Github user tejasapatil commented on the issue: https://github.com/apache/spark/pull/18975 There is a difference in Hive's semantics vs what this PR is doing. In Hive, the query execution writes to a staging location and the destination location is cleared + re-populated after the end of query execution (it happens in `MoveTask`). This PR will first wipe the destination location and then perform the query execution to populate the destination location with desired data. I like the hive model because: - If the query execution fails, you atleast have the old data. Insert overwrite to table does the same thing. This PR will leave the output location empty. - Hive achieves atomicity by using a staging dir. With this PR, I am not sure what happens to the output location if the some tasks have written out the final data while rest are still working. Would it have partial output ?
--- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- --------------------------------------------------------------------- To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org