Github user tejasapatil commented on the issue:

    https://github.com/apache/spark/pull/18975
  
    There is a difference in Hive's semantics vs what this PR is doing. In 
Hive, the query execution writes to a staging location and the destination 
location is cleared + re-populated after the end of  query execution (it 
happens in `MoveTask`). This PR will first wipe the destination location and 
then perform the query execution to populate the destination location with 
desired data. 
    
    I like the hive model because: 
    - If the query execution fails, you atleast have the old data. Insert 
overwrite to table does the same thing. This PR will leave the output location 
empty.
    - Hive achieves atomicity by using a staging dir. With this PR, I am not 
sure what happens to the output location if the some tasks have written out the 
final data while rest are still working. Would it have partial output ?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

Reply via email to