Syed Shameerur Rahman created HADOOP-18842:
----------------------------------------------

             Summary: Support Overwrite Directory On Commit For S3A Committers
                 Key: HADOOP-18842
                 URL: https://issues.apache.org/jira/browse/HADOOP-18842
             Project: Hadoop Common
          Issue Type: New Feature
            Reporter: Syed Shameerur Rahman


The goal is to add a new kind of commit mechanism in which the destination 
directory is cleared off before committing the file.

*Use Case*

In case of dynamicPartition insert overwrite queries, The destination directory 
which needs to be overwritten are not known before the execution and hence it 
becomes a challenge to clear off the destination directory.

 

One approach to handle this is, The underlying engines/client will clear off 
all the destination directories before calling the commitJob operation but the 
issue with this approach is that, In case of failures while committing the 
files, We might end up with the whole of previous data being deleted making the 
recovery process difficult or time consuming.

 

*Solution*

Based on mode of commit operation either *INSERT* or *OVERWRITE* , During 
commitJob operations, The committer will map each destination directory with 
the commits which needs to be added in the directory and if the mode is 
*OVERWRITE* , The committer will delete the directory recursively and then 
commit each of the files in the directory. So in case of failures (worst case) 
The number of destination directory which will be deleted will be equal to the 
number of threads if we do it in multi-threaded way as compared to the whole 
data if it was done in the engine side.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

---------------------------------------------------------------------
To unsubscribe, e-mail: common-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: common-issues-h...@hadoop.apache.org

Reply via email to