[ https://issues.apache.org/jira/browse/HADOOP-18842?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17751656#comment-17751656 ]
Syed Shameerur Rahman commented on HADOOP-18842: ------------------------------------------------ [~ste...@apache.org] It would be great if you review the above PR or the proposed changes. Note: It is a WIP PR (need to add unit tests and integration tests). I would like to get your thoughts on this before taking it forward. Thanks > Support Overwrite Directory On Commit For S3A Committers > -------------------------------------------------------- > > Key: HADOOP-18842 > URL: https://issues.apache.org/jira/browse/HADOOP-18842 > Project: Hadoop Common > Issue Type: New Feature > Reporter: Syed Shameerur Rahman > Priority: Major > Labels: pull-request-available > > The goal is to add a new kind of commit mechanism in which the destination > directory is cleared off before committing the file. > *Use Case* > In case of dynamicPartition insert overwrite queries, The destination > directory which needs to be overwritten are not known before the execution > and hence it becomes a challenge to clear off the destination directory. > > One approach to handle this is, The underlying engines/client will clear off > all the destination directories before calling the commitJob operation but > the issue with this approach is that, In case of failures while committing > the files, We might end up with the whole of previous data being deleted > making the recovery process difficult or time consuming. > > *Solution* > Based on mode of commit operation either *INSERT* or *OVERWRITE* , During > commitJob operations, The committer will map each destination directory with > the commits which needs to be added in the directory and if the mode is > *OVERWRITE* , The committer will delete the directory recursively and then > commit each of the files in the directory. So in case of failures (worst > case) The number of destination directory which will be deleted will be equal > to the number of threads if we do it in multi-threaded way as compared to the > whole data if it was done in the engine side. -- This message was sent by Atlassian Jira (v8.20.10#820010) --------------------------------------------------------------------- To unsubscribe, e-mail: common-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: common-issues-h...@hadoop.apache.org