[ 
https://issues.apache.org/jira/browse/HDFS-8828?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14696000#comment-14696000
 ] 

Yongjun Zhang commented on HDFS-8828:
-------------------------------------

Hi [~yufeigu],

Thanks for answering my question in person. So for newly created dir, there is 
indeed one entry "CREATE" in the snapshot diff report, and no entries for new 
elements created below this dir.

So please take care of my comment 1, 2 in my previous review, plus: 

3.  Suggest to change the {{getExcludeList}} method to 
{{getTraverseExcludeList}}  (hopefully a better name) and with the following 
javadoc as we agreed.
{code}
This method returns a list of items to be excluded when recursively traversing 
newDir to build the copy list.

Specifically, given a newly created directory newDir (a CREATE entry in the 
snapshot diff), if a previously copied file/directory itemX is moved (a RENAME 
entry in the snapshot diff) into newDir, itemX should be excluded when 
recursively traversing newDir in #traverseDirectory,  so that it will not to be 
copied again.

If the same itemX also has a MODIFY entry in the snapshot diff report, meaning 
it was modified after it was previously copied, it will still be added to the 
copy list (handled in the main loop of doBuildListingWithSnapshotDiff).
{code}

4. Do refactoring to consolidate duplicated code in test code that we discussed.
 
Hi [~jingzhao], I had quite some side discussion with Yufei, I am +1 on the 
change after the above comments are addressed. Would you please take a look at 
it if you wish? I'm targeting at committing it next Monday.

Thanks much.



> Utilize Snapshot diff report to build copy list in distcp
> ---------------------------------------------------------
>
>                 Key: HDFS-8828
>                 URL: https://issues.apache.org/jira/browse/HDFS-8828
>             Project: Hadoop HDFS
>          Issue Type: Improvement
>          Components: distcp, snapshots
>            Reporter: Yufei Gu
>            Assignee: Yufei Gu
>         Attachments: HDFS-8828.001.patch, HDFS-8828.002.patch, 
> HDFS-8828.003.patch, HDFS-8828.004.patch, HDFS-8828.005.patch, 
> HDFS-8828.006.patch
>
>
> Some users reported huge time cost to build file copy list in distcp. (30 
> hours for 1.6M files). We can leverage snapshot diff report to build file 
> copy list including files/dirs which are changes only between two snapshots 
> (or a snapshot and a normal dir). It speed up the process in two folds: 1. 
> less copy list building time. 2. less file copy MR jobs.
> HDFS snapshot diff report provide information about file/directory creation, 
> deletion, rename and modification between two snapshots or a snapshot and a 
> normal directory. HDFS-7535 synchronize deletion and rename, then fallback to 
> the default distcp. So it still relies on default distcp to building complete 
> list of files under the source dir. This patch only puts creation and 
> modification files into the copy list based on snapshot diff report. We can 
> minimize the number of files to copy. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to