[ 
https://issues.apache.org/jira/browse/HDFS-8828?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14701760#comment-14701760
 ] 

Jing Zhao commented on HDFS-8828:
---------------------------------

Thanks for updating the patch, Yufei! Some further comments on the latest patch:
# Now we have many static methods in DiffInfo and DistCpSync, and we're passing 
the diffs between them. It's better to leave DiffInfo still as a simple data 
structure class, move the diff report processing code and the list building 
code all into the DistCpSync class. We then can make the DiffInfo EnumMap as a 
class member. In this way we can simplify a lot of code and make the logic more 
clear.
# I notice there are multiple places we retrieve the DiffInfo list (e.g., with 
Rename type), convert it into an array, and sort it afterwards. I'm not sure if 
a sorted array is necessary for all the scenarios (e.g., 
{{getTraverseExcludeList}} and {{getRenameItem}}). Maybe we can directly go 
through the corresponding list inside of the DiffInfo map.

> Utilize Snapshot diff report to build copy list in distcp
> ---------------------------------------------------------
>
>                 Key: HDFS-8828
>                 URL: https://issues.apache.org/jira/browse/HDFS-8828
>             Project: Hadoop HDFS
>          Issue Type: Improvement
>          Components: distcp, snapshots
>            Reporter: Yufei Gu
>            Assignee: Yufei Gu
>         Attachments: HDFS-8828.001.patch, HDFS-8828.002.patch, 
> HDFS-8828.003.patch, HDFS-8828.004.patch, HDFS-8828.005.patch, 
> HDFS-8828.006.patch, HDFS-8828.007.patch, HDFS-8828.008.patch
>
>
> Some users reported huge time cost to build file copy list in distcp. (30 
> hours for 1.6M files). We can leverage snapshot diff report to build file 
> copy list including files/dirs which are changes only between two snapshots 
> (or a snapshot and a normal dir). It speed up the process in two folds: 1. 
> less copy list building time. 2. less file copy MR jobs.
> HDFS snapshot diff report provide information about file/directory creation, 
> deletion, rename and modification between two snapshots or a snapshot and a 
> normal directory. HDFS-7535 synchronize deletion and rename, then fallback to 
> the default distcp. So it still relies on default distcp to building complete 
> list of files under the source dir. This patch only puts creation and 
> modification files into the copy list based on snapshot diff report. We can 
> minimize the number of files to copy. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to