[ 
https://issues.apache.org/jira/browse/HDFS-9820?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15241943#comment-15241943
 ] 

Jing Zhao commented on HDFS-9820:
---------------------------------

bq. One small correction: before we do the incremental copy, we create a 
snapshot s2 on source cluster first
No. This is incorrect. We allow {{distcp -diff s1 .}}. "s2" can be done after 
the copy. See TestDistCpSync#testSyncWithCurrent as an example.

bq. We assume that no changes have been made at target cluster after s1 was 
created before we do incremental copy in this case (assumption I)
This assumption must be verified before the new distcp. Currently we do a 
snapshot diff report on target (between {{from}} and ".") to check. This check 
cannot be dropped as in your current patch.

bq. Do you mean if "" ever appear as one parameter of -diff
I mean "" or "." should *never* be used as the {{fromState}} in {{distcp 
-diff}} command, otherwise we have no way to verify there is no change 
happening on target. So we actually should use "s2" here.

bq. Because "" is just an alias of current state "snapshot"
This is also wrong. In command line "." is the alias of the current state.

bq. -diff <fromState> <toState> is what I feel more intuitive
bq. But if this is what you prefer, we can relax the order requirement, and let 
"" means revert operation. Would you please confirm?
What I mean is: we should always use "-diff <fromState> <toState>", but instead 
of using ".", we should use "s2". No change is necessary on {{DistCpOptions}}.

bq. bypass DistCpSync#prepareDiffList
For any modification/creation happening under a renamed directory, the diff 
report always uses the paths before the rename (as reported by HDFS-10263). 
{{prepareDiffList}} changes these paths to new paths after the rename, but when 
applying the reverse diff, we do not need to do this.


> Improve distcp to support efficient restore to an earlier snapshot
> ------------------------------------------------------------------
>
>                 Key: HDFS-9820
>                 URL: https://issues.apache.org/jira/browse/HDFS-9820
>             Project: Hadoop HDFS
>          Issue Type: New Feature
>          Components: distcp
>            Reporter: Yongjun Zhang
>            Assignee: Yongjun Zhang
>         Attachments: HDFS-9820.001.patch, HDFS-9820.002.patch, 
> HDFS-9820.003.patch, HDFS-9820.004.patch
>
>
> HDFS-4167 intends to restore HDFS to the most recent snapshot, and there are 
> some complexity and challenges. 
> HDFS-7535 improved distcp performance by avoiding copying files that changed 
> name since last backup.
> On top of HDFS-7535, HDFS-8828 improved distcp performance when copying data 
> from source to target cluster, by only copying changed files since last 
> backup. The way it works is use snapshot diff to find out all files changed, 
> and copy the changed files only.
> See 
> https://blog.cloudera.com/blog/2015/12/distcp-performance-improvements-in-apache-hadoop/
> This jira is to propose a variation of HDFS-8828, to find out the files 
> changed in target cluster since last snapshot sx, and copy these from the 
> source target's same snapshot sx, to restore target cluster to sx.
> If a file/dir is
> - renamed, rename it back
> - created in target cluster, delete it
> - modified, put it to the copy list
> - run distcp with the copy list, copy from the source cluster's corresponding 
> snapshot
> This could be a new command line switch -rdiff in distcp.
> HDFS-4167 would still be nice to have. It just seems to me that HDFS-9820 
> would hopefully be easier to implement.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to