[ 
https://issues.apache.org/jira/browse/HDFS-9820?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15241643#comment-15241643
 ] 

Jing Zhao commented on HDFS-9820:
---------------------------------

Let's say we first have snapshot s1 both both source and target (and the source 
and the target have been synced). Then we make some changes on the source, do a 
forward incremental distcp copy to apply the changes to the target. Based on 
our assumption, before the next incremental copy, we will create a snapshot s2 
on both the source and the target.

Let's say at this time you want to restore the target back to s1. Theoretically 
we only need to do "distcp -diff s2 s1", where s2 is the {{from}} snapshot and 
s1 is the {{to}} snapshot. Note there is no diff between the current states and 
s2 on the target. Only after verifying this can we continue the incremental 
dictcp. Because of the lack of HDFS-10263, we need to make slight changes when 
applying the reversed diff, i.e., to bypass {{DistCpSync#prepareDiffList}}. 
This requires the distcp tool to understand s2 is actually after s1, and we can 
call {{getListing}} against {{path-of-snapshottable-dir/.snapshot}} to achieve 
this.

We should allow user to pass in two snapshots in any order. The only 
restriction here is the {{from}} snapshot must also exist in the target cluster 
and there is no difference between this snapshot and the current status in the 
target cluster. 

> Improve distcp to support efficient restore to an earlier snapshot
> ------------------------------------------------------------------
>
>                 Key: HDFS-9820
>                 URL: https://issues.apache.org/jira/browse/HDFS-9820
>             Project: Hadoop HDFS
>          Issue Type: New Feature
>          Components: distcp
>            Reporter: Yongjun Zhang
>            Assignee: Yongjun Zhang
>         Attachments: HDFS-9820.001.patch, HDFS-9820.002.patch, 
> HDFS-9820.003.patch, HDFS-9820.004.patch
>
>
> HDFS-4167 intends to restore HDFS to the most recent snapshot, and there are 
> some complexity and challenges. 
> HDFS-7535 improved distcp performance by avoiding copying files that changed 
> name since last backup.
> On top of HDFS-7535, HDFS-8828 improved distcp performance when copying data 
> from source to target cluster, by only copying changed files since last 
> backup. The way it works is use snapshot diff to find out all files changed, 
> and copy the changed files only.
> See 
> https://blog.cloudera.com/blog/2015/12/distcp-performance-improvements-in-apache-hadoop/
> This jira is to propose a variation of HDFS-8828, to find out the files 
> changed in target cluster since last snapshot sx, and copy these from the 
> source target's same snapshot sx, to restore target cluster to sx.
> If a file/dir is
> - renamed, rename it back
> - created in target cluster, delete it
> - modified, put it to the copy list
> - run distcp with the copy list, copy from the source cluster's corresponding 
> snapshot
> This could be a new command line switch -rdiff in distcp.
> HDFS-4167 would still be nice to have. It just seems to me that HDFS-9820 
> would hopefully be easier to implement.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to