[jira] [Commented] (HDFS-9820) Improve distcp to support efficient restore to an earlier snapshot

Yongjun Zhang (JIRA) Mon, 03 Oct 2016 13:50:23 -0700

    [ 
https://issues.apache.org/jira/browse/HDFS-9820?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15543389#comment-15543389
 ]


Yongjun Zhang commented on HDFS-9820:
-------------------------------------

Copied from 
https://issues.apache.org/jira/browse/HDFS-10314?focusedCommentId=15248778&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-15248778

The idea is the wrap around distcp as a tool to achieve the functionality of 
distcp's switch -rdiff (if we will do the same for -diff, it will be a 
different jira). Here is a description and comparison of the -diff and 
unimplemented -rdiff switches. 

{code}
Definition: Assuming we have two snapshots, s1 and s2, where s1 is created 
earlier, and s2 is newer.

- SnapshotDiff(s1, s2): represents the delta between s1 and s2; That is, if we 
apply 
  snapshotDiff(s1, s2)  on top of s1, we can go to the state of s2.
- SnapshotDiff(s2, s1) represents the reversed delta between s1 and s2. That 
is, if
  we apply SnapshotDiff(s2, s1) on top of s2, we can go back to the state of s1.

Note: When we talk about source and target, we mean distcp source and distcp 
target.

A. -diff allows distcp to efficiently copy incremental changes made (on top of 
previously copied
    snapshot s1) in source cluster to target cluster   Assuming snapshot s2 is 
created at the source to
    capture s1 + incremental changes, snapshotDiff(s1,s2) is the incremental 
changes, the output of this
    operation is that the target will be at s2 sate. this operation involves 
three steps:

  A.1 calculate snapshotDiff(s1, s2) at the source
  A.2 apply the rename and delete portion of the snapshotDiff at the target. 
this step is called "sync"
  A.3 copy created/modified files from source's s2 to target 

B. -rdiff allows distcp to efficiently copy data from snapshot s1 to overwrite 
changes made in target
    after snapshot s1 was created in target. Assuming snapshot s2 is created at 
the target to capture
    the changes that need to be overwritten, snapshotDiff(s2, s1) is what we 
want to apply to target. 
    The output of this operation is that the target is at s1 state. Similar to 
-diff, but with some 
    differences, this operation involves three steps too:

  B.1 calculate snapshotDiff(s2, s1) at the target,
  B.2 apply the rename and delete portion of the snapshot diff at the target. 
this step is called "sync"
  B.3 copy created/modified files from source's s1 to target. (the source here 
can be a different
        cluster, or the target itself. When it's a different cluster, the 
cluster has to have snapshot s1 
        that's has exact same name and content as the s1 at the target)

A tablularized comparison:

                  required snapshots      DiffCalc       Output After Operation
                  --------------------------
                  source        target        
                  ------------------------------------------
-diff             s1, s2   ->  s1             source         target is at s2
-rdiff            s1       ->   s1,s2        target          target is at  s1  

(note, for -rdiff, the source could be the same as target)

So the "r" (reversed) in the -rdiff means the following and is very symmetric 
to -diff:

- swap the snapshot requirement of source and target in -diff 
  (from "s1, s2   ->   s1 "  to  "s1  ->   s1,s2")
- swap the result snapshot after operation (from s2 to s1)
- swap the snapshot diff calculation place  (from source to target)

We require source and target to have same snapshot s1 (same snapshot name, same 
content).
{code}



> Improve distcp to support efficient restore to an earlier snapshot
> ------------------------------------------------------------------
>
>                 Key: HDFS-9820
>                 URL: https://issues.apache.org/jira/browse/HDFS-9820
>             Project: Hadoop HDFS
>          Issue Type: New Feature
>          Components: distcp
>            Reporter: Yongjun Zhang
>            Assignee: Yongjun Zhang
>         Attachments: HDFS-9820.001.patch, HDFS-9820.002.patch, 
> HDFS-9820.003.patch, HDFS-9820.004.patch, HDFS-9820.005.patch
>
>
> A common use scenario (scenaio 1): 
> # create snapshot sx in clusterX, 
> # do some experiemnts in clusterX, which creates some files. 
> # throw away the files changed and go back to sx.
> Another scenario (scenario 2) is, there is a production cluster and a backup 
> cluster, we periodically sync up the data from production cluster to the 
> backup cluster with distcp. 
> The cluster in scenario 1 could be the backup cluster in scenario 2.
> For scenario 1:
> HDFS-4167 intends to restore HDFS to the most recent snapshot, and there are 
> some complexity and challenges.  Before that jira is implemented, we count on 
> distcp to copy from snapshot to the current state. However, the performance 
> of this operation could be very bad because we have to go through all files 
> even if we only changed a few files.
> For scenario 2:
> HDFS-7535 improved distcp performance by avoiding copying files that changed 
> name since last backup.
> On top of HDFS-7535, HDFS-8828 improved distcp performance when copying data 
> from source to target cluster, by only copying changed files since last 
> backup. The way it works is use snapshot diff to find out all files changed, 
> and copy the changed files only.
> See 
> https://blog.cloudera.com/blog/2015/12/distcp-performance-improvements-in-apache-hadoop/
> This jira is to propose a variation of HDFS-8828, to find out the files 
> changed in target cluster since last snapshot sx, and copy these from 
> snapshot sx of either the source or the target cluster, to restore target 
> cluster's current state to sx. 
> Specifically,
> If a file/dir is
> - renamed, rename it back
> - created in target cluster, delete it
> - modified, put it to the copy list
> - run distcp with the copy list, copy from the source cluster's corresponding 
> snapshot
> This could be a new command line switch -rdiff in distcp.
> As a native restore feature, HDFS-4167 would still be ideal to have. However, 
>  HDFS-9820 would hopefully be easier to implement, before HDFS-4167 is in 
> place.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org

[jira] [Commented] (HDFS-9820) Improve distcp to support efficient restore to an earlier snapshot

Reply via email to