[jira] [Commented] (HDFS-8828) Utilize Snapshot diff report to build diff copy list in distcp
[ https://issues.apache.org/jira/browse/HDFS-8828?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15282142#comment-15282142 ] Mingliang Liu commented on HDFS-8828: - Thanks [~jingzhao], [~yzhangal] and [~jnp] for your prompt reply. I filed [HDFS-10397] to track the effort. > Utilize Snapshot diff report to build diff copy list in distcp > -- > > Key: HDFS-8828 > URL: https://issues.apache.org/jira/browse/HDFS-8828 > Project: Hadoop HDFS > Issue Type: Improvement > Components: distcp, snapshots >Reporter: Yufei Gu >Assignee: Yufei Gu > Fix For: 2.8.0 > > Attachments: HDFS-8828.001.patch, HDFS-8828.002.patch, > HDFS-8828.003.patch, HDFS-8828.004.patch, HDFS-8828.005.patch, > HDFS-8828.006.patch, HDFS-8828.007.patch, HDFS-8828.008.patch, > HDFS-8828.009.patch, HDFS-8828.010.patch, HDFS-8828.011.patch > > > Some users reported huge time cost to build file copy list in distcp. (30 > hours for 1.6M files). We can leverage snapshot diff report to build file > copy list including files/dirs which are changes only between two snapshots > (or a snapshot and a normal dir). It speed up the process in two folds: 1. > less copy list building time. 2. less file copy MR jobs. > HDFS snapshot diff report provide information about file/directory creation, > deletion, rename and modification between two snapshots or a snapshot and a > normal directory. HDFS-7535 synchronize deletion and rename, then fallback to > the default distcp. So it still relies on default distcp to building complete > list of files under the source dir. This patch only puts creation and > modification files into the copy list based on snapshot diff report. We can > minimize the number of files to copy. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Commented] (HDFS-8828) Utilize Snapshot diff report to build diff copy list in distcp
[ https://issues.apache.org/jira/browse/HDFS-8828?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15282125#comment-15282125 ] Jitendra Nath Pandey commented on HDFS-8828: +1 to the proposal. '-delete' and '-diff' should have been mutually exclusive, but instead of throwing an exception ignoring "-delete" sounds ok to prevent existing apps from failing. > Utilize Snapshot diff report to build diff copy list in distcp > -- > > Key: HDFS-8828 > URL: https://issues.apache.org/jira/browse/HDFS-8828 > Project: Hadoop HDFS > Issue Type: Improvement > Components: distcp, snapshots >Reporter: Yufei Gu >Assignee: Yufei Gu > Fix For: 2.8.0 > > Attachments: HDFS-8828.001.patch, HDFS-8828.002.patch, > HDFS-8828.003.patch, HDFS-8828.004.patch, HDFS-8828.005.patch, > HDFS-8828.006.patch, HDFS-8828.007.patch, HDFS-8828.008.patch, > HDFS-8828.009.patch, HDFS-8828.010.patch, HDFS-8828.011.patch > > > Some users reported huge time cost to build file copy list in distcp. (30 > hours for 1.6M files). We can leverage snapshot diff report to build file > copy list including files/dirs which are changes only between two snapshots > (or a snapshot and a normal dir). It speed up the process in two folds: 1. > less copy list building time. 2. less file copy MR jobs. > HDFS snapshot diff report provide information about file/directory creation, > deletion, rename and modification between two snapshots or a snapshot and a > normal directory. HDFS-7535 synchronize deletion and rename, then fallback to > the default distcp. So it still relies on default distcp to building complete > list of files under the source dir. This patch only puts creation and > modification files into the copy list based on snapshot diff report. We can > minimize the number of files to copy. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Commented] (HDFS-8828) Utilize Snapshot diff report to build diff copy list in distcp
[ https://issues.apache.org/jira/browse/HDFS-8828?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15282121#comment-15282121 ] Yongjun Zhang commented on HDFS-8828: - Thanks [~liuml07] and [~jingzhao], +1 on the proposed change. > Utilize Snapshot diff report to build diff copy list in distcp > -- > > Key: HDFS-8828 > URL: https://issues.apache.org/jira/browse/HDFS-8828 > Project: Hadoop HDFS > Issue Type: Improvement > Components: distcp, snapshots >Reporter: Yufei Gu >Assignee: Yufei Gu > Fix For: 2.8.0 > > Attachments: HDFS-8828.001.patch, HDFS-8828.002.patch, > HDFS-8828.003.patch, HDFS-8828.004.patch, HDFS-8828.005.patch, > HDFS-8828.006.patch, HDFS-8828.007.patch, HDFS-8828.008.patch, > HDFS-8828.009.patch, HDFS-8828.010.patch, HDFS-8828.011.patch > > > Some users reported huge time cost to build file copy list in distcp. (30 > hours for 1.6M files). We can leverage snapshot diff report to build file > copy list including files/dirs which are changes only between two snapshots > (or a snapshot and a normal dir). It speed up the process in two folds: 1. > less copy list building time. 2. less file copy MR jobs. > HDFS snapshot diff report provide information about file/directory creation, > deletion, rename and modification between two snapshots or a snapshot and a > normal directory. HDFS-7535 synchronize deletion and rename, then fallback to > the default distcp. So it still relies on default distcp to building complete > list of files under the source dir. This patch only puts creation and > modification files into the copy list based on snapshot diff report. We can > minimize the number of files to copy. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Commented] (HDFS-8828) Utilize Snapshot diff report to build diff copy list in distcp
[ https://issues.apache.org/jira/browse/HDFS-8828?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15282112#comment-15282112 ] Jing Zhao commented on HDFS-8828: - Agree with [~liuml07]. This option change looks like incompatible, and the proposed fix sounds good to me. > Utilize Snapshot diff report to build diff copy list in distcp > -- > > Key: HDFS-8828 > URL: https://issues.apache.org/jira/browse/HDFS-8828 > Project: Hadoop HDFS > Issue Type: Improvement > Components: distcp, snapshots >Reporter: Yufei Gu >Assignee: Yufei Gu > Fix For: 2.8.0 > > Attachments: HDFS-8828.001.patch, HDFS-8828.002.patch, > HDFS-8828.003.patch, HDFS-8828.004.patch, HDFS-8828.005.patch, > HDFS-8828.006.patch, HDFS-8828.007.patch, HDFS-8828.008.patch, > HDFS-8828.009.patch, HDFS-8828.010.patch, HDFS-8828.011.patch > > > Some users reported huge time cost to build file copy list in distcp. (30 > hours for 1.6M files). We can leverage snapshot diff report to build file > copy list including files/dirs which are changes only between two snapshots > (or a snapshot and a normal dir). It speed up the process in two folds: 1. > less copy list building time. 2. less file copy MR jobs. > HDFS snapshot diff report provide information about file/directory creation, > deletion, rename and modification between two snapshots or a snapshot and a > normal directory. HDFS-7535 synchronize deletion and rename, then fallback to > the default distcp. So it still relies on default distcp to building complete > list of files under the source dir. This patch only puts creation and > modification files into the copy list based on snapshot diff report. We can > minimize the number of files to copy. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Commented] (HDFS-8828) Utilize Snapshot diff report to build diff copy list in distcp
[ https://issues.apache.org/jira/browse/HDFS-8828?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15282098#comment-15282098 ] Mingliang Liu commented on HDFS-8828: - Hi, I'm aware that {{-delete}} and {{-diff}} options are mutually exclusive. I'm thinking this patch makes the existing applications (or scripts) that work just fine with both {{-delete}} and {{-diff}} options stop performing because of the {{java.lang.IllegalArgumentException: Diff is valid only with update options}} exception. That's said, this change is backward incompatible. Do you think it makes sense to ignore the {{-delete}} option, given {{-diff}} option, instead of exiting the program? Along with that, we can print a warning message saying that "Diff is valid only with update options, and -delete option is ignored". Ping [~jnp] for more input. > Utilize Snapshot diff report to build diff copy list in distcp > -- > > Key: HDFS-8828 > URL: https://issues.apache.org/jira/browse/HDFS-8828 > Project: Hadoop HDFS > Issue Type: Improvement > Components: distcp, snapshots >Reporter: Yufei Gu >Assignee: Yufei Gu > Fix For: 2.8.0 > > Attachments: HDFS-8828.001.patch, HDFS-8828.002.patch, > HDFS-8828.003.patch, HDFS-8828.004.patch, HDFS-8828.005.patch, > HDFS-8828.006.patch, HDFS-8828.007.patch, HDFS-8828.008.patch, > HDFS-8828.009.patch, HDFS-8828.010.patch, HDFS-8828.011.patch > > > Some users reported huge time cost to build file copy list in distcp. (30 > hours for 1.6M files). We can leverage snapshot diff report to build file > copy list including files/dirs which are changes only between two snapshots > (or a snapshot and a normal dir). It speed up the process in two folds: 1. > less copy list building time. 2. less file copy MR jobs. > HDFS snapshot diff report provide information about file/directory creation, > deletion, rename and modification between two snapshots or a snapshot and a > normal directory. HDFS-7535 synchronize deletion and rename, then fallback to > the default distcp. So it still relies on default distcp to building complete > list of files under the source dir. This patch only puts creation and > modification files into the copy list based on snapshot diff report. We can > minimize the number of files to copy. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Commented] (HDFS-8828) Utilize Snapshot diff report to build diff copy list in distcp
[ https://issues.apache.org/jira/browse/HDFS-8828?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14707441#comment-14707441 ] Hudson commented on HDFS-8828: -- FAILURE: Integrated in Hadoop-Yarn-trunk-Java8 #294 (See [https://builds.apache.org/job/Hadoop-Yarn-trunk-Java8/294/]) HDFS-8828. Utilize Snapshot diff report to build diff copy list in distcp. (Yufei Gu via Yongjun Zhang) (yzhang: rev 0bc15cb6e60dc60885234e01dec1c7cb4557a926) * hadoop-tools/hadoop-distcp/src/main/java/org/apache/hadoop/tools/CopyListing.java * hadoop-tools/hadoop-distcp/src/main/java/org/apache/hadoop/tools/SimpleCopyListing.java * hadoop-tools/hadoop-distcp/src/test/java/org/apache/hadoop/tools/TestDistCpSync.java * hadoop-tools/hadoop-distcp/src/main/java/org/apache/hadoop/tools/DistCpOptions.java * hadoop-tools/hadoop-distcp/src/main/java/org/apache/hadoop/tools/DistCp.java * hadoop-tools/hadoop-distcp/src/main/java/org/apache/hadoop/tools/DiffInfo.java * hadoop-hdfs-project/hadoop-hdfs/CHANGES.txt * hadoop-tools/hadoop-distcp/src/main/java/org/apache/hadoop/tools/DistCpSync.java * hadoop-tools/hadoop-distcp/src/test/java/org/apache/hadoop/tools/TestOptionsParser.java Utilize Snapshot diff report to build diff copy list in distcp -- Key: HDFS-8828 URL: https://issues.apache.org/jira/browse/HDFS-8828 Project: Hadoop HDFS Issue Type: Improvement Components: distcp, snapshots Reporter: Yufei Gu Assignee: Yufei Gu Fix For: 2.8.0 Attachments: HDFS-8828.001.patch, HDFS-8828.002.patch, HDFS-8828.003.patch, HDFS-8828.004.patch, HDFS-8828.005.patch, HDFS-8828.006.patch, HDFS-8828.007.patch, HDFS-8828.008.patch, HDFS-8828.009.patch, HDFS-8828.010.patch, HDFS-8828.011.patch Some users reported huge time cost to build file copy list in distcp. (30 hours for 1.6M files). We can leverage snapshot diff report to build file copy list including files/dirs which are changes only between two snapshots (or a snapshot and a normal dir). It speed up the process in two folds: 1. less copy list building time. 2. less file copy MR jobs. HDFS snapshot diff report provide information about file/directory creation, deletion, rename and modification between two snapshots or a snapshot and a normal directory. HDFS-7535 synchronize deletion and rename, then fallback to the default distcp. So it still relies on default distcp to building complete list of files under the source dir. This patch only puts creation and modification files into the copy list based on snapshot diff report. We can minimize the number of files to copy. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (HDFS-8828) Utilize Snapshot diff report to build diff copy list in distcp
[ https://issues.apache.org/jira/browse/HDFS-8828?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14707317#comment-14707317 ] Hudson commented on HDFS-8828: -- FAILURE: Integrated in Hadoop-Mapreduce-trunk-Java8 #291 (See [https://builds.apache.org/job/Hadoop-Mapreduce-trunk-Java8/291/]) HDFS-8828. Utilize Snapshot diff report to build diff copy list in distcp. (Yufei Gu via Yongjun Zhang) (yzhang: rev 0bc15cb6e60dc60885234e01dec1c7cb4557a926) * hadoop-tools/hadoop-distcp/src/main/java/org/apache/hadoop/tools/DistCpOptions.java * hadoop-hdfs-project/hadoop-hdfs/CHANGES.txt * hadoop-tools/hadoop-distcp/src/test/java/org/apache/hadoop/tools/TestOptionsParser.java * hadoop-tools/hadoop-distcp/src/main/java/org/apache/hadoop/tools/DistCpSync.java * hadoop-tools/hadoop-distcp/src/main/java/org/apache/hadoop/tools/DiffInfo.java * hadoop-tools/hadoop-distcp/src/test/java/org/apache/hadoop/tools/TestDistCpSync.java * hadoop-tools/hadoop-distcp/src/main/java/org/apache/hadoop/tools/DistCp.java * hadoop-tools/hadoop-distcp/src/main/java/org/apache/hadoop/tools/CopyListing.java * hadoop-tools/hadoop-distcp/src/main/java/org/apache/hadoop/tools/SimpleCopyListing.java Utilize Snapshot diff report to build diff copy list in distcp -- Key: HDFS-8828 URL: https://issues.apache.org/jira/browse/HDFS-8828 Project: Hadoop HDFS Issue Type: Improvement Components: distcp, snapshots Reporter: Yufei Gu Assignee: Yufei Gu Fix For: 2.8.0 Attachments: HDFS-8828.001.patch, HDFS-8828.002.patch, HDFS-8828.003.patch, HDFS-8828.004.patch, HDFS-8828.005.patch, HDFS-8828.006.patch, HDFS-8828.007.patch, HDFS-8828.008.patch, HDFS-8828.009.patch, HDFS-8828.010.patch, HDFS-8828.011.patch Some users reported huge time cost to build file copy list in distcp. (30 hours for 1.6M files). We can leverage snapshot diff report to build file copy list including files/dirs which are changes only between two snapshots (or a snapshot and a normal dir). It speed up the process in two folds: 1. less copy list building time. 2. less file copy MR jobs. HDFS snapshot diff report provide information about file/directory creation, deletion, rename and modification between two snapshots or a snapshot and a normal directory. HDFS-7535 synchronize deletion and rename, then fallback to the default distcp. So it still relies on default distcp to building complete list of files under the source dir. This patch only puts creation and modification files into the copy list based on snapshot diff report. We can minimize the number of files to copy. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (HDFS-8828) Utilize Snapshot diff report to build diff copy list in distcp
[ https://issues.apache.org/jira/browse/HDFS-8828?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14707347#comment-14707347 ] Hudson commented on HDFS-8828: -- FAILURE: Integrated in Hadoop-Yarn-trunk #1024 (See [https://builds.apache.org/job/Hadoop-Yarn-trunk/1024/]) HDFS-8828. Utilize Snapshot diff report to build diff copy list in distcp. (Yufei Gu via Yongjun Zhang) (yzhang: rev 0bc15cb6e60dc60885234e01dec1c7cb4557a926) * hadoop-tools/hadoop-distcp/src/test/java/org/apache/hadoop/tools/TestOptionsParser.java * hadoop-tools/hadoop-distcp/src/main/java/org/apache/hadoop/tools/SimpleCopyListing.java * hadoop-tools/hadoop-distcp/src/test/java/org/apache/hadoop/tools/TestDistCpSync.java * hadoop-tools/hadoop-distcp/src/main/java/org/apache/hadoop/tools/DistCpOptions.java * hadoop-tools/hadoop-distcp/src/main/java/org/apache/hadoop/tools/DiffInfo.java * hadoop-hdfs-project/hadoop-hdfs/CHANGES.txt * hadoop-tools/hadoop-distcp/src/main/java/org/apache/hadoop/tools/CopyListing.java * hadoop-tools/hadoop-distcp/src/main/java/org/apache/hadoop/tools/DistCp.java * hadoop-tools/hadoop-distcp/src/main/java/org/apache/hadoop/tools/DistCpSync.java Utilize Snapshot diff report to build diff copy list in distcp -- Key: HDFS-8828 URL: https://issues.apache.org/jira/browse/HDFS-8828 Project: Hadoop HDFS Issue Type: Improvement Components: distcp, snapshots Reporter: Yufei Gu Assignee: Yufei Gu Fix For: 2.8.0 Attachments: HDFS-8828.001.patch, HDFS-8828.002.patch, HDFS-8828.003.patch, HDFS-8828.004.patch, HDFS-8828.005.patch, HDFS-8828.006.patch, HDFS-8828.007.patch, HDFS-8828.008.patch, HDFS-8828.009.patch, HDFS-8828.010.patch, HDFS-8828.011.patch Some users reported huge time cost to build file copy list in distcp. (30 hours for 1.6M files). We can leverage snapshot diff report to build file copy list including files/dirs which are changes only between two snapshots (or a snapshot and a normal dir). It speed up the process in two folds: 1. less copy list building time. 2. less file copy MR jobs. HDFS snapshot diff report provide information about file/directory creation, deletion, rename and modification between two snapshots or a snapshot and a normal directory. HDFS-7535 synchronize deletion and rename, then fallback to the default distcp. So it still relies on default distcp to building complete list of files under the source dir. This patch only puts creation and modification files into the copy list based on snapshot diff report. We can minimize the number of files to copy. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (HDFS-8828) Utilize Snapshot diff report to build diff copy list in distcp
[ https://issues.apache.org/jira/browse/HDFS-8828?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14707587#comment-14707587 ] Hudson commented on HDFS-8828: -- FAILURE: Integrated in Hadoop-Hdfs-trunk-Java8 #283 (See [https://builds.apache.org/job/Hadoop-Hdfs-trunk-Java8/283/]) HDFS-8828. Utilize Snapshot diff report to build diff copy list in distcp. (Yufei Gu via Yongjun Zhang) (yzhang: rev 0bc15cb6e60dc60885234e01dec1c7cb4557a926) * hadoop-tools/hadoop-distcp/src/main/java/org/apache/hadoop/tools/DistCpSync.java * hadoop-tools/hadoop-distcp/src/test/java/org/apache/hadoop/tools/TestOptionsParser.java * hadoop-tools/hadoop-distcp/src/main/java/org/apache/hadoop/tools/DiffInfo.java * hadoop-tools/hadoop-distcp/src/main/java/org/apache/hadoop/tools/DistCpOptions.java * hadoop-hdfs-project/hadoop-hdfs/CHANGES.txt * hadoop-tools/hadoop-distcp/src/test/java/org/apache/hadoop/tools/TestDistCpSync.java * hadoop-tools/hadoop-distcp/src/main/java/org/apache/hadoop/tools/SimpleCopyListing.java * hadoop-tools/hadoop-distcp/src/main/java/org/apache/hadoop/tools/DistCp.java * hadoop-tools/hadoop-distcp/src/main/java/org/apache/hadoop/tools/CopyListing.java Utilize Snapshot diff report to build diff copy list in distcp -- Key: HDFS-8828 URL: https://issues.apache.org/jira/browse/HDFS-8828 Project: Hadoop HDFS Issue Type: Improvement Components: distcp, snapshots Reporter: Yufei Gu Assignee: Yufei Gu Fix For: 2.8.0 Attachments: HDFS-8828.001.patch, HDFS-8828.002.patch, HDFS-8828.003.patch, HDFS-8828.004.patch, HDFS-8828.005.patch, HDFS-8828.006.patch, HDFS-8828.007.patch, HDFS-8828.008.patch, HDFS-8828.009.patch, HDFS-8828.010.patch, HDFS-8828.011.patch Some users reported huge time cost to build file copy list in distcp. (30 hours for 1.6M files). We can leverage snapshot diff report to build file copy list including files/dirs which are changes only between two snapshots (or a snapshot and a normal dir). It speed up the process in two folds: 1. less copy list building time. 2. less file copy MR jobs. HDFS snapshot diff report provide information about file/directory creation, deletion, rename and modification between two snapshots or a snapshot and a normal directory. HDFS-7535 synchronize deletion and rename, then fallback to the default distcp. So it still relies on default distcp to building complete list of files under the source dir. This patch only puts creation and modification files into the copy list based on snapshot diff report. We can minimize the number of files to copy. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (HDFS-8828) Utilize Snapshot diff report to build diff copy list in distcp
[ https://issues.apache.org/jira/browse/HDFS-8828?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14707603#comment-14707603 ] Hudson commented on HDFS-8828: -- SUCCESS: Integrated in Hadoop-Hdfs-trunk #2221 (See [https://builds.apache.org/job/Hadoop-Hdfs-trunk/2221/]) HDFS-8828. Utilize Snapshot diff report to build diff copy list in distcp. (Yufei Gu via Yongjun Zhang) (yzhang: rev 0bc15cb6e60dc60885234e01dec1c7cb4557a926) * hadoop-hdfs-project/hadoop-hdfs/CHANGES.txt * hadoop-tools/hadoop-distcp/src/test/java/org/apache/hadoop/tools/TestOptionsParser.java * hadoop-tools/hadoop-distcp/src/main/java/org/apache/hadoop/tools/SimpleCopyListing.java * hadoop-tools/hadoop-distcp/src/main/java/org/apache/hadoop/tools/DistCpSync.java * hadoop-tools/hadoop-distcp/src/main/java/org/apache/hadoop/tools/DiffInfo.java * hadoop-tools/hadoop-distcp/src/test/java/org/apache/hadoop/tools/TestDistCpSync.java * hadoop-tools/hadoop-distcp/src/main/java/org/apache/hadoop/tools/DistCpOptions.java * hadoop-tools/hadoop-distcp/src/main/java/org/apache/hadoop/tools/CopyListing.java * hadoop-tools/hadoop-distcp/src/main/java/org/apache/hadoop/tools/DistCp.java Utilize Snapshot diff report to build diff copy list in distcp -- Key: HDFS-8828 URL: https://issues.apache.org/jira/browse/HDFS-8828 Project: Hadoop HDFS Issue Type: Improvement Components: distcp, snapshots Reporter: Yufei Gu Assignee: Yufei Gu Fix For: 2.8.0 Attachments: HDFS-8828.001.patch, HDFS-8828.002.patch, HDFS-8828.003.patch, HDFS-8828.004.patch, HDFS-8828.005.patch, HDFS-8828.006.patch, HDFS-8828.007.patch, HDFS-8828.008.patch, HDFS-8828.009.patch, HDFS-8828.010.patch, HDFS-8828.011.patch Some users reported huge time cost to build file copy list in distcp. (30 hours for 1.6M files). We can leverage snapshot diff report to build file copy list including files/dirs which are changes only between two snapshots (or a snapshot and a normal dir). It speed up the process in two folds: 1. less copy list building time. 2. less file copy MR jobs. HDFS snapshot diff report provide information about file/directory creation, deletion, rename and modification between two snapshots or a snapshot and a normal directory. HDFS-7535 synchronize deletion and rename, then fallback to the default distcp. So it still relies on default distcp to building complete list of files under the source dir. This patch only puts creation and modification files into the copy list based on snapshot diff report. We can minimize the number of files to copy. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (HDFS-8828) Utilize Snapshot diff report to build diff copy list in distcp
[ https://issues.apache.org/jira/browse/HDFS-8828?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14707540#comment-14707540 ] Hudson commented on HDFS-8828: -- SUCCESS: Integrated in Hadoop-Mapreduce-trunk #2240 (See [https://builds.apache.org/job/Hadoop-Mapreduce-trunk/2240/]) HDFS-8828. Utilize Snapshot diff report to build diff copy list in distcp. (Yufei Gu via Yongjun Zhang) (yzhang: rev 0bc15cb6e60dc60885234e01dec1c7cb4557a926) * hadoop-tools/hadoop-distcp/src/main/java/org/apache/hadoop/tools/SimpleCopyListing.java * hadoop-tools/hadoop-distcp/src/main/java/org/apache/hadoop/tools/CopyListing.java * hadoop-tools/hadoop-distcp/src/main/java/org/apache/hadoop/tools/DistCpOptions.java * hadoop-hdfs-project/hadoop-hdfs/CHANGES.txt * hadoop-tools/hadoop-distcp/src/test/java/org/apache/hadoop/tools/TestOptionsParser.java * hadoop-tools/hadoop-distcp/src/main/java/org/apache/hadoop/tools/DiffInfo.java * hadoop-tools/hadoop-distcp/src/test/java/org/apache/hadoop/tools/TestDistCpSync.java * hadoop-tools/hadoop-distcp/src/main/java/org/apache/hadoop/tools/DistCpSync.java * hadoop-tools/hadoop-distcp/src/main/java/org/apache/hadoop/tools/DistCp.java Utilize Snapshot diff report to build diff copy list in distcp -- Key: HDFS-8828 URL: https://issues.apache.org/jira/browse/HDFS-8828 Project: Hadoop HDFS Issue Type: Improvement Components: distcp, snapshots Reporter: Yufei Gu Assignee: Yufei Gu Fix For: 2.8.0 Attachments: HDFS-8828.001.patch, HDFS-8828.002.patch, HDFS-8828.003.patch, HDFS-8828.004.patch, HDFS-8828.005.patch, HDFS-8828.006.patch, HDFS-8828.007.patch, HDFS-8828.008.patch, HDFS-8828.009.patch, HDFS-8828.010.patch, HDFS-8828.011.patch Some users reported huge time cost to build file copy list in distcp. (30 hours for 1.6M files). We can leverage snapshot diff report to build file copy list including files/dirs which are changes only between two snapshots (or a snapshot and a normal dir). It speed up the process in two folds: 1. less copy list building time. 2. less file copy MR jobs. HDFS snapshot diff report provide information about file/directory creation, deletion, rename and modification between two snapshots or a snapshot and a normal directory. HDFS-7535 synchronize deletion and rename, then fallback to the default distcp. So it still relies on default distcp to building complete list of files under the source dir. This patch only puts creation and modification files into the copy list based on snapshot diff report. We can minimize the number of files to copy. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (HDFS-8828) Utilize Snapshot diff report to build diff copy list in distcp
[ https://issues.apache.org/jira/browse/HDFS-8828?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14705464#comment-14705464 ] Yufei Gu commented on HDFS-8828: Thank you very much for committing, [~yzhangal]. Thank you very much for review, [~jingzhao]. Utilize Snapshot diff report to build diff copy list in distcp -- Key: HDFS-8828 URL: https://issues.apache.org/jira/browse/HDFS-8828 Project: Hadoop HDFS Issue Type: Improvement Components: distcp, snapshots Reporter: Yufei Gu Assignee: Yufei Gu Fix For: 2.8.0 Attachments: HDFS-8828.001.patch, HDFS-8828.002.patch, HDFS-8828.003.patch, HDFS-8828.004.patch, HDFS-8828.005.patch, HDFS-8828.006.patch, HDFS-8828.007.patch, HDFS-8828.008.patch, HDFS-8828.009.patch, HDFS-8828.010.patch, HDFS-8828.011.patch Some users reported huge time cost to build file copy list in distcp. (30 hours for 1.6M files). We can leverage snapshot diff report to build file copy list including files/dirs which are changes only between two snapshots (or a snapshot and a normal dir). It speed up the process in two folds: 1. less copy list building time. 2. less file copy MR jobs. HDFS snapshot diff report provide information about file/directory creation, deletion, rename and modification between two snapshots or a snapshot and a normal directory. HDFS-7535 synchronize deletion and rename, then fallback to the default distcp. So it still relies on default distcp to building complete list of files under the source dir. This patch only puts creation and modification files into the copy list based on snapshot diff report. We can minimize the number of files to copy. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (HDFS-8828) Utilize Snapshot diff report to build diff copy list in distcp
[ https://issues.apache.org/jira/browse/HDFS-8828?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14705113#comment-14705113 ] Hudson commented on HDFS-8828: -- FAILURE: Integrated in Hadoop-trunk-Commit #8328 (See [https://builds.apache.org/job/Hadoop-trunk-Commit/8328/]) HDFS-8828. Utilize Snapshot diff report to build diff copy list in distcp. (Yufei Gu via Yongjun Zhang) (yzhang: rev 0bc15cb6e60dc60885234e01dec1c7cb4557a926) * hadoop-tools/hadoop-distcp/src/test/java/org/apache/hadoop/tools/TestOptionsParser.java * hadoop-hdfs-project/hadoop-hdfs/CHANGES.txt * hadoop-tools/hadoop-distcp/src/main/java/org/apache/hadoop/tools/SimpleCopyListing.java * hadoop-tools/hadoop-distcp/src/main/java/org/apache/hadoop/tools/CopyListing.java * hadoop-tools/hadoop-distcp/src/test/java/org/apache/hadoop/tools/TestDistCpSync.java * hadoop-tools/hadoop-distcp/src/main/java/org/apache/hadoop/tools/DistCpSync.java * hadoop-tools/hadoop-distcp/src/main/java/org/apache/hadoop/tools/DistCp.java * hadoop-tools/hadoop-distcp/src/main/java/org/apache/hadoop/tools/DistCpOptions.java * hadoop-tools/hadoop-distcp/src/main/java/org/apache/hadoop/tools/DiffInfo.java Utilize Snapshot diff report to build diff copy list in distcp -- Key: HDFS-8828 URL: https://issues.apache.org/jira/browse/HDFS-8828 Project: Hadoop HDFS Issue Type: Improvement Components: distcp, snapshots Reporter: Yufei Gu Assignee: Yufei Gu Fix For: 2.8.0 Attachments: HDFS-8828.001.patch, HDFS-8828.002.patch, HDFS-8828.003.patch, HDFS-8828.004.patch, HDFS-8828.005.patch, HDFS-8828.006.patch, HDFS-8828.007.patch, HDFS-8828.008.patch, HDFS-8828.009.patch, HDFS-8828.010.patch, HDFS-8828.011.patch Some users reported huge time cost to build file copy list in distcp. (30 hours for 1.6M files). We can leverage snapshot diff report to build file copy list including files/dirs which are changes only between two snapshots (or a snapshot and a normal dir). It speed up the process in two folds: 1. less copy list building time. 2. less file copy MR jobs. HDFS snapshot diff report provide information about file/directory creation, deletion, rename and modification between two snapshots or a snapshot and a normal directory. HDFS-7535 synchronize deletion and rename, then fallback to the default distcp. So it still relies on default distcp to building complete list of files under the source dir. This patch only puts creation and modification files into the copy list based on snapshot diff report. We can minimize the number of files to copy. -- This message was sent by Atlassian JIRA (v6.3.4#6332)