[jira] [Commented] (MAPREDUCE-2257) distcp can copy blocks in parallel
[ https://issues.apache.org/jira/browse/MAPREDUCE-2257?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14049997#comment-14049997 ] liuwei commented on MAPREDUCE-2257: --- since distcp has distcp2, is there a patch exits for distcp2 to copy blocks in parallel? > distcp can copy blocks in parallel > -- > > Key: MAPREDUCE-2257 > URL: https://issues.apache.org/jira/browse/MAPREDUCE-2257 > Project: Hadoop Map/Reduce > Issue Type: Improvement > Components: distcp >Affects Versions: 0.21.0 >Reporter: dhruba borthakur >Assignee: Mithun Radhakrishnan > Attachments: MAPREDUCE-2257.patch > > > The minimum unit of work for a distcp task is a file. We have files that are > greater than 1 TB with a block size of 1 GB. If we use distcp to copy these > files, the tasks either take a long long long time or finally fails. A better > way for distcp would be to copy all the source blocks in parallel, and then > stich the blocks back to files at the destination via the HDFS Concat API > (HDFS-222) -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (MAPREDUCE-2257) distcp can copy blocks in parallel
[ https://issues.apache.org/jira/browse/MAPREDUCE-2257?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13508414#comment-13508414 ] Mithun Radhakrishnan commented on MAPREDUCE-2257: - Sorry, I haven't been able to spare the time yet. I'll try make the time, shortly. > distcp can copy blocks in parallel > -- > > Key: MAPREDUCE-2257 > URL: https://issues.apache.org/jira/browse/MAPREDUCE-2257 > Project: Hadoop Map/Reduce > Issue Type: Improvement > Components: distcp >Affects Versions: 0.21.0 >Reporter: dhruba borthakur >Assignee: Mithun Radhakrishnan > Attachments: MAPREDUCE-2257.patch > > > The minimum unit of work for a distcp task is a file. We have files that are > greater than 1 TB with a block size of 1 GB. If we use distcp to copy these > files, the tasks either take a long long long time or finally fails. A better > way for distcp would be to copy all the source blocks in parallel, and then > stich the blocks back to files at the destination via the HDFS Concat API > (HDFS-222) -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (MAPREDUCE-2257) distcp can copy blocks in parallel
[ https://issues.apache.org/jira/browse/MAPREDUCE-2257?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13508168#comment-13508168 ] Harsh J commented on MAPREDUCE-2257: [~mithun] - I know its been a while, but are you still working on this? Since HDFS-222 is getting some attention, I feel it would be good to have this as an inbuilt usage of the same (and since Dhruba has already mentioned it is a great improvement to DistCp). > distcp can copy blocks in parallel > -- > > Key: MAPREDUCE-2257 > URL: https://issues.apache.org/jira/browse/MAPREDUCE-2257 > Project: Hadoop Map/Reduce > Issue Type: Improvement > Components: distcp >Affects Versions: 0.21.0 >Reporter: dhruba borthakur >Assignee: Mithun Radhakrishnan > Attachments: MAPREDUCE-2257.patch > > > The minimum unit of work for a distcp task is a file. We have files that are > greater than 1 TB with a block size of 1 GB. If we use distcp to copy these > files, the tasks either take a long long long time or finally fails. A better > way for distcp would be to copy all the source blocks in parallel, and then > stich the blocks back to files at the destination via the HDFS Concat API > (HDFS-222) -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (MAPREDUCE-2257) distcp can copy blocks in parallel
[ https://issues.apache.org/jira/browse/MAPREDUCE-2257?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13205551#comment-13205551 ] Mahadev konar commented on MAPREDUCE-2257: -- Thanks for taking this up Mithun! > distcp can copy blocks in parallel > -- > > Key: MAPREDUCE-2257 > URL: https://issues.apache.org/jira/browse/MAPREDUCE-2257 > Project: Hadoop Map/Reduce > Issue Type: Improvement > Components: distcp >Affects Versions: 0.21.0 >Reporter: dhruba borthakur >Assignee: Mithun Radhakrishnan > Attachments: MAPREDUCE-2257.patch > > > The minimum unit of work for a distcp task is a file. We have files that are > greater than 1 TB with a block size of 1 GB. If we use distcp to copy these > files, the tasks either take a long long long time or finally fails. A better > way for distcp would be to copy all the source blocks in parallel, and then > stich the blocks back to files at the destination via the HDFS Concat API > (HDFS-222) -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (MAPREDUCE-2257) distcp can copy blocks in parallel
[ https://issues.apache.org/jira/browse/MAPREDUCE-2257?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13205280#comment-13205280 ] Mithun Radhakrishnan commented on MAPREDUCE-2257: - I'll take a look. I already have a patch that accomplishes the bulk of this. The finishing touches remain. I'll post a patch shortly. > distcp can copy blocks in parallel > -- > > Key: MAPREDUCE-2257 > URL: https://issues.apache.org/jira/browse/MAPREDUCE-2257 > Project: Hadoop Map/Reduce > Issue Type: Improvement > Components: distcp >Affects Versions: 0.21.0 >Reporter: dhruba borthakur > Attachments: MAPREDUCE-2257.patch > > > The minimum unit of work for a distcp task is a file. We have files that are > greater than 1 TB with a block size of 1 GB. If we use distcp to copy these > files, the tasks either take a long long long time or finally fails. A better > way for distcp would be to copy all the source blocks in parallel, and then > stich the blocks back to files at the destination via the HDFS Concat API > (HDFS-222) -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (MAPREDUCE-2257) distcp can copy blocks in parallel
[ https://issues.apache.org/jira/browse/MAPREDUCE-2257?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13033623#comment-13033623 ] Rodrigo Schmidt commented on MAPREDUCE-2257: +1 Patch looks good. Just make sure it passes the QA test. Hadoop QA doesn't seem to have picked up the latest version. > distcp can copy blocks in parallel > -- > > Key: MAPREDUCE-2257 > URL: https://issues.apache.org/jira/browse/MAPREDUCE-2257 > Project: Hadoop Map/Reduce > Issue Type: Improvement > Components: distcp >Affects Versions: 0.21.0 >Reporter: dhruba borthakur >Assignee: dhruba borthakur > Attachments: MAPREDUCE-2257.patch > > > The minimum unit of work for a distcp task is a file. We have files that are > greater than 1 TB with a block size of 1 GB. If we use distcp to copy these > files, the tasks either take a long long long time or finally fails. A better > way for distcp would be to copy all the source blocks in parallel, and then > stich the blocks back to files at the destination via the HDFS Concat API > (HDFS-222) -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (MAPREDUCE-2257) distcp can copy blocks in parallel
[ https://issues.apache.org/jira/browse/MAPREDUCE-2257?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13026166#comment-13026166 ] Rodrigo Schmidt commented on MAPREDUCE-2257: Maybe it's time to change it to non-deprecated classes. > distcp can copy blocks in parallel > -- > > Key: MAPREDUCE-2257 > URL: https://issues.apache.org/jira/browse/MAPREDUCE-2257 > Project: Hadoop Map/Reduce > Issue Type: Improvement > Components: distcp >Affects Versions: 0.21.0 >Reporter: dhruba borthakur >Assignee: dhruba borthakur > Attachments: MAPREDUCE-2257.patch > > > The minimum unit of work for a distcp task is a file. We have files that are > greater than 1 TB with a block size of 1 GB. If we use distcp to copy these > files, the tasks either take a long long long time or finally fails. A better > way for distcp would be to copy all the source blocks in parallel, and then > stich the blocks back to files at the destination via the HDFS Concat API > (HDFS-222) -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (MAPREDUCE-2257) distcp can copy blocks in parallel
[ https://issues.apache.org/jira/browse/MAPREDUCE-2257?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13025088#comment-13025088 ] Rosie Li commented on MAPREDUCE-2257: - the original code was using the deprecated one..like the JobConf, InputSplit > distcp can copy blocks in parallel > -- > > Key: MAPREDUCE-2257 > URL: https://issues.apache.org/jira/browse/MAPREDUCE-2257 > Project: Hadoop Map/Reduce > Issue Type: Improvement > Components: distcp >Affects Versions: 0.21.0 >Reporter: dhruba borthakur >Assignee: dhruba borthakur > Attachments: MAPREDUCE-2257.patch > > > The minimum unit of work for a distcp task is a file. We have files that are > greater than 1 TB with a block size of 1 GB. If we use distcp to copy these > files, the tasks either take a long long long time or finally fails. A better > way for distcp would be to copy all the source blocks in parallel, and then > stich the blocks back to files at the destination via the HDFS Concat API > (HDFS-222) -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (MAPREDUCE-2257) distcp can copy blocks in parallel
[ https://issues.apache.org/jira/browse/MAPREDUCE-2257?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13025087#comment-13025087 ] Rodrigo Schmidt commented on MAPREDUCE-2257: Shouldn't you change your code to use the class that replaced the deprecated one? :) > distcp can copy blocks in parallel > -- > > Key: MAPREDUCE-2257 > URL: https://issues.apache.org/jira/browse/MAPREDUCE-2257 > Project: Hadoop Map/Reduce > Issue Type: Improvement > Components: distcp >Affects Versions: 0.21.0 >Reporter: dhruba borthakur >Assignee: dhruba borthakur > Attachments: MAPREDUCE-2257.patch > > > The minimum unit of work for a distcp task is a file. We have files that are > greater than 1 TB with a block size of 1 GB. If we use distcp to copy these > files, the tasks either take a long long long time or finally fails. A better > way for distcp would be to copy all the source blocks in parallel, and then > stich the blocks back to files at the destination via the HDFS Concat API > (HDFS-222) -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (MAPREDUCE-2257) distcp can copy blocks in parallel
[ https://issues.apache.org/jira/browse/MAPREDUCE-2257?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13025035#comment-13025035 ] Rosie Li commented on MAPREDUCE-2257: - [javac] /data/users/rosieli/hadoop_jira/hadoop-mapred-trunk/src/tools/org/apache/hadoop/tools/DistCp.java:59: warning: [deprecation] org.apache.hadoop.mapred.FileSplit in org.apache.hadoop.mapred has been deprecated [javac] import org.apache.hadoop.mapred.FileSplit; [javac]^ [javac] /data/users/rosieli/hadoop_jira/hadoop-mapred-trunk/src/tools/org/apache/hadoop/tools/DistCp.java:60: warning: [deprecation] org.apache.hadoop.mapred.InputFormat in org.apache.hadoop.mapred has been deprecated [javac] import org.apache.hadoop.mapred.InputFormat; [javac]^ [javac] /data/users/rosieli/hadoop_jira/hadoop-mapred-trunk/src/tools/org/apache/hadoop/tools/DistCp.java:61: warning: [deprecation] org.apache.hadoop.mapred.InputSplit in org.apache.hadoop.mapred has been deprecated [javac] import org.apache.hadoop.mapred.InputSplit; [javac]^ [javac] /data/users/rosieli/hadoop_jira/hadoop-mapred-trunk/src/tools/org/apache/hadoop/tools/DistCp.java:63: warning: [deprecation] org.apache.hadoop.mapred.JobClient in org.apache.hadoop.mapred has been deprecated [javac] import org.apache.hadoop.mapred.JobClient; [javac]^ [javac] /data/users/rosieli/hadoop_jira/hadoop-mapred-trunk/src/tools/org/apache/hadoop/tools/DistCp.java:64: warning: [deprecation] org.apache.hadoop.mapred.JobConf in org.apache.hadoop.mapred has been deprecated [javac] import org.apache.hadoop.mapred.JobConf; [javac]^ [javac] /data/users/rosieli/hadoop_jira/hadoop-mapred-trunk/src/tools/org/apache/hadoop/tools/DistCp.java:66: warning: [deprecation] org.apache.hadoop.mapred.Mapper in org.apache.hadoop.mapred has been deprecated [javac] import org.apache.hadoop.mapred.Mapper; [javac]^ [javac] /data/users/rosieli/hadoop_jira/hadoop-mapred-trunk/src/tools/org/apache/hadoop/tools/DistCp.java:211: warning: [deprecation] org.apache.hadoop.mapred.JobConf in org.apache.hadoop.mapred has been deprecated [javac] private JobConf conf; [javac] ^ [javac] /data/users/rosieli/hadoop_jira/hadoop-mapred-trunk/src/tools/org/apache/hadoop/tools/DistCp.java:738: warning: [deprecation] org.apache.hadoop.mapred.JobConf in org.apache.hadoop.mapred has been deprecated [javac] private static void checkSrcPath(JobConf jobConf, List srcPaths) [javac]^ [javac] /data/users/rosieli/hadoop_jira/hadoop-mapred-trunk/src/tools/org/apache/hadoop/tools/DistCp.java:831: warning: [deprecation] org.apache.hadoop.mapred.JobConf in org.apache.hadoop.mapred has been deprecated [javac] static private void finalize(Configuration conf, JobConf jobconf, [javac]^ [javac] /data/users/rosieli/hadoop_jira/hadoop-mapred-trunk/src/tools/org/apache/hadoop/tools/DistCp.java:1096: warning: [deprecation] org.apache.hadoop.mapred.JobConf in org.apache.hadoop.mapred has been deprecated [javac] private static int setMapCount(long totalBytes, JobConf job) [javac] ^ [javac] /data/users/rosieli/hadoop_jira/hadoop-mapred-trunk/src/tools/org/apache/hadoop/tools/DistCp.java:1120: warning: [deprecation] org.apache.hadoop.mapred.JobConf in org.apache.hadoop.mapred has been deprecated [javac] private static JobConf createJobConf(Configuration conf) { [javac] ^ [javac] /data/users/rosieli/hadoop_jira/hadoop-mapred-trunk/src/tools/org/apache/hadoop/tools/DistCp.java:1148: warning: [deprecation] org.apache.hadoop.mapred.JobConf in org.apache.hadoop.mapred has been deprecated [javac] private static void setReplication(Configuration conf, JobConf jobConf, [javac] ^ [javac] /data/users/rosieli/hadoop_jira/hadoop-mapred-trunk/src/tools/org/apache/hadoop/tools/DistCp.java:1190: warning: [deprecation] org.apache.hadoop.mapred.JobConf in org.apache.hadoop.mapred has been deprecated [javac] static boolean setup(Configuration conf, JobConf jobConf, [javac]^ [javac] /data/users/rosieli/hadoop_jira/hadoop-mapred-trunk/src/tools/org/apache/hadoop/tools/DistCp.java:1562: warning: [deprecation] org.apache.hadoop.mapred.JobConf in org.apache.hadoop.mapred has been deprecated [javac] FileSystem jobfs, Path jobdir, JobConf jobconf, Configuration conf [javac] ^ [javac] /data/users/ro
[jira] [Commented] (MAPREDUCE-2257) distcp can copy blocks in parallel
[ https://issues.apache.org/jira/browse/MAPREDUCE-2257?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13021230#comment-13021230 ] Hadoop QA commented on MAPREDUCE-2257: -- -1 overall. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12476634/MAPREDUCE-2257.patch against trunk revision 1094093. +1 @author. The patch does not contain any @author tags. +1 tests included. The patch appears to include 4 new or modified tests. +1 javadoc. The javadoc tool did not generate any warning messages. -1 javac. The applied patch generated 2256 javac compiler warnings (more than the trunk's current 2244 warnings). +1 findbugs. The patch does not introduce any new Findbugs (version 1.3.9) warnings. +1 release audit. The applied patch does not increase the total number of release audit warnings. +1 core tests. The patch passed core unit tests. -1 contrib tests. The patch failed contrib unit tests. +1 system test framework. The patch passed system test framework compile. Test results: https://hudson.apache.org/hudson/job/PreCommit-MAPREDUCE-Build/173//testReport/ Findbugs warnings: https://hudson.apache.org/hudson/job/PreCommit-MAPREDUCE-Build/173//artifact/trunk/build/test/findbugs/newPatchFindbugsWarnings.html Console output: https://hudson.apache.org/hudson/job/PreCommit-MAPREDUCE-Build/173//console This message is automatically generated. > distcp can copy blocks in parallel > -- > > Key: MAPREDUCE-2257 > URL: https://issues.apache.org/jira/browse/MAPREDUCE-2257 > Project: Hadoop Map/Reduce > Issue Type: Improvement > Components: distcp >Affects Versions: 0.21.0 >Reporter: dhruba borthakur >Assignee: dhruba borthakur > Attachments: MAPREDUCE-2257.patch > > > The minimum unit of work for a distcp task is a file. We have files that are > greater than 1 TB with a block size of 1 GB. If we use distcp to copy these > files, the tasks either take a long long long time or finally fails. A better > way for distcp would be to copy all the source blocks in parallel, and then > stich the blocks back to files at the destination via the HDFS Concat API > (HDFS-222) -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (MAPREDUCE-2257) distcp can copy blocks in parallel
[ https://issues.apache.org/jira/browse/MAPREDUCE-2257?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13021125#comment-13021125 ] Rosie Li commented on MAPREDUCE-2257: - FileChunkPair is still src/dst file pairs but with the other 3 fields telling the starting point and offset for the file chunk pairs Also I merged doCopyFile() and doCopyFileChunks(), now we only have one doCopyFile method. > distcp can copy blocks in parallel > -- > > Key: MAPREDUCE-2257 > URL: https://issues.apache.org/jira/browse/MAPREDUCE-2257 > Project: Hadoop Map/Reduce > Issue Type: Improvement > Components: distcp >Affects Versions: 0.21.0 >Reporter: dhruba borthakur >Assignee: dhruba borthakur > Attachments: MAPREDUCE-2257.patch > > > The minimum unit of work for a distcp task is a file. We have files that are > greater than 1 TB with a block size of 1 GB. If we use distcp to copy these > files, the tasks either take a long long long time or finally fails. A better > way for distcp would be to copy all the source blocks in parallel, and then > stich the blocks back to files at the destination via the HDFS Concat API > (HDFS-222) -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (MAPREDUCE-2257) distcp can copy blocks in parallel
[ https://issues.apache.org/jira/browse/MAPREDUCE-2257?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13020936#comment-13020936 ] Rodrigo Schmidt commented on MAPREDUCE-2257: The class FileChunkPair is not really a pair, right? It stores 5 fields. Can't we somehow unify the if/else in copy()? At least doCopyFile() could use doCopyFileChunks(). > distcp can copy blocks in parallel > -- > > Key: MAPREDUCE-2257 > URL: https://issues.apache.org/jira/browse/MAPREDUCE-2257 > Project: Hadoop Map/Reduce > Issue Type: Improvement > Components: distcp >Affects Versions: 0.21.0 >Reporter: dhruba borthakur >Assignee: dhruba borthakur > Attachments: MAPREDUCE-2257.patch > > > The minimum unit of work for a distcp task is a file. We have files that are > greater than 1 TB with a block size of 1 GB. If we use distcp to copy these > files, the tasks either take a long long long time or finally fails. A better > way for distcp would be to copy all the source blocks in parallel, and then > stich the blocks back to files at the destination via the HDFS Concat API > (HDFS-222) -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (MAPREDUCE-2257) distcp can copy blocks in parallel
[ https://issues.apache.org/jira/browse/MAPREDUCE-2257?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13016023#comment-13016023 ] Rosie Li commented on MAPREDUCE-2257: - The failure of the contrib test is not related to the new distcp. > distcp can copy blocks in parallel > -- > > Key: MAPREDUCE-2257 > URL: https://issues.apache.org/jira/browse/MAPREDUCE-2257 > Project: Hadoop Map/Reduce > Issue Type: Improvement > Components: distcp >Affects Versions: 0.21.0 >Reporter: dhruba borthakur >Assignee: dhruba borthakur > Attachments: MAPREDUCE-2257.patch > > > The minimum unit of work for a distcp task is a file. We have files that are > greater than 1 TB with a block size of 1 GB. If we use distcp to copy these > files, the tasks either take a long long long time or finally fails. A better > way for distcp would be to copy all the source blocks in parallel, and then > stich the blocks back to files at the destination via the HDFS Concat API > (HDFS-222) -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (MAPREDUCE-2257) distcp can copy blocks in parallel
[ https://issues.apache.org/jira/browse/MAPREDUCE-2257?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13014214#comment-13014214 ] Hadoop QA commented on MAPREDUCE-2257: -- -1 overall. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12475126/MAPREDUCE-2257.patch against trunk revision 1087098. +1 @author. The patch does not contain any @author tags. +1 tests included. The patch appears to include 4 new or modified tests. +1 javadoc. The javadoc tool did not generate any warning messages. -1 javac. The applied patch generated 2256 javac compiler warnings (more than the trunk's current 2244 warnings). +1 findbugs. The patch does not introduce any new Findbugs (version 1.3.9) warnings. +1 release audit. The applied patch does not increase the total number of release audit warnings. +1 core tests. The patch passed core unit tests. -1 contrib tests. The patch failed contrib unit tests. +1 system test framework. The patch passed system test framework compile. Test results: https://hudson.apache.org/hudson/job/PreCommit-MAPREDUCE-Build/152//testReport/ Findbugs warnings: https://hudson.apache.org/hudson/job/PreCommit-MAPREDUCE-Build/152//artifact/trunk/build/test/findbugs/newPatchFindbugsWarnings.html Console output: https://hudson.apache.org/hudson/job/PreCommit-MAPREDUCE-Build/152//console This message is automatically generated. > distcp can copy blocks in parallel > -- > > Key: MAPREDUCE-2257 > URL: https://issues.apache.org/jira/browse/MAPREDUCE-2257 > Project: Hadoop Map/Reduce > Issue Type: Improvement > Components: distcp >Affects Versions: 0.21.0 >Reporter: dhruba borthakur >Assignee: dhruba borthakur > Attachments: MAPREDUCE-2257.patch > > > The minimum unit of work for a distcp task is a file. We have files that are > greater than 1 TB with a block size of 1 GB. If we use distcp to copy these > files, the tasks either take a long long long time or finally fails. A better > way for distcp would be to copy all the source blocks in parallel, and then > stich the blocks back to files at the destination via the HDFS Concat API > (HDFS-222) -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (MAPREDUCE-2257) distcp can copy blocks in parallel
[ https://issues.apache.org/jira/browse/MAPREDUCE-2257?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13014085#comment-13014085 ] Allen Wittenauer commented on MAPREDUCE-2257: - >By default, distcp.copy.by.chunk is set to true in the configuration. The user >can set it to >false to use the original distcp. But the type of destination >will be checked afterward. >distcp.copy.by.chunk will remain true only if the >destination file system is the distributed >file system. This needs to get added to the release notes. > distcp can copy blocks in parallel > -- > > Key: MAPREDUCE-2257 > URL: https://issues.apache.org/jira/browse/MAPREDUCE-2257 > Project: Hadoop Map/Reduce > Issue Type: Improvement > Components: distcp >Affects Versions: 0.21.0 >Reporter: dhruba borthakur >Assignee: dhruba borthakur > Attachments: MAPREDUCE-2257.patch > > > The minimum unit of work for a distcp task is a file. We have files that are > greater than 1 TB with a block size of 1 GB. If we use distcp to copy these > files, the tasks either take a long long long time or finally fails. A better > way for distcp would be to copy all the source blocks in parallel, and then > stich the blocks back to files at the destination via the HDFS Concat API > (HDFS-222) -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (MAPREDUCE-2257) distcp can copy blocks in parallel
[ https://issues.apache.org/jira/browse/MAPREDUCE-2257?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13013901#comment-13013901 ] Hadoop QA commented on MAPREDUCE-2257: -- -1 overall. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12475059/MAPREDUCE-2257.patch against trunk revision 1087098. +1 @author. The patch does not contain any @author tags. +1 tests included. The patch appears to include 4 new or modified tests. +1 javadoc. The javadoc tool did not generate any warning messages. -1 javac. The applied patch generated 2256 javac compiler warnings (more than the trunk's current 2244 warnings). -1 findbugs. The patch appears to introduce 1 new Findbugs (version 1.3.9) warnings. +1 release audit. The applied patch does not increase the total number of release audit warnings. +1 core tests. The patch passed core unit tests. -1 contrib tests. The patch failed contrib unit tests. +1 system test framework. The patch passed system test framework compile. Test results: https://hudson.apache.org/hudson/job/PreCommit-MAPREDUCE-Build/150//testReport/ Findbugs warnings: https://hudson.apache.org/hudson/job/PreCommit-MAPREDUCE-Build/150//artifact/trunk/build/test/findbugs/newPatchFindbugsWarnings.html Console output: https://hudson.apache.org/hudson/job/PreCommit-MAPREDUCE-Build/150//console This message is automatically generated. > distcp can copy blocks in parallel > -- > > Key: MAPREDUCE-2257 > URL: https://issues.apache.org/jira/browse/MAPREDUCE-2257 > Project: Hadoop Map/Reduce > Issue Type: Improvement > Components: distcp >Affects Versions: 0.21.0 >Reporter: dhruba borthakur >Assignee: dhruba borthakur > Attachments: MAPREDUCE-2257.patch > > > The minimum unit of work for a distcp task is a file. We have files that are > greater than 1 TB with a block size of 1 GB. If we use distcp to copy these > files, the tasks either take a long long long time or finally fails. A better > way for distcp would be to copy all the source blocks in parallel, and then > stich the blocks back to files at the destination via the HDFS Concat API > (HDFS-222) -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (MAPREDUCE-2257) distcp can copy blocks in parallel
[ https://issues.apache.org/jira/browse/MAPREDUCE-2257?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13012260#comment-13012260 ] Hadoop QA commented on MAPREDUCE-2257: -- -1 overall. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12474806/MAPREDUCE-2257.patch against trunk revision 1082703. +1 @author. The patch does not contain any @author tags. +1 tests included. The patch appears to include 4 new or modified tests. +1 javadoc. The javadoc tool did not generate any warning messages. -1 javac. The applied patch generated 2256 javac compiler warnings (more than the trunk's current 2244 warnings). -1 findbugs. The patch appears to introduce 1 new Findbugs (version 1.3.9) warnings. +1 release audit. The applied patch does not increase the total number of release audit warnings. +1 core tests. The patch passed core unit tests. -1 contrib tests. The patch failed contrib unit tests. +1 system test framework. The patch passed system test framework compile. Test results: https://hudson.apache.org/hudson/job/PreCommit-MAPREDUCE-Build/147//testReport/ Findbugs warnings: https://hudson.apache.org/hudson/job/PreCommit-MAPREDUCE-Build/147//artifact/trunk/build/test/findbugs/newPatchFindbugsWarnings.html Console output: https://hudson.apache.org/hudson/job/PreCommit-MAPREDUCE-Build/147//console This message is automatically generated. > distcp can copy blocks in parallel > -- > > Key: MAPREDUCE-2257 > URL: https://issues.apache.org/jira/browse/MAPREDUCE-2257 > Project: Hadoop Map/Reduce > Issue Type: Improvement > Components: distcp >Affects Versions: 0.21.0 >Reporter: dhruba borthakur >Assignee: dhruba borthakur > Attachments: MAPREDUCE-2257.patch > > > The minimum unit of work for a distcp task is a file. We have files that are > greater than 1 TB with a block size of 1 GB. If we use distcp to copy these > files, the tasks either take a long long long time or finally fails. A better > way for distcp would be to copy all the source blocks in parallel, and then > stich the blocks back to files at the destination via the HDFS Concat API > (HDFS-222) -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (MAPREDUCE-2257) distcp can copy blocks in parallel
[ https://issues.apache.org/jira/browse/MAPREDUCE-2257?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13012211#comment-13012211 ] Hadoop QA commented on MAPREDUCE-2257: -- -1 overall. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12474807/MAPREDUCE-2257.patch against trunk revision 1082703. +1 @author. The patch does not contain any @author tags. +1 tests included. The patch appears to include 4 new or modified tests. +1 javadoc. The javadoc tool did not generate any warning messages. -1 javac. The patch appears to cause tar ant target to fail. -1 findbugs. The patch appears to cause Findbugs (version 1.3.9) to fail. +1 release audit. The applied patch does not increase the total number of release audit warnings. -1 core tests. The patch failed these core unit tests: -1 contrib tests. The patch failed contrib unit tests. -1 system test framework. The patch failed system test framework compile. Test results: https://hudson.apache.org/hudson/job/PreCommit-MAPREDUCE-Build/148//testReport/ Console output: https://hudson.apache.org/hudson/job/PreCommit-MAPREDUCE-Build/148//console This message is automatically generated. > distcp can copy blocks in parallel > -- > > Key: MAPREDUCE-2257 > URL: https://issues.apache.org/jira/browse/MAPREDUCE-2257 > Project: Hadoop Map/Reduce > Issue Type: Improvement > Components: distcp >Affects Versions: 0.21.0 >Reporter: dhruba borthakur >Assignee: dhruba borthakur > Attachments: MAPREDUCE-2257.patch > > > The minimum unit of work for a distcp task is a file. We have files that are > greater than 1 TB with a block size of 1 GB. If we use distcp to copy these > files, the tasks either take a long long long time or finally fails. A better > way for distcp would be to copy all the source blocks in parallel, and then > stich the blocks back to files at the destination via the HDFS Concat API > (HDFS-222) -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] Commented: (MAPREDUCE-2257) distcp can copy blocks in parallel
[ https://issues.apache.org/jira/browse/MAPREDUCE-2257?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13008669#comment-13008669 ] Rosie Li commented on MAPREDUCE-2257: - I'm working on this feature right now. Already done writing the code. Testing now. > distcp can copy blocks in parallel > -- > > Key: MAPREDUCE-2257 > URL: https://issues.apache.org/jira/browse/MAPREDUCE-2257 > Project: Hadoop Map/Reduce > Issue Type: Improvement > Components: distcp >Reporter: dhruba borthakur >Assignee: dhruba borthakur > > The minimum unit of work for a distcp task is a file. We have files that are > greater than 1 TB with a block size of 1 GB. If we use distcp to copy these > files, the tasks either take a long long long time or finally fails. A better > way for distcp would be to copy all the source blocks in parallel, and then > stich the blocks back to files at the destination via the HDFS Concat API > (HDFS-222) -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] Commented: (MAPREDUCE-2257) distcp can copy blocks in parallel
[ https://issues.apache.org/jira/browse/MAPREDUCE-2257?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13008557#comment-13008557 ] gopikannan venugopalsamy commented on MAPREDUCE-2257: - I wanna work on this, hey nikhil .. would you like to discuss > distcp can copy blocks in parallel > -- > > Key: MAPREDUCE-2257 > URL: https://issues.apache.org/jira/browse/MAPREDUCE-2257 > Project: Hadoop Map/Reduce > Issue Type: Improvement > Components: distcp >Reporter: dhruba borthakur >Assignee: dhruba borthakur > > The minimum unit of work for a distcp task is a file. We have files that are > greater than 1 TB with a block size of 1 GB. If we use distcp to copy these > files, the tasks either take a long long long time or finally fails. A better > way for distcp would be to copy all the source blocks in parallel, and then > stich the blocks back to files at the destination via the HDFS Concat API > (HDFS-222) -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] Commented: (MAPREDUCE-2257) distcp can copy blocks in parallel
[ https://issues.apache.org/jira/browse/MAPREDUCE-2257?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13002925#comment-13002925 ] nikhil commented on MAPREDUCE-2257: --- Is anyone working on this feature? > distcp can copy blocks in parallel > -- > > Key: MAPREDUCE-2257 > URL: https://issues.apache.org/jira/browse/MAPREDUCE-2257 > Project: Hadoop Map/Reduce > Issue Type: Improvement > Components: distcp >Reporter: dhruba borthakur >Assignee: dhruba borthakur > > The minimum unit of work for a distcp task is a file. We have files that are > greater than 1 TB with a block size of 1 GB. If we use distcp to copy these > files, the tasks either take a long long long time or finally fails. A better > way for distcp would be to copy all the source blocks in parallel, and then > stich the blocks back to files at the destination via the HDFS Concat API > (HDFS-222) -- This message is automatically generated by JIRA. - For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] Commented: (MAPREDUCE-2257) distcp can copy blocks in parallel
[ https://issues.apache.org/jira/browse/MAPREDUCE-2257?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12989456#comment-12989456 ] gopikannan venugopalsamy commented on MAPREDUCE-2257: - Hello, I wish to contribute to this issue but I am new to this project.Can you guys give some tips for where to start from > distcp can copy blocks in parallel > -- > > Key: MAPREDUCE-2257 > URL: https://issues.apache.org/jira/browse/MAPREDUCE-2257 > Project: Hadoop Map/Reduce > Issue Type: Improvement > Components: distcp >Reporter: dhruba borthakur >Assignee: dhruba borthakur > > The minimum unit of work for a distcp task is a file. We have files that are > greater than 1 TB with a block size of 1 GB. If we use distcp to copy these > files, the tasks either take a long long long time or finally fails. A better > way for distcp would be to copy all the source blocks in parallel, and then > stich the blocks back to files at the destination via the HDFS Concat API > (HDFS-222) -- This message is automatically generated by JIRA. - For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] Commented: (MAPREDUCE-2257) distcp can copy blocks in parallel
[ https://issues.apache.org/jira/browse/MAPREDUCE-2257?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12980490#action_12980490 ] dhruba borthakur commented on MAPREDUCE-2257: - A new option to distcp could trigger parallel-block copy. It cannot be used with hftp. > distcp can copy blocks in parallel > -- > > Key: MAPREDUCE-2257 > URL: https://issues.apache.org/jira/browse/MAPREDUCE-2257 > Project: Hadoop Map/Reduce > Issue Type: Improvement > Components: distcp >Reporter: dhruba borthakur >Assignee: dhruba borthakur > > The minimum unit of work for a distcp task is a file. We have files that are > greater than 1 TB with a block size of 1 GB. If we use distcp to copy these > files, the tasks either take a long long long time or finally fails. A better > way for distcp would be to copy all the source blocks in parallel, and then > stich the blocks back to files at the destination via the HDFS Concat API > (HDFS-222) -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (MAPREDUCE-2257) distcp can copy blocks in parallel
[ https://issues.apache.org/jira/browse/MAPREDUCE-2257?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12980408#action_12980408 ] Allen Wittenauer commented on MAPREDUCE-2257: - Won't changing the unit break hftp? > distcp can copy blocks in parallel > -- > > Key: MAPREDUCE-2257 > URL: https://issues.apache.org/jira/browse/MAPREDUCE-2257 > Project: Hadoop Map/Reduce > Issue Type: Improvement > Components: distcp >Reporter: dhruba borthakur >Assignee: dhruba borthakur > > The minimum unit of work for a distcp task is a file. We have files that are > greater than 1 TB with a block size of 1 GB. If we use distcp to copy these > files, the tasks either take a long long long time or finally fails. A better > way for distcp would be to copy all the source blocks in parallel, and then > stich the blocks back to files at the destination via the HDFS Concat API > (HDFS-222) -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.