[ https://issues.apache.org/jira/browse/HADOOP-11794?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15067190#comment-15067190 ]
Yongjun Zhang commented on HADOOP-11794: ---------------------------------------- HI [~mithun], For your question 2, it's because we now support variable block length, see https://issues.apache.org/jira/browse/HDFS-3689?focusedCommentId=14277548&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-14277548 For question 1, I'm worried that it will take much longer time to prepare the copy listing if we need to get block locations for each file. That's why I think it's easier to defer this to mapper. As you pointed out, indeed this will incur more communication work between mapper and NN, but since it's split among mappers, and the call to NN would be scattered in the lifespan of distcp job, it should be easier then when we prepare copy listing. Plus, we may make mapper cache some block locations if multiple block-ranges of the same file are assigned to the same mapper. Thanks. > distcp can copy blocks in parallel > ---------------------------------- > > Key: HADOOP-11794 > URL: https://issues.apache.org/jira/browse/HADOOP-11794 > Project: Hadoop Common > Issue Type: Improvement > Components: tools/distcp > Affects Versions: 0.21.0 > Reporter: dhruba borthakur > Assignee: Yongjun Zhang > Attachments: MAPREDUCE-2257.patch > > > The minimum unit of work for a distcp task is a file. We have files that are > greater than 1 TB with a block size of 1 GB. If we use distcp to copy these > files, the tasks either take a long long long time or finally fails. A better > way for distcp would be to copy all the source blocks in parallel, and then > stich the blocks back to files at the destination via the HDFS Concat API > (HDFS-222) -- This message was sent by Atlassian JIRA (v6.3.4#6332)