[ 
https://issues.apache.org/jira/browse/HADOOP-11794?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15067190#comment-15067190
 ] 

Yongjun Zhang commented on HADOOP-11794:
----------------------------------------

HI [~mithun],

For your question 2, it's because we now support variable block length, see
https://issues.apache.org/jira/browse/HDFS-3689?focusedCommentId=14277548&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-14277548

For question 1, I'm worried that it will take much longer time to prepare the 
copy listing if we need to get block locations for each file. That's why I 
think it's easier to defer this to mapper. As you pointed out, indeed this will 
incur more communication work between mapper and NN, but since it's split among 
mappers, and the call to NN would be scattered in the lifespan of distcp job, 
it should be easier then when we prepare copy listing. Plus, we may make mapper 
cache some block locations if multiple block-ranges of the same file are 
assigned to the same mapper.

Thanks.




> distcp can copy blocks in parallel
> ----------------------------------
>
>                 Key: HADOOP-11794
>                 URL: https://issues.apache.org/jira/browse/HADOOP-11794
>             Project: Hadoop Common
>          Issue Type: Improvement
>          Components: tools/distcp
>    Affects Versions: 0.21.0
>            Reporter: dhruba borthakur
>            Assignee: Yongjun Zhang
>         Attachments: MAPREDUCE-2257.patch
>
>
> The minimum unit of work for a distcp task is a file. We have files that are 
> greater than 1 TB with a block size of  1 GB. If we use distcp to copy these 
> files, the tasks either take a long long long time or finally fails. A better 
> way for distcp would be to copy all the source blocks in parallel, and then 
> stich the blocks back to files at the destination via the HDFS Concat API 
> (HDFS-222)



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to