[ 
https://issues.apache.org/jira/browse/MAPREDUCE-5352?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13725954#comment-13725954
 ] 

Siddharth Seth commented on MAPREDUCE-5352:
-------------------------------------------

bq. blockToNodes doesnt look like it needs to be a map?
Could likely be replaced by a set. That existed before this patch, haven't 
tried to clean up more than necessary for this specific jira.

bq. What are the results of running the new test with the old code. From what I 
see, the test has a uniform distribution of blocks and the old code should pass 
the test too. The test by itself is a good test to have. Distribution fixes 
like the one in this patch are not easy to test anyways.
The test fails with old code - it ends up generating an additional split. A 
uniform distribution doesn't necessarily mean the old code would generate 
uniform splits since it leaves fewer options for the last few nodes being 
processed. That said, tweaking variables on the test can get the new code to 
generate additional splits as well. In general though, the new code should 
generate better splits more consistently. Agree, verifying the distribution is 
difficult.

bq. Would be great if we can ascertain how the performance of the new algo 
compares to the earlier one. e.g. how much time does it take to create splits 
for 1 million blocks on 10000 machines with 4 blocks per split for example. I 
am expecting that looping once for every split will be slower though not quite 
sure how much.
I don't expect this to cause a big difference. The code removes entries from 
the nodeToBlocks map so that it does not walk those entries for each split 
generated for a node - that seems like most of the additional cost. Otherwise 
it just walks the block list for each node which is similar to the old code.
                
> Optimize node local splits generated by CombineFileInputFormat 
> ---------------------------------------------------------------
>
>                 Key: MAPREDUCE-5352
>                 URL: https://issues.apache.org/jira/browse/MAPREDUCE-5352
>             Project: Hadoop Map/Reduce
>          Issue Type: Improvement
>    Affects Versions: 2.0.5-alpha
>            Reporter: Siddharth Seth
>            Assignee: Siddharth Seth
>         Attachments: MAPREDUCE-5352.1.txt, MAPREDUCE-5352.2.txt, 
> MAPREDUCE-5352.3.txt, MAPREDUCE-5352.4.txt
>
>
> CombineFileInputFormat currently walks through all available nodes and 
> generates multiple (maxSplitsPerNode) splits on a single node before 
> attempting to generate splits on subsequent nodes. This ends up reducing the 
> possibility of generating splits for subsequent nodes - since these blocks 
> will no longer be available for subsequent nodes. Allowing splits to go 1 
> block above the max-split-size makes this worse.
> Allocating a single split per node in one iteration, should help increase the 
> distribution of splits across nodes - so the subsequent nodes will have more 
> blocks to choose from.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

Reply via email to