Mike Liddell created HADOOP-10124:
-------------------------------------

             Summary: Option to shuffle splits of equal size
                 Key: HADOOP-10124
                 URL: https://issues.apache.org/jira/browse/HADOOP-10124
             Project: Hadoop Common
          Issue Type: Improvement
            Reporter: Mike Liddell


Mapreduce split calculation has the following base logic (via JobClient and the 
major InputFormat implementations ):
◾enumerate input files in natural (aka linear) order.
◾create one split for each 'block-size' of each input. Apart from 
rack-awareness, combining and so on, the input file order remains in its 
natural order.
◾sort the splits by size using a stable sort based on splitsize.

When data from multiple storage services are used in a single hadoop job, we 
get better I/O utilization if the list of splits does round-robin or 
random-access across the services. 
The particular scenario arises in Azure HDInsight where jobs can easily read 
from many storage accounts and each storage account has hard limits on 
throughtput.  Concurrent access to the accounts is substantially better than 
 
Two common scenarios can cause non-ideal access pattern:
 1. many/all input files are the same size
 2. files have different sizes, but many/all input files have size>blocksize.
 In the second scenario, for each file will have one or more splits with size 
exactly equal to block size so it basically degenerates to the first scenario.

There are various ways to solve the problem but the simplest is to alter the 
mapreduce JobClient to sort splits by size _and_ randomize the order of splits 
with equal size. This keeps the old behavior effectively unchanged while also 
fixing both common problematic scenarios.

Some rare scenarios will still suffer bad access patterns due. For example if 
two storage accounts are used and the files from one storage account are all 
smaller than from the other then problems can arise. Addressing these scenarios 
would be further work, perhaps by completely randomizing the split order. These 
problematic scenarios are considered rare and not requiring immediate attention.

If further algorithms for split ordering are necessary, the implementation in 
JobClient will change to being interface-based (eg interface splitOrderer) with 
various standard implementations.  At this time there is only the need for two 
implementations and so simple Boolean flag and if/then logic is used.




--
This message was sent by Atlassian JIRA
(v6.1#6144)

Reply via email to