Mike Liddell created HADOOP-10124: ------------------------------------- Summary: Option to shuffle splits of equal size Key: HADOOP-10124 URL: https://issues.apache.org/jira/browse/HADOOP-10124 Project: Hadoop Common Issue Type: Improvement Reporter: Mike Liddell
Mapreduce split calculation has the following base logic (via JobClient and the major InputFormat implementations ): ◾enumerate input files in natural (aka linear) order. ◾create one split for each 'block-size' of each input. Apart from rack-awareness, combining and so on, the input file order remains in its natural order. ◾sort the splits by size using a stable sort based on splitsize. When data from multiple storage services are used in a single hadoop job, we get better I/O utilization if the list of splits does round-robin or random-access across the services. The particular scenario arises in Azure HDInsight where jobs can easily read from many storage accounts and each storage account has hard limits on throughtput. Concurrent access to the accounts is substantially better than Two common scenarios can cause non-ideal access pattern: 1. many/all input files are the same size 2. files have different sizes, but many/all input files have size>blocksize. In the second scenario, for each file will have one or more splits with size exactly equal to block size so it basically degenerates to the first scenario. There are various ways to solve the problem but the simplest is to alter the mapreduce JobClient to sort splits by size _and_ randomize the order of splits with equal size. This keeps the old behavior effectively unchanged while also fixing both common problematic scenarios. Some rare scenarios will still suffer bad access patterns due. For example if two storage accounts are used and the files from one storage account are all smaller than from the other then problems can arise. Addressing these scenarios would be further work, perhaps by completely randomizing the split order. These problematic scenarios are considered rare and not requiring immediate attention. If further algorithms for split ordering are necessary, the implementation in JobClient will change to being interface-based (eg interface splitOrderer) with various standard implementations. At this time there is only the need for two implementations and so simple Boolean flag and if/then logic is used. -- This message was sent by Atlassian JIRA (v6.1#6144)