[ https://issues.apache.org/jira/browse/HADOOP-10124?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Mike Liddell updated HADOOP-10124: ---------------------------------- Attachment: HADOOP-10124.1.patch > Option to shuffle splits of equal size > -------------------------------------- > > Key: HADOOP-10124 > URL: https://issues.apache.org/jira/browse/HADOOP-10124 > Project: Hadoop Common > Issue Type: Improvement > Reporter: Mike Liddell > Attachments: HADOOP-10124.1.patch > > > Mapreduce split calculation has the following base logic (via JobClient and > the major InputFormat implementations ): > ◾enumerate input files in natural (aka linear) order. > ◾create one split for each 'block-size' of each input. Apart from > rack-awareness, combining and so on, the input file order remains in its > natural order. > ◾sort the splits by size using a stable sort based on splitsize. > When data from multiple storage services are used in a single hadoop job, we > get better I/O utilization if the list of splits does round-robin or > random-access across the services. > The particular scenario arises in Azure HDInsight where jobs can easily read > from many storage accounts and each storage account has hard limits on > throughtput. Concurrent access to the accounts is substantially better than > > Two common scenarios can cause non-ideal access pattern: > 1. many/all input files are the same size > 2. files have different sizes, but many/all input files have size>blocksize. > In the second scenario, for each file will have one or more splits with size > exactly equal to block size so it basically degenerates to the first scenario. > There are various ways to solve the problem but the simplest is to alter the > mapreduce JobClient to sort splits by size _and_ randomize the order of > splits with equal size. This keeps the old behavior effectively unchanged > while also fixing both common problematic scenarios. > Some rare scenarios will still suffer bad access patterns due. For example if > two storage accounts are used and the files from one storage account are all > smaller than from the other then problems can arise. Addressing these > scenarios would be further work, perhaps by completely randomizing the split > order. These problematic scenarios are considered rare and not requiring > immediate attention. > If further algorithms for split ordering are necessary, the implementation in > JobClient will change to being interface-based (eg interface splitOrderer) > with various standard implementations. At this time there is only the need > for two implementations and so simple Boolean flag and if/then logic is used. -- This message was sent by Atlassian JIRA (v6.1#6144)