Hi, Our intention is to solve this in a generic context, not just file input. Thus the split class should be generic (very similar to CompositeInputSplit from mapred).
We also already implement getRecordReader by iterating over record readers created by the decorated input format (this method is not implemented in MultiFile). Regarding allocation of such a composite split to a location, in a generic context (not just file input), a better job at choosing a sweet spot between data locality and workload distribution fairness can be done than the algorithm used in MultiFileIF does. On the other hand, I will dive into FileInputFormat.getSplitHosts to evaluate the rack locality into a generic close-to-optimal allocation algorithm. Please check https://issues.apache.org/jira/browse/MAPREDUCE-5287 for java code already tested which can work in general context but is optimal only for single-location splits such as the case which hbase. A configuration is given to the mapreduce job to specify the number of map tasks desired for the job. Thanks, Nicu On 6/19/13 6:01 PM, "Robert Evans" <ev...@yahoo-inc.com<mailto:ev...@yahoo-inc.com>> wrote: This sounds similar to MultiFileInputFormat http://svn.apache.org/viewvc/hadoop/common/trunk/hadoop-mapreduce-project/h adoop-mapreduce-client/hadoop-mapreduce-client-core/src/main/java/org/apach e/hadoop/mapred/MultiFileInputFormat.java?revision=1239482&view=markup It would be nice if you could take a look at it and see if there is something we can do here to improve it/combine the two. --Bobby On 6/19/13 2:53 AM, "Nicolae Marasoiu" <nmara...@adobe.com<mailto:nmara...@adobe.com>> wrote: Hi, When running map-reduce with many splits it would be nice from a performance perspective to have fewer splits while maintaining data locality, so that the overhead of running a map task (jvm creation, map executor ramp-up e.g. spring context, etc) be less impactful when frequently running map-reduces with low data & processing. I created such an AggregatingInputFormat that simply groups input splits into composite ones with same location and creates a record reader that iterates over the record reader created by underlying inputFormat for the underlying raw splits. Currently we intend to use it for hbase sharding but I would like to also implement an optimal algorithm to ensure both fair distribution and locality, which I can describe if you find it useful to apply in multi-locations such as replicated kafka or hdfs. Thanks, waiting for your feedback, Nicu Marasoiu Adobe