Hi,

Our intention is to solve this in a generic context, not just file input.
Thus the split class should be generic (very similar to CompositeInputSplit 
from mapred).

We also already implement getRecordReader by iterating over record readers 
created by the decorated input format (this method is not implemented in 
MultiFile).

Regarding allocation of such a composite split to a location, in a generic 
context (not just file input), a better job at choosing a sweet spot between 
data locality and workload distribution fairness can be done than the algorithm 
used in MultiFileIF does.

On the other hand, I will dive into FileInputFormat.getSplitHosts to evaluate 
the rack locality into a generic close-to-optimal allocation algorithm.

Please check  https://issues.apache.org/jira/browse/MAPREDUCE-5287 for java 
code already tested which can work in general context but is optimal only for 
single-location splits such as the case which hbase.

A configuration is given to the mapreduce job to specify the number of map 
tasks desired for the job.

Thanks,
Nicu

On 6/19/13 6:01 PM, "Robert Evans" 
<ev...@yahoo-inc.com<mailto:ev...@yahoo-inc.com>> wrote:

This sounds similar to MultiFileInputFormat

http://svn.apache.org/viewvc/hadoop/common/trunk/hadoop-mapreduce-project/h
adoop-mapreduce-client/hadoop-mapreduce-client-core/src/main/java/org/apach
e/hadoop/mapred/MultiFileInputFormat.java?revision=1239482&view=markup

It would be nice if you could take a look at it and see if there is
something we can do here to improve it/combine the two.

--Bobby

On 6/19/13 2:53 AM, "Nicolae Marasoiu" 
<nmara...@adobe.com<mailto:nmara...@adobe.com>> wrote:

Hi,

When running map-reduce with many splits it would be nice from a
performance perspective to have fewer splits while maintaining data
locality, so that the overhead of running a map task (jvm creation, map
executor ramp-up e.g. spring context, etc) be less impactful when
frequently running map-reduces with low data & processing.

I created such an AggregatingInputFormat that simply groups input splits
into composite ones with same location and creates a record reader that
iterates over the record reader created by underlying inputFormat for the
underlying raw splits.

Currently we intend to use it for hbase sharding but I would like to also
implement an optimal algorithm to ensure both fair distribution and
locality, which I can describe if you find it useful to apply in
multi-locations such as replicated kafka or hdfs.

Thanks,
waiting for your feedback,
Nicu Marasoiu
Adobe


Reply via email to