Need a "LineBasedTextInputFormat"
---------------------------------

                 Key: HADOOP-3221
                 URL: https://issues.apache.org/jira/browse/HADOOP-3221
             Project: Hadoop Core
          Issue Type: New Feature
          Components: mapred
    Affects Versions: 0.16.2
         Environment: All
            Reporter: Milind Bhandarkar


In many "pleasantly" parallel applications, each process/mapper processes the 
same input file (s), but with computations are controlled by different 
parameters.
(Referred to as "parameter sweeps").

One way to achieve this, is to specify a set of parameters (one set per line) 
as input in a control file (which is the input path to the map-reduce 
application, where as the input dataset is specified via a config variable in 
JobConf.).

It would be great to have an InputFormat, that splits the input file such that 
by default, one line is fed as a value to one map task, and key could be line 
number. i.e. (k,v) is (LongWritable, Text).

If user specifies the number of maps explicitly, each mapper should get a 
contiguous chunk of lines (so as to load balance between the mappers.)

The location hints for the splits should not be derived from the input file, 
but rather, should span the whole mapred cluster.

(Is there a way to do this without having to return an array of 
nSplits*nTaskTrackers ?)

Increasing the replication of the "real" input dataset (since it will be 
fetched by all the nodes) is orthogonal, and one can use DistributedCache for 
that.

(P.S. Please chose a better name for this InputFormat. I am not in love with  
"LineBasedText" name.)


-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

Reply via email to