[jira] Updated: (HADOOP-3221) Need a "LineBasedTextInputFormat"

Chris Douglas (JIRA) Sat, 10 May 2008 20:42:19 -0700

     [ 
https://issues.apache.org/jira/browse/HADOOP-3221?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]


Chris Douglas updated HADOOP-3221:
----------------------------------

    Status: Open  (was: Patch Available)

This implements something slightly different than the requirements as stated, 
i.e. it takes input file(s) and encodes each line (or a subset of lines) as a 
split, rather than specifying a partition of a resource with one split per 
line. This has some clear advantages for the issue at hand, i.e. one map per 
line of text, where a vanilla FileSplit is likely as large (path + offsets + 
locations) as the relevant line of text, and placement avoids being misled.

That said, slurping all the input files and writing their contents into the 
splits may not be the best approach. The result is likely to be close to 
guessing even offsets into each input (without reading each file), and while 
there's a possible space savings if both the line length and N are small, it's 
close enough that the value added may not distinguish it from an InputFormat 
returning closely cropped FileSplits, stripped of locations. The use and 
purpose of this new InputFormat might be clearer (though not what this patch 
implements) if one set a property that governs how many lines are in each split 
(defaulting to 1).\* Since the JobTracker has to read in all the splits (and 
hold them in memory for the duration of the job, limiting the size of the file 
the user points this at would be a good idea (via a property that- if said user 
felt daring or malicious- he could cast off). If you felt daring, you could 
even mix stripped-down FileSplits with LineSplits based on the length of each 
section, since the classname of each split is encoded into job.splits.

A few nits:
* This should be in o.a.h.mapred.lib, not o.a.h.mapred
* Since the map expects Text, LineSplit might as well keep Text[] rather than 
String[]
* It might be worthwhile to use LineRecordReader instead of InputStreamReader
* I'm fairly certain that "line number" should not be local to the split, but 
either the line number in the original input file or an offset into that file.

\* Semantically, it's not clear how to regard files with a number of lines not 
evenly divided by N; the current patch would group lines from different files 
into the same split, which might not be what users would expect, but the 
particular choice is not critical as long as it's documented.

> Need a "LineBasedTextInputFormat"
> ---------------------------------
>
>                 Key: HADOOP-3221
>                 URL: https://issues.apache.org/jira/browse/HADOOP-3221
>             Project: Hadoop Core
>          Issue Type: New Feature
>          Components: mapred
>    Affects Versions: 0.16.2
>         Environment: All
>            Reporter: Milind Bhandarkar
>            Assignee: Amareshwari Sriramadasu
>             Fix For: 0.18.0
>
>         Attachments: patch-3221-1.txt, patch-3221.txt
>
>
> In many "pleasantly" parallel applications, each process/mapper processes the 
> same input file (s), but with computations are controlled by different 
> parameters.
> (Referred to as "parameter sweeps").
> One way to achieve this, is to specify a set of parameters (one set per line) 
> as input in a control file (which is the input path to the map-reduce 
> application, where as the input dataset is specified via a config variable in 
> JobConf.).
> It would be great to have an InputFormat, that splits the input file such 
> that by default, one line is fed as a value to one map task, and key could be 
> line number. i.e. (k,v) is (LongWritable, Text).
> If user specifies the number of maps explicitly, each mapper should get a 
> contiguous chunk of lines (so as to load balance between the mappers.)
> The location hints for the splits should not be derived from the input file, 
> but rather, should span the whole mapred cluster.
> (Is there a way to do this without having to return an array of 
> nSplits*nTaskTrackers ?)
> Increasing the replication of the "real" input dataset (since it will be 
> fetched by all the nodes) is orthogonal, and one can use DistributedCache for 
> that.
> (P.S. Please chose a better name for this InputFormat. I am not in love with  
> "LineBasedText" name.)

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Updated: (HADOOP-3221) Need a "LineBasedTextInputFormat"

Reply via email to