Hello Steve,

On Mon, Sep 12, 2011 at 7:57 AM, Steve Lewis <lordjoe2...@gmail.com> wrote:
> I have a problem where there is a single, relatively small (10-20 MB) input
> file. (It happens it is a fasta file which will have meaning if you are a
> biologist.)  I am already using a custom  InputFormat  and a custom reader
> to force a custom parsing. The file may generate tens or hundreds of
> millions of key value pairs and the mapper does a fair amount of work on
> each record.
> The standard implementation of
>   public List<InputSplit> getSplits(JobContext job) throws IOException {
>
> uses fs.getFileBlockLocations(file, 0, length); to determine the blocks and
> for a file of this size will come up with a single InputSplit and a single
> mapper.
> I am looking for a good example of forcing the generation of multiple
> InputSplits for a small file. In this case I am  happy if every Mapper
> instance is required to read and parse the entire file    as long as I can
> guarantee that every record is processed by only a single mapper.

Is the file splittable?

You may look at the FileInputFormat's "mapred.min.split.size"
property. See 
http://hadoop.apache.org/common/docs/r0.20.2/api/org/apache/hadoop/mapreduce/lib/input/FileInputFormat.html#setMinInputSplitSize(org.apache.hadoop.mapreduce.Job,
long)

Perhaps the 'NLineInputFormat' may also be what you're really looking
for, which lets you limit no. of records per mapper instead of
fiddling around with byte sizes with the above.

> While I think I see how I might modify  getSplits(JobContext job)  I am not
> sure how and when the code is called when the job is running on the cluster.

The method is called in the client-end, at the job-submission point.

-- 
Harsh J

Reply via email to