Hello Steve, On Mon, Sep 12, 2011 at 7:57 AM, Steve Lewis <lordjoe2...@gmail.com> wrote: > I have a problem where there is a single, relatively small (10-20 MB) input > file. (It happens it is a fasta file which will have meaning if you are a > biologist.) I am already using a custom InputFormat and a custom reader > to force a custom parsing. The file may generate tens or hundreds of > millions of key value pairs and the mapper does a fair amount of work on > each record. > The standard implementation of > public List<InputSplit> getSplits(JobContext job) throws IOException { > > uses fs.getFileBlockLocations(file, 0, length); to determine the blocks and > for a file of this size will come up with a single InputSplit and a single > mapper. > I am looking for a good example of forcing the generation of multiple > InputSplits for a small file. In this case I am happy if every Mapper > instance is required to read and parse the entire file as long as I can > guarantee that every record is processed by only a single mapper.
Is the file splittable? You may look at the FileInputFormat's "mapred.min.split.size" property. See http://hadoop.apache.org/common/docs/r0.20.2/api/org/apache/hadoop/mapreduce/lib/input/FileInputFormat.html#setMinInputSplitSize(org.apache.hadoop.mapreduce.Job, long) Perhaps the 'NLineInputFormat' may also be what you're really looking for, which lets you limit no. of records per mapper instead of fiddling around with byte sizes with the above. > While I think I see how I might modify getSplits(JobContext job) I am not > sure how and when the code is called when the job is running on the cluster. The method is called in the client-end, at the job-submission point. -- Harsh J