Hi, Which version of Hadoop are u using. With v0.21 hadoop supports split bzip2 compressed files(HADOOP-4012). So you dont even have to read from beginning to end.
This patch is also available in cdh3 distribution which I would recommend as 0.21 is not declared suitable for production. Also the following link summarizes the comparison of diff compression formats. http://tukaani.org/lzma/benchmarks.html Thanks Sudhan S On Tue, Sep 13, 2011 at 6:41 AM, Steve Lewis <lordjoe2...@gmail.com> wrote: > Thanks - what NLineInputFormat is pretty close to what I want. > In most cases the file is text and quite splittable > although it raises another issue - sometimes the file is compressed - even > though it may > only be tens of megs compression is useful to speed transport > In the case of a small file with enough work in the mapper it may be useful > to split even a zipped file - > even if it means reading from the beginning to reach a specific index in > the unzipped stream - > ever seen that done?? > > > On Mon, Sep 12, 2011 at 1:36 AM, Harsh J <ha...@cloudera.com> wrote: > >> Hello Steve, >> >> On Mon, Sep 12, 2011 at 7:57 AM, Steve Lewis <lordjoe2...@gmail.com> >> wrote: >> > I have a problem where there is a single, relatively small (10-20 MB) >> input >> > file. (It happens it is a fasta file which will have meaning if you are >> a >> > biologist.) I am already using a custom InputFormat and a custom >> reader >> > to force a custom parsing. The file may generate tens or hundreds of >> > millions of key value pairs and the mapper does a fair amount of work on >> > each record. >> > The standard implementation of >> > public List<InputSplit> getSplits(JobContext job) throws IOException { >> > >> > uses fs.getFileBlockLocations(file, 0, length); to determine the blocks >> and >> > for a file of this size will come up with a single InputSplit and a >> single >> > mapper. >> > I am looking for a good example of forcing the generation of multiple >> > InputSplits for a small file. In this case I am happy if every Mapper >> > instance is required to read and parse the entire file as long as I >> can >> > guarantee that every record is processed by only a single mapper. >> >> Is the file splittable? >> >> You may look at the FileInputFormat's "mapred.min.split.size" >> property. See >> http://hadoop.apache.org/common/docs/r0.20.2/api/org/apache/hadoop/mapreduce/lib/input/FileInputFormat.html#setMinInputSplitSize(org.apache.hadoop.mapreduce.Job >> , >> long) >> >> Perhaps the 'NLineInputFormat' may also be what you're really looking >> for, which lets you limit no. of records per mapper instead of >> fiddling around with byte sizes with the above. >> >> > While I think I see how I might modify getSplits(JobContext job) I am >> not >> > sure how and when the code is called when the job is running on the >> cluster. >> >> The method is called in the client-end, at the job-submission point. >> >> -- >> Harsh J >> > > > > -- > Steven M. Lewis PhD > 4221 105th Ave NE > Kirkland, WA 98033 > 206-384-1340 (cell) > Skype lordjoe_com > > >