If you're case is like mine, where you have lots of .gz files and you
don't want splits in the middle of those files, you can use the code I
just sent in the thread about traversing subdirectories.  In brief, your
RecordReader could do something like:

    public static class MyRecordReader 
        implements RecordReader<DocLocation, Text> {
        private CompressionCodecFactory compressionCodecs = null;
        private long start;
        private long end;
        private long pos;
        private Path file;
        private LineRecordReader.LineReader in;
        
        public MyRecordReader(JobConf job, FileSplit split)
            throws IOException {
            file = split.getPath();
            start = 0;
            end = split.getLength();
            compressionCodecs = new CompressionCodecFactory(job);
            CompressionCodec codec = compressionCodecs.getCodec(file);

            FileSystem fs = file.getFileSystem(job);
            FSDataInputStream fileIn = fs.open(file);

            if (codec != null) {
                in = new LineRecordReader.LineReader(codec.createInputStream(fil
eIn), job);
            } else {
                in = new LineRecordReader.LineReader(fileIn, job);
            }
            pos = 0;
        }
    


Alex Loddengaard <a...@cloudera.com> writes:

> Hi Adam,
>
> Gzipped files don't play that nicely with Hadoop, because they aren't
> splittable.  Can you use bzip2 instead?  bzip2 files play more nicely with
> Hadoop, because they're splittable.  If you're stuck with gzip, then take a
> look here: <http://issues.apache.org/jira/browse/HADOOP-437>.  I don't know
> if you'll have to set the same JobConf parameter in newer versions of
> Hadoop, but it's worth trying out.
>
> Hope this helps.
>
> Alex
>
> On Wed, Jun 3, 2009 at 11:50 AM, Adam Silberstein 
> <silbe...@yahoo-inc.com>wrote:
>
>> Hi,
>>
>> I have some hadoop code that works properly when the input files are not
>> compressed, but it is not working for the gzipped versions of those
>> files.  My files are named with *.gz, but the format is not being
>> recognized.  I'm under the impression I don't need to set any JobConf
>> parameters to indicate compressed input.
>>
>>
>>
>> I'm actually taking a directory name as input, and modeled that aspect
>> of my application after the MultiFileWordCount.java example in
>> org.apache.hadoop.examples.  Not sure if this is part of the problem.
>>
>>
>>
>> Thanks,
>>
>> Adam
>>
>>

Reply via email to