Re: MultiFileInputFormat and gzipped files

2008-08-05 Thread Enis Soztutar
MultiFileWordCount uses its own RecordReader, namely 
MultiFileLineRecordReader. This is different from the LineRecordReader 
which automatically detects the file's codec, and decodes it.


You can write a custom RecordReader similar to LineRecordReader and 
MultiFileLineRecordReader, or just add codecs to MultiFileLineRecordReader.



Michele Catasta wrote:

Hi all,

I'm writing some Hadoop jobs that should run on a collection of
gzipped files. Everything is already working correctly with
MultiFileInputFormat and an initial step of gunzip extraction.
Considering that Hadoop recognizes and handles correctly .gz files (at
least with a single file input), I was wondering if it's able to do
the same with file collections, such that I avoid the overhad of
sequential file extraction.
I tried to run the multi file WordCount example with a bunch of
gzipped text files (0.17.1 installation), and I get a wrong output
(neither correct or empty). With my own InputFormat (not really
different from the one in multiflewc), I got no output at all (map
input record counter = 0).

Is it a desired behavior? Are there some technical reasons why it's
not working in a multi file scenario?
Thanks in advance for the help.


Regards,
  Michele Catasta

  




MultiFileInputFormat and gzipped files

2008-07-30 Thread Michele Catasta
Hi all,

I'm writing some Hadoop jobs that should run on a collection of
gzipped files. Everything is already working correctly with
MultiFileInputFormat and an initial step of gunzip extraction.
Considering that Hadoop recognizes and handles correctly .gz files (at
least with a single file input), I was wondering if it's able to do
the same with file collections, such that I avoid the overhad of
sequential file extraction.
I tried to run the multi file WordCount example with a bunch of
gzipped text files (0.17.1 installation), and I get a wrong output
(neither correct or empty). With my own InputFormat (not really
different from the one in multiflewc), I got no output at all (map
input record counter = 0).

Is it a desired behavior? Are there some technical reasons why it's
not working in a multi file scenario?
Thanks in advance for the help.


Regards,
  Michele Catasta