On Wed, Jun 4, 2008 at 6:52 PM, Arun C Murthy <[EMAIL PROTECTED]> wrote:
> With the current compression codecs available in Hadoop (zlib/gzip/lzo) it
> is not possible to split up a compressed file and then process it in a
> parallel manner. However once we get bzip2 to work we could split up the
> files as you are describing...

If it helps, on *nix you can split a compressed text file like this:
    gunzip -c original.txt.gz | split -a 5 -d -C 16777216 - output.txt.

Replace 16777216 (16MB) with however many (max) bytes you want per
split.  This is guaranteed to split only on line breaks.  You get
files named output.txt.00000, output.txt.00001, and so on.

-Stuart

Reply via email to