On Wed, Jun 4, 2008 at 6:52 PM, Arun C Murthy <[EMAIL PROTECTED]> wrote: > With the current compression codecs available in Hadoop (zlib/gzip/lzo) it > is not possible to split up a compressed file and then process it in a > parallel manner. However once we get bzip2 to work we could split up the > files as you are describing...
If it helps, on *nix you can split a compressed text file like this: gunzip -c original.txt.gz | split -a 5 -d -C 16777216 - output.txt. Replace 16777216 (16MB) with however many (max) bytes you want per split. This is guaranteed to split only on line breaks. You get files named output.txt.00000, output.txt.00001, and so on. -Stuart