Re: Compression using Hadoop...

Doug Cutting Fri, 31 Aug 2007 10:43:42 -0700

Arun C Murthy wrote:

One way to reap benefits of both compression and better parallelism is to use 
compressed SequenceFiles: http://wiki.apache.org/lucene-hadoop/SequenceFile


Of course this means you will have to do a conversion from .gzip to .seq file 
and load it onto hdfs for your job, which should be fairly simple piece of code.

We really need someone to contribute an InputFormat for bzip files.This has come up before: bzip is a standard compression format that issplittable.

Another InputFormat that would be handy is zip. Zip archives, unliketar files, can be split by reading the table of contents. So one couldpackage a bunch of tiny files as a zip file, then the input format couldsplit the zip file into splits that each contain a number of filesinside the zip. Each map task would then have to read the table ofcontents from the file, but could then seek directly to the files in itssplit without scanning the entire file.

Should we file jira issues for these? Any volunteers who're interestedin implementing these?


Doug

Re: Compression using Hadoop...

Reply via email to