Coincidentally, we *just* posted a blog entry about this, courtesy of Kevin Weil from Twitter:
http://www.cloudera.com/blog/2009/11/17/hadoop-at-twitter-part-1-splittable-lzo-compression/ -Todd On Tue, Nov 17, 2009 at 8:36 AM, Todd Lipcon <t...@cloudera.com> wrote: > On Tue, Nov 17, 2009 at 8:33 AM, Edward Capriolo <edlinuxg...@gmail.com>wrote: > >> @Todd >> >> 1) Splittable LZO --- 2) Use a SequenceFile container >> >> Between the licensing, patching, and indexing, this seems to be very >> challenging. Also how do these things fit into my hive usage, it is >> not very clear. After I generate a Hive table using LZO what process >> runs the indexing? >> >> > Right now it's sadly a manual process - there's a jar which you run on the > lzo-compressed file to generate the split index. > > >> Take a look at this thread: >> >> http://www.mail-archive.com/common-user@hadoop.apache.org/msg00337.html >> >> I really like this! Easy! I set some hive parameters, and whamo! >> Compression! >> >> I think support for LZO is really great, but I it seems like bz2 works >> almost out of the box, sometimes. I think in my current setup no >> compression option works out of the box, I would rather have a slow >> BZ2 option then no option. >> >> Oh, I agree completely. There's a JIRA HADOOP-6349 to add FastLZ as a > compression option. It's similar to LZO in terms of performance profile, but > not encumbered by licensing issues. I haven't looked closely enough yet to > know whether it's natively splittable or if we still need some kind of > indexing pass, but at least it will be built in. > > -Todd > >> >> >> >> On Tue, Nov 17, 2009 at 11:23 AM, Michael E. Driscoll >> <m.e.drisc...@gmail.com> wrote: >> > Kevin Weil, of Twitter, has done some work extending LZO compression to >> work >> > with Hadoop streaming. See >> > >> > http://github.com/kevinweil/hadoop-lzo >> > >> > <http://github.com/kevinweil/hadoop-lzo>MD >> > >> > On Tue, Nov 17, 2009 at 8:08 AM, Todd Lipcon <t...@cloudera.com> wrote: >> > >> >> On Tue, Nov 17, 2009 at 7:52 AM, Edward Capriolo < >> edlinuxg...@gmail.com >> >> >wrote: >> >> >> >> > >> >> > Todd, >> >> > >> >> > I think this is very important. From the grid on "Hadoop the >> >> > Definative guide" 78, it appears that bzip2 and zip are the only >> >> > formats the are splittable. As a result bzip2 would be my format of >> >> > choice to compress my data. In particular I would like to use bzip2 >> on >> >> > my hive tables. I can not speak to how IO intensive BZ2 is however I >> >> > know you can lower the compression threshold to trade off between >> >> > compression/performance. >> >> > >> >> > What other options are out there? >> >> > >> >> > >> >> The other options are currently: >> >> >> >> 1) Splittable LZO >> >> >> >> You need to add in some external libraries here since LZO is >> LGPL-licensed >> >> and thus can't be distributed with Hadoop. I've made some scripts which >> you >> >> can use to generate packages compatible with Cloudera's distro here: >> >> http://github.com/toddlipcon/hadoop-lzo-packager >> >> >> >> The scripts are pretty new but there are people running the LZO code in >> >> production with a lot of success. >> >> >> >> Also, to make LZO splittable you have to run an indexing process across >> >> your >> >> data one time. I believe the README in the LZO hadoop library source >> tree >> >> explains this. >> >> >> >> 2) Use a SequenceFile container >> >> >> >> If you use SequenceFile for your data, you can turn on block >> compression >> >> and >> >> retain splittability with any codec. The downside of course is that >> you've >> >> gotta use some process to get it into this format, but once it's there >> you >> >> avoid this issue completely. >> >> >> >> >> >> -Todd >> >> >> > >> > >> > -- >> > p: 415.860.4347 >> > b: www.dataspora.com/blog >> > t: www.twitter.com/dataspora >> > >> > >