@Todd 1) Splittable LZO --- 2) Use a SequenceFile container
Between the licensing, patching, and indexing, this seems to be very challenging. Also how do these things fit into my hive usage, it is not very clear. After I generate a Hive table using LZO what process runs the indexing? Take a look at this thread: http://www.mail-archive.com/common-user@hadoop.apache.org/msg00337.html I really like this! Easy! I set some hive parameters, and whamo! Compression! I think support for LZO is really great, but I it seems like bz2 works almost out of the box, sometimes. I think in my current setup no compression option works out of the box, I would rather have a slow BZ2 option then no option. Edward On Tue, Nov 17, 2009 at 11:23 AM, Michael E. Driscoll <m.e.drisc...@gmail.com> wrote: > Kevin Weil, of Twitter, has done some work extending LZO compression to work > with Hadoop streaming. See > > http://github.com/kevinweil/hadoop-lzo > > <http://github.com/kevinweil/hadoop-lzo>MD > > On Tue, Nov 17, 2009 at 8:08 AM, Todd Lipcon <t...@cloudera.com> wrote: > >> On Tue, Nov 17, 2009 at 7:52 AM, Edward Capriolo <edlinuxg...@gmail.com >> >wrote: >> >> > >> > Todd, >> > >> > I think this is very important. From the grid on "Hadoop the >> > Definative guide" 78, it appears that bzip2 and zip are the only >> > formats the are splittable. As a result bzip2 would be my format of >> > choice to compress my data. In particular I would like to use bzip2 on >> > my hive tables. I can not speak to how IO intensive BZ2 is however I >> > know you can lower the compression threshold to trade off between >> > compression/performance. >> > >> > What other options are out there? >> > >> > >> The other options are currently: >> >> 1) Splittable LZO >> >> You need to add in some external libraries here since LZO is LGPL-licensed >> and thus can't be distributed with Hadoop. I've made some scripts which you >> can use to generate packages compatible with Cloudera's distro here: >> http://github.com/toddlipcon/hadoop-lzo-packager >> >> The scripts are pretty new but there are people running the LZO code in >> production with a lot of success. >> >> Also, to make LZO splittable you have to run an indexing process across >> your >> data one time. I believe the README in the LZO hadoop library source tree >> explains this. >> >> 2) Use a SequenceFile container >> >> If you use SequenceFile for your data, you can turn on block compression >> and >> retain splittability with any codec. The downside of course is that you've >> gotta use some process to get it into this format, but once it's there you >> avoid this issue completely. >> >> >> -Todd >> > > > -- > p: 415.860.4347 > b: www.dataspora.com/blog > t: www.twitter.com/dataspora >