Re: Cloudera 18.3 splits bz2 inputs

Todd Lipcon Tue, 17 Nov 2009 08:40:05 -0800

Coincidentally, we *just* posted a blog entry about this, courtesy of Kevin
Weil from Twitter:


http://www.cloudera.com/blog/2009/11/17/hadoop-at-twitter-part-1-splittable-lzo-compression/

-Todd

On Tue, Nov 17, 2009 at 8:36 AM, Todd Lipcon <t...@cloudera.com> wrote:

> On Tue, Nov 17, 2009 at 8:33 AM, Edward Capriolo <edlinuxg...@gmail.com>wrote:
>
>> @Todd
>>
>> 1) Splittable LZO --- 2) Use a SequenceFile container
>>
>> Between the licensing, patching, and indexing, this seems to be very
>> challenging. Also how do these things fit into my hive usage, it is
>> not very clear. After I generate a Hive table using LZO what process
>> runs the indexing?
>>
>>
> Right now it's sadly a manual process - there's a jar which you run on the
> lzo-compressed file to generate the split index.
>
>
>> Take a look at this thread:
>>
>> http://www.mail-archive.com/common-user@hadoop.apache.org/msg00337.html
>>
>> I really like this! Easy! I set some hive parameters, and whamo!
>> Compression!
>>
>> I think support for LZO is really great, but I it seems like bz2 works
>> almost out of the box, sometimes. I think in my current setup no
>> compression option works out of the box, I would rather have a slow
>> BZ2 option then no option.
>>
>> Oh, I agree completely. There's a JIRA HADOOP-6349 to add FastLZ as a
> compression option. It's similar to LZO in terms of performance profile, but
> not encumbered by licensing issues. I haven't looked closely enough yet to
> know whether it's natively splittable or if we still need some kind of
> indexing pass, but at least it will be built in.
>
> -Todd
>
>>
>>
>>
>> On Tue, Nov 17, 2009 at 11:23 AM, Michael E. Driscoll
>> <m.e.drisc...@gmail.com> wrote:
>> > Kevin Weil, of Twitter, has done some work extending LZO compression to
>> work
>> > with Hadoop streaming.  See
>> >
>> >  http://github.com/kevinweil/hadoop-lzo
>> >
>> > <http://github.com/kevinweil/hadoop-lzo>MD
>> >
>> > On Tue, Nov 17, 2009 at 8:08 AM, Todd Lipcon <t...@cloudera.com> wrote:
>> >
>> >> On Tue, Nov 17, 2009 at 7:52 AM, Edward Capriolo <
>> edlinuxg...@gmail.com
>> >> >wrote:
>> >>
>> >> >
>> >> > Todd,
>> >> >
>> >> > I think this is very important. From the grid on "Hadoop the
>> >> > Definative guide" 78, it appears that bzip2 and zip are the only
>> >> > formats the are splittable. As a result bzip2 would be my format of
>> >> > choice to compress my data. In particular I would like to use bzip2
>> on
>> >> > my hive tables. I can not speak to how IO intensive BZ2 is however I
>> >> > know you can lower the compression threshold to trade off between
>> >> > compression/performance.
>> >> >
>> >> > What other options are out there?
>> >> >
>> >> >
>> >> The other options are currently:
>> >>
>> >> 1) Splittable LZO
>> >>
>> >> You need to add in some external libraries here since LZO is
>> LGPL-licensed
>> >> and thus can't be distributed with Hadoop. I've made some scripts which
>> you
>> >> can use to generate packages compatible with Cloudera's distro here:
>> >> http://github.com/toddlipcon/hadoop-lzo-packager
>> >>
>> >> The scripts are pretty new but there are people running the LZO code in
>> >> production with a lot of success.
>> >>
>> >> Also, to make LZO splittable you have to run an indexing process across
>> >> your
>> >> data one time. I believe the README in the LZO hadoop library source
>> tree
>> >> explains this.
>> >>
>> >> 2) Use a SequenceFile container
>> >>
>> >> If you use SequenceFile for your data, you can turn on block
>> compression
>> >> and
>> >> retain splittability with any codec. The downside of course is that
>> you've
>> >> gotta use some process to get it into this format, but once it's there
>> you
>> >> avoid this issue completely.
>> >>
>> >>
>> >> -Todd
>> >>
>> >
>> >
>> > --
>> > p: 415.860.4347
>> > b: www.dataspora.com/blog
>> > t: www.twitter.com/dataspora
>> >
>>
>
>

Re: Cloudera 18.3 splits bz2 inputs

Reply via email to