Most companies handling BigData use LZO, a few have started exploring/using Snappy as well (which is not any easier to configure). These are the 2 splittable fast-compression algorithms. Note Snappy is not efficient space-wise compared to gzip or other compression algos, but a lot faster (ideal for compression between Map and Reduce)
Is there any repeated/heavy computation involved on the outputs other than pushing this data to a database? If not, may be its fine to use gzip but you have to make sure the individual files are close to the block size, or you will have a lot of unnecessary IO transfers taking place. If you read the outputs to perform further Map Reduce computation, gzip is not the best. -Prashant On Tue, Apr 3, 2012 at 12:18 PM, Mohit Anchlia <mohitanch...@gmail.com>wrote: > Thanks for your input. > > It looks like it's some work to configure LZO. What are the other > alternatives? We read new sequence files and generate output continuously. > What are my options? Should I split the output in small pieces and gzip > them? How do people solve similar problems where there is continuous flow > of data that generates tons of output continuosly? > > After output is generated we again read them and load it in OLAP db or do > some other analysis. > > On Tue, Apr 3, 2012 at 11:48 AM, Prashant Kommireddi <prash1...@gmail.com > >wrote: > > > Yes, it is splittable. > > > > Bzip2 consumes a lot of CPU in decompression. With Hadoop jobs generally > > being IO bound, Bzip2 sometimes can become the bottleneck with respect to > > performance due to this slow decompression rate (algorithm unable to > > decompress at disk read rate). > > > > > > On Tue, Apr 3, 2012 at 11:39 AM, Mohit Anchlia <mohitanch...@gmail.com > > >wrote: > > > > > Is bzip2 not advisable? I think it can split too and is supported out > of > > > the box. > > > > > > On Thu, Mar 29, 2012 at 8:08 PM, 帝归 <xudi1...@gmail.com> wrote: > > > > > > > When I use LzoPigStorage, it will load all files under a directory. > > But I > > > > want compress every file under a directory and keep the file name > > > > unchanged, just with a .lzo extension name. How can I do this? Maybe > I > > > must > > > > write a mapreduce job? > > > > > > > > 2012/3/30 Jonathan Coveney <jcove...@gmail.com> > > > > > > > > > check out: > > > > > > > > > > > > > > > > > > > > https://github.com/kevinweil/elephant-bird/tree/master/src/java/com/twitter/elephantbird/pig/store > > > > > > > > > > 2012/3/29 Mohit Anchlia <mohitanch...@gmail.com> > > > > > > > > > > > Thanks! When I store output how can I tell pig to compress it in > > LZO > > > > > > format? > > > > > > > > > > > > On Thu, Mar 29, 2012 at 4:02 PM, Dmitriy Ryaboy < > > dvrya...@gmail.com> > > > > > > wrote: > > > > > > > > > > > > > You might find the elephant-bird project helpful for reading > and > > > > > > > creating LZO files, in raw hadoop or using Pig. > > > > > > > (disclaimer: I'm a committer on elephant-bird) > > > > > > > > > > > > > > D > > > > > > > > > > > > > > On Wed, Mar 28, 2012 at 9:49 AM, Prashant Kommireddi > > > > > > > <prash1...@gmail.com> wrote: > > > > > > > > Pig support LZO for splittable compression. > > > > > > > > > > > > > > > > Thanks, > > > > > > > > Prashant > > > > > > > > > > > > > > > > On Mar 28, 2012, at 9:45 AM, Mohit Anchlia < > > > mohitanch...@gmail.com > > > > > > > > > > > > wrote: > > > > > > > > > > > > > > > >> We currently have 100s of GB of uncompressed data which we > > would > > > > > like > > > > > > to > > > > > > > >> zip using some compression that is block compression so that > > we > > > > can > > > > > > use > > > > > > > >> multiple input splits. Does pig support any such > compression? > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > -- > > > > ‘(hello world) > > > > > > > > > >