On Mon, Jan 31, 2011 at 12:36 AM, Niels Basjes <ni...@basjes.nl> wrote:
> Hi, > > 2011/1/31 Sean Bigdatafun <sean.bigdata...@gmail.com>: > > GZIP is not splittable. > > Correct, gzip is a stream compression system which effectively means > you can only start at the beginning of the data with decompressing. > > > Does that mean a GZIP block compressed sequencefile can't take advantage > of MR parallelism? > > AFAIK it should be splittable in the same blocks as the compression was > done. > Splittable within the same block? Normally, each mapper would pick a HDFS block (64MB in an HDFS with default configuration) of a 1GB file for map processing, should the file not GZIP compressed --- this is a scenario for an unpressed file. But as GZIP is not splittable, if/how can a mapper pick a block? (if it can't, then we can't utilize the Mapreduce framework for the parallelism). Can you give more answer? > > > How to control the size of block to be compressed in SequenceFile? > > Can't help you with that one. > > -- > Met vriendelijke groeten, > > Niels Basjes > -- --Sean