Re: pig.maxCombinedSplitSize not work

Jarek Jarcec Cecho Mon, 18 Aug 2014 08:02:26 -0700

Hi Lei,
gzip is so called non splittable file format - Hadoop can’t “seek” in the 
middle of the file and start decompressing it - you have to always start 
reading the file from the begging which is undesirable thing to do on Hadoop 
cluster. Hence you will get one mapper per input non splittable file.

You might consider to uncompress the files, use splittable codec (such as 
bzip2) or use some binary container file (avro, parquet, sequence file).

Jarcec

On Aug 18, 2014, at 7:49 AM, [email protected] wrote:

> 
> I have an input directory which has 7 files:
> 804M bid10.gz 
> 814M bid11.gz 
> 808M bid2.gz 
> 812M bid4.gz 
> 803M bid5.gz 
> 818M bid8.gz 
> 823M bid9.gz
> 
> In my pig script i set combined size to 128M
> 
> SET pig.maxCombinedSplitSize 134217728;
> 
> But there's only 7 mapper.(one file per mapper)
> Any insight on this? 
> 
> Thanks,
> Lei 
> 
> 
> 
> [email protected]

Re: pig.maxCombinedSplitSize not work

Reply via email to