Is there any other way to split the input gz file in the MapReduce intead of 
changing the codec?



leiwang...@gmail.com
 
From: leiwang...@gmail.com
Date: 2014-08-18 23:15
To: user
Subject: Re: Re: pig.maxCombinedSplitSize not work

Hi Jarek, can you give me some examples of how to do this? 

Thanks,
Lei


leiwang...@gmail.com
 
From: Jarek Jarcec Cecho
Date: 2014-08-18 23:01
To: user
Subject: Re: pig.maxCombinedSplitSize not work
Hi Lei,
gzip is so called non splittable file format - Hadoop can’t “seek” in the 
middle of the file and start decompressing it - you have to always start 
reading the file from the begging which is undesirable thing to do on Hadoop 
cluster. Hence you will get one mapper per input non splittable file.
 
You might consider to uncompress the files, use splittable codec (such as 
bzip2) or use some binary container file (avro, parquet, sequence file).
 
Jarcec
 
On Aug 18, 2014, at 7:49 AM, leiwang...@gmail.com wrote:
 
> 
> I have an input directory which has 7 files:
> 804M bid10.gz 
> 814M bid11.gz 
> 808M bid2.gz 
> 812M bid4.gz 
> 803M bid5.gz 
> 818M bid8.gz 
> 823M bid9.gz
> 
> In my pig script i set combined size to 128M
> 
> SET pig.maxCombinedSplitSize 134217728;
> 
> But there's only 7 mapper.(one file per mapper)
> Any insight on this? 
> 
> Thanks,
> Lei 
> 
> 
> 
> leiwang...@gmail.com
 

Reply via email to