Is there any other way to split the input gz file in the MapReduce intead of changing the codec?
leiwang...@gmail.com From: leiwang...@gmail.com Date: 2014-08-18 23:15 To: user Subject: Re: Re: pig.maxCombinedSplitSize not work Hi Jarek, can you give me some examples of how to do this? Thanks, Lei leiwang...@gmail.com From: Jarek Jarcec Cecho Date: 2014-08-18 23:01 To: user Subject: Re: pig.maxCombinedSplitSize not work Hi Lei, gzip is so called non splittable file format - Hadoop can’t “seek” in the middle of the file and start decompressing it - you have to always start reading the file from the begging which is undesirable thing to do on Hadoop cluster. Hence you will get one mapper per input non splittable file. You might consider to uncompress the files, use splittable codec (such as bzip2) or use some binary container file (avro, parquet, sequence file). Jarcec On Aug 18, 2014, at 7:49 AM, leiwang...@gmail.com wrote: > > I have an input directory which has 7 files: > 804M bid10.gz > 814M bid11.gz > 808M bid2.gz > 812M bid4.gz > 803M bid5.gz > 818M bid8.gz > 823M bid9.gz > > In my pig script i set combined size to 128M > > SET pig.maxCombinedSplitSize 134217728; > > But there's only 7 mapper.(one file per mapper) > Any insight on this? > > Thanks, > Lei > > > > leiwang...@gmail.com