Re: Input split for a streaming job!
MIlind I realised that thankls to Joey from Cloudera. I have given up on bzip. Raj > >From: "milind.bhandar...@emc.com" >To: common-user@hadoop.apache.org; rajv...@yahoo.com; cdh-u...@cloudera.org >Sent: Monday, November 14, 2011 2:02 PM >Subject: Re: Input split for a streaming job! > >It looks like your hadoop distro does not have >https://issues.apache.org/jira/browse/HADOOP-4012. > >- milind > >On 11/10/11 2:40 PM, "Raj V" wrote: > >>All >> >>I assumed that the input splits for a streaming job will follow the same >>logic as a map reduce java job but I seem to be wrong. >> >>I started out with 73 gzipped files that vary between 23MB to 255MB in >>size. My default block size was 128MB. 8 of the 73 files are larger than >>128 MB >> >>When I ran my streaming job, it ran, as expected, 73 mappers ( No >>reducers for this job). >> >>Since I have 128 Nodes in my cluster , I thought I would use more systems >>in the cluster by increasing the number of mappers. I changed all the >>gzip files into bzip2 files. I expected the number of mappers to increase >>to 81. The mappers remained at 73. >> >>I tried a second experiment- I changed my dfs.block.size to 32MB. That >>should have increased my mappers to about ~250. It remains steadfast at >>73. >> >>Is my understanding wrong? With a smaller block size and bzipped files, >>should I not get more mappers? >> >>Raj > > > >
Re: Input split for a streaming job!
It looks like your hadoop distro does not have https://issues.apache.org/jira/browse/HADOOP-4012. - milind On 11/10/11 2:40 PM, "Raj V" wrote: >All > >I assumed that the input splits for a streaming job will follow the same >logic as a map reduce java job but I seem to be wrong. > >I started out with 73 gzipped files that vary between 23MB to 255MB in >size. My default block size was 128MB. 8 of the 73 files are larger than >128 MB > >When I ran my streaming job, it ran, as expected, 73 mappers ( No >reducers for this job). > >Since I have 128 Nodes in my cluster , I thought I would use more systems >in the cluster by increasing the number of mappers. I changed all the >gzip files into bzip2 files. I expected the number of mappers to increase >to 81. The mappers remained at 73. > >I tried a second experiment- I changed my dfs.block.size to 32MB. That >should have increased my mappers to about ~250. It remains steadfast at >73. > >Is my understanding wrong? With a smaller block size and bzipped files, >should I not get more mappers? > >Raj
RE: Input split for a streaming job!
Or you could use the LZO patch and get *fast* splittable compression that doesn't depend on the bz2 generalized splittability scheme: http://www.cloudera.com/blog/2009/06/parallel-lzo-splittable-compression-for-hadoop/ http://www.cloudera.com/blog/2009/11/hadoop-at-twitter-part-1-splittable-lzo-compression/ - Tim. From: bejoy.had...@gmail.com [bejoy.had...@gmail.com] Sent: Friday, November 11, 2011 10:44 AM To: common-user@hadoop.apache.org; Raj V; Tim Broberg Subject: Re: Input split for a streaming job! Hi Raj AFAIK 0.21is an unstable release and I fear anyone would recommend that for production. You can play around with the same, a better approach would be patching your CDH3u1 with the required patches for splittable BZip2, but make sure that your new patch doesn't break anything else. Regards Bejoy K S -Original Message- From: Raj V Date: Fri, 11 Nov 2011 10:34:18 To: Tim Broberg; common-user@hadoop.apache.org Reply-To: common-user@hadoop.apache.org Subject: Re: Input split for a streaming job! Tim I am using CDH3 U1. ( 0.20.2+923) which does not have the patch. I will try and use 0.21 Raj > >From: Tim Broberg >To: "common-user@hadoop.apache.org" ; Raj V >; Joey Echeverria >Sent: Friday, November 11, 2011 10:25 AM >Subject: RE: Input split for a streaming job! > > > >What version of hadoop are you using? > >We just stumbled on the Jira item for BZIP2 splitting, and it appears to have >been added in 0.21. > >When I diff 0.20.205 vs trunk, I see >< public class BZip2Codec implements >>< org.apache.hadoop.io.compress.CompressionCodec { >>--- >>> @InterfaceAudience.Public >>> @InterfaceStability.Evolving >>> public class BZip2Codec implements SplittableCompressionCodec { >So, it appears you need at least 0.21 to play with splittability in BZIP2. > > - Tim. > > >From: Raj V [rajv...@yahoo.com] >Sent: Friday, November 11, 2011 9:18 AM >To: Joey Echeverria >Cc: common-user@hadoop.apache.org >Subject: Re: Input split for a streaming job! > >Joey,Anirudh, Bejoy > >I am using TextInputFormat Class. (org.apache.hadoop.mapred.TextInputFormat). > >And the input files were created using 32MB block size and the files are bzip2. > >So all things point to my input files being spliitable. > >I will continue poking around. > >- best regards > >Raj > > > >> >>From: Joey Echeverria >>To: Raj V >>Sent: Friday, November 11, 2011 2:56 AM >>Subject: Re: Input split for a streaming job! >> >>U1 should be able to split the bzip2 files. What input format are you using? >> >>-Joey >> >>On Thu, Nov 10, 2011 at 9:06 PM, Raj V wrote: >>> Sorry to bother you offline. >>> From the release notes for CDH3U1 >>> ( http://archive.cloudera.com/cdh/3/hadoop-0.20.2+923.97.releasenotes.html) >>> I understand that split of the bzip files was available. >>> But returning to my old problem I still see 73 mappers. Did I misunderstand >>> something? >>> If necessary, I can re-post the mail to the group., >>> >>> >>> From: Joey Echeverria >>> To: rajv...@yahoo.com >>> Sent: Thursday, November 10, 2011 3:11 PM >>> Subject: Re: Input split for a streaming job! >>> >>> No problem. Out of curiosity, why are you still using B3? >>> >>> -Joey >>> >>> On Thu, Nov 10, 2011 at 6:07 PM, Raj V wrote: >>>> Joey >>>> I think I know the answer. I am using CDH3B3 ( 0-20.2+737) and this does >>>> not >>>> seem to support bzip splitting. I should have looked before shooting off >>>> the >>>> email :-( >>>> To answer your second question, I created a completely new set of input >>>> files with dfs.block.size=32MB and used this as the input data >>>> Raj >>>> >>>> >>>> >>>> From: Joey Echeverria >>>> To: cdh-u...@cloudera.org >>>> Sent: Thursday, November 10, 2011 3:02 PM >>>> Subject: Re: Input split for a streaming job! >>>> >>>> It depends on the version of hadoop that you're using. Also, when you >>>> changed the block size, did you do it on the actual files, or just the >>>> default for new files? >>>> >>>> -Joey >>>> >>>> On Thu, Nov 10, 2011 at 5:52 PM, Raj V wrote: >>>>
Re: Input split for a streaming job!
Hi Raj AFAIK 0.21is an unstable release and I fear anyone would recommend that for production. You can play around with the same, a better approach would be patching your CDH3u1 with the required patches for splittable BZip2, but make sure that your new patch doesn't break anything else. Regards Bejoy K S -Original Message- From: Raj V Date: Fri, 11 Nov 2011 10:34:18 To: Tim Broberg; common-user@hadoop.apache.org Reply-To: common-user@hadoop.apache.org Subject: Re: Input split for a streaming job! Tim I am using CDH3 U1. ( 0.20.2+923) which does not have the patch. I will try and use 0.21 Raj > >From: Tim Broberg >To: "common-user@hadoop.apache.org" ; Raj V >; Joey Echeverria >Sent: Friday, November 11, 2011 10:25 AM >Subject: RE: Input split for a streaming job! > > > >What version of hadoop are you using? > >We just stumbled on the Jira item for BZIP2 splitting, and it appears to have >been added in 0.21. > >When I diff 0.20.205 vs trunk, I see >< public class BZip2Codec implements >>< org.apache.hadoop.io.compress.CompressionCodec { >>--- >>> @InterfaceAudience.Public >>> @InterfaceStability.Evolving >>> public class BZip2Codec implements SplittableCompressionCodec { >So, it appears you need at least 0.21 to play with splittability in BZIP2. > > - Tim. > > >From: Raj V [rajv...@yahoo.com] >Sent: Friday, November 11, 2011 9:18 AM >To: Joey Echeverria >Cc: common-user@hadoop.apache.org >Subject: Re: Input split for a streaming job! > >Joey,Anirudh, Bejoy > >I am using TextInputFormat Class. (org.apache.hadoop.mapred.TextInputFormat). > >And the input files were created using 32MB block size and the files are bzip2. > >So all things point to my input files being spliitable. > >I will continue poking around. > >- best regards > >Raj > > > >> >>From: Joey Echeverria >>To: Raj V >>Sent: Friday, November 11, 2011 2:56 AM >>Subject: Re: Input split for a streaming job! >> >>U1 should be able to split the bzip2 files. What input format are you using? >> >>-Joey >> >>On Thu, Nov 10, 2011 at 9:06 PM, Raj V wrote: >>> Sorry to bother you offline. >>> From the release notes for CDH3U1 >>> ( http://archive.cloudera.com/cdh/3/hadoop-0.20.2+923.97.releasenotes.html) >>> I understand that split of the bzip files was available. >>> But returning to my old problem I still see 73 mappers. Did I misunderstand >>> something? >>> If necessary, I can re-post the mail to the group., >>> >>> >>> From: Joey Echeverria >>> To: rajv...@yahoo.com >>> Sent: Thursday, November 10, 2011 3:11 PM >>> Subject: Re: Input split for a streaming job! >>> >>> No problem. Out of curiosity, why are you still using B3? >>> >>> -Joey >>> >>> On Thu, Nov 10, 2011 at 6:07 PM, Raj V wrote: >>>> Joey >>>> I think I know the answer. I am using CDH3B3 ( 0-20.2+737) and this does >>>> not >>>> seem to support bzip splitting. I should have looked before shooting off >>>> the >>>> email :-( >>>> To answer your second question, I created a completely new set of input >>>> files with dfs.block.size=32MB and used this as the input data >>>> Raj >>>> >>>> >>>> >>>> From: Joey Echeverria >>>> To: cdh-u...@cloudera.org >>>> Sent: Thursday, November 10, 2011 3:02 PM >>>> Subject: Re: Input split for a streaming job! >>>> >>>> It depends on the version of hadoop that you're using. Also, when you >>>> changed the block size, did you do it on the actual files, or just the >>>> default for new files? >>>> >>>> -Joey >>>> >>>> On Thu, Nov 10, 2011 at 5:52 PM, Raj V wrote: >>>>> Hi Joey, >>>>> I always thought bzip was splittable. >>>>> Raj >>>>> >>>>> >>>>> From: Joey Echeverria >>>>> To: cdh-u...@cloudera.org >>>>> Sent: Thursday, November 10, 2011 2:43 PM >>>>> Subject: Re: Input split for a streaming job! >>>>> >>>>> Gzip and bzip2 compressed files aren't splittable, so you'll always >>>>> get one mapper pe
Re: Input split for a streaming job!
Tim I am using CDH3 U1. ( 0.20.2+923) which does not have the patch. I will try and use 0.21 Raj > >From: Tim Broberg >To: "common-user@hadoop.apache.org" ; Raj V >; Joey Echeverria >Sent: Friday, November 11, 2011 10:25 AM >Subject: RE: Input split for a streaming job! > > > >What version of hadoop are you using? > >We just stumbled on the Jira item for BZIP2 splitting, and it appears to have >been added in 0.21. > >When I diff 0.20.205 vs trunk, I see >< public class BZip2Codec implements >>< org.apache.hadoop.io.compress.CompressionCodec { >>--- >>> @InterfaceAudience.Public >>> @InterfaceStability.Evolving >>> public class BZip2Codec implements SplittableCompressionCodec { >So, it appears you need at least 0.21 to play with splittability in BZIP2. > > - Tim. > > >From: Raj V [rajv...@yahoo.com] >Sent: Friday, November 11, 2011 9:18 AM >To: Joey Echeverria >Cc: common-user@hadoop.apache.org >Subject: Re: Input split for a streaming job! > >Joey,Anirudh, Bejoy > >I am using TextInputFormat Class. (org.apache.hadoop.mapred.TextInputFormat). > >And the input files were created using 32MB block size and the files are bzip2. > >So all things point to my input files being spliitable. > >I will continue poking around. > >- best regards > >Raj > > > >> >>From: Joey Echeverria >>To: Raj V >>Sent: Friday, November 11, 2011 2:56 AM >>Subject: Re: Input split for a streaming job! >> >>U1 should be able to split the bzip2 files. What input format are you using? >> >>-Joey >> >>On Thu, Nov 10, 2011 at 9:06 PM, Raj V wrote: >>> Sorry to bother you offline. >>> From the release notes for CDH3U1 >>> ( http://archive.cloudera.com/cdh/3/hadoop-0.20.2+923.97.releasenotes.html) >>> I understand that split of the bzip files was available. >>> But returning to my old problem I still see 73 mappers. Did I misunderstand >>> something? >>> If necessary, I can re-post the mail to the group., >>> >>> >>> From: Joey Echeverria >>> To: rajv...@yahoo.com >>> Sent: Thursday, November 10, 2011 3:11 PM >>> Subject: Re: Input split for a streaming job! >>> >>> No problem. Out of curiosity, why are you still using B3? >>> >>> -Joey >>> >>> On Thu, Nov 10, 2011 at 6:07 PM, Raj V wrote: >>>> Joey >>>> I think I know the answer. I am using CDH3B3 ( 0-20.2+737) and this does >>>> not >>>> seem to support bzip splitting. I should have looked before shooting off >>>> the >>>> email :-( >>>> To answer your second question, I created a completely new set of input >>>> files with dfs.block.size=32MB and used this as the input data >>>> Raj >>>> >>>> >>>> >>>> From: Joey Echeverria >>>> To: cdh-u...@cloudera.org >>>> Sent: Thursday, November 10, 2011 3:02 PM >>>> Subject: Re: Input split for a streaming job! >>>> >>>> It depends on the version of hadoop that you're using. Also, when you >>>> changed the block size, did you do it on the actual files, or just the >>>> default for new files? >>>> >>>> -Joey >>>> >>>> On Thu, Nov 10, 2011 at 5:52 PM, Raj V wrote: >>>>> Hi Joey, >>>>> I always thought bzip was splittable. >>>>> Raj >>>>> >>>>> >>>>> From: Joey Echeverria >>>>> To: cdh-u...@cloudera.org >>>>> Sent: Thursday, November 10, 2011 2:43 PM >>>>> Subject: Re: Input split for a streaming job! >>>>> >>>>> Gzip and bzip2 compressed files aren't splittable, so you'll always >>>>> get one mapper per file. >>>>> >>>>> -Joey >>>>> >>>>> On Thu, Nov 10, 2011 at 5:40 PM, Raj V wrote: >>>>>> All >>>>>> I assumed that the input splits for a streaming job will follow the same >>>>>> logic as a map reduce java job but I seem to be wrong. >>>>>> I started out with 73 gzipped files that vary between 23MB to 255MB in >>>>>> size. >>>>>> My default block size was
RE: Input split for a streaming job!
What version of hadoop are you using? We just stumbled on the Jira item for BZIP2 splitting, and it appears to have been added in 0.21. When I diff 0.20.205 vs trunk, I see < public class BZip2Codec implements < org.apache.hadoop.io.compress.CompressionCodec { --- > @InterfaceAudience.Public > @InterfaceStability.Evolving > public class BZip2Codec implements SplittableCompressionCodec { So, it appears you need at least 0.21 to play with splittability in BZIP2. - Tim. From: Raj V [rajv...@yahoo.com] Sent: Friday, November 11, 2011 9:18 AM To: Joey Echeverria Cc: common-user@hadoop.apache.org Subject: Re: Input split for a streaming job! Joey,Anirudh, Bejoy I am using TextInputFormat Class. (org.apache.hadoop.mapred.TextInputFormat). And the input files were created using 32MB block size and the files are bzip2. So all things point to my input files being spliitable. I will continue poking around. - best regards Raj > >From: Joey Echeverria >To: Raj V >Sent: Friday, November 11, 2011 2:56 AM >Subject: Re: Input split for a streaming job! > >U1 should be able to split the bzip2 files. What input format are you using? > >-Joey > >On Thu, Nov 10, 2011 at 9:06 PM, Raj V wrote: >> Sorry to bother you offline. >> From the release notes for CDH3U1 >> ( http://archive.cloudera.com/cdh/3/hadoop-0.20.2+923.97.releasenotes.html) >> I understand that split of the bzip files was available. >> But returning to my old problem I still see 73 mappers. Did I misunderstand >> something? >> If necessary, I can re-post the mail to the group., >> >> >> From: Joey Echeverria >> To: rajv...@yahoo.com >> Sent: Thursday, November 10, 2011 3:11 PM >> Subject: Re: Input split for a streaming job! >> >> No problem. Out of curiosity, why are you still using B3? >> >> -Joey >> >> On Thu, Nov 10, 2011 at 6:07 PM, Raj V wrote: >>> Joey >>> I think I know the answer. I am using CDH3B3 ( 0-20.2+737) and this does >>> not >>> seem to support bzip splitting. I should have looked before shooting off >>> the >>> email :-( >>> To answer your second question, I created a completely new set of input >>> files with dfs.block.size=32MB and used this as the input data >>> Raj >>> >>> >>> >>> From: Joey Echeverria >>> To: cdh-u...@cloudera.org >>> Sent: Thursday, November 10, 2011 3:02 PM >>> Subject: Re: Input split for a streaming job! >>> >>> It depends on the version of hadoop that you're using. Also, when you >>> changed the block size, did you do it on the actual files, or just the >>> default for new files? >>> >>> -Joey >>> >>> On Thu, Nov 10, 2011 at 5:52 PM, Raj V wrote: >>>> Hi Joey, >>>> I always thought bzip was splittable. >>>> Raj >>>> >>>> >>>> From: Joey Echeverria >>>> To: cdh-u...@cloudera.org >>>> Sent: Thursday, November 10, 2011 2:43 PM >>>> Subject: Re: Input split for a streaming job! >>>> >>>> Gzip and bzip2 compressed files aren't splittable, so you'll always >>>> get one mapper per file. >>>> >>>> -Joey >>>> >>>> On Thu, Nov 10, 2011 at 5:40 PM, Raj V wrote: >>>>> All >>>>> I assumed that the input splits for a streaming job will follow the same >>>>> logic as a map reduce java job but I seem to be wrong. >>>>> I started out with 73 gzipped files that vary between 23MB to 255MB in >>>>> size. >>>>> My default block size was 128MB. 8 of the 73 files are larger than 128 >>>>> MB >>>>> When I ran my streaming job, it ran, as expected, 73 mappers ( No >>>>> reducers >>>>> for this job). >>>>> Since I have 128 Nodes in my cluster , I thought I would use more >>>>> systems >>>>> in >>>>> the cluster by increasing the number of mappers. I changed all the gzip >>>>> files into bzip2 files. I expected the number of mappers to increase to >>>>> 81. >>>>> The mappers remained at 73. >>>>> I tried a second experiment- I changed my dfs.block.size to 32MB. That >>>>> should have increased my mappers to about ~250. It remains steadfast at >>>>> 73. >>&
Re: Input split for a streaming job!
Joey,Anirudh, Bejoy I am using TextInputFormat Class. (org.apache.hadoop.mapred.TextInputFormat). And the input files were created using 32MB block size and the files are bzip2. So all things point to my input files being spliitable. I will continue poking around. - best regards Raj > >From: Joey Echeverria >To: Raj V >Sent: Friday, November 11, 2011 2:56 AM >Subject: Re: Input split for a streaming job! > >U1 should be able to split the bzip2 files. What input format are you using? > >-Joey > >On Thu, Nov 10, 2011 at 9:06 PM, Raj V wrote: >> Sorry to bother you offline. >> From the release notes for CDH3U1 >> ( http://archive.cloudera.com/cdh/3/hadoop-0.20.2+923.97.releasenotes.html) >> I understand that split of the bzip files was available. >> But returning to my old problem I still see 73 mappers. Did I misunderstand >> something? >> If necessary, I can re-post the mail to the group., >> >> >> From: Joey Echeverria >> To: rajv...@yahoo.com >> Sent: Thursday, November 10, 2011 3:11 PM >> Subject: Re: Input split for a streaming job! >> >> No problem. Out of curiosity, why are you still using B3? >> >> -Joey >> >> On Thu, Nov 10, 2011 at 6:07 PM, Raj V wrote: >>> Joey >>> I think I know the answer. I am using CDH3B3 ( 0-20.2+737) and this does >>> not >>> seem to support bzip splitting. I should have looked before shooting off >>> the >>> email :-( >>> To answer your second question, I created a completely new set of input >>> files with dfs.block.size=32MB and used this as the input data >>> Raj >>> >>> >>> >>> From: Joey Echeverria >>> To: cdh-u...@cloudera.org >>> Sent: Thursday, November 10, 2011 3:02 PM >>> Subject: Re: Input split for a streaming job! >>> >>> It depends on the version of hadoop that you're using. Also, when you >>> changed the block size, did you do it on the actual files, or just the >>> default for new files? >>> >>> -Joey >>> >>> On Thu, Nov 10, 2011 at 5:52 PM, Raj V wrote: >>>> Hi Joey, >>>> I always thought bzip was splittable. >>>> Raj >>>> >>>> >>>> From: Joey Echeverria >>>> To: cdh-u...@cloudera.org >>>> Sent: Thursday, November 10, 2011 2:43 PM >>>> Subject: Re: Input split for a streaming job! >>>> >>>> Gzip and bzip2 compressed files aren't splittable, so you'll always >>>> get one mapper per file. >>>> >>>> -Joey >>>> >>>> On Thu, Nov 10, 2011 at 5:40 PM, Raj V wrote: >>>>> All >>>>> I assumed that the input splits for a streaming job will follow the same >>>>> logic as a map reduce java job but I seem to be wrong. >>>>> I started out with 73 gzipped files that vary between 23MB to 255MB in >>>>> size. >>>>> My default block size was 128MB. 8 of the 73 files are larger than 128 >>>>> MB >>>>> When I ran my streaming job, it ran, as expected, 73 mappers ( No >>>>> reducers >>>>> for this job). >>>>> Since I have 128 Nodes in my cluster , I thought I would use more >>>>> systems >>>>> in >>>>> the cluster by increasing the number of mappers. I changed all the gzip >>>>> files into bzip2 files. I expected the number of mappers to increase to >>>>> 81. >>>>> The mappers remained at 73. >>>>> I tried a second experiment- I changed my dfs.block.size to 32MB. That >>>>> should have increased my mappers to about ~250. It remains steadfast at >>>>> 73. >>>>> Is my understanding wrong? With a smaller block size and bzipped files, >>>>> should I not get more mappers? >>>>> Raj >>>>> >>>>> >>>>> >>>> >>>> >>>> >>>> -- >>>> Joseph Echeverria >>>> Cloudera, Inc. >>>> 443.305.9434 >>>> >>>> >>>> >>> >>> >>> >>> -- >>> Joseph Echeverria >>> Cloudera, Inc. >>> 443.305.9434 >>> >>> >>> >> >> >> >> -- >> Joseph Echeverria >> Cloudera, Inc. >> 443.305.9434 >> >> >> > > > >-- >Joseph Echeverria >Cloudera, Inc. >443.305.9434 > > >
Re: Input split for a streaming job!
Hi Raj Is your Streaming job using WholeFileInput Format or some Custom Input Format that reads files as a whole? If that is the case then this is the expected behavior. Also you mentioned you changed the dfs.block.size to 32 Mb.AFAIK this value would be applicable only for new files into hdfs, the existing files in hdfs would be having the previous block size itself. Also to test some scenarios you need to specify the block size on the cluster level, you specify on the file level while copying the same into hdfs with hadoop dfs -D dfs.block.size=16777216 -copyFromLocal /src/file /dest/file Did you do the same way and still it doesn't vary the number of mappers. AFAIK bzip2 is splittable. Please correct me if I'm wrong. On Fri, Nov 11, 2011 at 2:07 PM, Anirudh Jhina wrote: > Raj, > > What InputFormat are you using? The compressed format is not splittable, so > if you have 73 gzip files, there will be 73 corresponding mappers for each > file respectively. Look at the TextInputFormat.isSplittable() description. > > Thanks, > ~Anirudh > > On Thu, Nov 10, 2011 at 2:40 PM, Raj V wrote: > > > All > > > > I assumed that the input splits for a streaming job will follow the same > > logic as a map reduce java job but I seem to be wrong. > > > > I started out with 73 gzipped files that vary between 23MB to 255MB in > > size. My default block size was 128MB. 8 of the 73 files are larger than > > 128 MB > > > > When I ran my streaming job, it ran, as expected, 73 mappers ( No > > reducers for this job). > > > > Since I have 128 Nodes in my cluster , I thought I would use more systems > > in the cluster by increasing the number of mappers. I changed all the > gzip > > files into bzip2 files. I expected the number of mappers to increase to > 81. > > The mappers remained at 73. > > > > I tried a second experiment- I changed my dfs.block.size to 32MB. That > > should have increased my mappers to about ~250. It remains steadfast at > 73. > > > > Is my understanding wrong? With a smaller block size and bzipped files, > > should I not get more mappers? > > > > Raj >
Re: Input split for a streaming job!
Raj, What InputFormat are you using? The compressed format is not splittable, so if you have 73 gzip files, there will be 73 corresponding mappers for each file respectively. Look at the TextInputFormat.isSplittable() description. Thanks, ~Anirudh On Thu, Nov 10, 2011 at 2:40 PM, Raj V wrote: > All > > I assumed that the input splits for a streaming job will follow the same > logic as a map reduce java job but I seem to be wrong. > > I started out with 73 gzipped files that vary between 23MB to 255MB in > size. My default block size was 128MB. 8 of the 73 files are larger than > 128 MB > > When I ran my streaming job, it ran, as expected, 73 mappers ( No > reducers for this job). > > Since I have 128 Nodes in my cluster , I thought I would use more systems > in the cluster by increasing the number of mappers. I changed all the gzip > files into bzip2 files. I expected the number of mappers to increase to 81. > The mappers remained at 73. > > I tried a second experiment- I changed my dfs.block.size to 32MB. That > should have increased my mappers to about ~250. It remains steadfast at 73. > > Is my understanding wrong? With a smaller block size and bzipped files, > should I not get more mappers? > > Raj
Input split for a streaming job!
All I assumed that the input splits for a streaming job will follow the same logic as a map reduce java job but I seem to be wrong. I started out with 73 gzipped files that vary between 23MB to 255MB in size. My default block size was 128MB. 8 of the 73 files are larger than 128 MB When I ran my streaming job, it ran, as expected, 73 mappers ( No reducers for this job). Since I have 128 Nodes in my cluster , I thought I would use more systems in the cluster by increasing the number of mappers. I changed all the gzip files into bzip2 files. I expected the number of mappers to increase to 81. The mappers remained at 73. I tried a second experiment- I changed my dfs.block.size to 32MB. That should have increased my mappers to about ~250. It remains steadfast at 73. Is my understanding wrong? With a smaller block size and bzipped files, should I not get more mappers? Raj