MIlind I realised that thankls to Joey from Cloudera. I have given up on bzip.
Raj >________________________________ >From: "[email protected]" <[email protected]> >To: [email protected]; [email protected]; [email protected] >Sent: Monday, November 14, 2011 2:02 PM >Subject: Re: Input split for a streaming job! > >It looks like your hadoop distro does not have >https://issues.apache.org/jira/browse/HADOOP-4012. > >- milind > >On 11/10/11 2:40 PM, "Raj V" <[email protected]> wrote: > >>All >> >>I assumed that the input splits for a streaming job will follow the same >>logic as a map reduce java job but I seem to be wrong. >> >>I started out with 73 gzipped files that vary between 23MB to 255MB in >>size. My default block size was 128MB. 8 of the 73 files are larger than >>128 MB >> >>When I ran my streaming job, it ran, as expected, 73 mappers ( No >>reducers for this job). >> >>Since I have 128 Nodes in my cluster , I thought I would use more systems >>in the cluster by increasing the number of mappers. I changed all the >>gzip files into bzip2 files. I expected the number of mappers to increase >>to 81. The mappers remained at 73. >> >>I tried a second experiment- I changed my dfs.block.size to 32MB. That >>should have increased my mappers to about ~250. It remains steadfast at >>73. >> >>Is my understanding wrong? With a smaller block size and bzipped files, >>should I not get more mappers? >> >>Raj > > > >
