All I assumed that the input splits for a streaming job will follow the same logic as a map reduce java job but I seem to be wrong.
I started out with 73 gzipped files that vary between 23MB to 255MB in size. My default block size was 128MB. 8 of the 73 files are larger than 128 MB When I ran my streaming job, it ran, as expected, 73 mappers ( No reducers for this job). Since I have 128 Nodes in my cluster , I thought I would use more systems in the cluster by increasing the number of mappers. I changed all the gzip files into bzip2 files. I expected the number of mappers to increase to 81. The mappers remained at 73. I tried a second experiment- I changed my dfs.block.size to 32MB. That should have increased my mappers to about ~250. It remains steadfast at 73. Is my understanding wrong? With a smaller block size and bzipped files, should I not get more mappers? Raj
