Input split for a streaming job!

Raj V Thu, 10 Nov 2011 14:41:28 -0800

All

I assumed that the input splits for a streaming job will follow the same logic 
as a map reduce java job but I seem to be wrong.


I started out with 73 gzipped files that vary between 23MB to 255MB in size. My 
default block size was 128MB.  8 of the 73 files are larger than 128 MB

When I ran my streaming job, it ran, as expected,  73 mappers ( No reducers for 
this job).

Since I have 128 Nodes in my cluster , I thought I would use more systems in 
the cluster by increasing the number of mappers. I changed all the gzip files 
into bzip2 files. I expected the number of mappers to increase to 81. The 
mappers remained at 73.

I tried a second experiment- I changed my dfs.block.size to 32MB. That should 
have increased my mappers to about ~250. It remains steadfast at 73.

Is my understanding wrong? With a smaller block size and bzipped files, should 
I not get more mappers?

Raj

Input split for a streaming job!

Reply via email to