On Feb 7, 2011, at 13:39 , Allen Wittenauer wrote:

> On Feb 4, 2011, at 7:46 AM, Keith Wiley wrote:
>> 

[On the topic of why I care if Hadoop funnels and queues multiple input splits 
into a a small number mappers instead of perfectly parallelizing the job across 
the available slots...]

>> Because all slots are not in use.  It's a very larger cluster and it's 
>> excruciating that Hadoop partially serializes a job by piling multiple map 
>> tasks onto a single map in a queue even when the cluster is massively 
>> underutilized.
> 
>       Well, sort of.
> 
>       The only input hadoop has to go on is your filename input which is 
> relatively tiny.  So of course it is going to underutilize.  This makes sense 
> now. :)


I think we're talking around each other a little bit here.  I'm sorry.  In in 
my original description, I was referring to the nonstreaming version of my 
program.  The all-Java version doesn't use filenames, it sets up actual Hadoop 
input splits from files stored on HDFS.  These files are about 6MB after 
decompression.  My point, earlier in this thread, was Hadoop's default 
behavior, even in that case which used the actual "largish" files as the 
inputs, still assigned many input splits to a single mapper (since they are 
smaller than a block) instead of achieving perfect parallelism.

The degree of queueing seemed perfectly coordinated with the block size of 
64Mb.  That is to say that given my input files of 6MB each, Hadoop would 
assign about 10 of them per mapper...where I wanted one per mapper and ten 
times as many mappers.

Then, my final point was that in the nonstreaming all Java case, I could *NOT* 
achieve the desired behavior simply by setting mapred.map.tasks to a high 
number, say, one per input file (I honestly don't remember what the behavior 
was when I tried this, it was a very long time ago).  This simply did not work, 
Hadoop ignored it and queued up all my inputs anyway.  What I had to do was set 
mapred.max.split.size really small so that Hadoop would not be willing to queue 
inputs up per mapper.  Ideally, I would set mapred.max.split.size slightly 
larger than a single input file, about 6MB.  Doing this achieves my desired 
goal, one input per mapper, perfect parallelism, minimum job turn-around time.

Now, all that said, I am perfectly open to discussion or suggestions as to how 
I ought to better handle this situation, including the notion that 
mapred.map.tasks should have worked the way I intended in the first place (Did 
I just do something wrong there?  Should it have worked the way I expected it 
to?).  At any rate, what is the proper Hadoop method for evenly distributing 
inputs across all nodes before doubling up on any given node?

Sorry, maybe this thread is getting a bit rambling.  We can drop it if people 
prefer.....

Thanks.

________________________________________________________________________________
Keith Wiley               kwi...@keithwiley.com               www.keithwiley.com

"I do not feel obliged to believe that the same God who has endowed us with
sense, reason, and intellect has intended us to forgo their use."
  -- Galileo Galilei
________________________________________________________________________________



Reply via email to