I am running a job that takes no input from the mapper-input key/value 
interface.  Each job reads the same small file from the distributed cache and 
processes it independently (to generate Monte Carlo sampling of the problem 
space).  I am using MR purely to parallelize the otherwise redundant and 
separated sampling process.  To maximize parallelism, I want to set the number 
of mappers explicitly, such that 10 samples run in exact 1X time by perfectly 
distributing over 10 mappers.  I am accomplishing this by generating a dummy MR 
input file of nonvalue data.  Each row is identical so I know the exact row 
length of all rows.  I then simply set the split size to the row length with 
the intention that Hadoop perfectly assign the intended number of mappers.  
This approach mostly works.  However, I get a few extraneous empty mappers.  
Since they get no input, they do no work and exit almost immediately, so they 
aren't a serious drain on cluster resources, but I'm confused why I get extra 
mappers in the first place.

My working theory was that the end-lines of the input file must be accounted 
for when calculating split sizes (so my splits were too small and I got a few 
extra splits hanging off the end of the input file).  I attempted to fix this 
by adding one to the calculated split size (one greater than the actual row 
length now).  This works perfectly, generating exactly the intended number of 
mappers, exactly the same number as there are rows in the input file.  However, 
the labor distribution is not perfect.  Almost every single run produces one 
mapper which receives no input (and ends immediately) and another mapper which 
receives two inputs, thus triggering two "processing sessions" on that 
particular mapper such that it takes twice as long to complete as the other 
mappers.  Obviously, this wrecks the potential parallelism by literally 
doubling the overall job time.

Which split size is correct: row length without end-line or row length with 
end-line?  The former yields extra empty mappers while the latter yields 
exactly the right number.  However, if the latter is correct, why is the task 
distribution uneven (albeit NEARLY even) and what (if anything) can be done 
about it?

Thanks.

________________________________________________________________________________
Keith Wiley     kwi...@keithwiley.com     keithwiley.com    music.keithwiley.com

"The easy confidence with which I know another man's religion is folly teaches
me to suspect that my own is also."
                                           --  Mark Twain
________________________________________________________________________________

Reply via email to