I'm running a job whose mappers take a long time, which causes problems like
starving out other jobs that want to run on the same cluster.  Rewriting the
mapper algorithm is not currently an option, but I still need a way to
increase the number of mappers so that I will have greater granularity.
 What is the best way to do this?

Looking through the O'Reilly book and starting from
this<http://wiki.apache.org/hadoop/HowManyMapsAndReduces>Wiki page
I've come up with a couple of ideas:

   1. Set mapred.map.tasks to the value I want.
   2. Decrease the block size of my input files.

What are the gotchas with these approaches?  I know that (1) may not work
because this parameter is just a suggestion.  Is there a command line option
that accomplishes (2), or do I have to do a distcp with a non-default block
size.  (I think the answer is that I have to do a distcp, but I'm making
sure.)

Are there other approaches?  Are there other gotchas that come with trying
to increase mapper granularity.  I know this can be more of an art than a
science.

Thanks.

Reply via email to