I'm running a job whose mappers take a long time, which causes problems like starving out other jobs that want to run on the same cluster. Rewriting the mapper algorithm is not currently an option, but I still need a way to increase the number of mappers so that I will have greater granularity. What is the best way to do this?
Looking through the O'Reilly book and starting from this<http://wiki.apache.org/hadoop/HowManyMapsAndReduces>Wiki page I've come up with a couple of ideas: 1. Set mapred.map.tasks to the value I want. 2. Decrease the block size of my input files. What are the gotchas with these approaches? I know that (1) may not work because this parameter is just a suggestion. Is there a command line option that accomplishes (2), or do I have to do a distcp with a non-default block size. (I think the answer is that I have to do a distcp, but I'm making sure.) Are there other approaches? Are there other gotchas that come with trying to increase mapper granularity. I know this can be more of an art than a science. Thanks.