In general multithreaded does not get you much in traditional Map/Reduce. If you want the mappers to run faster you can drop the split size and get a similar result, because you get more parallelism. This is the use case that we have typically concentrated on. About the only time that MultiThreaded mapper makes a lot of since is if there is a lot of computation associated with each key/value pair. Your process is very compute bound, and not I/O bound. Wordcount is typically going to be I/O bound. I am not aware of any work that is being done to reduce lock contention in these cases. If you want to file a generic JIRA for the lock contention that would be great.
My gut feeling is that the reason the lock is so course is because the InputFormats themselves are not thread safe. Perhaps the simplest thing you could do is to change it so that each thread gets its own "split" of the actual split, and then if one finishes early there could be some logic to try and share a "split" among a limited number of threads. But like with anything in performance never trust your gut, so please profile it before doing any code changes. --Bobby Evans On 7/26/12 12:47 AM, "kenyh" <ken.yihan1...@gmail.com> wrote: > >Multithread Mapreduce introduces multithread execution in map task. In >hadoop >1.0.2, MultithreadedMapper implements multithread execution in mapper >function. But I found that synchronization is needed for record >reading(read >the input Key and Value) and result output. This contention brings heavy >overhead in performance, which increase 50MB wordcount task execution from >40 seconds to 1 minute. I wonder if there are any optimization about the >multithread mapper to decrease the contention of input reading and >output? >-- >View this message in context: >http://old.nabble.com/MultithreadedMapper-tp34213805p34213805.html >Sent from the Hadoop core-dev mailing list archive at Nabble.com. >