If you've got 20 nodes, then you want to have 20-ish reduce tasks. Maybe 40 if you want it to run in two waves. (Assuming 1 core/node. Multiply by N for N cores...) As it is, each node has 500-ish map tasks that it has to read from and for each of these, it needs to generate 500 separate reduce task output files. That's going to take Hadoop a long time to do. 10000 map tasks is also a very large number of map tasks. Are you processing a lot of little files? If so, try using MultiFileInputFormat or MultipleInputs to group them together.
As is mentioned, also set mapred.reduce.parallel.copies to 20. (The default of 5 is appropriate only for 1--5 nodes.) - Aaron On Mon, Aug 24, 2009 at 12:31 AM, Amogh Vasekar <am...@yahoo-inc.com> wrote: > Maybe look at mapred.reduce.parallel.copies property to speed it up...I > don't see as to why transfer speed be configured via params, and I'm think > hadoop wont be messing with that. > > Thanks, > Amogh > > -----Original Message----- > From: yang song [mailto:hadoop.ini...@gmail.com] > Sent: Monday, August 24, 2009 12:20 PM > To: common-user@hadoop.apache.org > Subject: How to speed up the copy phrase? > > Hello, everyone > > When I submit a big job(e.g. maptasks:10000, reducetasks:500), I find that > the copy phrase will last for a long long time. From WebUI, the message > "reduce > copy (xxxx of 10000 at 0.01 MB/s) >" tells me the transfer speed > is just 0.01 MB/s. Does it a regular value? How can I solve it? > > Thank you! > > P.S. The hadoop version is 0.19.1. The cluster has 20 nodes. Heap size of > JT > is 6G while the others are default settings. >