If you've got 20 nodes, then you want to have 20-ish reduce tasks. Maybe 40
if you want it to run in two waves. (Assuming 1 core/node. Multiply by N for
N cores...) As it is, each node has 500-ish map tasks that it has to read
from and for each of these, it needs to generate 500 separate reduce task
output files.  That's going to take Hadoop a long time to do. 10000 map
tasks is also a very large number of map tasks. Are you processing a lot of
little files? If so, try using MultiFileInputFormat or MultipleInputs to
group them together.

As is mentioned, also set mapred.reduce.parallel.copies to 20. (The default
of 5 is appropriate only for 1--5 nodes.)

- Aaron

On Mon, Aug 24, 2009 at 12:31 AM, Amogh Vasekar <am...@yahoo-inc.com> wrote:

> Maybe look at mapred.reduce.parallel.copies property to speed it up...I
> don't see as to why transfer speed be configured via params, and I'm think
> hadoop wont be messing with that.
>
> Thanks,
> Amogh
>
> -----Original Message-----
> From: yang song [mailto:hadoop.ini...@gmail.com]
> Sent: Monday, August 24, 2009 12:20 PM
> To: common-user@hadoop.apache.org
> Subject: How to speed up the copy phrase?
>
> Hello, everyone
>
> When I submit a big job(e.g. maptasks:10000, reducetasks:500), I find that
> the copy phrase will last for a long long time. From WebUI, the message
> "reduce > copy (xxxx of 10000 at 0.01 MB/s) >" tells me the transfer speed
> is just 0.01 MB/s. Does it a regular value? How can I solve it?
>
> Thank you!
>
> P.S. The hadoop version is 0.19.1. The cluster has 20 nodes. Heap size of
> JT
> is 6G while the     others are default settings.
>

Reply via email to