There are a few things that caused this to happen to me earlier on. Make sure to check that it actually makes progress. Sometimes, slowness is result of negative progress: it gets to say 10% complete on reduce, and then drop back down to 5%...In that case the output can output that line with the slow throughput rate.
changing a few of the settings below did improve on things, but ultimately, what fixed it for us was buying more hardware. ;-) On Sun, Mar 1, 2009 at 10:21 PM, Jothi Padmanabhan <joth...@yahoo-inc.com>wrote: > There are a lot of factors that affect shuffle speed. > > Some of them are: > > 1. The Number of reducers concurrently running in a node > 2. The number of parallel copier threads that are pulling in map data ( > mapred.reduce.parallel.copies) > 3. Size of the individual map outputs. If Map outputs are huge, they are > shuffled to disk and there might be some contention if several files are > written to disk at the same time > 4. Size of the buffer reserved to accommodate map outputs on the reducer > side ( mapred.job.shuffle.input.buffer.percent). > > Jothi > > > > On 2/28/09 6:55 AM, "Nathan Marz" <nat...@rapleaf.com> wrote: > > > The Hadoop shuffle phase seems painstakingly slow. For example, I am > > running a very large job, and all the reducers report a status such as: > > > > "reduce > copy (14266 of 28243 at 1.30 MB/s)" > > > > This is after all the mappers are finished. Is it supposed to be so > > slow? > > > >