On Apr 20, 2009, at 9:56 AM, Mark Kerzner wrote:

Hi,

I ran a Hadoop MapReduce task in the local mode, reading and writing from HDFS, and it took 2.5 minutes. Essentially the same operations on the local file system without MapReduce took 1/2 minute. Is this to be expected?


Hmm... some overhead is expected, but this seems too much. What version of Hadoop are your running?

It's hard to help without more details about your application, configuration etc., I'll try...


It seemed that the system lost most of the time in the MapReduce operation,
such as after these messages

09/04/19 23:23:01 INFO mapred.LocalJobRunner: reduce > reduce
09/04/19 23:23:01 INFO mapred.JobClient:  map 100% reduce 92%
09/04/19 23:23:04 INFO mapred.LocalJobRunner: reduce > reduce

it waited for a long time. The final output lines were


It could either be the reduce-side merge or the hdfs-write. Can you check your task-logs and data-node logs?

09/04/19 23:24:13 INFO mapred.JobClient:     Combine input records=185
09/04/19 23:24:13 INFO mapred.JobClient: Combine output records=185

That shows that the combiner is useless for this app, turn it off - it adds unnecessary overhead.

09/04/19 23:24:13 INFO mapred.JobClient:   File Systems
09/04/19 23:24:13 INFO mapred.JobClient:     HDFS bytes read=138103444
09/04/19 23:24:13 INFO mapred.JobClient: HDFS bytes written=107357785 09/04/19 23:24:13 INFO mapred.JobClient: Local bytes read=282509133 09/04/19 23:24:13 INFO mapred.JobClient: Local bytes written=376697552

For the amount of data you are processing, you are doing far too much local-disk i/o. 'Local bytes written' should be _very_ close to the 'Map output bytes' i.e 91M for 'maps' and zero for reduces. (See the counters-table on the job-details web-ui.)

There are a few knobs you need to tweak to get closer to more optimal performance, the good news is that it's doable - the bad news is that one _has_ to get his/her fingers dirty...

Some knobs you will be interested in are:

Map-side:
•io.sort.mb
•io.sort.factor
•io.sort.record.percent
•io.sort.spill.percent

 Reduce-side
* mapred.reduce.parallel.copies
* mapred.reduce.copy.backoff
* mapred.job.shuffle.input.buffer.percent
* mapred.job.shuffle.merge.percent
* mapred.inmem.merge.threshold
* mapred.job.reduce.input.buffer.percent


Check description for each of them in hadoop-default.xml or mapred- default.xml (depending on the version of Hadoop you are running).
Some more details available here: 
http://wiki.apache.org/hadoop-data/attachments/HadoopPresentations/attachments/TuningAndDebuggingMapReduce_ApacheConEU09.pdf

hth,
Arun

Reply via email to