Re: Performance question

Arun C Murthy Mon, 20 Apr 2009 08:27:10 -0700


On Apr 20, 2009, at 9:56 AM, Mark Kerzner wrote:

Hi,
I ran a Hadoop MapReduce task in the local mode, reading and writingfromHDFS, and it took 2.5 minutes. Essentially the same operations onthe localfile system without MapReduce took 1/2 minute. Is this to beexpected?

Hmm... some overhead is expected, but this seems too much. Whatversion of Hadoop are your running?

It's hard to help without more details about your application,configuration etc., I'll try...

It seemed that the system lost most of the time in the MapReduceoperation,

such as after these messages

09/04/19 23:23:01 INFO mapred.LocalJobRunner: reduce > reduce
09/04/19 23:23:01 INFO mapred.JobClient:  map 100% reduce 92%
09/04/19 23:23:04 INFO mapred.LocalJobRunner: reduce > reduce

it waited for a long time. The final output lines were

It could either be the reduce-side merge or the hdfs-write. Can youcheck your task-logs and data-node logs?

09/04/19 23:24:13 INFO mapred.JobClient:     Combine input records=185
09/04/19 23:24:13 INFO mapred.JobClient: Combine outputrecords=185

That shows that the combiner is useless for this app, turn it off - itadds unnecessary overhead.

09/04/19 23:24:13 INFO mapred.JobClient:   File Systems
09/04/19 23:24:13 INFO mapred.JobClient:     HDFS bytes read=138103444
09/04/19 23:24:13 INFO mapred.JobClient: HDFS byteswritten=10735778509/04/19 23:24:13 INFO mapred.JobClient: Local bytesread=28250913309/04/19 23:24:13 INFO mapred.JobClient: Local byteswritten=376697552

For the amount of data you are processing, you are doing far too muchlocal-disk i/o.'Local bytes written' should be _very_ close to the 'Map output bytes'i.e 91M for 'maps' and zero for reduces. (See the counters-table onthe job-details web-ui.)

There are a few knobs you need to tweak to get closer to more optimalperformance, the good news is that it's doable - the bad news is thatone _has_ to get his/her fingers dirty...


Some knobs you will be interested in are:

Map-side:
•io.sort.mb
•io.sort.factor
•io.sort.record.percent
•io.sort.spill.percent

 Reduce-side
* mapred.reduce.parallel.copies
* mapred.reduce.copy.backoff
* mapred.job.shuffle.input.buffer.percent
* mapred.job.shuffle.merge.percent
* mapred.inmem.merge.threshold
* mapred.job.reduce.input.buffer.percent

Check description for each of them in hadoop-default.xml or mapred-default.xml (depending on the version of Hadoop you are running).

Some more details available here: 
http://wiki.apache.org/hadoop-data/attachments/HadoopPresentations/attachments/TuningAndDebuggingMapReduce_ApacheConEU09.pdf

hth,
Arun

Re: Performance question

Reply via email to