On Apr 20, 2009, at 9:56 AM, Mark Kerzner wrote:
Hi,
I ran a Hadoop MapReduce task in the local mode, reading and writing
from
HDFS, and it took 2.5 minutes. Essentially the same operations on
the local
file system without MapReduce took 1/2 minute. Is this to be
expected?
Hmm... some overhead is expected, but this seems too much. What
version of Hadoop are your running?
It's hard to help without more details about your application,
configuration etc., I'll try...
It seemed that the system lost most of the time in the MapReduce
operation,
such as after these messages
09/04/19 23:23:01 INFO mapred.LocalJobRunner: reduce > reduce
09/04/19 23:23:01 INFO mapred.JobClient: map 100% reduce 92%
09/04/19 23:23:04 INFO mapred.LocalJobRunner: reduce > reduce
it waited for a long time. The final output lines were
It could either be the reduce-side merge or the hdfs-write. Can you
check your task-logs and data-node logs?
09/04/19 23:24:13 INFO mapred.JobClient: Combine input records=185
09/04/19 23:24:13 INFO mapred.JobClient: Combine output
records=185
That shows that the combiner is useless for this app, turn it off - it
adds unnecessary overhead.
09/04/19 23:24:13 INFO mapred.JobClient: File Systems
09/04/19 23:24:13 INFO mapred.JobClient: HDFS bytes read=138103444
09/04/19 23:24:13 INFO mapred.JobClient: HDFS bytes
written=107357785
09/04/19 23:24:13 INFO mapred.JobClient: Local bytes
read=282509133
09/04/19 23:24:13 INFO mapred.JobClient: Local bytes
written=376697552
For the amount of data you are processing, you are doing far too much
local-disk i/o.
'Local bytes written' should be _very_ close to the 'Map output bytes'
i.e 91M for 'maps' and zero for reduces. (See the counters-table on
the job-details web-ui.)
There are a few knobs you need to tweak to get closer to more optimal
performance, the good news is that it's doable - the bad news is that
one _has_ to get his/her fingers dirty...
Some knobs you will be interested in are:
Map-side:
•io.sort.mb
•io.sort.factor
•io.sort.record.percent
•io.sort.spill.percent
Reduce-side
* mapred.reduce.parallel.copies
* mapred.reduce.copy.backoff
* mapred.job.shuffle.input.buffer.percent
* mapred.job.shuffle.merge.percent
* mapred.inmem.merge.threshold
* mapred.job.reduce.input.buffer.percent
Check description for each of them in hadoop-default.xml or mapred-
default.xml (depending on the version of Hadoop you are running).
Some more details available here:
http://wiki.apache.org/hadoop-data/attachments/HadoopPresentations/attachments/TuningAndDebuggingMapReduce_ApacheConEU09.pdf
hth,
Arun