Arun, thank you very much for the answer. I will turn off the combiner. I am debugging intermediate MR steps now, so I am mostly interested in performance to for this, and real tuning will be later, in a cluster. I am running 18.3, but general pointers should be good enough at this stage. I am doing a lot of I/O, because I write every binary file that I get from a SequentialFile back to local dist, in order to process it with Tika. This hasn't added significant overhead, although I think that Tika allows me to give it an IOStream, which is really what I should do.
Oh, and yes, I already had performance problems writing to SequentialFile, when I was copying binary files from the local file system. I hoped that writing from HDFS to HDFS will be faster - but maybe not. Anyway, it may all change on different hardware, and I am going to try it on Elastic MapReduce on Amazon. And thank you for the link - I wish I were at the conference! Anyway, at this level I have to make my hands dirty, re-read both Hadoop books, and other article. Cheers, Mark On Mon, Apr 20, 2009 at 10:24 AM, Arun C Murthy <a...@yahoo-inc.com> wrote: > > On Apr 20, 2009, at 9:56 AM, Mark Kerzner wrote: > > Hi, >> >> I ran a Hadoop MapReduce task in the local mode, reading and writing from >> HDFS, and it took 2.5 minutes. Essentially the same operations on the >> local >> file system without MapReduce took 1/2 minute. Is this to be expected? >> >> > Hmm... some overhead is expected, but this seems too much. What version of > Hadoop are your running? > > It's hard to help without more details about your application, > configuration etc., I'll try... > > > It seemed that the system lost most of the time in the MapReduce >> operation, >> such as after these messages >> >> 09/04/19 23:23:01 INFO mapred.LocalJobRunner: reduce > reduce >> 09/04/19 23:23:01 INFO mapred.JobClient: map 100% reduce 92% >> 09/04/19 23:23:04 INFO mapred.LocalJobRunner: reduce > reduce >> >> it waited for a long time. The final output lines were >> >> > It could either be the reduce-side merge or the hdfs-write. Can you check > your task-logs and data-node logs? > > 09/04/19 23:24:13 INFO mapred.JobClient: Combine input records=185 >> 09/04/19 23:24:13 INFO mapred.JobClient: Combine output records=185 >> > > That shows that the combiner is useless for this app, turn it off - it adds > unnecessary overhead. > > 09/04/19 23:24:13 INFO mapred.JobClient: File Systems >> 09/04/19 23:24:13 INFO mapred.JobClient: HDFS bytes read=138103444 >> 09/04/19 23:24:13 INFO mapred.JobClient: HDFS bytes written=107357785 >> 09/04/19 23:24:13 INFO mapred.JobClient: Local bytes read=282509133 >> 09/04/19 23:24:13 INFO mapred.JobClient: Local bytes written=376697552 >> > > For the amount of data you are processing, you are doing far too much > local-disk i/o. > 'Local bytes written' should be _very_ close to the 'Map output bytes' i.e > 91M for 'maps' and zero for reduces. (See the counters-table on the > job-details web-ui.) > > There are a few knobs you need to tweak to get closer to more optimal > performance, the good news is that it's doable - the bad news is that one > _has_ to get his/her fingers dirty... > > Some knobs you will be interested in are: > > Map-side: > •io.sort.mb > •io.sort.factor > •io.sort.record.percent > •io.sort.spill.percent > > Reduce-side > * mapred.reduce.parallel.copies > * mapred.reduce.copy.backoff > * mapred.job.shuffle.input.buffer.percent > * mapred.job.shuffle.merge.percent > * mapred.inmem.merge.threshold > * mapred.job.reduce.input.buffer.percent > > > Check description for each of them in hadoop-default.xml or > mapred-default.xml (depending on the version of Hadoop you are running). > Some more details available here: > http://wiki.apache.org/hadoop-data/attachments/HadoopPresentations/attachments/TuningAndDebuggingMapReduce_ApacheConEU09.pdf > > hth, > Arun > >