Arun, thank you very much for the answer. I will turn off the combiner. I am
debugging intermediate MR steps now, so I am mostly interested in
performance to for this, and real tuning will be later, in a cluster. I am
running 18.3, but general pointers should be good enough at this stage.
I am doing a lot of I/O, because I write every binary file that I get from a
SequentialFile back to local dist, in order to process it with Tika. This
hasn't added significant overhead, although I think that Tika allows me to
give it an IOStream, which is really what I should do.

Oh, and yes, I already had performance problems writing to SequentialFile,
when I was copying binary files from the local file system. I hoped that
writing from HDFS to HDFS will be faster - but maybe not. Anyway, it may all
change on different hardware, and I am going to try it on Elastic MapReduce
on Amazon.

And thank you for the link - I wish I were at the conference! Anyway, at
this level I have to make my hands dirty, re-read both Hadoop books, and
other article.

Cheers,
Mark

On Mon, Apr 20, 2009 at 10:24 AM, Arun C Murthy <a...@yahoo-inc.com> wrote:

>
> On Apr 20, 2009, at 9:56 AM, Mark Kerzner wrote:
>
>  Hi,
>>
>> I ran a Hadoop MapReduce task in the local mode, reading and writing from
>> HDFS, and it took 2.5 minutes. Essentially the same operations on the
>> local
>> file system without MapReduce took 1/2 minute.  Is this to be expected?
>>
>>
> Hmm... some overhead is expected, but this seems too much. What version of
> Hadoop are your running?
>
> It's hard to help without more details about your application,
> configuration etc., I'll try...
>
>
>  It seemed that the system lost most of the time in the MapReduce
>> operation,
>> such as after these messages
>>
>> 09/04/19 23:23:01 INFO mapred.LocalJobRunner: reduce > reduce
>> 09/04/19 23:23:01 INFO mapred.JobClient:  map 100% reduce 92%
>> 09/04/19 23:23:04 INFO mapred.LocalJobRunner: reduce > reduce
>>
>> it waited for a long time. The final output lines were
>>
>>
> It could either be the reduce-side merge or the hdfs-write. Can you check
> your task-logs and data-node logs?
>
>  09/04/19 23:24:13 INFO mapred.JobClient:     Combine input records=185
>> 09/04/19 23:24:13 INFO mapred.JobClient:     Combine output records=185
>>
>
> That shows that the combiner is useless for this app, turn it off - it adds
> unnecessary overhead.
>
>  09/04/19 23:24:13 INFO mapred.JobClient:   File Systems
>> 09/04/19 23:24:13 INFO mapred.JobClient:     HDFS bytes read=138103444
>> 09/04/19 23:24:13 INFO mapred.JobClient:     HDFS bytes written=107357785
>> 09/04/19 23:24:13 INFO mapred.JobClient:     Local bytes read=282509133
>> 09/04/19 23:24:13 INFO mapred.JobClient:     Local bytes written=376697552
>>
>
> For the amount of data you are processing, you are doing far too much
> local-disk i/o.
> 'Local bytes written' should be _very_ close to the 'Map output bytes' i.e
> 91M for 'maps' and zero for reduces. (See the counters-table on the
> job-details web-ui.)
>
> There are a few knobs you need to tweak to get closer to more optimal
> performance, the good news is that it's doable - the bad news is that one
> _has_ to get his/her fingers dirty...
>
> Some knobs you will be interested in are:
>
> Map-side:
> •io.sort.mb
> •io.sort.factor
> •io.sort.record.percent
> •io.sort.spill.percent
>
>  Reduce-side
> * mapred.reduce.parallel.copies
> * mapred.reduce.copy.backoff
> * mapred.job.shuffle.input.buffer.percent
> * mapred.job.shuffle.merge.percent
> * mapred.inmem.merge.threshold
> * mapred.job.reduce.input.buffer.percent
>
>
> Check description for each of them in hadoop-default.xml or
> mapred-default.xml (depending on the version of Hadoop you are running).
> Some more details available here:
> http://wiki.apache.org/hadoop-data/attachments/HadoopPresentations/attachments/TuningAndDebuggingMapReduce_ApacheConEU09.pdf
>
> hth,
> Arun
>
>

Reply via email to