I used three different sample.txt files, and was able to replicate the
error. The first was 1.5MB, the second 66MB, and the last 428MB. I get the
same problem despite what size of input file I use: the running time of
wordcount increases with the number of mappers and reducers specified. If it
is the problem of the input file, how big do I have to go before it
disappears entirely?

If it is psuedo-distributed mode that's the issue, what mode should I be
running on my machine, given it's specs? Once again, it is a SINGLE MacPro
with 16GB of RAM, 4  1TB hard disks, and 2 quad-core processors.

I'm not sure if it's HADOOP-2771, since the sort/merge(shuffle) is what
seems to be taking the longest:
2 M/R ==> map: 18 sec, shuffle: 15 sec, reduce: 9 sec
4 M/R ==> map: 19 sec, shuffle: 37 sec, reduce: 2 sec
8 M/R ==> map: 21 sec, shuffle: 1 min 10 sec, 1 sec

To make sure it's not because of the combiner, I removed it and reran
everything again, and got the same bottom-line: With increasing maps and
reducers, running time goes up, with majority of time seeming to be in
sort/merge.

Also, another thing we noticed is that the CPUs seem to be very active
during the map phase, but when the map phase reaches 100%, and only reduce
appears to be running, the CPUs all become idle. Furthermore, despite the
number of mappers I specify, all the CPUs become very active when a job is
running. Why is this so? If I specify 2 mappers and 2 reducers, won't there
be just 2 or 4 CPUs that should be active? Why are all 8 active?

Since I can reproduce this error using Hadoop's standard word count example,
I was hoping that someone else could tell me if they can reproduce this too.
Is it true that when you increase the number of mappers and reducers on your
systems, the running time of wordcount goes up?

Thanks for the help! I'm looking forward to your responses.

-SM

On Thu, Mar 5, 2009 at 2:57 AM, Amareshwari Sriramadasu <
amar...@yahoo-inc.com> wrote:

> Are you hitting HADOOP-2771?
> -Amareshwari
>
> Sandy wrote:
>
>> Hello all,
>>
>> For the sake of benchmarking, I ran the standard hadoop wordcount example
>> on
>> an input file using 2, 4, and 8 mappers and reducers for my job.
>> In other words,  I do:
>>
>> time -p bin/hadoop jar hadoop-0.18.3-examples.jar wordcount -m 2 -r 2
>> sample.txt output
>> time -p bin/hadoop jar hadoop-0.18.3-examples.jar wordcount -m 4 -r 4
>> sample.txt output2
>> time -p bin/hadoop jar hadoop-0.18.3-examples.jar wordcount -m 8 -r 8
>> sample.txt output3
>>
>> Strangely enough, when this increase in mappers and reducers result in
>> slower running times!
>> -On 2 mappers and reducers it ran for 40 seconds
>> on 4 mappers and reducers it ran for 60 seconds
>> on 8 mappers and reducers it ran for 90 seconds!
>>
>> Please note that the "sample.txt" file is identical in each of these runs.
>>
>> I have the following questions:
>> - Shouldn't wordcount get -faster- with additional mappers and reducers,
>> instead of slower?
>> - If it does get faster for other people, why does it become slower for
>> me?
>>  I am running hadoop on psuedo-distributed mode on a single 64-bit Mac Pro
>> with 2 quad-core processors, 16 GB of RAM and 4 1TB HDs
>>
>> I would greatly appreciate it if someone could explain this behavior to
>> me,
>> and tell me if I'm running this wrong. How can I change my settings (if at
>> all) to get wordcount running faster when i increases that number of maps
>> and reduces?
>>
>> Thanks,
>> -SM
>>
>>
>>
>
>

Reply via email to