just thinking out loud here to see if anything hits a chord.
since you're talking about an access log, i imagine the data is pretty
skewed. i.e., a good percentage of the access is for one resource. if you
use resource id as key, that means a good percentage of the intermediate
data is shuffled to just one reducer.
now the usual solution is to use a combiner, and it seems like you've done
that already. however, given that you're only using 3 task trackers, there's
still some freak chance that one reducer ends up doing most of the work.
anyways, a quick way to check this is to set your reducer to the
IdentityReducer and run the job again. This way you see the input to each
reducer and can see if they're balanced or not.
On Thu, May 14, 2009 at 1:04 PM, Tiago Macambira macamb...@gmail.comwrote:
On Wed, May 6, 2009 at 5:29 PM, Todd Lipcon t...@cloudera.com wrote:
Hi Tiago,
Hi there.
First of all, sorry for the late reply --- I was investigating the
issue further before replying.
Just to make the whole thing clear(er), let me add some numbers and
explain my problem.
I have a ~80GB sequence file holding entries for 3 million users,
regarding how many times those said users accessed aprox. 160 million
distinct resources. There were aprox. 5000 million accesses for those
resources. Each entry in the seq. file is encoded using google
protocol buffers and compressed with gzip (don't ask...). I have to
extract some metrics from this data. To start, I've chosen to rank
resources based on the number of accesses.
I thought that this would be a pretty simple MR application to write
and run. A beefed up WordCount, if I may:
Map:
for each user:
for each resource_id accessed by current user:
# resource.id is a LongWritable
collect(resource.id, 1)
Combine and Reduce work just as in WordCount, except that keys and
values are both LongWritables. The final step to calculate the ranking
--- sorting the resources based on their accumulated access count ---
is done using the unix sort command. Nothing really fancy here.
This mapreduce consumed aprox. 2 hours to run --- I said 4 hours in
the previous e-mail, sorry :-) IMBW, but it seems quite a long time to
compute a ranking. I coded a similar application in a filter-stream
framework and it took less than half an hour to run -- even with most
of its data being read from the network.
So, what I was wondering is: what am I doing wrong?
- Is it just a matter of fine-tuning my hadoop cluster setup?
- This is a valid MR application, right?
- Is it just that I have too few IO units? I'm using 4 DataNodes
and 3 TaskTrackers (dual octacores).
Now, back for our regular programming...
Here are a couple thoughts:
1) How much data are you outputting? Obviously there is a certain amount
of
IO involved in actually outputting data versus not ;-)
Well, the map phase is outputting 180GB of data for aprox. 1000
million intermediate keys. I know it is going to take some time to
save this amount of data to disk but yet... this much time?
2) Are you using a reduce phase in this job? If so, since you're cutting
off
the data at map output time, you're also avoiding a whole sort
computation
which involves significant network IO, etc.
Yes, a Reduce and a Combine phases. From the original 5000 million,
the combine outputs 1000 million and the reduce ends outputting (the
expected) 160 million keys.
3) What version of Hadoop are you running?
0.18.3
Cheers.
Tiago Alves Macambira
--
I may be drunk, but in the morning I will be sober, while you will
still be stupid and ugly. -Winston Churchill