On Wed, May 6, 2009 at 5:29 PM, Todd Lipcon <t...@cloudera.com> wrote:
> Hi Tiago,

Hi there.

First of all, sorry for the late reply --- I was investigating the
issue further before replying.

Just to make the whole thing clear(er), let me add some numbers and
explain my problem.

I have a ~80GB sequence file holding entries for 3 million users,
regarding how many times those said users accessed aprox. 160 million
distinct resources. There were aprox. 5000 million accesses for those
resources. Each entry in the seq. file is encoded using google
protocol buffers and compressed with gzip (don't ask...). I have to
extract some metrics from this data. To start, I've chosen to rank
resources based on the number of accesses.

I thought that this would be a pretty simple MR application to write
and run. A beefed up WordCount, if I may:

Map:
for each user:
    for each resource_id accessed by current user:
        # resource.id is a LongWritable
        collect(resource.id, 1)

Combine and Reduce work just as in WordCount, except that keys and
values are both LongWritables. The final step to calculate the ranking
--- sorting the resources based on their accumulated access count ---
is done using the unix sort command. Nothing really fancy here.

This mapreduce consumed aprox. 2 hours to run --- I said 4 hours in
the previous e-mail, sorry :-) IMBW, but it seems quite a long time to
compute a ranking. I coded a similar application in a filter-stream
framework and it took less than half an hour to run -- even with most
of its data being read from the network.

So, what I was wondering is: what am I doing wrong?
  -  Is it just a matter of fine-tuning my hadoop cluster setup?
  -  This is a valid MR application, right?
  -  Is it just that I have too few "IO units"? I'm using 4 DataNodes
and 3 TaskTrackers (dual octacores).

Now, back for our regular programming...


> Here are a couple thoughts:
>
> 1) How much data are you outputting? Obviously there is a certain amount of
> IO involved in actually outputting data versus not ;-)

Well, the map phase is outputting 180GB of data for aprox. 1000
million intermediate keys. I know it is going to take some time to
save this amount of data to disk but yet... this much time?


> 2) Are you using a reduce phase in this job? If so, since you're cutting off
> the data at map output time, you're also avoiding a whole sort computation
> which involves significant network IO, etc.

Yes, a Reduce and a Combine phases. From the original 5000 million,
the combine outputs 1000 million and the reduce ends outputting (the
expected) 160 million keys.


> 3) What version of Hadoop are you running?

0.18.3

Cheers.

Tiago Alves Macambira
--
I may be drunk, but in the morning I will be sober, while you will
still be stupid and ugly." -Winston Churchill

Reply via email to