Re: Large number of map output keys and performance issues.

2009-05-14 Thread Chuck Lam
just thinking out loud here to see if anything hits a chord.

since you're talking about an access log, i imagine the data is pretty
skewed. i.e., a good percentage of the access is for one resource. if you
use resource id as key, that means a good percentage of the intermediate
data is shuffled to just one reducer.

now the usual solution is to use a combiner, and it seems like you've done
that already. however, given that you're only using 3 task trackers, there's
still some freak chance that one reducer ends up doing most of the work.

anyways, a quick way to check this is to set your reducer to the
IdentityReducer and run the job again. This way you see the input to each
reducer and can see if they're balanced or not.




On Thu, May 14, 2009 at 1:04 PM, Tiago Macambira macamb...@gmail.comwrote:

 On Wed, May 6, 2009 at 5:29 PM, Todd Lipcon t...@cloudera.com wrote:
  Hi Tiago,

 Hi there.

 First of all, sorry for the late reply --- I was investigating the
 issue further before replying.

 Just to make the whole thing clear(er), let me add some numbers and
 explain my problem.

 I have a ~80GB sequence file holding entries for 3 million users,
 regarding how many times those said users accessed aprox. 160 million
 distinct resources. There were aprox. 5000 million accesses for those
 resources. Each entry in the seq. file is encoded using google
 protocol buffers and compressed with gzip (don't ask...). I have to
 extract some metrics from this data. To start, I've chosen to rank
 resources based on the number of accesses.

 I thought that this would be a pretty simple MR application to write
 and run. A beefed up WordCount, if I may:

 Map:
 for each user:
for each resource_id accessed by current user:
# resource.id is a LongWritable
collect(resource.id, 1)

 Combine and Reduce work just as in WordCount, except that keys and
 values are both LongWritables. The final step to calculate the ranking
 --- sorting the resources based on their accumulated access count ---
 is done using the unix sort command. Nothing really fancy here.

 This mapreduce consumed aprox. 2 hours to run --- I said 4 hours in
 the previous e-mail, sorry :-) IMBW, but it seems quite a long time to
 compute a ranking. I coded a similar application in a filter-stream
 framework and it took less than half an hour to run -- even with most
 of its data being read from the network.

 So, what I was wondering is: what am I doing wrong?
  -  Is it just a matter of fine-tuning my hadoop cluster setup?
  -  This is a valid MR application, right?
  -  Is it just that I have too few IO units? I'm using 4 DataNodes
 and 3 TaskTrackers (dual octacores).

 Now, back for our regular programming...


  Here are a couple thoughts:
 
  1) How much data are you outputting? Obviously there is a certain amount
 of
  IO involved in actually outputting data versus not ;-)

 Well, the map phase is outputting 180GB of data for aprox. 1000
 million intermediate keys. I know it is going to take some time to
 save this amount of data to disk but yet... this much time?


  2) Are you using a reduce phase in this job? If so, since you're cutting
 off
  the data at map output time, you're also avoiding a whole sort
 computation
  which involves significant network IO, etc.

 Yes, a Reduce and a Combine phases. From the original 5000 million,
 the combine outputs 1000 million and the reduce ends outputting (the
 expected) 160 million keys.


  3) What version of Hadoop are you running?

 0.18.3

 Cheers.

 Tiago Alves Macambira
 --
 I may be drunk, but in the morning I will be sober, while you will
 still be stupid and ugly. -Winston Churchill



Re: Large number of map output keys and performance issues.

2009-05-07 Thread jason hadoop
It may simply be that your JVM's are spending their time doing garbage
collection instead of running your tasks.
My book, in chapterr 6 has a section on how to tune your jobs, and how to
determine what to tune. That chapter is available now as an alpha.

On Wed, May 6, 2009 at 1:29 PM, Todd Lipcon t...@cloudera.com wrote:

 Hi Tiago,

 Here are a couple thoughts:

 1) How much data are you outputting? Obviously there is a certain amount of
 IO involved in actually outputting data versus not ;-)

 2) Are you using a reduce phase in this job? If so, since you're cutting
 off
 the data at map output time, you're also avoiding a whole sort computation
 which involves significant network IO, etc.

 3) What version of Hadoop are you running?

 Thanks
 -Todd

 On Wed, May 6, 2009 at 12:23 PM, Tiago Macambira macamb...@gmail.com
 wrote:

  I am developing a MR application w/ hadoop that is generating during it's
  map phase a really large number of output keys and it is having an
 abysmal
  performance.
 
  While just reading the said data takes 20 minutes and processing it but
 not
  outputting anything from the map takes around 30 min, running the full
  application takes around 4 hours. Is this a known or expected issue?
 
  Cheers.
  Tiago Alves Macambira
  --
  I may be drunk, but in the morning I will be sober, while you will
  still be stupid and ugly. -Winston Churchill
 




-- 
Alpha Chapters of my book on Hadoop are available
http://www.apress.com/book/view/9781430219422
www.prohadoopbook.com a community for Hadoop Professionals


Large number of map output keys and performance issues.

2009-05-06 Thread Tiago Macambira
I am developing a MR application w/ hadoop that is generating during it's
map phase a really large number of output keys and it is having an abysmal
performance.

While just reading the said data takes 20 minutes and processing it but not
outputting anything from the map takes around 30 min, running the full
application takes around 4 hours. Is this a known or expected issue?

Cheers.
Tiago Alves Macambira
--
I may be drunk, but in the morning I will be sober, while you will
still be stupid and ugly. -Winston Churchill


Re: Large number of map output keys and performance issues.

2009-05-06 Thread Todd Lipcon
Hi Tiago,

Here are a couple thoughts:

1) How much data are you outputting? Obviously there is a certain amount of
IO involved in actually outputting data versus not ;-)

2) Are you using a reduce phase in this job? If so, since you're cutting off
the data at map output time, you're also avoiding a whole sort computation
which involves significant network IO, etc.

3) What version of Hadoop are you running?

Thanks
-Todd

On Wed, May 6, 2009 at 12:23 PM, Tiago Macambira macamb...@gmail.comwrote:

 I am developing a MR application w/ hadoop that is generating during it's
 map phase a really large number of output keys and it is having an abysmal
 performance.

 While just reading the said data takes 20 minutes and processing it but not
 outputting anything from the map takes around 30 min, running the full
 application takes around 4 hours. Is this a known or expected issue?

 Cheers.
 Tiago Alves Macambira
 --
 I may be drunk, but in the morning I will be sober, while you will
 still be stupid and ugly. -Winston Churchill