Re: Large number of map output keys and performance issues.

2009-05-14 Thread Chuck Lam
just thinking out loud here to see if anything hits a chord. since you're talking about an access log, i imagine the data is pretty skewed. i.e., a good percentage of the access is for one resource. if you use resource id as key, that means a good percentage of the intermediate data is shuffled to

Re: Large number of map output keys and performance issues.

2009-05-14 Thread Tiago Macambira
On Wed, May 6, 2009 at 5:29 PM, Todd Lipcon wrote: > Hi Tiago, Hi there. First of all, sorry for the late reply --- I was investigating the issue further before replying. Just to make the whole thing clear(er), let me add some numbers and explain my problem. I have a ~80GB sequence file holdin

Re: Large number of map output keys and performance issues.

2009-05-07 Thread jason hadoop
It may simply be that your JVM's are spending their time doing garbage collection instead of running your tasks. My book, in chapterr 6 has a section on how to tune your jobs, and how to determine what to tune. That chapter is available now as an alpha. On Wed, May 6, 2009 at 1:29 PM, Todd Lipcon

Re: Large number of map output keys and performance issues.

2009-05-06 Thread Todd Lipcon
Hi Tiago, Here are a couple thoughts: 1) How much data are you outputting? Obviously there is a certain amount of IO involved in actually outputting data versus not ;-) 2) Are you using a reduce phase in this job? If so, since you're cutting off the data at map output time, you're also avoiding

Large number of map output keys and performance issues.

2009-05-06 Thread Tiago Macambira
I am developing a MR application w/ hadoop that is generating during it's map phase a really large number of output keys and it is having an abysmal performance. While just reading the said data takes 20 minutes and processing it but not outputting anything from the map takes around 30 min, runnin