just thinking out loud here to see if anything hits a chord.
since you're talking about an access log, i imagine the data is pretty
skewed. i.e., a good percentage of the access is for one resource. if you
use resource id as key, that means a good percentage of the intermediate
data is shuffled to
On Wed, May 6, 2009 at 5:29 PM, Todd Lipcon wrote:
> Hi Tiago,
Hi there.
First of all, sorry for the late reply --- I was investigating the
issue further before replying.
Just to make the whole thing clear(er), let me add some numbers and
explain my problem.
I have a ~80GB sequence file holdin
It may simply be that your JVM's are spending their time doing garbage
collection instead of running your tasks.
My book, in chapterr 6 has a section on how to tune your jobs, and how to
determine what to tune. That chapter is available now as an alpha.
On Wed, May 6, 2009 at 1:29 PM, Todd Lipcon
Hi Tiago,
Here are a couple thoughts:
1) How much data are you outputting? Obviously there is a certain amount of
IO involved in actually outputting data versus not ;-)
2) Are you using a reduce phase in this job? If so, since you're cutting off
the data at map output time, you're also avoiding
I am developing a MR application w/ hadoop that is generating during it's
map phase a really large number of output keys and it is having an abysmal
performance.
While just reading the said data takes 20 minutes and processing it but not
outputting anything from the map takes around 30 min, runnin