Yes, that is a possibility, but it will take some significant rearchitecture.  
I was assuming that was what I was going to have to do until I saw the key 
distribution problem and though I might be able to buy some relief by 
addressing that.

The job runs once per day, starting at 1:00AM EDT.  I have changed it to use a 
fewer number of reducers just to see how that effects the distribution.

Dave Shine
Sr. Software Engineer
321.939.5093 direct |  407.314.0122 mobile
CI Boost(tm) Clients  Outperform Online(tm)  www.ciboost.com


-----Original Message-----
From: Tim Broberg [mailto:tim.brob...@exar.com]
Sent: Friday, July 20, 2012 1:03 PM
To: mapreduce-user@hadoop.apache.org
Subject: RE: Distributing Keys across Reducers

Just a thought, but can you deal with the problem with increased granularity by 
simply making the jobs smaller?

If you have enough jobs, when one takes twice as long there will be plenty of 
other small jobs to employ the other nodes, right?

    - Tim.

________________________________________
From: David Rosenstrauch [dar...@darose.net]
Sent: Friday, July 20, 2012 7:45 AM
To: mapreduce-user@hadoop.apache.org
Subject: Re: Distributing Keys across Reducers

On 07/20/2012 09:20 AM, Dave Shine wrote:
> I have a job that is emitting over 3 billion rows from the map to the reduce. 
>  The job is configured with 43 reduce tasks.  A perfectly even distribution 
> would amount to about 70 million rows per reduce task.  However I actually 
> got around 60 million for most of the tasks, one task got over 100 million, 
> and one task got almost 350 million.  This uneven distribution caused the job 
> to run exceedingly long.
>
> I believe this is referred to as a "key skew problem", which I know is 
> heavily dependent on the actual data being processed.  Can anyone point me to 
> any blog posts, white papers, etc. that might give me some options on how to 
> deal with this issue?

Hadoop lets you override the default partitioner and replace it with your own.  
This lets you write a custom partitioning scheme which distributes your data 
more evenly.

HTH,

DR

The information contained in this email is intended only for the personal and 
confidential use of the recipient(s) named above.  The information and any 
attached documents contained in this message may be Exar confidential and/or 
legally privileged.  If you are not the intended recipient, you are hereby 
notified that any review, use, dissemination or reproduction of this message is 
strictly prohibited and may be unlawful.  If you have received this 
communication in error, please notify us immediately by return email and delete 
the original message.

The information contained in this email message is considered confidential and 
proprietary to the sender and is intended solely for review and use by the 
named recipient. Any unauthorized review, use or distribution is strictly 
prohibited. If you have received this message in error, please advise the 
sender by reply email and delete the message.

Reply via email to