Re: Distributing Keys across Reducers

Tim Broberg Wed, 25 Jul 2012 09:10:59 -0700

Good to know. Thanks for the update.

    - Tim.


On Jul 25, 2012, at 5:21 AM, "Dave Shine" <[email protected]> 
wrote:

> Just wanted to follow up on this issue.  It turned out that I was overlooking 
> the obvious.  Turns out that over 8% of the mapper output had exactly the 
> same key, which was actually an invalid value.  By changing the mapper to not 
> emit records with an invalid key the problem went away.
> 
> Moral of the story, verify the data before you blame the software.
> 
> Dave Shine
> Sr. Software Engineer
> 321.939.5093 direct |  407.314.0122 mobile
> CI BoostT Clients  Outperform OnlineT  www.ciboost.com
> 
> 
> -----Original Message-----
> From: Dave Shine [mailto:[email protected]] 
> Sent: Friday, July 20, 2012 1:13 PM
> To: [email protected]
> Subject: RE: Distributing Keys across Reducers
> 
> Yes, that is a possibility, but it will take some significant rearchitecture. 
>  I was assuming that was what I was going to have to do until I saw the key 
> distribution problem and though I might be able to buy some relief by 
> addressing that.
> 
> The job runs once per day, starting at 1:00AM EDT.  I have changed it to use 
> a fewer number of reducers just to see how that effects the distribution.
> 
> Dave Shine
> Sr. Software Engineer
> 321.939.5093 direct |  407.314.0122 mobile CI Boost(tm) Clients  Outperform 
> Online(tm)  www.ciboost.com
> 
> 
> -----Original Message-----
> From: Tim Broberg [mailto:[email protected]]
> Sent: Friday, July 20, 2012 1:03 PM
> To: [email protected]
> Subject: RE: Distributing Keys across Reducers
> 
> Just a thought, but can you deal with the problem with increased granularity 
> by simply making the jobs smaller?
> 
> If you have enough jobs, when one takes twice as long there will be plenty of 
> other small jobs to employ the other nodes, right?
> 
>    - Tim.
> 
> ________________________________________
> From: David Rosenstrauch [[email protected]]
> Sent: Friday, July 20, 2012 7:45 AM
> To: [email protected]
> Subject: Re: Distributing Keys across Reducers
> 
> On 07/20/2012 09:20 AM, Dave Shine wrote:
>> I have a job that is emitting over 3 billion rows from the map to the 
>> reduce.  The job is configured with 43 reduce tasks.  A perfectly even 
>> distribution would amount to about 70 million rows per reduce task.  However 
>> I actually got around 60 million for most of the tasks, one task got over 
>> 100 million, and one task got almost 350 million.  This uneven distribution 
>> caused the job to run exceedingly long.
>> 
>> I believe this is referred to as a "key skew problem", which I know is 
>> heavily dependent on the actual data being processed.  Can anyone point me 
>> to any blog posts, white papers, etc. that might give me some options on how 
>> to deal with this issue?
> 
> Hadoop lets you override the default partitioner and replace it with your 
> own.  This lets you write a custom partitioning scheme which distributes your 
> data more evenly.
> 
> HTH,
> 
> DR
> 
> The information contained in this email is intended only for the personal and 
> confidential use of the recipient(s) named above.  The information and any 
> attached documents contained in this message may be Exar confidential and/or 
> legally privileged.  If you are not the intended recipient, you are hereby 
> notified that any review, use, dissemination or reproduction of this message 
> is strictly prohibited and may be unlawful.  If you have received this 
> communication in error, please notify us immediately by return email and 
> delete the original message.
> 
> The information contained in this email message is considered confidential 
> and proprietary to the sender and is intended solely for review and use by 
> the named recipient. Any unauthorized review, use or distribution is strictly 
> prohibited. If you have received this message in error, please advise the 
> sender by reply email and delete the message.

Re: Distributing Keys across Reducers

Reply via email to