Good to know. Thanks for the update.
- Tim.
On Jul 25, 2012, at 5:21 AM, "Dave Shine" <[email protected]>
wrote:
> Just wanted to follow up on this issue. It turned out that I was overlooking
> the obvious. Turns out that over 8% of the mapper output had exactly the
> same key, which was actually an invalid value. By changing the mapper to not
> emit records with an invalid key the problem went away.
>
> Moral of the story, verify the data before you blame the software.
>
> Dave Shine
> Sr. Software Engineer
> 321.939.5093 direct | 407.314.0122 mobile
> CI BoostT Clients Outperform OnlineT www.ciboost.com
>
>
> -----Original Message-----
> From: Dave Shine [mailto:[email protected]]
> Sent: Friday, July 20, 2012 1:13 PM
> To: [email protected]
> Subject: RE: Distributing Keys across Reducers
>
> Yes, that is a possibility, but it will take some significant rearchitecture.
> I was assuming that was what I was going to have to do until I saw the key
> distribution problem and though I might be able to buy some relief by
> addressing that.
>
> The job runs once per day, starting at 1:00AM EDT. I have changed it to use
> a fewer number of reducers just to see how that effects the distribution.
>
> Dave Shine
> Sr. Software Engineer
> 321.939.5093 direct | 407.314.0122 mobile CI Boost(tm) Clients Outperform
> Online(tm) www.ciboost.com
>
>
> -----Original Message-----
> From: Tim Broberg [mailto:[email protected]]
> Sent: Friday, July 20, 2012 1:03 PM
> To: [email protected]
> Subject: RE: Distributing Keys across Reducers
>
> Just a thought, but can you deal with the problem with increased granularity
> by simply making the jobs smaller?
>
> If you have enough jobs, when one takes twice as long there will be plenty of
> other small jobs to employ the other nodes, right?
>
> - Tim.
>
> ________________________________________
> From: David Rosenstrauch [[email protected]]
> Sent: Friday, July 20, 2012 7:45 AM
> To: [email protected]
> Subject: Re: Distributing Keys across Reducers
>
> On 07/20/2012 09:20 AM, Dave Shine wrote:
>> I have a job that is emitting over 3 billion rows from the map to the
>> reduce. The job is configured with 43 reduce tasks. A perfectly even
>> distribution would amount to about 70 million rows per reduce task. However
>> I actually got around 60 million for most of the tasks, one task got over
>> 100 million, and one task got almost 350 million. This uneven distribution
>> caused the job to run exceedingly long.
>>
>> I believe this is referred to as a "key skew problem", which I know is
>> heavily dependent on the actual data being processed. Can anyone point me
>> to any blog posts, white papers, etc. that might give me some options on how
>> to deal with this issue?
>
> Hadoop lets you override the default partitioner and replace it with your
> own. This lets you write a custom partitioning scheme which distributes your
> data more evenly.
>
> HTH,
>
> DR
>
> The information contained in this email is intended only for the personal and
> confidential use of the recipient(s) named above. The information and any
> attached documents contained in this message may be Exar confidential and/or
> legally privileged. If you are not the intended recipient, you are hereby
> notified that any review, use, dissemination or reproduction of this message
> is strictly prohibited and may be unlawful. If you have received this
> communication in error, please notify us immediately by return email and
> delete the original message.
>
> The information contained in this email message is considered confidential
> and proprietary to the sender and is intended solely for review and use by
> the named recipient. Any unauthorized review, use or distribution is strictly
> prohibited. If you have received this message in error, please advise the
> sender by reply email and delete the message.