Re: Distributing Keys across Reducers

2012-07-20 Thread syed kather
Dave Shine , Can you share how many data is been taken by map task .If map task is uneven then it might be Hot Spotting Problem. Have an look on http://blog.sematext.com/2012/04/09/hbasewd-avoid-regionserver-hotspotting-despite-writing-records-with-sequential-keys/ . I had also faced same pr

Re: Distributing Keys across Reducers

2012-07-20 Thread John Armstrong
On 07/20/2012 09:20 AM, Dave Shine wrote: I believe this is referred to as a “key skew problem”, which I know is heavily dependent on the actual data being processed. Can anyone point me to any blog posts, white papers, etc. that might give me some options on how to deal with this issue? I don

Re: Distributing Keys across Reducers

2012-07-20 Thread Christoph Schmitz
Hi Dave, I haven't actually done this in practice, so take this with a grain of salt ;-) One way to circumvent your problem might be to add entropy to the keys, i.e., if your keys are "a", "b" etc. and you got too many "a"s and too many "b"s, you could inflate your keys randomly to be (a, 1)

Re: Distributing Keys across Reducers

2012-07-20 Thread David Rosenstrauch
On 07/20/2012 09:20 AM, Dave Shine wrote: I have a job that is emitting over 3 billion rows from the map to the reduce. The job is configured with 43 reduce tasks. A perfectly even distribution would amount to about 70 million rows per reduce task. However I actually got around 60 million f

RE: Distributing Keys across Reducers

2012-07-20 Thread Dave Shine
.ab...@gmail.com] Sent: Friday, July 20, 2012 9:58 AM To: mapreduce-user@hadoop.apache.org Subject: Re: Distributing Keys across Reducers Dave Shine , Can you share how many data is been taken by map task .If map task is uneven then it might be Hot Spotting Problem. Have an look on http://blog.s

RE: Distributing Keys across Reducers

2012-07-20 Thread Dave Shine
d this problem. Dave Shine Sr. Software Engineer 321.939.5093 direct | 407.314.0122 mobile CI Boost(tm) Clients Outperform Online(tm) www.ciboost.com -Original Message- From: John Armstrong [mailto:j...@ccri.com] Sent: Friday, July 20, 2012 10:20 AM To: mapreduce-user@hadoop.apache.org Subject

Re: Distributing Keys across Reducers

2012-07-20 Thread Harsh J
.0122 mobile > CI Boost(tm) Clients Outperform Online(tm) www.ciboost.com > > > -Original Message- > From: John Armstrong [mailto:j...@ccri.com] > Sent: Friday, July 20, 2012 10:20 AM > To: mapreduce-user@hadoop.apache.org > Subject: Re: Distributing Keys across Reducers &g

RE: Distributing Keys across Reducers

2012-07-20 Thread Dave Shine
rmstr...@ccri.com Subject: Re: Distributing Keys across Reducers Does applying a combiner make any difference? Or are these numbers with the combiner included? On Fri, Jul 20, 2012 at 8:46 PM, Dave Shine wrote: > Thanks John. > > The key is my own WritableComparable object, and I have

RE: Distributing Keys across Reducers

2012-07-20 Thread Tim Broberg
. From: David Rosenstrauch [dar...@darose.net] Sent: Friday, July 20, 2012 7:45 AM To: mapreduce-user@hadoop.apache.org Subject: Re: Distributing Keys across Reducers On 07/20/2012 09:20 AM, Dave Shine wrote: > I have a job that is emitting over 3 billion rows from the map to the reduce. > T

RE: Distributing Keys across Reducers

2012-07-20 Thread Dave Shine
...@exar.com] Sent: Friday, July 20, 2012 1:03 PM To: mapreduce-user@hadoop.apache.org Subject: RE: Distributing Keys across Reducers Just a thought, but can you deal with the problem with increased granularity by simply making the jobs smaller? If you have enough jobs, when one takes twice as

RE: Distributing Keys across Reducers

2012-07-25 Thread Dave Shine
20, 2012 1:13 PM To: mapreduce-user@hadoop.apache.org Subject: RE: Distributing Keys across Reducers Yes, that is a possibility, but it will take some significant rearchitecture. I was assuming that was what I was going to have to do until I saw the key distribution problem and though I might

Re: Distributing Keys across Reducers

2012-07-25 Thread Tim Broberg
> CI BoostT Clients Outperform OnlineT www.ciboost.com > > > -Original Message- > From: Dave Shine [mailto:dave.sh...@channelintelligence.com] > Sent: Friday, July 20, 2012 1:13 PM > To: mapreduce-user@hadoop.apache.org > Subject: RE: Distributing Keys across