Re: How to avoid shuffle errors for a large join ?

2015-09-16 Thread Reynold Xin
Only SQL and DataFrame for now. We are thinking about how to apply that to a more general distributed collection based API, but it's not in 1.5. On Sat, Sep 5, 2015 at 11:56 AM, Gurvinder Singh wrote: > On 09/05/2015 11:22 AM, Reynold Xin wrote: > > Try increase

Re: How to avoid shuffle errors for a large join ?

2015-09-05 Thread Reynold Xin
Try increase the shuffle memory fraction (by default it is only 16%). Again, if you run Spark 1.5, this will probably run a lot faster, especially if you increase the shuffle memory fraction ... On Tue, Sep 1, 2015 at 8:13 AM, Thomas Dudziak wrote: > While it works with

Re: How to avoid shuffle errors for a large join ?

2015-09-05 Thread Gurvinder Singh
On 09/05/2015 11:22 AM, Reynold Xin wrote: > Try increase the shuffle memory fraction (by default it is only 16%). > Again, if you run Spark 1.5, this will probably run a lot faster, > especially if you increase the shuffle memory fraction ... Hi Reynold, Does the 1.5 has better join/cogroup

Re: How to avoid shuffle errors for a large join ?

2015-09-01 Thread Thomas Dudziak
While it works with sort-merge-join, it takes about 12h to finish (with 1 shuffle partitions). My hunch is that the reason for that is this: INFO ExternalSorter: Thread 3733 spilling in-memory map of 174.9 MB to disk (62 times so far) (and lots more where this comes from). On Sat, Aug 29,

Re: How to avoid shuffle errors for a large join ?

2015-08-29 Thread Reynold Xin
Can you try 1.5? This should work much, much better in 1.5 out of the box. For 1.4, I think you'd want to turn on sort-merge-join, which is off by default. However, the sort-merge join in 1.4 can still trigger a lot of garbage, making it slower. SMJ performance is probably 5x - 1000x better in

RE: How to avoid shuffle errors for a large join ?

2015-08-28 Thread java8964
partitions or memory. Yong From: tom...@gmail.com Date: Fri, 28 Aug 2015 13:55:52 -0700 Subject: Re: How to avoid shuffle errors for a large join ? To: ja...@jasonknight.us CC: user@spark.apache.org Yeah, I tried with 10k and 30k and these still failed, will try with more then. Though

Re: How to avoid shuffle errors for a large join ?

2015-08-28 Thread Thomas Dudziak
, adding partitions or memory. Yong -- From: tom...@gmail.com Date: Fri, 28 Aug 2015 13:55:52 -0700 Subject: Re: How to avoid shuffle errors for a large join ? To: ja...@jasonknight.us CC: user@spark.apache.org Yeah, I tried with 10k and 30k and these still

Re: How to avoid shuffle errors for a large join ?

2015-08-28 Thread Jason
Ahh yes, thanks for mentioning data skew, I've run into that before as well. The best way there is to get statistics on the distribution of your join key. If there are a few values with drastically larger number of values, then a reducer task will always be swamped no matter how many reducer side

Re: How to avoid shuffle errors for a large join ?

2015-08-28 Thread Thomas Dudziak
Yeah, I tried with 10k and 30k and these still failed, will try with more then. Though that is a little disappointing, it only writes ~7TB of shuffle data which shouldn't in theory require more than 1000 reducers on my 10TB memory cluster (~7GB of spill per reducer). I'm now wondering if my

Re: How to avoid shuffle errors for a large join ?

2015-08-28 Thread Jason
I had similar problems to this (reduce side failures for large joins (25bn rows with 9bn)), and found the answer was to further up the spark.sql.shuffle.partitions=1000. In my case, 16k partitions worked for me, but your tables look a little denser, so you may want to go even higher. On Thu, Aug