Only SQL and DataFrame for now.
We are thinking about how to apply that to a more general distributed
collection based API, but it's not in 1.5.
On Sat, Sep 5, 2015 at 11:56 AM, Gurvinder Singh wrote:
> On 09/05/2015 11:22 AM, Reynold Xin wrote:
> > Try increase
Try increase the shuffle memory fraction (by default it is only 16%).
Again, if you run Spark 1.5, this will probably run a lot faster,
especially if you increase the shuffle memory fraction ...
On Tue, Sep 1, 2015 at 8:13 AM, Thomas Dudziak wrote:
> While it works with
On 09/05/2015 11:22 AM, Reynold Xin wrote:
> Try increase the shuffle memory fraction (by default it is only 16%).
> Again, if you run Spark 1.5, this will probably run a lot faster,
> especially if you increase the shuffle memory fraction ...
Hi Reynold,
Does the 1.5 has better join/cogroup
While it works with sort-merge-join, it takes about 12h to finish (with
1 shuffle partitions). My hunch is that the reason for that is this:
INFO ExternalSorter: Thread 3733 spilling in-memory map of 174.9 MB to disk
(62 times so far)
(and lots more where this comes from).
On Sat, Aug 29,
Can you try 1.5? This should work much, much better in 1.5 out of the box.
For 1.4, I think you'd want to turn on sort-merge-join, which is off by
default. However, the sort-merge join in 1.4 can still trigger a lot of
garbage, making it slower. SMJ performance is probably 5x - 1000x better in
partitions or memory.
Yong
From: tom...@gmail.com
Date: Fri, 28 Aug 2015 13:55:52 -0700
Subject: Re: How to avoid shuffle errors for a large join ?
To: ja...@jasonknight.us
CC: user@spark.apache.org
Yeah, I tried with 10k and 30k and these still failed, will try with more then.
Though
, adding partitions or memory.
Yong
--
From: tom...@gmail.com
Date: Fri, 28 Aug 2015 13:55:52 -0700
Subject: Re: How to avoid shuffle errors for a large join ?
To: ja...@jasonknight.us
CC: user@spark.apache.org
Yeah, I tried with 10k and 30k and these still
Ahh yes, thanks for mentioning data skew, I've run into that before as
well. The best way there is to get statistics on the distribution of your
join key. If there are a few values with drastically larger number of
values, then a reducer task will always be swamped no matter how many
reducer side
Yeah, I tried with 10k and 30k and these still failed, will try with more
then. Though that is a little disappointing, it only writes ~7TB of shuffle
data which shouldn't in theory require more than 1000 reducers on my 10TB
memory cluster (~7GB of spill per reducer).
I'm now wondering if my
I had similar problems to this (reduce side failures for large joins (25bn
rows with 9bn)), and found the answer was to further up the
spark.sql.shuffle.partitions=1000. In my case, 16k partitions worked for
me, but your tables look a little denser, so you may want to go even higher.
On Thu, Aug
10 matches
Mail list logo