[jira] [Commented] (SPARK-16753) Spark SQL doesn't handle skewed dataset joins properly

Reynold Xin (JIRA) Thu, 28 Jul 2016 18:29:42 -0700

    [ 
https://issues.apache.org/jira/browse/SPARK-16753?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15398551#comment-15398551
 ]


Reynold Xin commented on SPARK-16753:
-------------------------------------

Got it - definitely good to do skew join. There are a lot of different ways to 
implement this. I think teradata had a paper on skewed join too.


> Spark SQL doesn't handle skewed dataset joins properly
> ------------------------------------------------------
>
>                 Key: SPARK-16753
>                 URL: https://issues.apache.org/jira/browse/SPARK-16753
>             Project: Spark
>          Issue Type: Bug
>          Components: SQL
>    Affects Versions: 1.6.1, 1.6.2, 2.0.0, 2.0.1
>            Reporter: Jurriaan Pruis
>         Attachments: screenshot-1.png
>
>
> I'm having issues with joining a 1 billion row dataframe with skewed data 
> with multiple dataframes with sizes ranging from 100,000 to 10 million rows. 
> This means some of the joins (about half of them) can be done using 
> broadcast, but not all.
> Because the data in the large dataframe is skewed we get out of memory errors 
> in the executors or errors like: 
> `org.apache.spark.shuffle.FetchFailedException: Too large frame`.
> We tried a lot of things, like broadcast joining the skewed rows separately 
> and unioning them with the dataset containing the sort merge joined data. 
> Which works perfectly when doing one or two joins, but when doing 10 joins 
> like this the query planner gets confused (see [SPARK-15326]).
> As most of the rows are skewed on the NULL value we use a hack where we put 
> unique values in those NULL columns so the data is properly distributed over 
> all partitions. This works fine for NULL values, but since this table is 
> growing rapidly and we have skewed data for non-NULL values as well this 
> isn't a full solution to the problem.
> Right now this specific spark task runs well 30% of the time and it's getting 
> worse and worse because of the increasing amount of data.
> How to approach these kinds of joins using Spark? It seems weird that I can't 
> find proper solutions for this problem/other people having the same kind of 
> issues when Spark profiles itself as a large-scale data processing engine. 
> Doing joins on big datasets should be a thing Spark should have no problem 
> with out of the box.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-16753) Spark SQL doesn't handle skewed dataset joins properly

Reply via email to