[ https://issues.apache.org/jira/browse/SPARK-16753?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15398551#comment-15398551 ]
Reynold Xin commented on SPARK-16753: ------------------------------------- Got it - definitely good to do skew join. There are a lot of different ways to implement this. I think teradata had a paper on skewed join too. > Spark SQL doesn't handle skewed dataset joins properly > ------------------------------------------------------ > > Key: SPARK-16753 > URL: https://issues.apache.org/jira/browse/SPARK-16753 > Project: Spark > Issue Type: Bug > Components: SQL > Affects Versions: 1.6.1, 1.6.2, 2.0.0, 2.0.1 > Reporter: Jurriaan Pruis > Attachments: screenshot-1.png > > > I'm having issues with joining a 1 billion row dataframe with skewed data > with multiple dataframes with sizes ranging from 100,000 to 10 million rows. > This means some of the joins (about half of them) can be done using > broadcast, but not all. > Because the data in the large dataframe is skewed we get out of memory errors > in the executors or errors like: > `org.apache.spark.shuffle.FetchFailedException: Too large frame`. > We tried a lot of things, like broadcast joining the skewed rows separately > and unioning them with the dataset containing the sort merge joined data. > Which works perfectly when doing one or two joins, but when doing 10 joins > like this the query planner gets confused (see [SPARK-15326]). > As most of the rows are skewed on the NULL value we use a hack where we put > unique values in those NULL columns so the data is properly distributed over > all partitions. This works fine for NULL values, but since this table is > growing rapidly and we have skewed data for non-NULL values as well this > isn't a full solution to the problem. > Right now this specific spark task runs well 30% of the time and it's getting > worse and worse because of the increasing amount of data. > How to approach these kinds of joins using Spark? It seems weird that I can't > find proper solutions for this problem/other people having the same kind of > issues when Spark profiles itself as a large-scale data processing engine. > Doing joins on big datasets should be a thing Spark should have no problem > with out of the box. -- This message was sent by Atlassian JIRA (v6.3.4#6332) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org