Sital Kedia created SPARK-16827: ----------------------------------- Summary: Query with Join produces excessive shuffle data Key: SPARK-16827 URL: https://issues.apache.org/jira/browse/SPARK-16827 Project: Spark Issue Type: Bug Components: Shuffle, Spark Core Affects Versions: 2.0.0 Reporter: Sital Kedia
One of our hive job which looks like this - SELECT userid FROM table1 a JOIN table2 b ON a.ds = '2016-07-15' AND b.ds = '2016-07-15' AND a.source_id = b.id After upgrade to Spark 2.0 the job is significantly slow. Digging a little into it, we found out that one of the stages produces excessive amount of shuffle data. Please note that this is a regression from Spark 1.6 -- This message was sent by Atlassian JIRA (v6.3.4#6332) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org