[jira] [Assigned] (SPARK-16827) Query with Join produces excessive amount of shuffle data
[ https://issues.apache.org/jira/browse/SPARK-16827?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-16827: Assignee: Apache Spark > Query with Join produces excessive amount of shuffle data > - > > Key: SPARK-16827 > URL: https://issues.apache.org/jira/browse/SPARK-16827 > Project: Spark > Issue Type: Bug > Components: Shuffle, Spark Core >Affects Versions: 2.0.0 >Reporter: Sital Kedia >Assignee: Apache Spark > Labels: performance > > One of our hive job which looks like this - > {code} > SELECT userid > FROM table1 a > JOIN table2 b > ONa.ds = '2016-07-15' > AND b.ds = '2016-07-15' > AND a.source_id = b.id > {code} > After upgrade to Spark 2.0 the job is significantly slow. Digging a little > into it, we found out that one of the stages produces excessive amount of > shuffle data. Please note that this is a regression from Spark 1.6. Stage 2 > of the job which used to produce 32KB shuffle data with 1.6, now produces > more than 400GB with Spark 2.0. We also tried turning off whole stage code > generation but that did not help. > PS - Even if the intermediate shuffle data size is huge, the job still > produces accurate output. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-16827) Query with Join produces excessive amount of shuffle data
[ https://issues.apache.org/jira/browse/SPARK-16827?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-16827: Assignee: (was: Apache Spark) > Query with Join produces excessive amount of shuffle data > - > > Key: SPARK-16827 > URL: https://issues.apache.org/jira/browse/SPARK-16827 > Project: Spark > Issue Type: Bug > Components: Shuffle, Spark Core >Affects Versions: 2.0.0 >Reporter: Sital Kedia > Labels: performance > > One of our hive job which looks like this - > {code} > SELECT userid > FROM table1 a > JOIN table2 b > ONa.ds = '2016-07-15' > AND b.ds = '2016-07-15' > AND a.source_id = b.id > {code} > After upgrade to Spark 2.0 the job is significantly slow. Digging a little > into it, we found out that one of the stages produces excessive amount of > shuffle data. Please note that this is a regression from Spark 1.6. Stage 2 > of the job which used to produce 32KB shuffle data with 1.6, now produces > more than 400GB with Spark 2.0. We also tried turning off whole stage code > generation but that did not help. > PS - Even if the intermediate shuffle data size is huge, the job still > produces accurate output. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org