[jira] [Assigned] (SPARK-16827) Query with Join produces excessive amount of shuffle data

2016-10-04 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-16827?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-16827:


Assignee: Apache Spark

> Query with Join produces excessive amount of shuffle data
> -
>
> Key: SPARK-16827
> URL: https://issues.apache.org/jira/browse/SPARK-16827
> Project: Spark
>  Issue Type: Bug
>  Components: Shuffle, Spark Core
>Affects Versions: 2.0.0
>Reporter: Sital Kedia
>Assignee: Apache Spark
>  Labels: performance
>
> One of our hive job which looks like this -
> {code}
>  SELECT  userid
>  FROM  table1 a
>  JOIN table2 b
>   ONa.ds = '2016-07-15'
>   AND  b.ds = '2016-07-15'
>   AND  a.source_id = b.id
> {code}
> After upgrade to Spark 2.0 the job is significantly slow.  Digging a little 
> into it, we found out that one of the stages produces excessive amount of 
> shuffle data.  Please note that this is a regression from Spark 1.6. Stage 2 
> of the job which used to produce 32KB shuffle data with 1.6, now produces 
> more than 400GB with Spark 2.0. We also tried turning off whole stage code 
> generation but that did not help. 
> PS - Even if the intermediate shuffle data size is huge, the job still 
> produces accurate output.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-16827) Query with Join produces excessive amount of shuffle data

2016-10-04 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-16827?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-16827:


Assignee: (was: Apache Spark)

> Query with Join produces excessive amount of shuffle data
> -
>
> Key: SPARK-16827
> URL: https://issues.apache.org/jira/browse/SPARK-16827
> Project: Spark
>  Issue Type: Bug
>  Components: Shuffle, Spark Core
>Affects Versions: 2.0.0
>Reporter: Sital Kedia
>  Labels: performance
>
> One of our hive job which looks like this -
> {code}
>  SELECT  userid
>  FROM  table1 a
>  JOIN table2 b
>   ONa.ds = '2016-07-15'
>   AND  b.ds = '2016-07-15'
>   AND  a.source_id = b.id
> {code}
> After upgrade to Spark 2.0 the job is significantly slow.  Digging a little 
> into it, we found out that one of the stages produces excessive amount of 
> shuffle data.  Please note that this is a regression from Spark 1.6. Stage 2 
> of the job which used to produce 32KB shuffle data with 1.6, now produces 
> more than 400GB with Spark 2.0. We also tried turning off whole stage code 
> generation but that did not help. 
> PS - Even if the intermediate shuffle data size is huge, the job still 
> produces accurate output.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org