Sital Kedia created SPARK-16827:
-----------------------------------

             Summary: Query with Join produces excessive shuffle data
                 Key: SPARK-16827
                 URL: https://issues.apache.org/jira/browse/SPARK-16827
             Project: Spark
          Issue Type: Bug
          Components: Shuffle, Spark Core
    Affects Versions: 2.0.0
            Reporter: Sital Kedia


One of our hive job which looks like this -

 SELECT  userid
     FROM  table1 a
     JOIN table2 b
      ON    a.ds = '2016-07-15'
      AND  b.ds = '2016-07-15'
      AND  a.source_id = b.id

After upgrade to Spark 2.0 the job is significantly slow.  Digging a little 
into it, we found out that one of the stages produces excessive amount of 
shuffle data.  Please note that this is a regression from Spark 1.6



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

Reply via email to