[ https://issues.apache.org/jira/browse/SPARK-16827?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15571007#comment-15571007 ]
Gaoxiang Liu commented on SPARK-16827: -------------------------------------- I am quite new to spark, but I think I am getting some confusion on the requirement now - [~sitalke...@gmail.com] , [~chobrian] , could you please give some comment here as for the requirement based on what [~rxin] said ? > Stop reporting spill metrics as shuffle metrics > ----------------------------------------------- > > Key: SPARK-16827 > URL: https://issues.apache.org/jira/browse/SPARK-16827 > Project: Spark > Issue Type: Bug > Components: Shuffle, Spark Core > Affects Versions: 2.0.0 > Reporter: Sital Kedia > Assignee: Brian Cho > Labels: performance > Fix For: 2.0.2, 2.1.0 > > > One of our hive job which looks like this - > {code} > SELECT userid > FROM table1 a > JOIN table2 b > ON a.ds = '2016-07-15' > AND b.ds = '2016-07-15' > AND a.source_id = b.id > {code} > After upgrade to Spark 2.0 the job is significantly slow. Digging a little > into it, we found out that one of the stages produces excessive amount of > shuffle data. Please note that this is a regression from Spark 1.6. Stage 2 > of the job which used to produce 32KB shuffle data with 1.6, now produces > more than 400GB with Spark 2.0. We also tried turning off whole stage code > generation but that did not help. > PS - Even if the intermediate shuffle data size is huge, the job still > produces accurate output. -- This message was sent by Atlassian JIRA (v6.3.4#6332) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org