[ 
https://issues.apache.org/jira/browse/SPARK-13902?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Kay Ousterhout updated SPARK-13902:
-----------------------------------
    Fix Version/s:     (was: 2.1.0)
                   2.0.0

> Make DAGScheduler not to create duplicate stage.
> ------------------------------------------------
>
>                 Key: SPARK-13902
>                 URL: https://issues.apache.org/jira/browse/SPARK-13902
>             Project: Spark
>          Issue Type: Bug
>          Components: Scheduler
>            Reporter: Takuya Ueshin
>            Assignee: Takuya Ueshin
>             Fix For: 2.0.0
>
>
> {{DAGScheduler}} sometimes generate incorrect stage graph.
> Suppose you have the following DAG (please see this in monospaced font):
> {noformat}
> [A] <--(s_A)-- [B] <--(s_B)-- [C] <--(s_C)-- [D]
>             \                /
>               <-------------
> {noformat}
> Note: [] means an RDD, () means a shuffle dependency.
> Here, RDD {{B}} has a shuffle dependency on RDD {{A}}, and RDD {{C}} has 
> shuffle dependency on both {{B}} and {{A}}. The shuffle dependency IDs are 
> numbers in the {{DAGScheduler}}, but to make the example easier to 
> understand, let's call the shuffled data from {{A}} shuffle dependency ID 
> {{s_A}} and the shuffled data from {{B}} shuffle dependency ID {{s_B}}.
> The {{getAncestorShuffleDependencies}} method in {{DAGScheduler}} 
> (incorrectly) does not check for duplicates when it's adding 
> ShuffleDependencies to the parents data structure, so for this DAG, when 
> {{getAncestorShuffleDependencies}} gets called on {{C}} (previous of the 
> final RDD), {{getAncestorShuffleDependencies}} will return {{s_A}}, {{s_B}}, 
> {{s_A}} ({{s_A}} gets added twice: once when the method "visit"s RDD {{C}}, 
> and once when the method "visit"s RDD {{B}}). This is problematic because 
> this line of code: 
> https://github.com/apache/spark/blob/8ef3399/core/src/main/scala/org/apache/spark/scheduler/DAGScheduler.scala#L289
>  then generates a new shuffle stage for each dependency returned by 
> {{getAncestorShuffleDependencies}}, resulting in duplicate map stages that 
> compute the map output from RDD A.
> As a result, {{DAGScheduler}} generates the following stages and their 
> parents for each shuffle:
> |  | stage | parents |
> | s_A | ShuffleMapStage 2 | List() |
> | s_B | ShuffleMapStage 1 | List(ShuffleMapStage 0) |
> | s_C | ShuffleMapStage 3 | List(ShuffleMapStage 1, ShuffleMapStage 2) |
> | \- | ResultStage 4 | List(ShuffleMapStage 3) |
> The stage for {{s_A}} should be {{ShuffleMapStage 0}}, but the stage for 
> {{s_A}} is generated twice as {{ShuffleMapStage 2}} and {{ShuffleMapStage 0}} 
> is overwritten by {{ShuffleMapStage 2}}, and the stage {{ShuffleMap Stage1}} 
> keeps referring the _old_ stage {{ShuffleMapStage 0}}.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

Reply via email to