[GitHub] spark pull request: [SPARK-13902][SCHEDULER] Make DAGScheduler.get...

2016-05-08 Thread SparkQA
Github user SparkQA commented on the pull request:

https://github.com/apache/spark/pull/12655#issuecomment-217780852
  
**[Test build #58121 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/58121/consoleFull)**
 for PR 12655 at commit 
[`55d6b6d`](https://github.com/apache/spark/commit/55d6b6db26aba0054c0da87d66404a68385ad3ff).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-13902][SCHEDULER] Make DAGScheduler.get...

2016-05-08 Thread AmplabJenkins
Github user AmplabJenkins commented on the pull request:

https://github.com/apache/spark/pull/12655#issuecomment-217780501
  
Test FAILed.
Refer to this link for build results (access rights to CI server needed): 
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/58119/
Test FAILed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-13902][SCHEDULER] Make DAGScheduler.get...

2016-05-08 Thread AmplabJenkins
Github user AmplabJenkins commented on the pull request:

https://github.com/apache/spark/pull/12655#issuecomment-217780498
  
Merged build finished. Test FAILed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-13902][SCHEDULER] Make DAGScheduler.get...

2016-05-08 Thread SparkQA
Github user SparkQA commented on the pull request:

https://github.com/apache/spark/pull/12655#issuecomment-217780496
  
**[Test build #58119 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/58119/consoleFull)**
 for PR 12655 at commit 
[`b4e2eb1`](https://github.com/apache/spark/commit/b4e2eb1557dedc34b9a57b371e11ade2693ff38a).
 * This patch **fails Scala style tests**.
 * This patch merges cleanly.
 * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-13902][SCHEDULER] Make DAGScheduler.get...

2016-05-08 Thread SparkQA
Github user SparkQA commented on the pull request:

https://github.com/apache/spark/pull/12655#issuecomment-217780312
  
**[Test build #58119 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/58119/consoleFull)**
 for PR 12655 at commit 
[`b4e2eb1`](https://github.com/apache/spark/commit/b4e2eb1557dedc34b9a57b371e11ade2693ff38a).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-13902][SCHEDULER] Make DAGScheduler.get...

2016-05-03 Thread AmplabJenkins
Github user AmplabJenkins commented on the pull request:

https://github.com/apache/spark/pull/12655#issuecomment-216745078
  
Merged build finished. Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-13902][SCHEDULER] Make DAGScheduler.get...

2016-05-03 Thread AmplabJenkins
Github user AmplabJenkins commented on the pull request:

https://github.com/apache/spark/pull/12655#issuecomment-216745080
  
Test PASSed.
Refer to this link for build results (access rights to CI server needed): 
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/57709/
Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-13902][SCHEDULER] Make DAGScheduler.get...

2016-05-03 Thread SparkQA
Github user SparkQA commented on the pull request:

https://github.com/apache/spark/pull/12655#issuecomment-216744662
  
**[Test build #57709 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/57709/consoleFull)**
 for PR 12655 at commit 
[`ab92488`](https://github.com/apache/spark/commit/ab92488a7c92d1e7824021dc2113df370bb2d79f).
 * This patch passes all tests.
 * This patch merges cleanly.
 * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-13902][SCHEDULER] Make DAGScheduler.get...

2016-05-03 Thread SparkQA
Github user SparkQA commented on the pull request:

https://github.com/apache/spark/pull/12655#issuecomment-216732592
  
**[Test build #57709 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/57709/consoleFull)**
 for PR 12655 at commit 
[`ab92488`](https://github.com/apache/spark/commit/ab92488a7c92d1e7824021dc2113df370bb2d79f).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-13902][SCHEDULER] Make DAGScheduler.get...

2016-05-03 Thread kayousterhout
Github user kayousterhout commented on the pull request:

https://github.com/apache/spark/pull/12655#issuecomment-216727287
  
@ueshin that would be great so we can merge this!


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-13902][SCHEDULER] Make DAGScheduler.get...

2016-05-03 Thread ueshin
Github user ueshin commented on the pull request:

https://github.com/apache/spark/pull/12655#issuecomment-216706049
  
I think we are moving to #8923, but no updates there.
Should I use #8923 way here?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-13902][SCHEDULER] Make DAGScheduler.get...

2016-04-28 Thread markhamstra
Github user markhamstra commented on the pull request:

https://github.com/apache/spark/pull/12655#issuecomment-215560207
  
Sorry, on re-reading my own post I realize that I wasn't completely clear.  
I'm saying that the two-line change in DAGScheduler.scala from 
https://github.com/apache/spark/pull/8923 combined with Kay's requested test 
changes is all we need for a first PR.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-13902][SCHEDULER] Make DAGScheduler.get...

2016-04-28 Thread markhamstra
Github user markhamstra commented on the pull request:

https://github.com/apache/spark/pull/12655#issuecomment-215556785
  
I'm fine with the change in https://github.com/apache/spark/pull/8923  I'm 
looking at refactoring of `newOrUsedShuffleStage` that may effectively turn it 
into just `newShuffleStage`, so putting the `Used` check/guard in 
`getShuffleMapStage` where @suyanNone has it is a reasonable thing from my 
perspective.  Along with Kay's suggested augmentation of the tests, I think 
that's good enough for the first, simple PR.  


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-13902][SCHEDULER] Make DAGScheduler.get...

2016-04-28 Thread suyanNone
Github user suyanNone commented on the pull request:

https://github.com/apache/spark/pull/12655#issuecomment-215329471
  
as I know, duplicate stage occurs: 
stage 2 and stage 3  all  depends on stage 1
stage 4 depends on stage 2 and stage 3

So, if we get getAncestorShuffleDependencies(stage 4), it will 
depth-traverse stage 2-> stage 1, then traverse stage 3 -> stage 1... due to we 
use stack,  we can't de-duplicate the same stage in this.

if  we traverse stage 4 with de-duplication successful, it will in 
`shuffleIdToMapStage`, which can tell other rdd which stage had traversed. 
right?




---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-13902][SCHEDULER] Make DAGScheduler.get...

2016-04-27 Thread ueshin
Github user ueshin commented on the pull request:

https://github.com/apache/spark/pull/12655#issuecomment-215317033
  
I saw the PR #8427 now.
Both the #8427 approach and @markhamstra's approach (should we use 
`getOrElseUpdate` instead of `getOrElse`?) seem like the simplest way to fix 
the duplicate-stage issue except a risk of `StackOverflowError`.

I agree that we should fix the duplicate-stage issue first in the simplest 
way possible, by #8427 or @markhamstra's, and then discuss a risk of `SOE` 
after one of them is merged.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-13902][SCHEDULER] Make DAGScheduler.get...

2016-04-27 Thread markhamstra
Github user markhamstra commented on the pull request:

https://github.com/apache/spark/pull/12655#issuecomment-215250204
  
Yes, I'd also like to avoid trying to fix all of these related issues in 
one big PR.  There are overlapping concerns, but I agree that a simple fix to 
the duplicate-stage bug (SPARK-13902) is where I'd like to start -- which means 
delaying any merge of the other, related PRs [@rxin @andrewor14 @srowen 
@JoshRosen ... hopefully that's enough to get the message out].

I'd like to move the `shuffleToMapStage.getOrElse` within 
`newOrUsedShuffleStage` because that matches up better with the way I remember 
the desired semantics of newOrUsedStage -- i.e. if the desired Stage already 
exists, then re-use it; else create a new Stage.  Instead, what we do is: 
create a new Stage unconditionally; if some of the guts of the desired Stage 
already exist, then copy those over into the new Stage; else create a new 
Stage.  The way https://github.com/apache/spark/pull/8923 works is to retain 
the copy-the-guts approach  in newOrUsed, but to preface the callsite of 
newOrUsed with a guard to avoid the call in the Used case.  Seems like 
needlessly convoluted logic -- not that we're otherwise free from that 
particular sin within the DAGScheduler.

What I have now is kind of belt-and-suspenders. I don't know that the 
copy-the-guts logic within the `if 
(mapOutputTracker.containsShuffle(shuffleDep.shuffleId))` branch can ever do 
anything useful within the OrElse of `shuffleToMapStage.getOrElse`, so maybe 
that branch can be dropped.  I'm also still trying to convince myself that I 
can't lose my pants by not doing any mapOutput checking and copying in the case 
that `shuffleToMapStage` already has a reference to the desired used Stage.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-13902][SCHEDULER] Make DAGScheduler.get...

2016-04-27 Thread kayousterhout
Github user kayousterhout commented on the pull request:

https://github.com/apache/spark/pull/12655#issuecomment-215240356
  
@markhamstra why is the newOrUsedShuffleStage approach better / different?

IMO the existing code is very convoluted at the moment, but we should first 
fix the duplicate-stage bug (in the simplest way possible), and then in a 
separate PR address cleaning up some of the function calls so this whole thing 
is less convoluted and easier to reason about.  But I realize I may be missing 
some issue that the simple fix in #8923 doesn't address.




---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-13902][SCHEDULER] Make DAGScheduler.get...

2016-04-27 Thread markhamstra
Github user markhamstra commented on the pull request:

https://github.com/apache/spark/pull/12655#issuecomment-215237185
  
I'd also like to address the concerns of @squito 's 
https://github.com/apache/spark/pull/8427 and @sitalkedia 's 
https://github.com/apache/spark/pull/12436

These PR's are all linked with the issue of needlessly duplicated Stages.  
There probably needs to be some coordination in addressing the total set, but I 
hope that we can also handle them in more than one PR.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-13902][SCHEDULER] Make DAGScheduler.get...

2016-04-27 Thread markhamstra
Github user markhamstra commented on the pull request:

https://github.com/apache/spark/pull/12655#issuecomment-215235472
  
What I'm actually trying to look at right now is something very similar to 
https://github.com/apache/spark/pull/8923, but pushing into 
newOrUsedShuffleStage the check for prior existence of a Stage in 
shuffleToMapStage:
```scala
  private def newOrUsedShuffleStage(
  shuffleDep: ShuffleDependency[_, _, 
_],
  firstJobId: Int): ShuffleMapStage = {
val rdd = shuffleDep.rdd
val numTasks = rdd.partitions.length
shuffleToMapStage.getOrElse(shuffleDep.shuffleId, {
  val stage = newShuffleMapStage(rdd, numTasks, shuffleDep, firstJobId, 
rdd.creationSite)
  if (mapOutputTracker.containsShuffle(shuffleDep.shuffleId)) {
val serLocs = 
mapOutputTracker.getSerializedMapOutputStatuses(shuffleDep.shuffleId)
val locs = MapOutputTracker.deserializeMapStatuses(serLocs)
(0 until locs.length).foreach { i =>
  if (locs(i) ne null) {
// locs(i) will be null if missing
stage.addOutputLoc(i, locs(i))
  }
}
  } else {
// Kind of ugly: need to register RDDs with the cache and map 
output tracker here
// since we can't do it in the RDD constructor because # of 
partitions is unknown
logInfo("Registering RDD " + rdd.id + " (" + rdd.getCreationSite + 
")")
mapOutputTracker.registerShuffle(shuffleDep.shuffleId, 
rdd.partitions.length)
  }
  stage
})
  }
```


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-13902][SCHEDULER] Make DAGScheduler.get...

2016-04-27 Thread kayousterhout
Github user kayousterhout commented on the pull request:

https://github.com/apache/spark/pull/12655#issuecomment-215200108
  
Thanks for pointing that out @markhamstra. I'd be in favor of that solution 
(but augmented with a more clear test case -- which might just mean adding a 
nice ascii-art description of the DAG in the test case). What do you think 
Mark? @useshin do you see any issue with the simpler approach in #8923?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-13902][SCHEDULER] Make DAGScheduler.get...

2016-04-27 Thread markhamstra
Github user markhamstra commented on the pull request:

https://github.com/apache/spark/pull/12655#issuecomment-215155985
  
It looks like https://github.com/apache/spark/pull/8923 by @suyanNone fixes 
at least the test that this PR adds to the DAGSchedulerSuite, and does so much 
more simply.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-13902][SCHEDULER] Make DAGScheduler.get...

2016-04-26 Thread ueshin
Github user ueshin commented on the pull request:

https://github.com/apache/spark/pull/12655#issuecomment-214670763
  
@kayousterhout 
I added a comment to 
[JIRA](https://issues.apache.org/jira/browse/SPARK-13902?focusedCommentId=15257721=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-15257721).
Could you take a look at it and let me know which example I should use for 
the description, the original one or the "simpler" one.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-13902][SCHEDULER] Make DAGScheduler.get...

2016-04-25 Thread kayousterhout
Github user kayousterhout commented on the pull request:

https://github.com/apache/spark/pull/12655#issuecomment-214619200
  
I see -- now I understand the motivation for returning the shuffle 
dependencies topologically sorted, because it limits the depth of the recursion 
(it looks like the old code was trying to do that with the stack, but didn't 
quite get it right?).  Let me think about whether there's a simpler way to 
accomplish that.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-13902][SCHEDULER] Make DAGScheduler.get...

2016-04-25 Thread ueshin
Github user ueshin commented on the pull request:

https://github.com/apache/spark/pull/12655#issuecomment-214617772
  
@kayousterhout 
I'll update the unit test and JIRA, PR descriptions to use the simpler 
example.

As for the `createAncestorShuffleMapStages` way, I think it will have a 
risk of `StackOverflowError` for a long job because the master branch version 
of `getAncestorShuffleDependencies` finds descendants first, which would have a 
lot of ancestors with long linage, and if we immediately create the shuffle 
stage there by the `newOrUsedShuffleStage` method, which builds all ancestor 
stages by recursive-call fashion, the `StackOverflowError` will be thrown.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-13902][SCHEDULER] Make DAGScheduler.get...

2016-04-25 Thread kayousterhout
Github user kayousterhout commented on the pull request:

https://github.com/apache/spark/pull/12655#issuecomment-214455792
  
I just updated the 
[JIRA](https://issues.apache.org/jira/browse/SPARK-13902) with what I 
understand to be the issue.  Can you take a look and let me know if that's 
correct? If the simpler example I showed is sufficient to reproduce the issue 
(and my explanation is correct), can you simplify the unit test to use that 
example, and also update the JIRA and pull request description to have that 
text?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-13902][SCHEDULER] Make DAGScheduler.get...

2016-04-25 Thread ueshin
Github user ueshin commented on the pull request:

https://github.com/apache/spark/pull/12655#issuecomment-214163952
  
@markhamstra Thank you for your comment.
I thought the non-optimal state in `DAGScheduler` was a kind of bug so I 
used the words "incorrect" or "illegal" but now I see what you thought.
So yes, I was talking about only graphs and want to improve `DAGSchduler` 
performance by this and #12060.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-13902][SCHEDULER] Make DAGScheduler.get...

2016-04-25 Thread AmplabJenkins
Github user AmplabJenkins commented on the pull request:

https://github.com/apache/spark/pull/12655#issuecomment-214152067
  
Merged build finished. Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-13902][SCHEDULER] Make DAGScheduler.get...

2016-04-25 Thread AmplabJenkins
Github user AmplabJenkins commented on the pull request:

https://github.com/apache/spark/pull/12655#issuecomment-214152071
  
Test PASSed.
Refer to this link for build results (access rights to CI server needed): 
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/56873/
Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-13902][SCHEDULER] Make DAGScheduler.get...

2016-04-25 Thread SparkQA
Github user SparkQA commented on the pull request:

https://github.com/apache/spark/pull/12655#issuecomment-214151363
  
**[Test build #56873 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/56873/consoleFull)**
 for PR 12655 at commit 
[`cab5264`](https://github.com/apache/spark/commit/cab52643eddfc7d08673f8e0437a0c71adeb93d5).
 * This patch passes all tests.
 * This patch merges cleanly.
 * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-13902][SCHEDULER] Make DAGScheduler.get...

2016-04-24 Thread ueshin
Github user ueshin commented on a diff in the pull request:

https://github.com/apache/spark/pull/12655#discussion_r60862612
  
--- Diff: 
core/src/test/scala/org/apache/spark/scheduler/DAGSchedulerSuite.scala ---
@@ -322,6 +322,59 @@ class DAGSchedulerSuite extends SparkFunSuite with 
LocalSparkContext with Timeou
 assert(sparkListener.stageByOrderOfExecution(0) < 
sparkListener.stageByOrderOfExecution(1))
   }
 
+  /**
+   * This test ensures that DAGScheduler build stage graph correctly.
+   * Here, we submit an RDD[F] having a linage of RDDs as follows:
+   *
+   *   <
+   * /   \
+   * [A] <--(1)-- [B] <--(2)-- [C] <--(3)-- [D] <--(4)-- [E] <--(5)-- [F]
--- End diff --

I updated the JIRA description.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-13902][SCHEDULER] Make DAGScheduler.get...

2016-04-24 Thread AmplabJenkins
Github user AmplabJenkins commented on the pull request:

https://github.com/apache/spark/pull/12655#issuecomment-214122523
  
Merged build finished. Test FAILed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-13902][SCHEDULER] Make DAGScheduler.get...

2016-04-24 Thread AmplabJenkins
Github user AmplabJenkins commented on the pull request:

https://github.com/apache/spark/pull/12655#issuecomment-214122524
  
Test FAILed.
Refer to this link for build results (access rights to CI server needed): 
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/56868/
Test FAILed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-13902][SCHEDULER] Make DAGScheduler.get...

2016-04-24 Thread SparkQA
Github user SparkQA commented on the pull request:

https://github.com/apache/spark/pull/12655#issuecomment-214122275
  
**[Test build #56868 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/56868/consoleFull)**
 for PR 12655 at commit 
[`3a8ff84`](https://github.com/apache/spark/commit/3a8ff84622c3f136fa3511561a789163c94b2f2e).
 * This patch **fails Spark unit tests**.
 * This patch merges cleanly.
 * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-13902][SCHEDULER] Make DAGScheduler.get...

2016-04-24 Thread SparkQA
Github user SparkQA commented on the pull request:

https://github.com/apache/spark/pull/12655#issuecomment-214118512
  
**[Test build #56873 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/56873/consoleFull)**
 for PR 12655 at commit 
[`cab5264`](https://github.com/apache/spark/commit/cab52643eddfc7d08673f8e0437a0c71adeb93d5).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-13902][SCHEDULER] Make DAGScheduler.get...

2016-04-24 Thread markhamstra
Github user markhamstra commented on the pull request:

https://github.com/apache/spark/pull/12655#issuecomment-214101076
  
I think some of the terminology used in this and related PRs is confusing 
the issues.  When @kayousterhout and I ask about "correctness", what we are 
fundamentally concerned about is whether evaluation of the DAG produces the 
correct data elements.  I don't think that your description of "incorrect" or 
"illegal" graphs is meant to imply that incorrect data is produced from their 
evaluation.  Correct me if I am wrong, but I think that you are talking 
exclusively about graphs that are not optimal, causing duplication of effort 
and preventing further optimizations -- graphs that are taking longer to 
evaluate than is necessary, not graphs that are producing incorrect data 
elements.

If I am thinking correctly about this, then the entire effect of this and 
related PRs is to improve or optimize the DAGScheduler, not to create graphs 
and schedules that produce different end results than the DAGScheduler does now.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-13902][SCHEDULER] Make DAGScheduler.get...

2016-04-24 Thread kayousterhout
Github user kayousterhout commented on a diff in the pull request:

https://github.com/apache/spark/pull/12655#discussion_r60858779
  
--- Diff: core/src/main/scala/org/apache/spark/scheduler/DAGScheduler.scala 
---
@@ -403,32 +403,47 @@ class DAGScheduler(
 parents.toList
   }
 
-  /** Find ancestor shuffle dependencies that are not registered in 
shuffleToMapStage yet */
-  private def getAncestorShuffleDependencies(rdd: RDD[_]): 
Stack[ShuffleDependency[_, _, _]] = {
-val parents = new Stack[ShuffleDependency[_, _, _]]
+  /**
+   * Find ancestor shuffle dependencies that are not registered in 
shuffleToMapStage yet.
+   * This is done in topological order to create ancestor stages first to 
ensure that the result
+   * stage graph is correctly built.
+   */
+  private def getAncestorShuffleDependencies(rdd: RDD[_]): 
Seq[ShuffleDependency[_, _, _]] = {
+val parents = new ArrayBuffer[ShuffleDependency[_, _, _]]
 val visited = new HashSet[RDD[_]]
 // We are manually maintaining a stack here to prevent 
StackOverflowError
 // caused by recursively visiting
 val waitingForVisit = new Stack[RDD[_]]
 def visit(r: RDD[_]) {
-  if (!visited(r)) {
-visited += r
-for (dep <- r.dependencies) {
-  dep match {
-case shufDep: ShuffleDependency[_, _, _] =>
-  if (!shuffleToMapStage.contains(shufDep.shuffleId)) {
-parents.push(shufDep)
-  }
-case _ =>
+  if (visited(r)) {
+waitingForVisit.pop()
+  } else {
+val visitedShuffleDeps = new ArrayBuffer[ShuffleDependency[_, _, 
_]]
+val unvisitedDeps = new ArrayBuffer[Dependency[_]]
+
+r.dependencies.foreach {
+  case dep: ShuffleDependency[_, _, _] if 
!shuffleToMapStage.contains(dep.shuffleId) =>
+if (visited(dep.rdd)) visitedShuffleDeps += dep
+else unvisitedDeps += dep
+  case dep if !visited(dep.rdd) => unvisitedDeps += dep
--- End diff --

Can you put the "unvisitedDeps += dep" on the next line (and indented)? 
AFAIK that's more consistent with the style used elsewhere in Spark


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-13902][SCHEDULER] Make DAGScheduler.get...

2016-04-24 Thread kayousterhout
Github user kayousterhout commented on a diff in the pull request:

https://github.com/apache/spark/pull/12655#discussion_r60858768
  
--- Diff: core/src/main/scala/org/apache/spark/scheduler/DAGScheduler.scala 
---
@@ -403,32 +403,47 @@ class DAGScheduler(
 parents.toList
   }
 
-  /** Find ancestor shuffle dependencies that are not registered in 
shuffleToMapStage yet */
-  private def getAncestorShuffleDependencies(rdd: RDD[_]): 
Stack[ShuffleDependency[_, _, _]] = {
-val parents = new Stack[ShuffleDependency[_, _, _]]
+  /**
+   * Find ancestor shuffle dependencies that are not registered in 
shuffleToMapStage yet.
+   * This is done in topological order to create ancestor stages first to 
ensure that the result
+   * stage graph is correctly built.
+   */
+  private def getAncestorShuffleDependencies(rdd: RDD[_]): 
Seq[ShuffleDependency[_, _, _]] = {
+val parents = new ArrayBuffer[ShuffleDependency[_, _, _]]
 val visited = new HashSet[RDD[_]]
 // We are manually maintaining a stack here to prevent 
StackOverflowError
 // caused by recursively visiting
 val waitingForVisit = new Stack[RDD[_]]
 def visit(r: RDD[_]) {
-  if (!visited(r)) {
-visited += r
-for (dep <- r.dependencies) {
-  dep match {
-case shufDep: ShuffleDependency[_, _, _] =>
-  if (!shuffleToMapStage.contains(shufDep.shuffleId)) {
-parents.push(shufDep)
-  }
-case _ =>
+  if (visited(r)) {
+waitingForVisit.pop()
+  } else {
+val visitedShuffleDeps = new ArrayBuffer[ShuffleDependency[_, _, 
_]]
+val unvisitedDeps = new ArrayBuffer[Dependency[_]]
+
+r.dependencies.foreach {
+  case dep: ShuffleDependency[_, _, _] if 
!shuffleToMapStage.contains(dep.shuffleId) =>
+if (visited(dep.rdd)) visitedShuffleDeps += dep
--- End diff --

The spark style 
(https://cwiki.apache.org/confluence/display/SPARK/Spark+Code+Style+Guide) is 
to always put if/else statements in curly-braces, even if they're only one line


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-13902][SCHEDULER] Make DAGScheduler.get...

2016-04-24 Thread kayousterhout
Github user kayousterhout commented on a diff in the pull request:

https://github.com/apache/spark/pull/12655#discussion_r60858715
  
--- Diff: 
core/src/test/scala/org/apache/spark/scheduler/DAGSchedulerSuite.scala ---
@@ -322,6 +322,59 @@ class DAGSchedulerSuite extends SparkFunSuite with 
LocalSparkContext with Timeou
 assert(sparkListener.stageByOrderOfExecution(0) < 
sparkListener.stageByOrderOfExecution(1))
   }
 
+  /**
+   * This test ensures that DAGScheduler build stage graph correctly.
+   * Here, we submit an RDD[F] having a linage of RDDs as follows:
+   *
+   *   <
+   * /   \
+   * [A] <--(1)-- [B] <--(2)-- [C] <--(3)-- [D] <--(4)-- [E] <--(5)-- [F]
--- End diff --

Can you update the JIRA to have a description of this test (this ascii art 
showing the DAG is great), and to explain what the previous (and incorrect) 
behavior was for this case?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-13902][SCHEDULER] Make DAGScheduler.get...

2016-04-24 Thread kayousterhout
Github user kayousterhout commented on a diff in the pull request:

https://github.com/apache/spark/pull/12655#discussion_r60858660
  
--- Diff: 
core/src/test/scala/org/apache/spark/scheduler/DAGSchedulerSuite.scala ---
@@ -322,6 +322,59 @@ class DAGSchedulerSuite extends SparkFunSuite with 
LocalSparkContext with Timeou
 assert(sparkListener.stageByOrderOfExecution(0) < 
sparkListener.stageByOrderOfExecution(1))
   }
 
+  /**
+   * This test ensures that DAGScheduler build stage graph correctly.
+   * Here, we submit an RDD[F] having a linage of RDDs as follows:
+   *
+   *   <
+   * /   \
+   * [A] <--(1)-- [B] <--(2)-- [C] <--(3)-- [D] <--(4)-- [E] <--(5)-- [F]
+   *\   /
+   *  <
+   *
+   * then check if all stages have correct parent stages.
+   * Note: [] means an RDD, () means a shuffle dependency.
+   */
+  test("parent stages") {
--- End diff --

Can you add the JIRA name to this test?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-13902][SCHEDULER] Make DAGScheduler.get...

2016-04-24 Thread kayousterhout
Github user kayousterhout commented on a diff in the pull request:

https://github.com/apache/spark/pull/12655#discussion_r60858650
  
--- Diff: core/src/main/scala/org/apache/spark/scheduler/DAGScheduler.scala 
---
@@ -403,32 +403,47 @@ class DAGScheduler(
 parents.toList
   }
 
-  /** Find ancestor shuffle dependencies that are not registered in 
shuffleToMapStage yet */
-  private def getAncestorShuffleDependencies(rdd: RDD[_]): 
Stack[ShuffleDependency[_, _, _]] = {
-val parents = new Stack[ShuffleDependency[_, _, _]]
+  /**
+   * Find ancestor shuffle dependencies that are not registered in 
shuffleToMapStage yet.
+   * This is done in topological order to create ancestor stages first to 
ensure that the result
--- End diff --

Can you move this comment to some place inside the method? It doesn't seem 
relevant to someone using this method, since the order of the 
ShuffleDependencies returned doesn't matter


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-13902][SCHEDULER] Make DAGScheduler.get...

2016-04-24 Thread SparkQA
Github user SparkQA commented on the pull request:

https://github.com/apache/spark/pull/12655#issuecomment-214097411
  
**[Test build #56868 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/56868/consoleFull)**
 for PR 12655 at commit 
[`3a8ff84`](https://github.com/apache/spark/commit/3a8ff84622c3f136fa3511561a789163c94b2f2e).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-13902][SCHEDULER] Make DAGScheduler.get...

2016-04-24 Thread ueshin
GitHub user ueshin opened a pull request:

https://github.com/apache/spark/pull/12655

[SPARK-13902][SCHEDULER] Make DAGScheduler.getAncestorShuffleDependencies() 
return in topological order to ensure building ancestor stages first.

## What changes were proposed in this pull request?

`DAGScheduler`sometimes generate incorrect stage graph.
Some stages are generated for the same shuffleId twice or more and they are 
referenced by the child stages because the building order of the graph is not 
correct.

This patch is fixing it.

## How was this patch tested?

I added the sample RDD graph to show the illegal stage graph to 
`DAGSchedulerSuite`.


You can merge this pull request into a Git repository by running:

$ git pull https://github.com/ueshin/apache-spark issues/SPARK-13902

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/spark/pull/12655.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #12655


commit 9a1724de0287b5ca41e30f3d3401fd721a2e1520
Author: Takuya UESHIN 
Date:   2016-03-15T02:21:09Z

Add a test to check if the stage graph is properly built.

commit f8b7910ecb52a5954de091ed79d5de9c19ba2744
Author: Takuya UESHIN 
Date:   2016-03-15T02:22:42Z

Make DAGScheduler.getAncestorShuffleDependencies() return in topological 
order to ensure building ancestor stages first.

commit 0ea3fc838f689729794b6ea3aaf0b88a339ec20c
Author: Takuya UESHIN 
Date:   2016-03-16T02:04:45Z

Refactor getAncestorShuffleDependencies.

commit 697b32208262b3c1c10bc2cae43b891c7970811d
Author: Takuya UESHIN 
Date:   2016-03-16T12:55:50Z

Fix topological sort.

commit d6d3c34e0e8387ce6390babba3df2464a8b2b4a1
Author: Takuya UESHIN 
Date:   2016-03-17T12:21:32Z

Merge branch 'master' into issues/SPARK-13902

commit 1636531c65912bbfb68e4c669690a9f9107d9cd1
Author: Takuya UESHIN 
Date:   2016-03-28T07:01:27Z

Add assertion to check not to overwrite illegally.

commit 92e9f4484b09f65829f6e9300042cc2b57979278
Author: Takuya UESHIN 
Date:   2016-03-28T07:19:09Z

Modify to mitigate adds extra push

commit 4b412f5e73ca9cf5ab2de1a51f6c30f01286e89a
Author: Takuya UESHIN 
Date:   2016-03-28T07:48:42Z

Modify comment.

commit 8fb9a149a03543a35c2a08c79edc53d49f66b5c2
Author: Takuya UESHIN 
Date:   2016-03-28T08:11:37Z

Add a comment to explain what the test is doing.

commit e2cfeaf3ef5a7291a235bbcbb968d88959e52e93
Author: Takuya UESHIN 
Date:   2016-03-29T03:22:36Z

Revert "Add assertion to check not to overwrite illegally."

This reverts commit 1636531c65912bbfb68e4c669690a9f9107d9cd1.

commit 3a8ff84622c3f136fa3511561a789163c94b2f2e
Author: Takuya UESHIN 
Date:   2016-04-05T02:58:53Z

Modify to cut down on the repeated scanning of data structures.




---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-13902][SCHEDULER] Make DAGScheduler.get...

2016-03-30 Thread ueshin
Github user ueshin closed the pull request at:

https://github.com/apache/spark/pull/11720


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-13902][SCHEDULER] Make DAGScheduler.get...

2016-03-30 Thread ueshin
Github user ueshin commented on the pull request:

https://github.com/apache/spark/pull/11720#issuecomment-203734197
  
I'm going to close this in favor of #12060.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-13902][SCHEDULER] Make DAGScheduler.get...

2016-03-30 Thread ueshin
Github user ueshin commented on the pull request:

https://github.com/apache/spark/pull/11720#issuecomment-203350094
  
@markhamstra 
Could you take a look at #12060 please?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-13902][SCHEDULER] Make DAGScheduler.get...

2016-03-30 Thread ueshin
Github user ueshin commented on a diff in the pull request:

https://github.com/apache/spark/pull/11720#discussion_r57859542
  
--- Diff: core/src/main/scala/org/apache/spark/scheduler/DAGScheduler.scala 
---
@@ -286,6 +286,7 @@ class DAGScheduler(
   case None =>
 // We are going to register ancestor shuffle dependencies
 getAncestorShuffleDependencies(shuffleDep.rdd).foreach { dep =>
+  assert(!shuffleToMapStage.get(dep.shuffleId).isDefined)
--- End diff --

Let me show you another PR to improve `DAGScheduler` performance based on 
this PR.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-13902][SCHEDULER] Make DAGScheduler.get...

2016-03-28 Thread AmplabJenkins
Github user AmplabJenkins commented on the pull request:

https://github.com/apache/spark/pull/11720#issuecomment-202722356
  
Test PASSed.
Refer to this link for build results (access rights to CI server needed): 
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/54407/
Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-13902][SCHEDULER] Make DAGScheduler.get...

2016-03-28 Thread AmplabJenkins
Github user AmplabJenkins commented on the pull request:

https://github.com/apache/spark/pull/11720#issuecomment-202722352
  
Merged build finished. Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-13902][SCHEDULER] Make DAGScheduler.get...

2016-03-28 Thread SparkQA
Github user SparkQA commented on the pull request:

https://github.com/apache/spark/pull/11720#issuecomment-202721624
  
**[Test build #54407 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/54407/consoleFull)**
 for PR 11720 at commit 
[`e2cfeaf`](https://github.com/apache/spark/commit/e2cfeaf3ef5a7291a235bbcbb968d88959e52e93).
 * This patch passes all tests.
 * This patch merges cleanly.
 * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-13902][SCHEDULER] Make DAGScheduler.get...

2016-03-28 Thread markhamstra
Github user markhamstra commented on a diff in the pull request:

https://github.com/apache/spark/pull/11720#discussion_r57668558
  
--- Diff: core/src/main/scala/org/apache/spark/scheduler/DAGScheduler.scala 
---
@@ -286,6 +286,7 @@ class DAGScheduler(
   case None =>
 // We are going to register ancestor shuffle dependencies
 getAncestorShuffleDependencies(shuffleDep.rdd).foreach { dep =>
+  assert(!shuffleToMapStage.get(dep.shuffleId).isDefined)
--- End diff --

Yes, I tend to agree that the way what really should be the same Stage gets 
recreated as a new Stage is undesirable in the DAGScheduler, but the more 
important point is that, even though it is not optimal, it does produce correct 
results, and we can't change that for the sake of code that seems more 
desirable.

That's not to say that some of the efforts that were begun to fix this kind 
of behavior while still generating correct results shouldn't be driven to 
completion.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-13902][SCHEDULER] Make DAGScheduler.get...

2016-03-28 Thread SparkQA
Github user SparkQA commented on the pull request:

https://github.com/apache/spark/pull/11720#issuecomment-202689229
  
**[Test build #54407 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/54407/consoleFull)**
 for PR 11720 at commit 
[`e2cfeaf`](https://github.com/apache/spark/commit/e2cfeaf3ef5a7291a235bbcbb968d88959e52e93).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-13902][SCHEDULER] Make DAGScheduler.get...

2016-03-28 Thread ueshin
Github user ueshin commented on a diff in the pull request:

https://github.com/apache/spark/pull/11720#discussion_r57667547
  
--- Diff: core/src/main/scala/org/apache/spark/scheduler/DAGScheduler.scala 
---
@@ -286,6 +286,7 @@ class DAGScheduler(
   case None =>
 // We are going to register ancestor shuffle dependencies
 getAncestorShuffleDependencies(shuffleDep.rdd).foreach { dep =>
+  assert(!shuffleToMapStage.get(dep.shuffleId).isDefined)
--- End diff --

@markhamstra Thank you for your review.
I'll remove the assertion for now but notice that there is a case that some 
stages refer the overwritten stage as its parent by the overwrite, which is 
undesirable for `DAGScheduler`.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-13902][SCHEDULER] Make DAGScheduler.get...

2016-03-28 Thread markhamstra
Github user markhamstra commented on a diff in the pull request:

https://github.com/apache/spark/pull/11720#discussion_r57605206
  
--- Diff: core/src/main/scala/org/apache/spark/scheduler/DAGScheduler.scala 
---
@@ -286,6 +286,7 @@ class DAGScheduler(
   case None =>
 // We are going to register ancestor shuffle dependencies
 getAncestorShuffleDependencies(shuffleDep.rdd).foreach { dep =>
+  assert(!shuffleToMapStage.get(dep.shuffleId).isDefined)
--- End diff --

Overwriting entries in `shuffleToMapStage` definitely isn't an outright 
error -- we should still get correct results after the overwrite; so we 
shouldn't be adding this new assertion to change an evaluation path that was 
generating correct results into an error condition.

There are other open efforts to clean up this creation of additional Stages 
that will be skipped instead of just re-using the prior-executed Stage more 
cleanly.  That issue is orthogonal to the generation of a topological ordering 
of the dependencies, so I'd prefer to handle it outside of this PR.

cc @squito   


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-13902][SCHEDULER] Make DAGScheduler.get...

2016-03-28 Thread AmplabJenkins
Github user AmplabJenkins commented on the pull request:

https://github.com/apache/spark/pull/11720#issuecomment-202484260
  
Merged build finished. Test FAILed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-13902][SCHEDULER] Make DAGScheduler.get...

2016-03-28 Thread AmplabJenkins
Github user AmplabJenkins commented on the pull request:

https://github.com/apache/spark/pull/11720#issuecomment-202484262
  
Test FAILed.
Refer to this link for build results (access rights to CI server needed): 
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/54319/
Test FAILed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-13902][SCHEDULER] Make DAGScheduler.get...

2016-03-28 Thread SparkQA
Github user SparkQA commented on the pull request:

https://github.com/apache/spark/pull/11720#issuecomment-202483946
  
**[Test build #54319 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/54319/consoleFull)**
 for PR 11720 at commit 
[`8fb9a14`](https://github.com/apache/spark/commit/8fb9a149a03543a35c2a08c79edc53d49f66b5c2).
 * This patch **fails Spark unit tests**.
 * This patch merges cleanly.
 * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-13902][SCHEDULER] Make DAGScheduler.get...

2016-03-28 Thread markhamstra
Github user markhamstra commented on a diff in the pull request:

https://github.com/apache/spark/pull/11720#discussion_r57586861
  
--- Diff: core/src/main/scala/org/apache/spark/scheduler/DAGScheduler.scala 
---
@@ -286,6 +286,7 @@ class DAGScheduler(
   case None =>
 // We are going to register ancestor shuffle dependencies
 getAncestorShuffleDependencies(shuffleDep.rdd).foreach { dep =>
+  assert(!shuffleToMapStage.get(dep.shuffleId).isDefined)
--- End diff --

Yes, but it is not immediately obvious that that is inappropriate.  I need 
to spend some time re-familiarizing myself with newOrUsedShuffleStage.  In any 
event, just failing an assertion in the middle of the DAGScheduler is not 
likely something we want to do.  At a bare minimum, we'd want to be logging a 
more useful error message. 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-13902][SCHEDULER] Make DAGScheduler.get...

2016-03-28 Thread maropu
Github user maropu commented on a diff in the pull request:

https://github.com/apache/spark/pull/11720#discussion_r57586226
  
--- Diff: core/src/main/scala/org/apache/spark/scheduler/DAGScheduler.scala 
---
@@ -286,6 +286,7 @@ class DAGScheduler(
   case None =>
 // We are going to register ancestor shuffle dependencies
 getAncestorShuffleDependencies(shuffleDep.rdd).foreach { dep =>
+  assert(!shuffleToMapStage.get(dep.shuffleId).isDefined)
--- End diff --

The original codes possibly overwrite entries in `shuffleToMapStage`.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-13902][SCHEDULER] Make DAGScheduler.get...

2016-03-28 Thread markhamstra
Github user markhamstra commented on the pull request:

https://github.com/apache/spark/pull/11720#issuecomment-202434831
  
The basic idea is a good one, but in addition to needing to spend some time 
sorting out the logic around that `assert`, I'm a little concerned about the 
performance implications of the multiple scans of the data structures in the 
`filter`, `forall` and `for`.  It looks to be possible to compose these so that 
the performance will be better at the expense of a little less legible code. 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-13902][SCHEDULER] Make DAGScheduler.get...

2016-03-28 Thread markhamstra
Github user markhamstra commented on a diff in the pull request:

https://github.com/apache/spark/pull/11720#discussion_r57581958
  
--- Diff: core/src/main/scala/org/apache/spark/scheduler/DAGScheduler.scala 
---
@@ -286,6 +286,7 @@ class DAGScheduler(
   case None =>
 // We are going to register ancestor shuffle dependencies
 getAncestorShuffleDependencies(shuffleDep.rdd).foreach { dep =>
+  assert(!shuffleToMapStage.get(dep.shuffleId).isDefined)
--- End diff --

Wait... sorry, let me rethink this a bit.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-13902][SCHEDULER] Make DAGScheduler.get...

2016-03-28 Thread markhamstra
Github user markhamstra commented on a diff in the pull request:

https://github.com/apache/spark/pull/11720#discussion_r57581491
  
--- Diff: core/src/main/scala/org/apache/spark/scheduler/DAGScheduler.scala 
---
@@ -286,6 +286,7 @@ class DAGScheduler(
   case None =>
 // We are going to register ancestor shuffle dependencies
 getAncestorShuffleDependencies(shuffleDep.rdd).foreach { dep =>
+  assert(!shuffleToMapStage.get(dep.shuffleId).isDefined)
--- End diff --

Why are you asserting this when you are already within a match case where 
this must be true?  If it is actually possible for the contents of 
shuffleToMapStage to change between the pattern match and the foreach, then we 
need to fix the unsynchronized update, not just shutdown Spark on a failed 
assertion.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-13902][SCHEDULER] Make DAGScheduler.get...

2016-03-28 Thread SparkQA
Github user SparkQA commented on the pull request:

https://github.com/apache/spark/pull/11720#issuecomment-202427366
  
**[Test build #54319 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/54319/consoleFull)**
 for PR 11720 at commit 
[`8fb9a14`](https://github.com/apache/spark/commit/8fb9a149a03543a35c2a08c79edc53d49f66b5c2).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-13902][SCHEDULER] Make DAGScheduler.get...

2016-03-28 Thread ueshin
Github user ueshin commented on the pull request:

https://github.com/apache/spark/pull/11720#issuecomment-202427017
  
Jenkins, retest this please.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-13902][SCHEDULER] Make DAGScheduler.get...

2016-03-28 Thread AmplabJenkins
Github user AmplabJenkins commented on the pull request:

https://github.com/apache/spark/pull/11720#issuecomment-202369148
  
Test FAILed.
Refer to this link for build results (access rights to CI server needed): 
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/54298/
Test FAILed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-13902][SCHEDULER] Make DAGScheduler.get...

2016-03-28 Thread AmplabJenkins
Github user AmplabJenkins commented on the pull request:

https://github.com/apache/spark/pull/11720#issuecomment-202369146
  
Merged build finished. Test FAILed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-13902][SCHEDULER] Make DAGScheduler.get...

2016-03-28 Thread SparkQA
Github user SparkQA commented on the pull request:

https://github.com/apache/spark/pull/11720#issuecomment-202369046
  
**[Test build #54298 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/54298/consoleFull)**
 for PR 11720 at commit 
[`8fb9a14`](https://github.com/apache/spark/commit/8fb9a149a03543a35c2a08c79edc53d49f66b5c2).
 * This patch **fails from timeout after a configured wait of \`250m\`**.
 * This patch merges cleanly.
 * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-13902][SCHEDULER] Make DAGScheduler.get...

2016-03-28 Thread maropu
Github user maropu commented on the pull request:

https://github.com/apache/spark/pull/11720#issuecomment-202320267
  
cc: @rxin


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-13902][SCHEDULER] Make DAGScheduler.get...

2016-03-28 Thread SparkQA
Github user SparkQA commented on the pull request:

https://github.com/apache/spark/pull/11720#issuecomment-202298857
  
**[Test build #54298 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/54298/consoleFull)**
 for PR 11720 at commit 
[`8fb9a14`](https://github.com/apache/spark/commit/8fb9a149a03543a35c2a08c79edc53d49f66b5c2).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-13902][SCHEDULER] Make DAGScheduler.get...

2016-03-28 Thread ueshin
Github user ueshin commented on the pull request:

https://github.com/apache/spark/pull/11720#issuecomment-202298736
  
@maropu Thank you for your review.
I modified as you mentioned.

> do we get wrong answers from this kind of incorrect stage graphs?

I don't think this kind of incorrect stage graphs causes the wrong answers 
for now.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-13902][SCHEDULER] Make DAGScheduler.get...

2016-03-28 Thread maropu
Github user maropu commented on the pull request:

https://github.com/apache/spark/pull/11720#issuecomment-202258270
  
@ueshin Great work. I'm not 100% sure though, one question; do we get wrong 
answers from this kind of incorrect stage graphs? If so, it'd be better to add 
tests in `RDDSuite`.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-13902][SCHEDULER] Make DAGScheduler.get...

2016-03-28 Thread maropu
Github user maropu commented on a diff in the pull request:

https://github.com/apache/spark/pull/11720#discussion_r57548138
  
--- Diff: 
core/src/test/scala/org/apache/spark/scheduler/DAGSchedulerSuite.scala ---
@@ -322,6 +322,53 @@ class DAGSchedulerSuite extends SparkFunSuite with 
LocalSparkContext with Timeou
 assert(sparkListener.stageByOrderOfExecution(0) < 
sparkListener.stageByOrderOfExecution(1))
   }
 
+  /*
--- End diff --

We should add comments for what's tested here.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-13902][SCHEDULER] Make DAGScheduler.get...

2016-03-28 Thread maropu
Github user maropu commented on a diff in the pull request:

https://github.com/apache/spark/pull/11720#discussion_r57548049
  
--- Diff: core/src/main/scala/org/apache/spark/scheduler/DAGScheduler.scala 
---
@@ -403,25 +403,37 @@ class DAGScheduler(
 parents.toList
   }
 
-  /** Find ancestor shuffle dependencies that are not registered in 
shuffleToMapStage yet */
-  private def getAncestorShuffleDependencies(rdd: RDD[_]): 
Stack[ShuffleDependency[_, _, _]] = {
-val parents = new Stack[ShuffleDependency[_, _, _]]
+  /**
+   * Find ancestor shuffle dependencies that are not registered in 
shuffleToMapStage yet,
+   * in topological order to ensure building ancestor stages first.
--- End diff --

Nit: `This is done in topological order to ` is better?
Also, I think we need to add comments for the reason why this topological 
order is needed here.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-13902][SCHEDULER] Make DAGScheduler.get...

2016-03-28 Thread maropu
Github user maropu commented on a diff in the pull request:

https://github.com/apache/spark/pull/11720#discussion_r57547815
  
--- Diff: core/src/main/scala/org/apache/spark/scheduler/DAGScheduler.scala 
---
@@ -403,25 +403,37 @@ class DAGScheduler(
 parents.toList
   }
 
-  /** Find ancestor shuffle dependencies that are not registered in 
shuffleToMapStage yet */
-  private def getAncestorShuffleDependencies(rdd: RDD[_]): 
Stack[ShuffleDependency[_, _, _]] = {
-val parents = new Stack[ShuffleDependency[_, _, _]]
+  /**
+   * Find ancestor shuffle dependencies that are not registered in 
shuffleToMapStage yet,
+   * in topological order to ensure building ancestor stages first.
+   */
+  private def getAncestorShuffleDependencies(rdd: RDD[_]): 
Seq[ShuffleDependency[_, _, _]] = {
+val parents = new ArrayBuffer[ShuffleDependency[_, _, _]]
 val visited = new HashSet[RDD[_]]
 // We are manually maintaining a stack here to prevent 
StackOverflowError
 // caused by recursively visiting
 val waitingForVisit = new Stack[RDD[_]]
 def visit(r: RDD[_]) {
   if (!visited(r)) {
-visited += r
-for (dep <- r.dependencies) {
-  dep match {
-case shufDep: ShuffleDependency[_, _, _] =>
-  if (!shuffleToMapStage.contains(shufDep.shuffleId)) {
-parents.push(shufDep)
-  }
-case _ =>
+val deps = r.dependencies.filter {
+  case shufDep: ShuffleDependency[_, _, _] =>
+!shuffleToMapStage.contains(shufDep.shuffleId)
+  case _ => true
+}
+if (deps.forall(dep => visited(dep.rdd))) {
+  visited += r
+  for (dep <- deps) {
+dep match {
+  case shufDep: ShuffleDependency[_, _, _] =>
+parents += shufDep
+  case _ =>
+}
+  }
+} else {
+  waitingForVisit.push(r)
--- End diff --

This pr adds extra push against the original codes.
To mitigate this overhead, how about poping a rdd in case of all the 
dependent rdds visited?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-13902][SCHEDULER] Make DAGScheduler.get...

2016-03-28 Thread maropu
Github user maropu commented on a diff in the pull request:

https://github.com/apache/spark/pull/11720#discussion_r57547560
  
--- Diff: core/src/main/scala/org/apache/spark/scheduler/DAGScheduler.scala 
---
@@ -403,25 +403,37 @@ class DAGScheduler(
 parents.toList
   }
 
-  /** Find ancestor shuffle dependencies that are not registered in 
shuffleToMapStage yet */
-  private def getAncestorShuffleDependencies(rdd: RDD[_]): 
Stack[ShuffleDependency[_, _, _]] = {
-val parents = new Stack[ShuffleDependency[_, _, _]]
+  /**
+   * Find ancestor shuffle dependencies that are not registered in 
shuffleToMapStage yet,
+   * in topological order to ensure building ancestor stages first.
+   */
+  private def getAncestorShuffleDependencies(rdd: RDD[_]): 
Seq[ShuffleDependency[_, _, _]] = {
--- End diff --

ISTM it'd be better to check illegal overwrites in `shuffleToMapStage` like 
`assert(!shuffleToMapStage.get(dep.shuffleId).isDefined)` in 
`https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/scheduler/DAGScheduler.scala#L289`.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-13902][SCHEDULER] Make DAGScheduler.get...

2016-03-19 Thread AmplabJenkins
Github user AmplabJenkins commented on the pull request:

https://github.com/apache/spark/pull/11720#issuecomment-197906450
  
Merged build finished. Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-13902][SCHEDULER] Make DAGScheduler.get...

2016-03-19 Thread AmplabJenkins
Github user AmplabJenkins commented on the pull request:

https://github.com/apache/spark/pull/11720#issuecomment-197378322
  
Merged build finished. Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-13902][SCHEDULER] Make DAGScheduler.get...

2016-03-19 Thread AmplabJenkins
Github user AmplabJenkins commented on the pull request:

https://github.com/apache/spark/pull/11720#issuecomment-197378325
  
Test PASSed.
Refer to this link for build results (access rights to CI server needed): 
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/53318/
Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-13902][SCHEDULER] Make DAGScheduler.get...

2016-03-19 Thread SparkQA
Github user SparkQA commented on the pull request:

https://github.com/apache/spark/pull/11720#issuecomment-197377676
  
**[Test build #53318 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/53318/consoleFull)**
 for PR 11720 at commit 
[`697b322`](https://github.com/apache/spark/commit/697b32208262b3c1c10bc2cae43b891c7970811d).
 * This patch passes all tests.
 * This patch merges cleanly.
 * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-13902][SCHEDULER] Make DAGScheduler.get...

2016-03-19 Thread SparkQA
Github user SparkQA commented on the pull request:

https://github.com/apache/spark/pull/11720#issuecomment-197905769
  
**[Test build #53425 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/53425/consoleFull)**
 for PR 11720 at commit 
[`d6d3c34`](https://github.com/apache/spark/commit/d6d3c34e0e8387ce6390babba3df2464a8b2b4a1).
 * This patch passes all tests.
 * This patch merges cleanly.
 * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-13902][SCHEDULER] Make DAGScheduler.get...

2016-03-19 Thread AmplabJenkins
Github user AmplabJenkins commented on the pull request:

https://github.com/apache/spark/pull/11720#issuecomment-197906458
  
Test PASSed.
Refer to this link for build results (access rights to CI server needed): 
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/53425/
Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-13902][SCHEDULER] Make DAGScheduler.get...

2016-03-19 Thread SparkQA
Github user SparkQA commented on the pull request:

https://github.com/apache/spark/pull/11720#issuecomment-197320498
  
**[Test build #53318 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/53318/consoleFull)**
 for PR 11720 at commit 
[`697b322`](https://github.com/apache/spark/commit/697b32208262b3c1c10bc2cae43b891c7970811d).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-13902][SCHEDULER] Make DAGScheduler.get...

2016-03-19 Thread SparkQA
Github user SparkQA commented on the pull request:

https://github.com/apache/spark/pull/11720#issuecomment-197855749
  
**[Test build #53425 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/53425/consoleFull)**
 for PR 11720 at commit 
[`d6d3c34`](https://github.com/apache/spark/commit/d6d3c34e0e8387ce6390babba3df2464a8b2b4a1).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-13902][SCHEDULER] Make DAGScheduler.get...

2016-03-15 Thread AmplabJenkins
Github user AmplabJenkins commented on the pull request:

https://github.com/apache/spark/pull/11720#issuecomment-197154174
  
Merged build finished. Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-13902][SCHEDULER] Make DAGScheduler.get...

2016-03-15 Thread AmplabJenkins
Github user AmplabJenkins commented on the pull request:

https://github.com/apache/spark/pull/11720#issuecomment-197154175
  
Test PASSed.
Refer to this link for build results (access rights to CI server needed): 
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/53265/
Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-13902][SCHEDULER] Make DAGScheduler.get...

2016-03-15 Thread SparkQA
Github user SparkQA commented on the pull request:

https://github.com/apache/spark/pull/11720#issuecomment-197154079
  
**[Test build #53265 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/53265/consoleFull)**
 for PR 11720 at commit 
[`0ea3fc8`](https://github.com/apache/spark/commit/0ea3fc838f689729794b6ea3aaf0b88a339ec20c).
 * This patch passes all tests.
 * This patch merges cleanly.
 * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-13902][SCHEDULER] Make DAGScheduler.get...

2016-03-15 Thread SparkQA
Github user SparkQA commented on the pull request:

https://github.com/apache/spark/pull/11720#issuecomment-197127125
  
**[Test build #53265 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/53265/consoleFull)**
 for PR 11720 at commit 
[`0ea3fc8`](https://github.com/apache/spark/commit/0ea3fc838f689729794b6ea3aaf0b88a339ec20c).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-13902][SCHEDULER] Make DAGScheduler.get...

2016-03-15 Thread AmplabJenkins
Github user AmplabJenkins commented on the pull request:

https://github.com/apache/spark/pull/11720#issuecomment-196767388
  
Merged build finished. Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-13902][SCHEDULER] Make DAGScheduler.get...

2016-03-15 Thread AmplabJenkins
Github user AmplabJenkins commented on the pull request:

https://github.com/apache/spark/pull/11720#issuecomment-196767390
  
Test PASSed.
Refer to this link for build results (access rights to CI server needed): 
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/53181/
Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-13902][SCHEDULER] Make DAGScheduler.get...

2016-03-15 Thread SparkQA
Github user SparkQA commented on the pull request:

https://github.com/apache/spark/pull/11720#issuecomment-196767151
  
**[Test build #53181 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/53181/consoleFull)**
 for PR 11720 at commit 
[`f8b7910`](https://github.com/apache/spark/commit/f8b7910ecb52a5954de091ed79d5de9c19ba2744).
 * This patch passes all tests.
 * This patch merges cleanly.
 * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-13902][SCHEDULER] Make DAGScheduler.get...

2016-03-15 Thread SparkQA
Github user SparkQA commented on the pull request:

https://github.com/apache/spark/pull/11720#issuecomment-196728116
  
**[Test build #53181 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/53181/consoleFull)**
 for PR 11720 at commit 
[`f8b7910`](https://github.com/apache/spark/commit/f8b7910ecb52a5954de091ed79d5de9c19ba2744).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-13902][SCHEDULER] Make DAGScheduler.get...

2016-03-15 Thread ueshin
GitHub user ueshin opened a pull request:

https://github.com/apache/spark/pull/11720

[SPARK-13902][SCHEDULER] Make DAGScheduler.getAncestorShuffleDependencies() 
return in topological order to ensure building ancestor stages first.

## What changes were proposed in this pull request?

`DAGScheduler`sometimes generate incorrect stage graph.
Some stages are generated for the same shuffleId twice or more and they are 
referenced by the child stages because the building order of the graph is not 
correct.

This patch is fixing it.

## How was this patch tested?

I added the sample RDD graph to show the illegal stage graph to 
`DAGSchedulerSuite`.


You can merge this pull request into a Git repository by running:

$ git pull https://github.com/ueshin/apache-spark issues/SPARK-13902

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/spark/pull/11720.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #11720


commit 9a1724de0287b5ca41e30f3d3401fd721a2e1520
Author: Takuya UESHIN 
Date:   2016-03-15T02:21:09Z

Add a test to check if the stage graph is properly built.

commit f8b7910ecb52a5954de091ed79d5de9c19ba2744
Author: Takuya UESHIN 
Date:   2016-03-15T02:22:42Z

Make DAGScheduler.getAncestorShuffleDependencies() return in topological 
order to ensure building ancestor stages first.




---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org