[GitHub] spark issue #17936: [SPARK-20638][Core]Optimize the CartesianRDD to reduce r...
Github user suyanNone commented on the issue: https://github.com/apache/spark/pull/17936 So careless to notice UnsafeCartesianRDD's ExternalAppendOnlyUnsafeRowArray, that nice, I am not read all discussion here...the solution unify with unsafeCartesionRDD already have a big improvement for CartesionRDD, and it seams more simple and easy to understand... (In our inner change, we adopt a memory and disk array to store graphx Array[EdgeAttr])... I not sure it will have a strong optimize requirement to avoid per task locality... --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #17936: [SPARK-20638][Core]Optimize the CartesianRDD to reduce r...
Github user suyanNone commented on the issue: https://github.com/apache/spark/pull/17936 May create a MemoryAndDiskArray like ExternalAppendOnlyMap? MemoryAndDiskArray, not also use here and also groupByKey? and it memory can controller by MemoryManager --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #12576: [SPARK-14804][Spark][Graphx] Fix Graph vertexRDD/EdgeRDD...
Github user suyanNone commented on the issue: https://github.com/apache/spark/pull/12576 @tdas agree, `isCheckpointed` should be final, in current code, `isCheckpointed` exposed as public is for testing? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #14765: [SPARK-15815] Keeping tell yarn the target executors in ...
Github user suyanNone commented on the issue: https://github.com/apache/spark/pull/14765 jenkins retest. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #14765: [SPARK-15815] K、
GitHub user suyanNone opened a pull request: https://github.com/apache/spark/pull/14765 [SPARK-15815] Kã ## What changes were proposed in this pull request? (Please fill in changes proposed in this fix) ## How was this patch tested? (Please explain how this patch was tested. E.g. unit tests, integration tests, manual tests) (If this patch involves UI changes, please attach a screenshot; otherwise, remove this) You can merge this pull request into a Git repository by running: $ git pull https://github.com/suyanNone/spark fix-hang Alternatively you can review and apply these changes as the patch at: https://github.com/apache/spark/pull/14765.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #14765 commit 59de77b5f523340d50836f072b38afee9bf579c3 Author: hushan <hus...@xiaomi.com> Date: 2016-08-23T03:02:53Z Fix hang --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK][SPARK-10842]Eliminate creating duplica...
Github user suyanNone commented on the pull request: https://github.com/apache/spark/pull/8923#issuecomment-218682081 already merge into https://github.com/apache/spark/pull/12655, mark this closed --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK][SPARK-10842]Eliminate creating duplica...
Github user suyanNone closed the pull request at: https://github.com/apache/spark/pull/8923 --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-14750][Spark][UI] Support spark on yarn...
Github user suyanNone closed the pull request at: https://github.com/apache/spark/pull/12570 --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-14750][Spark][UI] Support spark on yarn...
Github user suyanNone commented on the pull request: https://github.com/apache/spark/pull/12570#issuecomment-218412962 @ajbozarth oh...I not familiar with standalone... I have no idea about there had a class named `LogPage`, nice for you to tell me about that... --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-14957][Yarn] Adopt healthy dir to store...
Github user suyanNone commented on the pull request: https://github.com/apache/spark/pull/12735#issuecomment-218389524 @tgravescs, yes, the executor will failed fast... vaguely remember that there have a application failed caused by shuffle server unhealthy dir, I don't have time to walk deep at that time... now I can't re-produce it so I will close this. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-14957][Yarn] Adopt healthy dir to store...
Github user suyanNone closed the pull request at: https://github.com/apache/spark/pull/12735 --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-13902][SCHEDULER] Make DAGScheduler.get...
Github user suyanNone commented on the pull request: https://github.com/apache/spark/pull/12655#issuecomment-215329471 as I know, duplicate stage occurs: stage 2 and stage 3 all depends on stage 1 stage 4 depends on stage 2 and stage 3 So, if we get getAncestorShuffleDependencies(stage 4), it will depth-traverse stage 2-> stage 1, then traverse stage 3 -> stage 1... due to we use stack, we can't de-duplicate the same stage in this. if we traverse stage 4 with de-duplication successful, it will in `shuffleIdToMapStage`, which can tell other rdd which stage had traversed. right? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-14957][Yarn] Adopt healthy dir to store...
Github user suyanNone commented on the pull request: https://github.com/apache/spark/pull/12735#issuecomment-215298717 To be honest, I not walk through all Yarn shuffle server process, I just fix our user reported problem, why can't connect with shuffle server due to create level db in an non-exists folder. I will take some time to re-produce problem, and be more comprehend about this. @tgravescs I will look into the getRecoveryPath api... --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-14957][Yarn] Adopt healthy dir to store...
Github user suyanNone commented on the pull request: https://github.com/apache/spark/pull/12735#issuecomment-215297811 for why will have multi-meta files, assume we found one, but the disk will be a read-only filesystem, still use that ? or choose another healthy dir to create new one? if choose the second, we can't delete right, and we will have 2 files exist... --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-14957][Yarn] Adopt healthy dir to store...
Github user suyanNone commented on the pull request: https://github.com/apache/spark/pull/12735#issuecomment-215294953 ``` registeredExecutorFile = findRegisteredExecutorFile(conf.getTrimmedStrings("yarn.nodemanager.local-dirs")); ``` we got that local dirs for yarnConfig, there a lot of dirs, but current we always adopt the first dir. right? first assume there don't exist any meta file: if the first disk had been removed, or had disk errors, like read-only filesystem/input/output errors/no space left. it will cause ExternalShuffleBlockResolver to create a new leveldb file, but it will failed...and throw IOException, and this IOException will be engulfed by YarnShufflerService ``` try { blockHandler = new ExternalShuffleBlockHandler(transportConf, registeredExecutorFile); } catch (Exception e) { logger.error("Failed to initialize external shuffle service", e); } ``` So may be we can choose a more healthy disk dir for storing meta file, to avoid necessary exception. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-14957][Spark] Adopt healthy dir to stor...
GitHub user suyanNone opened a pull request: https://github.com/apache/spark/pull/12735 [SPARK-14957][Spark] Adopt healthy dir to store executor meta ## What changes were proposed in this pull request? Adopt a healthy dir to store executor meta. ## How was this patch tested? Unit tests You can merge this pull request into a Git repository by running: $ git pull https://github.com/suyanNone/spark fix-0dir Alternatively you can review and apply these changes as the patch at: https://github.com/apache/spark/pull/12735.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #12735 commit 2527391378e3f572663211f9ecc20a3a49c52ebb Author: hushan <hus...@xiaomi.com> Date: 2016-04-27T13:24:49Z Adopt healthy dir to store executor meta --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK][SPARK-10842]Eliminate creating duplica...
Github user suyanNone commented on the pull request: https://github.com/apache/spark/pull/8923#issuecomment-214655571 Revert set to Stack, and add test case. Revert set to Stack, for we should build map stage from bottom to up(Stack), not a random(Set structure). --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-10796][CORE]Resubmit stage while lost t...
Github user suyanNone commented on the pull request: https://github.com/apache/spark/pull/8927#issuecomment-213363612 @squito @srowen --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-10796][CORE]Resubmit stage while lost t...
Github user suyanNone commented on a diff in the pull request: https://github.com/apache/spark/pull/8927#discussion_r60716314 --- Diff: core/src/test/scala/org/apache/spark/scheduler/DAGSchedulerSuite.scala --- @@ -1083,8 +1085,6 @@ class DAGSchedulerSuite extends SparkFunSuite with LocalSparkContext with Timeou Success, makeMapStatus("hostA", reduceRdd.partitions.size))) assert(shuffleStage.numAvailableOutputs === 2) -assert(mapOutputTracker.getMapSizesByExecutorId(shuffleId, 0).map(_._1).toSet === --- End diff -- For running stage , executor lost will not register outputlocs in this PR --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-10796][CORE]Resubmit stage while lost t...
Github user suyanNone commented on a diff in the pull request: https://github.com/apache/spark/pull/8927#discussion_r60715521 --- Diff: core/src/main/scala/org/apache/spark/scheduler/DAGScheduler.scala --- @@ -1416,6 +1449,7 @@ class DAGScheduler( outputCommitCoordinator.stageEnd(stage.id) listenerBus.post(SparkListenerStageCompleted(stage.latestInfo)) +taskScheduler.zombieTasks(stage.id) --- End diff -- Once stage was finished, it should make previous taskset Zombie --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-12524][Core]DagScheduler may submit a t...
Github user suyanNone commented on the pull request: https://github.com/apache/spark/pull/12524#issuecomment-213351412 can you refer this, and have a look, https://github.com/apache/spark/pull/8927 --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [Spark][Graphx] Fix Graph vertexRDD/EdgeRDD ch...
GitHub user suyanNone reopened a pull request: https://github.com/apache/spark/pull/12576 [Spark][Graphx] Fix Graph vertexRDD/EdgeRDD checkpoint results ClassCastException ## What changes were proposed in this pull request? The PR fixed compute chain from CheckpointRDD<-vertexRDDImp to CheckpointRDD<-partitionRDD<- vertexRDDImpl ## How was this patch tested? You can merge this pull request into a Git repository by running: $ git pull https://github.com/suyanNone/spark refine-graph Alternatively you can review and apply these changes as the patch at: https://github.com/apache/spark/pull/12576.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #12576 commit f2ced743b2aa147ed2b43221b4f52d502b85c8c6 Author: hushan <hus...@xiaomi.com> Date: 2016-04-21T13:43:30Z fix commit a106758dd4492ccae851c48d010e6eb9fb26b849 Author: hushan <hus...@xiaomi.com> Date: 2016-04-22T02:37:33Z Add testcase --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [Spark][Graphx] Fix Graph vertexRDD/EdgeRDD ch...
Github user suyanNone closed the pull request at: https://github.com/apache/spark/pull/12576 --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [Spark][Graphx] Fix Graph vertexRDD/EdgeRDD ch...
Github user suyanNone commented on the pull request: https://github.com/apache/spark/pull/12576#issuecomment-212938601 mistake to open, close first --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [Spark][Graphx] Fix Graph vertexRDD/EdgeRDD ch...
GitHub user suyanNone opened a pull request: https://github.com/apache/spark/pull/12576 [Spark][Graphx] Fix Graph vertexRDD/EdgeRDD checkpoint results ClassCastException ## What changes were proposed in this pull request? The PR fixed compute chain from CheckpointRDD<-vertexRDDImp to CheckpointRDD<-partitionRDD<- vertexRDDImpl ## How was this patch tested? You can merge this pull request into a Git repository by running: $ git pull https://github.com/suyanNone/spark refine-graph Alternatively you can review and apply these changes as the patch at: https://github.com/apache/spark/pull/12576.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #12576 commit f2ced743b2aa147ed2b43221b4f52d502b85c8c6 Author: hushan <hus...@xiaomi.com> Date: 2016-04-21T13:43:30Z fix --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-14750][Spark][UI] Support spark on yarn...
GitHub user suyanNone opened a pull request: https://github.com/apache/spark/pull/12570 [SPARK-14750][Spark][UI] Support spark on yarn user refer finished application log in historyServer ## What changes were proposed in this pull request? For spark on yarn, make user can refer application logs while application finished. ## How was this patch tested? manual tests beforeï¼ ![Uploading Screenshot from 2016-04-21 16:59:45.pngâ¦]() after ![Uploading Screenshot from 2016-04-21 17:00:44.pngâ¦]() ![Uploading Screenshot from 2016-04-21 17:01:15.pngâ¦]() You can merge this pull request into a Git repository by running: $ git pull https://github.com/suyanNone/spark refer-hdfs Alternatively you can review and apply these changes as the patch at: https://github.com/apache/spark/pull/12570.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #12570 commit f37d5c0ad8a0fa5eaa7c305c6ddad9d6702e462c Author: hushan <hus...@xiaomi.com> Date: 2016-04-20T08:28:38Z [SPARK][UI] Make historyserver refer application log Summary: Ref T5959 Test Plan: Test local Reviewers: liushaohui, zouchenjun, wangjiasheng, peng.zhang Reviewed By: peng.zhang Maniphest Tasks: T5959 Differential Revision: https://phabricator.d.xiaomi.net/D31886 Conflicts: core/src/main/scala/org/apache/spark/deploy/history/FsHistoryProvider.scala commit 80262effb0918616a7d33e81c8e97db5d56d74ab Author: hushan <hus...@xiaomi.com> Date: 2016-04-21T03:26:15Z Refine --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-10796][CORE]Resubmit stage while lost t...
Github user suyanNone commented on the pull request: https://github.com/apache/spark/pull/8927#issuecomment-212322683 jenkins retest this please --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-10796][CORE]Resubmit stage while lost t...
GitHub user suyanNone reopened a pull request: https://github.com/apache/spark/pull/8927 [SPARK-10796][CORE]Resubmit stage while lost task in Zombie TaskSets We meet that problem in Spark 1.3.0, and I also reproduce on the latest version. desc: 1. We know a running `ShuffleMapStage` will have multiple `TaskSet`: one Active TaskSet, multiple Zombie TaskSet. 2. We think a running `ShuffleMapStage` is success only if its partition are all process success, namely each taskâs MapStatus are all add into `outputLocs` 3. MapStatus of running `ShuffleMapStage` may succeed by Zombie TaskSet1 / Zombie TaskSet2 // Active TaskSetN, and may some MapStatus only belong to one TaskSet, and may be a Zombie TaskSet. 4. If lost a executor, it chanced that some lost-executor related MapStatus are succeed by some Zombie TaskSet. In current logical, The solution to resolved that lost MapStatus problem is, each TaskSet re-running that those tasks which succeed in lost-executor: re-add into `TaskSet's pendingTasks`, and re-add it paritions into `Stageâs pendingPartitions` . but it is useless if that lost MapStatus only belong to Zombie TaskSet, it is Zombie, so will never be scheduled his `pendingTasks` 5. The condition for resubmit stage is only if some task throws `FetchFailedException`, but may the lost-executor just not empty any MapStatus of parent Stage for one of running Stages, and itâs happen to that running `Stage` was lost a MapStatus only belong to a ZombieTask. So if all Zombie TaskSets are all processed his runningTasks and Active TaskSet are all processed his pendingTask, then will removed by `TaskSchedulerImp`, then that running Stage's pending partitions is still nonEmpty. it will hangs.. Examples: Running Stage 0.0, running TaskSet0.0, Finshed task0.0 in ExecA, running Task1.0 in ExecB, waiting Task2.0 ---> Task1.0 throws FetchFailedException ---> Running Resubmited stage 0.1, running TaskSet0.1(which re-run Task1, Task2), assume Task 1.0 finshed in ExecA ---> ExecA lost, and it happens no one throw FetchFailedExecption. ---> TaskSet0.1 re-submit task 1, re-add it into pendingTasks, and waiting TaskSchedulerImp schedule. TaskSet 0.0 also resubmit task0, re-add it into pendingTasks, because itâs Zombie, TaskSchedulerImpl skip to schedule TaskSet0.0 So if TaskSet0.0 and TaskSet0.1 (isZombie && runningTasks.empty), TaskSchedulerImp will remove those TaskSets. DagScheduler still have pendingPartitions due to the task lost in TaskSet0.0, but his TaskSets are all removed, so hangs You can merge this pull request into a Git repository by running: $ git pull https://github.com/suyanNone/spark rerun-special Alternatively you can review and apply these changes as the patch at: https://github.com/apache/spark/pull/8927.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #8927 commit 1be6071956c6a93e2d264fb1c2db92d01c4d3fe4 Author: hushan <hus...@xiaomi.com> Date: 2016-04-20T08:21:51Z Fix zombieTasksets and RemovedTaskset lost output --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-10796][CORE]Resubmit stage while lost t...
Github user suyanNone closed the pull request at: https://github.com/apache/spark/pull/8927 --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-12009][Yarn]Avoid to re-allocating yarn...
Github user suyanNone commented on a diff in the pull request: https://github.com/apache/spark/pull/9992#discussion_r46235437 --- Diff: yarn/src/main/scala/org/apache/spark/deploy/yarn/ApplicationMaster.scala --- @@ -228,6 +228,7 @@ private[spark] class ApplicationMaster( } private def sparkContextStopped(sc: SparkContext) = { +sc.requestTotalExecutors(0, 0, Map.empty) --- End diff -- right... I am little mix with yarnScheduler and yarnSchedulerBackend class... the stop order was: YarnClusterScheduler.stop ->TaskSchedulerImpl.stop -> CoarsedGrainSchedulerbackend.stop -> asyn: stopExecutors -> ApplicationMaster.sparkContextStopped(sc) may be it is good way to override stop() in YarnSchedulerBackend. ``` YarnSchedulerBackend.stop { doRequestTotalExecutors() //this is a askWithRetry... super.stop } ``` --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-12009][Yarn]Avoid to re-allocating yarn...
Github user suyanNone commented on the pull request: https://github.com/apache/spark/pull/9992#issuecomment-160623236 @lianhuiwang Hi, can you look for that? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-12009][Yarn]Avoid to re-allocating yarn...
Github user suyanNone commented on the pull request: https://github.com/apache/spark/pull/10031#issuecomment-160623021 @lianhuiwang I prefer call sc.requestTotalExecutor(0), because we can't foresee user's behave after sc.stop and I already refine in my previous modification version. thanks for you if you can take some time to look it --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-12009][Yarn]Avoid to re-allocating yarn...
Github user suyanNone commented on the pull request: https://github.com/apache/spark/pull/10031#issuecomment-160507811 no... actually, it is not good ideaExecutor may be killed by Yarn due to some reason...or just akka disconnected... and I hope you can close this, I will working for that to find a more reasonable solution --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-12009][Yarn]Avoid to re-allocating yarn...
GitHub user suyanNone opened a pull request: https://github.com/apache/spark/pull/9992 [SPARK-12009][Yarn]Avoid to re-allocating yarn container while driver want to stop all Executors You can merge this pull request into a Git repository by running: $ git pull https://github.com/suyanNone/spark tricky Alternatively you can review and apply these changes as the patch at: https://github.com/apache/spark/pull/9992.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #9992 commit 771b33c799d30722cd51d24f7aa456719fa2d9d0 Author: hushan <hus...@xiaomi.com> Date: 2015-11-26T07:02:44Z Avoid to re-allocator yarn container while sc.stop --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-6157][CORE] Unrolling with MEMORY_AND_D...
Github user suyanNone commented on the pull request: https://github.com/apache/spark/pull/4887#issuecomment-159140817 @andrewor14 I have the impression that cogroup will hold tow RDD's iterator, If the first iterator is `MEMORY_ONLY`, and unroll failed, so the iterator is based of ` Right(vector.iterator ++ values)` then the second is `MEMORY_AND_DISK`, and unroll failed, and we prepare to put it into disk. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: Unify dependencies entry
GitHub user suyanNone opened a pull request: https://github.com/apache/spark/pull/9691 Unify dependencies entry a small change You can merge this pull request into a Git repository by running: $ git pull https://github.com/suyanNone/spark unify-getDependency Alternatively you can review and apply these changes as the patch at: https://github.com/apache/spark/pull/9691.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #9691 commit 4ac866fa386cbd94b078bdfc4c5c6a07d30b9ea6 Author: hushan <hus...@xiaomi.com> Date: 2015-11-13T12:05:58Z Unify dependencies entry --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-6157][CORE] Unrolling with MEMORY_AND_D...
Github user suyanNone commented on the pull request: https://github.com/apache/spark/pull/4887#issuecomment-155758546 @andrewor14 , I had not make my point clearly. This issues created because of `memory_disk_level` block not released `unrollMemory` after that block had been put into disk successful. In current logic, There are have a `unrollMemoryMap` to contains all unroll memory for unrolling blocks and unrolled failed blocks. due to SPARK-4777, we add a `pendingUnrollMemoryMap` to reserve unroll memory only for a block that unrolled success. and `pendingUnrollMemoryMap` + `unrollMemoryMap` is the total used unrollMemory. Now to resolve this issue, after spark had unroll failed a `memory_and_disk` level block, we need to know the specific memory size(like 199MB) of `this` block. **important**we can't just call `releaseUnrollMemoryForThisTask` after we had put this block into disk, because `unrollMemoryForThisTask` may contains other block's unroll memory, such as 2 cache RDD, and have same paritioner, and with some `cogroup` ops. right? So we may need to know the specific memory size of unrolled failed memory_and_disk block, and should be differentiate from that `unrollMemoryMap`. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-6157][CORE] Unrolling with MEMORY_AND_D...
Github user suyanNone commented on the pull request: https://github.com/apache/spark/pull/4887#issuecomment-155985114 eh... the earlier version is add a `blockManager.memoryStore.releasePendingUnrollMemoryForThisThread()` in [this line](https://github.com/apache/spark/blob/6cd98c1878a9c5c6475ed5974643021ab27862a7/core/src/main/scala/org/apache/spark/CacheManager.scala#L184). but the `pendingUnrollMemory` is just stored unrolled success memory, not for the unrolled failed memory. so it is useless to call `blockManager.memoryStore.releasePendingUnrollMemoryForThisThread()`. Unless we change `pendingUnrollMemory` to store unroll memory for unrolled failed `MEMORY_AND_DISK` block and unrolled successful block. and memoryStore had another api: (1)`blockManager.memoryStore.releaseUnrollMemoryForThisThread()` or (2)`blockManager.memoryStore.releaseUnrollMemoryForThisThread(SomeMemorySize)`, `unrollMemoryMap` is used store unroll memory for unrolling and unrolled failed blocks. if we call (1) after [this line](https://github.com/apache/spark/blob/6cd98c1878a9c5c6475ed5974643021ab27862a7/core/src/main/scala/org/apache/spark/CacheManager.scala#L184), it will release memory for other unrolled failed block. if we call (2), we can't got the specific memory for that `MEMORY_AND_DISK` block. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-6157][CORE] Unrolling with MEMORY_AND_D...
Github user suyanNone commented on the pull request: https://github.com/apache/spark/pull/4887#issuecomment-155260117 = =, I will find time to update today... --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-8100][UI]Make able to refer lost execut...
Github user suyanNone closed the pull request at: https://github.com/apache/spark/pull/6644 --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK][SPARK-10842]Eliminate creating duplica...
Github user suyanNone commented on the pull request: https://github.com/apache/spark/pull/8923#issuecomment-153248553 @markhamstra @squito I need re-construct and re-run the test case to confirm that problem... R1 --| | --->R3 | > R4 --| | |>R6--->R7 R2 --| R3 |->R5 --| We get finalRDD: R7 shufflesdeps, the first is R6 for R6, not added into `shuffleToMapStage`, get his all not added into `shuffleIdToMapStage` AncestersDeps (R4(R3(R1, R2)), R5(R3(R1, R2))), in Stack it somthing like(R4, R3, R1, R2, R5, R3, R1, R2), for each dep, it will create a new ShuffleMapStage. so the deps from R1->R3 and R2 ->R3, will create twice. eh... @markhamstra, change the return type from Stack to Set sounds like good, but the `shuffleDep` object, I not sure it is unique for each shuffleId, I need to check that. If I confirm that uniq, I'd like to change the `val parents = new Stack[ShuffleDependency[_, _, _]]` to `new Set`. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: SPARK-7729:Executor which has been killed shou...
Github user suyanNone commented on the pull request: https://github.com/apache/spark/pull/6263#issuecomment-153225204 yean, make it configurable looks good --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: SPARK-7729:Executor which has been killed shou...
Github user suyanNone commented on the pull request: https://github.com/apache/spark/pull/6263#issuecomment-151010358 Hi, @archit279thakur would you mind add the logic about adding a time expire to show lost-Executor log? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK][SPARK-10842]Eliminate creating duplica...
Github user suyanNone commented on the pull request: https://github.com/apache/spark/pull/8923#issuecomment-148297636 jenkins retest this please --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-6157][CORE] Unrolling with MEMORY_AND_D...
Github user suyanNone commented on the pull request: https://github.com/apache/spark/pull/4887#issuecomment-147957284 @andrewor14 For we have three methods to release unroll memory for three situation 1. Unroll success. We expect to cache this block in `tryToPut`. We do not release and re-acquire memory from the MemoryManager in order to avoid race conditions where another component steals the memory that we're trying to transfer. (SPARK-4777) 2. Unroll failed for memory_disk_level block. We should release memory after putting this block into diskStore in order to re-acquire memory for other purpose. (SPARK-6157) 3. Unroll failed for memory_only_level block, We do not release until we finished task. According that, we may hold the reserved unroll memory after unrolled a block. the `pendingUnrollMemory` was designed only for unroll success block, for situation 2, we also need hold that memory, we may could transfer a parameter to `unrollSafely` to tell that block contains disk level, so we can hold that memory after unroll failed a memory_and_disk block. and done some release after we put that block into diskStore. another solution is not add parameter for `unrollSafely`, for each Task unrolling a block, first acquire memory into `pendingUnrollMemoryMap`, we do some release for situation 1 and 2, and move that memory for `unrollMemoryMap` for situation 3. The commit version adopt second solution, May have a more suitable solution? look forward for your suggestions --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK][SPARK-10842]Eliminate creating duplica...
Github user suyanNone commented on the pull request: https://github.com/apache/spark/pull/8923#issuecomment-147999008 jenkins retest this please --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK][SPARK-10842]Eliminate creating duplica...
Github user suyanNone commented on the pull request: https://github.com/apache/spark/pull/8923#issuecomment-147964293 jenkins retest this please --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-6157][CORE] Unrolling with MEMORY_AND_D...
Github user suyanNone commented on the pull request: https://github.com/apache/spark/pull/4887#issuecomment-147963975 jenkins retest this please --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-6157][CORE] Unrolling with MEMORY_AND_D...
Github user suyanNone commented on the pull request: https://github.com/apache/spark/pull/4887#issuecomment-148025758 jenkins retest this please --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-6157][CORE] Unrolling with MEMORY_AND_D...
Github user suyanNone commented on the pull request: https://github.com/apache/spark/pull/4887#issuecomment-147999540 jenkins retest this please --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-10796][CORE]Resubmit stage while lost t...
Github user suyanNone commented on the pull request: https://github.com/apache/spark/pull/8927#issuecomment-14378 jenkins retest this please --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-10796][CORE]Resubmit stage while lost t...
Github user suyanNone commented on the pull request: https://github.com/apache/spark/pull/8927#issuecomment-143698080 Reproduce that, so re-open that --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-10796][CORE]Resubmit stage while lost t...
GitHub user suyanNone reopened a pull request: https://github.com/apache/spark/pull/8927 [SPARK-10796][CORE]Resubmit stage while lost task in Zombie TaskSets We meet that problem in Spark 1.3.0, and I also check the latest Spark code, and I think that problem still exist. desc: 1. We know a running `ShuffleMapStage` will have multiple `TaskSet`: one Active TaskSet, multiple Zombie TaskSet. 2. We think a running `ShuffleMapStage` is success only if its partition are all process success, namely each taskâs MapStatus are all add into `outputLocs` 3. MapStatus of running `ShuffleMapStage` may succeed by Zombie TaskSet1 / Zombie TaskSet2 // Active TaskSetN, and may some MapStatus only belong to one TaskSet, and may be a Zombie TaskSet. 4. If lost a executor, it chanced that some lost-executor related MapStatus are succeed by some Zombie TaskSet. In current logical, The solution to resolved that lost MapStatus problem is, each TaskSet re-running that those tasks which succeed in lost-executor: re-add into `TaskSet's pendingTasks`, and re-add it paritions into `Stageâs pendingPartitions` . but it is useless if that lost MapStatus only belong to Zombie TaskSet, it is Zombie, so will never be scheduled his `pendingTasks` 5. The condition for resubmit stage is only if some task throws `FetchFailedException`, but may the lost-executor just not empty any MapStatus of parent Stage for one of running Stages, and itâs happen to that running `Stage` was lost a MapStatus only belong to a ZombieTask. So if all Zombie TaskSets are all processed his runningTasks and Active TaskSet are all processed his pendingTask, then will removed by `TaskSchedulerImp`, then that running Stage's pending partitions is still nonEmpty. it will hangs.. Examples: Running Stage 0.0, running TaskSet0.0, Finshed task0.0 in ExecA, running Task1.0 in ExecB, waiting Task2.0 ---> Task1.0 throws FetchFailedException ---> Running Resubmited stage 0.1, running TaskSet0.1(which re-run Task1, Task2), assume Task 1.0 finshed in ExecA ---> ExecA lost, and it happens no one throw FetchFailedExecption. ---> TaskSet0.1 re-submit task 1, re-add it into pendingTasks, and waiting TaskSchedulerImp schedule. TaskSet 0.0 also resubmit task0, re-add it into pendingTasks, because itâs Zombie, TaskSchedulerImpl skip to schedule TaskSet0.0 So if TaskSet0.0 and TaskSet0.1 (isZombie && runningTasks.empty), TaskSchedulerImp will remove those TaskSets. DagScheduler still have pendingPartitions due to the task lost in TaskSet0.0, but his TaskSets are all removed, so hangs You can merge this pull request into a Git repository by running: $ git pull https://github.com/suyanNone/spark rerun-special Alternatively you can review and apply these changes as the patch at: https://github.com/apache/spark/pull/8927.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #8927 commit 554c61f800c6c1b25b1002a7255569a9c38e4154 Author: hushan <hus...@xiaomi.com> Date: 2015-09-24T09:49:22Z rerun-specail commit 3b4a683d23f951082df0b9d29dfa094683d235ea Author: hushan <hus...@xiaomi.com> Date: 2015-09-28T03:09:05Z refine commit f845f33563623a9f3d6858aba893ed8c75453403 Author: hushan <hus...@xiaomi.com> Date: 2015-09-28T03:14:18Z refine commit 301da0a20c94084bc8f783cd0e087e63f07e2124 Author: hushan <hus...@xiaomi.com> Date: 2015-09-28T03:18:46Z refine --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK][SPARK-10842]Eliminate creating duplica...
Github user suyanNone commented on the pull request: https://github.com/apache/spark/pull/8923#issuecomment-143923739 jenkins retest this please --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-10796][CORE]Resubmit stage while lost t...
GitHub user suyanNone opened a pull request: https://github.com/apache/spark/pull/8927 [SPARK-10796][CORE]Resubmit stage while lost task in Zombie TaskSets We meet that problem in Spark 1.3.0, and I also check the latest Spark code, and I think that problem still exist. 1. We know a running `ShuffleMapStage` will have multiple `TaskSet`: one Active TaskSet, multiple Zombie TaskSet. 2. We think a running `ShuffleMapStage` is success only if its partition are all process success, namely each taskâs MapStatus are all add into `outputLocs` 3. MapStatus of running `ShuffleMapStage` may succeed by Zombie TaskSet1 / Zombie TaskSet2 // Active TaskSetN, and may some MapStatus only belong to one TaskSet, and may be a Zombie TaskSet. 4. If lost a executor, it chanced that some lost-executor related MapStatus are succeed by some Zombie TaskSet. In current logical, The solution to resolved that lost MapStatus problem is, each TaskSet re-running that those tasks which succeed in lost-executor: re-add into `TaskSet's pendingTasks`, and re-add it paritions into `Stageâs pendingPartitions` . but it is useless if that lost MapStatus only belong to Zombie TaskSet, it is Zombie, so will never be scheduled his `pendingTasks` 5. The condition for resubmit stage is only if some task throws `FetchFailedException`, but may the lost-executor just not empty any MapStatus of parent Stage for one of running Stages, and itâs happen to that running `Stage` was lost a MapStatus only belong to a ZombieTask. So if all Zombie TaskSets are all processed his runningTasks and Active TaskSet are all processed his pendingTask, then will removed by `TaskSchedulerImp`, then that running Stage's pending partitions is still nonEmpty. it will hangs.. You can merge this pull request into a Git repository by running: $ git pull https://github.com/suyanNone/spark rerun-special Alternatively you can review and apply these changes as the patch at: https://github.com/apache/spark/pull/8927.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #8927 commit 554c61f800c6c1b25b1002a7255569a9c38e4154 Author: hushan <hus...@xiaomi.com> Date: 2015-09-24T09:49:22Z rerun-specail commit 3b4a683d23f951082df0b9d29dfa094683d235ea Author: hushan <hus...@xiaomi.com> Date: 2015-09-28T03:09:05Z refine commit f845f33563623a9f3d6858aba893ed8c75453403 Author: hushan <hus...@xiaomi.com> Date: 2015-09-28T03:14:18Z refine commit 301da0a20c94084bc8f783cd0e087e63f07e2124 Author: hushan <hus...@xiaomi.com> Date: 2015-09-28T03:18:46Z refine --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-10796][CORE]Resubmit stage while lost t...
Github user suyanNone closed the pull request at: https://github.com/apache/spark/pull/8927 --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-10796][CORE]Resubmit stage while lost t...
Github user suyanNone commented on the pull request: https://github.com/apache/spark/pull/8927#issuecomment-143632342 I will run a test job on the latest code, to confirm that problem exist or not... --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK][SPARK-10842]Eliminate create duplicate...
GitHub user suyanNone opened a pull request: https://github.com/apache/spark/pull/8923 [SPARK][SPARK-10842]Eliminate create duplicate stage while generate job dag When we traverse RDD, to generate Stage DAG, Spark will skip judge the stage whether it add into shuffleIdToStage in some condition: get shuffleDep from getAncestorShuffleDependency Before: ![before1](https://cloud.githubusercontent.com/assets/9818852/10117739/9e1e8778-6493-11e5-935d-ea32099e94c9.png) ![before2](https://cloud.githubusercontent.com/assets/9818852/10117740/9e20ee46-6493-11e5-95b4-f30df5ac22e1.png) After: ![after](https://cloud.githubusercontent.com/assets/9818852/10117737/9e1864c4-6493-11e5-8541-31ff0ead329a.png) ![after2](https://cloud.githubusercontent.com/assets/9818852/10117738/9e1c8658-6493-11e5-954a-fb00f36f4d3b.png) You can merge this pull request into a Git repository by running: $ git pull https://github.com/suyanNone/spark duplicate Alternatively you can review and apply these changes as the patch at: https://github.com/apache/spark/pull/8923.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #8923 commit 315887a5267a385d40a47267dcd25664414f1784 Author: hushan <hus...@xiaomi.com> Date: 2015-09-26T13:12:39Z Duplicate stage --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-6157][CORE]Unroll unsuccessful memory_a...
Github user suyanNone commented on the pull request: https://github.com/apache/spark/pull/4887#issuecomment-138480264 sorry too late to see that, I will update it today --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK][SPARK-10370]Cancel all running attempt...
GitHub user suyanNone opened a pull request: https://github.com/apache/spark/pull/8550 [SPARK][SPARK-10370]Cancel all running attempts while that stage marked as finished You can merge this pull request into a Git repository by running: $ git pull https://github.com/suyanNone/spark apache-tasksets Alternatively you can review and apply these changes as the patch at: https://github.com/apache/spark/pull/8550.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #8550 commit 75b97796a2226cbc855d8b6d198a09cce709d698 Author: hushan[è¡ç] <hus...@xiaomi.com> Date: 2015-09-01T09:16:15Z Cancel stage task after it already marked finished --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-6157][CORE]Unroll unsuccessful memory_a...
Github user suyanNone commented on the pull request: https://github.com/apache/spark/pull/4887#issuecomment-136661103 @srowen Ok, I check the current code, it change to memory threshold per task instead per thread. so that problem is still exist... " In the previous situation, it is aim to resolve memory_and_disk level block, first unroll failed, but it will reserved unroll memory for this thread, and that unroll memory part should release after that block already put into disk, because nobody will use value for that unroll arrays. So it no necessary to reserved that unroll memory part. " Do you agree that description or do I make that understandable? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-6157][CORE]Unroll unsuccessful memory_a...
Github user suyanNone commented on the pull request: https://github.com/apache/spark/pull/4887#issuecomment-136657671 @srowen @nitin2goyal I not sure it is need because there is a big change for memory threshold per thread. In the previous situation, it is aim to resolve memory_and_disk level block, first unroll failed, but it will reserved unroll memory for this thread, and that unroll memory part should release after that block already put into disk. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-5259][CORE] don't submit stage until it...
Github user suyanNone commented on the pull request: https://github.com/apache/spark/pull/7699#issuecomment-136639959 @squito Spark-10370 , eh...I think already resolved it few month ago in my local env... but it's based in spark 1.3.0... do u already working for it ? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK][SPARK-10370]Cancel all running attempt...
Github user suyanNone closed the pull request at: https://github.com/apache/spark/pull/8550 --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK][SPARK-10370]Cancel all running attempt...
Github user suyanNone commented on the pull request: https://github.com/apache/spark/pull/8550#issuecomment-136762126 @squito yean, I will close that --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK][SPARK-10370]Cancel all running attempt...
Github user suyanNone commented on the pull request: https://github.com/apache/spark/pull/8550#issuecomment-136766398 @squito I close that patch, because there have some errors in test. call `taskScheduler.cancelTasks(stage.id, true)` in MarkStageAsFinished, It will mark all the no usefull tasksets related to that stage as zombie, because we already mark it as finished, and no need to deal with any running task for that stage. So first that prevent task from scheduling, second it may be good to kill the running task for already finished stage, the task may run a long time. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK][SPARK-10370]Cancel all running attempt...
Github user suyanNone commented on the pull request: https://github.com/apache/spark/pull/8550#issuecomment-136767971 For Spark-2666, this patch is aim to cancel all stage tasks as long as a FetchFailedException was thrown, and I think is no related to this patch, right? because this patch is just cancel the task which his stage is marked as finished. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: KafKaDirectDstream should filter empty partiti...
Github user suyanNone closed the pull request at: https://github.com/apache/spark/pull/8237 --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: KafKaDirectDstream should filter empty partiti...
Github user suyanNone commented on the pull request: https://github.com/apache/spark/pull/8237#issuecomment-132427763 @tdas Ok, as you already discussed before, so let's stay the same. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: KafKaDirectDstream should filter empty partiti...
Github user suyanNone commented on the pull request: https://github.com/apache/spark/pull/8237#issuecomment-132054268 @tdas, yean, I agree with it will be confused in semantics, and for a batch streaming system, it will not blocked the next batch as long as it will finished quickly, even not larger than batch time. Ah...I still have a little uncomfortable for launching empty tasks to executor, as we run spark on yarn, the resource is based soft isolation, so it will occurs resource competition, some times, even for empty task, the process time will larger than batch time. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: KafKaDirectDstream should filter empty partiti...
GitHub user suyanNone opened a pull request: https://github.com/apache/spark/pull/8237 KafKaDirectDstream should filter empty partition task or rdd To avoid submit stages and tasks for 0 events batch. You can merge this pull request into a Git repository by running: $ git pull https://github.com/suyanNone/spark spark-10052 Alternatively you can review and apply these changes as the patch at: https://github.com/apache/spark/pull/8237.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #8237 commit 4e4764d0a007160ae93a9c5b5910043ab406c02b Author: hushan[è¡ç] hus...@xiaomi.com Date: 2015-08-17T12:22:49Z KafKaDirectDstream should filter empty partition task or rdd --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: KafKaDirectDstream should filter empty partiti...
GitHub user suyanNone reopened a pull request: https://github.com/apache/spark/pull/8237 KafKaDirectDstream should filter empty partition task or rdd To avoid submit stages and tasks for 0 events batch. You can merge this pull request into a Git repository by running: $ git pull https://github.com/suyanNone/spark spark-10052 Alternatively you can review and apply these changes as the patch at: https://github.com/apache/spark/pull/8237.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #8237 commit 4e4764d0a007160ae93a9c5b5910043ab406c02b Author: hushan[è¡ç] hus...@xiaomi.com Date: 2015-08-17T12:22:49Z KafKaDirectDstream should filter empty partition task or rdd --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: KafKaDirectDstream should filter empty partiti...
Github user suyanNone closed the pull request at: https://github.com/apache/spark/pull/8237 --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-8100][UI]Make able to refer lost execut...
Github user suyanNone commented on the pull request: https://github.com/apache/spark/pull/6644#issuecomment-128218507 @andrewor14 @harishreedharan @squito I miss the andrewor14 comments about long running app again... as harishreedharan says, time limit expires is worth considering --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-5259][CORE] don't submit stage until it...
Github user suyanNone commented on the pull request: https://github.com/apache/spark/pull/7699#issuecomment-125477126 ... sorry for not update instantly... I am working busy off-line. it's nice for you to refine the code for #4055. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-8100][UI]Make able to refer lost execut...
Github user suyanNone commented on a diff in the pull request: https://github.com/apache/spark/pull/6644#discussion_r35617005 --- Diff: core/src/main/scala/org/apache/spark/status/api/v1/api.scala --- @@ -60,7 +60,8 @@ class ExecutorSummary private[spark]( val totalShuffleRead: Long, val totalShuffleWrite: Long, val maxMemory: Long, -val executorLogs: Map[String, String]) +val executorLogs: Map[String, String], +val isRemoved: Boolean) --- End diff -- Ok, I will refine that, and it's nice to tell me about MimaExcludes --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-8100][UI]Make able to refer lost execut...
Github user suyanNone commented on the pull request: https://github.com/apache/spark/pull/6644#issuecomment-125468342 retest this please. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-5259][CORE]Make sure shuffle metadata a...
Github user suyanNone commented on a diff in the pull request: https://github.com/apache/spark/pull/4055#discussion_r35177284 --- Diff: core/src/test/scala/org/apache/spark/scheduler/DAGSchedulerSuite.scala --- @@ -739,6 +742,88 @@ class DAGSchedulerSuite assertDataStructuresEmpty() } + test(verify not submit next stage while not have registered mapStatus) { +val firstRDD = new MyRDD(sc, 3, Nil) +val firstShuffleDep = new ShuffleDependency(firstRDD, null) +val firstShuffleId = firstShuffleDep.shuffleId +val shuffleMapRdd = new MyRDD(sc, 3, List(firstShuffleDep)) +val shuffleDep = new ShuffleDependency(shuffleMapRdd, null) +val reduceRdd = new MyRDD(sc, 1, List(shuffleDep)) +submit(reduceRdd, Array(0)) + +// things start out smoothly, stage 0 completes with no issues +complete(taskSets(0), Seq( + (Success, makeMapStatus(hostB, shuffleMapRdd.partitions.size)), + (Success, makeMapStatus(hostB, shuffleMapRdd.partitions.size)), + (Success, makeMapStatus(hostA, shuffleMapRdd.partitions.size)) +)) + +// then one executor dies, and a task fails in stage 1 +runEvent(ExecutorLost(exec-hostA)) +runEvent(CompletionEvent(taskSets(1).tasks(0), + FetchFailed(null, firstShuffleId, 2, 0, Fetch failed), + null, null, createFakeTaskInfo(), null)) + +// so we resubmit stage 0, which completes happily +Thread.sleep(1000) +val stage0Resubmit = taskSets(2) +assert(stage0Resubmit.stageId == 0) +assert(stage0Resubmit.stageAttemptId === 1) +val task = stage0Resubmit.tasks(0) +assert(task.partitionId === 2) +runEvent(CompletionEvent(task, Success, + makeMapStatus(hostC, shuffleMapRdd.partitions.size), null, createFakeTaskInfo(), null)) + +// now here is where things get tricky : we will now have a task set representing +// the second attempt for stage 1, but we *also* have some tasks for the first attempt for +// stage 1 still going +val stage1Resubmit = taskSets(3) +assert(stage1Resubmit.stageId == 1) +assert(stage1Resubmit.stageAttemptId === 1) +assert(stage1Resubmit.tasks.length === 3) + +// we'll have some tasks finish from the first attempt, and some finish from the second attempt, +// so that we actually have all stage outputs, though no attempt has completed all its +// tasks +runEvent(CompletionEvent(taskSets(3).tasks(0), Success, + makeMapStatus(hostC, reduceRdd.partitions.size), null, createFakeTaskInfo(), null)) +runEvent(CompletionEvent(taskSets(3).tasks(1), Success, + makeMapStatus(hostC, reduceRdd.partitions.size), null, createFakeTaskInfo(), null)) +// late task finish from the first attempt +runEvent(CompletionEvent(taskSets(1).tasks(2), Success, + makeMapStatus(hostB, reduceRdd.partitions.size), null, createFakeTaskInfo(), null)) + +// What should happen now is that we submit stage 2. However, we might not see an error +// b/c of DAGScheduler's error handling (it tends to swallow errors and just log them). But +// we can check some conditions. +// Note that the really important thing here is not so much that we submit stage 2 *immediately* +// but that we don't end up with some error from these interleaved completions. It would also +// be OK (though sub-optimal) if stage 2 simply waited until the resubmission of stage 1 had +// all its tasks complete + +// check that we have all the map output for stage 0 (it should have been there even before +// the last round of completions from stage 1, but just to double check it hasn't been messed +// up) +(0 until 3).foreach { reduceIdx = + val arr = mapOutputTracker.getServerStatuses(0, reduceIdx) + assert(arr != null) + assert(arr.nonEmpty) --- End diff -- may the below code will be more better? `` try { mapOutputTracker.getMapSizesByExecutorId(0, reduceIdx) } catch { case e: Exception = fail() } ``` --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-5259][CORE]Make sure shuffle metadata a...
Github user suyanNone commented on a diff in the pull request: https://github.com/apache/spark/pull/4055#discussion_r35064694 --- Diff: core/src/main/scala/org/apache/spark/scheduler/DAGScheduler.scala --- @@ -1193,8 +1193,10 @@ class DAGScheduler( // TODO: This will be really slow if we keep accumulating shuffle map stages for ((shuffleId, stage) - shuffleToMapStage) { stage.removeOutputsOnExecutor(execId) - val locs = stage.outputLocs.map(list = if (list.isEmpty) null else list.head).toArray - mapOutputTracker.registerMapOutputs(shuffleId, locs, changeEpoch = true) + if (!runningStages.contains(stage)) { --- End diff -- yean, that right. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-5259][CORE]Make sure shuffle metadata a...
Github user suyanNone commented on the pull request: https://github.com/apache/spark/pull/4055#issuecomment-123142054 @squito @markhamstra hmm, agree to only tracker partitionId, which make more simple. It's great to see that patch will to be close, Great thanks! --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-5259][CORE]Make sure shuffle metadata a...
Github user suyanNone commented on the pull request: https://github.com/apache/spark/pull/4055#issuecomment-118758477 @squito about stage.pendingTask and shuffleMapStage.isAvaiable for use `isAvailable` to instead of `stage.pendingTask's`, may need more careful to do with that. there is a comments in the code `// Some tasks had failed; let's resubmit this shuffleStage`, I can't recognize the situation or context... if we change to `isAvailable`, then the following code `if(shuffleStage.outputLocs.contains(Nil))` may be removed? and also may remove all variable `stage.pendingTask` related code, because is no reader any more. the logical of dagScheduler is complicated and always has some confusing code... ``` if (runningStages.contains(shuffleStage) shuffleStage.pendingTasks.isEmpty) { // clearCacheLocs() if (shuffleStage.outputLocs.contains(Nil)) { // Some tasks had failed; let's resubmit this shuffleStage // TODO: Lower-level scheduler should also deal with this logInfo(Resubmitting + shuffleStage + ( + shuffleStage.name + ) because some of itstasks had failed: + shuffleStage.outputLocs.zipWithIndex.filter(_._1.isEmpty).map(_._2).mkString(, )) submitStage(shuffleStage) } else { ``` --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-5259][CORE]Make sure shuffle metadata a...
Github user suyanNone commented on the pull request: https://github.com/apache/spark/pull/4055#issuecomment-118708012 @squito oh...I had skipped it... 1) Task attempt now is described in `TaskInfo` in Spark `TaskSetManager`. `TaskSetManager` is responsible for completing task attempt `TaskInfo`, which identified by `TaskID`. and if `TaskSetManager` is think a attempt was succeed, then call `dagScheduler` to complete the `tasks(index)` which is a `Task`. so it will make some sense that identified a `Task` by stageId and partitionID? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-5259][CORE]Make sure shuffle metadata a...
Github user suyanNone commented on a diff in the pull request: https://github.com/apache/spark/pull/4055#discussion_r33860928 --- Diff: core/src/main/scala/org/apache/spark/scheduler/DAGScheduler.scala --- @@ -1193,8 +1193,10 @@ class DAGScheduler( // TODO: This will be really slow if we keep accumulating shuffle map stages for ((shuffleId, stage) - shuffleToMapStage) { stage.removeOutputsOnExecutor(execId) - val locs = stage.outputLocs.map(list = if (list.isEmpty) null else list.head).toArray - mapOutputTracker.registerMapOutputs(shuffleId, locs, changeEpoch = true) + if (!runningStages.contains(stage)) { --- End diff -- @squito for this one, dagScheduler handler ExecutorLost, it will register all `shuffleToMapStage`. Including the running MapStage, which that is not complete partial, it will cause we register partial mapStatus into MapOutputTracker. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-5259][CORE]Make sure shuffle metadata a...
Github user suyanNone commented on a diff in the pull request: https://github.com/apache/spark/pull/4055#discussion_r33861063 --- Diff: core/src/main/scala/org/apache/spark/MapOutputTracker.scala --- @@ -305,7 +305,7 @@ private[spark] class MapOutputTrackerMaster(conf: SparkConf) if (mapStatuses.contains(shuffleId)) { val statuses = mapStatuses(shuffleId) - if (statuses.nonEmpty) { + if (statuses.nonEmpty statuses.exists(_ != null)) { --- End diff -- Do I need a test case for that correct, I think it is a bug... --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-5259][CORE]Make sure shuffle metadata a...
Github user suyanNone commented on a diff in the pull request: https://github.com/apache/spark/pull/4055#discussion_r33861293 --- Diff: core/src/main/scala/org/apache/spark/scheduler/DAGScheduler.scala --- @@ -1193,8 +1193,10 @@ class DAGScheduler( // TODO: This will be really slow if we keep accumulating shuffle map stages for ((shuffleId, stage) - shuffleToMapStage) { stage.removeOutputsOnExecutor(execId) - val locs = stage.outputLocs.map(list = if (list.isEmpty) null else list.head).toArray - mapOutputTracker.registerMapOutputs(shuffleId, locs, changeEpoch = true) + if (!runningStages.contains(stage)) { --- End diff -- Another things to claim, if we fix the problem described in that issues, we may never meet some problem resulted by that 2 part code. it may not help for that issue, do I need to create other issue for that 2. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-5259][CORE]Make sure shuffle metadata a...
Github user suyanNone commented on the pull request: https://github.com/apache/spark/pull/4055#issuecomment-118257935 @squito you are right, taskset is set to zombie for not scheduling the rest task. you Testcase is good, and I just modify 1 place. ``` runEvent(ExecutorLost(exec-hostA)) runEvent(CompletionEvent(taskSets(1).tasks(0), FetchFailed(null, firstShuffleId, 2, 0, Fetch failed), null, null, createFakeTaskInfo(), null)) // so we resubmit stage 0, which completes happily scheduler.resubmitFailedStages() val stage0Resubmit = taskSets(2) assert(stage0Resubmit.stageId == 0) assert(stage0Resubmit.attempt === 1) ``` to ``` runEvent(ExecutorLost(exec-hostA)) runEvent(CompletionEvent(taskSets(1).tasks(0), FetchFailed(null, firstShuffleId, 2, 0, Fetch failed), null, null, createFakeTaskInfo(), null)) // so we resubmit stage 0, which completes happily Thread.sleep(1000) val stage0Resubmit = taskSets(2) assert(stage0Resubmit.stageId == 0) assert(stage0Resubmit.attempt === 1) ``` the reason for that because dagScheduler handle FetchFailed, it will ``` messageScheduler.schedule(new Runnable { override def run(): Unit = eventProcessLoop.post(ResubmitFailedStages) }, 200, TimeUnit.MILLISECONDS) ``` and runEvent is asyn, so need to wait some time to make sure that event was processed --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-5259][CORE]Make sure shuffle metadata a...
Github user suyanNone commented on a diff in the pull request: https://github.com/apache/spark/pull/4055#discussion_r33858523 --- Diff: core/src/main/scala/org/apache/spark/MapOutputTracker.scala --- @@ -305,7 +305,7 @@ private[spark] class MapOutputTrackerMaster(conf: SparkConf) if (mapStatuses.contains(shuffleId)) { val statuses = mapStatuses(shuffleId) - if (statuses.nonEmpty) { + if (statuses.nonEmpty statuses.exists(_ != null)) { --- End diff -- These two correctness occurs while I revert my modify back to verify my test case will fails. but when I do this, it always encounter some strange problem. while we submitMissingTask, we will call `getPreferredLocs` for each task. and in that, for rdd's each ShuffleDependency, it will call `mapOutputTracker.getLocationsWithLargestOutputs(`, and in that, it will judge status.nonEmpty(), so it alway true, so it will always call `val status = statuses(mapIdx); val blockSize = status.getSizeForBlock(reducerId)` --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-5259][CORE]Make sure shuffle metadata a...
Github user suyanNone commented on the pull request: https://github.com/apache/spark/pull/4055#issuecomment-117419882 @squito I am in vacation in the past 10 days, sorry too late to see that. I will read your comments more carefully tomorrow, and it's grateful to you for reviewing. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-8044][CORE]Avoid to make directbuffer o...
Github user suyanNone closed the pull request at: https://github.com/apache/spark/pull/6586 --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-8044][CORE]Avoid to make directbuffer o...
Github user suyanNone commented on the pull request: https://github.com/apache/spark/pull/6586#issuecomment-117421446 @srowen I am in vacation in the past 10 days, sorry too late to see that. eh... ok, I will close this patch. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-8100][UI]Make able to refer lost execut...
Github user suyanNone commented on the pull request: https://github.com/apache/spark/pull/6644#issuecomment-113865096 retest this please --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-5259][CORE]Make sure mapStage.pendingta...
Github user suyanNone commented on the pull request: https://github.com/apache/spark/pull/4055#issuecomment-113414080 @squito, can you help me to review the test again. 3ks --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-5259][CORE]Make sure shuffle metadata a...
Github user suyanNone commented on the pull request: https://github.com/apache/spark/pull/4055#issuecomment-113477240 retest this please --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-5259][CORE]Make sure mapStage.pendingta...
Github user suyanNone commented on a diff in the pull request: https://github.com/apache/spark/pull/4055#discussion_r32811839 --- Diff: core/src/main/scala/org/apache/spark/MapOutputTracker.scala --- @@ -305,7 +305,7 @@ private[spark] class MapOutputTrackerMaster(conf: SparkConf) if (mapStatuses.contains(shuffleId)) { val statuses = mapStatuses(shuffleId) - if (statuses.nonEmpty) { + if (statuses.nonEmpty statuses.exists(_ != null)) { --- End diff -- nonEmpty is just to judge the array length != 0 but we will register array[partitionSize] for each shuffle when job submit. So statuses,nonEmpty is always true --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-5259][CORE]Make sure mapStage.pendingta...
Github user suyanNone commented on the pull request: https://github.com/apache/spark/pull/4055#issuecomment-113438222 retest this please --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-6157][CORE]Unroll unsuccessful memory_a...
Github user suyanNone commented on a diff in the pull request: https://github.com/apache/spark/pull/4887#discussion_r32811415 --- Diff: core/src/main/scala/org/apache/spark/storage/MemoryStore.scala --- @@ -295,9 +296,9 @@ private[spark] class MemoryStore(blockManager: BlockManager, maxMemory: Long) // In this case, we should release the memory after we cache the block there. // Otherwise, if we return an iterator, we release the memory reserved here // later when the task finishes. - if (keepUnrolling) { -accountingLock.synchronized { - val amountToRelease = currentUnrollMemoryForThisThread - previousMemoryReserved + accountingLock.synchronized { --- End diff -- @srowen releasePendingUnrollMemoryForThisThread means: we will reserved the unroll memory reserved for the unroll block, we need to release that part of memory when we not use any more, like: we put the unroll successful block to memoryStore, or we put unroll unsuccessful block into the disk, or we complete the task. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-5259][CORE]Make sure mapStage.pendingta...
Github user suyanNone commented on a diff in the pull request: https://github.com/apache/spark/pull/4055#discussion_r32811940 --- Diff: core/src/main/scala/org/apache/spark/scheduler/DAGScheduler.scala --- @@ -1193,8 +1193,10 @@ class DAGScheduler( // TODO: This will be really slow if we keep accumulating shuffle map stages for ((shuffleId, stage) - shuffleToMapStage) { stage.removeOutputsOnExecutor(execId) - val locs = stage.outputLocs.map(list = if (list.isEmpty) null else list.head).toArray - mapOutputTracker.registerMapOutputs(shuffleId, locs, changeEpoch = true) + if (!runningStages.contains(stage)) { --- End diff -- We register map Status when stage is done. So running stage is not in MapOutputTracker, no need to re-register --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-5259][CORE]Make sure mapStage.pendingta...
Github user suyanNone commented on the pull request: https://github.com/apache/spark/pull/4055#issuecomment-113166545 @andrewor14 @srowen Already refine with the comments. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-5259][CORE]Make sure mapStage.pendingta...
Github user suyanNone commented on a diff in the pull request: https://github.com/apache/spark/pull/4055#discussion_r32799824 --- Diff: core/src/test/scala/org/apache/spark/scheduler/DAGSchedulerSuite.scala --- @@ -598,6 +598,49 @@ class DAGSchedulerSuite assertDataStructuresEmpty() } + test(run with ShuffleMapStage retry) { +val firstRDD = new MyRDD(sc, 3, Nil) +val firstShuffleDep = new ShuffleDependency(firstRDD, null) +val firstShuffleId = firstShuffleDep.shuffleId +val shuffleMapRdd = new MyRDD(sc, 3, List(firstShuffleDep)) +val shuffleDep = new ShuffleDependency(shuffleMapRdd, null) +val reduceRdd = new MyRDD(sc, 1, List(shuffleDep)) +submit(reduceRdd, Array(0)) + +complete(taskSets(0), Seq( + (Success, makeMapStatus(hostB, shuffleMapRdd.partitions.size)), + (Success, makeMapStatus(hostB, shuffleMapRdd.partitions.size)), + (Success, makeMapStatus(hostC, shuffleMapRdd.partitions.size)) +)) + +complete(taskSets(1), Seq( + (Success, makeMapStatus(hostA, reduceRdd.partitions.size)), + (Success, makeMapStatus(hostB, reduceRdd.partitions.size)) +)) +runEvent(ExecutorLost(exec-hostA)) +// Resubmit already succcessd in hostA task +runEvent(CompletionEvent(taskSets(1).tasks(0), Resubmitted, + null, null, createFakeTaskInfo(), null)) + +// Cause mapOutTracker remove hostA outputs for taskset(0). +// Task that resubmitted will fetch matadata failed. +runEvent(CompletionEvent(taskSets(1).tasks(0), + FetchFailed(null, firstShuffleId, -1, 0, Fetch matadata failed), + null, null, createFakeTaskInfo(), null)) --- End diff -- nice~~ you are right, I change the test code to a wrong version. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-5259][CORE]Make sure mapStage.pendingta...
Github user suyanNone commented on the pull request: https://github.com/apache/spark/pull/4055#issuecomment-113356815 @andrewor14 The rest failed, because I change the `DagScheduler.executorLost` logical DAGSCheduler.scala: Origin version: Change mapOutPutTracker epoch for all stage as long as a Executor lost handleExecutorLost { .. for ((shuffleId, stage) - shuffleToMapStage) { stage.removeOutputsOnExecutor(execId) val locs = stage.outputLocs.map(list = if (list.isEmpty) null else list.head).toArray mapOutputTracker.registerMapOutputs(shuffleId, locs, changeEpoch = true) .. } This patch version: For running Stage is not need to re-register mapOutTracker, so it will not change the epoch. ``` for ((shuffleId, stage) - shuffleToMapStage) { stage.removeOutputsOnExecutor(execId) if (!runningStages.contains(stage)) { val locs = stage.outputLocs.map(list = if (list.isEmpty) null else list.head).toArray mapOutputTracker.registerMapOutputs(shuffleId, locs, changeEpoch = true) } } ``` --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org