[ https://issues.apache.org/jira/browse/SPARK-2445?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14129226#comment-14129226 ]
Tomas Barton commented on SPARK-2445: ------------------------------------- well, the workaround doesn't seem to work, here's output with TRACE log level: (produced by running same job: {code} MASTER=mesos://`cat /etc/mesos/zk` ./bin/run-example SparkLR {code} from different node) {code} ... On iteration 5 14/09/11 00:02:45 INFO SparkContext: Starting job: reduce at SparkLR.scala:64 14/09/11 00:02:45 TRACE DAGScheduler: Checking for newly runnable parent stages 14/09/11 00:02:45 TRACE DAGScheduler: running: Set() 14/09/11 00:02:45 TRACE DAGScheduler: waiting: Set() 14/09/11 00:02:45 TRACE DAGScheduler: failed: Set() 14/09/11 00:02:45 INFO DAGScheduler: Got job 4 (reduce at SparkLR.scala:64) with 2 output partitions (allowLocal=false) 14/09/11 00:02:45 INFO DAGScheduler: Final stage: Stage 4(reduce at SparkLR.scala:64) 14/09/11 00:02:45 INFO DAGScheduler: Parents of final stage: List() 14/09/11 00:02:45 INFO DAGScheduler: Missing parents: List() 14/09/11 00:02:45 DEBUG DAGScheduler: submitStage(Stage 4) 14/09/11 00:02:45 DEBUG DAGScheduler: missing: List() 14/09/11 00:02:45 INFO DAGScheduler: Submitting Stage 4 (MappedRDD[5] at map at SparkLR.scala:62), which has no missing parents 14/09/11 00:02:45 DEBUG DAGScheduler: submitMissingTasks(Stage 4) 14/09/11 00:02:45 INFO DAGScheduler: Submitting 2 missing tasks from Stage 4 (MappedRDD[5] at map at SparkLR.scala:62) 14/09/11 00:02:45 DEBUG DAGScheduler: New pending tasks: Set(ResultTask(4, 0), ResultTask(4, 1)) 14/09/11 00:02:45 INFO TaskSchedulerImpl: Adding task set 4.0 with 2 tasks 14/09/11 00:02:45 DEBUG TaskSetManager: Epoch for TaskSet 4.0: 5 14/09/11 00:02:45 DEBUG TaskSetManager: Valid locality levels for TaskSet 4.0: PROCESS_LOCAL, NODE_LOCAL, ANY 14/09/11 00:02:45 TRACE DAGScheduler: Checking for newly runnable parent stages 14/09/11 00:02:45 TRACE DAGScheduler: running: Set(Stage 4) 14/09/11 00:02:45 TRACE DAGScheduler: waiting: Set() 14/09/11 00:02:45 TRACE DAGScheduler: failed: Set() 14/09/11 00:02:45 DEBUG TaskSchedulerImpl: parentName: , name: TaskSet_4, runningTasks: 0 14/09/11 00:02:45 INFO TaskSetManager: Starting task 4.0:1 as TID 15 on executor 20140910-231511-185277356-5050-425-101: 172.27.11.11 (PROCESS_LOCAL) 14/09/11 00:02:45 INFO TaskSetManager: Serialized task 4.0:1 as 667083 bytes in 18 ms 14/09/11 00:02:45 TRACE DAGScheduler: Checking for newly runnable parent stages 14/09/11 00:02:45 INFO TaskSetManager: Starting task 4.0:0 as TID 16 on executor 20140910-231511-185277356-5050-425-102: 172.27.11.13 (PROCESS_LOCAL) 14/09/11 00:02:45 TRACE DAGScheduler: running: Set(Stage 4) 14/09/11 00:02:45 TRACE DAGScheduler: waiting: Set() 14/09/11 00:02:45 TRACE DAGScheduler: failed: Set() 14/09/11 00:02:45 INFO TaskSetManager: Serialized task 4.0:0 as 667083 bytes in 17 ms 14/09/11 00:02:45 TRACE DAGScheduler: Checking for newly runnable parent stages 14/09/11 00:02:45 TRACE DAGScheduler: running: Set(Stage 4) 14/09/11 00:02:45 TRACE DAGScheduler: waiting: Set() 14/09/11 00:02:45 TRACE DAGScheduler: failed: Set() 14/09/11 00:02:46 INFO TaskSetManager: Re-queueing tasks for 20140910-231511-185277356-5050-425-101 from TaskSet 4.0 14/09/11 00:02:46 WARN TaskSetManager: Lost TID 15 (task 4.0:1) 14/09/11 00:02:46 TRACE DAGScheduler: Checking for newly runnable parent stages 14/09/11 00:02:46 TRACE DAGScheduler: running: Set(Stage 4) 14/09/11 00:02:46 TRACE DAGScheduler: waiting: Set() 14/09/11 00:02:46 TRACE DAGScheduler: failed: Set() 14/09/11 00:02:46 INFO DAGScheduler: Executor lost: 20140910-231511-185277356-5050-425-101 (epoch 5) 14/09/11 00:02:46 INFO BlockManagerMasterActor: Trying to remove executor 20140910-231511-185277356-5050-425-101 from BlockManagerMaster. 14/09/11 00:02:46 INFO BlockManagerMaster: Removed 20140910-231511-185277356-5050-425-101 successfully in removeExecutor 14/09/11 00:02:46 DEBUG MapOutputTrackerMaster: Increasing epoch to 6 14/09/11 00:02:46 TRACE DAGScheduler: Checking for newly runnable parent stages 14/09/11 00:02:46 TRACE DAGScheduler: running: Set(Stage 4) 14/09/11 00:02:46 TRACE DAGScheduler: waiting: Set() 14/09/11 00:02:46 TRACE DAGScheduler: failed: Set() 14/09/11 00:02:46 INFO DAGScheduler: Host added was in lost list earlier: 172.27.11.11 14/09/11 00:02:46 TRACE DAGScheduler: Checking for newly runnable parent stages 14/09/11 00:02:46 TRACE DAGScheduler: running: Set(Stage 4) 14/09/11 00:02:46 TRACE DAGScheduler: waiting: Set() 14/09/11 00:02:46 DEBUG TaskSchedulerImpl: parentName: , name: TaskSet_4, runningTasks: 1 14/09/11 00:02:46 TRACE DAGScheduler: failed: Set() 14/09/11 00:02:46 INFO TaskSetManager: Starting task 4.0:1 as TID 17 on executor 20140910-231511-185277356-5050-425-101: 172.27.11.11 (PROCESS_LOCAL) 14/09/11 00:02:46 INFO TaskSetManager: Serialized task 4.0:1 as 667083 bytes in 15 ms 14/09/11 00:02:46 TRACE DAGScheduler: Checking for newly runnable parent stages 14/09/11 00:02:46 TRACE DAGScheduler: running: Set(Stage 4) 14/09/11 00:02:46 TRACE DAGScheduler: waiting: Set() 14/09/11 00:02:46 TRACE DAGScheduler: failed: Set() 14/09/11 00:02:47 DEBUG TaskSchedulerImpl: parentName: , name: TaskSet_4, runningTasks: 2 14/09/11 00:02:48 ERROR BlockManagerMasterActor: Got two different block manager registrations on 20140910-231511-185277356-5050-425-102 14/09/11 00:02:48 DEBUG DiskBlockManager: Shutdown hook called {code} > MesosExecutorBackend crashes in fine grained mode > ------------------------------------------------- > > Key: SPARK-2445 > URL: https://issues.apache.org/jira/browse/SPARK-2445 > Project: Spark > Issue Type: Bug > Components: Mesos > Affects Versions: 1.0.0 > Reporter: Dario Rexin > > When multiple instances of the MesosExecutorBackend are running on the same > slave, they will have the same executorId assigned (equal to the mesos > slaveId), but will have a different port (which is randomly assigned). > Because of this, it can not register a new BlockManager, because one is > already registered with the same executorId, but a different BlockManagerId. > More description and a fix can be found in this PR on GitHub: > https://github.com/apache/spark/pull/1358 -- This message was sent by Atlassian JIRA (v6.3.4#6332) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org