Hi, this is just a follow-up regarding this issue. Turns out that it's caused by a bug in Spark. I created a case for it: https://issues.apache.org/jira/browse/SPARK-2204 and submitted a patch.
Any chance this could be included in the 1.0.1 release? Thanks, - Sebastien On Tue, Jun 17, 2014 at 2:57 PM, Sébastien Rainville < sebastienrainvi...@gmail.com> wrote: > Hi, > > I'm having trouble running spark on mesos in fine-grained mode. I'm > running spark 1.0.0 and mesos 0.18.0. The tasks are failing randomly, which > most of the time, but not always, cause the job to fail. The same code is > running fine in coarse-grained mode. I see the following exceptions in the > logs of the spark driver: > > W0617 10:57:36.774382 8735 sched.cpp:901] Attempting to launch task 21 > with an unknown offer 20140416-011500-1369465866-5050-26096-52332715 > W0617 10:57:36.774433 8735 sched.cpp:901] Attempting to launch task 22 > with an unknown offer 20140416-011500-1369465866-5050-26096-52332715 > 14/06/17 10:57:36 INFO TaskSetManager: Re-queueing tasks for > 201311011608-1369465866-5050-9189-46 from TaskSet 0.0 > 14/06/17 10:57:36 WARN TaskSetManager: Lost TID 22 (task 0.0:2) > 14/06/17 10:57:36 WARN TaskSetManager: Lost TID 19 (task 0.0:0) > 14/06/17 10:57:36 WARN TaskSetManager: Lost TID 21 (task 0.0:1) > 14/06/17 10:57:36 INFO DAGScheduler: Executor lost: > 201311011608-1369465866-5050-9189-46 (epoch 0) > 14/06/17 10:57:36 INFO BlockManagerMasterActor: Trying to remove executor > 201311011608-1369465866-5050-9189-46 from BlockManagerMaster. > 14/06/17 10:57:36 INFO BlockManagerMaster: Removed > 201311011608-1369465866-5050-9189-46 successfully in removeExecutor > 14/06/17 10:57:36 DEBUG MapOutputTrackerMaster: Increasing epoch to 1 > 14/06/17 10:57:36 INFO DAGScheduler: Host added was in lost list earlier: > ca1-dcc1-0065.lab.mtl > > I don't see any exceptions in the spark executor logs. The only error > message I found in mesos itself is warnings in the mesos master: > > W0617 10:57:36.816748 26100 master.cpp:1615] Failed to validate task 21 : > Task 21 attempted to use cpus(*):1 combined with already used cpus(*):1; > mem(*):2048 is greater than offered mem(*):3216; disk(*):98304; > ports(*):[11900-11919, 1192 > 1-11995, 11997-11999]; cpus(*):1 > W0617 10:57:36.819807 26100 master.cpp:1615] Failed to validate task 22 : > Task 22 attempted to use cpus(*):1 combined with already used cpus(*):1; > mem(*):2048 is greater than offered mem(*):3216; disk(*):98304; > ports(*):[11900-11919, 1192 > 1-11995, 11997-11999]; cpus(*):1 > W0617 10:57:36.932287 26102 master.cpp:1615] Failed to validate task 28 : > Task 28 attempted to use cpus(*):1 combined with already used cpus(*):1; > mem(*):2048 is greater than offered cpus(*):1; mem(*):3216; disk(*):98304; > ports(*):[11900- > 11960, 11962-11978, 11980-11999] > W0617 11:05:52.783133 26098 master.cpp:2106] Ignoring unknown exited > executor 201311011608-1369465866-5050-9189-46 on slave > 201311011608-1369465866-5050-9189-46 (ca1-dcc1-0065.lab.mtl) > W0617 11:05:52.787739 26103 master.cpp:2106] Ignoring unknown exited > executor 201311011608-1369465866-5050-9189-34 on slave > 201311011608-1369465866-5050-9189-34 (ca1-dcc1-0053.lab.mtl) > W0617 11:05:52.790292 26102 master.cpp:2106] Ignoring unknown exited > executor 201311011608-1369465866-5050-9189-59 on slave > 201311011608-1369465866-5050-9189-59 (ca1-dcc1-0079.lab.mtl) > W0617 11:05:52.800649 26099 master.cpp:2106] Ignoring unknown exited > executor 201311011608-1369465866-5050-9189-18 on slave > 201311011608-1369465866-5050-9189-18 (ca1-dcc1-0027.lab.mtl) > ... (more of those "Ignoring unknown exited executor") > > > I analyzed the difference in between the execution of the same job in > coarse-grained mode and fine-grained mode, and I noticed that in the > fine-grained mode the tasks get executed on executors different than the > ones reported in spark, as if spark and mesos get out of sync as to which > executor is responsible for which task. See the following: > > > Coarse-grained mode: > > Spark Mesos Task IndexTask ID ExecutorStatusTask ID (UI)Task Name Task > ID (logs)ExecutorState 0066SUCCESS 4"Task 4"0 66RUNNING1 159SUCCESS0 "Task > 0"159 RUNNING22 54SUCCESS10"Task 10" 254RUNNING 33128 SUCCESS6"Task 6" 3 > 128RUNNING ... > > Fine-grained mode: > > Spark Mesos Task IndexTask ID ExecutorTask ID (UI)Task NameTask ID (logs) > ExecutorState0 23108SUCCESS 23"task 0.0:0"23 27FINISHED0 1965FAILED19 "task > 0.0:0"1986 FINISHED1 2165FAILED Mesos executor was never created124 92 > SUCCESS24"task 0.0:1" 24129FINISHED 22265 FAILEDMesos executor was never > created 225100SUCCESS 25"task 0.0:2" 2584FINISHED 32680SUCCESS 26"task > 0.0:3"26 124FINISHED 42765FAILED 27"task 0.0:4"27 108FINISHED 42992SUCCESS > 29"task 0.0:4"29 65FINISHED 52865FAILED Mesos executor was never created5 > 3077SUCCESS30 "task 0.0:5"3062 FINISHED6 053SUCCESS0 "task 0.0:6"041 > FINISHED7 177SUCCESS1 "task 0.0:7"1114 FINISHED... > > > Is it normal that the executor reported in spark and mesos to be different > when running in fine-grained mode? > > Please note that in this particular example the job actually succeeded, > but most of the time it's failing after 4 failed attempts of a given task. > This job never fails in coarse-grained mode. Every job is working in > coarse-grained mode and failing the same way in fine-grained mode. > > Does anybody have an idea what the problem could be? > > Thanks, > > - Sebastien > > > > >