[ https://issues.apache.org/jira/browse/SPARK-18935?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16184588#comment-16184588 ]
Stavros Kontopoulos edited comment on SPARK-18935 at 9/28/17 6:20 PM: ---------------------------------------------------------------------- I verified the example and error is the same yet the reason is as in the cluster mode case: {noformat} 17/09/28 21:07:34 INFO MesosCoarseGrainedSchedulerBackend: Mesos task 1 is now TASK_ERROR 17/09/28 21:07:34 INFO MesosCoarseGrainedSchedulerBackend: Blacklisting Mesos slave 433038b9-80aa-43ef-b6eb-0075f5028d37-S0 due to too many failures; is Spark installed on it? 17/09/28 21:07:34 DEBUG CoarseGrainedSchedulerBackend$DriverEndpoint: Asked to remove executor 1 with reason Executor finished with state LOST 17/09/28 21:07:34 INFO BlockManagerMaster: Removal of executor 1 requested 17/09/28 21:07:34 INFO CoarseGrainedSchedulerBackend$DriverEndpoint: Asked to remove non-existent executor 1 17/09/28 21:07:34 INFO BlockManagerMasterEndpoint: Trying to remove executor 1 from BlockManagerMaster. {noformat} The task is failing and the agent is blacklisted which leads to starvation. In the scheduler we check for task failures for a slave in order to avoid feature launches there: {code:java} slaves.get(slaveId).map(_.taskFailures).getOrElse(0) < MAX_SLAVE_FAILURES {code} The task is failing due to: {noformat} I0928 21:07:34.621839 5559 master.cpp:6532] Sending status update TASK_ERROR for task 0 of framework e46985fe-1392-4d39-a3d5-e7ec77810695-0004 'Total resources cpus(spark-prive)(allocated: spark-prive):8; mem(spark-prive)(allocated: spark-prive):1408 required by task and its executor is more than available ports(spark-prive, )(allocated: spark-prive):[31000-32000]; disk(spark-prive, )(allocated: spark-prive):1000; cpus(spark-prive, )(allocated: spark-prive):8; mem(spark-prive, )(allocated: spark-prive):10024; mem(*)(allocated: spark-prive):4590; disk(*)(allocated: spark-prive):103216' I0928 21:07:34.622593 5559 hierarchical.cpp:850] Updated allocation of framework e46985fe-1392-4d39-a3d5-e7ec77810695-0004 on agent 433038b9-80aa-43ef-b6eb-0075f5028d37-S0 from ports(spark-prive, )(allocated: spark-prive):[31000-32000]; disk(spark-prive, )(allocated: spark-prive):1000; cpus(spark-prive, )(allocated: spark-prive):8; mem(spark-prive, )(allocated: spark-prive):10024; mem(*)(allocated: spark-prive):4590; disk(*)(allocated: spark-prive):103216 to ports(spark-prive, )(allocated: spark-prive):[31000-32000]; disk(spark-prive, )(allocated: spark-prive):1000; cpus(spark-prive, )(allocated: spark-prive):8; mem(spark-prive, )(allocated: spark-prive):10024; mem(*)(allocated: spark-prive):4590; disk(*)(allocated: spark-prive):103216 I0928 21:07:34.647950 5559 master.cpp:4941] Processing REVIVE call for framework e46985fe-1392-4d39-a3d5-e7ec77810695-0004 (Spark Pi) at scheduler-df433215-b87c-4b9b-993c-a3253c5f11a8@127.0.1.1:34775 {noformat} So again its the same reason as I have seen before. If I try to set a principle to spark mesos framework will not be able to register because even if I set a secret the driver will be aborted: {noformat} I0928 21:19:54.793844 7363 sched.cpp:232] Version: 1.3.0 I0928 21:19:54.795897 7355 sched.cpp:336] New master detected at master@127.0.1.1:5050 I0928 21:19:54.796042 7355 sched.cpp:407] Authenticating with master master@127.0.1.1:5050 I0928 21:19:54.796052 7355 sched.cpp:414] Using default CRAM-MD5 authenticatee I0928 21:19:54.796152 7361 authenticatee.cpp:97] Initializing client SASL I0928 21:19:54.814299 7361 authenticatee.cpp:121] Creating new client SASL connection I0928 21:19:54.815407 7357 authenticatee.cpp:213] Received SASL authentication mechanisms: CRAM-MD5 I0928 21:19:54.815421 7357 authenticatee.cpp:239] Attempting to authenticate with mechanism 'CRAM-MD5' I0928 21:19:54.815757 7360 authenticatee.cpp:259] Received SASL authentication step E0928 21:19:54.816179 7355 sched.cpp:507] Master master@127.0.1.1:5050 refused authentication I0928 21:19:54.816193 7355 sched.cpp:1187] Got error 'Master refused authentication' I0928 21:19:54.816200 7355 sched.cpp:2055] Asked to abort the driver 17/09/28 21:19:54 ERROR MesosCoarseGrainedSchedulerBackend: Mesos error: Master refused authentication Exception in thread "Thread-12" org.apache.spark.SparkException: Exiting due to error from cluster scheduler: Master refused authentication at org.apache.spark.scheduler.TaskSchedulerImpl.error(TaskSchedulerImpl.scala:500) at org.apache.spark.scheduler.cluster.mesos.MesosCoarseGrainedSchedulerBackend.error(MesosCoarseGrainedSchedulerBackend.scala:599) I0928 21:19:54.817343 7355 sched.cpp:2055] Asked to abort the driver I0928 21:19:54.817355 7355 sched.cpp:1233] Aborting framework {noformat} was (Author: skonto): I verified the example and error is the same yet the reason is as in the cluster mode case: {noformat} 17/09/28 21:07:34 INFO MesosCoarseGrainedSchedulerBackend: Mesos task 1 is now TASK_ERROR 17/09/28 21:07:34 INFO MesosCoarseGrainedSchedulerBackend: Blacklisting Mesos slave 433038b9-80aa-43ef-b6eb-0075f5028d37-S0 due to too many failures; is Spark installed on it? 17/09/28 21:07:34 DEBUG CoarseGrainedSchedulerBackend$DriverEndpoint: Asked to remove executor 1 with reason Executor finished with state LOST 17/09/28 21:07:34 INFO BlockManagerMaster: Removal of executor 1 requested 17/09/28 21:07:34 INFO CoarseGrainedSchedulerBackend$DriverEndpoint: Asked to remove non-existent executor 1 17/09/28 21:07:34 INFO BlockManagerMasterEndpoint: Trying to remove executor 1 from BlockManagerMaster. {noformat} The task is failing and the agent is blacklisted which leads to starvation. In the scheduler we check for task failures for a slave in order to avoid feature launches there: {code:java} slaves.get(slaveId).map(_.taskFailures).getOrElse(0) < MAX_SLAVE_FAILURES {code} The task is failing due to: {noformat} I0928 21:07:34.621839 5559 master.cpp:6532] Sending status update TASK_ERROR for task 0 of framework e46985fe-1392-4d39-a3d5-e7ec77810695-0004 'Total resources cpus(spark-prive)(allocated: spark-prive):8; mem(spark-prive)(allocated: spark-prive):1408 required by task and its executor is more than available ports(spark-prive, )(allocated: spark-prive):[31000-32000]; disk(spark-prive, )(allocated: spark-prive):1000; cpus(spark-prive, )(allocated: spark-prive):8; mem(spark-prive, )(allocated: spark-prive):10024; mem(*)(allocated: spark-prive):4590; disk(*)(allocated: spark-prive):103216' I0928 21:07:34.622593 5559 hierarchical.cpp:850] Updated allocation of framework e46985fe-1392-4d39-a3d5-e7ec77810695-0004 on agent 433038b9-80aa-43ef-b6eb-0075f5028d37-S0 from ports(spark-prive, )(allocated: spark-prive):[31000-32000]; disk(spark-prive, )(allocated: spark-prive):1000; cpus(spark-prive, )(allocated: spark-prive):8; mem(spark-prive, )(allocated: spark-prive):10024; mem(*)(allocated: spark-prive):4590; disk(*)(allocated: spark-prive):103216 to ports(spark-prive, )(allocated: spark-prive):[31000-32000]; disk(spark-prive, )(allocated: spark-prive):1000; cpus(spark-prive, )(allocated: spark-prive):8; mem(spark-prive, )(allocated: spark-prive):10024; mem(*)(allocated: spark-prive):4590; disk(*)(allocated: spark-prive):103216 I0928 21:07:34.647950 5559 master.cpp:4941] Processing REVIVE call for framework e46985fe-1392-4d39-a3d5-e7ec77810695-0004 (Spark Pi) at scheduler-df433215-b87c-4b9b-993c-a3253c5f11a8@127.0.1.1:34775 {noformat} So again its the same reason as I have seen before. If I try to set a principle mesos framework will not be able to register because even if I set a secret the driver will be aborted: {noformat} I0928 21:19:54.793844 7363 sched.cpp:232] Version: 1.3.0 I0928 21:19:54.795897 7355 sched.cpp:336] New master detected at master@127.0.1.1:5050 I0928 21:19:54.796042 7355 sched.cpp:407] Authenticating with master master@127.0.1.1:5050 I0928 21:19:54.796052 7355 sched.cpp:414] Using default CRAM-MD5 authenticatee I0928 21:19:54.796152 7361 authenticatee.cpp:97] Initializing client SASL I0928 21:19:54.814299 7361 authenticatee.cpp:121] Creating new client SASL connection I0928 21:19:54.815407 7357 authenticatee.cpp:213] Received SASL authentication mechanisms: CRAM-MD5 I0928 21:19:54.815421 7357 authenticatee.cpp:239] Attempting to authenticate with mechanism 'CRAM-MD5' I0928 21:19:54.815757 7360 authenticatee.cpp:259] Received SASL authentication step E0928 21:19:54.816179 7355 sched.cpp:507] Master master@127.0.1.1:5050 refused authentication I0928 21:19:54.816193 7355 sched.cpp:1187] Got error 'Master refused authentication' I0928 21:19:54.816200 7355 sched.cpp:2055] Asked to abort the driver 17/09/28 21:19:54 ERROR MesosCoarseGrainedSchedulerBackend: Mesos error: Master refused authentication Exception in thread "Thread-12" org.apache.spark.SparkException: Exiting due to error from cluster scheduler: Master refused authentication at org.apache.spark.scheduler.TaskSchedulerImpl.error(TaskSchedulerImpl.scala:500) at org.apache.spark.scheduler.cluster.mesos.MesosCoarseGrainedSchedulerBackend.error(MesosCoarseGrainedSchedulerBackend.scala:599) I0928 21:19:54.817343 7355 sched.cpp:2055] Asked to abort the driver I0928 21:19:54.817355 7355 sched.cpp:1233] Aborting framework {noformat} > Use Mesos "Dynamic Reservation" resource for Spark > -------------------------------------------------- > > Key: SPARK-18935 > URL: https://issues.apache.org/jira/browse/SPARK-18935 > Project: Spark > Issue Type: Bug > Affects Versions: 2.0.0, 2.0.1, 2.0.2 > Reporter: jackyoh > > I'm running spark on Apache Mesos > Please follow these steps to reproduce the issue: > 1. First, run Mesos resource reserve: > curl -i -d slaveId=c24d1cfb-79f3-4b07-9f8b-c7b19543a333-S0 -d > resources='[{"name":"cpus","type":"SCALAR","scalar":{"value":20},"role":"spark","reservation":{"principal":""}},{"name":"mem","type":"SCALAR","scalar":{"value":4096},"role":"spark","reservation":{"principal":""}}]' > -X POST http://192.168.1.118:5050/master/reserve > 2. Then run spark-submit command: > ./spark-submit --class org.apache.spark.examples.SparkPi --master > mesos://192.168.1.118:5050 --conf spark.mesos.role=spark > ../examples/jars/spark-examples_2.11-2.0.2.jar 10000 > And the console will keep loging same warning message as shown below: > 16/12/19 22:33:28 WARN TaskSchedulerImpl: Initial job has not accepted any > resources; check your cluster UI to ensure that workers are registered and > have sufficient resources -- This message was sent by Atlassian JIRA (v6.4.14#64029) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org