[jira] [Comment Edited] (SPARK-18935) Use Mesos "Dynamic Reservation" resource for Spark

Stavros Kontopoulos (JIRA) Thu, 28 Sep 2017 11:18:59 -0700

    [ 
https://issues.apache.org/jira/browse/SPARK-18935?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16184588#comment-16184588
 ]


Stavros Kontopoulos edited comment on SPARK-18935 at 9/28/17 6:17 PM:
----------------------------------------------------------------------

I verified the example and error is the same yet the reason is as in the 
cluster mode case:

{noformat}

17/09/28 21:07:34 INFO MesosCoarseGrainedSchedulerBackend: Mesos task 1 is now 
TASK_ERROR
17/09/28 21:07:34 INFO MesosCoarseGrainedSchedulerBackend: Blacklisting Mesos 
slave 433038b9-80aa-43ef-b6eb-0075f5028d37-S0 due to too many failures; is 
Spark installed on it?
17/09/28 21:07:34 DEBUG CoarseGrainedSchedulerBackend$DriverEndpoint: Asked to 
remove executor 1 with reason Executor finished with state LOST
17/09/28 21:07:34 INFO BlockManagerMaster: Removal of executor 1 requested
17/09/28 21:07:34 INFO CoarseGrainedSchedulerBackend$DriverEndpoint: Asked to 
remove non-existent executor 1
17/09/28 21:07:34 INFO BlockManagerMasterEndpoint: Trying to remove executor 1 
from BlockManagerMaster.

{noformat}

The task is failing and the agent is blacklisted which leads to starvation. 
In the scheduler we check for task failures for a slave in order to avoid 
feature launches there:

{code:java}
slaves.get(slaveId).map(_.taskFailures).getOrElse(0) < MAX_SLAVE_FAILURES 
{code}

The task is failing due to:

{noformat}
I0928 21:07:34.621839  5559 master.cpp:6532] Sending status update TASK_ERROR 
for task 0 of framework e46985fe-1392-4d39-a3d5-e7ec77810695-0004 'Total 
resources cpus(spark-prive)(allocated: spark-prive):8; 
mem(spark-prive)(allocated: spark-prive):1408 required by task and its executor 
is more than available ports(spark-prive, )(allocated: 
spark-prive):[31000-32000]; disk(spark-prive, )(allocated: spark-prive):1000; 
cpus(spark-prive, )(allocated: spark-prive):8; mem(spark-prive, )(allocated: 
spark-prive):10024; mem(*)(allocated: spark-prive):4590; disk(*)(allocated: 
spark-prive):103216'
I0928 21:07:34.622593  5559 hierarchical.cpp:850] Updated allocation of 
framework e46985fe-1392-4d39-a3d5-e7ec77810695-0004 on agent 
433038b9-80aa-43ef-b6eb-0075f5028d37-S0 from ports(spark-prive, )(allocated: 
spark-prive):[31000-32000]; disk(spark-prive, )(allocated: spark-prive):1000; 
cpus(spark-prive, )(allocated: spark-prive):8; mem(spark-prive, )(allocated: 
spark-prive):10024; mem(*)(allocated: spark-prive):4590; disk(*)(allocated: 
spark-prive):103216 to ports(spark-prive, )(allocated: 
spark-prive):[31000-32000]; disk(spark-prive, )(allocated: spark-prive):1000; 
cpus(spark-prive, )(allocated: spark-prive):8; mem(spark-prive, )(allocated: 
spark-prive):10024; mem(*)(allocated: spark-prive):4590; disk(*)(allocated: 
spark-prive):103216
I0928 21:07:34.647950  5559 master.cpp:4941] Processing REVIVE call for 
framework e46985fe-1392-4d39-a3d5-e7ec77810695-0004 (Spark Pi) at 
scheduler-df433215-b87c-4b9b-993c-a3253c5f11a8@127.0.1.1:34775

{noformat}

So again its the same reason as I have seen before.



was (Author: skonto):
I verified the example and error is the same yet the reason is as in the 
cluster mode case:

{noformat}

17/09/28 21:07:34 INFO MesosCoarseGrainedSchedulerBackend: Mesos task 1 is now 
TASK_ERROR
17/09/28 21:07:34 INFO MesosCoarseGrainedSchedulerBackend: Blacklisting Mesos 
slave 433038b9-80aa-43ef-b6eb-0075f5028d37-S0 due to too many failures; is 
Spark installed on it?
17/09/28 21:07:34 DEBUG CoarseGrainedSchedulerBackend$DriverEndpoint: Asked to 
remove executor 1 with reason Executor finished with state LOST
17/09/28 21:07:34 INFO BlockManagerMaster: Removal of executor 1 requested
17/09/28 21:07:34 INFO CoarseGrainedSchedulerBackend$DriverEndpoint: Asked to 
remove non-existent executor 1
17/09/28 21:07:34 INFO BlockManagerMasterEndpoint: Trying to remove executor 1 
from BlockManagerMaster.

{noformat}

The task is failing and the agent is blacklisted which leads to starvation. The 
task is failing due to:

{noformat}
I0928 21:07:34.621839  5559 master.cpp:6532] Sending status update TASK_ERROR 
for task 0 of framework e46985fe-1392-4d39-a3d5-e7ec77810695-0004 'Total 
resources cpus(spark-prive)(allocated: spark-prive):8; 
mem(spark-prive)(allocated: spark-prive):1408 required by task and its executor 
is more than available ports(spark-prive, )(allocated: 
spark-prive):[31000-32000]; disk(spark-prive, )(allocated: spark-prive):1000; 
cpus(spark-prive, )(allocated: spark-prive):8; mem(spark-prive, )(allocated: 
spark-prive):10024; mem(*)(allocated: spark-prive):4590; disk(*)(allocated: 
spark-prive):103216'
I0928 21:07:34.622593  5559 hierarchical.cpp:850] Updated allocation of 
framework e46985fe-1392-4d39-a3d5-e7ec77810695-0004 on agent 
433038b9-80aa-43ef-b6eb-0075f5028d37-S0 from ports(spark-prive, )(allocated: 
spark-prive):[31000-32000]; disk(spark-prive, )(allocated: spark-prive):1000; 
cpus(spark-prive, )(allocated: spark-prive):8; mem(spark-prive, )(allocated: 
spark-prive):10024; mem(*)(allocated: spark-prive):4590; disk(*)(allocated: 
spark-prive):103216 to ports(spark-prive, )(allocated: 
spark-prive):[31000-32000]; disk(spark-prive, )(allocated: spark-prive):1000; 
cpus(spark-prive, )(allocated: spark-prive):8; mem(spark-prive, )(allocated: 
spark-prive):10024; mem(*)(allocated: spark-prive):4590; disk(*)(allocated: 
spark-prive):103216
I0928 21:07:34.647950  5559 master.cpp:4941] Processing REVIVE call for 
framework e46985fe-1392-4d39-a3d5-e7ec77810695-0004 (Spark Pi) at 
scheduler-df433215-b87c-4b9b-993c-a3253c5f11a8@127.0.1.1:34775

{noformat}

So again its the same reason as I have seen before.


> Use Mesos "Dynamic Reservation" resource for Spark
> --------------------------------------------------
>
>                 Key: SPARK-18935
>                 URL: https://issues.apache.org/jira/browse/SPARK-18935
>             Project: Spark
>          Issue Type: Bug
>    Affects Versions: 2.0.0, 2.0.1, 2.0.2
>            Reporter: jackyoh
>
> I'm running spark on Apache Mesos
> Please follow these steps to reproduce the issue:
> 1. First, run Mesos resource reserve:
> curl -i -d slaveId=c24d1cfb-79f3-4b07-9f8b-c7b19543a333-S0 -d 
> resources='[{"name":"cpus","type":"SCALAR","scalar":{"value":20},"role":"spark","reservation":{"principal":""}},{"name":"mem","type":"SCALAR","scalar":{"value":4096},"role":"spark","reservation":{"principal":""}}]'
>  -X POST http://192.168.1.118:5050/master/reserve
> 2. Then run spark-submit command:
> ./spark-submit --class org.apache.spark.examples.SparkPi --master 
> mesos://192.168.1.118:5050 --conf spark.mesos.role=spark  
> ../examples/jars/spark-examples_2.11-2.0.2.jar 10000
> And the console will keep loging same warning message as shown below: 
> 16/12/19 22:33:28 WARN TaskSchedulerImpl: Initial job has not accepted any 
> resources; check your cluster UI to ensure that workers are registered and 
> have sufficient resources



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Comment Edited] (SPARK-18935) Use Mesos "Dynamic Reservation" resource for Spark

Reply via email to