[jira] [Comment Edited] (SPARK-18935) Use Mesos "Dynamic Reservation" resource for Spark

Stavros Kontopoulos (JIRA) Thu, 28 Sep 2017 11:21:56 -0700

    [ 
https://issues.apache.org/jira/browse/SPARK-18935?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16184588#comment-16184588
 ]


Stavros Kontopoulos edited comment on SPARK-18935 at 9/28/17 6:20 PM:
----------------------------------------------------------------------

I verified the example and error is the same yet the reason is as in the 
cluster mode case:

{noformat}

17/09/28 21:07:34 INFO MesosCoarseGrainedSchedulerBackend: Mesos task 1 is now 
TASK_ERROR
17/09/28 21:07:34 INFO MesosCoarseGrainedSchedulerBackend: Blacklisting Mesos 
slave 433038b9-80aa-43ef-b6eb-0075f5028d37-S0 due to too many failures; is 
Spark installed on it?
17/09/28 21:07:34 DEBUG CoarseGrainedSchedulerBackend$DriverEndpoint: Asked to 
remove executor 1 with reason Executor finished with state LOST
17/09/28 21:07:34 INFO BlockManagerMaster: Removal of executor 1 requested
17/09/28 21:07:34 INFO CoarseGrainedSchedulerBackend$DriverEndpoint: Asked to 
remove non-existent executor 1
17/09/28 21:07:34 INFO BlockManagerMasterEndpoint: Trying to remove executor 1 
from BlockManagerMaster.

{noformat}

The task is failing and the agent is blacklisted which leads to starvation. 
In the scheduler we check for task failures for a slave in order to avoid 
feature launches there:

{code:java}
slaves.get(slaveId).map(_.taskFailures).getOrElse(0) < MAX_SLAVE_FAILURES 
{code}

The task is failing due to:

{noformat}
I0928 21:07:34.621839  5559 master.cpp:6532] Sending status update TASK_ERROR 
for task 0 of framework e46985fe-1392-4d39-a3d5-e7ec77810695-0004 'Total 
resources cpus(spark-prive)(allocated: spark-prive):8; 
mem(spark-prive)(allocated: spark-prive):1408 required by task and its executor 
is more than available ports(spark-prive, )(allocated: 
spark-prive):[31000-32000]; disk(spark-prive, )(allocated: spark-prive):1000; 
cpus(spark-prive, )(allocated: spark-prive):8; mem(spark-prive, )(allocated: 
spark-prive):10024; mem(*)(allocated: spark-prive):4590; disk(*)(allocated: 
spark-prive):103216'
I0928 21:07:34.622593  5559 hierarchical.cpp:850] Updated allocation of 
framework e46985fe-1392-4d39-a3d5-e7ec77810695-0004 on agent 
433038b9-80aa-43ef-b6eb-0075f5028d37-S0 from ports(spark-prive, )(allocated: 
spark-prive):[31000-32000]; disk(spark-prive, )(allocated: spark-prive):1000; 
cpus(spark-prive, )(allocated: spark-prive):8; mem(spark-prive, )(allocated: 
spark-prive):10024; mem(*)(allocated: spark-prive):4590; disk(*)(allocated: 
spark-prive):103216 to ports(spark-prive, )(allocated: 
spark-prive):[31000-32000]; disk(spark-prive, )(allocated: spark-prive):1000; 
cpus(spark-prive, )(allocated: spark-prive):8; mem(spark-prive, )(allocated: 
spark-prive):10024; mem(*)(allocated: spark-prive):4590; disk(*)(allocated: 
spark-prive):103216
I0928 21:07:34.647950  5559 master.cpp:4941] Processing REVIVE call for 
framework e46985fe-1392-4d39-a3d5-e7ec77810695-0004 (Spark Pi) at 
scheduler-df433215-b87c-4b9b-993c-a3253c5f11a8@127.0.1.1:34775

{noformat}

So again its the same reason as I have seen before.

If I try to set a principle to spark mesos framework will not be able to 
register because even if I set a secret the driver will be aborted:

{noformat}
I0928 21:19:54.793844  7363 sched.cpp:232] Version: 1.3.0
I0928 21:19:54.795897  7355 sched.cpp:336] New master detected at 
master@127.0.1.1:5050
I0928 21:19:54.796042  7355 sched.cpp:407] Authenticating with master 
master@127.0.1.1:5050
I0928 21:19:54.796052  7355 sched.cpp:414] Using default CRAM-MD5 authenticatee
I0928 21:19:54.796152  7361 authenticatee.cpp:97] Initializing client SASL
I0928 21:19:54.814299  7361 authenticatee.cpp:121] Creating new client SASL 
connection
I0928 21:19:54.815407  7357 authenticatee.cpp:213] Received SASL authentication 
mechanisms: CRAM-MD5
I0928 21:19:54.815421  7357 authenticatee.cpp:239] Attempting to authenticate 
with mechanism 'CRAM-MD5'
I0928 21:19:54.815757  7360 authenticatee.cpp:259] Received SASL authentication 
step
E0928 21:19:54.816179  7355 sched.cpp:507] Master master@127.0.1.1:5050 refused 
authentication
I0928 21:19:54.816193  7355 sched.cpp:1187] Got error 'Master refused 
authentication'
I0928 21:19:54.816200  7355 sched.cpp:2055] Asked to abort the driver
17/09/28 21:19:54 ERROR MesosCoarseGrainedSchedulerBackend: Mesos error: Master 
refused authentication
Exception in thread "Thread-12" org.apache.spark.SparkException: Exiting due to 
error from cluster scheduler: Master refused authentication
        at 
org.apache.spark.scheduler.TaskSchedulerImpl.error(TaskSchedulerImpl.scala:500)
        at 
org.apache.spark.scheduler.cluster.mesos.MesosCoarseGrainedSchedulerBackend.error(MesosCoarseGrainedSchedulerBackend.scala:599)
I0928 21:19:54.817343  7355 sched.cpp:2055] Asked to abort the driver
I0928 21:19:54.817355  7355 sched.cpp:1233] Aborting framework 

{noformat}




was (Author: skonto):
I verified the example and error is the same yet the reason is as in the 
cluster mode case:

{noformat}

17/09/28 21:07:34 INFO MesosCoarseGrainedSchedulerBackend: Mesos task 1 is now 
TASK_ERROR
17/09/28 21:07:34 INFO MesosCoarseGrainedSchedulerBackend: Blacklisting Mesos 
slave 433038b9-80aa-43ef-b6eb-0075f5028d37-S0 due to too many failures; is 
Spark installed on it?
17/09/28 21:07:34 DEBUG CoarseGrainedSchedulerBackend$DriverEndpoint: Asked to 
remove executor 1 with reason Executor finished with state LOST
17/09/28 21:07:34 INFO BlockManagerMaster: Removal of executor 1 requested
17/09/28 21:07:34 INFO CoarseGrainedSchedulerBackend$DriverEndpoint: Asked to 
remove non-existent executor 1
17/09/28 21:07:34 INFO BlockManagerMasterEndpoint: Trying to remove executor 1 
from BlockManagerMaster.

{noformat}

The task is failing and the agent is blacklisted which leads to starvation. 
In the scheduler we check for task failures for a slave in order to avoid 
feature launches there:

{code:java}
slaves.get(slaveId).map(_.taskFailures).getOrElse(0) < MAX_SLAVE_FAILURES 
{code}

The task is failing due to:

{noformat}
I0928 21:07:34.621839  5559 master.cpp:6532] Sending status update TASK_ERROR 
for task 0 of framework e46985fe-1392-4d39-a3d5-e7ec77810695-0004 'Total 
resources cpus(spark-prive)(allocated: spark-prive):8; 
mem(spark-prive)(allocated: spark-prive):1408 required by task and its executor 
is more than available ports(spark-prive, )(allocated: 
spark-prive):[31000-32000]; disk(spark-prive, )(allocated: spark-prive):1000; 
cpus(spark-prive, )(allocated: spark-prive):8; mem(spark-prive, )(allocated: 
spark-prive):10024; mem(*)(allocated: spark-prive):4590; disk(*)(allocated: 
spark-prive):103216'
I0928 21:07:34.622593  5559 hierarchical.cpp:850] Updated allocation of 
framework e46985fe-1392-4d39-a3d5-e7ec77810695-0004 on agent 
433038b9-80aa-43ef-b6eb-0075f5028d37-S0 from ports(spark-prive, )(allocated: 
spark-prive):[31000-32000]; disk(spark-prive, )(allocated: spark-prive):1000; 
cpus(spark-prive, )(allocated: spark-prive):8; mem(spark-prive, )(allocated: 
spark-prive):10024; mem(*)(allocated: spark-prive):4590; disk(*)(allocated: 
spark-prive):103216 to ports(spark-prive, )(allocated: 
spark-prive):[31000-32000]; disk(spark-prive, )(allocated: spark-prive):1000; 
cpus(spark-prive, )(allocated: spark-prive):8; mem(spark-prive, )(allocated: 
spark-prive):10024; mem(*)(allocated: spark-prive):4590; disk(*)(allocated: 
spark-prive):103216
I0928 21:07:34.647950  5559 master.cpp:4941] Processing REVIVE call for 
framework e46985fe-1392-4d39-a3d5-e7ec77810695-0004 (Spark Pi) at 
scheduler-df433215-b87c-4b9b-993c-a3253c5f11a8@127.0.1.1:34775

{noformat}

So again its the same reason as I have seen before.

If I try to set a principle mesos framework will not be able to register 
because even if I set a secret the driver will be aborted:

{noformat}
I0928 21:19:54.793844  7363 sched.cpp:232] Version: 1.3.0
I0928 21:19:54.795897  7355 sched.cpp:336] New master detected at 
master@127.0.1.1:5050
I0928 21:19:54.796042  7355 sched.cpp:407] Authenticating with master 
master@127.0.1.1:5050
I0928 21:19:54.796052  7355 sched.cpp:414] Using default CRAM-MD5 authenticatee
I0928 21:19:54.796152  7361 authenticatee.cpp:97] Initializing client SASL
I0928 21:19:54.814299  7361 authenticatee.cpp:121] Creating new client SASL 
connection
I0928 21:19:54.815407  7357 authenticatee.cpp:213] Received SASL authentication 
mechanisms: CRAM-MD5
I0928 21:19:54.815421  7357 authenticatee.cpp:239] Attempting to authenticate 
with mechanism 'CRAM-MD5'
I0928 21:19:54.815757  7360 authenticatee.cpp:259] Received SASL authentication 
step
E0928 21:19:54.816179  7355 sched.cpp:507] Master master@127.0.1.1:5050 refused 
authentication
I0928 21:19:54.816193  7355 sched.cpp:1187] Got error 'Master refused 
authentication'
I0928 21:19:54.816200  7355 sched.cpp:2055] Asked to abort the driver
17/09/28 21:19:54 ERROR MesosCoarseGrainedSchedulerBackend: Mesos error: Master 
refused authentication
Exception in thread "Thread-12" org.apache.spark.SparkException: Exiting due to 
error from cluster scheduler: Master refused authentication
        at 
org.apache.spark.scheduler.TaskSchedulerImpl.error(TaskSchedulerImpl.scala:500)
        at 
org.apache.spark.scheduler.cluster.mesos.MesosCoarseGrainedSchedulerBackend.error(MesosCoarseGrainedSchedulerBackend.scala:599)
I0928 21:19:54.817343  7355 sched.cpp:2055] Asked to abort the driver
I0928 21:19:54.817355  7355 sched.cpp:1233] Aborting framework 

{noformat}



> Use Mesos "Dynamic Reservation" resource for Spark
> --------------------------------------------------
>
>                 Key: SPARK-18935
>                 URL: https://issues.apache.org/jira/browse/SPARK-18935
>             Project: Spark
>          Issue Type: Bug
>    Affects Versions: 2.0.0, 2.0.1, 2.0.2
>            Reporter: jackyoh
>
> I'm running spark on Apache Mesos
> Please follow these steps to reproduce the issue:
> 1. First, run Mesos resource reserve:
> curl -i -d slaveId=c24d1cfb-79f3-4b07-9f8b-c7b19543a333-S0 -d 
> resources='[{"name":"cpus","type":"SCALAR","scalar":{"value":20},"role":"spark","reservation":{"principal":""}},{"name":"mem","type":"SCALAR","scalar":{"value":4096},"role":"spark","reservation":{"principal":""}}]'
>  -X POST http://192.168.1.118:5050/master/reserve
> 2. Then run spark-submit command:
> ./spark-submit --class org.apache.spark.examples.SparkPi --master 
> mesos://192.168.1.118:5050 --conf spark.mesos.role=spark  
> ../examples/jars/spark-examples_2.11-2.0.2.jar 10000
> And the console will keep loging same warning message as shown below: 
> 16/12/19 22:33:28 WARN TaskSchedulerImpl: Initial job has not accepted any 
> resources; check your cluster UI to ensure that workers are registered and 
> have sufficient resources



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Comment Edited] (SPARK-18935) Use Mesos "Dynamic Reservation" resource for Spark

Reply via email to