[jira] [Comment Edited] (SPARK-24641) Spark-Mesos integration doesn't respect request to abort itself

Stavros Kontopoulos (JIRA) Sun, 24 Jun 2018 09:26:35 -0700


    [ 
https://issues.apache.org/jira/browse/SPARK-24641?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16521524#comment-16521524
 ]


Stavros Kontopoulos edited comment on SPARK-24641 at 6/24/18 4:25 PM:
----------------------------------------------------------------------

[~igor.berman] The current idea of the shuffle service on mesos is to run it on 
all slaves upfront with a constraint: {{"constraints": [["hostname", 
"UNIQUE"]]}}.

So let's remove the case of "marathon haven't provisioned yet the external 
shuffle service on particular node", because if that holds then that is not 
going to work by design, AFAIK it is not meant to scale with the executors.

Now in case the shuffle service is available but due to network issues there is 
a communication error, then the executor mesos task should be non-reachable as 
well. Unless net issues are come and go and executor is visible  at some point 
and the shuffle service is not.

In general I can think of a case where the shuffle service may have failed and 
mesos executor has started fine. In that case we could tell the executor to 
fail and so the task can be restarted elsewhere. Right now when an executor is 
started and we get a mesos task udpate we try to connect to the shuffle 
service. If that fails then we do nothing. So we should probably improve the 
logic there. But first let's identify what should work and what shouldn't. Btw 
there is an ongoing effort to re-design the shuffle service to be based on some 
storage and run as a service so things will get improved at some time. The 
reason the driver needs to talk to the shuffle service (thus exposes 
implementation to such issues) is described 
[here|[https://github.com/apache/spark/blob/a5849ad9a3e5d41b5938faa7c592bcc6aec36044/common/network-shuffle/src/main/java/org/apache/spark/network/shuffle/mesos/MesosExternalShuffleClient.java#L41-L44]]
 :

"The reason why the driver has to talk to the service is for cleaning up 
shuffle files reliably
 after the application exits. Mesos does not provide a great alternative to do 
this, so Spark
 has to detect this itself." Maybe we should check if that still holds.

Ideally you only need the executor to connect locally to the service (done in 
the Blockmanager).

 

[~susanxhuynh] thoughts?



 


was (Author: skonto):
[~igor.berman] The current idea of the shuffle service on mesos is to run it on 
all slaves upfront with a constraint: {{"constraints": [["hostname", 
"UNIQUE"]]}}.

So let's remove the case of "marathon haven't provisioned yet the external 
shuffle service on particular node", because if that holds then that is not 
going to work by design, AFAIK it is not meant to scale with the executors.

Now in case the shuffle service is available but due to network issues there is 
a communication error, then the executor mesos task should be non-reachable as 
well. Unless net issues are come and go and executor is visible  at some point 
and the shuffle service is not.

In general I can think of a case where the shuffle service may have failed and 
mesos executor has started fine. In that case we could tell the executor to 
fail and so the task can be restarted elsewhere. Right now when an executor is 
started and we get a mesos task udpate we try to connect to the shuffle 
service. If that fails then we do nothing. So we should probably improve the 
logic there. But first let's identify what should work and what shouldn't. Btw 
there is an ongoing effort to re-design the shuffle service to be based on some 
storage and run as a service so things will get improved at some time. The 
reason the driver needs to talk to the shuffle service (thus exposes 
implementation to such issues) is described 
[here|[https://github.com/apache/spark/blob/a5849ad9a3e5d41b5938faa7c592bcc6aec36044/common/network-shuffle/src/main/java/org/apache/spark/network/shuffle/mesos/MesosExternalShuffleClient.java#L41-L44]]:

"The reason why the driver has to talk to the service is for cleaning up 
shuffle files reliably
 after the application exits. Mesos does not provide a great alternative to do 
this, so Spark
 has to detect this itself." Maybe we should check if that still holds.

Ideally you only need the executor to connect locally to the service (done in 
the Blockmanager).

 

[~susanxhuynh] thoughts?

 

> Spark-Mesos integration doesn't respect request to abort itself
> ---------------------------------------------------------------
>
>                 Key: SPARK-24641
>                 URL: https://issues.apache.org/jira/browse/SPARK-24641
>             Project: Spark
>          Issue Type: Bug
>          Components: Mesos, Shuffle
>    Affects Versions: 2.2.0
>            Reporter: Igor Berman
>            Priority: Major
>
> Hi,
> lately we came across following corner scenario:
> We are using dynamic allocation with external shuffle service that is managed 
> by marathon.
>  
> Due to some network/operation issue, the external shuffle service on one of 
> the machines(mesos-slaves) is not available for few seconds(e.g. marathon 
> haven't provisioned yet the external shuffle service on particular node, but 
> framework itself already accepted offer on this node and tries to startup 
> executor)
>  
> This makes framework(spark driver) to fail and I see error from stderr of 
> driver(seems like mesos-agent asks driver to abort itself), however spark 
> context continues to run(seems like in kind of zombi mode, since it can't 
> release resources to cluster and can't get additional offers since the 
> framework is aborted from mesos perspective)
>  
> The framework in mesos UI move to "inactive" state.
> [~skonto] [~susanxhuynh] any input on this problem? Have you came across such 
> behavior?
> I'm ready to work on some patch, but currently I don't understand where to 
> start, seems like driver is too fragile in this sense and something in 
> mesos-spark integration is missing
>  
>  
> {code:java}
> I0412 07:31:25.827283   274 sched.cpp:759] Framework registered with 
> 15d9838f-b266-413b-842d-f7c3567bd04a-0051 Exception in thread "Thread-295" 
> java.io.IOException: Failed to connect tomy-company.com/10.106.14.61:7337     
>     at 
> org.apache.spark.network.client.TransportClientFactory.createClient(TransportClientFactory.java:232)
>          at 
> org.apache.spark.network.client.TransportClientFactory.createClient(TransportClientFactory.java:182)
>          at 
> org.apache.spark.network.shuffle.mesos.MesosExternalShuffleClient.registerDriverWithShuffleService(MesosExternalShuffleClient.java:75)
>          at 
> org.apache.spark.scheduler.cluster.mesos.MesosCoarseGrainedSchedulerBackend.statusUpdate(MesosCoarseGrainedSchedulerBackend.scala:537)
>  Caused by: io.netty.channel.AbstractChannel$AnnotatedConnectException: 
> Connection refused: my-company.com/10.106.14.61:7337         at 
> sun.nio.ch.SocketChannelImpl.checkConnect(Native Method)         at 
> sun.nio.ch.SocketChannelImpl.finishConnect(SocketChannelImpl.java:717)        
>  at 
> io.netty.channel.socket.nio.NioSocketChannel.doFinishConnect(NioSocketChannel.java:257)
>          at 
> io.netty.channel.nio.AbstractNioChannel$AbstractNioUnsafe.finishConnect(AbstractNioChannel.java:291)
>          at 
> io.netty.channel.nio.NioEventLoop.processSelectedKey(NioEventLoop.java:631)   
>       at 
> io.netty.channel.nio.NioEventLoop.processSelectedKeysOptimized(NioEventLoop.java:566)
>          at 
> io.netty.channel.nio.NioEventLoop.processSelectedKeys(NioEventLoop.java:480)  
>        at io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:442)        
>  at 
> io.netty.util.concurrent.SingleThreadEventExecutor$2.run(SingleThreadEventExecutor.java:131)
>          at 
> io.netty.util.concurrent.DefaultThreadFactory$DefaultRunnableDecorator.run(DefaultThreadFactory.java:144)
>          at java.lang.Thread.run(Thread.java:748) I0412 07:35:12.032925   277 
> sched.cpp:2055] Asked to abort the driver I0412 07:35:12.033035   277 
> sched.cpp:1233] Aborting framework 15d9838f-b266-413b-842d-f7c3567bd04a-0051  
> {code}
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Comment Edited] (SPARK-24641) Spark-Mesos integration doesn't respect request to abort itself

Reply via email to