Hi Nimi,

This sounds similar to a bug I have come across before. See:
https://jira.apache.org/jira/browse/SPARK-22342?focusedCommentId=16429950&page=com.atlassian.jira.plugin.system.issuetabpanels%3Acomment-tabpanel#comment-16429950

It turned out to be a bug in libmesos (the client library used to
communicate with Mesos): "using a failoverTimeout of 0 with Mesos native
scheduler client can result in infinite subscribe loop" (
https://issues.apache.org/jira/browse/MESOS-8171). It can be fixed by
upgrading to a version of libmesos that has the fix.

Susan


On Fri, Jul 13, 2018 at 3:39 PM, Nimi W <psnim2...@gmail.com> wrote:

> I've come across an issue with Mesos 1.4.1 and Spark 2.2.1. We launch
> Spark tasks using the MesosClusterDispatcher in cluster mode. On a couple
> of occasions, we have noticed that when the Spark Driver crashes (to
> various causes - human error, network error), sometimes, when the Driver is
> restarted, it issues a hundreds of SUBSCRIBE requests to mesos / per second
> up until the Mesos Master node gets overwhelmed and crashes. It does this
> again to the next master node, over and over until it takes down all the
> master nodes. Usually the only thing that will fix is manually stopping the
> driver and restarting.
>
> Here is a snippet of the log of the mesos master, which just logs the
> repeated SUBSCRIBE command: https://gist.github.com/nemosupremo/
> 28ef4acfd7ec5bdcccee9789c021a97f
>
> Here is the output of the spark framework: https://gist.
> github.com/nemosupremo/d098ef4def28ebf96c14d8f87aecd133 which also just
> repeats 'Transport endpoint is not connected' over and over.
>
> Thanks for any insights
>
>
>


-- 
Susan X. Huynh
Software engineer, Data Agility
xhu...@mesosphere.com

Reply via email to