Hi Nimi, This sounds similar to a bug I have come across before. See: https://jira.apache.org/jira/browse/SPARK-22342?focusedCommentId=16429950&page=com.atlassian.jira.plugin.system.issuetabpanels%3Acomment-tabpanel#comment-16429950
It turned out to be a bug in libmesos (the client library used to communicate with Mesos): "using a failoverTimeout of 0 with Mesos native scheduler client can result in infinite subscribe loop" ( https://issues.apache.org/jira/browse/MESOS-8171). It can be fixed by upgrading to a version of libmesos that has the fix. Susan On Fri, Jul 13, 2018 at 3:39 PM, Nimi W <psnim2...@gmail.com> wrote: > I've come across an issue with Mesos 1.4.1 and Spark 2.2.1. We launch > Spark tasks using the MesosClusterDispatcher in cluster mode. On a couple > of occasions, we have noticed that when the Spark Driver crashes (to > various causes - human error, network error), sometimes, when the Driver is > restarted, it issues a hundreds of SUBSCRIBE requests to mesos / per second > up until the Mesos Master node gets overwhelmed and crashes. It does this > again to the next master node, over and over until it takes down all the > master nodes. Usually the only thing that will fix is manually stopping the > driver and restarting. > > Here is a snippet of the log of the mesos master, which just logs the > repeated SUBSCRIBE command: https://gist.github.com/nemosupremo/ > 28ef4acfd7ec5bdcccee9789c021a97f > > Here is the output of the spark framework: https://gist. > github.com/nemosupremo/d098ef4def28ebf96c14d8f87aecd133 which also just > repeats 'Transport endpoint is not connected' over and over. > > Thanks for any insights > > > -- Susan X. Huynh Software engineer, Data Agility xhu...@mesosphere.com