That does sound like it could be it - I checked our libmesos version and it is 1.4.1. I'll try upgrading libmesos.
Thanks. On Mon, Jul 23, 2018 at 12:13 PM Susan X. Huynh <xhu...@mesosphere.io> wrote: > Hi Nimi, > > This sounds similar to a bug I have come across before. See: > https://jira.apache.org/jira/browse/SPARK-22342?focusedCommentId=16429950&page=com.atlassian.jira.plugin.system.issuetabpanels%3Acomment-tabpanel#comment-16429950 > > It turned out to be a bug in libmesos (the client library used to > communicate with Mesos): "using a failoverTimeout of 0 with Mesos native > scheduler client can result in infinite subscribe loop" ( > https://issues.apache.org/jira/browse/MESOS-8171). It can be fixed by > upgrading to a version of libmesos that has the fix. > > Susan > > > On Fri, Jul 13, 2018 at 3:39 PM, Nimi W <psnim2...@gmail.com> wrote: > >> I've come across an issue with Mesos 1.4.1 and Spark 2.2.1. We launch >> Spark tasks using the MesosClusterDispatcher in cluster mode. On a couple >> of occasions, we have noticed that when the Spark Driver crashes (to >> various causes - human error, network error), sometimes, when the Driver is >> restarted, it issues a hundreds of SUBSCRIBE requests to mesos / per second >> up until the Mesos Master node gets overwhelmed and crashes. It does this >> again to the next master node, over and over until it takes down all the >> master nodes. Usually the only thing that will fix is manually stopping the >> driver and restarting. >> >> Here is a snippet of the log of the mesos master, which just logs the >> repeated SUBSCRIBE command: >> https://gist.github.com/nemosupremo/28ef4acfd7ec5bdcccee9789c021a97f >> >> Here is the output of the spark framework: >> https://gist.github.com/nemosupremo/d098ef4def28ebf96c14d8f87aecd133 which >> also just repeats 'Transport endpoint is not connected' over and over. >> >> Thanks for any insights >> >> >> > > > -- > Susan X. Huynh > Software engineer, Data Agility > xhu...@mesosphere.com >