Re: Spark on Mesos: Spark issuing hundreds of SUBSCRIBE requests / second and crashing Mesos
That does sound like it could be it - I checked our libmesos version and it is 1.4.1. I'll try upgrading libmesos. Thanks. On Mon, Jul 23, 2018 at 12:13 PM Susan X. Huynh wrote: > Hi Nimi, > > This sounds similar to a bug I have come across before. See: > https://jira.apache.org/jira/browse/SPARK-22342?focusedCommentId=16429950=com.atlassian.jira.plugin.system.issuetabpanels%3Acomment-tabpanel#comment-16429950 > > It turned out to be a bug in libmesos (the client library used to > communicate with Mesos): "using a failoverTimeout of 0 with Mesos native > scheduler client can result in infinite subscribe loop" ( > https://issues.apache.org/jira/browse/MESOS-8171). It can be fixed by > upgrading to a version of libmesos that has the fix. > > Susan > > > On Fri, Jul 13, 2018 at 3:39 PM, Nimi W wrote: > >> I've come across an issue with Mesos 1.4.1 and Spark 2.2.1. We launch >> Spark tasks using the MesosClusterDispatcher in cluster mode. On a couple >> of occasions, we have noticed that when the Spark Driver crashes (to >> various causes - human error, network error), sometimes, when the Driver is >> restarted, it issues a hundreds of SUBSCRIBE requests to mesos / per second >> up until the Mesos Master node gets overwhelmed and crashes. It does this >> again to the next master node, over and over until it takes down all the >> master nodes. Usually the only thing that will fix is manually stopping the >> driver and restarting. >> >> Here is a snippet of the log of the mesos master, which just logs the >> repeated SUBSCRIBE command: >> https://gist.github.com/nemosupremo/28ef4acfd7ec5bdcccee9789c021a97f >> >> Here is the output of the spark framework: >> https://gist.github.com/nemosupremo/d098ef4def28ebf96c14d8f87aecd133 which >> also just repeats 'Transport endpoint is not connected' over and over. >> >> Thanks for any insights >> >> >> > > > -- > Susan X. Huynh > Software engineer, Data Agility > xhu...@mesosphere.com >
Re: Spark on Mesos: Spark issuing hundreds of SUBSCRIBE requests / second and crashing Mesos
Hi Nimi, This sounds similar to a bug I have come across before. See: https://jira.apache.org/jira/browse/SPARK-22342?focusedCommentId=16429950=com.atlassian.jira.plugin.system.issuetabpanels%3Acomment-tabpanel#comment-16429950 It turned out to be a bug in libmesos (the client library used to communicate with Mesos): "using a failoverTimeout of 0 with Mesos native scheduler client can result in infinite subscribe loop" ( https://issues.apache.org/jira/browse/MESOS-8171). It can be fixed by upgrading to a version of libmesos that has the fix. Susan On Fri, Jul 13, 2018 at 3:39 PM, Nimi W wrote: > I've come across an issue with Mesos 1.4.1 and Spark 2.2.1. We launch > Spark tasks using the MesosClusterDispatcher in cluster mode. On a couple > of occasions, we have noticed that when the Spark Driver crashes (to > various causes - human error, network error), sometimes, when the Driver is > restarted, it issues a hundreds of SUBSCRIBE requests to mesos / per second > up until the Mesos Master node gets overwhelmed and crashes. It does this > again to the next master node, over and over until it takes down all the > master nodes. Usually the only thing that will fix is manually stopping the > driver and restarting. > > Here is a snippet of the log of the mesos master, which just logs the > repeated SUBSCRIBE command: https://gist.github.com/nemosupremo/ > 28ef4acfd7ec5bdcccee9789c021a97f > > Here is the output of the spark framework: https://gist. > github.com/nemosupremo/d098ef4def28ebf96c14d8f87aecd133 which also just > repeats 'Transport endpoint is not connected' over and over. > > Thanks for any insights > > > -- Susan X. Huynh Software engineer, Data Agility xhu...@mesosphere.com
Spark on Mesos: Spark issuing hundreds of SUBSCRIBE requests / second and crashing Mesos
I've come across an issue with Mesos 1.4.1 and Spark 2.2.1. We launch Spark tasks using the MesosClusterDispatcher in cluster mode. On a couple of occasions, we have noticed that when the Spark Driver crashes (to various causes - human error, network error), sometimes, when the Driver is restarted, it issues a hundreds of SUBSCRIBE requests to mesos / per second up until the Mesos Master node gets overwhelmed and crashes. It does this again to the next master node, over and over until it takes down all the master nodes. Usually the only thing that will fix is manually stopping the driver and restarting. Here is a snippet of the log of the mesos master, which just logs the repeated SUBSCRIBE command: https://gist.github.com/nemosupremo/28ef4acfd7ec5bdcccee9789c021a97f Here is the output of the spark framework: https://gist.github.com/nemosupremo/d098ef4def28ebf96c14d8f87aecd133 which also just repeats 'Transport endpoint is not connected' over and over. Thanks for any insights