That does sound like it could be it - I checked our libmesos version and it
is 1.4.1. I'll try upgrading libmesos.

Thanks.

On Mon, Jul 23, 2018 at 12:13 PM Susan X. Huynh <xhu...@mesosphere.io>
wrote:

> Hi Nimi,
>
> This sounds similar to a bug I have come across before. See:
> https://jira.apache.org/jira/browse/SPARK-22342?focusedCommentId=16429950&page=com.atlassian.jira.plugin.system.issuetabpanels%3Acomment-tabpanel#comment-16429950
>
> It turned out to be a bug in libmesos (the client library used to
> communicate with Mesos): "using a failoverTimeout of 0 with Mesos native
> scheduler client can result in infinite subscribe loop" (
> https://issues.apache.org/jira/browse/MESOS-8171). It can be fixed by
> upgrading to a version of libmesos that has the fix.
>
> Susan
>
>
> On Fri, Jul 13, 2018 at 3:39 PM, Nimi W <psnim2...@gmail.com> wrote:
>
>> I've come across an issue with Mesos 1.4.1 and Spark 2.2.1. We launch
>> Spark tasks using the MesosClusterDispatcher in cluster mode. On a couple
>> of occasions, we have noticed that when the Spark Driver crashes (to
>> various causes - human error, network error), sometimes, when the Driver is
>> restarted, it issues a hundreds of SUBSCRIBE requests to mesos / per second
>> up until the Mesos Master node gets overwhelmed and crashes. It does this
>> again to the next master node, over and over until it takes down all the
>> master nodes. Usually the only thing that will fix is manually stopping the
>> driver and restarting.
>>
>> Here is a snippet of the log of the mesos master, which just logs the
>> repeated SUBSCRIBE command:
>> https://gist.github.com/nemosupremo/28ef4acfd7ec5bdcccee9789c021a97f
>>
>> Here is the output of the spark framework:
>> https://gist.github.com/nemosupremo/d098ef4def28ebf96c14d8f87aecd133 which
>> also just repeats 'Transport endpoint is not connected' over and over.
>>
>> Thanks for any insights
>>
>>
>>
>
>
> --
> Susan X. Huynh
> Software engineer, Data Agility
> xhu...@mesosphere.com
>

Reply via email to