Re: Spark on Mesos: Spark issuing hundreds of SUBSCRIBE requests / second and crashing Mesos

2018-07-23 Thread Nimi W
That does sound like it could be it - I checked our libmesos version and it
is 1.4.1. I'll try upgrading libmesos.

Thanks.

On Mon, Jul 23, 2018 at 12:13 PM Susan X. Huynh 
wrote:

> Hi Nimi,
>
> This sounds similar to a bug I have come across before. See:
> https://jira.apache.org/jira/browse/SPARK-22342?focusedCommentId=16429950=com.atlassian.jira.plugin.system.issuetabpanels%3Acomment-tabpanel#comment-16429950
>
> It turned out to be a bug in libmesos (the client library used to
> communicate with Mesos): "using a failoverTimeout of 0 with Mesos native
> scheduler client can result in infinite subscribe loop" (
> https://issues.apache.org/jira/browse/MESOS-8171). It can be fixed by
> upgrading to a version of libmesos that has the fix.
>
> Susan
>
>
> On Fri, Jul 13, 2018 at 3:39 PM, Nimi W  wrote:
>
>> I've come across an issue with Mesos 1.4.1 and Spark 2.2.1. We launch
>> Spark tasks using the MesosClusterDispatcher in cluster mode. On a couple
>> of occasions, we have noticed that when the Spark Driver crashes (to
>> various causes - human error, network error), sometimes, when the Driver is
>> restarted, it issues a hundreds of SUBSCRIBE requests to mesos / per second
>> up until the Mesos Master node gets overwhelmed and crashes. It does this
>> again to the next master node, over and over until it takes down all the
>> master nodes. Usually the only thing that will fix is manually stopping the
>> driver and restarting.
>>
>> Here is a snippet of the log of the mesos master, which just logs the
>> repeated SUBSCRIBE command:
>> https://gist.github.com/nemosupremo/28ef4acfd7ec5bdcccee9789c021a97f
>>
>> Here is the output of the spark framework:
>> https://gist.github.com/nemosupremo/d098ef4def28ebf96c14d8f87aecd133 which
>> also just repeats 'Transport endpoint is not connected' over and over.
>>
>> Thanks for any insights
>>
>>
>>
>
>
> --
> Susan X. Huynh
> Software engineer, Data Agility
> xhu...@mesosphere.com
>


Re: Spark on Mesos: Spark issuing hundreds of SUBSCRIBE requests / second and crashing Mesos

2018-07-23 Thread Susan X. Huynh
Hi Nimi,

This sounds similar to a bug I have come across before. See:
https://jira.apache.org/jira/browse/SPARK-22342?focusedCommentId=16429950=com.atlassian.jira.plugin.system.issuetabpanels%3Acomment-tabpanel#comment-16429950

It turned out to be a bug in libmesos (the client library used to
communicate with Mesos): "using a failoverTimeout of 0 with Mesos native
scheduler client can result in infinite subscribe loop" (
https://issues.apache.org/jira/browse/MESOS-8171). It can be fixed by
upgrading to a version of libmesos that has the fix.

Susan


On Fri, Jul 13, 2018 at 3:39 PM, Nimi W  wrote:

> I've come across an issue with Mesos 1.4.1 and Spark 2.2.1. We launch
> Spark tasks using the MesosClusterDispatcher in cluster mode. On a couple
> of occasions, we have noticed that when the Spark Driver crashes (to
> various causes - human error, network error), sometimes, when the Driver is
> restarted, it issues a hundreds of SUBSCRIBE requests to mesos / per second
> up until the Mesos Master node gets overwhelmed and crashes. It does this
> again to the next master node, over and over until it takes down all the
> master nodes. Usually the only thing that will fix is manually stopping the
> driver and restarting.
>
> Here is a snippet of the log of the mesos master, which just logs the
> repeated SUBSCRIBE command: https://gist.github.com/nemosupremo/
> 28ef4acfd7ec5bdcccee9789c021a97f
>
> Here is the output of the spark framework: https://gist.
> github.com/nemosupremo/d098ef4def28ebf96c14d8f87aecd133 which also just
> repeats 'Transport endpoint is not connected' over and over.
>
> Thanks for any insights
>
>
>


-- 
Susan X. Huynh
Software engineer, Data Agility
xhu...@mesosphere.com


Spark on Mesos: Spark issuing hundreds of SUBSCRIBE requests / second and crashing Mesos

2018-07-13 Thread Nimi W
I've come across an issue with Mesos 1.4.1 and Spark 2.2.1. We launch Spark
tasks using the MesosClusterDispatcher in cluster mode. On a couple of
occasions, we have noticed that when the Spark Driver crashes (to various
causes - human error, network error), sometimes, when the Driver is
restarted, it issues a hundreds of SUBSCRIBE requests to mesos / per second
up until the Mesos Master node gets overwhelmed and crashes. It does this
again to the next master node, over and over until it takes down all the
master nodes. Usually the only thing that will fix is manually stopping the
driver and restarting.

Here is a snippet of the log of the mesos master, which just logs the
repeated SUBSCRIBE command:
https://gist.github.com/nemosupremo/28ef4acfd7ec5bdcccee9789c021a97f

Here is the output of the spark framework:
https://gist.github.com/nemosupremo/d098ef4def28ebf96c14d8f87aecd133 which
also just repeats 'Transport endpoint is not connected' over and over.

Thanks for any insights