I've come across an issue with Mesos 1.4.1 and Spark 2.2.1. We launch Spark
tasks using the MesosClusterDispatcher in cluster mode. On a couple of
occasions, we have noticed that when the Spark Driver crashes (to various
causes - human error, network error), sometimes, when the Driver is
restarted, it issues a hundreds of SUBSCRIBE requests to mesos / per second
up until the Mesos Master node gets overwhelmed and crashes. It does this
again to the next master node, over and over until it takes down all the
master nodes. Usually the only thing that will fix is manually stopping the
driver and restarting.

Here is a snippet of the log of the mesos master, which just logs the
repeated SUBSCRIBE command:
https://gist.github.com/nemosupremo/28ef4acfd7ec5bdcccee9789c021a97f

Here is the output of the spark framework:
https://gist.github.com/nemosupremo/d098ef4def28ebf96c14d8f87aecd133 which
also just repeats 'Transport endpoint is not connected' over and over.

Thanks for any insights

Reply via email to