I've come across an issue with Mesos 1.4.1 and Spark 2.2.1. We launch Spark tasks using the MesosClusterDispatcher in cluster mode. On a couple of occasions, we have noticed that when the Spark Driver crashes (to various causes - human error, network error), sometimes, when the Driver is restarted, it issues a hundreds of SUBSCRIBE requests to mesos / per second up until the Mesos Master node gets overwhelmed and crashes. It does this again to the next master node, over and over until it takes down all the master nodes. Usually the only thing that will fix is manually stopping the driver and restarting.
Here is a snippet of the log of the mesos master, which just logs the repeated SUBSCRIBE command: https://gist.github.com/nemosupremo/28ef4acfd7ec5bdcccee9789c021a97f Here is the output of the spark framework: https://gist.github.com/nemosupremo/d098ef4def28ebf96c14d8f87aecd133 which also just repeats 'Transport endpoint is not connected' over and over. Thanks for any insights