[ https://issues.apache.org/jira/browse/SPARK-23943?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
paul mackles updated SPARK-23943: --------------------------------- Description: Two changes: First, a more robust [health-check|[http://mesosphere.github.io/marathon/docs/health-checks.html]] for anyone who runs MesosClusterDispatcher as a marathon app. Specifically, this check verifies that the MesosSchedulerDriver is still running as we have seen certain cases where it stops (rather quietly) and the only way to revive it is a restart. With this health check, marathon will restart the dispatcher if the MesosSchedulerDriver stops running. The health check lives at the url "/health" and returns a 204 when the server is healthy and a 503 when it is not (e.g. the MesosSchedulerDriver stopped running). Second, a server status endpoint that replies with some basic metrics about the server. The status endpoint resides at the url "/status" and responds with: {code:java} { "action" : "ServerStatusResponse", "launchedDrivers" : 0, "message" : "server OK", "queuedDrivers" : 0, "schedulerDriverStopped" : false, "serverSparkVersion" : "2.3.1-SNAPSHOT", "success" : true }{code} As you can see, it includes a snapshot of the metrics/health of the scheduler. Useful for quick debugging/troubleshooting/monitoring. was: Add a more robust health-check to MesosRestServer so that anyone who runs MesosClusterDispatcher as a marathon app can use it to check the health of the server: [http://mesosphere.github.io/marathon/docs/health-checks.html] Specifically, this check verifies that the MesosSchedulerDriver is still running as we have seen certain cases where it dies (rather quietly) and the only way to revive it is a restart. With this health check, marathon will restart the dispatcher if the MesosSchedulerDriver stops running. The health check lives at the url "/health" and returns a 204 when the server is healthy and a 503 when it is not (e.g. the MesosSchedulerDriver stopped running). > Improve observability of MesosRestServer/MesosClusterDispatcher > --------------------------------------------------------------- > > Key: SPARK-23943 > URL: https://issues.apache.org/jira/browse/SPARK-23943 > Project: Spark > Issue Type: Improvement > Components: Deploy, Mesos > Affects Versions: 2.2.1, 2.3.0 > Environment: > > Reporter: paul mackles > Priority: Minor > Fix For: 2.4.0 > > > Two changes: > First, a more robust > [health-check|[http://mesosphere.github.io/marathon/docs/health-checks.html]] > for anyone who runs MesosClusterDispatcher as a marathon app. Specifically, > this check verifies that the MesosSchedulerDriver is still running as we have > seen certain cases where it stops (rather quietly) and the only way to revive > it is a restart. With this health check, marathon will restart the dispatcher > if the MesosSchedulerDriver stops running. The health check lives at the url > "/health" and returns a 204 when the server is healthy and a 503 when it is > not (e.g. the MesosSchedulerDriver stopped running). > Second, a server status endpoint that replies with some basic metrics about > the server. The status endpoint resides at the url "/status" and responds > with: > {code:java} > { > "action" : "ServerStatusResponse", > "launchedDrivers" : 0, > "message" : "server OK", > "queuedDrivers" : 0, > "schedulerDriverStopped" : false, > "serverSparkVersion" : "2.3.1-SNAPSHOT", > "success" : true > }{code} > As you can see, it includes a snapshot of the metrics/health of the > scheduler. Useful for quick debugging/troubleshooting/monitoring. -- This message was sent by Atlassian JIRA (v7.6.3#76005) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org