[ 
https://issues.apache.org/jira/browse/SPARK-23943?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

paul mackles updated SPARK-23943:
---------------------------------
    Description: 
Two changes:

First, a more robust 
[health-check|[http://mesosphere.github.io/marathon/docs/health-checks.html]] 
for anyone who runs MesosClusterDispatcher as a marathon app. Specifically, 
this check verifies that the MesosSchedulerDriver is still running as we have 
seen certain cases where it stops (rather quietly) and the only way to revive 
it is a restart. With this health check, marathon will restart the dispatcher 
if the MesosSchedulerDriver stops running. The health check lives at the url 
"/health" and returns a 204 when the server is healthy and a 503 when it is not 
(e.g. the MesosSchedulerDriver stopped running).

Second, a server status endpoint that replies with some basic metrics about the 
server. The status endpoint resides at the url "/status" and responds with:
{code:java}
{
  "action" : "ServerStatusResponse",
  "launchedDrivers" : 0,
  "message" : "server OK",
  "queuedDrivers" : 0,
  "schedulerDriverStopped" : false,
  "serverSparkVersion" : "2.3.1-SNAPSHOT",
  "success" : true
}{code}
As you can see, it includes a snapshot of the metrics/health of the scheduler. 
Useful for quick debugging/troubleshooting/monitoring. 

  was:
Add a more robust health-check to MesosRestServer so that anyone who runs 
MesosClusterDispatcher as a marathon app can use it to check the health of the 
server:

[http://mesosphere.github.io/marathon/docs/health-checks.html]

Specifically, this check verifies that the MesosSchedulerDriver is still 
running as we have seen certain cases where it  dies (rather quietly) and the 
only way to revive it is a restart. With this health check, marathon will 
restart the dispatcher if the MesosSchedulerDriver stops running. 

The health check lives at the url "/health" and returns a 204 when the server 
is healthy and a 503 when it is not (e.g. the MesosSchedulerDriver stopped 
running).


> Improve observability of MesosRestServer/MesosClusterDispatcher
> ---------------------------------------------------------------
>
>                 Key: SPARK-23943
>                 URL: https://issues.apache.org/jira/browse/SPARK-23943
>             Project: Spark
>          Issue Type: Improvement
>          Components: Deploy, Mesos
>    Affects Versions: 2.2.1, 2.3.0
>         Environment:  
>  
>            Reporter: paul mackles
>            Priority: Minor
>             Fix For: 2.4.0
>
>
> Two changes:
> First, a more robust 
> [health-check|[http://mesosphere.github.io/marathon/docs/health-checks.html]] 
> for anyone who runs MesosClusterDispatcher as a marathon app. Specifically, 
> this check verifies that the MesosSchedulerDriver is still running as we have 
> seen certain cases where it stops (rather quietly) and the only way to revive 
> it is a restart. With this health check, marathon will restart the dispatcher 
> if the MesosSchedulerDriver stops running. The health check lives at the url 
> "/health" and returns a 204 when the server is healthy and a 503 when it is 
> not (e.g. the MesosSchedulerDriver stopped running).
> Second, a server status endpoint that replies with some basic metrics about 
> the server. The status endpoint resides at the url "/status" and responds 
> with:
> {code:java}
> {
>   "action" : "ServerStatusResponse",
>   "launchedDrivers" : 0,
>   "message" : "server OK",
>   "queuedDrivers" : 0,
>   "schedulerDriverStopped" : false,
>   "serverSparkVersion" : "2.3.1-SNAPSHOT",
>   "success" : true
> }{code}
> As you can see, it includes a snapshot of the metrics/health of the 
> scheduler. Useful for quick debugging/troubleshooting/monitoring. 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

Reply via email to