We're running into an issue where periodically the master loses connectivity
with workers in the spark cluster. We believe this issue tends to manifest
when the cluster is under heavy load, but we're not entirely sure when it
happens. I've seen one or two other messages to this list about this issue,
but no one seems to have a clue as to the actual bug.

So, to work around the issue, we'd like to programmatically monitor the
number of workers connected to the master and restart the cluster when the
master loses track of some of its workers. Any ideas on how to
programmatically write such a health check?

Thanks,
Allen



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Monitoring-spark-dis-associated-workers-tp7358.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

Reply via email to