Hi experts,

I set up an Spark cluster in the standalone mode with 10 workers and the
version is 0.9.1. I chose the version with the assumption that the latest
version is always the most stable one. However, when I unintentionally run
an problematic job (such as config the SPARK_HOME with a wrong path), it
will make the workers disconnected with master, and master reject the
re-register of workers. In fact, the worker process didn't die and the
worker port is still normally open, which make the situation worse for an
online system because the monitor system (nagios) can't report the issue.

Before I turned to the mail list, I did some quick searches and found
someone had already reported the same issue (follow the links below). Can
anyone confirm if it's a known bug? If so, is there any fix for it? Or can
I work around it through some way?

http://stackoverflow.com/questions/22031006/spark-0-9-0-worker-keeps-dying-in-standalone-mode-when-job-fails
http://apache-spark-user-list.1001560.n3.nabble.com/master-attempted-to-re-register-the-worker-and-then-took-all-workers-as-unregistered-td553.html
http://apache-spark-user-list.1001560.n3.nabble.com/Spark-Worker-crashing-and-Master-not-seeing-recovered-worker-td2312.html

Thanks a lot,
Cheney

Reply via email to