Hi experts, I set up an Spark cluster in the standalone mode with 10 workers and the version is 0.9.1. I chose the version with the assumption that the latest version is always the most stable one. However, when I unintentionally run an problematic job (such as config the SPARK_HOME with a wrong path), it will make the workers disconnected with master, and master reject the re-register of workers. In fact, the worker process didn't die and the worker port is still normally open, which make the situation worse for an online system because the monitor system (nagios) can't report the issue.
Before I turned to the mail list, I did some quick searches and found someone had already reported the same issue (follow the links below). Can anyone confirm if it's a known bug? If so, is there any fix for it? Or can I work around it through some way? http://stackoverflow.com/questions/22031006/spark-0-9-0-worker-keeps-dying-in-standalone-mode-when-job-fails http://apache-spark-user-list.1001560.n3.nabble.com/master-attempted-to-re-register-the-worker-and-then-took-all-workers-as-unregistered-td553.html http://apache-spark-user-list.1001560.n3.nabble.com/Spark-Worker-crashing-and-Master-not-seeing-recovered-worker-td2312.html Thanks a lot, Cheney