I'm using the spark-ec2 script to launch a 30 node r3.8xlarge cluster. Occasionally several nodes will become unresponsive: I will notice that hdfs complains it can't find some blocks, then when I go to restart hadoop, the messages indicate that the connection to some nodes timed out, then when I check, I can't ssh into those nodes at all.
Is this a problem others have experienced? What is causing this random failure--- or where can I look to find relevant logs---, and how can I recover from this other than to destroy the cluster and start anew (time-consuming, tedious, and requiring that I pull down my large dataset from S3 to HDFS once again, but this is what I've been doing currently)? -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/what-is-cause-of-and-how-to-recover-from-unresponsive-nodes-w-spark-ec2-script-tp24235.html Sent from the Apache Spark User List mailing list archive at Nabble.com. --------------------------------------------------------------------- To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org