Hi,I am unable to see how Shark (eventually Spark) can recover from a bad node in the cluster. One of my EC2 clusters with 50 nodes ended up with a single node with datanode corruption, I can see the following error when I'm trying to load up a simple file into memory using CTAS: org.apache.hadoop.hive.ql.metadata.HiveException (org.apache.hadoop.hive.ql.metadata.HiveException: java.io.IOException: All datanodes 10.196.4.2:9200 are bad. Aborting...)org.apache.hadoop.hive.ql.exec.FileSinkOperator.processOp(FileSinkOperator.java:602)shark.execution.FileSinkOperator$$anonfun$processPartition$1.apply(FileSinkOperator.scala:84)shark.execution.FileSinkOperator$$anonfun$processPartition$1.apply(FileSinkOperator.scala:81)scala.collection.Iterator$class.foreach(Iterator.scala:772).....I'd expect Spark to recover from this by blacklisting that node and using the rest of the cluster to finish this task. However, it doesn't recover from it. I saw this email chain from around a year ago: https://groups.google.com/forum/#!topic/spark-users/maFIDL-0OoI - it was acknowledged that Spark wouldn't recover from a partially failed node.. Is this still the situation? Has any progress been made here? Even after I kill the application and start a brand new one and retry the job, it ends up in the same situation.. I'd like to know how to handle this.Thanks!Lakshmi.
-- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Spark-recovery-from-bad-nodes-tp4505.html Sent from the Apache Spark User List mailing list archive at Nabble.com.