Hi,I am unable to see how Shark (eventually Spark) can recover from a bad
node in the cluster. One of my EC2 clusters with 50 nodes ended up with a
single node with datanode corruption, I can see the following error when I'm
trying to load up a simple file into memory using CTAS:
org.apache.hadoop.hive.ql.metadata.HiveException
(org.apache.hadoop.hive.ql.metadata.HiveException: java.io.IOException: All
datanodes 10.196.4.2:9200 are bad.
Aborting...)org.apache.hadoop.hive.ql.exec.FileSinkOperator.processOp(FileSinkOperator.java:602)shark.execution.FileSinkOperator$$anonfun$processPartition$1.apply(FileSinkOperator.scala:84)shark.execution.FileSinkOperator$$anonfun$processPartition$1.apply(FileSinkOperator.scala:81)scala.collection.Iterator$class.foreach(Iterator.scala:772).....I'd
expect Spark to recover from this by blacklisting that node and using the
rest of the cluster to finish this task. However, it doesn't recover from
it. I saw this email chain from around a year ago:
https://groups.google.com/forum/#!topic/spark-users/maFIDL-0OoI - it was
acknowledged that Spark wouldn't recover from a partially failed node.. Is
this still the situation? Has any progress been made here? Even after I kill
the application and start a brand new one and retry the job, it ends up in
the same situation.. I'd like to know how to handle this.Thanks!Lakshmi.



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Spark-recovery-from-bad-nodes-tp4505.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

Reply via email to