I'm using Amazon EMR + S3 as my spark cluster infrastructure. When I'm running a job with periodic checkpointing (it has a long dependency tree, so truncating by checkpointing is mandatory, each checkpoint has 320 partitions). The job stops halfway, resulting an exception:
(On driver) org.apache.spark.SparkException: Invalid checkpoint directory: s3n://spooky-checkpoint/9e9dbddf-e5d8-478d-9b69-b5b966126d3c/rdd-198 at org.apache.spark.rdd.CheckpointRDD.getPartitions(CheckpointRDD.scala:54) at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:219) ... (On Executor) 15/04/17 22:00:14 WARN StorageService: Encountered 4 Internal Server error(s), will retry in 800ms 15/04/17 22:00:15 WARN RestStorageService: Retrying request following error response: PUT '/9e9dbddf-e5d8-478d-9b69-b5b966126d3c/rdd-198/part-00025' -- ResponseCode: 500, ResponseStatus: Internal Server Error ... After manually checking checkpointed files I found that /9e9dbddf-e5d8-478d-9b69-b5b966126d3c/rdd-198/part-00025 is indeed missing on S3. So my question is: if it is missing (perhaps due to AWS malfunction), why didn't spark detect it immediately in the checkpointing process (so it can be retried), instead of throwing an irrecoverable error stating that dependency tree is already lost? And how to avoid this situation from happening again? I don't think this problem is addressed before because HDFS is assumed to be immediately consistent (unlike S3 which is eventually consistent) and extremely resilient. However every component has a chance of breakdown, can you share your best practice of checkpointing? -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/How-to-avoid-Invalid-checkpoint-directory-error-in-apache-Spark-tp22548.html Sent from the Apache Spark User List mailing list archive at Nabble.com. --------------------------------------------------------------------- To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org