I'm using Amazon EMR + S3 as my spark cluster infrastructure. When I'm
running a job with periodic checkpointing (it has a long dependency tree, so
truncating by checkpointing is mandatory, each checkpoint has 320
partitions). The job stops halfway, resulting an exception:

(On driver)
org.apache.spark.SparkException: Invalid checkpoint directory:
s3n://spooky-checkpoint/9e9dbddf-e5d8-478d-9b69-b5b966126d3c/rdd-198
    at
org.apache.spark.rdd.CheckpointRDD.getPartitions(CheckpointRDD.scala:54)
    at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:219)
...

(On Executor)
15/04/17 22:00:14 WARN StorageService: Encountered 4 Internal Server
error(s), will retry in 800ms
15/04/17 22:00:15 WARN RestStorageService: Retrying request following error
response: PUT '/9e9dbddf-e5d8-478d-9b69-b5b966126d3c/rdd-198/part-00025' --
ResponseCode: 500, ResponseStatus: Internal Server Error
...

After manually checking checkpointed files I found that
/9e9dbddf-e5d8-478d-9b69-b5b966126d3c/rdd-198/part-00025 is indeed missing
on S3. So my question is: if it is missing (perhaps due to AWS malfunction),
why didn't spark detect it immediately in the checkpointing process (so it
can be retried), instead of throwing an irrecoverable error stating that
dependency tree is already lost? And how to avoid this situation from
happening again?

I don't think this problem is addressed before because HDFS is assumed to be
immediately consistent (unlike S3 which is eventually consistent) and
extremely resilient. However every component has a chance of breakdown, can
you share your best practice of checkpointing?



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/How-to-avoid-Invalid-checkpoint-directory-error-in-apache-Spark-tp22548.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

---------------------------------------------------------------------
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org

Reply via email to