Hi Yana, You are correct. What needs to be added is that besides RDDs being checkpointed, metadata which represents execution of computations are also checkpointed in Spark Streaming.
Upon driver recovery, the last batches (the ones already executed and the ones that should have been executed while shut down) are recomputed. This is very good if we just want to recover state and if we don't have any other component or data store depending on Spark's output. In the case we do have that requirement (which is my case) all the nodes will re-execute all that IO provoking overall system inconsistency as the outside system were not expecting events from the past. We need some way of making Spark aware of which computations are recomputations and which are not so we can empower Spark developers to introduce specific logic if they need to. Let me know if any of this doesn't make sense. tnks, Rod -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Spark-Streaming-checkpoint-recovery-causes-IO-re-execution-tp12568p13205.html Sent from the Apache Spark User List mailing list archive at Nabble.com. --------------------------------------------------------------------- To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org