Hi Yana,

You are correct. What needs to be added is that besides RDDs being
checkpointed, metadata which represents execution of computations are also
checkpointed in Spark Streaming.

Upon driver recovery, the last batches (the ones already executed and the
ones that should have been executed while shut down) are recomputed. This is
very good if we just want to recover state and if we don't have any other
component or data store depending on Spark's output. 
In the case we do have that requirement (which is my case) all the nodes
will re-execute all that IO provoking overall system inconsistency as the
outside system were not expecting events from the past.

We need some way of making Spark aware of which computations are
recomputations and which are not so we can empower Spark developers to
introduce specific logic if they need to.

Let me know if any of this doesn't make sense.

tnks,
Rod 



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Spark-Streaming-checkpoint-recovery-causes-IO-re-execution-tp12568p13205.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

---------------------------------------------------------------------
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org

Reply via email to