I'd like to understand better what happens when a streaming consumer job (with direct streaming, but also with receiver-based streaming) is killed/terminated/crashes.
Assuming it was processing a batch of RDD data, what happens when the job is restarted? How much state is maintained within Spark's checkpointing to allow for little or no data loss? For the direct streaming case, would we need to update offsets in Zookeeper to achieve more fault tolerance? I'm looking at https://databricks.com/blog/2015/01/15/improved-driver-fault-tolerance-and-zero-data-loss-in-spark-streaming.html and it talks about the Write-Ahead Logs. Do they work with direct streaming? With write ahead logs in place, e.g. streaming from Kafka, where would a restarted consumer resume processing? E.g. it was processing Message# 25 out of 100 messages in the Kafka topic when it crashed or was terminated. Thanks. -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/What-happens-when-a-streaming-consumer-job-is-killed-then-restarted-tp23348.html Sent from the Apache Spark User List mailing list archive at Nabble.com. --------------------------------------------------------------------- To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org