I'd like to understand better what happens when a streaming consumer job
(with direct streaming, but also with receiver-based streaming) is
killed/terminated/crashes.

Assuming it was processing a batch of RDD data, what happens when the job is
restarted?  How much state is maintained within Spark's checkpointing to
allow for little or no data loss?

For the direct streaming case, would we need to update offsets in Zookeeper
to achieve more fault tolerance?

I'm looking at
https://databricks.com/blog/2015/01/15/improved-driver-fault-tolerance-and-zero-data-loss-in-spark-streaming.html
and it talks about the Write-Ahead Logs. Do they work with direct streaming?

With write ahead logs in place, e.g. streaming from Kafka, where would a
restarted consumer resume processing?  E.g. it was processing Message# 25
out of 100 messages in the Kafka topic when it crashed or was terminated.

Thanks.



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/What-happens-when-a-streaming-consumer-job-is-killed-then-restarted-tp23348.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

---------------------------------------------------------------------
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org

Reply via email to