We've been using the new DirectKafkaInputDStream to implement an exactly once
processing solution that tracks the provided offset ranges within the same
transaction that persists our data results. When an exception is thrown
within the processing loop and the configured number of retries are
exhausted the stream will skip to the end of the failed range of offsets and
continue on with the next  RDD. 

Makes sense but we're wondering how others would handle recovering from
failures. In our case the cause of the exception was a temporary outage of a
needed service. Since the transaction rolled back at the point of failure
our offset tracking table retained the correct offsets updated so we simply
needed to restart the Spark process whereupon it happily picked up at the
correct point and continued. Short of the restart do people have any good
ideas for how we might recover?

FWIW We've looked at setting spark.task.maxFailures param to a large value
and looked for a property that would increase the wait between attempts.
This might mitigate the issue when the availability problem is short lived
but wouldn't completely eliminate the need to restart.

Any thoughts, ideas welcome.



--
View this message in context: 
http://apache-spark-developers-list.1001551.n3.nabble.com/Error-recovery-strategies-using-the-DirectKafkaInputDStream-tp12258.html
Sent from the Apache Spark Developers List mailing list archive at Nabble.com.

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
For additional commands, e-mail: dev-h...@spark.apache.org

Reply via email to