We've been using the new DirectKafkaInputDStream to implement an exactly once processing solution that tracks the provided offset ranges within the same transaction that persists our data results. When an exception is thrown within the processing loop and the configured number of retries are exhausted the stream will skip to the end of the failed range of offsets and continue on with the next RDD.
Makes sense but we're wondering how others would handle recovering from failures. In our case the cause of the exception was a temporary outage of a needed service. Since the transaction rolled back at the point of failure our offset tracking table retained the correct offsets updated so we simply needed to restart the Spark process whereupon it happily picked up at the correct point and continued. Short of the restart do people have any good ideas for how we might recover? FWIW We've looked at setting spark.task.maxFailures param to a large value and looked for a property that would increase the wait between attempts. This might mitigate the issue when the availability problem is short lived but wouldn't completely eliminate the need to restart. Any thoughts, ideas welcome. -- View this message in context: http://apache-spark-developers-list.1001551.n3.nabble.com/Error-recovery-strategies-using-the-DirectKafkaInputDStream-tp12258.html Sent from the Apache Spark Developers List mailing list archive at Nabble.com. --------------------------------------------------------------------- To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org For additional commands, e-mail: dev-h...@spark.apache.org