Thanks Andrew and Matei. My input is from a kafka spout which can fail any time so it's not a reliable input source. The problem is duplicate tuples coming through and I'd like those to be discarded.
I'll take a look at those links provided by both you and Andrew. -A From: Matei Zaharia [mailto:matei.zaha...@gmail.com] Sent: February-11-14 4:02 PM To: user@spark.incubator.apache.org Subject: Re: how is fault tolerance achieved in spark Spark Streaming also provides fault tolerance by default as long as your input is in a reliable data source. Take a look at this paper: http://www.cs.berkeley.edu/~matei/papers/2013/sosp_spark_streaming.pdf or the fault tolerance section of the usage guide: http://spark.incubator.apache.org/docs/latest/streaming-programming-guide.html. Matei On Feb 11, 2014, at 12:27 PM, Andrew Ash <and...@andrewash.com<mailto:and...@andrewash.com>> wrote: Here's the original paper on how the framework achieves fault tolerance. You shouldn't have to do anything special as a user of the framework. https://www.cs.berkeley.edu/~matei/papers/2012/nsdi_spark.pdf On Tue, Feb 11, 2014 at 12:21 PM, Adrian Mocanu <amoc...@verticalscope.com<mailto:amoc...@verticalscope.com>> wrote: Anyone willing to link some resource on how to achieve fault tolerance? From: Adrian Mocanu [mailto:amoc...@verticalscope.com<mailto:amoc...@verticalscope.com>] Sent: February-10-14 1:44 PM To: user@spark.incubator.apache.org<mailto:user@spark.incubator.apache.org> Subject: how is fault tolerance achieved in spark Hi all, I am curious how fault tolerance is achieved in spark. Well, more like what do I need to do to make sure my aggregations which comes from streams are fault tolerant and saved into cassandra. I will have nodes die and would not like to count "tuples" multiple times. For example, in trident you have to implement different interfaces. Is there a similar process for spark? Thanks -Adrian