Hi Adrian, You must know all the transformations on RDD are having 'exactly once semantic' but counting multiple times is still possible because all the output operations (like foreachRDD) are not having 'exactly once semantic' instead they have 'at least once semantic'. Have a look at this link - https://spark.incubator.apache.org/docs/0.9.0/streaming-programming-guide.html#failure-of-a-worker-node
So your application or Cassandra model needs to have logic to make sure over counting doesn't happen while storing data into Cassandra. We are achieving this by storing RDD id into our data model so that we don't over count even if failed partitions are recomputed. Let me know if it helps. Thanks Pankaj On Wed, Feb 12, 2014 at 2:50 AM, Adrian Mocanu <amoc...@verticalscope.com>wrote: > Thanks Andrew and Matei. > > > > My input is from a kafka spout which can fail any time so it's not a > reliable input source. The problem is duplicate tuples coming through and > I'd like those to be discarded. > > > > I'll take a look at those links provided by both you and Andrew. > > -A > > *From:* Matei Zaharia [mailto:matei.zaha...@gmail.com] > *Sent:* February-11-14 4:02 PM > *To:* user@spark.incubator.apache.org > *Subject:* Re: how is fault tolerance achieved in spark > > > > Spark Streaming also provides fault tolerance by default as long as your > input is in a reliable data source. Take a look at this paper: > http://www.cs.berkeley.edu/~matei/papers/2013/sosp_spark_streaming.pdf or > the fault tolerance section of the usage guide: > http://spark.incubator.apache.org/docs/latest/streaming-programming-guide.html > . > > > > Matei > > > > On Feb 11, 2014, at 12:27 PM, Andrew Ash <and...@andrewash.com> wrote: > > > > Here's the original paper on how the framework achieves fault > tolerance. You shouldn't have to do anything special as a user of the > framework. > > > > https://www.cs.berkeley.edu/~matei/papers/2012/nsdi_spark.pdf > > > > On Tue, Feb 11, 2014 at 12:21 PM, Adrian Mocanu <amoc...@verticalscope.com> > wrote: > > Anyone willing to link some resource on how to achieve fault tolerance? > > > > *From:* Adrian Mocanu [mailto:amoc...@verticalscope.com] > *Sent:* February-10-14 1:44 PM > *To:* user@spark.incubator.apache.org > *Subject:* how is fault tolerance achieved in spark > > > > Hi all, > > I am curious how fault tolerance is achieved in spark. Well, more like > what do I need to do to make sure my aggregations which comes from streams > are fault tolerant and saved into cassandra. I will have nodes die and > would not like to count "tuples" multiple times. > > > > For example, in trident you have to implement different interfaces. Is > there a similar process for spark? > > > > Thanks > > -Adrian > > > > > > >