Thanks Andrew and Matei.

My input is from a kafka spout which can fail any time so it's not a reliable 
input source. The problem is duplicate tuples coming through and I'd like those 
to be discarded.

I'll take a look at those links provided by both you and Andrew.
-A
From: Matei Zaharia [mailto:matei.zaha...@gmail.com]
Sent: February-11-14 4:02 PM
To: user@spark.incubator.apache.org
Subject: Re: how is fault tolerance achieved in spark

Spark Streaming also provides fault tolerance by default as long as your input 
is in a reliable data source. Take a look at this paper: 
http://www.cs.berkeley.edu/~matei/papers/2013/sosp_spark_streaming.pdf or the 
fault tolerance section of the usage guide: 
http://spark.incubator.apache.org/docs/latest/streaming-programming-guide.html.

Matei

On Feb 11, 2014, at 12:27 PM, Andrew Ash 
<and...@andrewash.com<mailto:and...@andrewash.com>> wrote:


Here's the original paper on how the framework achieves fault tolerance.  You 
shouldn't have to do anything special as a user of the framework.

https://www.cs.berkeley.edu/~matei/papers/2012/nsdi_spark.pdf

On Tue, Feb 11, 2014 at 12:21 PM, Adrian Mocanu 
<amoc...@verticalscope.com<mailto:amoc...@verticalscope.com>> wrote:
Anyone willing to link some resource on how to achieve fault tolerance?

From: Adrian Mocanu 
[mailto:amoc...@verticalscope.com<mailto:amoc...@verticalscope.com>]
Sent: February-10-14 1:44 PM
To: user@spark.incubator.apache.org<mailto:user@spark.incubator.apache.org>
Subject: how is fault tolerance achieved in spark

Hi all,
I am curious how fault tolerance is achieved in spark. Well, more like what do 
I need to do to make sure my aggregations which comes from streams are fault 
tolerant and saved into cassandra. I will have nodes die and would not like to 
count "tuples" multiple times.

For example, in trident you have to implement different interfaces. Is there a 
similar process for spark?

Thanks
-Adrian



Reply via email to