Re: avoid duplicate due to executor failure in spark stream

2015-08-12 Thread Cody Koeninger
Accumulators aren't going to work to communicate state changes between
executors.  You need external storage.

On Tue, Aug 11, 2015 at 11:28 AM, Shushant Arora shushantaror...@gmail.com
wrote:

 What if processing is neither idempotent nor its in transaction ,say  I am
 posting events to some external server after processing.

 Is it possible to get accumulator of failed task in retry task? Is there
 any way to detect whether this task is retried task or original task ?

 I was trying to achieve something like incrementing a counter after each
 event processed and if task fails- retry task will just ignore already
 processed events by accessing counter of failed task. Is it directly
 possible to access accumulator per task basis without writing to hdfs or
 hbase.




 On Tue, Aug 11, 2015 at 3:15 AM, Cody Koeninger c...@koeninger.org
 wrote:


 http://spark.apache.org/docs/latest/streaming-kafka-integration.html#approach-2-direct-approach-no-receivers


 http://spark.apache.org/docs/latest/streaming-programming-guide.html#semantics-of-output-operations

 https://www.youtube.com/watch?v=fXnNEq1v3VA


 On Mon, Aug 10, 2015 at 4:32 PM, Shushant Arora 
 shushantaror...@gmail.com wrote:

 Hi

 How can I avoid duplicate processing of kafka messages in spark stream
 1.3 because of executor failure.

 1.Can I some how access accumulators of failed task in retry  task to
 skip those many events which are already processed by failed task on this
 partition ?

 2.Or I ll have to persist each msg processed and then check before
 processing each msg whether its already processed by failure task and
 delete this perisited information at each batch end?






Re: avoid duplicate due to executor failure in spark stream

2015-08-11 Thread Shushant Arora
What if processing is neither idempotent nor its in transaction ,say  I am
posting events to some external server after processing.

Is it possible to get accumulator of failed task in retry task? Is there
any way to detect whether this task is retried task or original task ?

I was trying to achieve something like incrementing a counter after each
event processed and if task fails- retry task will just ignore already
processed events by accessing counter of failed task. Is it directly
possible to access accumulator per task basis without writing to hdfs or
hbase.




On Tue, Aug 11, 2015 at 3:15 AM, Cody Koeninger c...@koeninger.org wrote:


 http://spark.apache.org/docs/latest/streaming-kafka-integration.html#approach-2-direct-approach-no-receivers


 http://spark.apache.org/docs/latest/streaming-programming-guide.html#semantics-of-output-operations

 https://www.youtube.com/watch?v=fXnNEq1v3VA


 On Mon, Aug 10, 2015 at 4:32 PM, Shushant Arora shushantaror...@gmail.com
  wrote:

 Hi

 How can I avoid duplicate processing of kafka messages in spark stream
 1.3 because of executor failure.

 1.Can I some how access accumulators of failed task in retry  task to
 skip those many events which are already processed by failed task on this
 partition ?

 2.Or I ll have to persist each msg processed and then check before
 processing each msg whether its already processed by failure task and
 delete this perisited information at each batch end?





Re: avoid duplicate due to executor failure in spark stream

2015-08-10 Thread Cody Koeninger
http://spark.apache.org/docs/latest/streaming-kafka-integration.html#approach-2-direct-approach-no-receivers

http://spark.apache.org/docs/latest/streaming-programming-guide.html#semantics-of-output-operations

https://www.youtube.com/watch?v=fXnNEq1v3VA


On Mon, Aug 10, 2015 at 4:32 PM, Shushant Arora shushantaror...@gmail.com
wrote:

 Hi

 How can I avoid duplicate processing of kafka messages in spark stream 1.3
 because of executor failure.

 1.Can I some how access accumulators of failed task in retry  task to skip
 those many events which are already processed by failed task on this
 partition ?

 2.Or I ll have to persist each msg processed and then check before
 processing each msg whether its already processed by failure task and
 delete this perisited information at each batch end?



avoid duplicate due to executor failure in spark stream

2015-08-10 Thread Shushant Arora
Hi

How can I avoid duplicate processing of kafka messages in spark stream 1.3
because of executor failure.

1.Can I some how access accumulators of failed task in retry  task to skip
those many events which are already processed by failed task on this
partition ?

2.Or I ll have to persist each msg processed and then check before
processing each msg whether its already processed by failure task and
delete this perisited information at each batch end?