Re: avoid duplicate due to executor failure in spark stream
Accumulators aren't going to work to communicate state changes between executors. You need external storage. On Tue, Aug 11, 2015 at 11:28 AM, Shushant Arora shushantaror...@gmail.com wrote: What if processing is neither idempotent nor its in transaction ,say I am posting events to some external server after processing. Is it possible to get accumulator of failed task in retry task? Is there any way to detect whether this task is retried task or original task ? I was trying to achieve something like incrementing a counter after each event processed and if task fails- retry task will just ignore already processed events by accessing counter of failed task. Is it directly possible to access accumulator per task basis without writing to hdfs or hbase. On Tue, Aug 11, 2015 at 3:15 AM, Cody Koeninger c...@koeninger.org wrote: http://spark.apache.org/docs/latest/streaming-kafka-integration.html#approach-2-direct-approach-no-receivers http://spark.apache.org/docs/latest/streaming-programming-guide.html#semantics-of-output-operations https://www.youtube.com/watch?v=fXnNEq1v3VA On Mon, Aug 10, 2015 at 4:32 PM, Shushant Arora shushantaror...@gmail.com wrote: Hi How can I avoid duplicate processing of kafka messages in spark stream 1.3 because of executor failure. 1.Can I some how access accumulators of failed task in retry task to skip those many events which are already processed by failed task on this partition ? 2.Or I ll have to persist each msg processed and then check before processing each msg whether its already processed by failure task and delete this perisited information at each batch end?
Re: avoid duplicate due to executor failure in spark stream
What if processing is neither idempotent nor its in transaction ,say I am posting events to some external server after processing. Is it possible to get accumulator of failed task in retry task? Is there any way to detect whether this task is retried task or original task ? I was trying to achieve something like incrementing a counter after each event processed and if task fails- retry task will just ignore already processed events by accessing counter of failed task. Is it directly possible to access accumulator per task basis without writing to hdfs or hbase. On Tue, Aug 11, 2015 at 3:15 AM, Cody Koeninger c...@koeninger.org wrote: http://spark.apache.org/docs/latest/streaming-kafka-integration.html#approach-2-direct-approach-no-receivers http://spark.apache.org/docs/latest/streaming-programming-guide.html#semantics-of-output-operations https://www.youtube.com/watch?v=fXnNEq1v3VA On Mon, Aug 10, 2015 at 4:32 PM, Shushant Arora shushantaror...@gmail.com wrote: Hi How can I avoid duplicate processing of kafka messages in spark stream 1.3 because of executor failure. 1.Can I some how access accumulators of failed task in retry task to skip those many events which are already processed by failed task on this partition ? 2.Or I ll have to persist each msg processed and then check before processing each msg whether its already processed by failure task and delete this perisited information at each batch end?
Re: avoid duplicate due to executor failure in spark stream
http://spark.apache.org/docs/latest/streaming-kafka-integration.html#approach-2-direct-approach-no-receivers http://spark.apache.org/docs/latest/streaming-programming-guide.html#semantics-of-output-operations https://www.youtube.com/watch?v=fXnNEq1v3VA On Mon, Aug 10, 2015 at 4:32 PM, Shushant Arora shushantaror...@gmail.com wrote: Hi How can I avoid duplicate processing of kafka messages in spark stream 1.3 because of executor failure. 1.Can I some how access accumulators of failed task in retry task to skip those many events which are already processed by failed task on this partition ? 2.Or I ll have to persist each msg processed and then check before processing each msg whether its already processed by failure task and delete this perisited information at each batch end?
avoid duplicate due to executor failure in spark stream
Hi How can I avoid duplicate processing of kafka messages in spark stream 1.3 because of executor failure. 1.Can I some how access accumulators of failed task in retry task to skip those many events which are already processed by failed task on this partition ? 2.Or I ll have to persist each msg processed and then check before processing each msg whether its already processed by failure task and delete this perisited information at each batch end?