Re: Can spark support exactly once based kafka ? Due to these following question?

2016-12-05 Thread Cody Koeninger
Have you read / watched the materials linked from
https://github.com/koeninger/kafka-exactly-once

On Mon, Dec 5, 2016 at 4:17 AM, Jörn Franke  wrote:
> You need to do the book keeping of what has been processed yourself. This
> may mean roughly the following (of course the devil is in the details):
> Write down in zookeeper which part of the processing job has been done and
> for which dataset all the data has been created (do not keep the data itself
> in zookeeper).
> Once you start a processing job, check in zookeeper if it has been
> processed, if not remove all staging data, if yes terminate.
>
> As I said the details depend on your job and require some careful thinking,
> but exactly once can be achieved with Spark (and potentially zookeeper or
> similar, such as Redis).
> Of course at the same time think if you need delivery in order etc.
>
> On 5 Dec 2016, at 08:59, Michal Šenkýř  wrote:
>
> Hello John,
>
> 1. If a task complete the operation, it will notify driver. The driver may
> not receive the message due to the network, and think the task is still
> running. Then the child stage won't be scheduled ?
>
> Spark's fault tolerance policy is, if there is a problem in processing a
> task or an executor is lost, run the task (and any dependent tasks) again.
> Spark attempts to minimize the number of tasks it has to recompute, so
> usually only a small part of the data is recomputed.
>
> So in your case, the driver simply schedules the task on another executor
> and continues to the next stage when it receives the data.
>
> 2. how do spark guarantee the downstream-task can receive the shuffle-data
> completely. As fact, I can't find the checksum for blocks in spark. For
> example, the upstream-task may shuffle 100Mb data, but the downstream-task
> may receive 99Mb data due to network. Can spark verify the data is received
> completely based size ?
>
> Spark uses compression with checksuming for shuffle data so it should know
> when the data is corrupt and initiate a recomputation.
>
> As for your question in the subject:
> All of this means that Spark supports at-least-once processing. There is no
> way that I know of to ensure exactly-once. You can try to minimize
> more-than-once situations by updating your offsets as soon as possible but
> that does not eliminate the problem entirely.
>
> Hope this helps,
>
> Michal Senkyr

-
To unsubscribe e-mail: user-unsubscr...@spark.apache.org



Re: Can spark support exactly once based kafka ? Due to these following question?

2016-12-05 Thread Jörn Franke
You need to do the book keeping of what has been processed yourself. This may 
mean roughly the following (of course the devil is in the details):
Write down in zookeeper which part of the processing job has been done and for 
which dataset all the data has been created (do not keep the data itself in 
zookeeper).
Once you start a processing job, check in zookeeper if it has been processed, 
if not remove all staging data, if yes terminate. 

As I said the details depend on your job and require some careful thinking, but 
exactly once can be achieved with Spark (and potentially zookeeper or similar, 
such as Redis).
Of course at the same time think if you need delivery in order etc.

> On 5 Dec 2016, at 08:59, Michal Šenkýř  wrote:
> 
> Hello John,
> 
>> 1. If a task complete the operation, it will notify driver.   
>> The driver may not receive the message due to the network, and think the 
>> task is still running. Then the child stage won't be scheduled ?
> Spark's fault tolerance policy is, if there is a problem in processing a task 
> or an executor is lost, run the task (and any dependent tasks) again. Spark 
> attempts to minimize the number of tasks it has to recompute, so usually only 
> a small part of the data is recomputed.
> 
> So in your case, the driver simply schedules the task on another executor and 
> continues to the next stage when it receives the data.
>> 2. how do spark guarantee the downstream-task can receive the shuffle-data 
>> completely. As fact, I can't find the checksum for blocks in spark. For 
>> example, the upstream-task may shuffle 100Mb data, but the downstream-task 
>> may receive 99Mb data due to network. Can spark verify the data is received 
>> completely based size ?
> Spark uses compression with checksuming for shuffle data so it should know 
> when the data is corrupt and initiate a recomputation.
> 
> As for your question in the subject:
> All of this means that Spark supports at-least-once processing. There is no 
> way that I know of to ensure exactly-once. You can try to minimize 
> more-than-once situations by updating your offsets as soon as possible but 
> that does not eliminate the problem entirely.
> 
> Hope this helps,
> Michal Senkyr


Re: Can spark support exactly once based kafka ? Due to these following question?

2016-12-05 Thread Piotr Smoliński
The boundary is a bit flexible. In terms of observed DStream effective
state the direct stream semantics is exactly-once.
In terms of external system observations (like message emission), Spark
Streaming semantics is at-least-once.

Regards,
Piotr

On Mon, Dec 5, 2016 at 8:59 AM, Michal Šenkýř  wrote:

> Hello John,
>
> 1. If a task complete the operation, it will notify driver. The driver may
> not receive the message due to the network, and think the task is still
> running. Then the child stage won't be scheduled ?
>
> Spark's fault tolerance policy is, if there is a problem in processing a
> task or an executor is lost, run the task (and any dependent tasks) again.
> Spark attempts to minimize the number of tasks it has to recompute, so
> usually only a small part of the data is recomputed.
>
> So in your case, the driver simply schedules the task on another executor
> and continues to the next stage when it receives the data.
>
> 2. how do spark guarantee the downstream-task can receive the shuffle-data
> completely. As fact, I can't find the checksum for blocks in spark. For
> example, the upstream-task may shuffle 100Mb data, but the downstream-task
> may receive 99Mb data due to network. Can spark verify the data is received
> completely based size ?
>
> Spark uses compression with checksuming for shuffle data so it should know
> when the data is corrupt and initiate a recomputation.
>
> As for your question in the subject:
> All of this means that Spark supports at-least-once processing. There is
> no way that I know of to ensure exactly-once. You can try to minimize
> more-than-once situations by updating your offsets as soon as possible but
> that does not eliminate the problem entirely.
>
> Hope this helps,
>
> Michal Senkyr
>


Re: Can spark support exactly once based kafka ? Due to these following question?

2016-12-04 Thread Michal Šenkýř

Hello John,

1. If a task complete the operation, it will notify driver. The driver 
may not receive the message due to the network, and think the task is 
still running. Then the child stage won't be scheduled ?


Spark's fault tolerance policy is, if there is a problem in processing a 
task or an executor is lost, run the task (and any dependent tasks) 
again. Spark attempts to minimize the number of tasks it has to 
recompute, so usually only a small part of the data is recomputed.


So in your case, the driver simply schedules the task on another 
executor and continues to the next stage when it receives the data.


2. how do spark guarantee the downstream-task can receive the 
shuffle-data completely. As fact, I can't find the checksum for blocks 
in spark. For example, the upstream-task may shuffle 100Mb data, but 
the downstream-task may receive 99Mb data due to network. Can spark 
verify the data is received completely based size ?


Spark uses compression with checksuming for shuffle data so it should 
know when the data is corrupt and initiate a recomputation.


As for your question in the subject:
All of this means that Spark supports at-least-once processing. There is 
no way that I know of to ensure exactly-once. You can try to minimize 
more-than-once situations by updating your offsets as soon as possible 
but that does not eliminate the problem entirely.


Hope this helps,

Michal Senkyr



Can spark support exactly once based kafka ? Due to these following question?

2016-12-04 Thread John Fang
1. If a task complete the operation, it will notify driver. The driver may not 
receive the message due to the network, and think the task is still running. 
Then the child stage won't be scheduled ?
2. how do spark guarantee the downstream-task can receive the shuffle-data 
completely. As fact, I can't find the checksum for blocks in spark. For 
example, the upstream-task may shuffle 100Mb data, but the downstream-task may 
receive 99Mb data due to network. Can spark verify the data is received 
completely based size ?