Re: Implementation for exactly-once streaming sink

Gabor Somogyi Thu, 06 Dec 2018 01:29:26 -0800

Hi Eric,

In order to have exactly-once one need re-playable source and idempotent
sink.
The cases what you've mentioned are covering the 2 main group of issues.
Practically any kind of programming problem can end-up in duplicated data
(even in the code which feeds kafka).
Don't know why have you asked this because if the sink see an already
processed key then it should be just skipped and doesn't matter why it is
duplicated.
Cody has a really good writing about semantics:
https://github.com/koeninger/kafka-exactly-once/blob/master/blogpost.md#delivery-semantics


I think if you reach Continuous Processing this it worth to consider:
"There are currently no automatic retries of failed tasks. Any failure will
lead to the query being stopped and it needs to be manually restarted from
the checkpoint."

BR,
G


On Wed, Dec 5, 2018 at 8:36 PM Eric Wohlstadter <wohls...@gmail.com> wrote:

> Hi all,
>  We are working on implementing a streaming sink on 2.3.1 with the
> DataSourceV2 APIs.
>
> Can anyone help check if my understanding is correct, with respect to the
> failure modes which need to be covered?
>
> We are assuming that a Reliable Receiver (such as Kafka) is used as the
> stream source. And we only want to support micro-batch execution at this
> time (not yet Continuous Processing).
>
> I believe the possible failures that need to be covered are:
>
> 1. Task failure: If a task fails, it may have written data to the sink
> output before failure. Subsequent attempts for a failed task must be
> idempotent, so that no data is duplicated in the output.
> 2. Driver failure: If the driver fails, upon recovery, it might replay a
> micro-batch that was already seen by the sink (if a failure occurs after
> the sink has committed output but before the driver has updated the
> checkpoint). In this case, the sink must be idempotent when a micro-batch
> is replayed so that no data is duplicated in the output.
>
> Are there any other cases where data might be duplicated in the stream?
> i.e. if neither of these 2 failures occur, is there still a case where
> data can be duplicated?
>
> Thanks for any help to check if my understanding is correct.
>
>
>
>
>

Re: Implementation for exactly-once streaming sink

Reply via email to