Implementation for exactly-once streaming sink

Eric Wohlstadter Wed, 05 Dec 2018 11:37:06 -0800

Hi all,
 We are working on implementing a streaming sink on 2.3.1 with the
DataSourceV2 APIs.


Can anyone help check if my understanding is correct, with respect to the
failure modes which need to be covered?

We are assuming that a Reliable Receiver (such as Kafka) is used as the
stream source. And we only want to support micro-batch execution at this
time (not yet Continuous Processing).

I believe the possible failures that need to be covered are:

1. Task failure: If a task fails, it may have written data to the sink
output before failure. Subsequent attempts for a failed task must be
idempotent, so that no data is duplicated in the output.
2. Driver failure: If the driver fails, upon recovery, it might replay a
micro-batch that was already seen by the sink (if a failure occurs after
the sink has committed output but before the driver has updated the
checkpoint). In this case, the sink must be idempotent when a micro-batch
is replayed so that no data is duplicated in the output.

Are there any other cases where data might be duplicated in the stream?
i.e. if neither of these 2 failures occur, is there still a case where data
can be duplicated?

Thanks for any help to check if my understanding is correct.

Implementation for exactly-once streaming sink

Reply via email to