Re: [DISCUSSION] KAFKA-1697 - make NotEnoughReplicasAfterAppend a non-retriable exception

Jay Kreps Tue, 10 Feb 2015 20:25:56 -0800

Yeah there are really two concepts here as I think you noted:
1. Retry safe: we know that the write did not occur
2. Retry fixable: if you send that again it could work

(probably there are better names for these).

Some things we know did not do a write and may be fixed by retrying (no
leader). Some things we know didn't do a write and are not worth retrying
(message too large). Somethings we don't know and are worth retrying
(network error), and probably some things we don't know and aren't worth it
(can't think of one though).

(I feel like Donald Rumsfeld with the "known unknowns" thing).

In the current world if you set retries > 0 you are saying "I accept
duplicates but want to ensure my stuff gets written", if you set retries =
0 you are saying "I can't abide duplicates and am willing to tolerate
loss". So Retryable for us means "retry may succeed".

Originally I thought of maybe trying to model both concepts. However the
two arguments against it are:
1. Even if you do this the guarantee remains "at least once delivery"
because: (1) in the network error case you just don't know, (2) consumer
failure.
2. The proper fix for this is to add idempotence support on the server,
which we should do.

Doing idempotence support on the server will actually fix all duplicate
problems, including the network error case (because of course the server
knows whether your write went through even though the client doesn't). When
we have that then the client can always just retry anything marked
Retriable (i.e. retry may work) without fear of duplicates.

This gives exactly once delivery to the log, and a co-operating consumer
can use the offset to dedupe and get it end-to-end.

So that was why I had just left one type of Retriable and used it to mean
"retry may work" and don't try to flag anything for duplicates.

-Jay

On Tue, Feb 10, 2015 at 4:32 PM, Gwen Shapira <[email protected]> wrote:

> Hi Kafka Devs,
>
> Need your thoughts on retriable exceptions:
>
> If a user configures Kafka with min.isr > 1 and there are not enough
> replicas to safely store the data, there are two possibilities:
>
> 1. The lack of replicas was discovered before the message was written. We
> throw NotEnoughReplicas.
> 2. The lack of replicas was discovered after the message was written to
> leader. In this case, we throw  NotEnoughReplicasAfterAppend.
>
> Currently, both errors are Retriable. Which means that the new producer
> will retry multiple times.
> In case of the second exception, this will cause duplicates.
>
> KAFKA-1697 suggests:
> "we probably want to make NotEnoughReplicasAfterAppend a non-retriable
> exception and let the client decide what to do."
>
> I agreed that the client (the one using the Producer) should weight the
> problems duplicates will cause vs. the probability of losing the message
> and do something sensible and made the exception non-retriable.
>
> In the RB (https://reviews.apache.org/r/29647/) Joel raised a good point:
> (Joel, feel free to correct me if I misrepresented your point)
>
> "I think our interpretation of retriable is as follows (but we can discuss
> on the list if that needs to change): if the produce request hits an error,
> and there is absolutely no point in retrying then that is a non-retriable
> error. MessageSizeTooLarge is an example - since unless the producer
> changes the request to make the messages smaller there is no point in
> retrying.
>
> ...
> Duplicates can arise even for other errors (e.g., request timed out). So
> that side-effect is not compelling enough to warrant a change to make this
> non-retriable. "
>
> *(TL;DR;  )  Should exceptions where retries can cause duplicates should
> still be *
> *retriable?*
>
> Gwen
>

Re: [DISCUSSION] KAFKA-1697 - make NotEnoughReplicasAfterAppend a non-retriable exception

Reply via email to