Re: [Vote] KIP-588: Allow producers to recover gracefully from transaction timeouts

Boyang Chen Wed, 22 Apr 2020 10:12:48 -0700

Hey Jason,

thanks for the suggestions! Addressed in the KIP.


On Wed, Apr 22, 2020 at 9:21 AM Jason Gustafson <ja...@confluent.io> wrote:

> +1 Just a couple small comments:
>
> 1. My comment about `initTransactions()` usage in the javadoc above appears
> not to have been addressed.
> 2. For the handling of INVALID_PRODUCER_EPOCH in the produce response,
> would we only try to abort if the broker supports the newer protocol
> version? I guess it would be simpler in the implementation if the producer
> did it in any case even if it might not be useful for older versions.
>
> -Jason
>
> On Fri, Apr 17, 2020 at 3:55 PM Guozhang Wang <wangg...@gmail.com> wrote:
>
> > Sounds good to me. Thanks Boyang.
> >
> > On Fri, Apr 17, 2020 at 3:32 PM Boyang Chen <reluctanthero...@gmail.com>
> > wrote:
> >
> > > Thanks Guozhang,
> > >
> > > I think most of the complexity comes from our intention to benefit
> older
> > > clients. After a second thought, I think the add-on complexity
> > counteracts
> > > the gain here as only 2.5 client is getting a slice of the resilience
> > > improvement, not for many older versions.
> > >
> > > So I decide to drop the UNKNOWN_PRODUCER_ID path, by just claiming that
> > > this change would only benefit 2.6 Producer clients. So the only path
> > that
> > > needs version detection is the new transaction coordinator handling
> > > transactional requests. If the Producer is 2.6+, we pick
> > > PRODUCER_FENCED(new error code) or TRANSACTION_TIMED_OUT as the
> response;
> > > otherwise  we return INVALID_PRODUCE_EPOCH to be consistent with older
> > > clients.
> > >
> > > Does this sound like a better plan? I already updated the KIP with
> > > simplifications.
> > >
> > >
> > > On Fri, Apr 17, 2020 at 12:02 PM Guozhang Wang <wangg...@gmail.com>
> > wrote:
> > >
> > > > Hi Boyang,
> > > >
> > > > Your reply to 3) seems conflicting with your other answers which is a
> > bit
> > > > confusing to me. Following your other answers, it seems you suggest
> > > > returning UNKNOWN_PRODUCER_ID so that 2.5 clients can trigger retry
> > logic
> > > > as well?
> > > >
> > > > To complete my reasoning here as a complete picture:
> > > >
> > > > a) post KIP-360 (2.5+) the partition leader broker does not return
> > > > UNKNOWN_PRODUCER_ID any more.
> > > > b) upon seeing an old epoch, partition leader cannot tell if it is
> due
> > to
> > > > fencing or timeout; so it could only return INVALID_PRODUCER_EPOCH.
> > > >
> > > > So the basic idea is to let the clients ask the transaction
> coordinator
> > > for
> > > > the source of truth:
> > > >
> > > > 1) 2.5+ client would handle UNKNOWN_PRODUCER_ID (which could only be
> > > > returned from old brokers) by trying to re-initialize with the
> > > transaction
> > > > coordinator; the coordinator would then tell it whether it is
> > > > PRODUCER_FENCED or TXN_TIMEOUT. And for old brokers, it would always
> > > return
> > > > PRODUCER_FENCED anyways.
> > > > 2) 2.6+ client would also handle INVALID_PRODUCER_EPOCH with the
> retry
> > > > initializing logic; and similarly the transaction coordinator would
> > > > return PRODUCER_FENCED or TXN_TIMEOUT if it is new or always
> > > > return PRODUCER_FENCED if it is old.
> > > >
> > > > The question open is, whether
> > > >
> > > > * 3) the new broker should return UNKNOWN_PRODUCER_ID now when it is
> > > > *supposed* to return INVALID_PRODUCER_EPOCH and it found the request
> is
> > > > from 2.5 client (note as mentioned in a) right now we do not
> > > > return UNKNOWN_PRODUCER_ID from brokers anymore).
> > > >
> > > > If it does, then 2.5 client could still do the retry logic to the
> > > > transaction coordinator, i.e. benefit from KIP-360; but the cost is
> > > complex
> > > > logic on the broker side as well as producer API version bump up.
> > > > If it does not, then when INVALID_PRODUCER_EPOCH is returned to the
> old
> > > > client it would treat it as fatal and not ask the txn coordinator;
> but
> > it
> > > > simplifies the broker logic and also do not require producer API
> > version
> > > > bump.
> > > >
> > > > Personally I'd suggest we do the latter, knowing that it would not
> > > benefit
> > > > 2.5 client when the partition leader gets an old epoch and does not
> > know
> > > > whether it is Fenced or Timed Out.
> > > >
> > > >
> > > > Guozhang
> > > >
> > > > On Thu, Apr 16, 2020 at 7:59 PM Boyang Chen <
> > reluctanthero...@gmail.com>
> > > > wrote:
> > > >
> > > > > Thanks Jason and Guozhang for the thoughts.
> > > > >
> > > > > On Thu, Apr 16, 2020 at 6:09 PM Guozhang Wang <wangg...@gmail.com>
> > > > wrote:
> > > > >
> > > > > > For 2/3 above, originally I was not thinking that we will have a
> > > > > different
> > > > > > exception for INVALID_PRODUCER_EPOCH and hence was thinking that
> in
> > > > order
> > > > > > to leverage KIP-360 for it, we'd have to let the broker to
> > > > > > return UNKNOWN_PRODUCER_ID. I.e. we'd change the logic of
> partition
> > > > > leader
> > > > > > as well to return UNKNOWN_PRODUCER_ID to let the producer try
> > > > > > re-initializing the PID on the coordinator, and if it is indeed
> due
> > > to
> > > > > > fencing, then coordinator can let the client know of the fatal
> > error
> > > > and
> > > > > > then fail. In that case, then we do need to bump up the producer
> > API
> > > > > > version so that the partition leader knows if it is from older or
> > > newer
> > > > > > clients: if it is older client who do not have KIP-360, we'd
> return
> > > > > > INVALID_PRODUCER_EPOCH still; for newer client, we can
> > > > > > return UNKNOWN_PRODUCER_ID to let it seek what's the truth on
> > > > > coordinator.
> > > > > >
> > > > > I know this is a bit of reversed order. Feel free to check out my
> > reply
> > > > to
> > > > > Jason first and go back here :)
> > > > > I think the partition leader will have no change of returned code
> > after
> > > > we
> > > > > discussed that only new client should be able to retry.
> > > > >
> > > > >
> > > > > > Now since we are additionally mapping INVALID_PRODUCER_EPOCH to a
> > > > > different
> > > > > > error code now and letting clients to handle that similar
> > > > > > to UNKNOWN_PRODUCER_ID, this can be saved indeed. However, it
> also
> > > > means
> > > > > > that 2.5.0 clients would not get benefited from this KIP, since
> > they
> > > > > still
> > > > > > treat INVALID_PRODUCER_EPOCH as fatal. If people this that's okay
> > > then
> > > > we
> > > > > > can simplify the partition leader behavior as well as not bumping
> > up
> > > > > > producer APIs. I think I'm a bit inclined towards simplicity over
> > > > > > benefiting older clients.
> > > > > >
> > > > >
> > > > > 2.5 client should get benefits if we return UNKNOWN_PRODUCER_ID as
> I
> > > have
> > > > > described below.
> > > > >
> > > > >
> > > > > > Guozhang
> > > > > >
> > > > > > On Thu, Apr 16, 2020 at 5:37 PM Jason Gustafson <
> > ja...@confluent.io>
> > > > > > wrote:
> > > > > >
> > > > > > > Hi Boyang,
> > > > > > >
> > > > > > > A few minor questions below:
> > > > > > >
> > > > > > > 1. You mention UNKNOWN_PRODUCER_ID in 2.a under Resilience
> > > > > Improvements.
> > > > > > I
> > > > > > > assume that should be INVALID_PRODUCER_EPOCH? I am not sure
> this
> > > case
> > > > > > makes
> > > > > > > sense for 2.5 clients which would view this error as fatal
> > > regardless
> > > > > of
> > > > > > > whatever the broker does. Not sure there's anything we can do
> > about
> > > > > that.
> > > > > > > Similarly, if a newer client is talking to a 2.5 broker, it
> > > wouldn't
> > > > be
> > > > > > > able to bump the epoch after a timeout because the broker would
> > not
> > > > > know
> > > > > > to
> > > > > > > keep the last epoch. Unfortunately, I think the only
> improvements
> > > > that
> > > > > > are
> > > > > > > possible here are newer clients talking to newer brokers, but I
> > > might
> > > > > > have
> > > > > > > missed something.
> > > > > > >
> > > > > >
> > > > > The reasoning for returning UNKNOWN_PRODUCER_ID is to trigger the
> > retry
> > > > > logic on 2.5 client hopefully.
> > > > > As stated in the KIP, the benefited parties are *2.5 client and 2.6
> > > > > client *only. If
> > > > > transaction coordinator returns INVALID_PRODUCER_EPOCH for a
> > > transaction
> > > > > timeout case, 2.5 client still has to treat as fatal as it only
> > > > recognizes
> > > > > PRODUCER_FENCED. Returning UNKNOWN_PRODUCER_ID for 2.5 client is a
> > > > > substitute for TRANSACTION_TIMED_OUT error code, think of 9656
> > > > > <https://issues.apache.org/jira/browse/KAFKA-9656> as a similar
> > > > > motivation.
> > > > >
> > > > >
> > > > > > > 2. The proposal says that newer clients talking to older
> brokers
> > > > should
> > > > > > > treat PRODUCER_FENCED as non-fatal. Just to clarify, older
> > brokers
> > > > will
> > > > > > > only return INVALID_PRODUCER_EPOCH because the PRODUCER_FENCED
> > > error
> > > > > did
> > > > > > > not exist. I think the logic should be something like this.
> > > > > > >
> > > > > > > On the Transaction Coordinator:
> > > > > > > - For old request versions, we continue returning
> > > > > INVALID_PRODUCER_EPOCH.
> > > > > > > Clients will translate this to PRODUCER_FENCED as they do today
> > and
> > > > > treat
> > > > > > > this as a fatal error.
> > > > > > > - For new request versions, we return either PRODUCER_FENCED or
> > > > > > > TRANSACTION_TIMED_OUT depending on the case. Clients will treat
> > the
> > > > > first
> > > > > > > as fatal and will bump the epoch on the latter.
> > > > > > >
> > > > > > > On the partition leader:
> > > > > > > - Nothing changes. Old clients will treat
> INVALID_PRODUCER_EPOCH
> > as
> > > > > > fatal.
> > > > > > > New clients will attempt to bump the epoch if the right version
> > of
> > > > > > > InitProducerId is supported and fail otherwise.
> > > > > > >
> > > > > > > Does that seem right?
> > > > > > >
> > > > > >
> > > > > As I stated in the previous answer, the txn coordinator could also
> > > choose
> > > > > to return UNKNOWN_PRODUCER_ID for 2.5 clients or older
> > > > > when there is a txn timeout. The rest looks good to me.
> > > > >
> > > > >
> > > > > > > 3. Why do we need the bump to the Produce API? As far as I can
> > > tell,
> > > > we
> > > > > > are
> > > > > > > not changing any errors that are used. The
> INVALID_PRODUCER_EPOCH
> > > > error
> > > > > > is
> > > > > > > already possible today and the two new errors cannot be
> returned
> > > from
> > > > > > > Produce.
> > > > > > >
> > > > > >
> > > > > Guozhang had some thoughts about it. I think one vague point we
> > haven't
> > > > > clarified is what's the expected error code for server and client.
> > The
> > > > > relation should look like:
> > > > > old client INVALID_PRODUCER_EPOCH -> 47 (ProducerFencedException)
> > > > > new client INVALID_PRODUCER_EPOCH -> 47 (InvalidEpochException),
> > > > > PRODUCER_FENCED -> 91 (ProducerFencedException)
> > > > > old broker INVALID_PRODUCER_EPOCH -> 47 (ProducerFencedException)
> > > > > new broker INVALID_PRODUCER_EPOCH -> 47 (InvalidEpochException),
> > > > > PRODUCER_FENCED -> 91 (ProducerFencedException)
> > > > >
> > > > > As we could see here, for a new partition leader, we shall return
> > code
> > > 47
> > > > > no matter what as stated in the KIP that PRODUCER_FENCED is not a
> > valid
> > > > > case reported from partition leader. So new client should always
> > > trigger
> > > > > the retry logic for partition leader error. I guess the Produce API
> > > bump
> > > > is
> > > > > not necessary then.
> > > > >
> > > > >
> > > > > > > 4. The javadoc examples suggest the user should call
> > > > > `initTransactions()`
> > > > > > > in the case that a timeout is reached. I don't think that's
> > right.
> > > > For
> > > > > > > KIP-360, we bump the epoch internally. I think we'd want to do
> > the
> > > > same
> > > > > > > here so that we continue the current pattern where
> > > > `initTransactions()`
> > > > > > is
> > > > > > > used just once.
> > > > > > >
> > > > > >
> > > > > That's correct observation, will address it.
> > > > >
> > > > >
> > > > > > > Thanks,
> > > > > > > Jason
> > > > > > >
> > > > > > > On Thu, Apr 16, 2020 at 2:24 PM Guozhang Wang <
> > wangg...@gmail.com>
> > > > > > wrote:
> > > > > > >
> > > > > > > > +1 (binding), thanks!
> > > > > > > >
> > > > > > > > On Tue, Apr 14, 2020 at 4:36 PM Boyang Chen <
> > > > > > reluctanthero...@gmail.com>
> > > > > > > > wrote:
> > > > > > > >
> > > > > > > > > Hey all,
> > > > > > > > >
> > > > > > > > > I would like to start the vote for KIP-588:
> > > > > > > > > https://cwiki.apache.org/confluence/display/KAFKA/KIP-588
> > > > > > > > >
> > > > %3A+Allow+producers+to+recover+gracefully+from+transaction+timeouts
> > > > > > > > >
> > > > > > > > > Feel free to continue posting on discussion thread if you
> > have
> > > > > > > > > any questions, thanks!
> > > > > > > > >
> > > > > > > > > Best,
> > > > > > > > > Boyang
> > > > > > > > >
> > > > > > > >
> > > > > > > >
> > > > > > > > --
> > > > > > > > -- Guozhang
> > > > > > > >
> > > > > > >
> > > > > >
> > > > > >
> > > > > > --
> > > > > > -- Guozhang
> > > > > >
> > > > >
> > > >
> > > >
> > > > --
> > > > -- Guozhang
> > > >
> > >
> >
> >
> > --
> > -- Guozhang
> >
>

Re: [Vote] KIP-588: Allow producers to recover gracefully from transaction timeouts

Reply via email to