Sounds good to me. Thanks Boyang. On Fri, Apr 17, 2020 at 3:32 PM Boyang Chen <[email protected]> wrote:
> Thanks Guozhang, > > I think most of the complexity comes from our intention to benefit older > clients. After a second thought, I think the add-on complexity counteracts > the gain here as only 2.5 client is getting a slice of the resilience > improvement, not for many older versions. > > So I decide to drop the UNKNOWN_PRODUCER_ID path, by just claiming that > this change would only benefit 2.6 Producer clients. So the only path that > needs version detection is the new transaction coordinator handling > transactional requests. If the Producer is 2.6+, we pick > PRODUCER_FENCED(new error code) or TRANSACTION_TIMED_OUT as the response; > otherwise we return INVALID_PRODUCE_EPOCH to be consistent with older > clients. > > Does this sound like a better plan? I already updated the KIP with > simplifications. > > > On Fri, Apr 17, 2020 at 12:02 PM Guozhang Wang <[email protected]> wrote: > > > Hi Boyang, > > > > Your reply to 3) seems conflicting with your other answers which is a bit > > confusing to me. Following your other answers, it seems you suggest > > returning UNKNOWN_PRODUCER_ID so that 2.5 clients can trigger retry logic > > as well? > > > > To complete my reasoning here as a complete picture: > > > > a) post KIP-360 (2.5+) the partition leader broker does not return > > UNKNOWN_PRODUCER_ID any more. > > b) upon seeing an old epoch, partition leader cannot tell if it is due to > > fencing or timeout; so it could only return INVALID_PRODUCER_EPOCH. > > > > So the basic idea is to let the clients ask the transaction coordinator > for > > the source of truth: > > > > 1) 2.5+ client would handle UNKNOWN_PRODUCER_ID (which could only be > > returned from old brokers) by trying to re-initialize with the > transaction > > coordinator; the coordinator would then tell it whether it is > > PRODUCER_FENCED or TXN_TIMEOUT. And for old brokers, it would always > return > > PRODUCER_FENCED anyways. > > 2) 2.6+ client would also handle INVALID_PRODUCER_EPOCH with the retry > > initializing logic; and similarly the transaction coordinator would > > return PRODUCER_FENCED or TXN_TIMEOUT if it is new or always > > return PRODUCER_FENCED if it is old. > > > > The question open is, whether > > > > * 3) the new broker should return UNKNOWN_PRODUCER_ID now when it is > > *supposed* to return INVALID_PRODUCER_EPOCH and it found the request is > > from 2.5 client (note as mentioned in a) right now we do not > > return UNKNOWN_PRODUCER_ID from brokers anymore). > > > > If it does, then 2.5 client could still do the retry logic to the > > transaction coordinator, i.e. benefit from KIP-360; but the cost is > complex > > logic on the broker side as well as producer API version bump up. > > If it does not, then when INVALID_PRODUCER_EPOCH is returned to the old > > client it would treat it as fatal and not ask the txn coordinator; but it > > simplifies the broker logic and also do not require producer API version > > bump. > > > > Personally I'd suggest we do the latter, knowing that it would not > benefit > > 2.5 client when the partition leader gets an old epoch and does not know > > whether it is Fenced or Timed Out. > > > > > > Guozhang > > > > On Thu, Apr 16, 2020 at 7:59 PM Boyang Chen <[email protected]> > > wrote: > > > > > Thanks Jason and Guozhang for the thoughts. > > > > > > On Thu, Apr 16, 2020 at 6:09 PM Guozhang Wang <[email protected]> > > wrote: > > > > > > > For 2/3 above, originally I was not thinking that we will have a > > > different > > > > exception for INVALID_PRODUCER_EPOCH and hence was thinking that in > > order > > > > to leverage KIP-360 for it, we'd have to let the broker to > > > > return UNKNOWN_PRODUCER_ID. I.e. we'd change the logic of partition > > > leader > > > > as well to return UNKNOWN_PRODUCER_ID to let the producer try > > > > re-initializing the PID on the coordinator, and if it is indeed due > to > > > > fencing, then coordinator can let the client know of the fatal error > > and > > > > then fail. In that case, then we do need to bump up the producer API > > > > version so that the partition leader knows if it is from older or > newer > > > > clients: if it is older client who do not have KIP-360, we'd return > > > > INVALID_PRODUCER_EPOCH still; for newer client, we can > > > > return UNKNOWN_PRODUCER_ID to let it seek what's the truth on > > > coordinator. > > > > > > > I know this is a bit of reversed order. Feel free to check out my reply > > to > > > Jason first and go back here :) > > > I think the partition leader will have no change of returned code after > > we > > > discussed that only new client should be able to retry. > > > > > > > > > > Now since we are additionally mapping INVALID_PRODUCER_EPOCH to a > > > different > > > > error code now and letting clients to handle that similar > > > > to UNKNOWN_PRODUCER_ID, this can be saved indeed. However, it also > > means > > > > that 2.5.0 clients would not get benefited from this KIP, since they > > > still > > > > treat INVALID_PRODUCER_EPOCH as fatal. If people this that's okay > then > > we > > > > can simplify the partition leader behavior as well as not bumping up > > > > producer APIs. I think I'm a bit inclined towards simplicity over > > > > benefiting older clients. > > > > > > > > > > 2.5 client should get benefits if we return UNKNOWN_PRODUCER_ID as I > have > > > described below. > > > > > > > > > > Guozhang > > > > > > > > On Thu, Apr 16, 2020 at 5:37 PM Jason Gustafson <[email protected]> > > > > wrote: > > > > > > > > > Hi Boyang, > > > > > > > > > > A few minor questions below: > > > > > > > > > > 1. You mention UNKNOWN_PRODUCER_ID in 2.a under Resilience > > > Improvements. > > > > I > > > > > assume that should be INVALID_PRODUCER_EPOCH? I am not sure this > case > > > > makes > > > > > sense for 2.5 clients which would view this error as fatal > regardless > > > of > > > > > whatever the broker does. Not sure there's anything we can do about > > > that. > > > > > Similarly, if a newer client is talking to a 2.5 broker, it > wouldn't > > be > > > > > able to bump the epoch after a timeout because the broker would not > > > know > > > > to > > > > > keep the last epoch. Unfortunately, I think the only improvements > > that > > > > are > > > > > possible here are newer clients talking to newer brokers, but I > might > > > > have > > > > > missed something. > > > > > > > > > > > > The reasoning for returning UNKNOWN_PRODUCER_ID is to trigger the retry > > > logic on 2.5 client hopefully. > > > As stated in the KIP, the benefited parties are *2.5 client and 2.6 > > > client *only. If > > > transaction coordinator returns INVALID_PRODUCER_EPOCH for a > transaction > > > timeout case, 2.5 client still has to treat as fatal as it only > > recognizes > > > PRODUCER_FENCED. Returning UNKNOWN_PRODUCER_ID for 2.5 client is a > > > substitute for TRANSACTION_TIMED_OUT error code, think of 9656 > > > <https://issues.apache.org/jira/browse/KAFKA-9656> as a similar > > > motivation. > > > > > > > > > > > 2. The proposal says that newer clients talking to older brokers > > should > > > > > treat PRODUCER_FENCED as non-fatal. Just to clarify, older brokers > > will > > > > > only return INVALID_PRODUCER_EPOCH because the PRODUCER_FENCED > error > > > did > > > > > not exist. I think the logic should be something like this. > > > > > > > > > > On the Transaction Coordinator: > > > > > - For old request versions, we continue returning > > > INVALID_PRODUCER_EPOCH. > > > > > Clients will translate this to PRODUCER_FENCED as they do today and > > > treat > > > > > this as a fatal error. > > > > > - For new request versions, we return either PRODUCER_FENCED or > > > > > TRANSACTION_TIMED_OUT depending on the case. Clients will treat the > > > first > > > > > as fatal and will bump the epoch on the latter. > > > > > > > > > > On the partition leader: > > > > > - Nothing changes. Old clients will treat INVALID_PRODUCER_EPOCH as > > > > fatal. > > > > > New clients will attempt to bump the epoch if the right version of > > > > > InitProducerId is supported and fail otherwise. > > > > > > > > > > Does that seem right? > > > > > > > > > > > > As I stated in the previous answer, the txn coordinator could also > choose > > > to return UNKNOWN_PRODUCER_ID for 2.5 clients or older > > > when there is a txn timeout. The rest looks good to me. > > > > > > > > > > > 3. Why do we need the bump to the Produce API? As far as I can > tell, > > we > > > > are > > > > > not changing any errors that are used. The INVALID_PRODUCER_EPOCH > > error > > > > is > > > > > already possible today and the two new errors cannot be returned > from > > > > > Produce. > > > > > > > > > > > > Guozhang had some thoughts about it. I think one vague point we haven't > > > clarified is what's the expected error code for server and client. The > > > relation should look like: > > > old client INVALID_PRODUCER_EPOCH -> 47 (ProducerFencedException) > > > new client INVALID_PRODUCER_EPOCH -> 47 (InvalidEpochException), > > > PRODUCER_FENCED -> 91 (ProducerFencedException) > > > old broker INVALID_PRODUCER_EPOCH -> 47 (ProducerFencedException) > > > new broker INVALID_PRODUCER_EPOCH -> 47 (InvalidEpochException), > > > PRODUCER_FENCED -> 91 (ProducerFencedException) > > > > > > As we could see here, for a new partition leader, we shall return code > 47 > > > no matter what as stated in the KIP that PRODUCER_FENCED is not a valid > > > case reported from partition leader. So new client should always > trigger > > > the retry logic for partition leader error. I guess the Produce API > bump > > is > > > not necessary then. > > > > > > > > > > > 4. The javadoc examples suggest the user should call > > > `initTransactions()` > > > > > in the case that a timeout is reached. I don't think that's right. > > For > > > > > KIP-360, we bump the epoch internally. I think we'd want to do the > > same > > > > > here so that we continue the current pattern where > > `initTransactions()` > > > > is > > > > > used just once. > > > > > > > > > > > > That's correct observation, will address it. > > > > > > > > > > > Thanks, > > > > > Jason > > > > > > > > > > On Thu, Apr 16, 2020 at 2:24 PM Guozhang Wang <[email protected]> > > > > wrote: > > > > > > > > > > > +1 (binding), thanks! > > > > > > > > > > > > On Tue, Apr 14, 2020 at 4:36 PM Boyang Chen < > > > > [email protected]> > > > > > > wrote: > > > > > > > > > > > > > Hey all, > > > > > > > > > > > > > > I would like to start the vote for KIP-588: > > > > > > > https://cwiki.apache.org/confluence/display/KAFKA/KIP-588 > > > > > > > > > %3A+Allow+producers+to+recover+gracefully+from+transaction+timeouts > > > > > > > > > > > > > > Feel free to continue posting on discussion thread if you have > > > > > > > any questions, thanks! > > > > > > > > > > > > > > Best, > > > > > > > Boyang > > > > > > > > > > > > > > > > > > > > > > > > > -- > > > > > > -- Guozhang > > > > > > > > > > > > > > > > > > > > > > > -- > > > > -- Guozhang > > > > > > > > > > > > > -- > > -- Guozhang > > > -- -- Guozhang
