Hey Guozhang, the CommitFailedException wrapping would cover all the non-fatal exceptions as we listed out in the KIP, generally speaking any exception that could recover safely by calling abortTxn should be wrapped.
Best, Boyang On Thu, Jan 28, 2021 at 5:22 PM Guozhang Wang <wangg...@gmail.com> wrote: > Hey Boyang, > > I think maybe there's something cracking here :) I'm just asking for > clarifications that as of today, which non-fatal exceptions the newly > introduced CommitFailedException would cover, and it seems to be only 1) > unknown pid, 2) invalid pid mapping, and 3) concurrent transactions. Is > that correct? > > On Thu, Jan 28, 2021 at 5:06 PM Boyang Chen <reluctanthero...@gmail.com> > wrote: > > > Hey Guozhang, I think TimeoutException would not be covered here as it > > potentially has a risk of hitting an illegal state on the broker side > when > > the previous commit was actually successful. Users should try to increase > > their max.block.ms to avoid hitting the timeout as a base suggestion, > > which > > is discussed in the KIP. > > > > As for the umbrella exception, I agree it has values to some extent, but > in > > terms of adoption difficulty, additional exception types usually make the > > error handling more complex than simplified, and we are doing our best to > > avoid compatibility issues caused by wrapping previously thrown raw > > exceptions that invalidates user error handling. And I'm also not in > favor > > of our exception hierarchy today where all exceptions are subclasses of > > KafkaException, and we already mixed its handling with other specific > > exception types that could be caught as superclass KafkaException. That > > could be addressed in a separate proposal if we see necessary. > > > > Boyang > > > > On Thu, Jan 28, 2021 at 3:28 PM Guozhang Wang <wangg...@gmail.com> > wrote: > > > > > Thanks Jason for the suggestion, that looks good to me too. > > > > > > Regarding the non-fatal exceptions wrapped as CommitFailed, I would > like > > to > > > clarify if we would cover all the following cases: 1) timeout, 2) > unknown > > > pid, 3) invalid pid mapping, 4) concurrent transactions? > > > > > > BTW I think it still makes sense to use an umbrella exception in case > in > > > the future we add more non-fatal cases even though today we only have a > > > few. > > > > > > > > > Guozhang > > > > > > On Thu, Jan 28, 2021 at 1:08 PM Boyang Chen < > reluctanthero...@gmail.com> > > > wrote: > > > > > > > Thanks Jason, I agree with the proposed solution here, will update > the > > > KIP. > > > > > > > > On Thu, Jan 28, 2021 at 10:52 AM Jason Gustafson <ja...@confluent.io > > > > > > wrote: > > > > > > > > > Hi Boyang, > > > > > > > > > > It seems like a reasonable suggestion. I wonder if a flag is > > sufficient > > > > > though. The current `Callback` documentation treats "fatal" errors > > from > > > > the > > > > > perspective of the individual message that was sent. > > > > > > > > > > ``` > > > > > * Non-Retriable exceptions (fatal, the message > > > will > > > > > never be sent): > > > > > ``` > > > > > > > > > > However, we also have fatal errors from the perspective of the > > > > transaction > > > > > (e.g. when the producer gets fenced). Perhaps that suggests we need > > > > > something richer than a boolean flag: > > > > > > > > > > At a high level, I think the following cases are possible: > > > > > > > > > > - message rejected (e.g. message too large, invalid topic) > > > > > - delivery failed after retries/delivery timeout (e.g. timeout, crc > > > > error, > > > > > not enough replicas) > > > > > - transaction failed (e.g. producer fenced, invalid transaction > > state) > > > > > > > > > > Perhaps instead we can have a type like the following: > > > > > > > > > > class SendFailure { > > > > > FailureType failureType; > > > > > Exception cause; > > > > > } > > > > > > > > > > enum FailureType { > > > > > MESSSAGE_REJECTED, DELIVERY_FAILED, TRANSACTION_FAILED > > > > > } > > > > > > > > > > (Not married to any of these names, just a starting point.) > > > > > > > > > > Then we add a new `onCompletion` as you've suggested: > > > > > > > > > > default void onCompletion(RecordMetadata metadata, SendFailure > > > failure) { > > > > > this.onCompletion(metadata, failure.cause()); > > > > > } > > > > > > > > > > This would give streams and other applications enough information > to > > > know > > > > > whether the message can be retried and whether the transaction can > be > > > > > aborted. > > > > > > > > > > What do you think? > > > > > > > > > > -Jason > > > > > > > > > > > > > > > On Wed, Jan 27, 2021 at 9:51 PM Boyang Chen < > > > reluctanthero...@gmail.com> > > > > > wrote: > > > > > > > > > > > Thanks Jason for the thoughts. > > > > > > > > > > > > On Wed, Jan 27, 2021 at 11:52 AM Jason Gustafson < > > ja...@confluent.io > > > > > > > > > > wrote: > > > > > > > > > > > > > Hi Boyang, > > > > > > > > > > > > > > Thanks for the iterations here. I think this is something we > > should > > > > > have > > > > > > > done a long time ago. It sounds like there are two API changes > > > here: > > > > > > > > > > > > > > 1. We are introducing the `CommitFailedException` to wrap > > abortable > > > > > > > errors that are raised from `commitTransaction`. This sounds > fine > > > to > > > > > me. > > > > > > As > > > > > > > far as I know, the only case we might need this is when we add > > > > support > > > > > to > > > > > > > let producers recover from coordinator timeouts. Are there any > > > > others? > > > > > > > > > > > > > > I think the purpose here is to ensure non-fatal exceptions are > > > > unified > > > > > > under the same > > > > > > exception umbrella, to make the proceeding to abort any ongoing > > > > > transaction > > > > > > much easier. > > > > > > I don't think `coordinator timeouts` is the only case to recover > > > here, > > > > > > since we have other > > > > > > non-fatal exceptions such as UnknownPid. > > > > > > > > > > > > 2. We are wrapping non-fatal errors raised from `send` as > > > > > `KafkaException`. > > > > > > > The motivation for this is less clear to me and it doesn't look > > > like > > > > > the > > > > > > > example from the KIP depends on it. My concern here is > > > compatibility. > > > > > > > Currently we have the following documentation for the > `Callback` > > > API: > > > > > > > > > > > > > > ``` > > > > > > > * Non-Retriable exceptions (fatal, the > > > message > > > > > will > > > > > > > never be sent): > > > > > > > * > > > > > > > * InvalidTopicException > > > > > > > * OffsetMetadataTooLargeException > > > > > > > * RecordBatchTooLargeException > > > > > > > * RecordTooLargeException > > > > > > > * UnknownServerException > > > > > > > * UnknownProducerIdException > > > > > > > * InvalidProducerEpochException > > > > > > > * > > > > > > > * Retriable exceptions (transient, may be > > > > covered > > > > > > by > > > > > > > increasing #.retries): > > > > > > > * > > > > > > > * CorruptRecordException > > > > > > > * InvalidMetadataException > > > > > > > * NotEnoughReplicasAfterAppendException > > > > > > > * NotEnoughReplicasException > > > > > > > * OffsetOutOfRangeException > > > > > > > * TimeoutException > > > > > > > * UnknownTopicOrPartitionException > > > > > > > ``` > > > > > > > > > > > > > > If we wrap all the retriable exceptions documented here as > > > > > > > `KafkaException`, wouldn't that break any error handling that > > users > > > > > might > > > > > > > already have? it's gonna introduce a compatibility issue. > > > > > > > > > > > > > > The original intention was to simplify `send` callback error > > > handling > > > > > by > > > > > > doing exception wrapping, as on Streams level > > > > > > we have to prepare an exhausting list of exceptions to catch as > > > fatal, > > > > > and > > > > > > the same lengthy list to catch as > > > > > > non-fatal. It would be much easier if we got `hints` from the > > > callback. > > > > > > However, > > > > > > I agree there is a compatibility concern, what about deprecating > > the > > > > > > existing: > > > > > > > > > > > > void onCompletion(RecordMetadata metadata, Exception exception) > > > > > > > > > > > > and replace it with: > > > > > > > > > > > > default void onCompletion(RecordMetadata metadata, Exception > > > exception, > > > > > > boolean isFatal) { > > > > > > this.onCompletion(metadata, exception); > > > > > > } > > > > > > > > > > > > to make sure new users get the benefit of understanding the > > fatality > > > > > based > > > > > > on the info presented by the producer? > > > > > > > > > > > > Thanks, > > > > > > > Jason > > > > > > > > > > > > > > > > > > > > > On Sat, Jan 23, 2021 at 3:31 AM Hiringuru <i...@hiringuru.com> > > > > wrote: > > > > > > > > > > > > > > > Why we are receiving all emails kindly remove us from > > > > > > > > dev@kafka.apache.org we don't want to receive emails > anymore. > > > > > > > > > > > > > > > > Thanks > > > > > > > > > On 01/23/2021 4:14 AM Guozhang Wang <wangg...@gmail.com> > > > wrote: > > > > > > > > > > > > > > > > > > > > > > > > > > > Thanks Boyang, yes I think I was confused about the > different > > > > > > handling > > > > > > > of > > > > > > > > > two abortTxn calls, and now I get it was not intentional. I > > > > think I > > > > > > do > > > > > > > > not > > > > > > > > > have more concerns. > > > > > > > > > > > > > > > > > > On Fri, Jan 22, 2021 at 1:12 PM Boyang Chen < > > > > > > > reluctanthero...@gmail.com> > > > > > > > > > wrote: > > > > > > > > > > > > > > > > > > > Thanks for the clarification Guozhang, I got your point > > that > > > we > > > > > > want > > > > > > > to > > > > > > > > > > have a consistent handling of fatal exceptions being > thrown > > > > from > > > > > > the > > > > > > > > > > abortTxn. I modified the current template to move the > fatal > > > > > > exception > > > > > > > > > > try-catch outside of the processing loop to make sure we > > > could > > > > > get > > > > > > a > > > > > > > > chance > > > > > > > > > > to close consumer/producer modules. Let me know what you > > > think. > > > > > > > > > > > > > > > > > > > > Best, > > > > > > > > > > Boyang > > > > > > > > > > > > > > > > > > > > On Fri, Jan 22, 2021 at 11:05 AM Boyang Chen < > > > > > > > > reluctanthero...@gmail.com> > > > > > > > > > > wrote: > > > > > > > > > > > > > > > > > > > > > My understanding is that abortTransaction would only > > throw > > > > when > > > > > > the > > > > > > > > > > > producer is in fatal state. For CommitFailed, the > > producer > > > > > should > > > > > > > > still > > > > > > > > > > be > > > > > > > > > > > in the abortable error state, so that abortTransaction > > call > > > > > would > > > > > > > not > > > > > > > > > > throw. > > > > > > > > > > > > > > > > > > > > > > On Fri, Jan 22, 2021 at 11:02 AM Guozhang Wang < > > > > > > wangg...@gmail.com > > > > > > > > > > > > > > > > > > wrote: > > > > > > > > > > > > > > > > > > > > > >> Or are you going to maintain some internal state such > > > that, > > > > > the > > > > > > > > > > >> `abortTransaction` in the catch block would never > throw > > > > again? > > > > > > > > > > >> > > > > > > > > > > >> On Fri, Jan 22, 2021 at 11:01 AM Guozhang Wang < > > > > > > > wangg...@gmail.com> > > > > > > > > > > >> wrote: > > > > > > > > > > >> > > > > > > > > > > >> > Hi Boyang/Jason, > > > > > > > > > > >> > > > > > > > > > > > >> > I've also thought about this (i.e. using > CommitFailed > > > for > > > > > all > > > > > > > > > > >> non-fatal), > > > > > > > > > > >> > but what I'm pondering is that, in the catch > > > > (CommitFailed) > > > > > > > block, > > > > > > > > > > what > > > > > > > > > > >> > would happen if the `producer.abortTransaction();` > > > throws > > > > > > again? > > > > > > > > > > should > > > > > > > > > > >> > that be captured as a fatal and cause the client to > > > close > > > > > > again. > > > > > > > > > > >> > > > > > > > > > > > >> > If yes, then naively the pattern would be: > > > > > > > > > > >> > > > > > > > > > > > >> > ... > > > > > > > > > > >> > catch (CommitFailedException e) { > > > > > > > > > > >> > // Transaction commit failed with abortable > > > error, > > > > > > user > > > > > > > > could > > > > > > > > > > >> reset > > > > > > > > > > >> > // the application state and resume with a > new > > > > > > > > transaction. > > > > > > > > > > The > > > > > > > > > > >> > root > > > > > > > > > > >> > // cause was wrapped in the thrown > exception. > > > > > > > > > > >> > resetToLastCommittedPositions(consumer); > > > > > > > > > > >> > try { > > > > > > > > > > >> > producer.abortTransaction(); > > > > > > > > > > >> > } catch (KafkaException e) { > > > > > > > > > > >> > producer.close(); > > > > > > > > > > >> > consumer.close(); > > > > > > > > > > >> > throw e; > > > > > > > > > > >> > } > > > > > > > > > > >> > } catch (KafkaException e) { > > > > > > > > > > >> > producer.close(); > > > > > > > > > > >> > consumer.close(); > > > > > > > > > > >> > throw e; > > > > > > > > > > >> > } > > > > > > > > > > >> > ... > > > > > > > > > > >> > > > > > > > > > > > >> > Guozhang > > > > > > > > > > >> > > > > > > > > > > > >> > On Fri, Jan 22, 2021 at 10:47 AM Boyang Chen < > > > > > > > > > > >> reluctanthero...@gmail.com> > > > > > > > > > > >> > wrote: > > > > > > > > > > >> > > > > > > > > > > > >> >> Hey Guozhang, > > > > > > > > > > >> >> > > > > > > > > > > >> >> Jason and I were discussing the new API offline and > > > > decided > > > > > > to > > > > > > > > take > > > > > > > > > > >> >> another > > > > > > > > > > >> >> approach. Firstly, the reason not to invent a new > API > > > > with > > > > > > > > returned > > > > > > > > > > >> >> boolean > > > > > > > > > > >> >> flag is for compatibility consideration, since old > > EOS > > > > code > > > > > > > > would not > > > > > > > > > > >> know > > > > > > > > > > >> >> that a given transaction commit was failed > internally > > > as > > > > > they > > > > > > > > don't > > > > > > > > > > >> listen > > > > > > > > > > >> >> to the output flag. Our proposed alternative > solution > > > is > > > > to > > > > > > let > > > > > > > > > > >> >> *commitTransaction > > > > > > > > > > >> >> throw CommitFailedException whenever the commit > > failed > > > > with > > > > > > > > non-fatal > > > > > > > > > > >> >> exception*, so that on the caller side the handling > > > logic > > > > > > > > becomes: > > > > > > > > > > >> >> > > > > > > > > > > >> >> try { > > > > > > > > > > >> >> if (shouldCommit) { > > > > > > > > > > >> >> producer.commitTransaction(); > > > > > > > > > > >> >> } else { > > > > > > > > > > >> >> > resetToLastCommittedPositions(consumer); > > > > > > > > > > >> >> producer.abortTransaction(); > > > > > > > > > > >> >> } > > > > > > > > > > >> >> } catch (CommitFailedException e) { > > > > > > > > > > >> >> // Transaction commit failed with abortable > > > > error, > > > > > > user > > > > > > > > could > > > > > > > > > > >> >> reset > > > > > > > > > > >> >> // the application state and resume with a > > new > > > > > > > > transaction. > > > > > > > > > > The > > > > > > > > > > >> >> root > > > > > > > > > > >> >> // cause was wrapped in the thrown > exception. > > > > > > > > > > >> >> resetToLastCommittedPositions(consumer); > > > > > > > > > > >> >> producer.abortTransaction(); > > > > > > > > > > >> >> } catch (KafkaException e) { > > > > > > > > > > >> >> producer.close(); > > > > > > > > > > >> >> consumer.close(); > > > > > > > > > > >> >> throw e; > > > > > > > > > > >> >> } > > > > > > > > > > >> >> > > > > > > > > > > >> >> This approach looks cleaner as all exception types > > > other > > > > > than > > > > > > > > > > >> CommitFailed > > > > > > > > > > >> >> will doom to be fatal, which is very easy to adopt > > for > > > > > users. > > > > > > > In > > > > > > > > the > > > > > > > > > > >> >> meantime, we still maintain the commitTxn behavior > to > > > > throw > > > > > > > > instead > > > > > > > > > > of > > > > > > > > > > >> >> silently failing. > > > > > > > > > > >> >> > > > > > > > > > > >> >> In addition, we decided to drop the recommendation > to > > > > > handle > > > > > > > > > > >> >> TimeoutException and leave it to the users to make > > the > > > > > call. > > > > > > > The > > > > > > > > > > >> downside > > > > > > > > > > >> >> for blindly calling abortTxn upon timeout is that > we > > > > could > > > > > > > > result in > > > > > > > > > > an > > > > > > > > > > >> >> illegal state when the commit was already > successful > > on > > > > the > > > > > > > > broker > > > > > > > > > > >> >> side. Without a good guarantee on the outcome, > > > > > > overcomplicating > > > > > > > > the > > > > > > > > > > >> >> template should not be encouraged IMHO. > > > > > > > > > > >> >> > > > > > > > > > > >> >> Let me know your thoughts on the new approach here, > > > thank > > > > > > you! > > > > > > > > > > >> >> > > > > > > > > > > >> >> Best, > > > > > > > > > > >> >> Boyang > > > > > > > > > > >> >> > > > > > > > > > > >> >> On Tue, Jan 19, 2021 at 11:11 AM Guozhang Wang < > > > > > > > > wangg...@gmail.com> > > > > > > > > > > >> >> wrote: > > > > > > > > > > >> >> > > > > > > > > > > >> >> > Thanks for your clarification on 2)/3), that > makes > > > > sense. > > > > > > > > > > >> >> > > > > > > > > > > > >> >> > On Tue, Jan 19, 2021 at 10:16 AM Boyang Chen < > > > > > > > > > > >> >> reluctanthero...@gmail.com> > > > > > > > > > > >> >> > wrote: > > > > > > > > > > >> >> > > > > > > > > > > > >> >> > > Thanks for the input Guozhang, replied inline. > > > > > > > > > > >> >> > > > > > > > > > > > > >> >> > > On Mon, Jan 18, 2021 at 8:57 PM Guozhang Wang < > > > > > > > > > > wangg...@gmail.com> > > > > > > > > > > >> >> > wrote: > > > > > > > > > > >> >> > > > > > > > > > > > > >> >> > > > Hello Boyang, > > > > > > > > > > >> >> > > > > > > > > > > > > > >> >> > > > Thanks for the updated KIP. I read it again > and > > > > have > > > > > > the > > > > > > > > > > >> following > > > > > > > > > > >> >> > > > thoughts: > > > > > > > > > > >> >> > > > > > > > > > > > > > >> >> > > > 0. I'm a bit concerned that if commitTxn does > > not > > > > > throw > > > > > > > any > > > > > > > > > > >> >> non-fatal > > > > > > > > > > >> >> > > > exception, and instead we rely on the > > subsequent > > > > > > beginTxn > > > > > > > > call > > > > > > > > > > to > > > > > > > > > > >> >> > throw, > > > > > > > > > > >> >> > > it > > > > > > > > > > >> >> > > > may violate some callers with a pattern that > > > > relying > > > > > on > > > > > > > > > > >> commitTxn to > > > > > > > > > > >> >> > > > succeed to make some non-rollback operations. > > For > > > > > > > example: > > > > > > > > > > >> >> > > > > > > > > > > > > > >> >> > > > beginTxn() > > > > > > > > > > >> >> > > > // do some read-write on my local DB > > > > > > > > > > >> >> > > > commitTxn() > > > > > > > > > > >> >> > > > // if commitTxn succeeds, then commit the DB > > > > > > > > > > >> >> > > > > > > > > > > > > > >> >> > > > ------------- > > > > > > > > > > >> >> > > > > > > > > > > > > > >> >> > > > The issue is that, committing DB is a > > > non-rollback > > > > > > > > operation, > > > > > > > > > > and > > > > > > > > > > >> >> users > > > > > > > > > > >> >> > > may > > > > > > > > > > >> >> > > > just rely on commitTxn to return without > error > > to > > > > > make > > > > > > > this > > > > > > > > > > >> >> > non-rollback > > > > > > > > > > >> >> > > > call. Of course we can just claim this > pattern > > is > > > > not > > > > > > > > > > legitimate > > > > > > > > > > >> >> and is > > > > > > > > > > >> >> > > not > > > > > > > > > > >> >> > > > the right way of doing things, but many users > > may > > > > > > > naturally > > > > > > > > > > adopt > > > > > > > > > > >> >> this > > > > > > > > > > >> >> > > > pattern. > > > > > > > > > > >> >> > > > > > > > > > > > > > >> >> > > > So maybe we should still let commitTxn also > > throw > > > > > > > non-fatal > > > > > > > > > > >> >> exceptions, > > > > > > > > > > >> >> > > in > > > > > > > > > > >> >> > > > which case we would then call abortTxn again. > > > > > > > > > > >> >> > > > > > > > > > > > > > >> >> > > > But if we do this, the pattern becomes: > > > > > > > > > > >> >> > > > > > > > > > > > > > >> >> > > > try { > > > > > > > > > > >> >> > > > beginTxn() > > > > > > > > > > >> >> > > > // do something > > > > > > > > > > >> >> > > > } catch (Exception) { > > > > > > > > > > >> >> > > > shouldCommit = false; > > > > > > > > > > >> >> > > > } > > > > > > > > > > >> >> > > > > > > > > > > > > > >> >> > > > if (shouldCommit) { > > > > > > > > > > >> >> > > > try { > > > > > > > > > > >> >> > > > commitTxn() > > > > > > > > > > >> >> > > > } catch (...) { // enumerate all > > fatal > > > > > > > > exceptions > > > > > > > > > > >> >> > > > shutdown() > > > > > > > > > > >> >> > > > } catch (KafkaException) { > > > > > > > > > > >> >> > > > // non-fatal > > > > > > > > > > >> >> > > > shouldCommit = false; > > > > > > > > > > >> >> > > > } > > > > > > > > > > >> >> > > > } > > > > > > > > > > >> >> > > > > > > > > > > > > > >> >> > > > if (!shouldCommit && running()) { > > > > > > > > > > >> >> > > > try { > > > > > > > > > > >> >> > > > abortTxn() > > > > > > > > > > >> >> > > > } catch (KafkaException) { > > > > > > > > > > >> >> > > > // only throw fatal > > > > > > > > > > >> >> > > > shutdown() > > > > > > > > > > >> >> > > > } > > > > > > > > > > >> >> > > > } > > > > > > > > > > >> >> > > > > > > > > > > > > > >> >> > > > --------------------- > > > > > > > > > > >> >> > > > > > > > > > > > > > >> >> > > > Which is much more complicated. > > > > > > > > > > >> >> > > > > > > > > > > > > > >> >> > > > Thank makes me think, the alternative we have > > > > > discussed > > > > > > > > offline > > > > > > > > > > >> may > > > > > > > > > > >> >> be > > > > > > > > > > >> >> > > > better: let commitTxn() to return a boolean > > flag. > > > > > > > > > > >> >> > > > > > > > > > > > > > >> >> > > > * If it returns true, it means the commit > > > > succeeded. > > > > > > > Users > > > > > > > > can > > > > > > > > > > >> >> > > comfortably > > > > > > > > > > >> >> > > > continue and do any external non-rollback > > > > operations > > > > > if > > > > > > > > they > > > > > > > > > > >> like. > > > > > > > > > > >> >> > > > * If it returns false, it means the commit > > failed > > > > > with > > > > > > > > > > non-fatal > > > > > > > > > > >> >> error > > > > > > > > > > >> >> > > *AND > > > > > > > > > > >> >> > > > the txn has been aborted*. Users do not need > to > > > > call > > > > > > > > abortTxn > > > > > > > > > > >> again. > > > > > > > > > > >> >> > > > * If it throws, then it means fatal errors. > > Users > > > > > > should > > > > > > > > shut > > > > > > > > > > >> down > > > > > > > > > > >> >> the > > > > > > > > > > >> >> > > > client. > > > > > > > > > > >> >> > > > > > > > > > > > > > >> >> > > > In this case, the pattern becomes: > > > > > > > > > > >> >> > > > > > > > > > > > > > >> >> > > > try { > > > > > > > > > > >> >> > > > beginTxn() > > > > > > > > > > >> >> > > > // do something > > > > > > > > > > >> >> > > > } catch (Exception) { > > > > > > > > > > >> >> > > > shouldCommit = false; > > > > > > > > > > >> >> > > > } > > > > > > > > > > >> >> > > > > > > > > > > > > > >> >> > > > try { > > > > > > > > > > >> >> > > > if (shouldCommit) { > > > > > > > > > > >> >> > > > commitSucceeded = commitTxn() > > > > > > > > > > >> >> > > > } else { > > > > > > > > > > >> >> > > > // reset offsets, rollback > operations, > > > etc > > > > > > > > > > >> >> > > > abortTxn() > > > > > > > > > > >> >> > > > } > > > > > > > > > > >> >> > > > } catch (KafkaException) { > > > > > > > > > > >> >> > > > // only throw fatal > > > > > > > > > > >> >> > > > shutdown() > > > > > > > > > > >> >> > > > } > > > > > > > > > > >> >> > > > > > > > > > > > > > >> >> > > > if (commitSucceeded) > > > > > > > > > > >> >> > > > // do other non-rollbackable things > > > > > > > > > > >> >> > > > else > > > > > > > > > > >> >> > > > // reset offsets, rollback operations, etc > > > > > > > > > > >> >> > > > > > > > > > > > > > >> >> > > > ------------------------- > > > > > > > > > > >> >> > > > > > > > > > > > > > >> >> > > > Of course, if we want to have better > visibility > > > > into > > > > > > what > > > > > > > > > > caused > > > > > > > > > > >> the > > > > > > > > > > >> >> > > commit > > > > > > > > > > >> >> > > > to fail and txn to abort. We can let the > return > > > > type > > > > > be > > > > > > > an > > > > > > > > enum > > > > > > > > > > >> >> instead > > > > > > > > > > >> >> > > of > > > > > > > > > > >> >> > > > a flag. But my main idea is to still let the > > > > > commitTxn > > > > > > be > > > > > > > > the > > > > > > > > > > >> final > > > > > > > > > > >> >> > point > > > > > > > > > > >> >> > > > users can tell whether this txn succeeded or > > not, > > > > > > instead > > > > > > > > of > > > > > > > > > > >> >> relying on > > > > > > > > > > >> >> > > the > > > > > > > > > > >> >> > > > next beginTxn() call. > > > > > > > > > > >> >> > > > > > > > > > > > > > >> >> > > > I agree that adding a boolean flag is indeed > > > useful > > > > > in > > > > > > > this > > > > > > > > > > case. > > > > > > > > > > >> >> Will > > > > > > > > > > >> >> > > update the KIP. > > > > > > > > > > >> >> > > > > > > > > > > > > >> >> > > 1. Re: "while maintaining the behavior to throw > > > fatal > > > > > > > > exception > > > > > > > > > > in > > > > > > > > > > >> >> raw" > > > > > > > > > > >> >> > not > > > > > > > > > > >> >> > > > sure what do you mean by "throw" here. Are > you > > > > > > proposing > > > > > > > > the > > > > > > > > > > >> >> callback > > > > > > > > > > >> >> > > would > > > > > > > > > > >> >> > > > *pass* (not throw) in any fatal exceptions > when > > > > > > triggered > > > > > > > > > > without > > > > > > > > > > >> >> > > wrapping? > > > > > > > > > > >> >> > > > > > > > > > > > > > >> >> > > > Yes, I want to say *pass*, the benefit is to > > make > > > > the > > > > > > end > > > > > > > > > > user's > > > > > > > > > > >> >> > > expectation consistent > > > > > > > > > > >> >> > > regarding exception handling. > > > > > > > > > > >> >> > > > > > > > > > > > > >> >> > > > > > > > > > > > > >> >> > > > 2. I'm not sure I understand the claim > > regarding > > > > the > > > > > > > > callback > > > > > > > > > > >> that > > > > > > > > > > >> >> "In > > > > > > > > > > >> >> > > EOS > > > > > > > > > > >> >> > > > setup, it is not required to handle the > > > exception". > > > > > Are > > > > > > > you > > > > > > > > > > >> >> proposing > > > > > > > > > > >> >> > > that, > > > > > > > > > > >> >> > > > e.g. in Streams, we do not try to handle any > > > > > exceptions > > > > > > > if > > > > > > > > EOS > > > > > > > > > > is > > > > > > > > > > >> >> > enabled > > > > > > > > > > >> >> > > > in the callback anymore, but just let > > commitTxn() > > > > > > itself > > > > > > > to > > > > > > > > > > fail > > > > > > > > > > >> to > > > > > > > > > > >> >> be > > > > > > > > > > >> >> > > > notified about the problem? Personally I > think > > in > > > > > > Streams > > > > > > > > we > > > > > > > > > > >> should > > > > > > > > > > >> >> > just > > > > > > > > > > >> >> > > > make the handling logic of the callback to be > > > > > > consistent > > > > > > > > > > >> regardless > > > > > > > > > > >> >> of > > > > > > > > > > >> >> > > the > > > > > > > > > > >> >> > > > EOS settings (today we have different logic > > > > depending > > > > > > on > > > > > > > > this > > > > > > > > > > >> logic, > > > > > > > > > > >> >> > > which > > > > > > > > > > >> >> > > > I think could be unified as well). > > > > > > > > > > >> >> > > > > > > > > > > > > > >> >> > > > My idea originates from the claim on send > API: > > > > > > > > > > >> >> > > "When used as part of a transaction, it is not > > > > > necessary > > > > > > to > > > > > > > > > > define > > > > > > > > > > >> a > > > > > > > > > > >> >> > > callback or check the result of the future in > > > order > > > > to > > > > > > > > detect > > > > > > > > > > >> errors > > > > > > > > > > >> >> > from > > > > > > > > > > >> >> > > <code>send</code>. " > > > > > > > > > > >> >> > > My understanding is that for EOS, the exception > > > will > > > > be > > > > > > > > detected > > > > > > > > > > by > > > > > > > > > > >> >> > calling > > > > > > > > > > >> >> > > transactional APIs either way, so it is a > > duplicate > > > > > > > handling > > > > > > > > to > > > > > > > > > > >> track > > > > > > > > > > >> >> > > all the sendExceptions in RecordCollector. > > > However, I > > > > > > > looked > > > > > > > > up > > > > > > > > > > >> >> > > sendException is being used today as follow: > > > > > > > > > > >> >> > > > > > > > > > > > > >> >> > > 1. Pass to "ProductionExceptionHandler" for any > > > > default > > > > > > or > > > > > > > > > > >> customized > > > > > > > > > > >> >> > > exception handler to handle > > > > > > > > > > >> >> > > 2. Stop collecting offset info or new > exceptions > > > > > > > > > > >> >> > > 3. Check and rethrow exceptions in close(), > > flush() > > > > or > > > > > > new > > > > > > > > send() > > > > > > > > > > >> >> calls > > > > > > > > > > >> >> > > > > > > > > > > > > >> >> > > To my understanding, we should still honor the > > > > > commitment > > > > > > > to > > > > > > > > #1 > > > > > > > > > > for > > > > > > > > > > >> >> any > > > > > > > > > > >> >> > > user customized implementation. The #2 does not > > > > really > > > > > > > > affect our > > > > > > > > > > >> >> > decision > > > > > > > > > > >> >> > > upon EOS. The #3 here is still valuable as it > > could > > > > > help > > > > > > us > > > > > > > > fail > > > > > > > > > > >> fast > > > > > > > > > > >> >> in > > > > > > > > > > >> >> > > new send() instead of waiting to later stage of > > > > > > processing. > > > > > > > > In > > > > > > > > > > that > > > > > > > > > > >> >> > sense, > > > > > > > > > > >> >> > > I agree we should continue to record send > > > exceptions > > > > > even > > > > > > > > under > > > > > > > > > > EOS > > > > > > > > > > >> >> case > > > > > > > > > > >> >> > to > > > > > > > > > > >> >> > > ensure the strength of stream side Producer > > logic. > > > On > > > > > the > > > > > > > > safer > > > > > > > > > > >> side, > > > > > > > > > > >> >> we > > > > > > > > > > >> >> > no > > > > > > > > > > >> >> > > longer need to wrap certain fatal exceptions > like > > > > > > > > ProducerFenced > > > > > > > > > > as > > > > > > > > > > >> >> > > TaskMigrated, since they should not crash the > > > stream > > > > > > thread > > > > > > > > if > > > > > > > > > > >> thrown > > > > > > > > > > >> >> in > > > > > > > > > > >> >> > > raw format, once we adopt the new processing > > model > > > in > > > > > the > > > > > > > > send > > > > > > > > > > >> phase. > > > > > > > > > > >> >> > > > > > > > > > > > > >> >> > > > > > > > > > > > > >> >> > > > > > > > > > > > > > >> >> > > > Guozhang > > > > > > > > > > >> >> > > > > > > > > > > > > > >> >> > > > > > > > > > > > > > >> >> > > > > > > > > > > > > > >> >> > > > On Thu, Dec 17, 2020 at 8:42 PM Boyang Chen < > > > > > > > > > > >> >> > reluctanthero...@gmail.com> > > > > > > > > > > >> >> > > > wrote: > > > > > > > > > > >> >> > > > > > > > > > > > > > >> >> > > > > Thanks for everyone's feedback so far. I > have > > > > > > polished > > > > > > > > the > > > > > > > > > > KIP > > > > > > > > > > >> >> after > > > > > > > > > > >> >> > > > > offline discussion with some folks working > on > > > EOS > > > > > to > > > > > > > > make the > > > > > > > > > > >> >> > exception > > > > > > > > > > >> >> > > > > handling more lightweight. The essential > > change > > > > is > > > > > > that > > > > > > > > we > > > > > > > > > > are > > > > > > > > > > >> not > > > > > > > > > > >> >> > > > > inventing a new intermediate exception > type, > > > but > > > > > > > instead > > > > > > > > > > >> >> separating a > > > > > > > > > > >> >> > > > full > > > > > > > > > > >> >> > > > > transaction into two phases: > > > > > > > > > > >> >> > > > > > > > > > > > > > > >> >> > > > > 1. The data transmission phase > > > > > > > > > > >> >> > > > > 2. The commit phase > > > > > > > > > > >> >> > > > > > > > > > > > > > > >> >> > > > > This way for any exception thrown from > phase > > 1, > > > > > will > > > > > > be > > > > > > > > an > > > > > > > > > > >> >> indicator > > > > > > > > > > >> >> > in > > > > > > > > > > >> >> > > > > phase 2 whether we should do commit or > abort, > > > and > > > > > > from > > > > > > > > now on > > > > > > > > > > >> >> > > > > `commitTransaction` should only throw fatal > > > > > > exceptions, > > > > > > > > > > >> similar to > > > > > > > > > > >> >> > > > > `abortTransaction`, so that any > > KafkaException > > > > > caught > > > > > > > in > > > > > > > > the > > > > > > > > > > >> >> commit > > > > > > > > > > >> >> > > phase > > > > > > > > > > >> >> > > > > will be definitely fatal to crash the app. > > For > > > > more > > > > > > > > advanced > > > > > > > > > > >> users > > > > > > > > > > >> >> > such > > > > > > > > > > >> >> > > > as > > > > > > > > > > >> >> > > > > Streams, we have the ability to further > wrap > > > > > selected > > > > > > > > types > > > > > > > > > > of > > > > > > > > > > >> >> fatal > > > > > > > > > > >> >> > > > > exceptions to trigger task migration if > > > > necessary. > > > > > > > > > > >> >> > > > > > > > > > > > > > > >> >> > > > > More details in the KIP, feel free to take > > > > another > > > > > > > look, > > > > > > > > > > >> thanks! > > > > > > > > > > >> >> > > > > > > > > > > > > > > >> >> > > > > On Thu, Dec 17, 2020 at 8:36 PM Boyang > Chen < > > > > > > > > > > >> >> > > reluctanthero...@gmail.com> > > > > > > > > > > >> >> > > > > wrote: > > > > > > > > > > >> >> > > > > > > > > > > > > > > >> >> > > > > > Thanks Bruno for the feedback. > > > > > > > > > > >> >> > > > > > > > > > > > > > > > >> >> > > > > > On Mon, Dec 7, 2020 at 5:26 AM Bruno > > Cadonna > > > < > > > > > > > > > > >> >> br...@confluent.io> > > > > > > > > > > >> >> > > > wrote: > > > > > > > > > > >> >> > > > > > > > > > > > > > > > >> >> > > > > >> Thanks Boyang for the KIP! > > > > > > > > > > >> >> > > > > >> > > > > > > > > > > >> >> > > > > >> Like Matthias, I do also not know the > > > producer > > > > > > > > internal > > > > > > > > > > well > > > > > > > > > > >> >> > enough > > > > > > > > > > >> >> > > to > > > > > > > > > > >> >> > > > > >> comment on the categorization. However, > I > > > > think > > > > > > > > having a > > > > > > > > > > >> super > > > > > > > > > > >> >> > > > exception > > > > > > > > > > >> >> > > > > >> (e.g. RetriableException) that encodes > if > > > an > > > > > > > > exception is > > > > > > > > > > >> >> fatal > > > > > > > > > > >> >> > or > > > > > > > > > > >> >> > > > not > > > > > > > > > > >> >> > > > > >> is cleaner, better understandable and > less > > > > > > > > error-prone, > > > > > > > > > > >> because > > > > > > > > > > >> >> > > > ideally > > > > > > > > > > >> >> > > > > >> when you add a new non-fatal exception > in > > > > future > > > > > > you > > > > > > > > just > > > > > > > > > > >> need > > > > > > > > > > >> >> to > > > > > > > > > > >> >> > > > think > > > > > > > > > > >> >> > > > > >> about letting it inherit from the super > > > > > exception > > > > > > > and > > > > > > > > all > > > > > > > > > > >> the > > > > > > > > > > >> >> rest > > > > > > > > > > >> >> > > of > > > > > > > > > > >> >> > > > > >> the code will just behave correctly > > without > > > > the > > > > > > need > > > > > > > > to > > > > > > > > > > wrap > > > > > > > > > > >> >> the > > > > > > > > > > >> >> > new > > > > > > > > > > >> >> > > > > >> exception into another exception each > time > > > it > > > > is > > > > > > > > thrown > > > > > > > > > > >> (maybe > > > > > > > > > > >> >> it > > > > > > > > > > >> >> > is > > > > > > > > > > >> >> > > > > >> thrown at different location in the > code). > > > > > > > > > > >> >> > > > > >> > > > > > > > > > > >> >> > > > > >> As far as I understand the following > > > statement > > > > > > from > > > > > > > > your > > > > > > > > > > >> >> previous > > > > > > > > > > >> >> > > > e-mail > > > > > > > > > > >> >> > > > > >> is the reason that currently such a > super > > > > > > exception > > > > > > > > is not > > > > > > > > > > >> >> > possible: > > > > > > > > > > >> >> > > > > >> > > > > > > > > > > >> >> > > > > >> "Right now we have RetriableException > > type, > > > if > > > > > we > > > > > > > are > > > > > > > > > > going > > > > > > > > > > >> to > > > > > > > > > > >> >> > add a > > > > > > > > > > >> >> > > > > >> `ProducerRetriableException` type, we > have > > > to > > > > > put > > > > > > > > this new > > > > > > > > > > >> >> > interface > > > > > > > > > > >> >> > > > as > > > > > > > > > > >> >> > > > > >> the parent of the RetriableException, > > > because > > > > > not > > > > > > > all > > > > > > > > > > thrown > > > > > > > > > > >> >> > > non-fatal > > > > > > > > > > >> >> > > > > >> exceptions are `retriable` in general > for > > > > > > producer" > > > > > > > > > > >> >> > > > > >> > > > > > > > > > > >> >> > > > > >> > > > > > > > > > > >> >> > > > > >> In the list of exceptions in your KIP, I > > > found > > > > > > > > non-fatal > > > > > > > > > > >> >> > exceptions > > > > > > > > > > >> >> > > > that > > > > > > > > > > >> >> > > > > >> do not inherit from RetriableException. > I > > > > guess > > > > > > > those > > > > > > > > are > > > > > > > > > > >> the > > > > > > > > > > >> >> ones > > > > > > > > > > >> >> > > you > > > > > > > > > > >> >> > > > > >> are referring to in your statement: > > > > > > > > > > >> >> > > > > >> > > > > > > > > > > >> >> > > > > >> InvalidProducerEpochException > > > > > > > > > > >> >> > > > > >> InvalidPidMappingException > > > > > > > > > > >> >> > > > > >> TransactionAbortedException > > > > > > > > > > >> >> > > > > >> > > > > > > > > > > >> >> > > > > >> All of those exceptions are non-fatal > and > > do > > > > not > > > > > > > > inherit > > > > > > > > > > >> from > > > > > > > > > > >> >> > > > > >> RetriableException. Is there a reason > for > > > > that? > > > > > If > > > > > > > > they > > > > > > > > > > >> >> depended > > > > > > > > > > >> >> > > from > > > > > > > > > > >> >> > > > > >> RetriableException we would be a bit > > closer > > > > to a > > > > > > > super > > > > > > > > > > >> >> exception I > > > > > > > > > > >> >> > > > > >> mention above. > > > > > > > > > > >> >> > > > > >> > > > > > > > > > > >> >> > > > > >> The reason is that sender may catch > those > > > > > > exceptions > > > > > > > > in > > > > > > > > > > the > > > > > > > > > > >> >> > > > > > ProduceResponse, and it currently does > > > infinite > > > > > > > > > > >> >> > > > > > retries on RetriableException. To make > sure > > > we > > > > > > could > > > > > > > > > > trigger > > > > > > > > > > >> the > > > > > > > > > > >> >> > > > > > abortTransaction() by catching non-fatal > > > thrown > > > > > > > > > > >> >> > > > > > exceptions and reinitialize the txn > state, > > we > > > > > chose > > > > > > > > not to > > > > > > > > > > >> let > > > > > > > > > > >> >> > those > > > > > > > > > > >> >> > > > > > exceptions inherit RetriableException so > > that > > > > > > > > > > >> >> > > > > > they won't cause infinite retry on > sender. > > > > > > > > > > >> >> > > > > > > > > > > > > > > > >> >> > > > > > > > > > > > > > > > >> >> > > > > >> With OutOfOrderSequenceException and > > > > > > > > > > >> >> UnknownProducerIdException, I > > > > > > > > > > >> >> > > > think > > > > > > > > > > >> >> > > > > >> to understand that their fatality > depends > > on > > > > the > > > > > > > type > > > > > > > > > > (i.e. > > > > > > > > > > >> >> > > > > >> configuration) of the producer. That > makes > > > it > > > > > > > > difficult to > > > > > > > > > > >> >> have a > > > > > > > > > > >> >> > > > super > > > > > > > > > > >> >> > > > > >> exception that encodes the retriability > as > > > > > > mentioned > > > > > > > > > > above. > > > > > > > > > > >> >> Would > > > > > > > > > > >> >> > it > > > > > > > > > > >> >> > > > be > > > > > > > > > > >> >> > > > > >> possible to introduce exceptions that > > > inherit > > > > > from > > > > > > > > > > >> >> > > RetriableException > > > > > > > > > > >> >> > > > > >> and that are thrown when those > exceptions > > > are > > > > > > caught > > > > > > > > from > > > > > > > > > > >> the > > > > > > > > > > >> >> > > brokers > > > > > > > > > > >> >> > > > > >> and the type of the producer is such > that > > > the > > > > > > > > exceptions > > > > > > > > > > are > > > > > > > > > > >> >> > > > retriable? > > > > > > > > > > >> >> > > > > >> > > > > > > > > > > >> >> > > > > >> Yea, I think in general the exception > type > > > > > mixing > > > > > > is > > > > > > > > > > >> difficult > > > > > > > > > > >> >> to > > > > > > > > > > >> >> > > get > > > > > > > > > > >> >> > > > > > them all right. I have already proposed > > > another > > > > > > > > solution > > > > > > > > > > >> based > > > > > > > > > > >> >> on > > > > > > > > > > >> >> > my > > > > > > > > > > >> >> > > > > > offline discussion with some folks > working > > on > > > > EOS > > > > > > > > > > >> >> > > > > > to make the handling more straightforward > > for > > > > end > > > > > > > users > > > > > > > > > > >> without > > > > > > > > > > >> >> the > > > > > > > > > > >> >> > > > need > > > > > > > > > > >> >> > > > > > to distinguish exception fatality. > > > > > > > > > > >> >> > > > > > > > > > > > > > > > >> >> > > > > >> Best, > > > > > > > > > > >> >> > > > > >> Bruno > > > > > > > > > > >> >> > > > > >> > > > > > > > > > > >> >> > > > > >> > > > > > > > > > > >> >> > > > > >> On 04.12.20 19:34, Guozhang Wang wrote: > > > > > > > > > > >> >> > > > > >> > Thanks Boyang for the proposal! I > made a > > > > pass > > > > > > over > > > > > > > > the > > > > > > > > > > >> list > > > > > > > > > > >> >> and > > > > > > > > > > >> >> > > here > > > > > > > > > > >> >> > > > > are > > > > > > > > > > >> >> > > > > >> > some thoughts: > > > > > > > > > > >> >> > > > > >> > > > > > > > > > > > >> >> > > > > >> > 0) Although this is not part of the > > public > > > > > API, > > > > > > I > > > > > > > > think > > > > > > > > > > we > > > > > > > > > > >> >> > should > > > > > > > > > > >> >> > > > make > > > > > > > > > > >> >> > > > > >> sure > > > > > > > > > > >> >> > > > > >> > that the suggested pattern (i.e. user > > can > > > > > always > > > > > > > > call > > > > > > > > > > >> >> abortTxn() > > > > > > > > > > >> >> > > > when > > > > > > > > > > >> >> > > > > >> > handling non-fatal errors) are indeed > > > > > supported. > > > > > > > > E.g. if > > > > > > > > > > >> the > > > > > > > > > > >> >> txn > > > > > > > > > > >> >> > > is > > > > > > > > > > >> >> > > > > >> already > > > > > > > > > > >> >> > > > > >> > aborted by the broker side, then users > > can > > > > > still > > > > > > > > call > > > > > > > > > > >> >> abortTxn > > > > > > > > > > >> >> > > which > > > > > > > > > > >> >> > > > > >> would > > > > > > > > > > >> >> > > > > >> > not throw another exception but just > be > > > > > treated > > > > > > > as a > > > > > > > > > > >> no-op. > > > > > > > > > > >> >> > > > > >> > > > > > > > > > > > >> >> > > > > >> > 1) *ConcurrentTransactionsException*: > I > > > > think > > > > > > this > > > > > > > > error > > > > > > > > > > >> can > > > > > > > > > > >> >> > also > > > > > > > > > > >> >> > > be > > > > > > > > > > >> >> > > > > >> > returned but not documented yet. This > > > should > > > > > be > > > > > > a > > > > > > > > > > >> non-fatal > > > > > > > > > > >> >> > error. > > > > > > > > > > >> >> > > > > >> > > > > > > > > > > > >> >> > > > > >> > 2) *InvalidTxnStateException*: this > > error > > > is > > > > > > > > returned > > > > > > > > > > from > > > > > > > > > > >> >> > broker > > > > > > > > > > >> >> > > > when > > > > > > > > > > >> >> > > > > >> txn > > > > > > > > > > >> >> > > > > >> > state transition failed (e.g. it is > > trying > > > > to > > > > > > > > transit to > > > > > > > > > > >> >> > > > > complete-commit > > > > > > > > > > >> >> > > > > >> > while the current state is not > > > > > prepare-commit). > > > > > > > This > > > > > > > > > > error > > > > > > > > > > >> >> could > > > > > > > > > > >> >> > > > > >> indicates > > > > > > > > > > >> >> > > > > >> > a bug on the client internal code or > the > > > > > broker > > > > > > > > code, > > > > > > > > > > OR a > > > > > > > > > > >> >> user > > > > > > > > > > >> >> > > > error > > > > > > > > > > >> >> > > > > >> --- a > > > > > > > > > > >> >> > > > > >> > similar error is > > > > > > ConcurrentTransactionsException, > > > > > > > > i.e. > > > > > > > > > > if > > > > > > > > > > >> >> Kafka > > > > > > > > > > >> >> > is > > > > > > > > > > >> >> > > > > >> bug-free > > > > > > > > > > >> >> > > > > >> > these exceptions would only be > returned > > if > > > > > users > > > > > > > > try to > > > > > > > > > > do > > > > > > > > > > >> >> > > something > > > > > > > > > > >> >> > > > > >> wrong, > > > > > > > > > > >> >> > > > > >> > e.g. calling abortTxn right after a > > > > commitTxn, > > > > > > > etc. > > > > > > > > So > > > > > > > > > > I'm > > > > > > > > > > >> >> > > thinking > > > > > > > > > > >> >> > > > it > > > > > > > > > > >> >> > > > > >> > should be a non-fatal error instead > of a > > > > fatal > > > > > > > > error, > > > > > > > > > > >> wdyt? > > > > > > > > > > >> >> > > > > >> > > > > > > > > > > > >> >> > > > > >> > 3) *KafkaException*: case i "indicates > > > fatal > > > > > > > > > > transactional > > > > > > > > > > >> >> > > sequence > > > > > > > > > > >> >> > > > > >> > (Fatal)", this is a bit conflicting > with > > > the > > > > > > > > > > >> >> > > > *OutOfSequenceException* > > > > > > > > > > >> >> > > > > >> that > > > > > > > > > > >> >> > > > > >> > is treated as non-fatal. I guess your > > > > proposal > > > > > > is > > > > > > > > that > > > > > > > > > > >> >> > > > > >> > OutOfOrderSequenceException would be > > > treated > > > > > > > either > > > > > > > > as > > > > > > > > > > >> fatal > > > > > > > > > > >> >> > with > > > > > > > > > > >> >> > > > > >> > transactional producer, or non-fatal > > with > > > > > > > idempotent > > > > > > > > > > >> >> producer, > > > > > > > > > > >> >> > is > > > > > > > > > > >> >> > > > that > > > > > > > > > > >> >> > > > > >> > right? If the producer is only > > configured > > > > with > > > > > > > > > > idempotency > > > > > > > > > > >> >> but > > > > > > > > > > >> >> > not > > > > > > > > > > >> >> > > > > >> > transaction, then throwing a > > > > > > > > > > >> >> TransactionStateCorruptedException > > > > > > > > > > >> >> > > for > > > > > > > > > > >> >> > > > > >> > non-fatal errors would be confusing > > since > > > > > users > > > > > > > are > > > > > > > > not > > > > > > > > > > >> using > > > > > > > > > > >> >> > > > > >> transactions > > > > > > > > > > >> >> > > > > >> > at all.. So I suggest we always throw > > > > > > > > > > >> OutOfSequenceException > > > > > > > > > > >> >> > as-is > > > > > > > > > > >> >> > > > > (i.e. > > > > > > > > > > >> >> > > > > >> > not wrapped) no matter how the > producer > > is > > > > > > > > configured, > > > > > > > > > > and > > > > > > > > > > >> >> let > > > > > > > > > > >> >> > the > > > > > > > > > > >> >> > > > > >> caller > > > > > > > > > > >> >> > > > > >> > decide how to handle it based on > whether > > > it > > > > is > > > > > > > only > > > > > > > > > > >> >> idempotent > > > > > > > > > > >> >> > or > > > > > > > > > > >> >> > > > > >> > transactional itself. > > > > > > > > > > >> >> > > > > >> > > > > > > > > > > > >> >> > > > > >> > 4) Besides all the txn APIs, the > > `send()` > > > > > > > callback / > > > > > > > > > > >> future > > > > > > > > > > >> >> can > > > > > > > > > > >> >> > > also > > > > > > > > > > >> >> > > > > >> throw > > > > > > > > > > >> >> > > > > >> > txn-related exceptions, I think this > KIP > > > > > should > > > > > > > also > > > > > > > > > > cover > > > > > > > > > > >> >> this > > > > > > > > > > >> >> > > API > > > > > > > > > > >> >> > > > as > > > > > > > > > > >> >> > > > > >> well? > > > > > > > > > > >> >> > > > > >> > > > > > > > > > > > >> >> > > > > >> > 5) This is related to 1/2) above: > > > sometimes > > > > > > those > > > > > > > > > > >> non-fatal > > > > > > > > > > >> >> > errors > > > > > > > > > > >> >> > > > > like > > > > > > > > > > >> >> > > > > >> > ConcurrentTxn or InvalidTxnState are > not > > > due > > > > > to > > > > > > > the > > > > > > > > > > state > > > > > > > > > > >> >> being > > > > > > > > > > >> >> > > > > >> corrupted > > > > > > > > > > >> >> > > > > >> > at the broker side, but maybe users > are > > > > doing > > > > > > > > something > > > > > > > > > > >> >> wrong. > > > > > > > > > > >> >> > So > > > > > > > > > > >> >> > > > I'm > > > > > > > > > > >> >> > > > > >> > wondering if we should further > > distinguish > > > > > those > > > > > > > > > > non-fatal > > > > > > > > > > >> >> > errors > > > > > > > > > > >> >> > > > > >> between > > > > > > > > > > >> >> > > > > >> > a) those that are caused by Kafka > > itself, > > > > > e.g. a > > > > > > > > broker > > > > > > > > > > >> timed > > > > > > > > > > >> >> > out > > > > > > > > > > >> >> > > > and > > > > > > > > > > >> >> > > > > >> > aborted a txn and later an endTxn > > request > > > is > > > > > > > > received, > > > > > > > > > > >> and b) > > > > > > > > > > >> >> > the > > > > > > > > > > >> >> > > > > user's > > > > > > > > > > >> >> > > > > >> > API call pattern is incorrect, causing > > the > > > > > > request > > > > > > > > to be > > > > > > > > > > >> >> > rejected > > > > > > > > > > >> >> > > > with > > > > > > > > > > >> >> > > > > >> an > > > > > > > > > > >> >> > > > > >> > error code from the broker. > > > > > > > > > > >> >> *TransactionStateCorruptedException* > > > > > > > > > > >> >> > > > feels > > > > > > > > > > >> >> > > > > >> to > > > > > > > > > > >> >> > > > > >> > me more like for case a), but not case > > b). > > > > > > > > > > >> >> > > > > >> > > > > > > > > > > > >> >> > > > > >> > > > > > > > > > > > >> >> > > > > >> > Guozhang > > > > > > > > > > >> >> > > > > >> > > > > > > > > > > > >> >> > > > > >> > > > > > > > > > > > >> >> > > > > >> > On Wed, Dec 2, 2020 at 4:50 PM Boyang > > > Chen < > > > > > > > > > > >> >> > > > > reluctanthero...@gmail.com> > > > > > > > > > > >> >> > > > > >> > wrote: > > > > > > > > > > >> >> > > > > >> > > > > > > > > > > > >> >> > > > > >> >> Thanks Matthias, I think your > proposal > > > > makes > > > > > > > sense > > > > > > > > as > > > > > > > > > > >> well, > > > > > > > > > > >> >> on > > > > > > > > > > >> >> > > the > > > > > > > > > > >> >> > > > > pro > > > > > > > > > > >> >> > > > > >> side > > > > > > > > > > >> >> > > > > >> >> we could have a universally agreed > > > > exception > > > > > > type > > > > > > > > to be > > > > > > > > > > >> >> caught > > > > > > > > > > >> >> > by > > > > > > > > > > >> >> > > > the > > > > > > > > > > >> >> > > > > >> user, > > > > > > > > > > >> >> > > > > >> >> without having an extra layer on top > of > > > the > > > > > > > actual > > > > > > > > > > >> >> exceptions. > > > > > > > > > > >> >> > I > > > > > > > > > > >> >> > > > > could > > > > > > > > > > >> >> > > > > >> see > > > > > > > > > > >> >> > > > > >> >> some issue on downsides: > > > > > > > > > > >> >> > > > > >> >> > > > > > > > > > > >> >> > > > > >> >> 1. The exception hierarchy will be > more > > > > > > complex. > > > > > > > > Right > > > > > > > > > > >> now > > > > > > > > > > >> >> we > > > > > > > > > > >> >> > > have > > > > > > > > > > >> >> > > > > >> >> RetriableException type, if we are > > going > > > to > > > > > > add a > > > > > > > > > > >> >> > > > > >> >> `ProducerRetriableException` type, we > > > have > > > > to > > > > > > put > > > > > > > > this > > > > > > > > > > >> new > > > > > > > > > > >> >> > > > interface > > > > > > > > > > >> >> > > > > >> as the > > > > > > > > > > >> >> > > > > >> >> parent of the RetriableException, > > because > > > > not > > > > > > all > > > > > > > > > > thrown > > > > > > > > > > >> >> > > non-fatal > > > > > > > > > > >> >> > > > > >> >> exceptions are `retriable` in general > > for > > > > > > > > producer, for > > > > > > > > > > >> >> example > > > > > > > > > > >> >> > > > > >> >> < > > > > > > > > > > >> >> > > > > >> >> > > > > > > > > > > >> >> > > > > >> > > > > > > > > > > >> >> > > > > > > > > > > > > > > >> >> > > > > > > > > > > > > > >> >> > > > > > > > > > > > > >> >> > > > > > > > > > > > >> >> > > > > > > > > > > >> > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > https://github.com/apache/kafka/blob/e275742f850af4a1b79b0d1bd1ac9a1d2e89c64e/clients/src/main/java/org/apache/kafka/clients/producer/internals/Sender.java#L745 > > > > > > > > > > >> >> > > > > >> >>> . > > > > > > > > > > >> >> > > > > >> >> We could have an empty interface > > > > > > > > > > >> >> `ProducerRetriableException` > > > > > > > > > > >> >> > to > > > > > > > > > > >> >> > > > let > > > > > > > > > > >> >> > > > > >> all > > > > > > > > > > >> >> > > > > >> >> the thrown exceptions implement for > > sure, > > > > but > > > > > > > it's > > > > > > > > a > > > > > > > > > > bit > > > > > > > > > > >> >> > > unorthodox > > > > > > > > > > >> >> > > > > in > > > > > > > > > > >> >> > > > > >> the > > > > > > > > > > >> >> > > > > >> >> way we deal with exceptions in > general. > > > > > > > > > > >> >> > > > > >> >> > > > > > > > > > > >> >> > > > > >> >> 2. There are cases where we throw a > > > > > > > KafkaException > > > > > > > > > > >> wrapping > > > > > > > > > > >> >> > > another > > > > > > > > > > >> >> > > > > >> >> KafkaException as either fatal or > > > > non-fatal. > > > > > If > > > > > > > we > > > > > > > > use > > > > > > > > > > an > > > > > > > > > > >> >> > > interface > > > > > > > > > > >> >> > > > > to > > > > > > > > > > >> >> > > > > >> >> solve #1, it is also required to > > > implement > > > > > > > another > > > > > > > > > > >> bloated > > > > > > > > > > >> >> > > > exception > > > > > > > > > > >> >> > > > > >> class > > > > > > > > > > >> >> > > > > >> >> which could replace KafkaException > > type, > > > as > > > > > we > > > > > > > > couldn't > > > > > > > > > > >> mark > > > > > > > > > > >> >> > > > > >> KafkaException > > > > > > > > > > >> >> > > > > >> >> as retriable for sure. > > > > > > > > > > >> >> > > > > >> >> > > > > > > > > > > >> >> > > > > >> >> 3. In terms of the encapsulation, > > > wrapping > > > > > > means > > > > > > > we > > > > > > > > > > could > > > > > > > > > > >> >> limit > > > > > > > > > > >> >> > > the > > > > > > > > > > >> >> > > > > >> scope > > > > > > > > > > >> >> > > > > >> >> of affection to the producer only, > > which > > > is > > > > > > > > important > > > > > > > > > > >> since > > > > > > > > > > >> >> we > > > > > > > > > > >> >> > > > don't > > > > > > > > > > >> >> > > > > >> want > > > > > > > > > > >> >> > > > > >> >> shared exception types to implement a > > > > > > > > producer-related > > > > > > > > > > >> >> > interface, > > > > > > > > > > >> >> > > > > such > > > > > > > > > > >> >> > > > > >> >> as UnknownTopicOrPartitionException. > > > > > > > > > > >> >> > > > > >> >> > > > > > > > > > > >> >> > > > > >> >> Best, > > > > > > > > > > >> >> > > > > >> >> Boyang > > > > > > > > > > >> >> > > > > >> >> > > > > > > > > > > >> >> > > > > >> >> On Wed, Dec 2, 2020 at 3:38 PM > Matthias > > > J. > > > > > Sax > > > > > > < > > > > > > > > > > >> >> > mj...@apache.org > > > > > > > > > > >> >> > > > > > > > > > > > > > >> >> > > > > >> wrote: > > > > > > > > > > >> >> > > > > >> >> > > > > > > > > > > >> >> > > > > >> >>> Thanks for the KIP Boyang! > > > > > > > > > > >> >> > > > > >> >>> > > > > > > > > > > >> >> > > > > >> >>> Overall, categorizing exceptions > > makes a > > > > lot > > > > > > of > > > > > > > > sense. > > > > > > > > > > >> As I > > > > > > > > > > >> >> > > don't > > > > > > > > > > >> >> > > > > know > > > > > > > > > > >> >> > > > > >> >>> the producer internals well enough, > I > > > > cannot > > > > > > > > comment > > > > > > > > > > on > > > > > > > > > > >> the > > > > > > > > > > >> >> > > > > >> >>> categorization in detail though. > > > > > > > > > > >> >> > > > > >> >>> > > > > > > > > > > >> >> > > > > >> >>> What I am wondering is, if we should > > > > > introduce > > > > > > > an > > > > > > > > > > >> exception > > > > > > > > > > >> >> > > > > interface > > > > > > > > > > >> >> > > > > >> >>> that non-fatal exception would > > implement > > > > > > instead > > > > > > > > of > > > > > > > > > > >> >> creating a > > > > > > > > > > >> >> > > new > > > > > > > > > > >> >> > > > > >> class > > > > > > > > > > >> >> > > > > >> >>> that will wrap non-fatal exceptions? > > > What > > > > > > would > > > > > > > > be the > > > > > > > > > > >> >> > pros/cons > > > > > > > > > > >> >> > > > for > > > > > > > > > > >> >> > > > > >> >>> both designs? > > > > > > > > > > >> >> > > > > >> >>> > > > > > > > > > > >> >> > > > > >> >>> > > > > > > > > > > >> >> > > > > >> >>> -Matthias > > > > > > > > > > >> >> > > > > >> >>> > > > > > > > > > > >> >> > > > > >> >>> > > > > > > > > > > >> >> > > > > >> >>> On 12/2/20 11:35 AM, Boyang Chen > > wrote: > > > > > > > > > > >> >> > > > > >> >>>> Hey there, > > > > > > > > > > >> >> > > > > >> >>>> > > > > > > > > > > >> >> > > > > >> >>>> I would like to start a discussion > > > thread > > > > > for > > > > > > > > > > KIP-691: > > > > > > > > > > >> >> > > > > >> >>>> > > > > > > > > > > >> >> > > > > >> >>> > > > > > > > > > > >> >> > > > > >> >> > > > > > > > > > > >> >> > > > > >> > > > > > > > > > > >> >> > > > > > > > > > > > > > > >> >> > > > > > > > > > > > > > >> >> > > > > > > > > > > > > >> >> > > > > > > > > > > > >> >> > > > > > > > > > > >> > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > https://cwiki.apache.org/confluence/display/KAFKA/KIP-691%3A+Enhance+Transactional+Producer+Exception+Handling > > > > > > > > > > >> >> > > > > >> >>>> > > > > > > > > > > >> >> > > > > >> >>>> The KIP is aiming to simplify the > > > > exception > > > > > > > > handling > > > > > > > > > > >> logic > > > > > > > > > > >> >> > for > > > > > > > > > > >> >> > > > > >> >>>> transactional Producer users by > > > > classifying > > > > > > > > fatal and > > > > > > > > > > >> >> > non-fatal > > > > > > > > > > >> >> > > > > >> >>> exceptions > > > > > > > > > > >> >> > > > > >> >>>> and throw them correspondingly for > > > easier > > > > > > catch > > > > > > > > and > > > > > > > > > > >> retry. > > > > > > > > > > >> >> > Let > > > > > > > > > > >> >> > > me > > > > > > > > > > >> >> > > > > >> know > > > > > > > > > > >> >> > > > > >> >>> what > > > > > > > > > > >> >> > > > > >> >>>> you think. > > > > > > > > > > >> >> > > > > >> >>>> > > > > > > > > > > >> >> > > > > >> >>>> Best, > > > > > > > > > > >> >> > > > > >> >>>> Boyang > > > > > > > > > > >> >> > > > > >> >>>> > > > > > > > > > > >> >> > > > > >> >>> > > > > > > > > > > >> >> > > > > >> >> > > > > > > > > > > >> >> > > > > >> > > > > > > > > > > > >> >> > > > > >> > > > > > > > > > > > >> >> > > > > >> > > > > > > > > > > >> >> > > > > > > > > > > > > > > > >> >> > > > > > > > > > > > > > > >> >> > > > > > > > > > > > > > >> >> > > > > > > > > > > > > > >> >> > > > -- > > > > > > > > > > >> >> > > > -- Guozhang > > > > > > > > > > >> >> > > > > > > > > > > > > > >> >> > > > > > > > > > > > > >> >> > > > > > > > > > > > >> >> > > > > > > > > > > > >> >> > -- > > > > > > > > > > >> >> > -- Guozhang > > > > > > > > > > >> >> > > > > > > > > > > > >> >> > > > > > > > > > > >> > > > > > > > > > > > >> > > > > > > > > > > > >> > -- > > > > > > > > > > >> > -- Guozhang > > > > > > > > > > >> > > > > > > > > > > > >> > > > > > > > > > > >> > > > > > > > > > > >> -- > > > > > > > > > > >> -- Guozhang > > > > > > > > > > >> > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > -- > > > > > > > > > -- Guozhang > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > -- > > > -- Guozhang > > > > > > > > -- > -- Guozhang >