Chris and Alieh, My understanding is that this KIP is really only trying to solve an issue of a "poison pill" record that fails send(). We've talked a lot about having a generic framework for all errors, but I don't think that is what this KIP is trying to do. Essentially the request is to undo the change from KAFKA-9279 <https://issues.apache.org/jira/browse/KAFKA-9279> but under specific circumstances that are controlled. I really am concerned about opening new avenues for bugs with EOS and hesitate to handle any other types of errors. I think if we all agree on the problem that we are trying to solve, it is easier to agree on solutions.
Justine On Mon, Jul 1, 2024 at 2:20 AM Alieh Saeedi <asae...@confluent.io.invalid> wrote: > Hi Matthias, > Thanks for the valid points you mentioned. I updated the KIP and the PR > with: > 1) mentioning that the new overloaded `send` throws `IllegalStateException` > if the user tries to ignore `send()` errors outside of a transaction. > 2) the default implementation in `Producer` interface throws an > `UnsupportedOperationException` > > Hi Chris, > Thanks for the feedback. I tried to clarify the points you listed: > -------> we've narrowed the scope from any error that might take place with > producing a record to Kafka, to only the ones that are thrown directly from > Producer::send; > > From the very beginning and even since KIP-1038, the main purpose was to > have "more flexibility," or, in other words, "giving the user the > authority" to handle some specific exceptions thrown from the `Producer`. > Due to the specific cases we had in mind, KIP-1038 was discarded and we > decided to not define a `CustomExceptionHandler` for `Producer` and instead > treat the `send` failures in a different way. The main issue is that `send` > makes a transition to error state, which is undoable. In fact, one single > poison pill record makes the whole batch fail. The former suggestions that > you agreed with have been all about un-doing this transition in `flush` or > `commit`. The new suggestion is to un-do (or better, NOT do) in `send` due > to the reasons listed in the discussions above. > Moreover, I would say that having such a large scope as you mentioned is > impossible. In the best case, we may have control over the `Producer`. What > shall we do with the broker? The `any error that might take place with > producing a record to Kafka` is too much, I think. > > -------> is this all we want to handle, and will it prevent us from > handling more in the future in an intuitive way? > > I think yes. This is all we want. Other sorts of errors such as having > problem with partition addition, producer fenced exception, etc seem to be > more serious issues. The intention was to handle problems created by > (maybe) a single poison pill record. BTW, I do not see any obstacles to > future changes. > > Bests, > Alieh > > On Sat, Jun 29, 2024 at 3:03 AM Chris Egerton <chr...@aiven.io.invalid> > wrote: > > > Ah, sorry--spoke too soon. The PR doesn't show that errors thrown from > > Producer::send are handled, but instead, ApiException instances that are > > caught inside KafkaProducer::doSend and are handled by returning an > > already-failed future are. I think the same question still applies (is > this > > all we want to handle, and will it prevent us from handling more in the > > future in an intuitive way), though. > > > > On Fri, Jun 28, 2024 at 8:57 PM Chris Egerton <chr...@aiven.io> wrote: > > > > > Hi Alieh, > > > > > > This KIP has evolved a lot since I last looked at it, but the changes > > seem > > > well thought-out both in semantics and API. One clarifying question I > > have > > > is that it looks based on the draft PR that we've narrowed the scope > from > > > any error that might take place with producing a record to Kafka, to > only > > > the ones that are thrown directly from Producer::send; is that the > > intended > > > behavior here? And if so, do you have thoughts on how we might design a > > > follow-up KIP that would catch all errors (including ones reported > > > asynchronously instead of synchronously)? I'd like it if we could leave > > the > > > door open for that without painting ourselves into too much of a corner > > > with the API design for this KIP. > > > > > > Cheers, > > > > > > Chris > > > > > > On Fri, Jun 28, 2024 at 6:31 PM Matthias J. Sax <mj...@apache.org> > > wrote: > > > > > >> Thanks Alieh, > > >> > > >> it seems this KIP can just pick between a couple of tradeoffs. Adding > an > > >> overloaded `send()` as the KIP propose makes sense to me and seems to > > >> provides the cleanest solution compare to there options we discussed. > > >> > > >> Given the explicit name of the passed-in option that highlights that > the > > >> option is for TX only make is pretty clear and avoids the issue of > > >> `flush()` ambiguity. > > >> > > >> > > >> Nit: We should make clear on the KIP though, that the new `send()` > > >> overload would throw an `IllegalStateException` if TX are not used > > >> (similar to other TX methods like initTx(), etc) > > >> > > >> > > >> About the `Producer` interface, I am not sure how this was done in the > > >> past (eg, KIP-266 added `Consumer.poll(Duration)` w/o a default > > >> implementation), if we need a default implementation for backward > > >> compatibility or not? If we do want to add one, I think it would be > > >> appropriate to throw an `UnsupportedOperationException` by default, > > >> instead of just keeping the default impl empty? > > >> > > >> > > >> My points are rather minor, and should not block this KIP though. > > >> Overall LGTM. > > >> > > >> > > >> > > >> -Matthias > > >> > > >> On 6/27/24 1:28 PM, Alieh Saeedi wrote: > > >> > Hi Justine, > > >> > > > >> > Thanks for the suggestion. > > >> > Making applications to validate every single record is not the best > > way, > > >> > from an efficiency point of view. > > >> > Moreover, between changing the behavior of the Producer in `send` > and > > >> > `commitTnx`, the former seems more reasonable and clean. > > >> > > > >> > Bests, > > >> > Alieh > > >> > > > >> > On Thu, Jun 27, 2024 at 8:14 PM Justine Olshan > > >> <jols...@confluent.io.invalid> > > >> > wrote: > > >> > > > >> >> Hey Alieh, > > >> >> > > >> >> I see there are two options now. So folks will be discussing the > > >> approaches > > >> >> and deciding the best way forward before we vote? > > >> >> I do think there could be a problem with the approach on commit if > we > > >> get > > >> >> stuck on an earlier error and have more records (potentially on new > > >> >> partitions) to commit as the current PR is implemented. > > >> >> > > >> >> I guess this takes us back to the question of whether the error > > should > > >> be > > >> >> cleared on send. > > >> >> > > >> >> (And I guess at the back of my mind, I'm wondering if there is a > way > > >> we can > > >> >> validate the "posion pill" records application side before we even > > try > > >> to > > >> >> send them) > > >> >> > > >> >> Justine > > >> >> > > >> >> On Wed, Jun 26, 2024 at 4:38 PM Alieh Saeedi > > >> <asae...@confluent.io.invalid > > >> >>> > > >> >> wrote: > > >> >> > > >> >>> Hi Justine, > > >> >>> > > >> >>> I did not update the KIP with `TxnSendOption` since I thought it'd > > be > > >> >>> better discussed here beforehand. > > >> >>> right now, there are 2 PRs: > > >> >>> - the PR that implements the current version of the KIP: > > >> >>> https://github.com/apache/kafka/pull/16332 > > >> >>> - the POC PR that clarifies the `TxnSendOption`: > > >> >>> https://github.com/apache/kafka/pull/16465 > > >> >>> > > >> >>> Bests, > > >> >>> Alieh > > >> >>> > > >> >>> On Thu, Jun 27, 2024 at 12:42 AM Justine Olshan > > >> >>> <jols...@confluent.io.invalid> wrote: > > >> >>> > > >> >>>> Hey Alieh, > > >> >>>> > > >> >>>> I think I am a little confused. Are the 3 points above addressed > by > > >> the > > >> >>> KIP > > >> >>>> or did something change? The PR seems to not include this change > > and > > >> >>> still > > >> >>>> has the CommitOption as well. > > >> >>>> > > >> >>>> Thanks, > > >> >>>> Justine > > >> >>>> > > >> >>>> On Wed, Jun 26, 2024 at 2:15 PM Alieh Saeedi > > >> >>> <asae...@confluent.io.invalid > > >> >>>>> > > >> >>>> wrote: > > >> >>>> > > >> >>>>> Hi all, > > >> >>>>> > > >> >>>>> > > >> >>>>> Looking at the PR <https://github.com/apache/kafka/pull/16332> > > >> >>>>> corresponding to the KIP, there are some points worthy of > mention: > > >> >>>>> > > >> >>>>> > > >> >>>>> 1) clearing the error ends up dirty/messy code in > > >> >> `TransactionManager`. > > >> >>>>> > > >> >>>>> 2) By clearing the error, we are actually doing an illegal > > >> transition > > >> >>>> from > > >> >>>>> `ABORTABLE_ERROR` to `IN_TRANSACTION` which is conceptually not > > >> >>>> acceptable. > > >> >>>>> This can be the root cause of some issues, with perhaps further > > >> >> future > > >> >>>>> changes by others. > > >> >>>>> > > >> >>>>> 3) If the poison pill record `r1` causes a transition to the > error > > >> >>> state > > >> >>>>> and then the next record `r2` requires adding a partition to the > > >> >>>>> transaction, the action fails due to being in the error state. > In > > >> >> this > > >> >>>>> case, clearing errors during `commitTnx(CLEAR_SEND_ERROR)` is > too > > >> >> late. > > >> >>>>> However, this case can NOT be the main concern as soon as > KIP-890 > > is > > >> >>>> fully > > >> >>>>> implemented. > > >> >>>>> > > >> >>>>> > > >> >>>>> My suggestion is to solve the problem where it arises. If the > > >> >>> transition > > >> >>>> to > > >> >>>>> the error state does not happen during `send()`, we do not need > to > > >> >>> clear > > >> >>>>> the error later. Therefore, instead of `CommitOption`, we can > > define > > >> >> a > > >> >>>>> `TxnSendOption` and prevent the `send()` method from going to > the > > >> >> error > > >> >>>>> state in case 1) we're in a transaction and 2) the user asked > for > > >> >>>>> IGONRE_SEND_ERRORS. For more clarity, you can take a look at the > > POC > > >> >> PR > > >> >>>>> <https://github.com/apache/kafka/pull/16465>. > > >> >>>>> > > >> >>>>> Cheers, > > >> >>>>> Alieh > > >> >>>>> > > >> >>>> > > >> >>> > > >> >> > > >> > > > >> > > > > > >