Chris and Alieh,

My understanding is that this KIP is really only trying to solve an issue
of a "poison pill" record that fails send().
We've talked a lot about having a generic framework for all errors, but I
don't think that is what this KIP is trying to do. Essentially the request
is to undo the change from KAFKA-9279
<https://issues.apache.org/jira/browse/KAFKA-9279> but under specific
circumstances that are controlled. I really am concerned about opening new
avenues for bugs with EOS and hesitate to handle any other types of errors.
I think if we all agree on the problem that we are trying to solve, it is
easier to agree on solutions.

Justine

On Mon, Jul 1, 2024 at 2:20 AM Alieh Saeedi <asae...@confluent.io.invalid>
wrote:

> Hi Matthias,
> Thanks for the valid points you mentioned. I updated the KIP and the PR
> with:
> 1) mentioning that the new overloaded `send` throws `IllegalStateException`
> if the user tries to ignore `send()` errors outside of a transaction.
> 2) the default implementation in `Producer` interface throws an
> `UnsupportedOperationException`
>
> Hi Chris,
> Thanks for the feedback. I tried to clarify the points you listed:
> -------> we've narrowed the scope from any error that might take place with
> producing a record to Kafka, to only the ones that are thrown directly from
> Producer::send;
>
> From the very beginning and even since KIP-1038, the main purpose was to
> have "more flexibility," or, in other words, "giving the user the
> authority" to handle some specific exceptions thrown from the `Producer`.
> Due to the specific cases we had in mind, KIP-1038 was discarded and we
> decided to not define a `CustomExceptionHandler` for `Producer` and instead
> treat the `send` failures in a different way. The main issue is that `send`
> makes a transition to error state, which is undoable. In fact, one single
> poison pill record makes the whole batch fail. The former suggestions that
> you agreed with have been all about un-doing this transition in `flush` or
> `commit`. The new suggestion is to un-do (or better, NOT do) in `send` due
> to the reasons listed in the discussions above.
> Moreover, I would say that having such a large scope as you mentioned is
> impossible. In the best case, we may have control over the `Producer`. What
> shall we do with the broker? The `any error that might take place with
> producing a record to Kafka` is too much, I think.
>
> -------> is this all we want to handle, and will it prevent us from
> handling more in the future in an intuitive way?
>
> I think yes. This is all we want. Other sorts of errors such as having
> problem with partition addition, producer fenced exception, etc seem to be
> more serious issues. The intention was to handle problems created by
> (maybe) a single poison pill record. BTW, I do not see any obstacles to
> future changes.
>
> Bests,
> Alieh
>
> On Sat, Jun 29, 2024 at 3:03 AM Chris Egerton <chr...@aiven.io.invalid>
> wrote:
>
> > Ah, sorry--spoke too soon. The PR doesn't show that errors thrown from
> > Producer::send are handled, but instead, ApiException instances that are
> > caught inside KafkaProducer::doSend and are handled by returning an
> > already-failed future are. I think the same question still applies (is
> this
> > all we want to handle, and will it prevent us from handling more in the
> > future in an intuitive way), though.
> >
> > On Fri, Jun 28, 2024 at 8:57 PM Chris Egerton <chr...@aiven.io> wrote:
> >
> > > Hi Alieh,
> > >
> > > This KIP has evolved a lot since I last looked at it, but the changes
> > seem
> > > well thought-out both in semantics and API. One clarifying question I
> > have
> > > is that it looks based on the draft PR that we've narrowed the scope
> from
> > > any error that might take place with producing a record to Kafka, to
> only
> > > the ones that are thrown directly from Producer::send; is that the
> > intended
> > > behavior here? And if so, do you have thoughts on how we might design a
> > > follow-up KIP that would catch all errors (including ones reported
> > > asynchronously instead of synchronously)? I'd like it if we could leave
> > the
> > > door open for that without painting ourselves into too much of a corner
> > > with the API design for this KIP.
> > >
> > > Cheers,
> > >
> > > Chris
> > >
> > > On Fri, Jun 28, 2024 at 6:31 PM Matthias J. Sax <mj...@apache.org>
> > wrote:
> > >
> > >> Thanks Alieh,
> > >>
> > >> it seems this KIP can just pick between a couple of tradeoffs. Adding
> an
> > >> overloaded `send()` as the KIP propose makes sense to me and seems to
> > >> provides the cleanest solution compare to there options we discussed.
> > >>
> > >> Given the explicit name of the passed-in option that highlights that
> the
> > >> option is for TX only make is pretty clear and avoids the issue of
> > >> `flush()` ambiguity.
> > >>
> > >>
> > >> Nit: We should make clear on the KIP though, that the new `send()`
> > >> overload would throw an `IllegalStateException` if TX are not used
> > >> (similar to other TX methods like initTx(), etc)
> > >>
> > >>
> > >> About the `Producer` interface, I am not sure how this was done in the
> > >> past (eg, KIP-266 added `Consumer.poll(Duration)` w/o a default
> > >> implementation), if we need a default implementation for backward
> > >> compatibility or not? If we do want to add one, I think it would be
> > >> appropriate to throw an `UnsupportedOperationException` by default,
> > >> instead of just keeping the default impl empty?
> > >>
> > >>
> > >> My points are rather minor, and should not block this KIP though.
> > >> Overall LGTM.
> > >>
> > >>
> > >>
> > >> -Matthias
> > >>
> > >> On 6/27/24 1:28 PM, Alieh Saeedi wrote:
> > >> > Hi Justine,
> > >> >
> > >> > Thanks for the suggestion.
> > >> > Making applications to validate every single record is not the best
> > way,
> > >> > from an efficiency point of view.
> > >> > Moreover, between changing the behavior of the Producer in `send`
> and
> > >> > `commitTnx`, the former seems more reasonable and clean.
> > >> >
> > >> > Bests,
> > >> > Alieh
> > >> >
> > >> > On Thu, Jun 27, 2024 at 8:14 PM Justine Olshan
> > >> <jols...@confluent.io.invalid>
> > >> > wrote:
> > >> >
> > >> >> Hey Alieh,
> > >> >>
> > >> >> I see there are two options now. So folks will be discussing the
> > >> approaches
> > >> >> and deciding the best way forward before we vote?
> > >> >> I do think there could be a problem with the approach on commit if
> we
> > >> get
> > >> >> stuck on an earlier error and have more records (potentially on new
> > >> >> partitions) to commit as the current PR is implemented.
> > >> >>
> > >> >> I guess this takes us back to the question of whether the error
> > should
> > >> be
> > >> >> cleared on send.
> > >> >>
> > >> >> (And I guess at the back of my mind, I'm wondering if there is a
> way
> > >> we can
> > >> >> validate the "posion pill" records application side before we even
> > try
> > >> to
> > >> >> send them)
> > >> >>
> > >> >> Justine
> > >> >>
> > >> >> On Wed, Jun 26, 2024 at 4:38 PM Alieh Saeedi
> > >> <asae...@confluent.io.invalid
> > >> >>>
> > >> >> wrote:
> > >> >>
> > >> >>> Hi Justine,
> > >> >>>
> > >> >>> I did not update the KIP with `TxnSendOption` since I thought it'd
> > be
> > >> >>> better discussed here beforehand.
> > >> >>> right now, there are 2 PRs:
> > >> >>> - the PR that implements the current version of the KIP:
> > >> >>> https://github.com/apache/kafka/pull/16332
> > >> >>> - the POC PR that clarifies the `TxnSendOption`:
> > >> >>> https://github.com/apache/kafka/pull/16465
> > >> >>>
> > >> >>> Bests,
> > >> >>> Alieh
> > >> >>>
> > >> >>> On Thu, Jun 27, 2024 at 12:42 AM Justine Olshan
> > >> >>> <jols...@confluent.io.invalid> wrote:
> > >> >>>
> > >> >>>> Hey Alieh,
> > >> >>>>
> > >> >>>> I think I am a little confused. Are the 3 points above addressed
> by
> > >> the
> > >> >>> KIP
> > >> >>>> or did something change? The PR seems to not include this change
> > and
> > >> >>> still
> > >> >>>> has the CommitOption as well.
> > >> >>>>
> > >> >>>> Thanks,
> > >> >>>> Justine
> > >> >>>>
> > >> >>>> On Wed, Jun 26, 2024 at 2:15 PM Alieh Saeedi
> > >> >>> <asae...@confluent.io.invalid
> > >> >>>>>
> > >> >>>> wrote:
> > >> >>>>
> > >> >>>>> Hi all,
> > >> >>>>>
> > >> >>>>>
> > >> >>>>> Looking at the PR <https://github.com/apache/kafka/pull/16332>
> > >> >>>>> corresponding to the KIP, there are some points worthy of
> mention:
> > >> >>>>>
> > >> >>>>>
> > >> >>>>> 1) clearing the error ends up dirty/messy code in
> > >> >> `TransactionManager`.
> > >> >>>>>
> > >> >>>>> 2) By clearing the error, we are actually doing an illegal
> > >> transition
> > >> >>>> from
> > >> >>>>> `ABORTABLE_ERROR` to `IN_TRANSACTION` which is conceptually not
> > >> >>>> acceptable.
> > >> >>>>> This can be the root cause of some issues, with perhaps further
> > >> >> future
> > >> >>>>> changes by others.
> > >> >>>>>
> > >> >>>>> 3) If the poison pill record `r1` causes a transition to the
> error
> > >> >>> state
> > >> >>>>> and then the next record `r2` requires adding a partition to the
> > >> >>>>> transaction, the action fails due to being in the error state.
> In
> > >> >> this
> > >> >>>>> case, clearing errors during `commitTnx(CLEAR_SEND_ERROR)` is
> too
> > >> >> late.
> > >> >>>>> However, this case can NOT be the main concern as soon as
> KIP-890
> > is
> > >> >>>> fully
> > >> >>>>> implemented.
> > >> >>>>>
> > >> >>>>>
> > >> >>>>> My suggestion is to solve the problem where it arises. If the
> > >> >>> transition
> > >> >>>> to
> > >> >>>>> the error state does not happen during `send()`, we do not need
> to
> > >> >>> clear
> > >> >>>>> the error later. Therefore, instead of `CommitOption`, we can
> > define
> > >> >> a
> > >> >>>>> `TxnSendOption` and prevent the `send()` method from going to
> the
> > >> >> error
> > >> >>>>> state in case 1) we're in a transaction and 2) the user asked
> for
> > >> >>>>> IGONRE_SEND_ERRORS. For more clarity, you can take a look at the
> > POC
> > >> >> PR
> > >> >>>>> <https://github.com/apache/kafka/pull/16465>.
> > >> >>>>>
> > >> >>>>> Cheers,
> > >> >>>>> Alieh
> > >> >>>>>
> > >> >>>>
> > >> >>>
> > >> >>
> > >> >
> > >>
> > >
> >
>

Reply via email to