Hey Justine,

yes we talked about `RecordTooLargeException` as an example, but did we
ever limit ourselves to only this specific exception? I think neither in
the KIP nor in the PR.  As Chris mentioned, this KIP is going to undo what
we have done in `KAFKA-9279` in case 1) the user is in a transaction and 2)
he decides to ignore the errors in which the record was not even added to
the batch. Yes, and we suggested some methods for undoing or, in fact,
moving back the transaction from the error state in `flush` or in
`commitTnx` and we finally came to the idea of not even doing the changes
(better than undoing) in `send`.

Bests,
Alieh

On Tue, Jul 2, 2024 at 8:03 PM Justine Olshan <jols...@confluent.io.invalid>
wrote:

> Hey folks,
>
> I understand where you are coming from by asking for specific use cases. My
> understanding based on previous conversations was that there were a few
> different errors that have been seen.
> One example I heard some information about was when the record was too
> large and it fails the batch. Besides that, I'm not really sure if there
> are cases in mind, though it is fair to ask on those and bring them up.
>
> > Does a record qualify as a poison pill if it targets a topic that
> doesn't exist? Or if it targets a topic that the producer principal lacks
> ACLs for? What if it fails broker-side validation (e.g., has a null key for
> a compacted topic)?
>
> I think there was some parallel work with addressing the
> UnknownTopicOrPartitionError in another way. As for the other checks, acls,
> validation etc. I am not aware of that being in Alieh's scope, but we
> should be clear about exactly what we are doing.
>
> All errors that fall into ApiException seems too broad to me.
>
> Justine
>
> On Tue, Jul 2, 2024 at 10:51 AM Alieh Saeedi <asae...@confluent.io.invalid
> >
> wrote:
>
> > Hey Chris,
> > thanks for sharing your concerns.
> >
> > 1) About the language of KIP (or maybe later in Javadocs): Is that
> alright
> > if I write all errors that fall into the `ApiException` category thrown
> > (actually returned) by Producer?
> > 2) About future expansion: do you have any better suggestions for enum
> > names? Do you think `IGNORE_API_EXEPTIONS` or something like that is a
> > "better/more accurate" one?
> >
> > Bests,
> > Alieh
> >
> > On Tue, Jul 2, 2024 at 7:29 PM Chris Egerton <chr...@aiven.io.invalid>
> > wrote:
> >
> > > Hi Alieh and Justine,
> > >
> > > I'm concerned that we're settling on a definition of "poison pill"
> that's
> > > easiest to tackle right now but may lead to shortcomings down the
> road. I
> > > understand the relationship between this KIP and KAFKA-9279, and I can
> > > totally get behind the desire to keep things small, focused, and simple
> > in
> > > the name of avoiding bugs. However, what I don't think is clear at all
> is
> > > what the "specific circumstances" are that Justine mentioned. I had a
> > > drastically different idea of what the intended behavioral change would
> > be
> > > before looking at the draft PR.
> > >
> > > I would like 1) for us to be clearer about the categories of errors
> that
> > we
> > > want to cover with this new API (especially since we'll have to find a
> > > clear, succinct way to document this for users), and 2) to make sure
> that
> > > if we do try to expand this API in the future, that we won't be painted
> > > into a corner.
> > >
> > > For item 1, hopefully we can agree that the language in the KIP
> > > for IGNORE_SEND_ERRORS ("The records causing irrecoverable errors are
> > > excluded from the batch and the transaction is committed
> successfully.")
> > is
> > > pretty vague. If we start using the phrase "poison pill record" that
> > could
> > > help, but IMO more detail would still be needed. We know that we want
> to
> > > include records that are so large that they can be immediately rejected
> > by
> > > the producer. But there are other cases that users might expect to be
> > > handled. Does a record qualify as a poison pill if it targets a topic
> > that
> > > doesn't exist? Or if it targets a topic that the producer principal
> lacks
> > > ACLs for? What if it fails broker-side validation (e.g., has a null key
> > for
> > > a compacted topic)?
> > >
> > > For item 2, this really depends on how narrow the scope of what we're
> > doing
> > > right now is. If we only handle a subset of the examples I laid out
> above
> > > that could possibly be considered poison pills with this KIP, do we
> want
> > to
> > > lock ourselves in to never addressing more in the future, or can we
> > choose
> > > an API (probably just enum names would be the only important decision
> > here)
> > > that leaves room for more later?
> > >
> > > Best,
> > >
> > > Chris
> > >
> > >
> > >
> > > On Tue, Jul 2, 2024 at 12:28 PM Justine Olshan
> > > <jols...@confluent.io.invalid>
> > > wrote:
> > >
> > > > Chris and Alieh,
> > > >
> > > > My understanding is that this KIP is really only trying to solve an
> > issue
> > > > of a "poison pill" record that fails send().
> > > > We've talked a lot about having a generic framework for all errors,
> > but I
> > > > don't think that is what this KIP is trying to do. Essentially the
> > > request
> > > > is to undo the change from KAFKA-9279
> > > > <https://issues.apache.org/jira/browse/KAFKA-9279> but under
> specific
> > > > circumstances that are controlled. I really am concerned about
> opening
> > > new
> > > > avenues for bugs with EOS and hesitate to handle any other types of
> > > errors.
> > > > I think if we all agree on the problem that we are trying to solve,
> it
> > is
> > > > easier to agree on solutions.
> > > >
> > > > Justine
> > > >
> > > > On Mon, Jul 1, 2024 at 2:20 AM Alieh Saeedi
> > <asae...@confluent.io.invalid
> > > >
> > > > wrote:
> > > >
> > > > > Hi Matthias,
> > > > > Thanks for the valid points you mentioned. I updated the KIP and
> the
> > PR
> > > > > with:
> > > > > 1) mentioning that the new overloaded `send` throws
> > > > `IllegalStateException`
> > > > > if the user tries to ignore `send()` errors outside of a
> transaction.
> > > > > 2) the default implementation in `Producer` interface throws an
> > > > > `UnsupportedOperationException`
> > > > >
> > > > > Hi Chris,
> > > > > Thanks for the feedback. I tried to clarify the points you listed:
> > > > > -------> we've narrowed the scope from any error that might take
> > place
> > > > with
> > > > > producing a record to Kafka, to only the ones that are thrown
> > directly
> > > > from
> > > > > Producer::send;
> > > > >
> > > > > From the very beginning and even since KIP-1038, the main purpose
> was
> > > to
> > > > > have "more flexibility," or, in other words, "giving the user the
> > > > > authority" to handle some specific exceptions thrown from the
> > > `Producer`.
> > > > > Due to the specific cases we had in mind, KIP-1038 was discarded
> and
> > we
> > > > > decided to not define a `CustomExceptionHandler` for `Producer` and
> > > > instead
> > > > > treat the `send` failures in a different way. The main issue is
> that
> > > > `send`
> > > > > makes a transition to error state, which is undoable. In fact, one
> > > single
> > > > > poison pill record makes the whole batch fail. The former
> suggestions
> > > > that
> > > > > you agreed with have been all about un-doing this transition in
> > `flush`
> > > > or
> > > > > `commit`. The new suggestion is to un-do (or better, NOT do) in
> > `send`
> > > > due
> > > > > to the reasons listed in the discussions above.
> > > > > Moreover, I would say that having such a large scope as you
> mentioned
> > > is
> > > > > impossible. In the best case, we may have control over the
> > `Producer`.
> > > > What
> > > > > shall we do with the broker? The `any error that might take place
> > with
> > > > > producing a record to Kafka` is too much, I think.
> > > > >
> > > > > -------> is this all we want to handle, and will it prevent us from
> > > > > handling more in the future in an intuitive way?
> > > > >
> > > > > I think yes. This is all we want. Other sorts of errors such as
> > having
> > > > > problem with partition addition, producer fenced exception, etc
> seem
> > to
> > > > be
> > > > > more serious issues. The intention was to handle problems created
> by
> > > > > (maybe) a single poison pill record. BTW, I do not see any
> obstacles
> > to
> > > > > future changes.
> > > > >
> > > > > Bests,
> > > > > Alieh
> > > > >
> > > > > On Sat, Jun 29, 2024 at 3:03 AM Chris Egerton
> > <chr...@aiven.io.invalid
> > > >
> > > > > wrote:
> > > > >
> > > > > > Ah, sorry--spoke too soon. The PR doesn't show that errors thrown
> > > from
> > > > > > Producer::send are handled, but instead, ApiException instances
> > that
> > > > are
> > > > > > caught inside KafkaProducer::doSend and are handled by returning
> an
> > > > > > already-failed future are. I think the same question still
> applies
> > > (is
> > > > > this
> > > > > > all we want to handle, and will it prevent us from handling more
> in
> > > the
> > > > > > future in an intuitive way), though.
> > > > > >
> > > > > > On Fri, Jun 28, 2024 at 8:57 PM Chris Egerton <chr...@aiven.io>
> > > wrote:
> > > > > >
> > > > > > > Hi Alieh,
> > > > > > >
> > > > > > > This KIP has evolved a lot since I last looked at it, but the
> > > changes
> > > > > > seem
> > > > > > > well thought-out both in semantics and API. One clarifying
> > > question I
> > > > > > have
> > > > > > > is that it looks based on the draft PR that we've narrowed the
> > > scope
> > > > > from
> > > > > > > any error that might take place with producing a record to
> Kafka,
> > > to
> > > > > only
> > > > > > > the ones that are thrown directly from Producer::send; is that
> > the
> > > > > > intended
> > > > > > > behavior here? And if so, do you have thoughts on how we might
> > > > design a
> > > > > > > follow-up KIP that would catch all errors (including ones
> > reported
> > > > > > > asynchronously instead of synchronously)? I'd like it if we
> could
> > > > leave
> > > > > > the
> > > > > > > door open for that without painting ourselves into too much of
> a
> > > > corner
> > > > > > > with the API design for this KIP.
> > > > > > >
> > > > > > > Cheers,
> > > > > > >
> > > > > > > Chris
> > > > > > >
> > > > > > > On Fri, Jun 28, 2024 at 6:31 PM Matthias J. Sax <
> > mj...@apache.org>
> > > > > > wrote:
> > > > > > >
> > > > > > >> Thanks Alieh,
> > > > > > >>
> > > > > > >> it seems this KIP can just pick between a couple of tradeoffs.
> > > > Adding
> > > > > an
> > > > > > >> overloaded `send()` as the KIP propose makes sense to me and
> > seems
> > > > to
> > > > > > >> provides the cleanest solution compare to there options we
> > > > discussed.
> > > > > > >>
> > > > > > >> Given the explicit name of the passed-in option that
> highlights
> > > that
> > > > > the
> > > > > > >> option is for TX only make is pretty clear and avoids the
> issue
> > of
> > > > > > >> `flush()` ambiguity.
> > > > > > >>
> > > > > > >>
> > > > > > >> Nit: We should make clear on the KIP though, that the new
> > `send()`
> > > > > > >> overload would throw an `IllegalStateException` if TX are not
> > used
> > > > > > >> (similar to other TX methods like initTx(), etc)
> > > > > > >>
> > > > > > >>
> > > > > > >> About the `Producer` interface, I am not sure how this was
> done
> > in
> > > > the
> > > > > > >> past (eg, KIP-266 added `Consumer.poll(Duration)` w/o a
> default
> > > > > > >> implementation), if we need a default implementation for
> > backward
> > > > > > >> compatibility or not? If we do want to add one, I think it
> would
> > > be
> > > > > > >> appropriate to throw an `UnsupportedOperationException` by
> > > default,
> > > > > > >> instead of just keeping the default impl empty?
> > > > > > >>
> > > > > > >>
> > > > > > >> My points are rather minor, and should not block this KIP
> > though.
> > > > > > >> Overall LGTM.
> > > > > > >>
> > > > > > >>
> > > > > > >>
> > > > > > >> -Matthias
> > > > > > >>
> > > > > > >> On 6/27/24 1:28 PM, Alieh Saeedi wrote:
> > > > > > >> > Hi Justine,
> > > > > > >> >
> > > > > > >> > Thanks for the suggestion.
> > > > > > >> > Making applications to validate every single record is not
> the
> > > > best
> > > > > > way,
> > > > > > >> > from an efficiency point of view.
> > > > > > >> > Moreover, between changing the behavior of the Producer in
> > > `send`
> > > > > and
> > > > > > >> > `commitTnx`, the former seems more reasonable and clean.
> > > > > > >> >
> > > > > > >> > Bests,
> > > > > > >> > Alieh
> > > > > > >> >
> > > > > > >> > On Thu, Jun 27, 2024 at 8:14 PM Justine Olshan
> > > > > > >> <jols...@confluent.io.invalid>
> > > > > > >> > wrote:
> > > > > > >> >
> > > > > > >> >> Hey Alieh,
> > > > > > >> >>
> > > > > > >> >> I see there are two options now. So folks will be
> discussing
> > > the
> > > > > > >> approaches
> > > > > > >> >> and deciding the best way forward before we vote?
> > > > > > >> >> I do think there could be a problem with the approach on
> > commit
> > > > if
> > > > > we
> > > > > > >> get
> > > > > > >> >> stuck on an earlier error and have more records
> (potentially
> > on
> > > > new
> > > > > > >> >> partitions) to commit as the current PR is implemented.
> > > > > > >> >>
> > > > > > >> >> I guess this takes us back to the question of whether the
> > error
> > > > > > should
> > > > > > >> be
> > > > > > >> >> cleared on send.
> > > > > > >> >>
> > > > > > >> >> (And I guess at the back of my mind, I'm wondering if there
> > is
> > > a
> > > > > way
> > > > > > >> we can
> > > > > > >> >> validate the "posion pill" records application side before
> we
> > > > even
> > > > > > try
> > > > > > >> to
> > > > > > >> >> send them)
> > > > > > >> >>
> > > > > > >> >> Justine
> > > > > > >> >>
> > > > > > >> >> On Wed, Jun 26, 2024 at 4:38 PM Alieh Saeedi
> > > > > > >> <asae...@confluent.io.invalid
> > > > > > >> >>>
> > > > > > >> >> wrote:
> > > > > > >> >>
> > > > > > >> >>> Hi Justine,
> > > > > > >> >>>
> > > > > > >> >>> I did not update the KIP with `TxnSendOption` since I
> > thought
> > > > it'd
> > > > > > be
> > > > > > >> >>> better discussed here beforehand.
> > > > > > >> >>> right now, there are 2 PRs:
> > > > > > >> >>> - the PR that implements the current version of the KIP:
> > > > > > >> >>> https://github.com/apache/kafka/pull/16332
> > > > > > >> >>> - the POC PR that clarifies the `TxnSendOption`:
> > > > > > >> >>> https://github.com/apache/kafka/pull/16465
> > > > > > >> >>>
> > > > > > >> >>> Bests,
> > > > > > >> >>> Alieh
> > > > > > >> >>>
> > > > > > >> >>> On Thu, Jun 27, 2024 at 12:42 AM Justine Olshan
> > > > > > >> >>> <jols...@confluent.io.invalid> wrote:
> > > > > > >> >>>
> > > > > > >> >>>> Hey Alieh,
> > > > > > >> >>>>
> > > > > > >> >>>> I think I am a little confused. Are the 3 points above
> > > > addressed
> > > > > by
> > > > > > >> the
> > > > > > >> >>> KIP
> > > > > > >> >>>> or did something change? The PR seems to not include this
> > > > change
> > > > > > and
> > > > > > >> >>> still
> > > > > > >> >>>> has the CommitOption as well.
> > > > > > >> >>>>
> > > > > > >> >>>> Thanks,
> > > > > > >> >>>> Justine
> > > > > > >> >>>>
> > > > > > >> >>>> On Wed, Jun 26, 2024 at 2:15 PM Alieh Saeedi
> > > > > > >> >>> <asae...@confluent.io.invalid
> > > > > > >> >>>>>
> > > > > > >> >>>> wrote:
> > > > > > >> >>>>
> > > > > > >> >>>>> Hi all,
> > > > > > >> >>>>>
> > > > > > >> >>>>>
> > > > > > >> >>>>> Looking at the PR <
> > > https://github.com/apache/kafka/pull/16332
> > > > >
> > > > > > >> >>>>> corresponding to the KIP, there are some points worthy
> of
> > > > > mention:
> > > > > > >> >>>>>
> > > > > > >> >>>>>
> > > > > > >> >>>>> 1) clearing the error ends up dirty/messy code in
> > > > > > >> >> `TransactionManager`.
> > > > > > >> >>>>>
> > > > > > >> >>>>> 2) By clearing the error, we are actually doing an
> illegal
> > > > > > >> transition
> > > > > > >> >>>> from
> > > > > > >> >>>>> `ABORTABLE_ERROR` to `IN_TRANSACTION` which is
> > conceptually
> > > > not
> > > > > > >> >>>> acceptable.
> > > > > > >> >>>>> This can be the root cause of some issues, with perhaps
> > > > further
> > > > > > >> >> future
> > > > > > >> >>>>> changes by others.
> > > > > > >> >>>>>
> > > > > > >> >>>>> 3) If the poison pill record `r1` causes a transition to
> > the
> > > > > error
> > > > > > >> >>> state
> > > > > > >> >>>>> and then the next record `r2` requires adding a
> partition
> > to
> > > > the
> > > > > > >> >>>>> transaction, the action fails due to being in the error
> > > state.
> > > > > In
> > > > > > >> >> this
> > > > > > >> >>>>> case, clearing errors during
> `commitTnx(CLEAR_SEND_ERROR)`
> > > is
> > > > > too
> > > > > > >> >> late.
> > > > > > >> >>>>> However, this case can NOT be the main concern as soon
> as
> > > > > KIP-890
> > > > > > is
> > > > > > >> >>>> fully
> > > > > > >> >>>>> implemented.
> > > > > > >> >>>>>
> > > > > > >> >>>>>
> > > > > > >> >>>>> My suggestion is to solve the problem where it arises.
> If
> > > the
> > > > > > >> >>> transition
> > > > > > >> >>>> to
> > > > > > >> >>>>> the error state does not happen during `send()`, we do
> not
> > > > need
> > > > > to
> > > > > > >> >>> clear
> > > > > > >> >>>>> the error later. Therefore, instead of `CommitOption`,
> we
> > > can
> > > > > > define
> > > > > > >> >> a
> > > > > > >> >>>>> `TxnSendOption` and prevent the `send()` method from
> going
> > > to
> > > > > the
> > > > > > >> >> error
> > > > > > >> >>>>> state in case 1) we're in a transaction and 2) the user
> > > asked
> > > > > for
> > > > > > >> >>>>> IGONRE_SEND_ERRORS. For more clarity, you can take a
> look
> > at
> > > > the
> > > > > > POC
> > > > > > >> >> PR
> > > > > > >> >>>>> <https://github.com/apache/kafka/pull/16465>.
> > > > > > >> >>>>>
> > > > > > >> >>>>> Cheers,
> > > > > > >> >>>>> Alieh
> > > > > > >> >>>>>
> > > > > > >> >>>>
> > > > > > >> >>>
> > > > > > >> >>
> > > > > > >> >
> > > > > > >>
> > > > > > >
> > > > > >
> > > > >
> > > >
> > >
> >
>

Reply via email to