Re: [DISCUSS] KIP-1038: Add Custom Error Handler to Producer

Alieh Saeedi Tue, 14 May 2024 06:10:10 -0700

Thanks Andrew. Done :)

@Chris: I changed the config parameter type from boolean to integer, which
defines the timeout for retrying. I thought reusing `max.block.ms` was not
reasonable as you mentioned.


So if the KIP looks good, let 's skip to the good part ;-) VOTING :)

Bests,
Alieh





On Tue, May 14, 2024 at 12:26 PM Andrew Schofield <andrew_schofi...@live.com>
wrote:

> Hi Alieh,
> Just one final comment.
>
> [AJS5] Existing classes use Retriable, not Retryable. For example:
>
> https://kafka.apache.org/21/javadoc/org/apache/kafka/common/errors/RetriableException.html
>
> I suggest RetriableResponse and NonRetriableResponse.
>
> Thanks,
> Andrew
>
> > On 13 May 2024, at 23:17, Alieh Saeedi <asae...@confluent.io.INVALID>
> wrote:
> >
> > Hi all,
> >
> >
> > Thanks for all the valid points you listed.
> >
> >
> > KIP updates and addressing concerns:
> >
> >
> > 1) The KIP now suggests two Response types: `RetryableResponse` and
> > `NonRetryableResponse`
> >
> >
> > 2) `custom.exception.handler` is changed to
> `custom.exception.handler.class`
> >
> >
> > 3) The KIP clarifies that `In the case of an implemented handler for the
> > specified exception, the handler takes precedence.`
> >
> >
> > 4)  There is now a `default` implementation for both handle() methods.
> >
> >
> > 5)  @Chris: for `UnknownTopicOrPartition`, the default is already
> retrying
> > for 60s. (In fact, the default value of `max.block.ms`). If the handler
> > instructs to FAIL or SWALLOW, there will be no retry, and if the handler
> > instructs to RETRY, that will be the default behavior, which follows the
> > values in already existing config parameters such as `max.block.ms`.
> Does
> > that make sense?
> >
> >
> > Hope the changes and explanations are convincing :)
> >
> >
> > Cheers,
> >
> > Alieh
> >
> > On Mon, May 13, 2024 at 6:40 PM Justine Olshan
> <jols...@confluent.io.invalid>
> > wrote:
> >
> >> Oh I see. The type isn't the error type but a newly defined type for the
> >> response. Makes sense and works for me.
> >>
> >> Justine
> >>
> >> On Mon, May 13, 2024 at 9:13 AM Chris Egerton <fearthecel...@gmail.com>
> >> wrote:
> >>
> >>> If we have dedicated methods for each kind of exception
> >>> (handleRecordTooLarge, handleUnknownTopicOrPartition, etc.), doesn't
> that
> >>> provide sufficient constraint? I'm not suggesting we eliminate these
> >>> methods, just that we change their return types to something more
> >> flexible.
> >>>
> >>> On Mon, May 13, 2024, 12:07 Justine Olshan
> <jols...@confluent.io.invalid
> >>>
> >>> wrote:
> >>>
> >>>> I'm not sure I agree with the Retriable and NonRetriableResponse
> >> comment.
> >>>> This doesn't limit the blast radius or enforce certain errors are
> used.
> >>>> I think we might disagree on how controlled these interfaces can be...
> >>>>
> >>>> Justine
> >>>>
> >>>> On Mon, May 13, 2024 at 8:40 AM Chris Egerton <chr...@aiven.io.invalid
> >>>
> >>>> wrote:
> >>>>
> >>>>> Hi Alieh,
> >>>>>
> >>>>> Thanks for the updates! I just have a few more thoughts:
> >>>>>
> >>>>> - I don't think a boolean property is sufficient to dictate retries
> >> for
> >>>>> unknown topic partitions, though. These errors can occur if a topic
> >> has
> >>>>> just been created, which can occur if, for example, automatic topic
> >>>>> creation is enabled for a multi-task connector. This is why I
> >> proposed
> >>> a
> >>>>> timeout instead of a boolean (and see my previous email for why
> >>> reducing
> >>>>> max.block.ms for a producer is not a viable alternative). If it
> >> helps,
> >>>> one
> >>>>> way to reproduce this yourself is to add the line
> >>>>> `fooProps.put(TASKS_MAX_CONFIG, "10");` to the integration test here:
> >>>>>
> >>>>>
> >>>>
> >>>
> >>
> https://github.com/apache/kafka/blob/5439914c32fa00d634efa7219699f1bc21add839/connect/runtime/src/test/java/org/apache/kafka/connect/integration/SourceConnectorsIntegrationTest.java#L134
> >>>>> and then check the logs afterward for messages like "Error while
> >>> fetching
> >>>>> metadata with correlation id <n> :
> >>>> {foo-topic=UNKNOWN_TOPIC_OR_PARTITION}".
> >>>>>
> >>>>> - I also don't think we need custom XxxResponse enums for every
> >>> possible
> >>>>> method; it seems like this will lead to a lot of duplication and
> >>>> cognitive
> >>>>> overhead if we want to expand the error handler in the future.
> >>> Something
> >>>>> more flexible like RetriableResponse and NonRetriableResponse could
> >>>>> suffice.
> >>>>>
> >>>>> - Finally, the KIP still doesn't state how the handler will or won't
> >>> take
> >>>>> precedence over existing retry properties. If I set `retries` or `
> >>>>> delivery.timeout.ms` or `max.block.ms` to low values, will that
> >> cause
> >>>>> retries to cease even if my custom handler would otherwise keep
> >>> returning
> >>>>> RETRY for an error?
> >>>>>
> >>>>> Cheers,
> >>>>>
> >>>>> Chris
> >>>>>
> >>>>> On Mon, May 13, 2024 at 11:02 AM Andrew Schofield <
> >>>>> andrew_schofi...@live.com>
> >>>>> wrote:
> >>>>>
> >>>>>> Hi Alieh,
> >>>>>> Just a few more comments on the KIP. It is looking much less risky
> >>> now
> >>>>> the
> >>>>>> scope
> >>>>>> is tighter.
> >>>>>>
> >>>>>> [AJS1] It would be nice to have default implementations of the
> >> handle
> >>>>>> methods
> >>>>>> so an implementor would not need to implement both themselves.
> >>>>>>
> >>>>>> [AJS2] Producer configurations which are class names usually end in
> >>>>>> “.class”.
> >>>>>> I suggest “custom.exception.handler.class”.
> >>>>>>
> >>>>>> [AJS3] If I implemented a handler, and I set a non-default value
> >> for
> >>>> one
> >>>>>> of the
> >>>>>> new configuations, what happens? I would expect that the handler
> >>> takes
> >>>>>> precedence. I wasn’t quite clear what “the control will follow the
> >>>>> handler
> >>>>>> instructions” meant.
> >>>>>>
> >>>>>> [AJS4] Because you now have an enum for the
> >>>>>> RecordTooLargeExceptionResponse,
> >>>>>> I don’t think you need to state in the comment for
> >>>>>> ProducerExceptionHandler that
> >>>>>> RETRY will be interpreted as FAIL.
> >>>>>>
> >>>>>> Thanks,
> >>>>>> Andrew
> >>>>>>
> >>>>>>> On 13 May 2024, at 14:53, Alieh Saeedi
> >>> <asae...@confluent.io.INVALID
> >>>>>
> >>>>>> wrote:
> >>>>>>>
> >>>>>>> Hi all,
> >>>>>>>
> >>>>>>>
> >>>>>>> Thanks for the very interesting discussion during my PTO.
> >>>>>>>
> >>>>>>>
> >>>>>>> KIP updates and addressing concerns:
> >>>>>>>
> >>>>>>>
> >>>>>>> 1) Two handle() methods are defined in ProducerExceptionHandler
> >> for
> >>>> the
> >>>>>> two
> >>>>>>> exceptions with different input parameters so that we have
> >>>>>>> handle(RecordTooLargeException e, ProducerRecord record) and
> >>>>>>> handle(UnknownTopicOrPartitionException e, ProducerRecord record)
> >>>>>>>
> >>>>>>>
> >>>>>>> 2) The ProducerExceptionHandler extends `Closable` as well.
> >>>>>>>
> >>>>>>>
> >>>>>>> 3) The KIP suggests having two more configuration parameters with
> >>>>> boolean
> >>>>>>> values:
> >>>>>>>
> >>>>>>> - `drop.invalid.large.records` with a default value of `false`
> >> for
> >>>>>>> swallowing too large records.
> >>>>>>>
> >>>>>>> - `retry.unknown.topic.partition` with a default value of `true`
> >>> that
> >>>>>>> performs RETRY for `max.block.ms` ms, encountering the
> >>>>>>> UnknownTopicOrPartitionException.
> >>>>>>>
> >>>>>>>
> >>>>>>> Hope the main concerns are addressed so that we can go forward
> >> with
> >>>>>> voting.
> >>>>>>>
> >>>>>>>
> >>>>>>> Cheers,
> >>>>>>>
> >>>>>>> Alieh
> >>>>>>>
> >>>>>>> On Thu, May 9, 2024 at 11:25 PM Artem Livshits
> >>>>>>> <alivsh...@confluent.io.invalid> wrote:
> >>>>>>>
> >>>>>>>> Hi Mathias,
> >>>>>>>>
> >>>>>>>>> [AL1] While I see the point, I would think having a different
> >>>>> callback
> >>>>>>>> for every exception might not really be elegant?
> >>>>>>>>
> >>>>>>>> I'm not sure how to assess the level of elegance of the
> >> proposal,
> >>>> but
> >>>>> I
> >>>>>> can
> >>>>>>>> comment on the technical characteristics:
> >>>>>>>>
> >>>>>>>> 1. Having specific interfaces that codify the logic that is
> >>>> currently
> >>>>>>>> prescribed in the comments reduce the chance of making a
> >> mistake.
> >>>>>>>> Commments may get ignored, misuderstood or etc. but if the
> >>> contract
> >>>> is
> >>>>>>>> codified, the compilier will help to enforce the contract.
> >>>>>>>> 2. Given that the logic is trickier than it seems (the
> >>>>> record-too-large
> >>>>>> is
> >>>>>>>> an example that can easily confuse someone who's not intimately
> >>>>> familiar
> >>>>>>>> with the nuances of the batching logic), having a little more
> >>> hoops
> >>>> to
> >>>>>> jump
> >>>>>>>> would give a greater chance that whoever tries to add a new
> >> cases
> >>>>> pauses
> >>>>>>>> and thinks a bit more.
> >>>>>>>> 3. As Justine pointed out, having different method will be a
> >>> forcing
> >>>>>>>> function to go through a KIP rather than smuggle new cases
> >> through
> >>>>>>>> implementation.
> >>>>>>>> 4. Sort of a consequence of the previous 3 -- all those things
> >>>> reduce
> >>>>>> the
> >>>>>>>> chance of someone writing the code that works with 2 errors and
> >>> then
> >>>>>> when
> >>>>>>>> more errors are added in the future will suddenly incorrectly
> >>> ignore
> >>>>> new
> >>>>>>>> errors (the example I gave in the previous email).
> >>>>>>>>
> >>>>>>>>> [AL2 cont.] Similar to AL1, I see such a handler to some extend
> >>> as
> >>>>>>>> business logic. If a user puts a bad filter condition in their
> >> KS
> >>>> app,
> >>>>>> and
> >>>>>>>> drops messages
> >>>>>>>>
> >>>>>>>> I agree that there is always a chance to get a bug and lose
> >>>> messages,
> >>>>>> but
> >>>>>>>> there are generally separation of concerns that has different
> >> risk
> >>>>>> profile:
> >>>>>>>> the filtering logic may be more rigorously tested and rarely
> >>> changed
> >>>>>> (say
> >>>>>>>> an application developer does it), but setting the topics to
> >>> produce
> >>>>>> may be
> >>>>>>>> done via configuration (e.g. a user of the application does it)
> >>> and
> >>>>> it's
> >>>>>>>> generally an expectation that users would get an error when
> >>>>>> configuration
> >>>>>>>> is incorrect.
> >>>>>>>>
> >>>>>>>> What could be worse is that UnknownTopicOrPartitionException can
> >>> be
> >>>> an
> >>>>>>>> intermittent error, i.e. with a generally correct configuration,
> >>>> there
> >>>>>>>> could be metadata propagation problem on the cluster and then a
> >>>> random
> >>>>>> set
> >>>>>>>> of records could get lost.
> >>>>>>>>
> >>>>>>>>> [AL3] Maybe I misunderstand what you are saying, but to me,
> >>>> checking
> >>>>>> the
> >>>>>>>> size of the record upfront is exactly what the KIP proposes? No?
> >>>>>>>>
> >>>>>>>> It achieves the same result but solves it differently, my
> >>> proposal:
> >>>>>>>>
> >>>>>>>> 1. Application checks the validity of a record (maybe via a new
> >>>>>>>> validateRecord method) before producing it, and can just exclude
> >>> it
> >>>> or
> >>>>>>>> return an error to the user.
> >>>>>>>> 2. Application produces the record -- at this point there are no
> >>>>> records
> >>>>>>>> that could return record too large, they were either skipped at
> >>>> step 1
> >>>>>> or
> >>>>>>>> we didn't get here because step 1 failed.
> >>>>>>>>
> >>>>>>>> Vs. KIP's proposal
> >>>>>>>>
> >>>>>>>> 1. Application produces the record.
> >>>>>>>> 2. Application gets a callback.
> >>>>>>>> 3. Application returns the action on how to proceed.
> >>>>>>>>
> >>>>>>>> The advantage of the former is the clarity of semantics -- the
> >>>> record
> >>>>> is
> >>>>>>>> invalid (property of the record, not a function of server state
> >> or
> >>>>>> server
> >>>>>>>> configuration) and we can clearly know that it is the record
> >> that
> >>> is
> >>>>> bad
> >>>>>>>> and can never succeed.
> >>>>>>>>
> >>>>>>>> The KIP-proposed way actually has a very tricky point: it
> >> actually
> >>>>>> handles
> >>>>>>>> a subset of record-too-large exceptions.  The broker can return
> >>>>>>>> record-too-large and reject the whole batch (but we don't want
> >> to
> >>>>> ignore
> >>>>>>>> those because then we can skip random records that just happened
> >>> to
> >>>> be
> >>>>>> in
> >>>>>>>> the same batch), in some sense we use the same error for 2
> >>> different
> >>>>>>>> conditions and understanding that requires pretty deep
> >>> understanding
> >>>>> of
> >>>>>>>> Kafka internals.
> >>>>>>>>
> >>>>>>>> -Artem
> >>>>>>>>
> >>>>>>>>
> >>>>>>>> On Wed, May 8, 2024 at 9:47 AM Justine Olshan
> >>>>>> <jols...@confluent.io.invalid
> >>>>>>>>>
> >>>>>>>> wrote:
> >>>>>>>>
> >>>>>>>>> My concern with respect to it being fragile: the code that
> >>> ensures
> >>>>> the
> >>>>>>>>> error type is internal to the producer. Someone may see it and
> >>>> say, I
> >>>>>>>> want
> >>>>>>>>> to add such and such error. This looks like internal code, so I
> >>>> don't
> >>>>>>>> need
> >>>>>>>>> a KIP, and then they can change it to whatever they want
> >> thinking
> >>>> it
> >>>>> is
> >>>>>>>>> within the typical kafka improvement protocol.
> >>>>>>>>>
> >>>>>>>>> Relying on an internal change to enforce an external API is
> >>> fragile
> >>>>> in
> >>>>>> my
> >>>>>>>>> opinion. That's why I sort of agreed with Artem with enforcing
> >>> the
> >>>>>> error
> >>>>>>>> in
> >>>>>>>>> the method signature -- part of the public API.
> >>>>>>>>>
> >>>>>>>>> Chris's comments on requiring more information to handler again
> >>>> makes
> >>>>>> me
> >>>>>>>>> wonder if we are solving a problem of lack of information at
> >> the
> >>>>>>>>> application level with a more powerful solution than we need.
> >>> (Ie,
> >>>> if
> >>>>>> we
> >>>>>>>>> had more information, could the application close and restart
> >> the
> >>>>>>>>> transaction rather than having to drop records) But I am happy
> >> to
> >>>>>>>>> compromise with a handler that we can agree is sufficiently
> >>>>> controlled
> >>>>>>>> and
> >>>>>>>>> documented.
> >>>>>>>>>
> >>>>>>>>> Justine
> >>>>>>>>>
> >>>>>>>>> On Wed, May 8, 2024 at 7:20 AM Chris Egerton
> >>>> <chr...@aiven.io.invalid
> >>>>>>
> >>>>>>>>> wrote:
> >>>>>>>>>
> >>>>>>>>>> Hi Alieh,
> >>>>>>>>>>
> >>>>>>>>>> Continuing prior discussions:
> >>>>>>>>>>
> >>>>>>>>>> 1) Regarding the "flexibility" discussion, my overarching
> >> point
> >>> is
> >>>>>>>> that I
> >>>>>>>>>> don't see the point in allowing for this kind of pluggable
> >> logic
> >>>>>>>> without
> >>>>>>>>>> also covering more scenarios. Take example 2 in the KIP: if
> >>> we're
> >>>>>> going
> >>>>>>>>> to
> >>>>>>>>>> implement retries only on "important" topics when a topic
> >>>> partition
> >>>>>>>> isn't
> >>>>>>>>>> found, why wouldn't we also want to be able to do this for
> >> other
> >>>>>>>> errors?
> >>>>>>>>>> Again, taking authorization errors as an example, why wouldn't
> >>> we
> >>>>> want
> >>>>>>>> to
> >>>>>>>>>> be able to fail when we can't write to "important" topics
> >>> because
> >>>>> the
> >>>>>>>>>> producer principal lacks sufficient ACLs, and drop the record
> >> if
> >>>> the
> >>>>>>>>> topic
> >>>>>>>>>> isn't "important"? In a security-conscious environment with
> >>>>>>>>>> runtime-dependent topic routing (which is a common feature of
> >>> many
> >>>>>>>> source
> >>>>>>>>>> connectors, such as the Debezium connectors), this seems
> >> fairly
> >>>>>> likely.
> >>>>>>>>>>
> >>>>>>>>>> 2) As far as changing the shape of the API goes, I like
> >> Artem's
> >>>> idea
> >>>>>> of
> >>>>>>>>>> splitting out the interface based on specific exceptions. This
> >>> may
> >>>>> be
> >>>>>> a
> >>>>>>>>>> little laborious to expand in the future, but if we really
> >> want
> >>> to
> >>>>>>>>>> limit the exceptions that we cover with the handler and move
> >>>> slowly
> >>>>>> and
> >>>>>>>>>> cautiously, then IMO it'd be reasonable to reflect that in the
> >>>>>>>>> interface. I
> >>>>>>>>>> also acknowledge that there's no way to completely prevent
> >>> people
> >>>>> from
> >>>>>>>>>> shooting themselves in the foot by implementing the API
> >>>> incorrectly,
> >>>>>>>> but
> >>>>>>>>> I
> >>>>>>>>>> think it's worth it to do what we can--including leveraging
> >> the
> >>>> Java
> >>>>>>>>>> language's type system--to help them, so IMO there's value to
> >>>>>>>> eliminating
> >>>>>>>>>> the implicit behavior of failing when a policy returns RETRY
> >>> for a
> >>>>>>>>>> non-retriable error. This can take a variety of shapes and I'm
> >>> not
> >>>>>>>> going
> >>>>>>>>> to
> >>>>>>>>>> insist on anything specific, but I do want to again raise my
> >>>>> concerns
> >>>>>>>>> with
> >>>>>>>>>> the current proposal and request that we find something a
> >> little
> >>>>>>>> better.
> >>>>>>>>>>
> >>>>>>>>>> 3) Concerning the default implementation--actually, I meant
> >>> what I
> >>>>>>>> wrote
> >>>>>>>>> :)
> >>>>>>>>>> I don't want a "second" default, I want an implementation of
> >>> this
> >>>>>>>>> interface
> >>>>>>>>>> to be used as the default if no others are specified. The
> >>> behavior
> >>>>> of
> >>>>>>>>> this
> >>>>>>>>>> default implementation would be identical to existing behavior
> >>> (so
> >>>>>>>> there
> >>>>>>>>>> would be no backwards compatibility concerns like the ones
> >>> raised
> >>>> by
> >>>>>>>>>> Matthias), but it would be possible to configure this default
> >>>>> handler
> >>>>>>>>> class
> >>>>>>>>>> to behave differently for a basic set of scenarios. This would
> >>>>> mirror
> >>>>>>>>> (pun
> >>>>>>>>>> intended) the approach we've taken with Mirror Maker 2 and its
> >>>>>>>>>> ReplicationPolicy interface [1]. There is a default
> >>> implementation
> >>>>>>>>>> available [2] that recognizes a handful of basic configuration
> >>>>>>>> properties
> >>>>>>>>>> [3] for simple tweaks, but if users want, they can also
> >>> implement
> >>>>>> their
> >>>>>>>>> own
> >>>>>>>>>> replication policy for more fine-grained logic if those
> >>> properties
> >>>>>>>> aren't
> >>>>>>>>>> flexible enough.
> >>>>>>>>>>
> >>>>>>>>>> More concretely, I'm imagining something like this for the
> >>>> producer
> >>>>>>>>>> exception handler:
> >>>>>>>>>>
> >>>>>>>>>> - Default implementation class
> >>>>>>>>>> of
> >>>> org.apache.kafka.clients.producer.DefaultProducerExceptionHandler
> >>>>>>>>>> - This class would recognize two properties:
> >>>>>>>>>> - drop.invalid.large.records: Boolean property, defaults to
> >>>> false.
> >>>>> If
> >>>>>>>>>> "false", then causes the handler to return FAIL whenever
> >>>>>>>>>> a RecordTooLargeException is encountered; if "true", then
> >> causes
> >>>>>>>>>> SWALLOW/SKIP/DROP to be returned instead
> >>>>>>>>>> - unknown.topic.partition.retry.timeout.ms: Integer
> >> property,
> >>>>>>>> defaults
> >>>>>>>>>> to
> >>>>>>>>>> INT_MAX. Whenever an UnknownTopicOrPartitionException is
> >>>>> encountered,
> >>>>>>>>>> causes the handler to return FAIL if that record has been
> >>> pending
> >>>>> for
> >>>>>>>>> more
> >>>>>>>>>> than the retry timeout; otherwise, causes RETRY to be returned
> >>>>>>>>>>
> >>>>>>>>>> I think this is worth addressing now instead of later because
> >> it
> >>>>>> forces
> >>>>>>>>> us
> >>>>>>>>>> to evaluate the usefulness of this interface and it addresses
> >> a
> >>>>>>>>>> long-standing issue not just with Kafka Connect, but with the
> >>> Java
> >>>>>>>>> producer
> >>>>>>>>>> in general. For reference, here are a few tickets I collected
> >>>> after
> >>>>>>>>> briefly
> >>>>>>>>>> skimming our Jira showing that this is a real pain point for
> >>>> users:
> >>>>>>>>>> https://issues.apache.org/jira/browse/KAFKA-10340,
> >>>>>>>>>> https://issues.apache.org/jira/browse/KAFKA-12990,
> >>>>>>>>>> https://issues.apache.org/jira/browse/KAFKA-13634. Although
> >>> this
> >>>> is
> >>>>>>>>>> frequently reported with Kafka Connect, it applies to anyone
> >> who
> >>>>>>>>> configures
> >>>>>>>>>> a producer to use a high retry timeout. I am aware of the
> >>>>>> max.block.ms
> >>>>>>>>>> property, but it's painful and IMO poor behavior to require
> >>> users
> >>>> to
> >>>>>>>>> reduce
> >>>>>>>>>> the value of this property just to handle the single scenario
> >>> when
> >>>>>>>> trying
> >>>>>>>>>> to write to topics that don't exist, since it would also limit
> >>> the
> >>>>>>>> retry
> >>>>>>>>>> timeout for other operations that are legitimately retriable.
> >>>>>>>>>>
> >>>>>>>>>> Raising new points:
> >>>>>>>>>>
> >>>>>>>>>> 5) I don't see the interplay between this handler and existing
> >>>>>>>>>> retry-related properties mentioned anywhere in the KIP. I'm
> >>>> assuming
> >>>>>>>> that
> >>>>>>>>>> properties like "retries", "max.block.ms", and "
> >>>> delivery.timeout.ms
> >>>>> "
> >>>>>>>>> would
> >>>>>>>>>> take precedence over the handler and once they are exhausted,
> >>> the
> >>>>>>>>>> record/batch will fail no matter what? If so, it's probably
> >>> worth
> >>>>>>>> briefly
> >>>>>>>>>> mentioning this (no more than a sentence or two) in the KIP,
> >> and
> >>>> if
> >>>>>>>> not,
> >>>>>>>>>> I'm curious what you have in mind.
> >>>>>>>>>>
> >>>>>>>>>> 6) I also wonder if the API provides enough information in its
> >>>>> current
> >>>>>>>>>> form. Would it be possible to provide handlers with some way
> >> of
> >>>>>>>> tracking
> >>>>>>>>>> how long a record has been pending for (i.e., how long it's
> >> been
> >>>>> since
> >>>>>>>>> the
> >>>>>>>>>> record was provided as an argument to Producer::send)? One way
> >>> to
> >>>> do
> >>>>>>>> this
> >>>>>>>>>> could be to add a method like `onNewRecord(ProducerRecord)`
> >> and
> >>>>>>>>>> allow/require the handler to do its own bookkeeping, probably
> >>>> with a
> >>>>>>>>>> matching `onRecordSuccess(ProducerRecord)` method so that the
> >>>>> handler
> >>>>>>>>>> doesn't eat up an ever-increasing amount of memory trying to
> >>> track
> >>>>>>>>> records.
> >>>>>>>>>> An alternative could be to include information about the
> >> initial
> >>>>> time
> >>>>>>>> the
> >>>>>>>>>> record was received by the producer and the number of retries
> >>> that
> >>>>>> have
> >>>>>>>>>> been performed for it as parameters in the handle method(s),
> >> but
> >>>> I'm
> >>>>>>>> not
> >>>>>>>>>> sure how easy this would be to implement and it might clutter
> >>>> things
> >>>>>>>> up a
> >>>>>>>>>> bit too much.
> >>>>>>>>>>
> >>>>>>>>>> 7) A small request--can we add Closeable (or, if you prefer,
> >>>>>>>>> AutoCloseable)
> >>>>>>>>>> as a superinterface for the handler interface?
> >>>>>>>>>>
> >>>>>>>>>> [1] -
> >>>>>>>>>>
> >>>>>>>>>>
> >>>>>>>>>
> >>>>>>>>
> >>>>>>
> >>>>>
> >>>>
> >>>
> >>
> https://kafka.apache.org/37/javadoc/org/apache/kafka/connect/mirror/ReplicationPolicy.html
> >>>>>>>>>> [2] -
> >>>>>>>>>>
> >>>>>>>>>>
> >>>>>>>>>
> >>>>>>>>
> >>>>>>
> >>>>>
> >>>>
> >>>
> >>
> https://kafka.apache.org/37/javadoc/org/apache/kafka/connect/mirror/DefaultReplicationPolicy.html
> >>>>>>>>>> [3] -
> >>>>>>>>>>
> >>>>>>>>>>
> >>>>>>>>>
> >>>>>>>>
> >>>>>>
> >>>>>
> >>>>
> >>>
> >>
> https://kafka.apache.org/37/javadoc/org/apache/kafka/connect/mirror/DefaultReplicationPolicy.html#SEPARATOR_CONFIG
> >>>>>>>>>> ,
> >>>>>>>>>>
> >>>>>>>>>>
> >>>>>>>>>
> >>>>>>>>
> >>>>>>
> >>>>>
> >>>>
> >>>
> >>
> https://kafka.apache.org/37/javadoc/org/apache/kafka/connect/mirror/DefaultReplicationPolicy.html#INTERNAL_TOPIC_SEPARATOR_ENABLED_CONFIG
> >>>>>>>>>>
> >>>>>>>>>> Cheers,
> >>>>>>>>>>
> >>>>>>>>>> Chris
> >>>>>>>>>>
> >>>>>>>>>> On Wed, May 8, 2024 at 12:37 AM Matthias J. Sax <
> >>> mj...@apache.org
> >>>>>
> >>>>>>>>> wrote:
> >>>>>>>>>>
> >>>>>>>>>>> Very interesting discussion.
> >>>>>>>>>>>
> >>>>>>>>>>> Seems a central point is the question about "how generic" we
> >>>>> approach
> >>>>>>>>>>> this, and some people think we need to be conservative and
> >>> others
> >>>>>>>> think
> >>>>>>>>>>> we should try to be as generic as possible.
> >>>>>>>>>>>
> >>>>>>>>>>> Personally, I think following a limited scope for this KIP
> >> (by
> >>>>>>>>>>> explicitly saying we only cover RecordTooLarge and
> >>>>>>>>>>> UnknownTopicOrPartition) might be better. We have concrete
> >>>> tickets
> >>>>>>>> that
> >>>>>>>>>>> we address, while for other exception (like authorization) we
> >>>> don't
> >>>>>>>>> know
> >>>>>>>>>>> if people want to handle it to begin with. Boiling the ocean
> >>>> might
> >>>>>>>> not
> >>>>>>>>>>> get us too far, and being somewhat pragmatic might help to
> >> move
> >>>>> this
> >>>>>>>>> KIP
> >>>>>>>>>>> forward. -- I also agree with Justin and Artem, that we want
> >> to
> >>>> be
> >>>>>>>>>>> careful anyway to not allow users to break stuff too easily.
> >>>>>>>>>>>
> >>>>>>>>>>> As the same time, I agree that we should setup this change
> >> in a
> >>>>>>>> forward
> >>>>>>>>>>> looking way, and thus having a single generic handler allows
> >> us
> >>>> to
> >>>>>>>>> later
> >>>>>>>>>>> extend the handler more easily. This should also simplify
> >>> follow
> >>>> up
> >>>>>>>> KIP
> >>>>>>>>>>> that might add new error cases (I actually mentioned one more
> >>> to
> >>>>>>>> Alieh
> >>>>>>>>>>> already, but we both agreed that it might be best to exclude
> >> it
> >>>>> from
> >>>>>>>>> the
> >>>>>>>>>>> KIP right now, to make the 3.8 deadline. Doing a follow up
> >> KIP
> >>> is
> >>>>> not
> >>>>>>>>>>> the end of the world.)
> >>>>>>>>>>>
> >>>>>>>>>>>
> >>>>>>>>>>>
> >>>>>>>>>>> @Chris:
> >>>>>>>>>>>
> >>>>>>>>>>> (2) This sounds fair to me, but not sure how "bad" it
> >> actually
> >>>>> would
> >>>>>>>>> be?
> >>>>>>>>>>> If the contract is clearly defined, it seems to be fine what
> >>> the
> >>>>> KIP
> >>>>>>>>>>> proposes, and given that such a handler is an expert API, and
> >>> we
> >>>>> can
> >>>>>>>>>>> provide "best practices" (cf my other comment below in
> >> [AL1]),
> >>>>> being
> >>>>>>>> a
> >>>>>>>>>>> little bit pragmatic sound fine to me.
> >>>>>>>>>>>
> >>>>>>>>>>> Not sure if I understand Justin's argument on this question?
> >>>>>>>>>>>
> >>>>>>>>>>>
> >>>>>>>>>>> (3) About having a default impl or not. I am fine with adding
> >>>> one,
> >>>>>>>> even
> >>>>>>>>>>> if I am not sure why Connect could not just add its own one
> >> and
> >>>>> plug
> >>>>>>>> it
> >>>>>>>>>>> in (and we would add corresponding configs for Connect, but
> >> not
> >>>> for
> >>>>>>>> the
> >>>>>>>>>>> producer itself)? For this case, we could also do this as a
> >>>> follow
> >>>>> up
> >>>>>>>>>>> KIP, but happy to include it in this KIP to provide value to
> >>>>> Connect
> >>>>>>>>>>> right away (even if the value might not come right away if we
> >>>> miss
> >>>>>>>> the
> >>>>>>>>>>> 3.8 deadline due to expanded KIP scope...) --  For KS, we
> >> would
> >>>> for
> >>>>>>>>> sure
> >>>>>>>>>>> plugin our own impl, and lock down the config such that users
> >>>>> cannot
> >>>>>>>>> set
> >>>>>>>>>>> their own handler on the internal producer to begin with.
> >> Might
> >>>> be
> >>>>>>>> good
> >>>>>>>>>>> to elaborate why the producer should have a default? We might
> >>>>>>>> actually
> >>>>>>>>>>> want to add this to the KIP right away?
> >>>>>>>>>>>
> >>>>>>>>>>> The key for a default impl would be, to not change the
> >> current
> >>>>>>>>> behavior,
> >>>>>>>>>>> and having no default seems to achieve this. For the two
> >> cases
> >>>> you
> >>>>>>>>>>> mentioned, it's unclear to me what default value on "upper
> >>> bound
> >>>> on
> >>>>>>>>>>> retires" for UnkownTopicOrPartitionException we should set?
> >>> Seems
> >>>>> it
> >>>>>>>>>>> would need to be the same as `delivery.timeout.ms`? However,
> >>> if
> >>>>>>>> users
> >>>>>>>>>>> have `delivery.timeout.ms` actually overwritten we would
> >> need
> >>> to
> >>>>> set
> >>>>>>>>>>> this config somewhat "dynamic"? Is this feasible? If we
> >>>> hard-code 2
> >>>>>>>>>>> minutes, it might not be backward compatible. I have the
> >>>> impression
> >>>>>>>> we
> >>>>>>>>>>> might introduce some undesired coupling? -- For the "record
> >> too
> >>>>>>>> large"
> >>>>>>>>>>> case, the config seems to be boolean and setting it to
> >> `false`
> >>> by
> >>>>>>>>>>> default seems to provide backward compatibility.
> >>>>>>>>>>>
> >>>>>>>>>>>
> >>>>>>>>>>>
> >>>>>>>>>>> @Artem:
> >>>>>>>>>>>
> >>>>>>>>>>> [AL1] While I see the point, I would think having a different
> >>>>>>>> callback
> >>>>>>>>>>> for every exception might not really be elegant? In the end,
> >>> the
> >>>>>>>>> handler
> >>>>>>>>>>> is an very advanced feature anyway, and if it's implemented
> >> in
> >>> a
> >>>>> bad
> >>>>>>>>>>> way, well, it's a user error -- we cannot protect users from
> >>>>>>>>> everything.
> >>>>>>>>>>> To me, a handler like this, is to some extend "business
> >> logic"
> >>>> and
> >>>>>>>> if a
> >>>>>>>>>>> user gets business logic wrong, it's hard to protect them. --
> >>> We
> >>>>>>>> would
> >>>>>>>>>>> of course provide best practice guidance in the JaveDocs, and
> >>>>> explain
> >>>>>>>>>>> that a handler should have explicit `if` statements for stuff
> >>> it
> >>>>> want
> >>>>>>>>> to
> >>>>>>>>>>> handle, and only a single default which return FAIL.
> >>>>>>>>>>>
> >>>>>>>>>>>
> >>>>>>>>>>> [AL2] Yes, but for KS we would retry at the application
> >> layer.
> >>>> Ie,
> >>>>>>>> the
> >>>>>>>>>>> TX is not completed, we clean up and setup out task from
> >>> scratch,
> >>>>> to
> >>>>>>>>>>> ensure the pending transaction is completed before we resume.
> >>> If
> >>>>> the
> >>>>>>>> TX
> >>>>>>>>>>> was indeed aborted, we would retry from older offset and thus
> >>>> just
> >>>>>>>> hit
> >>>>>>>>>>> the same error again and the loop begins again.
> >>>>>>>>>>>
> >>>>>>>>>>>
> >>>>>>>>>>> [AL2 cont.] Similar to AL1, I see such a handler to some
> >> extend
> >>>> as
> >>>>>>>>>>> business logic. If a user puts a bad filter condition in
> >> their
> >>> KS
> >>>>>>>> app,
> >>>>>>>>>>> and drops messages, it nothing we can do about it, and this
> >>>> handler
> >>>>>>>>>>> IMHO, has a similar purpose. This is also the line of
> >> thinking
> >>> I
> >>>>>>>> apply
> >>>>>>>>>>> to EOS, to address Justin's concern about "should we allow to
> >>>> drop
> >>>>>>>> for
> >>>>>>>>>>> EOS", and my answer is "yes", because it's more business
> >> logic
> >>>> than
> >>>>>>>>>>> actual error handling IMHO. And by default, we fail... So
> >> users
> >>>>>>>> opt-in
> >>>>>>>>>>> to add business logic to drop records. It's an application
> >>> level
> >>>>>>>>>>> decision how to write the code.
> >>>>>>>>>>>
> >>>>>>>>>>>
> >>>>>>>>>>> [AL3] Maybe I misunderstand what you are saying, but to me,
> >>>>> checking
> >>>>>>>>> the
> >>>>>>>>>>> size of the record upfront is exactly what the KIP proposes?
> >>> No?
> >>>>>>>>>>>
> >>>>>>>>>>>
> >>>>>>>>>>>
> >>>>>>>>>>> @Justin:
> >>>>>>>>>>>
> >>>>>>>>>>>> I saw the sample
> >>>>>>>>>>>> code -- is it just an if statement checking for the error
> >>> before
> >>>>>>>> the
> >>>>>>>>>>>> handler is invoked? That seems a bit fragile.
> >>>>>>>>>>>
> >>>>>>>>>>> What do you mean by fragile? Not sure if I see your point.
> >>>>>>>>>>>
> >>>>>>>>>>>
> >>>>>>>>>>>
> >>>>>>>>>>>
> >>>>>>>>>>> -Matthias
> >>>>>>>>>>>
> >>>>>>>>>>> On 5/7/24 5:33 PM, Artem Livshits wrote:
> >>>>>>>>>>>> Hi Alieh,
> >>>>>>>>>>>>
> >>>>>>>>>>>> Thanks for the KIP.  The motivation talks about very
> >> specific
> >>>>>>>> cases,
> >>>>>>>>>> but
> >>>>>>>>>>>> the interface is generic.
> >>>>>>>>>>>>
> >>>>>>>>>>>> [AL1]
> >>>>>>>>>>>> If the interface evolves in the future I think we could have
> >>> the
> >>>>>>>>>>> following
> >>>>>>>>>>>> confusion:
> >>>>>>>>>>>>
> >>>>>>>>>>>> 1. A user implemented SWALLOW action for both
> >>>>>>>> RecordTooLargeException
> >>>>>>>>>> and
> >>>>>>>>>>>> UnknownTopicOrPartitionException.  For simpicity they just
> >>>> return
> >>>>>>>>>> SWALLOW
> >>>>>>>>>>>> from the function, because it elegantly handles all known
> >>> cases.
> >>>>>>>>>>>> 2. The interface has evolved to support a new exception.
> >>>>>>>>>>>> 3. The user has upgraded their Kafka client.
> >>>>>>>>>>>>
> >>>>>>>>>>>> Now a new kind of error gets dropped on the floor without
> >>> user's
> >>>>>>>>>>> intention
> >>>>>>>>>>>> and it would be super hard to detect and debug.
> >>>>>>>>>>>>
> >>>>>>>>>>>> To avoid the confusion, I think we should use handlers for
> >>>>> specific
> >>>>>>>>>>>> exceptions.  Then if a new exception is added it won't get
> >>>>> silently
> >>>>>>>>>>> swalled
> >>>>>>>>>>>> because the user would need to add new functionality to
> >> handle
> >>>> it.
> >>>>>>>>>>>>
> >>>>>>>>>>>> I also have some higher level comments:
> >>>>>>>>>>>>
> >>>>>>>>>>>> [AL2]
> >>>>>>>>>>>>> it throws a TimeoutException, and the user can only blindly
> >>>>> retry,
> >>>>>>>>>> which
> >>>>>>>>>>>> may result in an infinite retry loop
> >>>>>>>>>>>>
> >>>>>>>>>>>> If the TimeoutException happens during transactional
> >>> processing
> >>>>>>>>>> (exactly
> >>>>>>>>>>>> once is the desired sematnics), then the client should not
> >>> retry
> >>>>>>>> when
> >>>>>>>>>> it
> >>>>>>>>>>>> gets TimeoutException because without knowing the reason for
> >>>>>>>>>>>> TimeoutExceptions, the client cannot know whether the
> >> message
> >>>> got
> >>>>>>>>>>> actually
> >>>>>>>>>>>> produced or not and retrying the message may result in
> >>>>> duplicatees.
> >>>>>>>>>>>>
> >>>>>>>>>>>>> The thrown TimeoutException "cuts" the connection to the
> >>>>>>>> underlying
> >>>>>>>>>> root
> >>>>>>>>>>>> cause of missing metadata
> >>>>>>>>>>>>
> >>>>>>>>>>>> Maybe we should fix the error handling and return the proper
> >>>>>>>>> underlying
> >>>>>>>>>>>> message?  Then the application can properly handle the
> >> message
> >>>>>>>> based
> >>>>>>>>> on
> >>>>>>>>>>>> preferences.
> >>>>>>>>>>>>
> >>>>>>>>>>>> From the product perspective, it's not clear how safe it is
> >> to
> >>>>>>>>> blindly
> >>>>>>>>>>>> ignore UnknownTopicOrPartitionException.  This could lead to
> >>>>>>>>> situations
> >>>>>>>>>>>> when a simple typo could lead to massive data loss (part of
> >>> the
> >>>>>>>> data
> >>>>>>>>>>> would
> >>>>>>>>>>>> effectively be produced to a "black hole" and the user may
> >> not
> >>>>>>>> notice
> >>>>>>>>>> it
> >>>>>>>>>>>> for a while).
> >>>>>>>>>>>>
> >>>>>>>>>>>> In which situations would you recommend the user to "black
> >>> hole"
> >>>>>>>>>> messages
> >>>>>>>>>>>> in case of misconfiguration?
> >>>>>>>>>>>>
> >>>>>>>>>>>> [AL3]
> >>>>>>>>>>>>
> >>>>>>>>>>>>> If the custom handler decides on SWALLOW for
> >>>>>>>>> RecordTooLargeException,
> >>>>>>>>>>>>
> >>>>>>>>>>>> Is it my understanding that this KIP proposes that
> >>> functionality
> >>>>>>>> that
> >>>>>>>>>>> would
> >>>>>>>>>>>> only be able to SWALLOW RecordTooLargeException that happen
> >>>>> because
> >>>>>>>>> the
> >>>>>>>>>>>> producer cannot produce the record (if the broker rejects
> >> the
> >>>>>>>> batch,
> >>>>>>>>>> the
> >>>>>>>>>>>> error won't get to the handler, because we cannot know which
> >>>> other
> >>>>>>>>>>> records
> >>>>>>>>>>>> get ignored).  In this case, why not just check the locally
> >>>>>>>>> configured
> >>>>>>>>>>> max
> >>>>>>>>>>>> record size upfront and not produce the recrord in the first
> >>>>> place?
> >>>>>>>>>>> Maybe
> >>>>>>>>>>>> we can expose a validation function from the producer that
> >>> could
> >>>>>>>>>> validate
> >>>>>>>>>>>> the records locally, so we don't need to produce the record
> >> in
> >>>>>>>> order
> >>>>>>>>> to
> >>>>>>>>>>>> know that it's invalid.
> >>>>>>>>>>>>
> >>>>>>>>>>>> -Artem
> >>>>>>>>>>>>
> >>>>>>>>>>>> On Tue, May 7, 2024 at 2:07 PM Justine Olshan
> >>>>>>>>>>> <jols...@confluent.io.invalid>
> >>>>>>>>>>>> wrote:
> >>>>>>>>>>>>
> >>>>>>>>>>>>> Alieh and Chris,
> >>>>>>>>>>>>>
> >>>>>>>>>>>>> Thanks for clarifying 1) but I saw the motivation. I guess
> >> I
> >>>> just
> >>>>>>>>>> didn't
> >>>>>>>>>>>>> understand how that would be ensured on the producer side.
> >> I
> >>>> saw
> >>>>>>>> the
> >>>>>>>>>>> sample
> >>>>>>>>>>>>> code -- is it just an if statement checking for the error
> >>>> before
> >>>>>>>> the
> >>>>>>>>>>>>> handler is invoked? That seems a bit fragile.
> >>>>>>>>>>>>>
> >>>>>>>>>>>>> Can you clarify what you mean by `since the code does not
> >>> reach
> >>>>>>>> the
> >>>>>>>>> KS
> >>>>>>>>>>>>> interface and breaks somewhere in producer.` If we surfaced
> >>>> this
> >>>>>>>>> error
> >>>>>>>>>>> to
> >>>>>>>>>>>>> the application in a better way would that also be a
> >> solution
> >>>> to
> >>>>>>>> the
> >>>>>>>>>>> issue?
> >>>>>>>>>>>>>
> >>>>>>>>>>>>> Justine
> >>>>>>>>>>>>>
> >>>>>>>>>>>>> On Tue, May 7, 2024 at 1:55 PM Alieh Saeedi
> >>>>>>>>>>> <asae...@confluent.io.invalid>
> >>>>>>>>>>>>> wrote:
> >>>>>>>>>>>>>
> >>>>>>>>>>>>>> Hi,
> >>>>>>>>>>>>>>
> >>>>>>>>>>>>>>
> >>>>>>>>>>>>>> Thank you, Chris and Justine, for the feedback.
> >>>>>>>>>>>>>>
> >>>>>>>>>>>>>>
> >>>>>>>>>>>>>> @Chris
> >>>>>>>>>>>>>>
> >>>>>>>>>>>>>> 1) Flexibility: it has two meanings. The first meaning is
> >>> the
> >>>>> one
> >>>>>>>>> you
> >>>>>>>>>>>>>> mentioned. We are going to cover more exceptions in the
> >>>> future,
> >>>>>>>> but
> >>>>>>>>>> as
> >>>>>>>>>>>>>> Justine mentioned, we must be very conservative about
> >> adding
> >>>>> more
> >>>>>>>>>>>>>> exceptions. Additionally, flexibility mainly means that
> >> the
> >>>> user
> >>>>>>>> is
> >>>>>>>>>>> able
> >>>>>>>>>>>>> to
> >>>>>>>>>>>>>> develop their own code. As mentioned in the motivation
> >>> section
> >>>>>>>> and
> >>>>>>>>>> the
> >>>>>>>>>>>>>> examples, sometimes the user decides on dropping a record
> >>>> based
> >>>>>>>> on
> >>>>>>>>>> the
> >>>>>>>>>>>>>> topic, for example.
> >>>>>>>>>>>>>>
> >>>>>>>>>>>>>>
> >>>>>>>>>>>>>> 2) Defining two separate methods for retriable and
> >>>> non-retriable
> >>>>>>>>>>>>>> exceptions: although the idea is brilliant, the user may
> >>> still
> >>>>>>>>> make a
> >>>>>>>>>>>>>> mistake by implementing the wrong method and see a
> >>>> non-expecting
> >>>>>>>>>>>>> behaviour.
> >>>>>>>>>>>>>> For example, he may implement handleRetriable() for
> >>>>>>>>>>>>> RecordTooLargeException
> >>>>>>>>>>>>>> and define SWALLOW for the exception, but in practice, he
> >>> sees
> >>>>> no
> >>>>>>>>>>> change
> >>>>>>>>>>>>> in
> >>>>>>>>>>>>>> default behaviour since he implemented the wrong method. I
> >>>> think
> >>>>>>>> we
> >>>>>>>>>> can
> >>>>>>>>>>>>>> never reduce the user’s mistakes to 0.
> >>>>>>>>>>>>>>
> >>>>>>>>>>>>>>
> >>>>>>>>>>>>>>
> >>>>>>>>>>>>>> 3) Default implementation for Handler: the default
> >> behaviour
> >>>> is
> >>>>>>>>>> already
> >>>>>>>>>>>>>> preserved with NO need of implementing any handler or
> >>> setting
> >>>>> the
> >>>>>>>>>>>>>> corresponding config parameter `custom.exception.handler`.
> >>>> What
> >>>>>>>> you
> >>>>>>>>>>> mean
> >>>>>>>>>>>>> is
> >>>>>>>>>>>>>> actually having a second default, which requires having
> >> both
> >>>>>>>>>> interface
> >>>>>>>>>>>>> and
> >>>>>>>>>>>>>> config parameters. About UnknownTopicOrPartitionException:
> >>> the
> >>>>>>>>>> producer
> >>>>>>>>>>>>>> already offers the config parameter `max.block.ms` which
> >>>>>>>>> determines
> >>>>>>>>>>> the
> >>>>>>>>>>>>>> duration of retrying. The main purpose of the user who
> >> needs
> >>>> the
> >>>>>>>>>>> handler
> >>>>>>>>>>>>> is
> >>>>>>>>>>>>>> to get the root cause of TimeoutException and handle it in
> >>> the
> >>>>>>>> way
> >>>>>>>>> he
> >>>>>>>>>>>>>> intends. The KIP explains the necessity of it for KS
> >> users.
> >>>>>>>>>>>>>>
> >>>>>>>>>>>>>>
> >>>>>>>>>>>>>> 4) Naming issue: By SWALLOW, we meant actually swallow the
> >>>>> error,
> >>>>>>>>>> while
> >>>>>>>>>>>>>> SKIP means skip the record, I think. If it makes sense for
> >>>> more
> >>>>>>>>> ppl,
> >>>>>>>>>> I
> >>>>>>>>>>>>> can
> >>>>>>>>>>>>>> change it to SKIP
> >>>>>>>>>>>>>>
> >>>>>>>>>>>>>>
> >>>>>>>>>>>>>> @Justine
> >>>>>>>>>>>>>>
> >>>>>>>>>>>>>> 1) was addressed by Chris.
> >>>>>>>>>>>>>>
> >>>>>>>>>>>>>> 2 and 3) The problem is exactly what you mentioned.
> >>> Currently,
> >>>>>>>>> there
> >>>>>>>>>> is
> >>>>>>>>>>>>> no
> >>>>>>>>>>>>>> way to handle these issues application-side. Even KS users
> >>> who
> >>>>>>>>>>> implement
> >>>>>>>>>>>>> KS
> >>>>>>>>>>>>>> ProductionExceptionHandler are not able to handle the
> >>>> exceptions
> >>>>>>>> as
> >>>>>>>>>>> they
> >>>>>>>>>>>>>> intend since the code does not reach the KS interface and
> >>>> breaks
> >>>>>>>>>>>>> somewhere
> >>>>>>>>>>>>>> in producer.
> >>>>>>>>>>>>>>
> >>>>>>>>>>>>>> Cheers,
> >>>>>>>>>>>>>> Alieh
> >>>>>>>>>>>>>>
> >>>>>>>>>>>>>> On Tue, May 7, 2024 at 8:43 PM Chris Egerton <
> >>>>>>>>>> fearthecel...@gmail.com>
> >>>>>>>>>>>>>> wrote:
> >>>>>>>>>>>>>>
> >>>>>>>>>>>>>>> Hi Justine,
> >>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>> The method signatures for the interface are indeed
> >>>> open-ended,
> >>>>>>>> but
> >>>>>>>>>> the
> >>>>>>>>>>>>>> KIP
> >>>>>>>>>>>>>>> states that its uses will be limited. See the motivation
> >>>>>>>> section:
> >>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>> We believe that the user should be able to develop
> >> custom
> >>>>>>>>> exception
> >>>>>>>>>>>>>>> handlers for managing producer exceptions. On the other
> >>> hand,
> >>>>>>>> this
> >>>>>>>>>>> will
> >>>>>>>>>>>>>> be
> >>>>>>>>>>>>>>> an expert-level API, and using that may result in strange
> >>>>>>>>> behaviour
> >>>>>>>>>> in
> >>>>>>>>>>>>>> the
> >>>>>>>>>>>>>>> system, making it hard to find the root cause. Therefore,
> >>> the
> >>>>>>>>> custom
> >>>>>>>>>>>>>>> handler is currently limited to handling
> >>>>> RecordTooLargeException
> >>>>>>>>> and
> >>>>>>>>>>>>>>> UnknownTopicOrPartitionException.
> >>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>> Cheers,
> >>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>> Chris
> >>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>> On Tue, May 7, 2024, 14:37 Justine Olshan
> >>>>>>>>>>> <jols...@confluent.io.invalid
> >>>>>>>>>>>>>>
> >>>>>>>>>>>>>>> wrote:
> >>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>> Hi Alieh,
> >>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>> I was out for KSB and then was also sick. :(
> >>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>> To your point 1) Chris, I don't think it is limited to
> >> two
> >>>>>>>>> specific
> >>>>>>>>>>>>>>>> scenarios, since the interface accepts a generic
> >>> Exception e
> >>>>>>>> and
> >>>>>>>>>> can
> >>>>>>>>>>>>> be
> >>>>>>>>>>>>>>>> implemented to check if that e is an instanceof any
> >>>> exception.
> >>>>>>>> I
> >>>>>>>>>>>>> didn't
> >>>>>>>>>>>>>>> see
> >>>>>>>>>>>>>>>> anywhere that specific errors are enforced. I'm a bit
> >>>>> concerned
> >>>>>>>>>> about
> >>>>>>>>>>>>>>> this
> >>>>>>>>>>>>>>>> actually. I'm concerned about the opened-endedness and
> >> the
> >>>>>>>>> contract
> >>>>>>>>>>>>> we
> >>>>>>>>>>>>>>> have
> >>>>>>>>>>>>>>>> with transactions. We are allowing the client to make
> >>>>> decisions
> >>>>>>>>>> that
> >>>>>>>>>>>>>> are
> >>>>>>>>>>>>>>>> somewhat invisible to the server. As an aside, can we
> >>> build
> >>>> in
> >>>>>>>>> log
> >>>>>>>>>>>>>>> messages
> >>>>>>>>>>>>>>>> when the handler decides to skip etc a message. I'm
> >> really
> >>>>>>>>>> concerned
> >>>>>>>>>>>>>>> about
> >>>>>>>>>>>>>>>> messages being silently dropped.
> >>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>> I do think Chris's point 2) about retriable vs non
> >>> retriable
> >>>>>>>>> errors
> >>>>>>>>>>>>> is
> >>>>>>>>>>>>>>>> fair. I'm a bit concerned about skipping a unknown topic
> >>> or
> >>>>>>>>>> partition
> >>>>>>>>>>>>>>>> exception too early, as there are cases where it can be
> >>>>>>>>> transient.
> >>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>> I'm still a little bit wary of allowing dropping records
> >>> as
> >>>>>>>> part
> >>>>>>>>> of
> >>>>>>>>>>>>> EOS
> >>>>>>>>>>>>>>>> generally as in many cases, these errors signify an
> >> issue
> >>>> with
> >>>>>>>>> the
> >>>>>>>>>>>>>>> original
> >>>>>>>>>>>>>>>> data. I understand that streams and connect/mirror maker
> >>> may
> >>>>>>>> have
> >>>>>>>>>>>>>> reasons
> >>>>>>>>>>>>>>>> they want to progress past these messages, but wondering
> >>> if
> >>>>>>>> there
> >>>>>>>>>> is
> >>>>>>>>>>>>> a
> >>>>>>>>>>>>>>> way
> >>>>>>>>>>>>>>>> that can be done application-side. I'm willing to accept
> >>>> this
> >>>>>>>>> sort
> >>>>>>>>>> of
> >>>>>>>>>>>>>>>> proposal if we can make it clear that this sort of thing
> >>> is
> >>>>>>>>>> happening
> >>>>>>>>>>>>>> and
> >>>>>>>>>>>>>>>> we limit the blast radius for what we can do.
> >>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>> Justine
> >>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>> On Tue, May 7, 2024 at 9:55 AM Chris Egerton
> >>>>>>>>>> <chr...@aiven.io.invalid
> >>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>> wrote:
> >>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>> Hi Alieh,
> >>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>> Sorry for the delay, I've been out sick. I still have
> >>> some
> >>>>>>>>>> thoughts
> >>>>>>>>>>>>>>> that
> >>>>>>>>>>>>>>>>> I'd like to see addressed before voting.
> >>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>> 1) If flexibility is the motivation for a pluggable
> >>>>> interface,
> >>>>>>>>> why
> >>>>>>>>>>>>>> are
> >>>>>>>>>>>>>>> we
> >>>>>>>>>>>>>>>>> only limiting the uses for this interface to two very
> >>>>> specific
> >>>>>>>>>>>>>>> scenarios?
> >>>>>>>>>>>>>>>>> Why not also allow, e.g., authorization errors to be
> >>>> handled
> >>>>>>>> as
> >>>>>>>>>>>>> well
> >>>>>>>>>>>>>>>>> (allowing users to drop records destined for some
> >>>> off-limits
> >>>>>>>>>>>>> topics,
> >>>>>>>>>>>>>> or
> >>>>>>>>>>>>>>>>> retry for a limited duration in case there's a delay in
> >>> the
> >>>>>>>>>>>>>> propagation
> >>>>>>>>>>>>>>>> of
> >>>>>>>>>>>>>>>>> ACL updates)? It'd be nice to see some analysis of
> >> other
> >>>>>>>> errors
> >>>>>>>>>>>>> that
> >>>>>>>>>>>>>>>> could
> >>>>>>>>>>>>>>>>> be handled with this new API, both to avoid the
> >> follow-up
> >>>>> work
> >>>>>>>>> of
> >>>>>>>>>>>>>>> another
> >>>>>>>>>>>>>>>>> KIP to address them in the future, and to make sure
> >> that
> >>>>> we're
> >>>>>>>>> not
> >>>>>>>>>>>>>>>> painting
> >>>>>>>>>>>>>>>>> ourselves into a corner with the current API in a way
> >>> that
> >>>>>>>> would
> >>>>>>>>>>>>> make
> >>>>>>>>>>>>>>>>> future modifications difficult.
> >>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>> 2) Something feels a bit off with how retriable vs.
> >>>>>>>>> non-retriable
> >>>>>>>>>>>>>>> errors
> >>>>>>>>>>>>>>>>> are handled with the interface. Why not introduce two
> >>>>> separate
> >>>>>>>>>>>>>> methods
> >>>>>>>>>>>>>>> to
> >>>>>>>>>>>>>>>>> handle each case separately? That way there's no
> >>> ambiguity
> >>>> or
> >>>>>>>>>>>>>> implicit
> >>>>>>>>>>>>>>>>> behavior when, e.g., attempting to retry on a
> >>>>>>>>>>>>>> RecordTooLargeException.
> >>>>>>>>>>>>>>>> This
> >>>>>>>>>>>>>>>>> could be something like `NonRetriableResponse
> >>>>>>>>>>>>> handle(ProducerRecord,
> >>>>>>>>>>>>>>>>> Exception)` and `RetriableResponse
> >>>>>>>>> handleRetriable(ProducerRecord,
> >>>>>>>>>>>>>>>>> Exception)`, though the exact names and shape can
> >>> obviously
> >>>>> be
> >>>>>>>>>>>>> toyed
> >>>>>>>>>>>>>>>> with a
> >>>>>>>>>>>>>>>>> bit.
> >>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>> 3) Although the flexibility of a pluggable interface
> >> may
> >>>>>>>> benefit
> >>>>>>>>>>>>> some
> >>>>>>>>>>>>>>>>> users' custom producer applications and Kafka Streams
> >>>>>>>>>> applications,
> >>>>>>>>>>>>>> it
> >>>>>>>>>>>>>>>>> comes at significant deployment cost for other
> >>> low-/no-code
> >>>>>>>>>>>>>>> environments,
> >>>>>>>>>>>>>>>>> including but not limited to Kafka Connect and
> >>> MirrorMaker
> >>>> 2.
> >>>>>>>>> Can
> >>>>>>>>>>>>> we
> >>>>>>>>>>>>>>> add
> >>>>>>>>>>>>>>>> a
> >>>>>>>>>>>>>>>>> default implementation of the exception handler that
> >>> allows
> >>>>>>>> for
> >>>>>>>>>>>>> some
> >>>>>>>>>>>>>>>> simple
> >>>>>>>>>>>>>>>>> behavior to be tweaked via configuration property? Two
> >>>> things
> >>>>>>>>> that
> >>>>>>>>>>>>>>> would
> >>>>>>>>>>>>>>>> be
> >>>>>>>>>>>>>>>>> nice to have would be A) an upper bound on the retry
> >> time
> >>>> for
> >>>>>>>>>>>>>>>>> unknown-topic-partition exceptions and B) an option to
> >>> drop
> >>>>>>>>>> records
> >>>>>>>>>>>>>>> that
> >>>>>>>>>>>>>>>>> are large enough to trigger a record-too-large
> >> exception.
> >>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>> 4) I'd still prefer to see "SKIP" or "DROP" instead of
> >>> the
> >>>>>>>>>> proposed
> >>>>>>>>>>>>>>>>> "SWALLOW" option, which IMO is opaque and non-obvious,
> >>>>>>>>> especially
> >>>>>>>>>>>>>> when
> >>>>>>>>>>>>>>>>> trying to guess the behavior for retriable errors.
> >>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>> Cheers,
> >>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>> Chris
> >>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>> On Fri, May 3, 2024 at 11:23 AM Alieh Saeedi
> >>>>>>>>>>>>>>>> <asae...@confluent.io.invalid
> >>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>> wrote:
> >>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>> Hi all,
> >>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>> A summary of the KIP and the discussions:
> >>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>> The KIP introduces a handler interface for Producer in
> >>>> order
> >>>>>>>> to
> >>>>>>>>>>>>>>> handle
> >>>>>>>>>>>>>>>>> two
> >>>>>>>>>>>>>>>>>> exceptions: RecordTooLargeException and
> >>>>>>>>>>>>>>>> UnknownTopicOrPartitionException.
> >>>>>>>>>>>>>>>>>> The handler handles the exceptions per-record.
> >>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>> - Do we need this handler?  [Motivation and Examples
> >>>>>>>> sections]
> >>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>> RecordTooLargeException: 1) In transactions, the
> >>> producer
> >>>>>>>>>>>>> collects
> >>>>>>>>>>>>>>>>> multiple
> >>>>>>>>>>>>>>>>>> records in batches. Then a RecordTooLargeException
> >>> related
> >>>>>>>> to a
> >>>>>>>>>>>>>>> single
> >>>>>>>>>>>>>>>>>> record leads to failing the entire batch. A custom
> >>>> exception
> >>>>>>>>>>>>>> handler
> >>>>>>>>>>>>>>> in
> >>>>>>>>>>>>>>>>>> this case may decide on dropping the record and
> >>> continuing
> >>>>>>>> the
> >>>>>>>>>>>>>>>>> processing.
> >>>>>>>>>>>>>>>>>> See Example 1, please. 2) More over, in Kafka
> >> Streams, a
> >>>>>>>> record
> >>>>>>>>>>>>>> that
> >>>>>>>>>>>>>>> is
> >>>>>>>>>>>>>>>>> too
> >>>>>>>>>>>>>>>>>> large is a poison pill record, and there is no way to
> >>> skip
> >>>>>>>> over
> >>>>>>>>>>>>>> it. A
> >>>>>>>>>>>>>>>>>> handler would allow us to react to this error inside
> >> the
> >>>>>>>>>>>>> producer,
> >>>>>>>>>>>>>>>> i.e.,
> >>>>>>>>>>>>>>>>>> local to where the error happens, and thus simplify
> >> the
> >>>>>>>> overall
> >>>>>>>>>>>>>> code
> >>>>>>>>>>>>>>>>>> significantly. Please read the Motivation section for
> >>> more
> >>>>>>>>>>>>>>> explanation.
> >>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>> UnknownTopicOrPartitionException: For this case, the
> >>>>> producer
> >>>>>>>>>>>>>> handles
> >>>>>>>>>>>>>>>>> this
> >>>>>>>>>>>>>>>>>> exception internally and only issues a WARN log about
> >>>>> missing
> >>>>>>>>>>>>>>> metadata
> >>>>>>>>>>>>>>>>> and
> >>>>>>>>>>>>>>>>>> retries internally. Later, when the producer hits "
> >>>>>>>>>>>>>>> deliver.timeout.ms"
> >>>>>>>>>>>>>>>>> it
> >>>>>>>>>>>>>>>>>> throws a TimeoutException, and the user can only
> >> blindly
> >>>>>>>> retry,
> >>>>>>>>>>>>>> which
> >>>>>>>>>>>>>>>> may
> >>>>>>>>>>>>>>>>>> result in an infinite retry loop. The thrown
> >>>>> TimeoutException
> >>>>>>>>>>>>>> "cuts"
> >>>>>>>>>>>>>>>> the
> >>>>>>>>>>>>>>>>>> connection to the underlying root cause of missing
> >>>> metadata
> >>>>>>>>>>>>> (which
> >>>>>>>>>>>>>>>> could
> >>>>>>>>>>>>>>>>>> indeed be a transient error but is persistent for a
> >>>>>>>>> non-existing
> >>>>>>>>>>>>>>>> topic).
> >>>>>>>>>>>>>>>>>> Thus, there is no programmatic way to break the
> >> infinite
> >>>>>>>> retry
> >>>>>>>>>>>>>> loop.
> >>>>>>>>>>>>>>>>> Kafka
> >>>>>>>>>>>>>>>>>> Streams also blindly retries for this case, and the
> >>>>>>>> application
> >>>>>>>>>>>>>> gets
> >>>>>>>>>>>>>>>>> stuck.
> >>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>> - Having interface vs configuration option:
> >> [Motivation,
> >>>>>>>>>>>>> Examples,
> >>>>>>>>>>>>>>> and
> >>>>>>>>>>>>>>>>>> Rejected Alternatives sections]
> >>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>> Our solution is introducing an interface due to the
> >> full
> >>>>>>>>>>>>>> flexibility
> >>>>>>>>>>>>>>>> that
> >>>>>>>>>>>>>>>>>> it offers. Sometimes users, especially Kafka Streams
> >>> ones,
> >>>>>>>>>>>>>> determine
> >>>>>>>>>>>>>>>> the
> >>>>>>>>>>>>>>>>>> handler's behaviour based on the situation. For
> >>> example, f
> >>>>>>>>>>>>>>>>>> acing UnknownTopicOrPartitionException*, *the user may
> >>>> want
> >>>>>>>> to
> >>>>>>>>>>>>>> raise
> >>>>>>>>>>>>>>> an
> >>>>>>>>>>>>>>>>>> error for some topics but retry it for other topics.
> >>>> Having
> >>>>> a
> >>>>>>>>>>>>>>>>> configuration
> >>>>>>>>>>>>>>>>>> option with a fixed set of possibilities does not
> >> serve
> >>>> the
> >>>>>>>>>>>>> user's
> >>>>>>>>>>>>>>>>>> needs. See Example 2, please.
> >>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>> - Note on RecordTooLargeException: [Public Interfaces
> >>>>>>>> section]
> >>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>> If the custom handler decides on SWALLOW for
> >>>>>>>>>>>>>> RecordTooLargeException,
> >>>>>>>>>>>>>>>>> then
> >>>>>>>>>>>>>>>>>> this record will not be a part of the batch of
> >>>> transactions
> >>>>>>>> and
> >>>>>>>>>>>>>> will
> >>>>>>>>>>>>>>>> also
> >>>>>>>>>>>>>>>>>> not be sent to the broker in non-transactional mode.
> >> So
> >>> no
> >>>>>>>>>>>>> worries
> >>>>>>>>>>>>>>>> about
> >>>>>>>>>>>>>>>>>> getting a RecordTooLargeException from the broker in
> >>> this
> >>>>>>>> case,
> >>>>>>>>>>>>> as
> >>>>>>>>>>>>>>> the
> >>>>>>>>>>>>>>>>>> record will never ever be sent to the broker. SWALLOW
> >>>> means
> >>>>>>>>> drop
> >>>>>>>>>>>>>> the
> >>>>>>>>>>>>>>>>> record
> >>>>>>>>>>>>>>>>>> and continue/swallow the error.
> >>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>> - What if the handle() method implements RETRY for
> >>>>>>>>>>>>>>>>> RecordTooLargeException?
> >>>>>>>>>>>>>>>>>> [Proposed Changes section]
> >>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>> We have to limit the user to only have FAIL or SWALLOW
> >>> for
> >>>>>>>>>>>>>>>>>> RecordTooLargeException. Actually, RETRY must be equal
> >>> to
> >>>>>>>> FAIL.
> >>>>>>>>>>>>>> This
> >>>>>>>>>>>>>>> is
> >>>>>>>>>>>>>>>>>> well documented/informed in javadoc.
> >>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>> - What if the handle() method of the handler throws an
> >>>>>>>>> exception?
> >>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>> The handler is expected to have correct code. If it
> >>> throws
> >>>>> an
> >>>>>>>>>>>>>>>> exception,
> >>>>>>>>>>>>>>>>>> everything fails.
> >>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>> This is a PoC PR <
> >>>>> https://github.com/apache/kafka/pull/15846
> >>>>>>>>>
> >>>>>>>>>>>>> ONLY
> >>>>>>>>>>>>>>> for
> >>>>>>>>>>>>>>>>>> RecordTooLargeException. The code changes related to
> >>>>>>>>>>>>>>>>>> UnknownTopicOrPartitionException will be added to this
> >>> PR
> >>>>>>>>> LATER.
> >>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>> Looking forward to your feedback again.
> >>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>> Cheers,
> >>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>> Alieh
> >>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>> On Thu, Apr 25, 2024 at 11:46 PM Kirk True <
> >>>>>>>> k...@kirktrue.pro>
> >>>>>>>>>>>>>>> wrote:
> >>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>> Hi Alieh,
> >>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>> Thanks for the updates!
> >>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>> Comments inline...
> >>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>> On Apr 25, 2024, at 1:10 PM, Alieh Saeedi
> >>>>>>>>>>>>>>>>> <asae...@confluent.io.INVALID
> >>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>> wrote:
> >>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>> Hi all,
> >>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>> Thanks a lot for the constructive feedbacks!
> >>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>> Addressing some of the main concerns:
> >>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>> - The `RecordTooLargeException` can be thrown by
> >>> broker,
> >>>>>>>>>>>>>> producer
> >>>>>>>>>>>>>>>> and
> >>>>>>>>>>>>>>>>>>>> consumer. Of course, the `ProducerExceptionHandler`
> >>>>>>>> interface
> >>>>>>>>>>>>>> is
> >>>>>>>>>>>>>>>>>>> introduced
> >>>>>>>>>>>>>>>>>>>> to affect only the exceptions thrown from the
> >>> producer.
> >>>>>>>> This
> >>>>>>>>>>>>>> KIP
> >>>>>>>>>>>>>>>> very
> >>>>>>>>>>>>>>>>>>>> specifically means to provide a possibility to
> >> manage
> >>>> the
> >>>>>>>>>>>>>>>>>>>> `RecordTooLargeException` thrown from the
> >>>> Producer.send()
> >>>>>>>>>>>>>> method.
> >>>>>>>>>>>>>>>>>> Please
> >>>>>>>>>>>>>>>>>>>> see “Proposed Changes” section for more clarity. I
> >>>>>>>>>>>>> investigated
> >>>>>>>>>>>>>>> the
> >>>>>>>>>>>>>>>>>> issue
> >>>>>>>>>>>>>>>>>>>> there thoroughly. I hope it can explain the concern
> >>>> about
> >>>>>>>> how
> >>>>>>>>>>>>>> we
> >>>>>>>>>>>>>>>>> handle
> >>>>>>>>>>>>>>>>>>> the
> >>>>>>>>>>>>>>>>>>>> errors as well.
> >>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>> - The problem with Callback: Methods of Callback are
> >>>>> called
> >>>>>>>>>>>>>> when
> >>>>>>>>>>>>>>>> the
> >>>>>>>>>>>>>>>>>>> record
> >>>>>>>>>>>>>>>>>>>> sent to the server is acknowledged, while this is
> >> not
> >>>> the
> >>>>>>>>>>>>>> desired
> >>>>>>>>>>>>>>>>> time
> >>>>>>>>>>>>>>>>>>> for
> >>>>>>>>>>>>>>>>>>>> all exceptions. We intend to handle exceptions
> >>>> beforehand.
> >>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>> I guess it makes sense to keep the expectation for
> >> when
> >>>>>>>>>>>>> Callback
> >>>>>>>>>>>>>> is
> >>>>>>>>>>>>>>>>>>> invoked as-is vs. shoehorning more into it.
> >>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>> - What if the custom handler returns RETRY for
> >>>>>>>>>>>>>>>>>>> `RecordTooLargeException`? I
> >>>>>>>>>>>>>>>>>>>> assume changing the producer configuration at
> >> runtime
> >>> is
> >>>>>>>>>>>>>>> possible.
> >>>>>>>>>>>>>>>> If
> >>>>>>>>>>>>>>>>>> so,
> >>>>>>>>>>>>>>>>>>>> RETRY for a too large record is valid because maybe
> >> in
> >>>> the
> >>>>>>>>>>>>> next
> >>>>>>>>>>>>>>>> try,
> >>>>>>>>>>>>>>>>>> the
> >>>>>>>>>>>>>>>>>>>> too large record is not poisoning any more. I am not
> >>>> 100%
> >>>>>>>>>>>>> sure
> >>>>>>>>>>>>>>>> about
> >>>>>>>>>>>>>>>>>> the
> >>>>>>>>>>>>>>>>>>>> technical details, though. Otherwise, we can
> >> consider
> >>>> the
> >>>>>>>>>>>>> RETRY
> >>>>>>>>>>>>>>> as
> >>>>>>>>>>>>>>>>> FAIL
> >>>>>>>>>>>>>>>>>>> for
> >>>>>>>>>>>>>>>>>>>> this exception. Another solution would be to
> >> consider
> >>> a
> >>>>>>>>>>>>>> constant
> >>>>>>>>>>>>>>>>> number
> >>>>>>>>>>>>>>>>>>> of
> >>>>>>>>>>>>>>>>>>>> times for RETRY which can be useful for other
> >>> exceptions
> >>>>> as
> >>>>>>>>>>>>>> well.
> >>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>> It’s not presently possible to change the
> >> configuration
> >>>> of
> >>>>>>>> an
> >>>>>>>>>>>>>>>> existing
> >>>>>>>>>>>>>>>>>>> Producer at runtime. So if a record hits a
> >>>>>>>>>>>>>> RecordTooLargeException
> >>>>>>>>>>>>>>>>> once,
> >>>>>>>>>>>>>>>>>> no
> >>>>>>>>>>>>>>>>>>> amount of retrying (with the current Producer) will
> >>>> change
> >>>>>>>>> that
> >>>>>>>>>>>>>>> fact.
> >>>>>>>>>>>>>>>>> So
> >>>>>>>>>>>>>>>>>>> I’m still a little stuck on how to handle a response
> >> of
> >>>>>>>> RETRY
> >>>>>>>>>>>>> for
> >>>>>>>>>>>>>>> an
> >>>>>>>>>>>>>>>>>>> “oversized” record.
> >>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>> - What if the handle() method itself throws an
> >>>> exception?
> >>>>> I
> >>>>>>>>>>>>>> think
> >>>>>>>>>>>>>>>>>>>> rationally and pragmatically, the behaviour must be
> >>>>> exactly
> >>>>>>>>>>>>>> like
> >>>>>>>>>>>>>>>> when
> >>>>>>>>>>>>>>>>>> no
> >>>>>>>>>>>>>>>>>>>> custom handler is defined since the user actually
> >> did
> >>>> not
> >>>>>>>>>>>>> have
> >>>>>>>>>>>>>> a
> >>>>>>>>>>>>>>>>>> working
> >>>>>>>>>>>>>>>>>>>> handler.
> >>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>> I’m not convinced that ignoring an errant handler is
> >>> the
> >>>>>>>> right
> >>>>>>>>>>>>>>>> choice.
> >>>>>>>>>>>>>>>>> It
> >>>>>>>>>>>>>>>>>>> then becomes a silent failure that might have
> >>>>> repercussions,
> >>>>>>>>>>>>>>>> depending
> >>>>>>>>>>>>>>>>> on
> >>>>>>>>>>>>>>>>>>> the business logic. A user would have to proactively
> >>>> trawls
> >>>>>>>>>>>>>> through
> >>>>>>>>>>>>>>>> the
> >>>>>>>>>>>>>>>>>>> logs for WARN/ERROR messages to catch it.
> >>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>> Throwing a hard error is pretty draconian, though…
> >>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>> - Why not use config parameters instead of an
> >>> interface?
> >>>>> As
> >>>>>>>>>>>>>>>> explained
> >>>>>>>>>>>>>>>>>> in
> >>>>>>>>>>>>>>>>>>>> the “Rejected Alternatives” section, we assume that
> >>> the
> >>>>>>>>>>>>> handler
> >>>>>>>>>>>>>>>> will
> >>>>>>>>>>>>>>>>> be
> >>>>>>>>>>>>>>>>>>>> used for a greater number of exceptions in the
> >> future.
> >>>>>>>>>>>>>> Defining a
> >>>>>>>>>>>>>>>>>>>> configuration parameter for each exception may make
> >>> the
> >>>>>>>>>>>>>>>>> configuration a
> >>>>>>>>>>>>>>>>>>> bit
> >>>>>>>>>>>>>>>>>>>> messy. Moreover, the handler offers more
> >> flexibility.
> >>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>> Agreed that the logic-via-configuration approach is
> >>> weird
> >>>>>>>> and
> >>>>>>>>>>>>>>>> limiting.
> >>>>>>>>>>>>>>>>>>> Forget I ever suggested it ;)
> >>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>> I’d think additional background in the Motivation
> >>> section
> >>>>>>>>> would
> >>>>>>>>>>>>>>> help
> >>>>>>>>>>>>>>>> me
> >>>>>>>>>>>>>>>>>>> understand how users might use this feature beyond a)
> >>>>>>>> skipping
> >>>>>>>>>>>>>>>>>> “oversized”
> >>>>>>>>>>>>>>>>>>> records, and b) not retrying missing topics.
> >>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>> Small change:
> >>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>> -ProductionExceptionHandlerResponse -> Response for
> >>>>> brevity
> >>>>>>>>>>>>> and
> >>>>>>>>>>>>>>>>>>> simplicity.
> >>>>>>>>>>>>>>>>>>>> Could’ve been HandlerResponse too I think!
> >>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>> The name change sounds good to me.
> >>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>> Thanks Alieh!
> >>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>> I thank you all again for your useful
> >>>>>>>> questions/suggestions.
> >>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>> I would be happy to hear more of your concerns, as
> >>>> stated
> >>>>>>>> in
> >>>>>>>>>>>>>> some
> >>>>>>>>>>>>>>>>>>> feedback.
> >>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>> Cheers,
> >>>>>>>>>>>>>>>>>>>> Alieh
> >>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>> On Wed, Apr 24, 2024 at 12:31 AM Justine Olshan
> >>>>>>>>>>>>>>>>>>>> <jols...@confluent.io.invalid> wrote:
> >>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>> Thanks Alieh for the updates.
> >>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>> I'm a little concerned about the design pattern
> >> here.
> >>>> It
> >>>>>>>>>>>>> seems
> >>>>>>>>>>>>>>>> like
> >>>>>>>>>>>>>>>>> we
> >>>>>>>>>>>>>>>>>>> want
> >>>>>>>>>>>>>>>>>>>>> specific usages, but we are packaging it as a
> >> generic
> >>>>>>>>>>>>> handler.
> >>>>>>>>>>>>>>>>>>>>> I think we tried to narrow down on the specific
> >>> errors
> >>>> we
> >>>>>>>>>>>>> want
> >>>>>>>>>>>>>>> to
> >>>>>>>>>>>>>>>>>>> handle,
> >>>>>>>>>>>>>>>>>>>>> but it feels a little clunky as we have a generic
> >>> thing
> >>>>>>>> for
> >>>>>>>>>>>>>> two
> >>>>>>>>>>>>>>>>>> specific
> >>>>>>>>>>>>>>>>>>>>> errors.
> >>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>> I'm wondering if we are using the right patterns to
> >>>> solve
> >>>>>>>>>>>>>> these
> >>>>>>>>>>>>>>>>>>> problems. I
> >>>>>>>>>>>>>>>>>>>>> agree though that we will need something more than
> >>> the
> >>>>>>>> error
> >>>>>>>>>>>>>>>> classes
> >>>>>>>>>>>>>>>>>> I'm
> >>>>>>>>>>>>>>>>>>>>> proposing if we want to have different handling be
> >>>>>>>>>>>>>> configurable.
> >>>>>>>>>>>>>>>>>>>>> My concern is that the open-endedness of a handler
> >>>> means
> >>>>>>>>>>>>> that
> >>>>>>>>>>>>>> we
> >>>>>>>>>>>>>>>> are
> >>>>>>>>>>>>>>>>>>>>> creating more problems than we are solving. It is
> >>> still
> >>>>>>>>>>>>>> unclear
> >>>>>>>>>>>>>>> to
> >>>>>>>>>>>>>>>>> me
> >>>>>>>>>>>>>>>>>>> how
> >>>>>>>>>>>>>>>>>>>>> we expect to handle the errors. Perhaps we could
> >>>> include
> >>>>>>>> an
> >>>>>>>>>>>>>>>> example?
> >>>>>>>>>>>>>>>>>> It
> >>>>>>>>>>>>>>>>>>>>> seems like there is a specific use case in mind and
> >>>> maybe
> >>>>>>>> we
> >>>>>>>>>>>>>> can
>
>
>

Re: [DISCUSS] KIP-1038: Add Custom Error Handler to Producer

Reply via email to