Re: [DISCUSS] KIP-1038: Add Custom Error Handler to Producer

Andrew Schofield Tue, 14 May 2024 03:25:54 -0700

Hi Alieh,
Just one final comment.

[AJS5] Existing classes use Retriable, not Retryable. For example:
https://kafka.apache.org/21/javadoc/org/apache/kafka/common/errors/RetriableException.html


I suggest RetriableResponse and NonRetriableResponse.

Thanks,
Andrew

> On 13 May 2024, at 23:17, Alieh Saeedi <asae...@confluent.io.INVALID> wrote:
>
> Hi all,
>
>
> Thanks for all the valid points you listed.
>
>
> KIP updates and addressing concerns:
>
>
> 1) The KIP now suggests two Response types: `RetryableResponse` and
> `NonRetryableResponse`
>
>
> 2) `custom.exception.handler` is changed to `custom.exception.handler.class`
>
>
> 3) The KIP clarifies that `In the case of an implemented handler for the
> specified exception, the handler takes precedence.`
>
>
> 4)  There is now a `default` implementation for both handle() methods.
>
>
> 5)  @Chris: for `UnknownTopicOrPartition`, the default is already retrying
> for 60s. (In fact, the default value of `max.block.ms`). If the handler
> instructs to FAIL or SWALLOW, there will be no retry, and if the handler
> instructs to RETRY, that will be the default behavior, which follows the
> values in already existing config parameters such as `max.block.ms`. Does
> that make sense?
>
>
> Hope the changes and explanations are convincing :)
>
>
> Cheers,
>
> Alieh
>
> On Mon, May 13, 2024 at 6:40 PM Justine Olshan <jols...@confluent.io.invalid>
> wrote:
>
>> Oh I see. The type isn't the error type but a newly defined type for the
>> response. Makes sense and works for me.
>>
>> Justine
>>
>> On Mon, May 13, 2024 at 9:13 AM Chris Egerton <fearthecel...@gmail.com>
>> wrote:
>>
>>> If we have dedicated methods for each kind of exception
>>> (handleRecordTooLarge, handleUnknownTopicOrPartition, etc.), doesn't that
>>> provide sufficient constraint? I'm not suggesting we eliminate these
>>> methods, just that we change their return types to something more
>> flexible.
>>>
>>> On Mon, May 13, 2024, 12:07 Justine Olshan <jols...@confluent.io.invalid
>>>
>>> wrote:
>>>
>>>> I'm not sure I agree with the Retriable and NonRetriableResponse
>> comment.
>>>> This doesn't limit the blast radius or enforce certain errors are used.
>>>> I think we might disagree on how controlled these interfaces can be...
>>>>
>>>> Justine
>>>>
>>>> On Mon, May 13, 2024 at 8:40 AM Chris Egerton <chr...@aiven.io.invalid
>>>
>>>> wrote:
>>>>
>>>>> Hi Alieh,
>>>>>
>>>>> Thanks for the updates! I just have a few more thoughts:
>>>>>
>>>>> - I don't think a boolean property is sufficient to dictate retries
>> for
>>>>> unknown topic partitions, though. These errors can occur if a topic
>> has
>>>>> just been created, which can occur if, for example, automatic topic
>>>>> creation is enabled for a multi-task connector. This is why I
>> proposed
>>> a
>>>>> timeout instead of a boolean (and see my previous email for why
>>> reducing
>>>>> max.block.ms for a producer is not a viable alternative). If it
>> helps,
>>>> one
>>>>> way to reproduce this yourself is to add the line
>>>>> `fooProps.put(TASKS_MAX_CONFIG, "10");` to the integration test here:
>>>>>
>>>>>
>>>>
>>>
>> https://github.com/apache/kafka/blob/5439914c32fa00d634efa7219699f1bc21add839/connect/runtime/src/test/java/org/apache/kafka/connect/integration/SourceConnectorsIntegrationTest.java#L134
>>>>> and then check the logs afterward for messages like "Error while
>>> fetching
>>>>> metadata with correlation id <n> :
>>>> {foo-topic=UNKNOWN_TOPIC_OR_PARTITION}".
>>>>>
>>>>> - I also don't think we need custom XxxResponse enums for every
>>> possible
>>>>> method; it seems like this will lead to a lot of duplication and
>>>> cognitive
>>>>> overhead if we want to expand the error handler in the future.
>>> Something
>>>>> more flexible like RetriableResponse and NonRetriableResponse could
>>>>> suffice.
>>>>>
>>>>> - Finally, the KIP still doesn't state how the handler will or won't
>>> take
>>>>> precedence over existing retry properties. If I set `retries` or `
>>>>> delivery.timeout.ms` or `max.block.ms` to low values, will that
>> cause
>>>>> retries to cease even if my custom handler would otherwise keep
>>> returning
>>>>> RETRY for an error?
>>>>>
>>>>> Cheers,
>>>>>
>>>>> Chris
>>>>>
>>>>> On Mon, May 13, 2024 at 11:02 AM Andrew Schofield <
>>>>> andrew_schofi...@live.com>
>>>>> wrote:
>>>>>
>>>>>> Hi Alieh,
>>>>>> Just a few more comments on the KIP. It is looking much less risky
>>> now
>>>>> the
>>>>>> scope
>>>>>> is tighter.
>>>>>>
>>>>>> [AJS1] It would be nice to have default implementations of the
>> handle
>>>>>> methods
>>>>>> so an implementor would not need to implement both themselves.
>>>>>>
>>>>>> [AJS2] Producer configurations which are class names usually end in
>>>>>> “.class”.
>>>>>> I suggest “custom.exception.handler.class”.
>>>>>>
>>>>>> [AJS3] If I implemented a handler, and I set a non-default value
>> for
>>>> one
>>>>>> of the
>>>>>> new configuations, what happens? I would expect that the handler
>>> takes
>>>>>> precedence. I wasn’t quite clear what “the control will follow the
>>>>> handler
>>>>>> instructions” meant.
>>>>>>
>>>>>> [AJS4] Because you now have an enum for the
>>>>>> RecordTooLargeExceptionResponse,
>>>>>> I don’t think you need to state in the comment for
>>>>>> ProducerExceptionHandler that
>>>>>> RETRY will be interpreted as FAIL.
>>>>>>
>>>>>> Thanks,
>>>>>> Andrew
>>>>>>
>>>>>>> On 13 May 2024, at 14:53, Alieh Saeedi
>>> <asae...@confluent.io.INVALID
>>>>>
>>>>>> wrote:
>>>>>>>
>>>>>>> Hi all,
>>>>>>>
>>>>>>>
>>>>>>> Thanks for the very interesting discussion during my PTO.
>>>>>>>
>>>>>>>
>>>>>>> KIP updates and addressing concerns:
>>>>>>>
>>>>>>>
>>>>>>> 1) Two handle() methods are defined in ProducerExceptionHandler
>> for
>>>> the
>>>>>> two
>>>>>>> exceptions with different input parameters so that we have
>>>>>>> handle(RecordTooLargeException e, ProducerRecord record) and
>>>>>>> handle(UnknownTopicOrPartitionException e, ProducerRecord record)
>>>>>>>
>>>>>>>
>>>>>>> 2) The ProducerExceptionHandler extends `Closable` as well.
>>>>>>>
>>>>>>>
>>>>>>> 3) The KIP suggests having two more configuration parameters with
>>>>> boolean
>>>>>>> values:
>>>>>>>
>>>>>>> - `drop.invalid.large.records` with a default value of `false`
>> for
>>>>>>> swallowing too large records.
>>>>>>>
>>>>>>> - `retry.unknown.topic.partition` with a default value of `true`
>>> that
>>>>>>> performs RETRY for `max.block.ms` ms, encountering the
>>>>>>> UnknownTopicOrPartitionException.
>>>>>>>
>>>>>>>
>>>>>>> Hope the main concerns are addressed so that we can go forward
>> with
>>>>>> voting.
>>>>>>>
>>>>>>>
>>>>>>> Cheers,
>>>>>>>
>>>>>>> Alieh
>>>>>>>
>>>>>>> On Thu, May 9, 2024 at 11:25 PM Artem Livshits
>>>>>>> <alivsh...@confluent.io.invalid> wrote:
>>>>>>>
>>>>>>>> Hi Mathias,
>>>>>>>>
>>>>>>>>> [AL1] While I see the point, I would think having a different
>>>>> callback
>>>>>>>> for every exception might not really be elegant?
>>>>>>>>
>>>>>>>> I'm not sure how to assess the level of elegance of the
>> proposal,
>>>> but
>>>>> I
>>>>>> can
>>>>>>>> comment on the technical characteristics:
>>>>>>>>
>>>>>>>> 1. Having specific interfaces that codify the logic that is
>>>> currently
>>>>>>>> prescribed in the comments reduce the chance of making a
>> mistake.
>>>>>>>> Commments may get ignored, misuderstood or etc. but if the
>>> contract
>>>> is
>>>>>>>> codified, the compilier will help to enforce the contract.
>>>>>>>> 2. Given that the logic is trickier than it seems (the
>>>>> record-too-large
>>>>>> is
>>>>>>>> an example that can easily confuse someone who's not intimately
>>>>> familiar
>>>>>>>> with the nuances of the batching logic), having a little more
>>> hoops
>>>> to
>>>>>> jump
>>>>>>>> would give a greater chance that whoever tries to add a new
>> cases
>>>>> pauses
>>>>>>>> and thinks a bit more.
>>>>>>>> 3. As Justine pointed out, having different method will be a
>>> forcing
>>>>>>>> function to go through a KIP rather than smuggle new cases
>> through
>>>>>>>> implementation.
>>>>>>>> 4. Sort of a consequence of the previous 3 -- all those things
>>>> reduce
>>>>>> the
>>>>>>>> chance of someone writing the code that works with 2 errors and
>>> then
>>>>>> when
>>>>>>>> more errors are added in the future will suddenly incorrectly
>>> ignore
>>>>> new
>>>>>>>> errors (the example I gave in the previous email).
>>>>>>>>
>>>>>>>>> [AL2 cont.] Similar to AL1, I see such a handler to some extend
>>> as
>>>>>>>> business logic. If a user puts a bad filter condition in their
>> KS
>>>> app,
>>>>>> and
>>>>>>>> drops messages
>>>>>>>>
>>>>>>>> I agree that there is always a chance to get a bug and lose
>>>> messages,
>>>>>> but
>>>>>>>> there are generally separation of concerns that has different
>> risk
>>>>>> profile:
>>>>>>>> the filtering logic may be more rigorously tested and rarely
>>> changed
>>>>>> (say
>>>>>>>> an application developer does it), but setting the topics to
>>> produce
>>>>>> may be
>>>>>>>> done via configuration (e.g. a user of the application does it)
>>> and
>>>>> it's
>>>>>>>> generally an expectation that users would get an error when
>>>>>> configuration
>>>>>>>> is incorrect.
>>>>>>>>
>>>>>>>> What could be worse is that UnknownTopicOrPartitionException can
>>> be
>>>> an
>>>>>>>> intermittent error, i.e. with a generally correct configuration,
>>>> there
>>>>>>>> could be metadata propagation problem on the cluster and then a
>>>> random
>>>>>> set
>>>>>>>> of records could get lost.
>>>>>>>>
>>>>>>>>> [AL3] Maybe I misunderstand what you are saying, but to me,
>>>> checking
>>>>>> the
>>>>>>>> size of the record upfront is exactly what the KIP proposes? No?
>>>>>>>>
>>>>>>>> It achieves the same result but solves it differently, my
>>> proposal:
>>>>>>>>
>>>>>>>> 1. Application checks the validity of a record (maybe via a new
>>>>>>>> validateRecord method) before producing it, and can just exclude
>>> it
>>>> or
>>>>>>>> return an error to the user.
>>>>>>>> 2. Application produces the record -- at this point there are no
>>>>> records
>>>>>>>> that could return record too large, they were either skipped at
>>>> step 1
>>>>>> or
>>>>>>>> we didn't get here because step 1 failed.
>>>>>>>>
>>>>>>>> Vs. KIP's proposal
>>>>>>>>
>>>>>>>> 1. Application produces the record.
>>>>>>>> 2. Application gets a callback.
>>>>>>>> 3. Application returns the action on how to proceed.
>>>>>>>>
>>>>>>>> The advantage of the former is the clarity of semantics -- the
>>>> record
>>>>> is
>>>>>>>> invalid (property of the record, not a function of server state
>> or
>>>>>> server
>>>>>>>> configuration) and we can clearly know that it is the record
>> that
>>> is
>>>>> bad
>>>>>>>> and can never succeed.
>>>>>>>>
>>>>>>>> The KIP-proposed way actually has a very tricky point: it
>> actually
>>>>>> handles
>>>>>>>> a subset of record-too-large exceptions.  The broker can return
>>>>>>>> record-too-large and reject the whole batch (but we don't want
>> to
>>>>> ignore
>>>>>>>> those because then we can skip random records that just happened
>>> to
>>>> be
>>>>>> in
>>>>>>>> the same batch), in some sense we use the same error for 2
>>> different
>>>>>>>> conditions and understanding that requires pretty deep
>>> understanding
>>>>> of
>>>>>>>> Kafka internals.
>>>>>>>>
>>>>>>>> -Artem
>>>>>>>>
>>>>>>>>
>>>>>>>> On Wed, May 8, 2024 at 9:47 AM Justine Olshan
>>>>>> <jols...@confluent.io.invalid
>>>>>>>>>
>>>>>>>> wrote:
>>>>>>>>
>>>>>>>>> My concern with respect to it being fragile: the code that
>>> ensures
>>>>> the
>>>>>>>>> error type is internal to the producer. Someone may see it and
>>>> say, I
>>>>>>>> want
>>>>>>>>> to add such and such error. This looks like internal code, so I
>>>> don't
>>>>>>>> need
>>>>>>>>> a KIP, and then they can change it to whatever they want
>> thinking
>>>> it
>>>>> is
>>>>>>>>> within the typical kafka improvement protocol.
>>>>>>>>>
>>>>>>>>> Relying on an internal change to enforce an external API is
>>> fragile
>>>>> in
>>>>>> my
>>>>>>>>> opinion. That's why I sort of agreed with Artem with enforcing
>>> the
>>>>>> error
>>>>>>>> in
>>>>>>>>> the method signature -- part of the public API.
>>>>>>>>>
>>>>>>>>> Chris's comments on requiring more information to handler again
>>>> makes
>>>>>> me
>>>>>>>>> wonder if we are solving a problem of lack of information at
>> the
>>>>>>>>> application level with a more powerful solution than we need.
>>> (Ie,
>>>> if
>>>>>> we
>>>>>>>>> had more information, could the application close and restart
>> the
>>>>>>>>> transaction rather than having to drop records) But I am happy
>> to
>>>>>>>>> compromise with a handler that we can agree is sufficiently
>>>>> controlled
>>>>>>>> and
>>>>>>>>> documented.
>>>>>>>>>
>>>>>>>>> Justine
>>>>>>>>>
>>>>>>>>> On Wed, May 8, 2024 at 7:20 AM Chris Egerton
>>>> <chr...@aiven.io.invalid
>>>>>>
>>>>>>>>> wrote:
>>>>>>>>>
>>>>>>>>>> Hi Alieh,
>>>>>>>>>>
>>>>>>>>>> Continuing prior discussions:
>>>>>>>>>>
>>>>>>>>>> 1) Regarding the "flexibility" discussion, my overarching
>> point
>>> is
>>>>>>>> that I
>>>>>>>>>> don't see the point in allowing for this kind of pluggable
>> logic
>>>>>>>> without
>>>>>>>>>> also covering more scenarios. Take example 2 in the KIP: if
>>> we're
>>>>>> going
>>>>>>>>> to
>>>>>>>>>> implement retries only on "important" topics when a topic
>>>> partition
>>>>>>>> isn't
>>>>>>>>>> found, why wouldn't we also want to be able to do this for
>> other
>>>>>>>> errors?
>>>>>>>>>> Again, taking authorization errors as an example, why wouldn't
>>> we
>>>>> want
>>>>>>>> to
>>>>>>>>>> be able to fail when we can't write to "important" topics
>>> because
>>>>> the
>>>>>>>>>> producer principal lacks sufficient ACLs, and drop the record
>> if
>>>> the
>>>>>>>>> topic
>>>>>>>>>> isn't "important"? In a security-conscious environment with
>>>>>>>>>> runtime-dependent topic routing (which is a common feature of
>>> many
>>>>>>>> source
>>>>>>>>>> connectors, such as the Debezium connectors), this seems
>> fairly
>>>>>> likely.
>>>>>>>>>>
>>>>>>>>>> 2) As far as changing the shape of the API goes, I like
>> Artem's
>>>> idea
>>>>>> of
>>>>>>>>>> splitting out the interface based on specific exceptions. This
>>> may
>>>>> be
>>>>>> a
>>>>>>>>>> little laborious to expand in the future, but if we really
>> want
>>> to
>>>>>>>>>> limit the exceptions that we cover with the handler and move
>>>> slowly
>>>>>> and
>>>>>>>>>> cautiously, then IMO it'd be reasonable to reflect that in the
>>>>>>>>> interface. I
>>>>>>>>>> also acknowledge that there's no way to completely prevent
>>> people
>>>>> from
>>>>>>>>>> shooting themselves in the foot by implementing the API
>>>> incorrectly,
>>>>>>>> but
>>>>>>>>> I
>>>>>>>>>> think it's worth it to do what we can--including leveraging
>> the
>>>> Java
>>>>>>>>>> language's type system--to help them, so IMO there's value to
>>>>>>>> eliminating
>>>>>>>>>> the implicit behavior of failing when a policy returns RETRY
>>> for a
>>>>>>>>>> non-retriable error. This can take a variety of shapes and I'm
>>> not
>>>>>>>> going
>>>>>>>>> to
>>>>>>>>>> insist on anything specific, but I do want to again raise my
>>>>> concerns
>>>>>>>>> with
>>>>>>>>>> the current proposal and request that we find something a
>> little
>>>>>>>> better.
>>>>>>>>>>
>>>>>>>>>> 3) Concerning the default implementation--actually, I meant
>>> what I
>>>>>>>> wrote
>>>>>>>>> :)
>>>>>>>>>> I don't want a "second" default, I want an implementation of
>>> this
>>>>>>>>> interface
>>>>>>>>>> to be used as the default if no others are specified. The
>>> behavior
>>>>> of
>>>>>>>>> this
>>>>>>>>>> default implementation would be identical to existing behavior
>>> (so
>>>>>>>> there
>>>>>>>>>> would be no backwards compatibility concerns like the ones
>>> raised
>>>> by
>>>>>>>>>> Matthias), but it would be possible to configure this default
>>>>> handler
>>>>>>>>> class
>>>>>>>>>> to behave differently for a basic set of scenarios. This would
>>>>> mirror
>>>>>>>>> (pun
>>>>>>>>>> intended) the approach we've taken with Mirror Maker 2 and its
>>>>>>>>>> ReplicationPolicy interface [1]. There is a default
>>> implementation
>>>>>>>>>> available [2] that recognizes a handful of basic configuration
>>>>>>>> properties
>>>>>>>>>> [3] for simple tweaks, but if users want, they can also
>>> implement
>>>>>> their
>>>>>>>>> own
>>>>>>>>>> replication policy for more fine-grained logic if those
>>> properties
>>>>>>>> aren't
>>>>>>>>>> flexible enough.
>>>>>>>>>>
>>>>>>>>>> More concretely, I'm imagining something like this for the
>>>> producer
>>>>>>>>>> exception handler:
>>>>>>>>>>
>>>>>>>>>> - Default implementation class
>>>>>>>>>> of
>>>> org.apache.kafka.clients.producer.DefaultProducerExceptionHandler
>>>>>>>>>> - This class would recognize two properties:
>>>>>>>>>> - drop.invalid.large.records: Boolean property, defaults to
>>>> false.
>>>>> If
>>>>>>>>>> "false", then causes the handler to return FAIL whenever
>>>>>>>>>> a RecordTooLargeException is encountered; if "true", then
>> causes
>>>>>>>>>> SWALLOW/SKIP/DROP to be returned instead
>>>>>>>>>> - unknown.topic.partition.retry.timeout.ms: Integer
>> property,
>>>>>>>> defaults
>>>>>>>>>> to
>>>>>>>>>> INT_MAX. Whenever an UnknownTopicOrPartitionException is
>>>>> encountered,
>>>>>>>>>> causes the handler to return FAIL if that record has been
>>> pending
>>>>> for
>>>>>>>>> more
>>>>>>>>>> than the retry timeout; otherwise, causes RETRY to be returned
>>>>>>>>>>
>>>>>>>>>> I think this is worth addressing now instead of later because
>> it
>>>>>> forces
>>>>>>>>> us
>>>>>>>>>> to evaluate the usefulness of this interface and it addresses
>> a
>>>>>>>>>> long-standing issue not just with Kafka Connect, but with the
>>> Java
>>>>>>>>> producer
>>>>>>>>>> in general. For reference, here are a few tickets I collected
>>>> after
>>>>>>>>> briefly
>>>>>>>>>> skimming our Jira showing that this is a real pain point for
>>>> users:
>>>>>>>>>> https://issues.apache.org/jira/browse/KAFKA-10340,
>>>>>>>>>> https://issues.apache.org/jira/browse/KAFKA-12990,
>>>>>>>>>> https://issues.apache.org/jira/browse/KAFKA-13634. Although
>>> this
>>>> is
>>>>>>>>>> frequently reported with Kafka Connect, it applies to anyone
>> who
>>>>>>>>> configures
>>>>>>>>>> a producer to use a high retry timeout. I am aware of the
>>>>>> max.block.ms
>>>>>>>>>> property, but it's painful and IMO poor behavior to require
>>> users
>>>> to
>>>>>>>>> reduce
>>>>>>>>>> the value of this property just to handle the single scenario
>>> when
>>>>>>>> trying
>>>>>>>>>> to write to topics that don't exist, since it would also limit
>>> the
>>>>>>>> retry
>>>>>>>>>> timeout for other operations that are legitimately retriable.
>>>>>>>>>>
>>>>>>>>>> Raising new points:
>>>>>>>>>>
>>>>>>>>>> 5) I don't see the interplay between this handler and existing
>>>>>>>>>> retry-related properties mentioned anywhere in the KIP. I'm
>>>> assuming
>>>>>>>> that
>>>>>>>>>> properties like "retries", "max.block.ms", and "
>>>> delivery.timeout.ms
>>>>> "
>>>>>>>>> would
>>>>>>>>>> take precedence over the handler and once they are exhausted,
>>> the
>>>>>>>>>> record/batch will fail no matter what? If so, it's probably
>>> worth
>>>>>>>> briefly
>>>>>>>>>> mentioning this (no more than a sentence or two) in the KIP,
>> and
>>>> if
>>>>>>>> not,
>>>>>>>>>> I'm curious what you have in mind.
>>>>>>>>>>
>>>>>>>>>> 6) I also wonder if the API provides enough information in its
>>>>> current
>>>>>>>>>> form. Would it be possible to provide handlers with some way
>> of
>>>>>>>> tracking
>>>>>>>>>> how long a record has been pending for (i.e., how long it's
>> been
>>>>> since
>>>>>>>>> the
>>>>>>>>>> record was provided as an argument to Producer::send)? One way
>>> to
>>>> do
>>>>>>>> this
>>>>>>>>>> could be to add a method like `onNewRecord(ProducerRecord)`
>> and
>>>>>>>>>> allow/require the handler to do its own bookkeeping, probably
>>>> with a
>>>>>>>>>> matching `onRecordSuccess(ProducerRecord)` method so that the
>>>>> handler
>>>>>>>>>> doesn't eat up an ever-increasing amount of memory trying to
>>> track
>>>>>>>>> records.
>>>>>>>>>> An alternative could be to include information about the
>> initial
>>>>> time
>>>>>>>> the
>>>>>>>>>> record was received by the producer and the number of retries
>>> that
>>>>>> have
>>>>>>>>>> been performed for it as parameters in the handle method(s),
>> but
>>>> I'm
>>>>>>>> not
>>>>>>>>>> sure how easy this would be to implement and it might clutter
>>>> things
>>>>>>>> up a
>>>>>>>>>> bit too much.
>>>>>>>>>>
>>>>>>>>>> 7) A small request--can we add Closeable (or, if you prefer,
>>>>>>>>> AutoCloseable)
>>>>>>>>>> as a superinterface for the handler interface?
>>>>>>>>>>
>>>>>>>>>> [1] -
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>
>>>>>>>>
>>>>>>
>>>>>
>>>>
>>>
>> https://kafka.apache.org/37/javadoc/org/apache/kafka/connect/mirror/ReplicationPolicy.html
>>>>>>>>>> [2] -
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>
>>>>>>>>
>>>>>>
>>>>>
>>>>
>>>
>> https://kafka.apache.org/37/javadoc/org/apache/kafka/connect/mirror/DefaultReplicationPolicy.html
>>>>>>>>>> [3] -
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>
>>>>>>>>
>>>>>>
>>>>>
>>>>
>>>
>> https://kafka.apache.org/37/javadoc/org/apache/kafka/connect/mirror/DefaultReplicationPolicy.html#SEPARATOR_CONFIG
>>>>>>>>>> ,
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>
>>>>>>>>
>>>>>>
>>>>>
>>>>
>>>
>> https://kafka.apache.org/37/javadoc/org/apache/kafka/connect/mirror/DefaultReplicationPolicy.html#INTERNAL_TOPIC_SEPARATOR_ENABLED_CONFIG
>>>>>>>>>>
>>>>>>>>>> Cheers,
>>>>>>>>>>
>>>>>>>>>> Chris
>>>>>>>>>>
>>>>>>>>>> On Wed, May 8, 2024 at 12:37 AM Matthias J. Sax <
>>> mj...@apache.org
>>>>>
>>>>>>>>> wrote:
>>>>>>>>>>
>>>>>>>>>>> Very interesting discussion.
>>>>>>>>>>>
>>>>>>>>>>> Seems a central point is the question about "how generic" we
>>>>> approach
>>>>>>>>>>> this, and some people think we need to be conservative and
>>> others
>>>>>>>> think
>>>>>>>>>>> we should try to be as generic as possible.
>>>>>>>>>>>
>>>>>>>>>>> Personally, I think following a limited scope for this KIP
>> (by
>>>>>>>>>>> explicitly saying we only cover RecordTooLarge and
>>>>>>>>>>> UnknownTopicOrPartition) might be better. We have concrete
>>>> tickets
>>>>>>>> that
>>>>>>>>>>> we address, while for other exception (like authorization) we
>>>> don't
>>>>>>>>> know
>>>>>>>>>>> if people want to handle it to begin with. Boiling the ocean
>>>> might
>>>>>>>> not
>>>>>>>>>>> get us too far, and being somewhat pragmatic might help to
>> move
>>>>> this
>>>>>>>>> KIP
>>>>>>>>>>> forward. -- I also agree with Justin and Artem, that we want
>> to
>>>> be
>>>>>>>>>>> careful anyway to not allow users to break stuff too easily.
>>>>>>>>>>>
>>>>>>>>>>> As the same time, I agree that we should setup this change
>> in a
>>>>>>>> forward
>>>>>>>>>>> looking way, and thus having a single generic handler allows
>> us
>>>> to
>>>>>>>>> later
>>>>>>>>>>> extend the handler more easily. This should also simplify
>>> follow
>>>> up
>>>>>>>> KIP
>>>>>>>>>>> that might add new error cases (I actually mentioned one more
>>> to
>>>>>>>> Alieh
>>>>>>>>>>> already, but we both agreed that it might be best to exclude
>> it
>>>>> from
>>>>>>>>> the
>>>>>>>>>>> KIP right now, to make the 3.8 deadline. Doing a follow up
>> KIP
>>> is
>>>>> not
>>>>>>>>>>> the end of the world.)
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> @Chris:
>>>>>>>>>>>
>>>>>>>>>>> (2) This sounds fair to me, but not sure how "bad" it
>> actually
>>>>> would
>>>>>>>>> be?
>>>>>>>>>>> If the contract is clearly defined, it seems to be fine what
>>> the
>>>>> KIP
>>>>>>>>>>> proposes, and given that such a handler is an expert API, and
>>> we
>>>>> can
>>>>>>>>>>> provide "best practices" (cf my other comment below in
>> [AL1]),
>>>>> being
>>>>>>>> a
>>>>>>>>>>> little bit pragmatic sound fine to me.
>>>>>>>>>>>
>>>>>>>>>>> Not sure if I understand Justin's argument on this question?
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> (3) About having a default impl or not. I am fine with adding
>>>> one,
>>>>>>>> even
>>>>>>>>>>> if I am not sure why Connect could not just add its own one
>> and
>>>>> plug
>>>>>>>> it
>>>>>>>>>>> in (and we would add corresponding configs for Connect, but
>> not
>>>> for
>>>>>>>> the
>>>>>>>>>>> producer itself)? For this case, we could also do this as a
>>>> follow
>>>>> up
>>>>>>>>>>> KIP, but happy to include it in this KIP to provide value to
>>>>> Connect
>>>>>>>>>>> right away (even if the value might not come right away if we
>>>> miss
>>>>>>>> the
>>>>>>>>>>> 3.8 deadline due to expanded KIP scope...) --  For KS, we
>> would
>>>> for
>>>>>>>>> sure
>>>>>>>>>>> plugin our own impl, and lock down the config such that users
>>>>> cannot
>>>>>>>>> set
>>>>>>>>>>> their own handler on the internal producer to begin with.
>> Might
>>>> be
>>>>>>>> good
>>>>>>>>>>> to elaborate why the producer should have a default? We might
>>>>>>>> actually
>>>>>>>>>>> want to add this to the KIP right away?
>>>>>>>>>>>
>>>>>>>>>>> The key for a default impl would be, to not change the
>> current
>>>>>>>>> behavior,
>>>>>>>>>>> and having no default seems to achieve this. For the two
>> cases
>>>> you
>>>>>>>>>>> mentioned, it's unclear to me what default value on "upper
>>> bound
>>>> on
>>>>>>>>>>> retires" for UnkownTopicOrPartitionException we should set?
>>> Seems
>>>>> it
>>>>>>>>>>> would need to be the same as `delivery.timeout.ms`? However,
>>> if
>>>>>>>> users
>>>>>>>>>>> have `delivery.timeout.ms` actually overwritten we would
>> need
>>> to
>>>>> set
>>>>>>>>>>> this config somewhat "dynamic"? Is this feasible? If we
>>>> hard-code 2
>>>>>>>>>>> minutes, it might not be backward compatible. I have the
>>>> impression
>>>>>>>> we
>>>>>>>>>>> might introduce some undesired coupling? -- For the "record
>> too
>>>>>>>> large"
>>>>>>>>>>> case, the config seems to be boolean and setting it to
>> `false`
>>> by
>>>>>>>>>>> default seems to provide backward compatibility.
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> @Artem:
>>>>>>>>>>>
>>>>>>>>>>> [AL1] While I see the point, I would think having a different
>>>>>>>> callback
>>>>>>>>>>> for every exception might not really be elegant? In the end,
>>> the
>>>>>>>>> handler
>>>>>>>>>>> is an very advanced feature anyway, and if it's implemented
>> in
>>> a
>>>>> bad
>>>>>>>>>>> way, well, it's a user error -- we cannot protect users from
>>>>>>>>> everything.
>>>>>>>>>>> To me, a handler like this, is to some extend "business
>> logic"
>>>> and
>>>>>>>> if a
>>>>>>>>>>> user gets business logic wrong, it's hard to protect them. --
>>> We
>>>>>>>> would
>>>>>>>>>>> of course provide best practice guidance in the JaveDocs, and
>>>>> explain
>>>>>>>>>>> that a handler should have explicit `if` statements for stuff
>>> it
>>>>> want
>>>>>>>>> to
>>>>>>>>>>> handle, and only a single default which return FAIL.
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> [AL2] Yes, but for KS we would retry at the application
>> layer.
>>>> Ie,
>>>>>>>> the
>>>>>>>>>>> TX is not completed, we clean up and setup out task from
>>> scratch,
>>>>> to
>>>>>>>>>>> ensure the pending transaction is completed before we resume.
>>> If
>>>>> the
>>>>>>>> TX
>>>>>>>>>>> was indeed aborted, we would retry from older offset and thus
>>>> just
>>>>>>>> hit
>>>>>>>>>>> the same error again and the loop begins again.
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> [AL2 cont.] Similar to AL1, I see such a handler to some
>> extend
>>>> as
>>>>>>>>>>> business logic. If a user puts a bad filter condition in
>> their
>>> KS
>>>>>>>> app,
>>>>>>>>>>> and drops messages, it nothing we can do about it, and this
>>>> handler
>>>>>>>>>>> IMHO, has a similar purpose. This is also the line of
>> thinking
>>> I
>>>>>>>> apply
>>>>>>>>>>> to EOS, to address Justin's concern about "should we allow to
>>>> drop
>>>>>>>> for
>>>>>>>>>>> EOS", and my answer is "yes", because it's more business
>> logic
>>>> than
>>>>>>>>>>> actual error handling IMHO. And by default, we fail... So
>> users
>>>>>>>> opt-in
>>>>>>>>>>> to add business logic to drop records. It's an application
>>> level
>>>>>>>>>>> decision how to write the code.
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> [AL3] Maybe I misunderstand what you are saying, but to me,
>>>>> checking
>>>>>>>>> the
>>>>>>>>>>> size of the record upfront is exactly what the KIP proposes?
>>> No?
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> @Justin:
>>>>>>>>>>>
>>>>>>>>>>>> I saw the sample
>>>>>>>>>>>> code -- is it just an if statement checking for the error
>>> before
>>>>>>>> the
>>>>>>>>>>>> handler is invoked? That seems a bit fragile.
>>>>>>>>>>>
>>>>>>>>>>> What do you mean by fragile? Not sure if I see your point.
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> -Matthias
>>>>>>>>>>>
>>>>>>>>>>> On 5/7/24 5:33 PM, Artem Livshits wrote:
>>>>>>>>>>>> Hi Alieh,
>>>>>>>>>>>>
>>>>>>>>>>>> Thanks for the KIP.  The motivation talks about very
>> specific
>>>>>>>> cases,
>>>>>>>>>> but
>>>>>>>>>>>> the interface is generic.
>>>>>>>>>>>>
>>>>>>>>>>>> [AL1]
>>>>>>>>>>>> If the interface evolves in the future I think we could have
>>> the
>>>>>>>>>>> following
>>>>>>>>>>>> confusion:
>>>>>>>>>>>>
>>>>>>>>>>>> 1. A user implemented SWALLOW action for both
>>>>>>>> RecordTooLargeException
>>>>>>>>>> and
>>>>>>>>>>>> UnknownTopicOrPartitionException.  For simpicity they just
>>>> return
>>>>>>>>>> SWALLOW
>>>>>>>>>>>> from the function, because it elegantly handles all known
>>> cases.
>>>>>>>>>>>> 2. The interface has evolved to support a new exception.
>>>>>>>>>>>> 3. The user has upgraded their Kafka client.
>>>>>>>>>>>>
>>>>>>>>>>>> Now a new kind of error gets dropped on the floor without
>>> user's
>>>>>>>>>>> intention
>>>>>>>>>>>> and it would be super hard to detect and debug.
>>>>>>>>>>>>
>>>>>>>>>>>> To avoid the confusion, I think we should use handlers for
>>>>> specific
>>>>>>>>>>>> exceptions.  Then if a new exception is added it won't get
>>>>> silently
>>>>>>>>>>> swalled
>>>>>>>>>>>> because the user would need to add new functionality to
>> handle
>>>> it.
>>>>>>>>>>>>
>>>>>>>>>>>> I also have some higher level comments:
>>>>>>>>>>>>
>>>>>>>>>>>> [AL2]
>>>>>>>>>>>>> it throws a TimeoutException, and the user can only blindly
>>>>> retry,
>>>>>>>>>> which
>>>>>>>>>>>> may result in an infinite retry loop
>>>>>>>>>>>>
>>>>>>>>>>>> If the TimeoutException happens during transactional
>>> processing
>>>>>>>>>> (exactly
>>>>>>>>>>>> once is the desired sematnics), then the client should not
>>> retry
>>>>>>>> when
>>>>>>>>>> it
>>>>>>>>>>>> gets TimeoutException because without knowing the reason for
>>>>>>>>>>>> TimeoutExceptions, the client cannot know whether the
>> message
>>>> got
>>>>>>>>>>> actually
>>>>>>>>>>>> produced or not and retrying the message may result in
>>>>> duplicatees.
>>>>>>>>>>>>
>>>>>>>>>>>>> The thrown TimeoutException "cuts" the connection to the
>>>>>>>> underlying
>>>>>>>>>> root
>>>>>>>>>>>> cause of missing metadata
>>>>>>>>>>>>
>>>>>>>>>>>> Maybe we should fix the error handling and return the proper
>>>>>>>>> underlying
>>>>>>>>>>>> message?  Then the application can properly handle the
>> message
>>>>>>>> based
>>>>>>>>> on
>>>>>>>>>>>> preferences.
>>>>>>>>>>>>
>>>>>>>>>>>> From the product perspective, it's not clear how safe it is
>> to
>>>>>>>>> blindly
>>>>>>>>>>>> ignore UnknownTopicOrPartitionException.  This could lead to
>>>>>>>>> situations
>>>>>>>>>>>> when a simple typo could lead to massive data loss (part of
>>> the
>>>>>>>> data
>>>>>>>>>>> would
>>>>>>>>>>>> effectively be produced to a "black hole" and the user may
>> not
>>>>>>>> notice
>>>>>>>>>> it
>>>>>>>>>>>> for a while).
>>>>>>>>>>>>
>>>>>>>>>>>> In which situations would you recommend the user to "black
>>> hole"
>>>>>>>>>> messages
>>>>>>>>>>>> in case of misconfiguration?
>>>>>>>>>>>>
>>>>>>>>>>>> [AL3]
>>>>>>>>>>>>
>>>>>>>>>>>>> If the custom handler decides on SWALLOW for
>>>>>>>>> RecordTooLargeException,
>>>>>>>>>>>>
>>>>>>>>>>>> Is it my understanding that this KIP proposes that
>>> functionality
>>>>>>>> that
>>>>>>>>>>> would
>>>>>>>>>>>> only be able to SWALLOW RecordTooLargeException that happen
>>>>> because
>>>>>>>>> the
>>>>>>>>>>>> producer cannot produce the record (if the broker rejects
>> the
>>>>>>>> batch,
>>>>>>>>>> the
>>>>>>>>>>>> error won't get to the handler, because we cannot know which
>>>> other
>>>>>>>>>>> records
>>>>>>>>>>>> get ignored).  In this case, why not just check the locally
>>>>>>>>> configured
>>>>>>>>>>> max
>>>>>>>>>>>> record size upfront and not produce the recrord in the first
>>>>> place?
>>>>>>>>>>> Maybe
>>>>>>>>>>>> we can expose a validation function from the producer that
>>> could
>>>>>>>>>> validate
>>>>>>>>>>>> the records locally, so we don't need to produce the record
>> in
>>>>>>>> order
>>>>>>>>> to
>>>>>>>>>>>> know that it's invalid.
>>>>>>>>>>>>
>>>>>>>>>>>> -Artem
>>>>>>>>>>>>
>>>>>>>>>>>> On Tue, May 7, 2024 at 2:07 PM Justine Olshan
>>>>>>>>>>> <jols...@confluent.io.invalid>
>>>>>>>>>>>> wrote:
>>>>>>>>>>>>
>>>>>>>>>>>>> Alieh and Chris,
>>>>>>>>>>>>>
>>>>>>>>>>>>> Thanks for clarifying 1) but I saw the motivation. I guess
>> I
>>>> just
>>>>>>>>>> didn't
>>>>>>>>>>>>> understand how that would be ensured on the producer side.
>> I
>>>> saw
>>>>>>>> the
>>>>>>>>>>> sample
>>>>>>>>>>>>> code -- is it just an if statement checking for the error
>>>> before
>>>>>>>> the
>>>>>>>>>>>>> handler is invoked? That seems a bit fragile.
>>>>>>>>>>>>>
>>>>>>>>>>>>> Can you clarify what you mean by `since the code does not
>>> reach
>>>>>>>> the
>>>>>>>>> KS
>>>>>>>>>>>>> interface and breaks somewhere in producer.` If we surfaced
>>>> this
>>>>>>>>> error
>>>>>>>>>>> to
>>>>>>>>>>>>> the application in a better way would that also be a
>> solution
>>>> to
>>>>>>>> the
>>>>>>>>>>> issue?
>>>>>>>>>>>>>
>>>>>>>>>>>>> Justine
>>>>>>>>>>>>>
>>>>>>>>>>>>> On Tue, May 7, 2024 at 1:55 PM Alieh Saeedi
>>>>>>>>>>> <asae...@confluent.io.invalid>
>>>>>>>>>>>>> wrote:
>>>>>>>>>>>>>
>>>>>>>>>>>>>> Hi,
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> Thank you, Chris and Justine, for the feedback.
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> @Chris
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> 1) Flexibility: it has two meanings. The first meaning is
>>> the
>>>>> one
>>>>>>>>> you
>>>>>>>>>>>>>> mentioned. We are going to cover more exceptions in the
>>>> future,
>>>>>>>> but
>>>>>>>>>> as
>>>>>>>>>>>>>> Justine mentioned, we must be very conservative about
>> adding
>>>>> more
>>>>>>>>>>>>>> exceptions. Additionally, flexibility mainly means that
>> the
>>>> user
>>>>>>>> is
>>>>>>>>>>> able
>>>>>>>>>>>>> to
>>>>>>>>>>>>>> develop their own code. As mentioned in the motivation
>>> section
>>>>>>>> and
>>>>>>>>>> the
>>>>>>>>>>>>>> examples, sometimes the user decides on dropping a record
>>>> based
>>>>>>>> on
>>>>>>>>>> the
>>>>>>>>>>>>>> topic, for example.
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> 2) Defining two separate methods for retriable and
>>>> non-retriable
>>>>>>>>>>>>>> exceptions: although the idea is brilliant, the user may
>>> still
>>>>>>>>> make a
>>>>>>>>>>>>>> mistake by implementing the wrong method and see a
>>>> non-expecting
>>>>>>>>>>>>> behaviour.
>>>>>>>>>>>>>> For example, he may implement handleRetriable() for
>>>>>>>>>>>>> RecordTooLargeException
>>>>>>>>>>>>>> and define SWALLOW for the exception, but in practice, he
>>> sees
>>>>> no
>>>>>>>>>>> change
>>>>>>>>>>>>> in
>>>>>>>>>>>>>> default behaviour since he implemented the wrong method. I
>>>> think
>>>>>>>> we
>>>>>>>>>> can
>>>>>>>>>>>>>> never reduce the user’s mistakes to 0.
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> 3) Default implementation for Handler: the default
>> behaviour
>>>> is
>>>>>>>>>> already
>>>>>>>>>>>>>> preserved with NO need of implementing any handler or
>>> setting
>>>>> the
>>>>>>>>>>>>>> corresponding config parameter `custom.exception.handler`.
>>>> What
>>>>>>>> you
>>>>>>>>>>> mean
>>>>>>>>>>>>> is
>>>>>>>>>>>>>> actually having a second default, which requires having
>> both
>>>>>>>>>> interface
>>>>>>>>>>>>> and
>>>>>>>>>>>>>> config parameters. About UnknownTopicOrPartitionException:
>>> the
>>>>>>>>>> producer
>>>>>>>>>>>>>> already offers the config parameter `max.block.ms` which
>>>>>>>>> determines
>>>>>>>>>>> the
>>>>>>>>>>>>>> duration of retrying. The main purpose of the user who
>> needs
>>>> the
>>>>>>>>>>> handler
>>>>>>>>>>>>> is
>>>>>>>>>>>>>> to get the root cause of TimeoutException and handle it in
>>> the
>>>>>>>> way
>>>>>>>>> he
>>>>>>>>>>>>>> intends. The KIP explains the necessity of it for KS
>> users.
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> 4) Naming issue: By SWALLOW, we meant actually swallow the
>>>>> error,
>>>>>>>>>> while
>>>>>>>>>>>>>> SKIP means skip the record, I think. If it makes sense for
>>>> more
>>>>>>>>> ppl,
>>>>>>>>>> I
>>>>>>>>>>>>> can
>>>>>>>>>>>>>> change it to SKIP
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> @Justine
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> 1) was addressed by Chris.
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> 2 and 3) The problem is exactly what you mentioned.
>>> Currently,
>>>>>>>>> there
>>>>>>>>>> is
>>>>>>>>>>>>> no
>>>>>>>>>>>>>> way to handle these issues application-side. Even KS users
>>> who
>>>>>>>>>>> implement
>>>>>>>>>>>>> KS
>>>>>>>>>>>>>> ProductionExceptionHandler are not able to handle the
>>>> exceptions
>>>>>>>> as
>>>>>>>>>>> they
>>>>>>>>>>>>>> intend since the code does not reach the KS interface and
>>>> breaks
>>>>>>>>>>>>> somewhere
>>>>>>>>>>>>>> in producer.
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> Cheers,
>>>>>>>>>>>>>> Alieh
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> On Tue, May 7, 2024 at 8:43 PM Chris Egerton <
>>>>>>>>>> fearthecel...@gmail.com>
>>>>>>>>>>>>>> wrote:
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> Hi Justine,
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> The method signatures for the interface are indeed
>>>> open-ended,
>>>>>>>> but
>>>>>>>>>> the
>>>>>>>>>>>>>> KIP
>>>>>>>>>>>>>>> states that its uses will be limited. See the motivation
>>>>>>>> section:
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> We believe that the user should be able to develop
>> custom
>>>>>>>>> exception
>>>>>>>>>>>>>>> handlers for managing producer exceptions. On the other
>>> hand,
>>>>>>>> this
>>>>>>>>>>> will
>>>>>>>>>>>>>> be
>>>>>>>>>>>>>>> an expert-level API, and using that may result in strange
>>>>>>>>> behaviour
>>>>>>>>>> in
>>>>>>>>>>>>>> the
>>>>>>>>>>>>>>> system, making it hard to find the root cause. Therefore,
>>> the
>>>>>>>>> custom
>>>>>>>>>>>>>>> handler is currently limited to handling
>>>>> RecordTooLargeException
>>>>>>>>> and
>>>>>>>>>>>>>>> UnknownTopicOrPartitionException.
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> Cheers,
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> Chris
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> On Tue, May 7, 2024, 14:37 Justine Olshan
>>>>>>>>>>> <jols...@confluent.io.invalid
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> wrote:
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> Hi Alieh,
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> I was out for KSB and then was also sick. :(
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> To your point 1) Chris, I don't think it is limited to
>> two
>>>>>>>>> specific
>>>>>>>>>>>>>>>> scenarios, since the interface accepts a generic
>>> Exception e
>>>>>>>> and
>>>>>>>>>> can
>>>>>>>>>>>>> be
>>>>>>>>>>>>>>>> implemented to check if that e is an instanceof any
>>>> exception.
>>>>>>>> I
>>>>>>>>>>>>> didn't
>>>>>>>>>>>>>>> see
>>>>>>>>>>>>>>>> anywhere that specific errors are enforced. I'm a bit
>>>>> concerned
>>>>>>>>>> about
>>>>>>>>>>>>>>> this
>>>>>>>>>>>>>>>> actually. I'm concerned about the opened-endedness and
>> the
>>>>>>>>> contract
>>>>>>>>>>>>> we
>>>>>>>>>>>>>>> have
>>>>>>>>>>>>>>>> with transactions. We are allowing the client to make
>>>>> decisions
>>>>>>>>>> that
>>>>>>>>>>>>>> are
>>>>>>>>>>>>>>>> somewhat invisible to the server. As an aside, can we
>>> build
>>>> in
>>>>>>>>> log
>>>>>>>>>>>>>>> messages
>>>>>>>>>>>>>>>> when the handler decides to skip etc a message. I'm
>> really
>>>>>>>>>> concerned
>>>>>>>>>>>>>>> about
>>>>>>>>>>>>>>>> messages being silently dropped.
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> I do think Chris's point 2) about retriable vs non
>>> retriable
>>>>>>>>> errors
>>>>>>>>>>>>> is
>>>>>>>>>>>>>>>> fair. I'm a bit concerned about skipping a unknown topic
>>> or
>>>>>>>>>> partition
>>>>>>>>>>>>>>>> exception too early, as there are cases where it can be
>>>>>>>>> transient.
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> I'm still a little bit wary of allowing dropping records
>>> as
>>>>>>>> part
>>>>>>>>> of
>>>>>>>>>>>>> EOS
>>>>>>>>>>>>>>>> generally as in many cases, these errors signify an
>> issue
>>>> with
>>>>>>>>> the
>>>>>>>>>>>>>>> original
>>>>>>>>>>>>>>>> data. I understand that streams and connect/mirror maker
>>> may
>>>>>>>> have
>>>>>>>>>>>>>> reasons
>>>>>>>>>>>>>>>> they want to progress past these messages, but wondering
>>> if
>>>>>>>> there
>>>>>>>>>> is
>>>>>>>>>>>>> a
>>>>>>>>>>>>>>> way
>>>>>>>>>>>>>>>> that can be done application-side. I'm willing to accept
>>>> this
>>>>>>>>> sort
>>>>>>>>>> of
>>>>>>>>>>>>>>>> proposal if we can make it clear that this sort of thing
>>> is
>>>>>>>>>> happening
>>>>>>>>>>>>>> and
>>>>>>>>>>>>>>>> we limit the blast radius for what we can do.
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> Justine
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> On Tue, May 7, 2024 at 9:55 AM Chris Egerton
>>>>>>>>>> <chr...@aiven.io.invalid
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> wrote:
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> Hi Alieh,
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> Sorry for the delay, I've been out sick. I still have
>>> some
>>>>>>>>>> thoughts
>>>>>>>>>>>>>>> that
>>>>>>>>>>>>>>>>> I'd like to see addressed before voting.
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> 1) If flexibility is the motivation for a pluggable
>>>>> interface,
>>>>>>>>> why
>>>>>>>>>>>>>> are
>>>>>>>>>>>>>>> we
>>>>>>>>>>>>>>>>> only limiting the uses for this interface to two very
>>>>> specific
>>>>>>>>>>>>>>> scenarios?
>>>>>>>>>>>>>>>>> Why not also allow, e.g., authorization errors to be
>>>> handled
>>>>>>>> as
>>>>>>>>>>>>> well
>>>>>>>>>>>>>>>>> (allowing users to drop records destined for some
>>>> off-limits
>>>>>>>>>>>>> topics,
>>>>>>>>>>>>>> or
>>>>>>>>>>>>>>>>> retry for a limited duration in case there's a delay in
>>> the
>>>>>>>>>>>>>> propagation
>>>>>>>>>>>>>>>> of
>>>>>>>>>>>>>>>>> ACL updates)? It'd be nice to see some analysis of
>> other
>>>>>>>> errors
>>>>>>>>>>>>> that
>>>>>>>>>>>>>>>> could
>>>>>>>>>>>>>>>>> be handled with this new API, both to avoid the
>> follow-up
>>>>> work
>>>>>>>>> of
>>>>>>>>>>>>>>> another
>>>>>>>>>>>>>>>>> KIP to address them in the future, and to make sure
>> that
>>>>> we're
>>>>>>>>> not
>>>>>>>>>>>>>>>> painting
>>>>>>>>>>>>>>>>> ourselves into a corner with the current API in a way
>>> that
>>>>>>>> would
>>>>>>>>>>>>> make
>>>>>>>>>>>>>>>>> future modifications difficult.
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> 2) Something feels a bit off with how retriable vs.
>>>>>>>>> non-retriable
>>>>>>>>>>>>>>> errors
>>>>>>>>>>>>>>>>> are handled with the interface. Why not introduce two
>>>>> separate
>>>>>>>>>>>>>> methods
>>>>>>>>>>>>>>> to
>>>>>>>>>>>>>>>>> handle each case separately? That way there's no
>>> ambiguity
>>>> or
>>>>>>>>>>>>>> implicit
>>>>>>>>>>>>>>>>> behavior when, e.g., attempting to retry on a
>>>>>>>>>>>>>> RecordTooLargeException.
>>>>>>>>>>>>>>>> This
>>>>>>>>>>>>>>>>> could be something like `NonRetriableResponse
>>>>>>>>>>>>> handle(ProducerRecord,
>>>>>>>>>>>>>>>>> Exception)` and `RetriableResponse
>>>>>>>>> handleRetriable(ProducerRecord,
>>>>>>>>>>>>>>>>> Exception)`, though the exact names and shape can
>>> obviously
>>>>> be
>>>>>>>>>>>>> toyed
>>>>>>>>>>>>>>>> with a
>>>>>>>>>>>>>>>>> bit.
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> 3) Although the flexibility of a pluggable interface
>> may
>>>>>>>> benefit
>>>>>>>>>>>>> some
>>>>>>>>>>>>>>>>> users' custom producer applications and Kafka Streams
>>>>>>>>>> applications,
>>>>>>>>>>>>>> it
>>>>>>>>>>>>>>>>> comes at significant deployment cost for other
>>> low-/no-code
>>>>>>>>>>>>>>> environments,
>>>>>>>>>>>>>>>>> including but not limited to Kafka Connect and
>>> MirrorMaker
>>>> 2.
>>>>>>>>> Can
>>>>>>>>>>>>> we
>>>>>>>>>>>>>>> add
>>>>>>>>>>>>>>>> a
>>>>>>>>>>>>>>>>> default implementation of the exception handler that
>>> allows
>>>>>>>> for
>>>>>>>>>>>>> some
>>>>>>>>>>>>>>>> simple
>>>>>>>>>>>>>>>>> behavior to be tweaked via configuration property? Two
>>>> things
>>>>>>>>> that
>>>>>>>>>>>>>>> would
>>>>>>>>>>>>>>>> be
>>>>>>>>>>>>>>>>> nice to have would be A) an upper bound on the retry
>> time
>>>> for
>>>>>>>>>>>>>>>>> unknown-topic-partition exceptions and B) an option to
>>> drop
>>>>>>>>>> records
>>>>>>>>>>>>>>> that
>>>>>>>>>>>>>>>>> are large enough to trigger a record-too-large
>> exception.
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> 4) I'd still prefer to see "SKIP" or "DROP" instead of
>>> the
>>>>>>>>>> proposed
>>>>>>>>>>>>>>>>> "SWALLOW" option, which IMO is opaque and non-obvious,
>>>>>>>>> especially
>>>>>>>>>>>>>> when
>>>>>>>>>>>>>>>>> trying to guess the behavior for retriable errors.
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> Cheers,
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> Chris
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> On Fri, May 3, 2024 at 11:23 AM Alieh Saeedi
>>>>>>>>>>>>>>>> <asae...@confluent.io.invalid
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> wrote:
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>> Hi all,
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>> A summary of the KIP and the discussions:
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>> The KIP introduces a handler interface for Producer in
>>>> order
>>>>>>>> to
>>>>>>>>>>>>>>> handle
>>>>>>>>>>>>>>>>> two
>>>>>>>>>>>>>>>>>> exceptions: RecordTooLargeException and
>>>>>>>>>>>>>>>> UnknownTopicOrPartitionException.
>>>>>>>>>>>>>>>>>> The handler handles the exceptions per-record.
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>> - Do we need this handler?  [Motivation and Examples
>>>>>>>> sections]
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>> RecordTooLargeException: 1) In transactions, the
>>> producer
>>>>>>>>>>>>> collects
>>>>>>>>>>>>>>>>> multiple
>>>>>>>>>>>>>>>>>> records in batches. Then a RecordTooLargeException
>>> related
>>>>>>>> to a
>>>>>>>>>>>>>>> single
>>>>>>>>>>>>>>>>>> record leads to failing the entire batch. A custom
>>>> exception
>>>>>>>>>>>>>> handler
>>>>>>>>>>>>>>> in
>>>>>>>>>>>>>>>>>> this case may decide on dropping the record and
>>> continuing
>>>>>>>> the
>>>>>>>>>>>>>>>>> processing.
>>>>>>>>>>>>>>>>>> See Example 1, please. 2) More over, in Kafka
>> Streams, a
>>>>>>>> record
>>>>>>>>>>>>>> that
>>>>>>>>>>>>>>> is
>>>>>>>>>>>>>>>>> too
>>>>>>>>>>>>>>>>>> large is a poison pill record, and there is no way to
>>> skip
>>>>>>>> over
>>>>>>>>>>>>>> it. A
>>>>>>>>>>>>>>>>>> handler would allow us to react to this error inside
>> the
>>>>>>>>>>>>> producer,
>>>>>>>>>>>>>>>> i.e.,
>>>>>>>>>>>>>>>>>> local to where the error happens, and thus simplify
>> the
>>>>>>>> overall
>>>>>>>>>>>>>> code
>>>>>>>>>>>>>>>>>> significantly. Please read the Motivation section for
>>> more
>>>>>>>>>>>>>>> explanation.
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>> UnknownTopicOrPartitionException: For this case, the
>>>>> producer
>>>>>>>>>>>>>> handles
>>>>>>>>>>>>>>>>> this
>>>>>>>>>>>>>>>>>> exception internally and only issues a WARN log about
>>>>> missing
>>>>>>>>>>>>>>> metadata
>>>>>>>>>>>>>>>>> and
>>>>>>>>>>>>>>>>>> retries internally. Later, when the producer hits "
>>>>>>>>>>>>>>> deliver.timeout.ms"
>>>>>>>>>>>>>>>>> it
>>>>>>>>>>>>>>>>>> throws a TimeoutException, and the user can only
>> blindly
>>>>>>>> retry,
>>>>>>>>>>>>>> which
>>>>>>>>>>>>>>>> may
>>>>>>>>>>>>>>>>>> result in an infinite retry loop. The thrown
>>>>> TimeoutException
>>>>>>>>>>>>>> "cuts"
>>>>>>>>>>>>>>>> the
>>>>>>>>>>>>>>>>>> connection to the underlying root cause of missing
>>>> metadata
>>>>>>>>>>>>> (which
>>>>>>>>>>>>>>>> could
>>>>>>>>>>>>>>>>>> indeed be a transient error but is persistent for a
>>>>>>>>> non-existing
>>>>>>>>>>>>>>>> topic).
>>>>>>>>>>>>>>>>>> Thus, there is no programmatic way to break the
>> infinite
>>>>>>>> retry
>>>>>>>>>>>>>> loop.
>>>>>>>>>>>>>>>>> Kafka
>>>>>>>>>>>>>>>>>> Streams also blindly retries for this case, and the
>>>>>>>> application
>>>>>>>>>>>>>> gets
>>>>>>>>>>>>>>>>> stuck.
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>> - Having interface vs configuration option:
>> [Motivation,
>>>>>>>>>>>>> Examples,
>>>>>>>>>>>>>>> and
>>>>>>>>>>>>>>>>>> Rejected Alternatives sections]
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>> Our solution is introducing an interface due to the
>> full
>>>>>>>>>>>>>> flexibility
>>>>>>>>>>>>>>>> that
>>>>>>>>>>>>>>>>>> it offers. Sometimes users, especially Kafka Streams
>>> ones,
>>>>>>>>>>>>>> determine
>>>>>>>>>>>>>>>> the
>>>>>>>>>>>>>>>>>> handler's behaviour based on the situation. For
>>> example, f
>>>>>>>>>>>>>>>>>> acing UnknownTopicOrPartitionException*, *the user may
>>>> want
>>>>>>>> to
>>>>>>>>>>>>>> raise
>>>>>>>>>>>>>>> an
>>>>>>>>>>>>>>>>>> error for some topics but retry it for other topics.
>>>> Having
>>>>> a
>>>>>>>>>>>>>>>>> configuration
>>>>>>>>>>>>>>>>>> option with a fixed set of possibilities does not
>> serve
>>>> the
>>>>>>>>>>>>> user's
>>>>>>>>>>>>>>>>>> needs. See Example 2, please.
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>> - Note on RecordTooLargeException: [Public Interfaces
>>>>>>>> section]
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>> If the custom handler decides on SWALLOW for
>>>>>>>>>>>>>> RecordTooLargeException,
>>>>>>>>>>>>>>>>> then
>>>>>>>>>>>>>>>>>> this record will not be a part of the batch of
>>>> transactions
>>>>>>>> and
>>>>>>>>>>>>>> will
>>>>>>>>>>>>>>>> also
>>>>>>>>>>>>>>>>>> not be sent to the broker in non-transactional mode.
>> So
>>> no
>>>>>>>>>>>>> worries
>>>>>>>>>>>>>>>> about
>>>>>>>>>>>>>>>>>> getting a RecordTooLargeException from the broker in
>>> this
>>>>>>>> case,
>>>>>>>>>>>>> as
>>>>>>>>>>>>>>> the
>>>>>>>>>>>>>>>>>> record will never ever be sent to the broker. SWALLOW
>>>> means
>>>>>>>>> drop
>>>>>>>>>>>>>> the
>>>>>>>>>>>>>>>>> record
>>>>>>>>>>>>>>>>>> and continue/swallow the error.
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>> - What if the handle() method implements RETRY for
>>>>>>>>>>>>>>>>> RecordTooLargeException?
>>>>>>>>>>>>>>>>>> [Proposed Changes section]
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>> We have to limit the user to only have FAIL or SWALLOW
>>> for
>>>>>>>>>>>>>>>>>> RecordTooLargeException. Actually, RETRY must be equal
>>> to
>>>>>>>> FAIL.
>>>>>>>>>>>>>> This
>>>>>>>>>>>>>>> is
>>>>>>>>>>>>>>>>>> well documented/informed in javadoc.
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>> - What if the handle() method of the handler throws an
>>>>>>>>> exception?
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>> The handler is expected to have correct code. If it
>>> throws
>>>>> an
>>>>>>>>>>>>>>>> exception,
>>>>>>>>>>>>>>>>>> everything fails.
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>> This is a PoC PR <
>>>>> https://github.com/apache/kafka/pull/15846
>>>>>>>>>
>>>>>>>>>>>>> ONLY
>>>>>>>>>>>>>>> for
>>>>>>>>>>>>>>>>>> RecordTooLargeException. The code changes related to
>>>>>>>>>>>>>>>>>> UnknownTopicOrPartitionException will be added to this
>>> PR
>>>>>>>>> LATER.
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>> Looking forward to your feedback again.
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>> Cheers,
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>> Alieh
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>> On Thu, Apr 25, 2024 at 11:46 PM Kirk True <
>>>>>>>> k...@kirktrue.pro>
>>>>>>>>>>>>>>> wrote:
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>> Hi Alieh,
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>> Thanks for the updates!
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>> Comments inline...
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>> On Apr 25, 2024, at 1:10 PM, Alieh Saeedi
>>>>>>>>>>>>>>>>> <asae...@confluent.io.INVALID
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>> wrote:
>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>> Hi all,
>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>> Thanks a lot for the constructive feedbacks!
>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>> Addressing some of the main concerns:
>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>> - The `RecordTooLargeException` can be thrown by
>>> broker,
>>>>>>>>>>>>>> producer
>>>>>>>>>>>>>>>> and
>>>>>>>>>>>>>>>>>>>> consumer. Of course, the `ProducerExceptionHandler`
>>>>>>>> interface
>>>>>>>>>>>>>> is
>>>>>>>>>>>>>>>>>>> introduced
>>>>>>>>>>>>>>>>>>>> to affect only the exceptions thrown from the
>>> producer.
>>>>>>>> This
>>>>>>>>>>>>>> KIP
>>>>>>>>>>>>>>>> very
>>>>>>>>>>>>>>>>>>>> specifically means to provide a possibility to
>> manage
>>>> the
>>>>>>>>>>>>>>>>>>>> `RecordTooLargeException` thrown from the
>>>> Producer.send()
>>>>>>>>>>>>>> method.
>>>>>>>>>>>>>>>>>> Please
>>>>>>>>>>>>>>>>>>>> see “Proposed Changes” section for more clarity. I
>>>>>>>>>>>>> investigated
>>>>>>>>>>>>>>> the
>>>>>>>>>>>>>>>>>> issue
>>>>>>>>>>>>>>>>>>>> there thoroughly. I hope it can explain the concern
>>>> about
>>>>>>>> how
>>>>>>>>>>>>>> we
>>>>>>>>>>>>>>>>> handle
>>>>>>>>>>>>>>>>>>> the
>>>>>>>>>>>>>>>>>>>> errors as well.
>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>> - The problem with Callback: Methods of Callback are
>>>>> called
>>>>>>>>>>>>>> when
>>>>>>>>>>>>>>>> the
>>>>>>>>>>>>>>>>>>> record
>>>>>>>>>>>>>>>>>>>> sent to the server is acknowledged, while this is
>> not
>>>> the
>>>>>>>>>>>>>> desired
>>>>>>>>>>>>>>>>> time
>>>>>>>>>>>>>>>>>>> for
>>>>>>>>>>>>>>>>>>>> all exceptions. We intend to handle exceptions
>>>> beforehand.
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>> I guess it makes sense to keep the expectation for
>> when
>>>>>>>>>>>>> Callback
>>>>>>>>>>>>>> is
>>>>>>>>>>>>>>>>>>> invoked as-is vs. shoehorning more into it.
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>> - What if the custom handler returns RETRY for
>>>>>>>>>>>>>>>>>>> `RecordTooLargeException`? I
>>>>>>>>>>>>>>>>>>>> assume changing the producer configuration at
>> runtime
>>> is
>>>>>>>>>>>>>>> possible.
>>>>>>>>>>>>>>>> If
>>>>>>>>>>>>>>>>>> so,
>>>>>>>>>>>>>>>>>>>> RETRY for a too large record is valid because maybe
>> in
>>>> the
>>>>>>>>>>>>> next
>>>>>>>>>>>>>>>> try,
>>>>>>>>>>>>>>>>>> the
>>>>>>>>>>>>>>>>>>>> too large record is not poisoning any more. I am not
>>>> 100%
>>>>>>>>>>>>> sure
>>>>>>>>>>>>>>>> about
>>>>>>>>>>>>>>>>>> the
>>>>>>>>>>>>>>>>>>>> technical details, though. Otherwise, we can
>> consider
>>>> the
>>>>>>>>>>>>> RETRY
>>>>>>>>>>>>>>> as
>>>>>>>>>>>>>>>>> FAIL
>>>>>>>>>>>>>>>>>>> for
>>>>>>>>>>>>>>>>>>>> this exception. Another solution would be to
>> consider
>>> a
>>>>>>>>>>>>>> constant
>>>>>>>>>>>>>>>>> number
>>>>>>>>>>>>>>>>>>> of
>>>>>>>>>>>>>>>>>>>> times for RETRY which can be useful for other
>>> exceptions
>>>>> as
>>>>>>>>>>>>>> well.
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>> It’s not presently possible to change the
>> configuration
>>>> of
>>>>>>>> an
>>>>>>>>>>>>>>>> existing
>>>>>>>>>>>>>>>>>>> Producer at runtime. So if a record hits a
>>>>>>>>>>>>>> RecordTooLargeException
>>>>>>>>>>>>>>>>> once,
>>>>>>>>>>>>>>>>>> no
>>>>>>>>>>>>>>>>>>> amount of retrying (with the current Producer) will
>>>> change
>>>>>>>>> that
>>>>>>>>>>>>>>> fact.
>>>>>>>>>>>>>>>>> So
>>>>>>>>>>>>>>>>>>> I’m still a little stuck on how to handle a response
>> of
>>>>>>>> RETRY
>>>>>>>>>>>>> for
>>>>>>>>>>>>>>> an
>>>>>>>>>>>>>>>>>>> “oversized” record.
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>> - What if the handle() method itself throws an
>>>> exception?
>>>>> I
>>>>>>>>>>>>>> think
>>>>>>>>>>>>>>>>>>>> rationally and pragmatically, the behaviour must be
>>>>> exactly
>>>>>>>>>>>>>> like
>>>>>>>>>>>>>>>> when
>>>>>>>>>>>>>>>>>> no
>>>>>>>>>>>>>>>>>>>> custom handler is defined since the user actually
>> did
>>>> not
>>>>>>>>>>>>> have
>>>>>>>>>>>>>> a
>>>>>>>>>>>>>>>>>> working
>>>>>>>>>>>>>>>>>>>> handler.
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>> I’m not convinced that ignoring an errant handler is
>>> the
>>>>>>>> right
>>>>>>>>>>>>>>>> choice.
>>>>>>>>>>>>>>>>> It
>>>>>>>>>>>>>>>>>>> then becomes a silent failure that might have
>>>>> repercussions,
>>>>>>>>>>>>>>>> depending
>>>>>>>>>>>>>>>>> on
>>>>>>>>>>>>>>>>>>> the business logic. A user would have to proactively
>>>> trawls
>>>>>>>>>>>>>> through
>>>>>>>>>>>>>>>> the
>>>>>>>>>>>>>>>>>>> logs for WARN/ERROR messages to catch it.
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>> Throwing a hard error is pretty draconian, though…
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>> - Why not use config parameters instead of an
>>> interface?
>>>>> As
>>>>>>>>>>>>>>>> explained
>>>>>>>>>>>>>>>>>> in
>>>>>>>>>>>>>>>>>>>> the “Rejected Alternatives” section, we assume that
>>> the
>>>>>>>>>>>>> handler
>>>>>>>>>>>>>>>> will
>>>>>>>>>>>>>>>>> be
>>>>>>>>>>>>>>>>>>>> used for a greater number of exceptions in the
>> future.
>>>>>>>>>>>>>> Defining a
>>>>>>>>>>>>>>>>>>>> configuration parameter for each exception may make
>>> the
>>>>>>>>>>>>>>>>> configuration a
>>>>>>>>>>>>>>>>>>> bit
>>>>>>>>>>>>>>>>>>>> messy. Moreover, the handler offers more
>> flexibility.
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>> Agreed that the logic-via-configuration approach is
>>> weird
>>>>>>>> and
>>>>>>>>>>>>>>>> limiting.
>>>>>>>>>>>>>>>>>>> Forget I ever suggested it ;)
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>> I’d think additional background in the Motivation
>>> section
>>>>>>>>> would
>>>>>>>>>>>>>>> help
>>>>>>>>>>>>>>>> me
>>>>>>>>>>>>>>>>>>> understand how users might use this feature beyond a)
>>>>>>>> skipping
>>>>>>>>>>>>>>>>>> “oversized”
>>>>>>>>>>>>>>>>>>> records, and b) not retrying missing topics.
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>> Small change:
>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>> -ProductionExceptionHandlerResponse -> Response for
>>>>> brevity
>>>>>>>>>>>>> and
>>>>>>>>>>>>>>>>>>> simplicity.
>>>>>>>>>>>>>>>>>>>> Could’ve been HandlerResponse too I think!
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>> The name change sounds good to me.
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>> Thanks Alieh!
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>> I thank you all again for your useful
>>>>>>>> questions/suggestions.
>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>> I would be happy to hear more of your concerns, as
>>>> stated
>>>>>>>> in
>>>>>>>>>>>>>> some
>>>>>>>>>>>>>>>>>>> feedback.
>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>> Cheers,
>>>>>>>>>>>>>>>>>>>> Alieh
>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>> On Wed, Apr 24, 2024 at 12:31 AM Justine Olshan
>>>>>>>>>>>>>>>>>>>> <jols...@confluent.io.invalid> wrote:
>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>> Thanks Alieh for the updates.
>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>> I'm a little concerned about the design pattern
>> here.
>>>> It
>>>>>>>>>>>>> seems
>>>>>>>>>>>>>>>> like
>>>>>>>>>>>>>>>>> we
>>>>>>>>>>>>>>>>>>> want
>>>>>>>>>>>>>>>>>>>>> specific usages, but we are packaging it as a
>> generic
>>>>>>>>>>>>> handler.
>>>>>>>>>>>>>>>>>>>>> I think we tried to narrow down on the specific
>>> errors
>>>> we
>>>>>>>>>>>>> want
>>>>>>>>>>>>>>> to
>>>>>>>>>>>>>>>>>>> handle,
>>>>>>>>>>>>>>>>>>>>> but it feels a little clunky as we have a generic
>>> thing
>>>>>>>> for
>>>>>>>>>>>>>> two
>>>>>>>>>>>>>>>>>> specific
>>>>>>>>>>>>>>>>>>>>> errors.
>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>> I'm wondering if we are using the right patterns to
>>>> solve
>>>>>>>>>>>>>> these
>>>>>>>>>>>>>>>>>>> problems. I
>>>>>>>>>>>>>>>>>>>>> agree though that we will need something more than
>>> the
>>>>>>>> error
>>>>>>>>>>>>>>>> classes
>>>>>>>>>>>>>>>>>> I'm
>>>>>>>>>>>>>>>>>>>>> proposing if we want to have different handling be
>>>>>>>>>>>>>> configurable.
>>>>>>>>>>>>>>>>>>>>> My concern is that the open-endedness of a handler
>>>> means
>>>>>>>>>>>>> that
>>>>>>>>>>>>>> we
>>>>>>>>>>>>>>>> are
>>>>>>>>>>>>>>>>>>>>> creating more problems than we are solving. It is
>>> still
>>>>>>>>>>>>>> unclear
>>>>>>>>>>>>>>> to
>>>>>>>>>>>>>>>>> me
>>>>>>>>>>>>>>>>>>> how
>>>>>>>>>>>>>>>>>>>>> we expect to handle the errors. Perhaps we could
>>>> include
>>>>>>>> an
>>>>>>>>>>>>>>>> example?
>>>>>>>>>>>>>>>>>> It
>>>>>>>>>>>>>>>>>>>>> seems like there is a specific use case in mind and
>>>> maybe
>>>>>>>> we
>>>>>>>>>>>>>> can

Re: [DISCUSS] KIP-1038: Add Custom Error Handler to Producer

Reply via email to