[ 
https://issues.apache.org/jira/browse/KAFKA-12879?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17503891#comment-17503891
 ] 

Randall Hauch edited comment on KAFKA-12879 at 3/10/22, 9:13 PM:
-----------------------------------------------------------------

The approach we decided to take was to revert the previous admin client changes 
from KAFKA-12339 to bring the admin client behavior back to previous 
expectations, and to implement retries within the KafkaBasedLog to handle cases 
like those identified in that issue.

For example, a likely root cause of KAFKA-12339 was a Connect worker 
instantiates its KafkaConfigBackingStore (and other internal topic stores), 
which creates a KafkaBasedLog that as part of start() creates the topic if it 
doesn't exist and then immediately tries to read the offsets. That reading of 
offsets can fail if the metadata for the newly created topic hasn't been 
propagated to all of the brokers. We can solve this particular root cause 
easily by retrying the reading of offsets within the KafkaBasedLog's start() 
method, and since topic metadata should be propagated relatively quickly, we 
don't need to retry for that long – and most of the time we'd probably 
successfully retry within a few retries.

I've just merged to trunk a PR that does this. When trying to backport this, 
some of the newer tests were flaky, so [~pnee] created another PR (plus 
another) to hopefully eliminate that flakiness, and it seemed to work. 

I'm in the process of backporting this all the way back to 2.6 -2.5- branch, 
since that's how far back the regression from KAFKA-12339 was backported.


was (Author: rhauch):
The approach we decided to take was to revert the previous admin client changes 
from KAFKA-12339 to bring the admin client behavior back to previous 
expectations, and to implement retries within the KafkaBasedLog to handle cases 
like those identified in that issue.

For example, a likely root cause of KAFKA-12339 was a Connect worker 
instantiates its KafkaConfigBackingStore (and other internal topic stores), 
which creates a KafkaBasedLog that as part of start() creates the topic if it 
doesn't exist and then immediately tries to read the offsets. That reading of 
offsets can fail if the metadata for the newly created topic hasn't been 
propagated to all of the brokers. We can solve this particular root cause 
easily by retrying the reading of offsets within the KafkaBasedLog's start() 
method, and since topic metadata should be propagated relatively quickly, we 
don't need to retry for that long – and most of the time we'd probably 
successfully retry within a few retries.

I've just merged to trunk a PR that does this. When trying to backport this, 
some of the newer tests were flaky, so [~pnee] created another PR (plus 
another) to hopefully eliminate that flakiness, and it seemed to work. 

I'm in the process of backporting this all the way back to 2.5 branch, since 
that's how far back the regression from KAFKA-12339 was backported.

> Compatibility break in Admin.listOffsets()
> ------------------------------------------
>
>                 Key: KAFKA-12879
>                 URL: https://issues.apache.org/jira/browse/KAFKA-12879
>             Project: Kafka
>          Issue Type: Bug
>          Components: admin
>    Affects Versions: 2.8.0, 2.7.1, 2.6.2
>            Reporter: Tom Bentley
>            Assignee: Philip Nee
>            Priority: Major
>
> KAFKA-12339 incompatibly changed the semantics of Admin.listOffsets(). 
> Previously it would fail with {{UnknownTopicOrPartitionException}} when a 
> topic didn't exist. Now it will (eventually) fail with {{TimeoutException}}. 
> It seems this was more or less intentional, even though it would break code 
> which was expecting and handling the {{UnknownTopicOrPartitionException}}. A 
> workaround is to use {{retries=1}} and inspect the cause of the 
> {{TimeoutException}}, but this isn't really suitable for cases where the same 
> Admin client instance is being used for other calls where retries is 
> desirable.
> Furthermore as well as the intended effect on {{listOffsets()}} it seems that 
> the change could actually affect other methods of Admin.
> More generally, the Admin client API is vague about which exceptions can 
> propagate from which methods. This means that it's not possible to say, in 
> cases like this, whether the calling code _should_ have been relying on the 
> {{UnknownTopicOrPartitionException}} or not.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

Reply via email to