[GitHub] [kafka] C0urante opened a new pull request, #14313: KAFKA-15416: Fix flaky TopicAdminTest::retryEndOffsetsShouldRetryWhenTopicNotFound test case

via GitHub Wed, 30 Aug 2023 11:14:16 -0700


C0urante opened a new pull request, #14313:
URL: https://github.com/apache/kafka/pull/14313


   [Jira](https://issues.apache.org/jira/browse/KAFKA-15416)
   
   There's some fairly subtle behavior involved here, so apologies in advance 
for the wall of text.
   
   # Problem
   
   In the `retryEndOffsetsShouldRetryWhenTopicNotFound` test case, we currently 
use a `MockTime` object that has an auto-tick of 10ms with our 
[AdminClientUnitTestEnv](https://github.com/apache/kafka/blob/703e1d9faafbf07795261b3233ab985583f17fcb/clients/src/test/java/org/apache/kafka/clients/admin/AdminClientUnitTestEnv.java),
 which ends up also being utilized by the [admin 
client](https://github.com/apache/kafka/blob/703e1d9faafbf07795261b3233ab985583f17fcb/clients/src/test/java/org/apache/kafka/clients/admin/AdminClientUnitTestEnv.java#L107-L109)
 that it provides.
   
   By default, admin clients perform periodic metadata refresh every five 
minutes (see the [metadata.max.age.ms 
property](https://github.com/apache/kafka/blob/703e1d9faafbf07795261b3233ab985583f17fcb/clients/src/main/java/org/apache/kafka/clients/admin/AdminClientConfig.java#L137)).
 With a `MockTime` instance that has an auto-tick of 10ms, periodic metadata 
refresh will be triggered every 30,000 times that `MockTime::milliseconds` (or 
`MockTime::nanoseconds`, but that's not relevant to these tests) is called.
   
   Because of how the testing admin is set up by the `AdminClientUnitTestEnv` 
class, [checks for periodic metadata 
refresh](https://github.com/apache/kafka/blob/703e1d9faafbf07795261b3233ab985583f17fcb/clients/src/main/java/org/apache/kafka/clients/admin/KafkaAdminClient.java#L1359-L1368)
 take place in a tight loop with no time spent sleeping during each iteration. 
As a result, it's not unreasonable for 30,000 iterations to take place fairly 
quickly.
   
   When a periodic metadata refresh is triggered, it can sometimes "swallow" 
the responses that we [prepare for the admin 
client](https://github.com/apache/kafka/blob/703e1d9faafbf07795261b3233ab985583f17fcb/connect/runtime/src/test/java/org/apache/kafka/connect/util/TopicAdminTest.java#L563-L565),
 causing these responses to be delivered to the periodic metadata refresh 
request instead of requests that the admin client issues in service of 
application logic (e.g., an invocation of `Admin::listOffsets`).
   
   With the current 
`TopicAdminTest::retryEndOffsetsShouldRetryWhenTopicNotFound` case, this can 
lead to a variety of timing-dependent failures, including a spurious timeout 
caused by repeated attempts to contact a node that never responds because all 
prepared responses have already been consumed, with every attempt causing the 
mocked time to move forward slowly but surely past the operation timeout.
   
   # Solutions
   
   There are a few options for a fix:
   
   1. Disable periodic metadata refresh in the admin client
   1. Disable auto-tick in our `MockTime` instance
   1. Switch to a fully-deterministic mocking method for the admin client 
(e.g., use Mockito to construct a mock `Admin` instance)
   
   In this PR I've opted for the second option, as this appears to be 
(implicitly or explicitly) the strategy used in all other tests that use the 
`AdminClientUnitTestEnv` class, and I'd prefer not to introduce inconsistency 
in the strategies we use in the `TopicAdminTest` suite to mock out admin client 
behavior.
   
   
   # Other changes
   
   ## Test retryEndOffsetsShouldWrapNonRetriableExceptionsWithConnectException
   
   This PR also tweaks the 
`retryEndOffsetsShouldWrapNonRetriableExceptionsWithConnectException` test case 
to prevent false negatives. Currently, because of a resurfacing of 
[KAFKA-12879](https://issues.apache.org/jira/browse/KAFKA-12879), admin clients 
will retry attempts to list offsets on unknown topics. The test expects the 
opposite behavior, but because the next automatic retry is met with a spurious 
timeout, it fails anyways.
   
   The test is modified here to 1) prepare a different failure (authorization 
error) that is less likely to be be treated as retriable by future changes, 2) 
prepare a valid response for the admin client in the event that it ever retries 
the operation (which should, conversely, cause the test to fail), and 3) make 
finer-grained assertions on the kind of exception thrown by 
`TopicAdmin::retryEndOffsets` to ensure that it fails for the reason that we 
expect it to.
   
   ## TopicAdmin::retryEndOffsets and TopicAdmin::endOffsets
   
   A few tweaks are made to these methods to eliminate unnecessary chained 
elements in the stack trace of exceptions they throw. This not only facilitates 
the stronger testing assertions added to 
`TopicAdminTest::retryEndOffsetsShouldWrapNonRetriableExceptionsWithConnectException`,
 it also improves the readability of the stack traces shown to users.
   
   This same improvement may be made to other parts of the `TopicAdmin` class. 
I've left that for follow-up work to try to keep this PR focused for now, but 
can add it if we agree it's warranted.
   
   ### Committer Checklist (excluded from commit message)
   - [ ] Verify design and implementation 
   - [ ] Verify test coverage and CI build status
   - [ ] Verify documentation (including upgrade notes)
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: jira-unsubscr...@kafka.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [kafka] C0urante opened a new pull request, #14313: KAFKA-15416: Fix flaky TopicAdminTest::retryEndOffsetsShouldRetryWhenTopicNotFound test case

Reply via email to