C0urante opened a new pull request, #14313: URL: https://github.com/apache/kafka/pull/14313
[Jira](https://issues.apache.org/jira/browse/KAFKA-15416) There's some fairly subtle behavior involved here, so apologies in advance for the wall of text. # Problem In the `retryEndOffsetsShouldRetryWhenTopicNotFound` test case, we currently use a `MockTime` object that has an auto-tick of 10ms with our [AdminClientUnitTestEnv](https://github.com/apache/kafka/blob/703e1d9faafbf07795261b3233ab985583f17fcb/clients/src/test/java/org/apache/kafka/clients/admin/AdminClientUnitTestEnv.java), which ends up also being utilized by the [admin client](https://github.com/apache/kafka/blob/703e1d9faafbf07795261b3233ab985583f17fcb/clients/src/test/java/org/apache/kafka/clients/admin/AdminClientUnitTestEnv.java#L107-L109) that it provides. By default, admin clients perform periodic metadata refresh every five minutes (see the [metadata.max.age.ms property](https://github.com/apache/kafka/blob/703e1d9faafbf07795261b3233ab985583f17fcb/clients/src/main/java/org/apache/kafka/clients/admin/AdminClientConfig.java#L137)). With a `MockTime` instance that has an auto-tick of 10ms, periodic metadata refresh will be triggered every 30,000 times that `MockTime::milliseconds` (or `MockTime::nanoseconds`, but that's not relevant to these tests) is called. Because of how the testing admin is set up by the `AdminClientUnitTestEnv` class, [checks for periodic metadata refresh](https://github.com/apache/kafka/blob/703e1d9faafbf07795261b3233ab985583f17fcb/clients/src/main/java/org/apache/kafka/clients/admin/KafkaAdminClient.java#L1359-L1368) take place in a tight loop with no time spent sleeping during each iteration. As a result, it's not unreasonable for 30,000 iterations to take place fairly quickly. When a periodic metadata refresh is triggered, it can sometimes "swallow" the responses that we [prepare for the admin client](https://github.com/apache/kafka/blob/703e1d9faafbf07795261b3233ab985583f17fcb/connect/runtime/src/test/java/org/apache/kafka/connect/util/TopicAdminTest.java#L563-L565), causing these responses to be delivered to the periodic metadata refresh request instead of requests that the admin client issues in service of application logic (e.g., an invocation of `Admin::listOffsets`). With the current `TopicAdminTest::retryEndOffsetsShouldRetryWhenTopicNotFound` case, this can lead to a variety of timing-dependent failures, including a spurious timeout caused by repeated attempts to contact a node that never responds because all prepared responses have already been consumed, with every attempt causing the mocked time to move forward slowly but surely past the operation timeout. # Solutions There are a few options for a fix: 1. Disable periodic metadata refresh in the admin client 1. Disable auto-tick in our `MockTime` instance 1. Switch to a fully-deterministic mocking method for the admin client (e.g., use Mockito to construct a mock `Admin` instance) In this PR I've opted for the second option, as this appears to be (implicitly or explicitly) the strategy used in all other tests that use the `AdminClientUnitTestEnv` class, and I'd prefer not to introduce inconsistency in the strategies we use in the `TopicAdminTest` suite to mock out admin client behavior. # Other changes ## Test retryEndOffsetsShouldWrapNonRetriableExceptionsWithConnectException This PR also tweaks the `retryEndOffsetsShouldWrapNonRetriableExceptionsWithConnectException` test case to prevent false negatives. Currently, because of a resurfacing of [KAFKA-12879](https://issues.apache.org/jira/browse/KAFKA-12879), admin clients will retry attempts to list offsets on unknown topics. The test expects the opposite behavior, but because the next automatic retry is met with a spurious timeout, it fails anyways. The test is modified here to 1) prepare a different failure (authorization error) that is less likely to be be treated as retriable by future changes, 2) prepare a valid response for the admin client in the event that it ever retries the operation (which should, conversely, cause the test to fail), and 3) make finer-grained assertions on the kind of exception thrown by `TopicAdmin::retryEndOffsets` to ensure that it fails for the reason that we expect it to. ## TopicAdmin::retryEndOffsets and TopicAdmin::endOffsets A few tweaks are made to these methods to eliminate unnecessary chained elements in the stack trace of exceptions they throw. This not only facilitates the stronger testing assertions added to `TopicAdminTest::retryEndOffsetsShouldWrapNonRetriableExceptionsWithConnectException`, it also improves the readability of the stack traces shown to users. This same improvement may be made to other parts of the `TopicAdmin` class. I've left that for follow-up work to try to keep this PR focused for now, but can add it if we agree it's warranted. ### Committer Checklist (excluded from commit message) - [ ] Verify design and implementation - [ ] Verify test coverage and CI build status - [ ] Verify documentation (including upgrade notes) -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: jira-unsubscr...@kafka.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org