Re: [PR] KAFKA-15566: Fix flaky tests in FetchRequestTest.scala in KRaft mode [kafka]
showuon merged PR #14573: URL: https://github.com/apache/kafka/pull/14573 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: jira-unsubscr...@kafka.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [PR] KAFKA-15566: Fix flaky tests in FetchRequestTest.scala in KRaft mode [kafka]
showuon commented on PR #14573: URL: https://github.com/apache/kafka/pull/14573#issuecomment-1771954831 Ran 3 times of CI build and no fetchRequestTest failures. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: jira-unsubscr...@kafka.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [PR] KAFKA-15566: Fix flaky tests in FetchRequestTest.scala in KRaft mode [kafka]
showuon commented on PR #14573: URL: https://github.com/apache/kafka/pull/14573#issuecomment-1770924775 > @showuon @dengziming Thanks for approving the PR. I also have identified the root cause of `checkLastFetchedEpochValidation` failure which also needs to call `TestUtils.waitUntilLeaderIsKnown` function after creating a topic partition. It retrieves the leader epoch for a topic partition too soon after creating it which results in value of `-1`. It then uses this incorrect epoch for the fetch request and fails. Since it's a same fix, I could perhaps add it to this PR as well? > > Also even though the other testcases did not fail, maybe just in case they should also call `TestUtils.waitUntilLeaderIsKnown` after creating topics. @tinaselenge , thanks for the investigation! I think we should have another PR for this issue since both Ziming and I already reviewed this PR and approved. Let's make it merge soon and focus on the other issues separately! -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: jira-unsubscr...@kafka.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [PR] KAFKA-15566: Fix flaky tests in FetchRequestTest.scala in KRaft mode [kafka]
tinaselenge commented on PR #14573: URL: https://github.com/apache/kafka/pull/14573#issuecomment-1770870075 @showuon @dengziming Thanks for approving the PR. I also have identified the root cause of `checkLastFetchedEpochValidation` failure which also needs to call `TestUtils.waitUntilLeaderIsKnown` function after creating a topic partition. It retrieves the leader epoch for the topic partition too soon after creating a topic which results in value of `-1`. It then uses this incorrect epoch for the fetch request and fails. Since it's a same fix, I could perhaps add it to this PR as well? Also even though the other testcases did not fail, maybe just in case they should also call `TestUtils.waitUntilLeaderIsKnown` after creating topics. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: jira-unsubscr...@kafka.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [PR] KAFKA-15566: Fix flaky tests in FetchRequestTest.scala in KRaft mode [kafka]
showuon commented on PR #14573: URL: https://github.com/apache/kafka/pull/14573#issuecomment-1769750941 @dengziming , do you want to have a look? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: jira-unsubscr...@kafka.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [PR] KAFKA-15566: Fix flaky tests in FetchRequestTest.scala in KRaft mode [kafka]
tinaselenge commented on PR #14573: URL: https://github.com/apache/kafka/pull/14573#issuecomment-1768593969 @showuon thanks for the feedback. I just made an update to use `TestUtils.waitUntilLeaderIsKnown' instead of the sleep function. The tests will now wait for the log to be created on the leader before sending fetch requests. I think this is similar to what you were suggesting. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: jira-unsubscr...@kafka.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[PR] KAFKA-15566: Fix flaky tests in FetchRequestTest.scala in KRaft mode [kafka]
tinaselenge opened a new pull request, #14573: URL: https://github.com/apache/kafka/pull/14573 - Fixed some of the failing tests in FetchRequestTest. `testFetchWithPartitionsWithIdError` and `testCreateIncrementalFetchWithPartitionsInErrorV12` fail with the following error when enabled with KRaft mode. These tests only fail sometimes when running locally but consistently failed when running in the Jenkins Pipeline. ``` expected: <0> but was: <6> Expected :0 Actual :6 org.opentest4j.AssertionFailedError: expected: <0> but was: <6> at app//org.junit.jupiter.api.AssertionFailureBuilder.build(AssertionFailureBuilder.java:151) at app//org.junit.jupiter.api.AssertionFailureBuilder.buildAndThrow(AssertionFailureBuilder.java:132) at app//org.junit.jupiter.api.AssertEquals.failNotEqual(AssertEquals.java:197) at app//org.junit.jupiter.api.AssertEquals.assertEquals(AssertEquals.java:134) at app//org.junit.jupiter.api.AssertEquals.assertEquals(AssertEquals.java:129) at app//org.junit.jupiter.api.Assertions.assertEquals(Assertions.java:325) at app//kafka.server.FetchRequestTest.testCreateIncrementalFetchWithPartitionsInErrorV12(FetchRequestTest.scala:547) ... ``` The tests create topic partitions and send fetch requests for them. The expected error code to be returned is 0 however they get `NOT_LEADER_OR_FOLLOWER` exception. We throw this exception when trying to read records from a log that doesn’t exist (https://github.com/apache/kafka/blob/dc6a53e19606674bd1276bf05d3ae7a3a2115523/core/src/main/scala/kafka/cluster/Partition.scala#L504). The issue seems to be due to receiving fetch requests before logs being created on the broker. The following test logs with extra debug lines show that we attempted to read records from the log for topic patition foo-1, a few milliseconds before it was created. ``` [2023-10-18 12:44:18,200] DEBUG [Partition foo-1 broker=0] Getting log for topic with id foo-1 to read its records (kafka.cluster.Partition:62) [2023-10-18 12:44:18,200] DEBUG [Partition foo-1 broker=0] NOT_LEADER_OR_FOLLOWER foo-1 because log is empty (kafka.cluster.Partition:62) [2023-10-18 12:44:18,205] INFO Created log for partition foo-1 in /var/folders/fm/65mtddt52vjf8hycyj0rn64rgn/T/kafka-6502713585191291330/foo-1 with properties {} (kafka.log.LogManager:66) [2023-10-18 12:44:18,212] DEBUG [KafkaApi-0] Fetch request with correlation id 1 from client client-id on partition AA:foo-1 failed due to org.apache.kafka.common.errors.NotLeaderOrFollowerException (kafka.server.KafkaApis:62) ``` In Zookeeper case, the log was created much earlier than attempting to read records from it therefore no error was returned for the fetch requests. ``` [2023-10-18 12:38:31,252] INFO Created log for partition foo-1 in /var/folders/fm/65mtddt52vjf8hycyj0rn64rgn/T/kafka-1270567419648769103/foo-1 with properties {} (kafka.log.LogManager:66) [2023-10-18 12:38:31,455] DEBUG [Partition foo-1 broker=0] Getting log for topic with id foo-1 to read its records (kafka.cluster.Partition:62) ``` The only difference between Zookeeper and KRaft mode in these tests is the way the topic partitions are created. In Zookeeper mode, we create the topic partitions directly with Zookeeper therefore seem to take less time to create the logs. In KRaft mode, we use Admin client to create topic partitions. Even though the test waits for topic partitions to get created and appear in metadata cache before sending fetch requests, it doesn’t seem to be sufficient time for the logs to be created on the brokers. Adding some sleep after creating the topic partitions and before sending the fetch requests fixed the tests. - Enabled all tests except `checkLastFetchedEpochValidation` with KRaft mode. Looking at the build history in Jenkins, all the other tests except these 2 tests and checkLastFetchedEpochValidation were passing when they were enabled with KRaft mode. Therefore enabled them with KRaft mode again but left `checkLastFetchedEpochValidation` to be investigated further. ### Committer Checklist (excluded from commit message) - [ ] Verify design and implementation - [ ] Verify test coverage and CI build status - [ ] Verify documentation (including upgrade notes) -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: jira-unsubscr...@kafka.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org