Re: [PR] KAFKA-15566: Fix flaky tests in FetchRequestTest.scala in KRaft mode [kafka]

2023-10-19 Thread via GitHub


showuon merged PR #14573:
URL: https://github.com/apache/kafka/pull/14573


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: jira-unsubscr...@kafka.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [PR] KAFKA-15566: Fix flaky tests in FetchRequestTest.scala in KRaft mode [kafka]

2023-10-19 Thread via GitHub


showuon commented on PR #14573:
URL: https://github.com/apache/kafka/pull/14573#issuecomment-1771954831

   Ran 3 times of CI build and no fetchRequestTest failures. 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: jira-unsubscr...@kafka.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [PR] KAFKA-15566: Fix flaky tests in FetchRequestTest.scala in KRaft mode [kafka]

2023-10-19 Thread via GitHub


showuon commented on PR #14573:
URL: https://github.com/apache/kafka/pull/14573#issuecomment-1770924775

   > @showuon @dengziming Thanks for approving the PR. I also have identified 
the root cause of `checkLastFetchedEpochValidation` failure which also needs to 
call `TestUtils.waitUntilLeaderIsKnown` function after creating a topic 
partition. It retrieves the leader epoch for a topic partition too soon after 
creating it which results in value of `-1`. It then uses this incorrect epoch 
for the fetch request and fails. Since it's a same fix, I could perhaps add it 
to this PR as well?
   > 
   > Also even though the other testcases did not fail, maybe just in case they 
should also call `TestUtils.waitUntilLeaderIsKnown` after creating topics.
   
   @tinaselenge , thanks for the investigation! I think we should have another 
PR for this issue since both Ziming and I already reviewed this PR and 
approved. Let's make it merge soon and focus on the other issues separately! 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: jira-unsubscr...@kafka.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [PR] KAFKA-15566: Fix flaky tests in FetchRequestTest.scala in KRaft mode [kafka]

2023-10-19 Thread via GitHub


tinaselenge commented on PR #14573:
URL: https://github.com/apache/kafka/pull/14573#issuecomment-1770870075

   @showuon @dengziming Thanks for approving the PR. I also have identified the 
root cause of `checkLastFetchedEpochValidation` failure which also needs to 
call `TestUtils.waitUntilLeaderIsKnown` function after creating a topic 
partition. It retrieves the leader epoch for the topic partition too soon after 
creating a topic which results in value of `-1`. It then uses this incorrect 
epoch for the fetch request and fails.  Since it's a same fix, I could perhaps 
add it to this PR as well? 
   
   Also even though the other testcases did not fail, maybe just in case they 
should also call `TestUtils.waitUntilLeaderIsKnown` after creating topics. 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: jira-unsubscr...@kafka.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [PR] KAFKA-15566: Fix flaky tests in FetchRequestTest.scala in KRaft mode [kafka]

2023-10-18 Thread via GitHub


showuon commented on PR #14573:
URL: https://github.com/apache/kafka/pull/14573#issuecomment-1769750941

   @dengziming , do you want to have a look?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: jira-unsubscr...@kafka.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [PR] KAFKA-15566: Fix flaky tests in FetchRequestTest.scala in KRaft mode [kafka]

2023-10-18 Thread via GitHub


tinaselenge commented on PR #14573:
URL: https://github.com/apache/kafka/pull/14573#issuecomment-1768593969

   @showuon thanks for the feedback. I just made an update to use 
`TestUtils.waitUntilLeaderIsKnown' instead of the sleep function. The tests 
will now wait for the log to be created on the leader before sending fetch 
requests. I think this is similar to what you were suggesting. 
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: jira-unsubscr...@kafka.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[PR] KAFKA-15566: Fix flaky tests in FetchRequestTest.scala in KRaft mode [kafka]

2023-10-18 Thread via GitHub


tinaselenge opened a new pull request, #14573:
URL: https://github.com/apache/kafka/pull/14573

   - Fixed some of the failing tests in FetchRequestTest. 
   
 `testFetchWithPartitionsWithIdError` and 
`testCreateIncrementalFetchWithPartitionsInErrorV12` fail with the following 
error when enabled with KRaft mode. These tests only fail sometimes when 
running locally but consistently failed when running in the Jenkins Pipeline. 
 ```
 expected: <0> but was: <6>
 Expected :0
 Actual   :6
 
 
 org.opentest4j.AssertionFailedError: expected: <0> but was: <6>
  at 
app//org.junit.jupiter.api.AssertionFailureBuilder.build(AssertionFailureBuilder.java:151)
  at 
app//org.junit.jupiter.api.AssertionFailureBuilder.buildAndThrow(AssertionFailureBuilder.java:132)
  at 
app//org.junit.jupiter.api.AssertEquals.failNotEqual(AssertEquals.java:197)
  at 
app//org.junit.jupiter.api.AssertEquals.assertEquals(AssertEquals.java:134)
  at 
app//org.junit.jupiter.api.AssertEquals.assertEquals(AssertEquals.java:129)
  at 
app//org.junit.jupiter.api.Assertions.assertEquals(Assertions.java:325)
  at 
app//kafka.server.FetchRequestTest.testCreateIncrementalFetchWithPartitionsInErrorV12(FetchRequestTest.scala:547)
 ...
 ```
 
 The tests create topic partitions and send fetch requests for them. The 
expected error code to be returned is 0 however they get 
`NOT_LEADER_OR_FOLLOWER` exception.  We throw this exception when trying to 
read records from a log that doesn’t exist 
(https://github.com/apache/kafka/blob/dc6a53e19606674bd1276bf05d3ae7a3a2115523/core/src/main/scala/kafka/cluster/Partition.scala#L504).
 The issue seems to be due to receiving fetch requests before logs being 
created on the broker.  The following test logs with extra debug lines show 
that we attempted to read records from the log for topic patition foo-1, a few 
milliseconds before it was created. 
 ```
 [2023-10-18 12:44:18,200] DEBUG [Partition foo-1 broker=0] Getting log for 
topic with id foo-1 to read its records (kafka.cluster.Partition:62)
 [2023-10-18 12:44:18,200] DEBUG [Partition foo-1 broker=0] 
NOT_LEADER_OR_FOLLOWER foo-1 because log is empty (kafka.cluster.Partition:62)
 [2023-10-18 12:44:18,205] INFO Created log for partition foo-1 in 
/var/folders/fm/65mtddt52vjf8hycyj0rn64rgn/T/kafka-6502713585191291330/foo-1
 with properties {} (kafka.log.LogManager:66)
 [2023-10-18 12:44:18,212] DEBUG [KafkaApi-0] Fetch request with 
correlation id 1 from client client-id on partition 
AA:foo-1 failed due to 
org.apache.kafka.common.errors.NotLeaderOrFollowerException 
(kafka.server.KafkaApis:62)
 ```
 In Zookeeper case, the log was created much earlier than attempting to 
read records from it therefore no error was returned for the fetch requests. 
 ```
 [2023-10-18 12:38:31,252] INFO Created log for partition foo-1 in 
/var/folders/fm/65mtddt52vjf8hycyj0rn64rgn/T/kafka-1270567419648769103/foo-1
 with properties {} (kafka.log.LogManager:66)
 [2023-10-18 12:38:31,455] DEBUG [Partition foo-1 broker=0] Getting log for 
topic with id foo-1 to read its records (kafka.cluster.Partition:62)
 ```
 
 The only difference between Zookeeper and KRaft mode in these tests is the 
way the topic partitions are created. In Zookeeper mode, we create the topic 
partitions directly with Zookeeper therefore seem to take less time to create 
the logs. In KRaft mode, we use Admin client to create topic partitions. Even 
though the test waits for topic partitions to get created and appear in 
metadata cache before sending fetch requests, it doesn’t seem to be sufficient 
time for the logs to be created on the brokers.  
 
 Adding some sleep after creating the topic partitions and before sending 
the fetch requests fixed the tests.
   
   - Enabled all tests except `checkLastFetchedEpochValidation` with KRaft mode.
 Looking at the build history in Jenkins, all the other tests except these 
2 tests and checkLastFetchedEpochValidation were passing when they were enabled 
with KRaft mode. Therefore enabled them with KRaft mode again but left 
`checkLastFetchedEpochValidation` to be investigated further. 
   
   ### Committer Checklist (excluded from commit message)
   - [ ] Verify design and implementation 
   - [ ] Verify test coverage and CI build status
   - [ ] Verify documentation (including upgrade notes)
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: jira-unsubscr...@kafka.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org