Re: [PR] MINOR: Fix flaky BrokerLifecycleManagerTest [kafka]

via GitHub Mon, 27 Nov 2023 15:48:46 -0800


soarez commented on code in PR #14836:
URL: https://github.com/apache/kafka/pull/14836#discussion_r1406906907



##########
core/src/test/scala/unit/kafka/server/BrokerLifecycleManagerTest.scala:
##########
@@ -201,7 +201,7 @@ class BrokerLifecycleManagerTest {
     while (!future.isDone || context.mockClient.hasInFlightRequests) {
       context.poll()
       manager.eventQueue.wakeup()
-      context.time.sleep(100)
+      context.time.sleep(5)

Review Comment:
   It's true in theory, but in this unit test, it is not.
   
   I just tried with 6 heartbeats, and in 1,000 repetitions of the test, one of 
the runs was missing the last directory. On my laptop, with 7 heartbeats, 
10,000 test runs had no failures.
   
   I think this happens because of how `poll()` works: it's rapidly advancing 
the clock and notifying three separate systems - BrokerLifecycleManager, 
MockClient and MockChannelManager — signaling them using `Object.notify()`, 
which does not guarantee each of those threads will run straightaway, it's 
still up to the OS to schedule them onto the CPU.
   
   In practice, outside of unit tests, the delays in scheduling the 
BrokerLifecycleManager thread should be insignificant compared to the heartbeat 
interval. So I don't expect failures to be delayed for more than a heartbeat. 
   
   If the test proves to still be flaky even with 10 heartbeats, I think we can 
just increase the number. 



##########
core/src/test/scala/unit/kafka/server/BrokerLifecycleManagerTest.scala:
##########
@@ -201,7 +201,7 @@ class BrokerLifecycleManagerTest {
     while (!future.isDone || context.mockClient.hasInFlightRequests) {
       context.poll()
       manager.eventQueue.wakeup()
-      context.time.sleep(100)
+      context.time.sleep(5)

Review Comment:
   It's true in theory, but in this unit test, it is not.
   
   I just tried with 6 heartbeats, and in 1,000 repetitions of the test, one of 
the runs was missing the last directory. On my laptop, with 7 heartbeats, 
10,000 test runs had no failures.
   
   I think this happens because of how `poll()` works: it's rapidly advancing 
the clock and notifying three separate systems - BrokerLifecycleManager, 
MockClient and MockChannelManager — signaling them using `Object.notify()`, 
which does not guarantee each of those threads will run straightaway, it's 
still up to the OS to schedule them onto the CPU.
   
   In practice, outside of unit tests, the delays in scheduling the 
BrokerLifecycleManager thread should be insignificant compared to the heartbeat 
interval. So I don't expect failures to be delayed for more than a heartbeat. 
   
   If the test proves to still be flaky even with 10 heartbeats, I think we can 
just increase the number. 



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: jira-unsubscr...@kafka.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

Re: [PR] MINOR: Fix flaky BrokerLifecycleManagerTest [kafka]

Reply via email to