Greg Harris created KAFKA-16349:
-----------------------------------

             Summary: ShutdownableThread fails build by calling Exit with race 
condition
                 Key: KAFKA-16349
                 URL: https://issues.apache.org/jira/browse/KAFKA-16349
             Project: Kafka
          Issue Type: Bug
          Components: core
    Affects Versions: 3.8.0
            Reporter: Greg Harris


`ShutdownableThread` calls `Exit.exit()` when the thread's operation throws 
FatalExitError. In normal operation, this calls System.exit, and exits the 
process. In tests, the exit procedure is masked with Exit.setExitProcedure to 
prevent tests that encounter a FatalExitError from crashing the test JVM.

Masking of exit procedures is usually done in BeforeEach/AfterEach annotations, 
with the exit procedures cleaned up immediately after the test finishes. If the 
body of the test creates a ShutdownableThread that outlives the test, such as 
by omitting `ShutdownableThread#awaitShutdown`, by having 
`ShutdownableThread#awaitShutdown` be interrupted by a test timeout, or by a 
race condition between `Exit.resetExitProcedure` and `Exit.exit`, then 
System.exit() can be called erroneously.

 
{noformat}
// First, in the test thread:
Exit.setExitProcedure(...)
try {
    new ShutdownableThread(...).start()
} finally {
    Exit.resetExitProcedure()
}
// Second, in the ShutdownableThread:
try {
    throw new FatalExitError(...)
} catch (FatalExitError e) {
    Exit.exit(...) // Calls real System.exit()
}{noformat}
 

This can be resolved by one of the following:
 # Eliminate FatalExitError usages in code when setExitProcedure is in-use
 # Eliminate the Exit.exit call from ShutdownableThread, and instead propagate 
this error to another thread to handle without a race-condition
 # Eliminate resetExitProcedure by refactoring Exit to be non-static

FatalExitError is in use in a small number of places, but may be difficult to 
eliminate:
 * FinalizedFeatureChangeListener
 * InterBrokerSendThread
 * TopicBasedRemoteLogMetadataManager

It appears that every other use of Exit.setExitProcedure/Exit.exit is done on a 
single thread, so ShutdownableThread is the only place where this race 
condition is present.

The effect of this bug is that the build is flaky, as race conditions/timeouts 
in tests can cause the gradle executors to exit with status code 1, which has 
happened 26 times in the last 28 days. I have not yet been able to confirm this 
bug is happening in other tests, but I do have a deterministic reproduction 
case with the exact same symptoms:
{noformat}
Gradle Test Run :core:test > Gradle Test Executor 38 > ShutdownableThreadTest > 
testShutdownWhenTestTimesOut(boolean) > 
"testShutdownWhenTestTimesOut(boolean).isInterruptible=true" SKIPPED
FAILURE: Build failed with an exception.
* What went wrong:
Execution failed for task ':core:test'.
> Process 'Gradle Test Executor 38' finished with non-zero exit value 1
  This problem might be caused by incorrect test process configuration.
  For more on test execution, please refer to 
https://docs.gradle.org/8.6/userguide/java_testing.html#sec:test_execution in 
the Gradle documentation.{noformat}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

Reply via email to