[jira] [Commented] (KAFKA-16768) SocketServer leaks accepted SocketChannel instances due to race condition
[ https://issues.apache.org/jira/browse/KAFKA-16768?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17846684#comment-17846684 ] Greg Harris commented on KAFKA-16768: - [~muralibasani] Do you mean call Acceptor#close? That might cause an infinite loop (because Acceptor#close calls Processor#close which calls Processor#closeAll). I think if the Processor closeAll (where newConnections is drained) could call just the acceptor thread `join` and get the same effect, but without the infinite loop. > SocketServer leaks accepted SocketChannel instances due to race condition > - > > Key: KAFKA-16768 > URL: https://issues.apache.org/jira/browse/KAFKA-16768 > Project: Kafka > Issue Type: Bug > Components: core >Affects Versions: 3.8.0 >Reporter: Greg Harris >Priority: Major > > The SocketServer has threads for Acceptors and Processors. These threads > communicate via Processor#accept/Processor#configureNewConnections and the > `newConnections` queue. > During shutdown, the Acceptor and Processors are each stopped by setting > shouldRun to false, and then shutdown proceeds asynchronously in all > instances together. This leads to a race condition where an Acceptor accepts > a SocketChannel and queues it to a Processor, but that Processor instance has > already started shutting down and has already drained the newConnections > queue. > KAFKA-16765 is an analogous bug in NioEchoServer, which uses a completely > different implementation but has the same flaw. > An example execution order that includes this leak: > 1. Acceptor#accept() is called, and a new SocketChannel is accepted. > 2. Acceptor#assignNewConnection() begins > 3. Acceptor#close() is called, which sets shouldRun to false in the Acceptor > and attached Processor instances > 4. Processor#run() checks the shouldRun variable, and exits the loop > 5. Processor#closeAll() executes, and drains the `newConnections` variable > 6. Processor#run() returns and the Processor thread terminates > 7. Acceptor#assignNewConnection() calls Processor#accept(), which adds the > SocketChannel to `newConnections` > 8. Acceptor#assignNewConnection() returns > 9. Acceptor#run() checks the shouldRun variable and exits the loop, and the > Acceptor thread terminates. > 10. Acceptor#close() joins all of the terminated threads, and returns > At the end of this sequence, there are still open SocketChannel instances in > newConnections, which are then considered leaked. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Commented] (KAFKA-16768) SocketServer leaks accepted SocketChannel instances due to race condition
[ https://issues.apache.org/jira/browse/KAFKA-16768?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17846607#comment-17846607 ] Muralidhar Basani commented on KAFKA-16768: --- To fix this, when Processor#closeAll() is called, should we close all the acceptors in dataPlaneAcceptors, so that no new connections are accepted ? > SocketServer leaks accepted SocketChannel instances due to race condition > - > > Key: KAFKA-16768 > URL: https://issues.apache.org/jira/browse/KAFKA-16768 > Project: Kafka > Issue Type: Bug > Components: core >Affects Versions: 3.8.0 >Reporter: Greg Harris >Priority: Major > > The SocketServer has threads for Acceptors and Processors. These threads > communicate via Processor#accept/Processor#configureNewConnections and the > `newConnections` queue. > During shutdown, the Acceptor and Processors are each stopped by setting > shouldRun to false, and then shutdown proceeds asynchronously in all > instances together. This leads to a race condition where an Acceptor accepts > a SocketChannel and queues it to a Processor, but that Processor instance has > already started shutting down and has already drained the newConnections > queue. > KAFKA-16765 is an analogous bug in NioEchoServer, which uses a completely > different implementation but has the same flaw. > An example execution order that includes this leak: > 1. Acceptor#accept() is called, and a new SocketChannel is accepted. > 2. Acceptor#assignNewConnection() begins > 3. Acceptor#close() is called, which sets shouldRun to false in the Acceptor > and attached Processor instances > 4. Processor#run() checks the shouldRun variable, and exits the loop > 5. Processor#closeAll() executes, and drains the `newConnections` variable > 6. Processor#run() returns and the Processor thread terminates > 7. Acceptor#assignNewConnection() calls Processor#accept(), which adds the > SocketChannel to `newConnections` > 8. Acceptor#assignNewConnection() returns > 9. Acceptor#run() checks the shouldRun variable and exits the loop, and the > Acceptor thread terminates. > 10. Acceptor#close() joins all of the terminated threads, and returns > At the end of this sequence, there are still open SocketChannel instances in > newConnections, which are then considered leaked. -- This message was sent by Atlassian Jira (v8.20.10#820010)