[
https://issues.apache.org/jira/browse/KAFKA-16768?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=18046845#comment-18046845
]
Jainam Shah commented on KAFKA-16768:
-------------------------------------
Hey [~cyh2012aa], Sounds good, I’ll share my approach as well so we can align
on a starting point and converge on a single solution.
Fix implemented: the Processor’s {{accept(...)}} path now checks {{shouldRun}}
and returns {{false}} when the processor is shutting down; on a {{false}}
return the Acceptor immediately closes the SocketChannel and calls
{{connectionQuotas.closeChannel(...)}} to update bookkeeping.
Why this resolves the race: by making acceptance conditional on the Processor
still being alive, we ensure the _hand-off_ is fail-fast and the Acceptor never
leaves a channel queued to a Processor that has already drained and terminated.
If the Processor is shutting down, the Acceptor becomes responsible for
cleanup, so there is no window where an accepted channel remains unclosed in
{{{}newConnections{}}}.
> SocketServer leaks accepted SocketChannel instances due to race condition
> -------------------------------------------------------------------------
>
> Key: KAFKA-16768
> URL: https://issues.apache.org/jira/browse/KAFKA-16768
> Project: Kafka
> Issue Type: Bug
> Components: core
> Affects Versions: 3.8.0
> Reporter: Greg Harris
> Assignee: Chang-Yu Huang
> Priority: Major
> Labels: newbie
>
> The SocketServer has threads for Acceptors and Processors. These threads
> communicate via Processor#accept/Processor#configureNewConnections and the
> `newConnections` queue.
> During shutdown, the Acceptor and Processors are each stopped by setting
> shouldRun to false, and then shutdown proceeds asynchronously in all
> instances together. This leads to a race condition where an Acceptor accepts
> a SocketChannel and queues it to a Processor, but that Processor instance has
> already started shutting down and has already drained the newConnections
> queue.
> KAFKA-16765 is an analogous bug in NioEchoServer, which uses a completely
> different implementation but has the same flaw.
> An example execution order that includes this leak:
> 1. Acceptor#accept() is called, and a new SocketChannel is accepted.
> 2. Acceptor#assignNewConnection() begins
> 3. Acceptor#close() is called, which sets shouldRun to false in the Acceptor
> and attached Processor instances
> 4. Processor#run() checks the shouldRun variable, and exits the loop
> 5. Processor#closeAll() executes, and drains the `newConnections` variable
> 6. Processor#run() returns and the Processor thread terminates
> 7. Acceptor#assignNewConnection() calls Processor#accept(), which adds the
> SocketChannel to `newConnections`
> 8. Acceptor#assignNewConnection() returns
> 9. Acceptor#run() checks the shouldRun variable and exits the loop, and the
> Acceptor thread terminates.
> 10. Acceptor#close() joins all of the terminated threads, and returns
> At the end of this sequence, there are still open SocketChannel instances in
> newConnections, which are then considered leaked.
--
This message was sent by Atlassian Jira
(v8.20.10#820010)