[ https://issues.apache.org/jira/browse/KAFKA-19561?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Manikumar updated KAFKA-19561: ------------------------------ Description: We've observed request timeouts occurring during SASL reauthentication, and analysis suggests the issue is caused by a race condition between request handling and reauthentication on the broker side. Here’s the sequence: # Client sends a request (Req1) to the broker. # Client initiates SASL reauthentication. # Broker receives Req1. # Broker also begins SASL reauthentication. # While reauth is in progress: ** Broker completes processing of Req1 and prepares a response (Res1). ** Res1 is queued via KafkaChannel.send(). ** Broker sets SelectionKey.OP_WRITE to indicate write readiness. ** However, Selector.attemptWrite() does not proceed because: *** **** channel.hasSend() is true, but **** channel.ready() is false (reauth is still in progress). # Once reauthentication completes: Broker removes SelectionKey.OP_WRITE. # At this point: ** channel.hasSend() and channel.ready() are now true, ** But key.isWritable() is false, so the response (Res1) is never sent. # The response remains stuck in the send buffer. Client eventually hits a request timeout. The fix is to set write readiness using SelectionKey.OP_WRITE at the end of Step 6. This is similar to [what we do on client side|https://github.com/apache/kafka/blob/trunk/clients/src/main/java/org/apache/kafka/common/security/authenticator/SaslClientAuthenticator.java#L422]. was: We've observed request timeouts occurring during SASL reauthentication, and analysis suggests the issue is caused by a race condition between request handling and reauthentication on the broker side. Here’s the sequence: # Client sends a request ({{{}Req1{}}}) to the broker. # Client begins SASL reauthentication. # Broker receives {{{}Req1{}}}. # Broker also initiates SASL reauthentication. # While reauth is in progress: ** Broker processes {{{}Req1{}}}, prepares {{{}Res1{}}}, and queues it via {{{}KafkaChannel.send(){}}}. ** Broker sets {{SelectionKey.OP_WRITE}} to indicate write readiness. ** However, {{Selector.attemptWrite()}} skips the send because: *** {{channel.hasSend()}} is true, but *** {{channel.ready()}} is false (since reauth is not yet complete). # After reauth completes, broker removes {{OP_WRITE}} from the selection key. # At this point: ** {{Res1}} is still pending in the channel. ** {{channel.hasSend()}} and {{channel.ready()}} are now true, ** But {{key.isWritable()}} is false, so no further write is attempted. 8. The response remains stuck in the send buffer. Client eventually hits a request timeout. The fix is to set write readiness using SelectionKey.OP_WRITE at the end of Step 6. This is similar to [what we do on client side|https://github.com/apache/kafka/blob/trunk/clients/src/main/java/org/apache/kafka/common/security/authenticator/SaslClientAuthenticator.java#L422]. > Request Timeout During SASL Reauthentication Due to Missed OP_WRITE interest > set > ---------------------------------------------------------------------------------- > > Key: KAFKA-19561 > URL: https://issues.apache.org/jira/browse/KAFKA-19561 > Project: Kafka > Issue Type: Bug > Reporter: Manikumar > Assignee: Manikumar > Priority: Major > > We've observed request timeouts occurring during SASL reauthentication, and > analysis suggests the issue is caused by a race condition between request > handling and reauthentication on the broker side. Here’s the sequence: > # Client sends a request (Req1) to the broker. > # Client initiates SASL reauthentication. > # Broker receives Req1. > # Broker also begins SASL reauthentication. > # While reauth is in progress: > ** Broker completes processing of Req1 and prepares a response (Res1). > ** Res1 is queued via KafkaChannel.send(). > ** Broker sets SelectionKey.OP_WRITE to indicate write readiness. > ** However, Selector.attemptWrite() does not proceed because: > *** > **** channel.hasSend() is true, but > **** channel.ready() is false (reauth is still in progress). > # Once reauthentication completes: Broker removes SelectionKey.OP_WRITE. > # At this point: > ** channel.hasSend() and channel.ready() are now true, > ** But key.isWritable() is false, so the response (Res1) is never sent. > # The response remains stuck in the send buffer. Client eventually hits a > request timeout. > The fix is to set write readiness using SelectionKey.OP_WRITE at the end of > Step 6. This is similar to [what we do on client > side|https://github.com/apache/kafka/blob/trunk/clients/src/main/java/org/apache/kafka/common/security/authenticator/SaslClientAuthenticator.java#L422]. -- This message was sent by Atlassian Jira (v8.20.10#820010)