[ 
https://issues.apache.org/jira/browse/KAFKA-20237?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=18062605#comment-18062605
 ] 

Yin Lei commented on KAFKA-20237:
---------------------------------

Hi [~chickenchickenlove] 

Thank you for clarifying! I really appreciate the feedback from a fellow 
contributor, and the links to the KIP process are very helpful.

You've raised a crucial point regarding the Public Contract. From my 
perspective, the current behavior — where the Producer remains permanently 
stuck in the INITIALIZING state after a transient SSL/Auth failure — feels more 
like a liveness bug than an intentional design. However, I agree that changing 
this to a self-recovery mechanism might have broader implications, and it might 
need further discussion.

Hello [~jolshan]   As experts in the transactional protocol and the 
TransactionManager state machine, could you please weigh in?
Given that this leads to a permanent "silent failure" in long-running 
applications, can we treat it as a bug rather than an intentional design? I’d 
appreciate your insights.

Best regards,
Yin

>  TransactionManager stuck in `INITIALIZING` state after initial SSL handshake 
> failure
> -------------------------------------------------------------------------------------
>
>                 Key: KAFKA-20237
>                 URL: https://issues.apache.org/jira/browse/KAFKA-20237
>             Project: Kafka
>          Issue Type: Bug
>          Components: clients, producer 
>    Affects Versions: 3.9.0
>         Environment: - Operating System: Linux aarch64;
> - Kafka Version (Both Client and Server): 3.9.0;
> - security.protocol: SSL;
> - Some producer configurations: retries=2, reconnect.backoff.ms=30000, 
> transactional.id not set, enable.idempotence not set;
>            Reporter: Yin Lei
>            Priority: Major
>
> I encountered a scenario where the `KafkaProducer` fails to recover if the 
> initial SSL handshake with the broker fails, even after the underlying SSL 
> configuration is corrected.
>  
> *Steps to Reproduce:*
> 1. Configure a `KafkaProducer` with SSL enabled, but use an 
> incorrect/untrusted certificate on the server side to trigger an 
> `SSLHandshakeException`.
> 2. Start the Producer and attempt to send a message.
> 3. The Producer logs show recurring SSL handshake errors. At this point, 
> `TransactionManager` enters the `INITIALIZING` state.
> 4. Correct the SSL certificate configuration on the *server side,* so that 
> the broker is now reachable and the handshake can succeed.
> 5. Observe the Producer's behavior, messages still cat not be sent to broker.
>  
> *Expected Behavior:*
> The Producer should successfully complete the SSL handshake, and the `Sender` 
> thread should retry the `InitProducerId` request, allowing the 
> `TransactionManager` to transition from `INITIALIZING` to `READY`.
>  
> *Actual Behavior:*
> Even though the network/SSL layer is recovered, the `KafkaProducer` remains 
> unable to send messages. The `TransactionManager` stays stuck in 
> *INITIALIZING* because the initial failure to obtain a `ProducerId` isn't 
> properly re-triggered, or the state machine doesn't recover from the specific 
> handshake exception during the transition.
> h3. *Potential Impact:*
> In long-running microservices, if the initial connection to Kafka fails due 
> to temporary infrastructure or certificate issues, the Producer becomes 
> permanently "broken" and requires a full application restart to recover, 
> which is not ideal for high-availability systems.
> h3. *PS: Log Snippet*
> > The producer thread repeatedly prints the following log, and no message 
> > sending record was found.
> ```
> 02-25 21:19:33.716+0800[TRACE][kafka-producer-network-thread | 
> producer-4][Sender 444] [Producer clientId=producer-4] Nodes with data ready 
> to send: [192.168.0.10:9812 (id: 0 rack: null)]  
> 02-25 21:19:33.716+0800[TRACE][kafka-producer-network-thread | 
> producer-4][ProducerBatch 121] For 
> ProducerBatch(topicPartition=dte_nb_federation_receive-0, recordCount=7), 
> leader wasn't updated, currentLeaderEpoch: OptionalInt[25], 
> attemptsWhenLeaderLastChanged:0, latestLeaderEpoch: OptionalInt[25], current 
> attempt: 0  
> 02-25 21:19:33.716+0800[TRACE][kafka-producer-network-thread | 
> producer-4][RecordAccumulator 823] [Producer clientId=producer-4] For 
> ProducerBatch(topicPartition=dte_nb_federation_receive-0, recordCount=7), 
> will not backoff, shouldWaitMore false, hasLeaderChanged false  
> 02-25 21:19:33.716+0800[TRACE][kafka-producer-network-thread | 
> producer-4][BuiltInPartitioner 258] [Producer clientId=producer-4] The number 
> of partitions is too small: available=1, all=1, not using adaptive for topic 
> dte_nb_federation_receive  
> 02-25 21:19:33.716+0800[TRACE][kafka-producer-network-thread | 
> producer-4][ProducerBatch 121] For 
> ProducerBatch(topicPartition=dte_nb_federation_receive-0, recordCount=7), 
> leader wasn't updated, currentLeaderEpoch: OptionalInt[25], 
> attemptsWhenLeaderLastChanged:0, latestLeaderEpoch: OptionalInt[25], current 
> attempt: 0  
> 02-25 21:19:33.716+0800[TRACE][kafka-producer-network-thread | 
> producer-4][RecordAccumulator 823] [Producer clientId=producer-4] For 
> ProducerBatch(topicPartition=dte_nb_federation_receive-0, recordCount=7), 
> will not backoff, shouldWaitMore false, hasLeaderChanged false  
> 02-25 21:19:33.716+0800[TRACE][kafka-producer-network-thread | 
> producer-4][Sender 444] [Producer clientId=producer-4] Nodes with data ready 
> to send: [192.168.0.10:9812 (id: 0 rack: null)]  
> 02-25 21:19:33.716+0800[TRACE][kafka-producer-network-thread | 
> producer-4][ProducerBatch 121] For 
> ProducerBatch(topicPartition=dte_nb_federation_receive-0, recordCount=7), 
> leader wasn't updated, currentLeaderEpoch: OptionalInt[25], 
> attemptsWhenLeaderLastChanged:0, latestLeaderEpoch: OptionalInt[25], current 
> attempt: 0  
> 02-25 21:19:33.716+0800[TRACE][kafka-producer-network-thread | 
> producer-4][RecordAccumulator 823] [Producer clientId=producer-4] For 
> ProducerBatch(topicPartition=dte_nb_federation_receive-0, recordCount=7), 
> will not backoff, shouldWaitMore false, hasLeaderChanged false  
> 02-25 21:19:33.717+0800[TRACE][kafka-producer-network-thread | 
> producer-4][BuiltInPartitioner 258] [Producer clientId=producer-4] The number 
> of partitions is too small: available=1, all=1, not using adaptive for topic 
> dte_nb_federation_receive  
> 02-25 21:19:33.717+0800[TRACE][kafka-producer-network-thread | 
> producer-4][ProducerBatch 121] For 
> ProducerBatch(topicPartition=dte_nb_federation_receive-0, recordCount=7), 
> leader wasn't updated, currentLeaderEpoch: OptionalInt[25], 
> attemptsWhenLeaderLastChanged:0, latestLeaderEpoch: OptionalInt[25], current 
> attempt: 0  
> 02-25 21:19:33.717+0800[TRACE][kafka-producer-network-thread | 
> producer-4][RecordAccumulator 823] [Producer clientId=producer-4] For 
> ProducerBatch(topicPartition=dte_nb_federation_receive-0, recordCount=7), 
> will not backoff, shouldWaitMore false, hasLeaderChanged false 
> ```
>  
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

Reply via email to