[ 
https://issues.apache.org/jira/browse/KAFKA-20109?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=18057790#comment-18057790
 ] 

Gergely Harmadás commented on KAFKA-20109:
------------------------------------------

Hi [~svdewitmam], I have started looking at the issue, feel free to assign it 
to me.

> Complete Kafka cluster dies on incorrect SSL config of a single controller
> --------------------------------------------------------------------------
>
>                 Key: KAFKA-20109
>                 URL: https://issues.apache.org/jira/browse/KAFKA-20109
>             Project: Kafka
>          Issue Type: Bug
>          Components: config, controller
>    Affects Versions: 4.1.1
>         Environment: Debian trixie x86_64, Apache Kafka 3.9.0 - 4.1.1
>            Reporter: Sven Dewit
>            Priority: Major
>         Attachments: reproduce.tar.gz
>
>
> Hello,
> we've recently run into a bug in Apache Kafka in Kraft mode where a whole 
> mtls-enabled cluster (controllers + brokers) die if a single controller is 
> (re)started with bad ssl principal mapping rules.
> The bad config of course was appllied unintentionally when doing some changes 
> in the config management of the system, basically it led to 
> {{ssl.principal.mapping.rules}} missing for the controller listener on that 
> one node. As soon as this single controller was restarted, the whole cluster 
> died within seconds, both controllers and brokers, with this error message:
> {code:java}
> ERROR Encountered fatal fault: Unexpected error in raft IO thread 
> (org.apache.kafka.server.fault.ProcessTerminatingFaultHandler)
> org.apache.kafka.common.errors.ClusterAuthorizationException: Received 
> cluster authorization error in response InboundResponse(correlationId=493, 
> data=BeginQuorumEpochResponseData(errorCode=31, topics=[], nodeEndpoints=[]), 
> source=controller-3:9093 (id: 103 rack: null isFenced: false)) {code}
> While the missing/bad ssl principal mapping is a major misconfiguration on a 
> cluster where in-cluster communication is based on mtls, this still should 
> not lead to the whole cluster terminating.
> The issue occurred on version 4.1.1 of Apache Kafka, but could be reproduced 
> back to 3.9.0.
> To reproduce, see the attached tarball containing
>  * {{gen-test-ca-and-certs.sh}} to create ca and certificates for brokers and 
> controllers to work in mtls mode
>  * {{compose.yml}} to spin up the cluster with {{podman compose}}
> Once the cluster is running, the following steps reproduce the error:
>  * {{podman compose down controller-3}} to stop controller 3
>  * uncomment line 53 in {{compose.yml}} to delete controller 3's 
> {{ssl.principal.mapping.rules}}
>  * {{podman compose up controller-3}} and watch the cluster go down the drain
>  
> In case I can provide you with any more information or support don't hesitate 
> to reach out to me.
>  
> Best regards,
> Sven



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

Reply via email to