[
https://issues.apache.org/jira/browse/KAFKA-18874?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17939584#comment-17939584
]
Daniel Fonai edited comment on KAFKA-18874 at 3/31/25 8:23 AM:
---------------------------------------------------------------
[~ijuma] the version is 3.9.0, I updated the Affects Version field.
[~cmccabe] I might have described it incorrectly. What we see is that the
controller registration times out (due to transient network/DNS issues) then
the controller does not retry the registration and the KRaft migration could
not be finished successfully. I uploaded the controller logs, there are a lot
of
UnknownHostExceptions but those are intermittent and related to the network/DNS
issues (the cluster is in Kubernetes and sometimes DNS is slow). Maybe that has
nothing to do with the quorum.
was (Author: JIRAUSER286467):
[~ijuma] the version is 3.9.0, I updated the Affects Version field.
[~cmccabe] I might have used the wrong phrase. What we see is that the
controller registration times out (due to transient network/DNS issues) then
the controller does not retry the registration and the KRaft migration could
not be finished successfully. I uploaded the controller logs, there are a lot
of
UnknownHostExceptions but those are intermittent and related to the network/DNS
issues (the cluster is in Kubernetes and sometimes DNS is slow).
> KRaft controller does not retry registration if the first attempt times out
> ---------------------------------------------------------------------------
>
> Key: KAFKA-18874
> URL: https://issues.apache.org/jira/browse/KAFKA-18874
> Project: Kafka
> Issue Type: Bug
> Components: controller
> Affects Versions: 3.9.0
> Reporter: Daniel Fonai
> Priority: Minor
> Attachments: controller-3.log, controller-4.log, controller-5.log
>
>
> There is a [retry
> mechanism|https://github.com/apache/kafka/blob/3.9.0/core/src/main/scala/kafka/server/ControllerRegistrationManager.scala#L274]
> with exponential backoff built-in in KRaft controller registration. The
> timeout of the first attempt is 5 s for KRaft controllers
> ([code|https://github.com/apache/kafka/blob/3.9.0/core/src/main/scala/kafka/server/ControllerServer.scala#L448])
> which is not configurable.
> If for some reason the controller's first registration request times out, the
> attempt should be retried but in practice this does not happen and the
> controller is not able to join the quorum. We see the following in the faulty
> controller's log:
> {noformat}
> 2025-02-21 13:31:46,606 INFO [ControllerRegistrationManager id=3
> incarnation=mEzjHheAQ_eDWejAFquGiw] sendControllerRegistration: attempting to
> send ControllerRegistrationRequestData(controllerId=3,
> incarnationId=mEzjHheAQ_eDWejAFquGiw, zkMigrationReady=true,
> listeners=[Listener(name='CONTROLPLANE-9090',
> host='kraft-rollback-kafka-controller-pool-3.kraft-rollback-kafka-kafka-brokers.csm-op-test-kraft-rollback-631e64ac.svc',
> port=9090, securityProtocol=1)], features=[Feature(name='kraft.version',
> minSupportedVersion=0, maxSupportedVersion=1),
> Feature(name='metadata.version', minSupportedVersion=1,
> maxSupportedVersion=21)]) (kafka.server.ControllerRegistrationManager)
> [controller-3-registration-manager-event-handler]
> ...
> 2025-02-21 13:31:51,627 ERROR [ControllerRegistrationManager id=3
> incarnation=mEzjHheAQ_eDWejAFquGiw] RegistrationResponseHandler: channel
> manager timed out before sending the request.
> (kafka.server.ControllerRegistrationManager)
> [controller-3-to-controller-registration-channel-manager]
> 2025-02-21 13:31:51,726 INFO [ControllerRegistrationManager id=3
> incarnation=mEzjHheAQ_eDWejAFquGiw] maybeSendControllerRegistration: waiting
> for the previous RPC to complete.
> (kafka.server.ControllerRegistrationManager)
> [controller-3-registration-manager-event-handler]
> {noformat}
> After this we can not see any controller retry in the log.
--
This message was sent by Atlassian Jira
(v8.20.10#820010)