[jira] [Created] (KAFKA-16620) Kraft quorum cannot be formed if all controllers are restarted at the same time
Gantigmaa Selenge created KAFKA-16620: - Summary: Kraft quorum cannot be formed if all controllers are restarted at the same time Key: KAFKA-16620 URL: https://issues.apache.org/jira/browse/KAFKA-16620 Project: Kafka Issue Type: Bug Reporter: Gantigmaa Selenge Controller quorum cannot seem to form at all after accidentally restarting all controller nodes at the same time in a test environment. This is reproducible, happens almost everytime when restarting all controller nodes of the cluster. Started a cluster with 3 controller nodes and 3 broker nodes. After restarting the controller nodes, one of them becomes the active controller but resigns due to fetch timeout. The quorum leadership bounces off like this between the nodes indefinitely. The controller.quorum.fetch.timeout.ms was set to the default of 2 seconds. Logs from an active controller: ``` 2024-04-17 14:00:48,250 INFO [QuorumController id=0] Becoming the active controller at epoch 34, next write offset 1116. (org.apache.kafka.controller.QuorumController) [quorum-controller-0-event-handler] 2024-04-17 14:00:48,250 WARN [QuorumController id=0] Performing controller activation. Loaded ZK migration state of NONE. (org.apache.kafka.controller.QuorumController) [quorum-controller-0-event-handler] 2024-04-17 14:00:48,701 INFO [RaftManager id=0] Node 1 disconnected. (org.apache.kafka.clients.NetworkClient) [kafka-0-raft-outbound-request-thread] 2024-04-17 14:00:48,701 WARN [RaftManager id=0] Connection to node 1 (my-cluster-controller-1.my-cluster-kafka-brokers.roller.svc.cluster.local/10.244.0.68:9090) could not be established. Node may not be available. (org.apache.kafka.clients.NetworkClient) [kafka-0-raft-outbound-request-thread] 2024-04-17 14:00:48,776 DEBUG [UnifiedLog partition=__cluster_metadata-0, dir=/var/lib/kafka/data/kafka-log0] Flushing log up to offset 1117 (exclusive)with recovery point 1117, last flushed: 1713362448239, current time: 1713362448776,unflushed: 1 (kafka.log.UnifiedLog) [kafka-0-raft-io-thread] 2024-04-17 14:00:49,277 DEBUG [UnifiedLog partition=__cluster_metadata-0, dir=/var/lib/kafka/data/kafka-log0] Flushing log up to offset 1118 (exclusive)with recovery point 1118, last flushed: 1713362448777, current time: ... 2024-04-17 14:01:35,934 DEBUG [UnifiedLog partition=__cluster_metadata-0, dir=/var/lib/kafka/data/kafka-log0] Flushing log up to offset 1200 (exclusive)with recovery point 1200, last flushed: 1713362489371, current time: 1713362495934,unflushed: 1 (kafka.log.UnifiedLog) [kafka-0-raft-io-thread] 2024-04-17 14:01:36,121 INFO [RaftManager id=0] Did not receive fetch request from the majority of the voters within 3000ms. Current fetched voters are []. (org.apache.kafka.raft.LeaderState) [kafka-0-raft-io-thread] 2024-04-17 14:01:36,223 WARN [QuorumController id=0] Renouncing the leadership due to a metadata log event. We were the leader at epoch 34, but in the new epoch 35, the leader is (none). Reverting to last stable offset 1198. (org.apache.kafka.controller.QuorumController) [quorum-controller-0-event-handler] 2024-04-17 14:01:36,223 INFO [QuorumController id=0] failAll(NotControllerException): failing writeNoOpRecord(152156824). (org.apache.kafka.deferred.DeferredEventQueue) [quorum-controller-0-event-handler] 2024-04-17 14:01:36,223 INFO [QuorumController id=0] writeNoOpRecord: event failed with NotControllerException in 6291037 microseconds. (org.apache.kafka.controller.QuorumController) [quorum-controller-0-event-handler] ``` Logs from the follower: ``` 024-04-17 14:00:48,242 INFO [RaftManager id=2] Completed transition to FollowerState(fetchTimeoutMs=2000, epoch=34, leaderId=0, voters=[0, 1, 2], highWatermark=Optional[LogOffsetMetadata(offset=1113, metadata=Optional.empty)], fetchingSnapshot=Optional.empty) from Voted(epoch=34, votedId=0, voters=[0, 1, 2], electionTimeoutMs=1794) (org.apache.kafka.raft.QuorumState) [kafka-2-raft-io-thread] 2024-04-17 14:00:48,242 INFO [QuorumController id=2] In the new epoch 34, the leader is 0. (org.apache.kafka.controller.QuorumController) [quorum-controller-2-event-handler] 2024-04-17 14:00:48,247 DEBUG [UnifiedLog partition=__cluster_metadata-0, dir=/var/lib/kafka/data/kafka-log2] Flushing log up to offset 1116 (exclusive)with recovery point 1116, last flushed: 1713362442238, current time: 1713362448247,unflushed: 2 (kafka.log.UnifiedLog) [kafka-2-raft-io-thread] 2024-04-17 14:00:48,777 DEBUG [UnifiedLog partition=__cluster_metadata-0, dir=/var/lib/kafka/data/kafka-log2] Flushing log up to offset 1117 (exclusive)with recovery point 1117, last flushed: 1713362448249, current time: 1713362448777,unflushed: 1 (kafka.log.UnifiedLog) [kafka-2-raft-io-thread] 2024-04-17 14:00:49,278 DEBUG [UnifiedLog partition=__cluster_metadata-0, dir=/var/lib/kafka/data/kafka-log2] Flushing log up to offset 111
[jira] [Created] (KAFKA-16612) Talking to controllers via AdminClient requires reconfiguring controller listener
Gantigmaa Selenge created KAFKA-16612: - Summary: Talking to controllers via AdminClient requires reconfiguring controller listener Key: KAFKA-16612 URL: https://issues.apache.org/jira/browse/KAFKA-16612 Project: Kafka Issue Type: Improvement Reporter: Gantigmaa Selenge After KIP-919, Kafka controllers register themselves with the active controller once they start up. This registration includes information about the endpoints which the controller listener is configured with. This endpoint is then sent to admin clients (via DescribeClusterResponse) so that clients send requests to the active controller. If the controller listener is configured with "CONTROLLER://0.0.0.0:9093" , this will result in admin clients requests failing (trying to connect to localhost). This was not clearly stated in the KIP or the documentation. When clients talking to brokers, advertised.listeners is used, however advertised.listener is forbidden for controllers. Should we allow advertised.listeners for controllers so that admin client can use it to talk to controllers, in the same way it uses it to talk to brokers? Or should the endpoints provided in controller.quorum.voters, be returned to admin client? If the intention is to use the regular "listeners" configuration of controller for clients, this should be clearly documented. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Created] (KAFKA-16240) Flaky test PlaintextAdminIntegrationTest.testReplicaCanFetchFromLogStartOffsetAfterDeleteRecords(String).quorum=kraft
Gantigmaa Selenge created KAFKA-16240: - Summary: Flaky test PlaintextAdminIntegrationTest.testReplicaCanFetchFromLogStartOffsetAfterDeleteRecords(String).quorum=kraft Key: KAFKA-16240 URL: https://issues.apache.org/jira/browse/KAFKA-16240 Project: Kafka Issue Type: Test Reporter: Gantigmaa Selenge Failed run [https://ci-builds.apache.org/job/Kafka/job/kafka-pr/job/PR-15300/8/testReport/junit/kafka.api/PlaintextAdminIntegrationTest/Build___JDK_17_and_Scala_2_13___testReplicaCanFetchFromLogStartOffsetAfterDeleteRecords_String__quorum_kraft_2/] Stack trace ``` org.apache.kafka.common.errors.TimeoutException: Timed out waiting for a node assignment. Call: deleteRecords(api=DELETE_RECORDS) at java.base/java.util.concurrent.CompletableFuture.reportGet(CompletableFuture.java:396) at java.base/java.util.concurrent.CompletableFuture.get(CompletableFuture.java:2073) at org.apache.kafka.common.internals.KafkaFutureImpl.get(KafkaFutureImpl.java:165) at kafka.api.PlaintextAdminIntegrationTest.testReplicaCanFetchFromLogStartOffsetAfterDeleteRecords(PlaintextAdminIntegrationTest.scala:860) ``` -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Created] (KAFKA-16211) Inconsistent static config values in CreateTopicsResult and DescribeConfigsResult
Gantigmaa Selenge created KAFKA-16211: - Summary: Inconsistent static config values in CreateTopicsResult and DescribeConfigsResult Key: KAFKA-16211 URL: https://issues.apache.org/jira/browse/KAFKA-16211 Project: Kafka Issue Type: Bug Components: controller Reporter: Gantigmaa Selenge When creating a topic in KRaft cluster, a config value returned in CreateTopicsResult is different than what you get from describe topic configs, if the config was set in broker.properties or controller.properties or in both but with different values. For example, start a broker with `segment.bytes` set to 573741824 in the properties file and then create a topic, the CreateTopicsResult contains: ConfigEntry(name=segment.bytes, value=1073741824, source=DEFAULT_CONFIG, isSensitive=false, isReadOnly=false, synonyms=[], type=INT, documentation=null) because the controller was started without setting this config. However when you describe configurations for the same topic, the config value set by the broker is returned: Create topic configsConfigEntry(name=segment.bytes, value=573741824, source=STATIC_BROKER_CONFIG, isSensitive=false, isReadOnly=false, synonyms=[], type=null, documentation=null) Vice versa, if the controller is started with this config set to a different value, the create topic request returns the value set by the controller and then when you describe the config for the same topic, you get the value set by the broker. This makes it confusing to understand which value being is used. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Created] (KAFKA-14763) Add integration test for DelegationTokenCommand tool
Gantigmaa Selenge created KAFKA-14763: - Summary: Add integration test for DelegationTokenCommand tool Key: KAFKA-14763 URL: https://issues.apache.org/jira/browse/KAFKA-14763 Project: Kafka Issue Type: Task Reporter: Gantigmaa Selenge When moving DelegationTokenCommand from core to tools module in [https://github.com/apache/kafka/pull/13172], the existing integration test could not be migrated because there is no {{BaseRequestTest}} or {{SaslSetup}} to help setup integration tests in the tools module. We will need to create similar setup in the tools module and create an integration test for the command tool. -- This message was sent by Atlassian Jira (v8.20.10#820010)