[jira] [Created] (KAFKA-16620) Kraft quorum cannot be formed if all controllers are restarted at the same time

2024-04-25 Thread Gantigmaa Selenge (Jira)
Gantigmaa Selenge created KAFKA-16620:
-

 Summary: Kraft quorum cannot be formed if all controllers are 
restarted at the same time
 Key: KAFKA-16620
 URL: https://issues.apache.org/jira/browse/KAFKA-16620
 Project: Kafka
  Issue Type: Bug
Reporter: Gantigmaa Selenge


Controller quorum cannot seem to form at all after accidentally restarting all 
controller nodes at the same time in a test environment. This is reproducible, 
happens almost everytime when restarting all controller nodes of the cluster. 

Started a cluster with 3 controller nodes and 3 broker nodes. After restarting 
the controller nodes, one of them becomes the active controller but resigns due 
to fetch timeout. The quorum leadership bounces off like this between the nodes 
indefinitely. 
The controller.quorum.fetch.timeout.ms was set to the default of 2 seconds. 
Logs from an active controller:
```
2024-04-17 14:00:48,250 INFO [QuorumController id=0] Becoming the active 
controller at epoch 34, next write offset 1116. 
(org.apache.kafka.controller.QuorumController) 
[quorum-controller-0-event-handler]
2024-04-17 14:00:48,250 WARN [QuorumController id=0] Performing controller 
activation. Loaded ZK migration state of NONE. 
(org.apache.kafka.controller.QuorumController) 
[quorum-controller-0-event-handler]
2024-04-17 14:00:48,701 INFO [RaftManager id=0] Node 1 disconnected. 
(org.apache.kafka.clients.NetworkClient) [kafka-0-raft-outbound-request-thread]
2024-04-17 14:00:48,701 WARN [RaftManager id=0] Connection to node 1 
(my-cluster-controller-1.my-cluster-kafka-brokers.roller.svc.cluster.local/10.244.0.68:9090)
 could not be established. Node may not be available. 
(org.apache.kafka.clients.NetworkClient) [kafka-0-raft-outbound-request-thread]
2024-04-17 14:00:48,776 DEBUG [UnifiedLog partition=__cluster_metadata-0, 
dir=/var/lib/kafka/data/kafka-log0] Flushing log up to offset 1117 
(exclusive)with recovery point 1117, last flushed: 1713362448239,  current 
time: 1713362448776,unflushed: 1 (kafka.log.UnifiedLog) [kafka-0-raft-io-thread]
2024-04-17 14:00:49,277 DEBUG [UnifiedLog partition=__cluster_metadata-0, 
dir=/var/lib/kafka/data/kafka-log0] Flushing log up to offset 1118 
(exclusive)with recovery point 1118, last flushed: 1713362448777,  current 
time: 
...
2024-04-17 14:01:35,934 DEBUG [UnifiedLog partition=__cluster_metadata-0, 
dir=/var/lib/kafka/data/kafka-log0] Flushing log up to offset 1200 
(exclusive)with recovery point 1200, last flushed: 1713362489371,  current 
time: 1713362495934,unflushed: 1 (kafka.log.UnifiedLog) [kafka-0-raft-io-thread]
2024-04-17 14:01:36,121 INFO [RaftManager id=0] Did not receive fetch request 
from the majority of the voters within 3000ms. Current fetched voters are []. 
(org.apache.kafka.raft.LeaderState) [kafka-0-raft-io-thread]
2024-04-17 14:01:36,223 WARN [QuorumController id=0] Renouncing the leadership 
due to a metadata log event. We were the leader at epoch 34, but in the new 
epoch 35, the leader is (none). Reverting to last stable offset 1198. 
(org.apache.kafka.controller.QuorumController) 
[quorum-controller-0-event-handler]
2024-04-17 14:01:36,223 INFO [QuorumController id=0] 
failAll(NotControllerException): failing writeNoOpRecord(152156824). 
(org.apache.kafka.deferred.DeferredEventQueue) 
[quorum-controller-0-event-handler]
2024-04-17 14:01:36,223 INFO [QuorumController id=0] writeNoOpRecord: event 
failed with NotControllerException in 6291037 microseconds. 
(org.apache.kafka.controller.QuorumController) 
[quorum-controller-0-event-handler]
```
Logs from the follower:
```
024-04-17 14:00:48,242 INFO [RaftManager id=2] Completed transition to 
FollowerState(fetchTimeoutMs=2000, epoch=34, leaderId=0, voters=[0, 1, 2], 
highWatermark=Optional[LogOffsetMetadata(offset=1113, 
metadata=Optional.empty)], fetchingSnapshot=Optional.empty) from 
Voted(epoch=34, votedId=0, voters=[0, 1, 2], electionTimeoutMs=1794) 
(org.apache.kafka.raft.QuorumState) [kafka-2-raft-io-thread]
2024-04-17 14:00:48,242 INFO [QuorumController id=2] In the new epoch 34, the 
leader is 0. (org.apache.kafka.controller.QuorumController) 
[quorum-controller-2-event-handler]
2024-04-17 14:00:48,247 DEBUG [UnifiedLog partition=__cluster_metadata-0, 
dir=/var/lib/kafka/data/kafka-log2] Flushing log up to offset 1116 
(exclusive)with recovery point 1116, last flushed: 1713362442238,  current 
time: 1713362448247,unflushed: 2 (kafka.log.UnifiedLog) [kafka-2-raft-io-thread]
2024-04-17 14:00:48,777 DEBUG [UnifiedLog partition=__cluster_metadata-0, 
dir=/var/lib/kafka/data/kafka-log2] Flushing log up to offset 1117 
(exclusive)with recovery point 1117, last flushed: 1713362448249,  current 
time: 1713362448777,unflushed: 1 (kafka.log.UnifiedLog) [kafka-2-raft-io-thread]
2024-04-17 14:00:49,278 DEBUG [UnifiedLog partition=__cluster_metadata-0, 
dir=/var/lib/kafka/data/kafka-log2] Flushing log up to offset 111

[jira] [Created] (KAFKA-16612) Talking to controllers via AdminClient requires reconfiguring controller listener

2024-04-24 Thread Gantigmaa Selenge (Jira)
Gantigmaa Selenge created KAFKA-16612:
-

 Summary: Talking to controllers via AdminClient requires 
reconfiguring controller listener
 Key: KAFKA-16612
 URL: https://issues.apache.org/jira/browse/KAFKA-16612
 Project: Kafka
  Issue Type: Improvement
Reporter: Gantigmaa Selenge


After KIP-919, Kafka controllers register themselves with the active controller 
once they  start up. This registration includes information about the endpoints 
which the controller listener is configured with. This endpoint is then sent to 
admin clients (via DescribeClusterResponse) so that clients send requests to 
the active controller. If the controller listener is configured with 
"CONTROLLER://0.0.0.0:9093" , this will result in admin clients requests 
failing (trying to connect to localhost). This was not clearly stated in the 
KIP or the documentation.

When clients talking to brokers, advertised.listeners is used, however 
advertised.listener is forbidden for controllers. Should we allow 
advertised.listeners for controllers so that admin client can use it to talk to 
controllers, in the same way it uses it to talk to brokers? Or should the 
endpoints provided in controller.quorum.voters, be returned to admin client?

If the intention is to use the regular "listeners" configuration of controller 
for clients, this should be clearly documented. 



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Created] (KAFKA-16240) Flaky test PlaintextAdminIntegrationTest.testReplicaCanFetchFromLogStartOffsetAfterDeleteRecords(String).quorum=kraft

2024-02-09 Thread Gantigmaa Selenge (Jira)
Gantigmaa Selenge created KAFKA-16240:
-

 Summary: Flaky test 
PlaintextAdminIntegrationTest.testReplicaCanFetchFromLogStartOffsetAfterDeleteRecords(String).quorum=kraft
 Key: KAFKA-16240
 URL: https://issues.apache.org/jira/browse/KAFKA-16240
 Project: Kafka
  Issue Type: Test
Reporter: Gantigmaa Selenge


Failed run 
[https://ci-builds.apache.org/job/Kafka/job/kafka-pr/job/PR-15300/8/testReport/junit/kafka.api/PlaintextAdminIntegrationTest/Build___JDK_17_and_Scala_2_13___testReplicaCanFetchFromLogStartOffsetAfterDeleteRecords_String__quorum_kraft_2/]

Stack trace

```

 org.apache.kafka.common.errors.TimeoutException: Timed out waiting for a node 
assignment. Call: deleteRecords(api=DELETE_RECORDS) at 
java.base/java.util.concurrent.CompletableFuture.reportGet(CompletableFuture.java:396)
 at 
java.base/java.util.concurrent.CompletableFuture.get(CompletableFuture.java:2073)
 at 
org.apache.kafka.common.internals.KafkaFutureImpl.get(KafkaFutureImpl.java:165) 
at 
kafka.api.PlaintextAdminIntegrationTest.testReplicaCanFetchFromLogStartOffsetAfterDeleteRecords(PlaintextAdminIntegrationTest.scala:860)

```



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Created] (KAFKA-16211) Inconsistent static config values in CreateTopicsResult and DescribeConfigsResult

2024-01-30 Thread Gantigmaa Selenge (Jira)
Gantigmaa Selenge created KAFKA-16211:
-

 Summary: Inconsistent static config values in CreateTopicsResult 
and DescribeConfigsResult
 Key: KAFKA-16211
 URL: https://issues.apache.org/jira/browse/KAFKA-16211
 Project: Kafka
  Issue Type: Bug
  Components: controller
Reporter: Gantigmaa Selenge


When creating a topic in KRaft cluster, a config value returned in 
CreateTopicsResult is different than what you get from describe topic configs, 
if the config was set in broker.properties or controller.properties or in both 
but with different values. 

 

For example, start a broker with `segment.bytes` set to 573741824 in the 
properties file and then create a topic, the CreateTopicsResult contains:

ConfigEntry(name=segment.bytes, value=1073741824, source=DEFAULT_CONFIG, 
isSensitive=false, isReadOnly=false, synonyms=[], type=INT, documentation=null)

 because the controller was started without setting this config. 

However when you describe configurations for the same topic, the config value 
set by the broker is returned:

Create topic configsConfigEntry(name=segment.bytes, value=573741824, 
source=STATIC_BROKER_CONFIG, isSensitive=false, isReadOnly=false, synonyms=[], 
type=null, documentation=null)

 

Vice versa, if the controller is started with this config set to a different 
value, the create topic request returns the value set by the controller and 
then when you describe the config for the same topic, you get the value set by 
the broker. This makes it confusing to understand which value being is used.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Created] (KAFKA-14763) Add integration test for DelegationTokenCommand tool

2023-02-27 Thread Gantigmaa Selenge (Jira)
Gantigmaa Selenge created KAFKA-14763:
-

 Summary: Add integration test for DelegationTokenCommand tool
 Key: KAFKA-14763
 URL: https://issues.apache.org/jira/browse/KAFKA-14763
 Project: Kafka
  Issue Type: Task
Reporter: Gantigmaa Selenge


When moving DelegationTokenCommand from core to tools module in 
[https://github.com/apache/kafka/pull/13172], the existing integration test 
could not be migrated because there is no {{BaseRequestTest}} or {{SaslSetup}} 
to help setup integration tests in the tools module. We will need to create 
similar setup in the tools module and create an integration test for the 
command tool. 

 



--
This message was sent by Atlassian Jira
(v8.20.10#820010)