[jira] [Commented] (KAFKA-14996) CreateTopic falis with UnknownServerException if num partitions >= QuorumController.MAX_RECORDS_PER_BATCH

2023-05-24 Thread Edoardo Comar (Jira)


[ 
https://issues.apache.org/jira/browse/KAFKA-14996?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17725858#comment-17725858
 ] 

Edoardo Comar commented on KAFKA-14996:
---

I found another way to get into the error state.

3 broker/controller cluster, all 3 voters. If I shut down the 2 non-active 
quorum members, the remaining acive controller enters the state where it logs 
`[2023-05-24 16:29:45,129] WARN [BrokerToControllerChannelManager id=1 
name=heartbeat] Received error UNKNOWN_SERVER_ERROR from node 1 when making an 
ApiVersionsRequest with correlation id 3945. Disconnecting. 
(org.apache.kafka.clients.NetworkClient)`

and correspondingly 
```
[2023-05-24 16:29:45,128] WARN [QuorumController id=1] getFinalizedFeatures: 
failed with unknown server exception RuntimeException in 222 us.  The 
controller is already in standby mode. 
(org.apache.kafka.controller.QuorumController)
java.lang.RuntimeException: No in-memory snapshot for epoch 159730. Snapshot 
epochs are:
    at 
org.apache.kafka.timeline.SnapshotRegistry.getSnapshot(SnapshotRegistry.java:173)
    at 
org.apache.kafka.timeline.SnapshotRegistry.iterator(SnapshotRegistry.java:131)
    at org.apache.kafka.timeline.TimelineObject.get(TimelineObject.java:69)
    at 
org.apache.kafka.controller.FeatureControlManager.finalizedFeatures(FeatureControlManager.java:303)
    at 
org.apache.kafka.controller.QuorumController.lambda$finalizedFeatures$16(QuorumController.java:2016)
    at 
org.apache.kafka.controller.QuorumController$ControllerReadEvent.run(QuorumController.java:546)
```
in controller.log

> CreateTopic falis with UnknownServerException if num partitions >= 
> QuorumController.MAX_RECORDS_PER_BATCH 
> --
>
> Key: KAFKA-14996
> URL: https://issues.apache.org/jira/browse/KAFKA-14996
> Project: Kafka
>  Issue Type: Bug
>  Components: controller
>Reporter: Edoardo Comar
>Assignee: Edoardo Comar
>Priority: Critical
>
> If an attempt is made to create a topic with
> num partitions >= QuorumController.MAX_RECORDS_PER_BATCH  (1)
> the client receives an UnknownServerException - it could rather receive a 
> better error.
> The controller logs
> {{2023-05-12 19:25:10,018] WARN [QuorumController id=1] createTopics: failed 
> with unknown server exception IllegalStateException at epoch 2 in 21956 us.  
> Renouncing leadership and reverting to the last committed offset 174. 
> (org.apache.kafka.controller.QuorumController)}}
> {{java.lang.IllegalStateException: Attempted to atomically commit 10001 
> records, but maxRecordsPerBatch is 1}}
> {{    at 
> org.apache.kafka.controller.QuorumController.appendRecords(QuorumController.java:812)}}
> {{    at 
> org.apache.kafka.controller.QuorumController$ControllerWriteEvent.run(QuorumController.java:719)}}
> {{    at 
> org.apache.kafka.queue.KafkaEventQueue$EventContext.run(KafkaEventQueue.java:127)}}
> {{    at 
> org.apache.kafka.queue.KafkaEventQueue$EventHandler.handleEvents(KafkaEventQueue.java:210)}}
> {{    at 
> org.apache.kafka.queue.KafkaEventQueue$EventHandler.run(KafkaEventQueue.java:181)}}
> {{    at java.base/java.lang.Thread.run(Thread.java:829)}}
> {{[}}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (KAFKA-14996) CreateTopic falis with UnknownServerException if num partitions >= QuorumController.MAX_RECORDS_PER_BATCH

2023-05-24 Thread Jira


[ 
https://issues.apache.org/jira/browse/KAFKA-14996?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17725712#comment-17725712
 ] 

Klemen Košir commented on KAFKA-14996:
--

The {{getFinalizedFeatures}} regression (and cluster instability) was probably 
introduced with this PR:

[https://github.com/apache/kafka/pull/13679#issuecomment-1559681643]

> CreateTopic falis with UnknownServerException if num partitions >= 
> QuorumController.MAX_RECORDS_PER_BATCH 
> --
>
> Key: KAFKA-14996
> URL: https://issues.apache.org/jira/browse/KAFKA-14996
> Project: Kafka
>  Issue Type: Bug
>  Components: controller
>Reporter: Edoardo Comar
>Assignee: Edoardo Comar
>Priority: Critical
>
> If an attempt is made to create a topic with
> num partitions >= QuorumController.MAX_RECORDS_PER_BATCH  (1)
> the client receives an UnknownServerException - it could rather receive a 
> better error.
> The controller logs
> {{2023-05-12 19:25:10,018] WARN [QuorumController id=1] createTopics: failed 
> with unknown server exception IllegalStateException at epoch 2 in 21956 us.  
> Renouncing leadership and reverting to the last committed offset 174. 
> (org.apache.kafka.controller.QuorumController)}}
> {{java.lang.IllegalStateException: Attempted to atomically commit 10001 
> records, but maxRecordsPerBatch is 1}}
> {{    at 
> org.apache.kafka.controller.QuorumController.appendRecords(QuorumController.java:812)}}
> {{    at 
> org.apache.kafka.controller.QuorumController$ControllerWriteEvent.run(QuorumController.java:719)}}
> {{    at 
> org.apache.kafka.queue.KafkaEventQueue$EventContext.run(KafkaEventQueue.java:127)}}
> {{    at 
> org.apache.kafka.queue.KafkaEventQueue$EventHandler.handleEvents(KafkaEventQueue.java:210)}}
> {{    at 
> org.apache.kafka.queue.KafkaEventQueue$EventHandler.run(KafkaEventQueue.java:181)}}
> {{    at java.base/java.lang.Thread.run(Thread.java:829)}}
> {{[}}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (KAFKA-14996) CreateTopic falis with UnknownServerException if num partitions >= QuorumController.MAX_RECORDS_PER_BATCH

2023-05-22 Thread Edoardo Comar (Jira)


[ 
https://issues.apache.org/jira/browse/KAFKA-14996?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17724948#comment-17724948
 ] 

Edoardo Comar commented on KAFKA-14996:
---

is the state the controller gets in similar to

https://issues.apache.org/jira/browse/KAFKA-14644

?

> CreateTopic falis with UnknownServerException if num partitions >= 
> QuorumController.MAX_RECORDS_PER_BATCH 
> --
>
> Key: KAFKA-14996
> URL: https://issues.apache.org/jira/browse/KAFKA-14996
> Project: Kafka
>  Issue Type: Bug
>  Components: controller
>Reporter: Edoardo Comar
>Assignee: Edoardo Comar
>Priority: Critical
>
> If an attempt is made to create a topic with
> num partitions >= QuorumController.MAX_RECORDS_PER_BATCH  (1)
> the client receives an UnknownServerException - it could rather receive a 
> better error.
> The controller logs
> {{2023-05-12 19:25:10,018] WARN [QuorumController id=1] createTopics: failed 
> with unknown server exception IllegalStateException at epoch 2 in 21956 us.  
> Renouncing leadership and reverting to the last committed offset 174. 
> (org.apache.kafka.controller.QuorumController)}}
> {{java.lang.IllegalStateException: Attempted to atomically commit 10001 
> records, but maxRecordsPerBatch is 1}}
> {{    at 
> org.apache.kafka.controller.QuorumController.appendRecords(QuorumController.java:812)}}
> {{    at 
> org.apache.kafka.controller.QuorumController$ControllerWriteEvent.run(QuorumController.java:719)}}
> {{    at 
> org.apache.kafka.queue.KafkaEventQueue$EventContext.run(KafkaEventQueue.java:127)}}
> {{    at 
> org.apache.kafka.queue.KafkaEventQueue$EventHandler.handleEvents(KafkaEventQueue.java:210)}}
> {{    at 
> org.apache.kafka.queue.KafkaEventQueue$EventHandler.run(KafkaEventQueue.java:181)}}
> {{    at java.base/java.lang.Thread.run(Thread.java:829)}}
> {{[}}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (KAFKA-14996) CreateTopic falis with UnknownServerException if num partitions >= QuorumController.MAX_RECORDS_PER_BATCH

2023-05-22 Thread Edoardo Comar (Jira)


[ 
https://issues.apache.org/jira/browse/KAFKA-14996?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17724946#comment-17724946
 ] 

Edoardo Comar commented on KAFKA-14996:
---

The controller instability is not reproducible with 3.4 (at the git commit 
`721a917b44` so it must be a regression)

 

> CreateTopic falis with UnknownServerException if num partitions >= 
> QuorumController.MAX_RECORDS_PER_BATCH 
> --
>
> Key: KAFKA-14996
> URL: https://issues.apache.org/jira/browse/KAFKA-14996
> Project: Kafka
>  Issue Type: Bug
>  Components: controller
>Reporter: Edoardo Comar
>Assignee: Edoardo Comar
>Priority: Critical
>
> If an attempt is made to create a topic with
> num partitions >= QuorumController.MAX_RECORDS_PER_BATCH  (1)
> the client receives an UnknownServerException - it could rather receive a 
> better error.
> The controller logs
> {{2023-05-12 19:25:10,018] WARN [QuorumController id=1] createTopics: failed 
> with unknown server exception IllegalStateException at epoch 2 in 21956 us.  
> Renouncing leadership and reverting to the last committed offset 174. 
> (org.apache.kafka.controller.QuorumController)}}
> {{java.lang.IllegalStateException: Attempted to atomically commit 10001 
> records, but maxRecordsPerBatch is 1}}
> {{    at 
> org.apache.kafka.controller.QuorumController.appendRecords(QuorumController.java:812)}}
> {{    at 
> org.apache.kafka.controller.QuorumController$ControllerWriteEvent.run(QuorumController.java:719)}}
> {{    at 
> org.apache.kafka.queue.KafkaEventQueue$EventContext.run(KafkaEventQueue.java:127)}}
> {{    at 
> org.apache.kafka.queue.KafkaEventQueue$EventHandler.handleEvents(KafkaEventQueue.java:210)}}
> {{    at 
> org.apache.kafka.queue.KafkaEventQueue$EventHandler.run(KafkaEventQueue.java:181)}}
> {{    at java.base/java.lang.Thread.run(Thread.java:829)}}
> {{[}}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (KAFKA-14996) CreateTopic falis with UnknownServerException if num partitions >= QuorumController.MAX_RECORDS_PER_BATCH

2023-05-19 Thread Edoardo Comar (Jira)


[ 
https://issues.apache.org/jira/browse/KAFKA-14996?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17724308#comment-17724308
 ] 

Edoardo Comar commented on KAFKA-14996:
---

The controller.log s are full of 

{{[2023-05-19 15:50:18,834] WARN [QuorumController id=0] getFinalizedFeatures: 
failed with unknown server exception RuntimeException in 28 us.  The controller 
is already in standby mode. (org.apache.kafka.controller.QuorumController)}}
{{java.lang.RuntimeException: No in-memory snapshot for epoch 84310. Snapshot 
epochs are: 61900}}
{{    at 
org.apache.kafka.timeline.SnapshotRegistry.getSnapshot(SnapshotRegistry.java:173)}}
{{    at 
org.apache.kafka.timeline.SnapshotRegistry.iterator(SnapshotRegistry.java:131)}}
{{    at org.apache.kafka.timeline.TimelineObject.get(TimelineObject.java:69)}}
{{    at 
org.apache.kafka.controller.FeatureControlManager.finalizedFeatures(FeatureControlManager.java:303)}}
{{    at 
org.apache.kafka.controller.QuorumController.lambda$finalizedFeatures$16(QuorumController.java:2016)}}
{{    at 
org.apache.kafka.controller.QuorumController$ControllerReadEvent.run(QuorumController.java:546)}}
{{    at 
org.apache.kafka.queue.KafkaEventQueue$EventContext.run(KafkaEventQueue.java:127)}}
{{    at 
org.apache.kafka.queue.KafkaEventQueue$EventHandler.handleEvents(KafkaEventQueue.java:210)}}
{{    at 
org.apache.kafka.queue.KafkaEventQueue$EventHandler.run(KafkaEventQueue.java:181)}}
{{    at java.base/java.lang.Thread.run(Thread.java:829)}}

> CreateTopic falis with UnknownServerException if num partitions >= 
> QuorumController.MAX_RECORDS_PER_BATCH 
> --
>
> Key: KAFKA-14996
> URL: https://issues.apache.org/jira/browse/KAFKA-14996
> Project: Kafka
>  Issue Type: Bug
>  Components: controller
>Reporter: Edoardo Comar
>Assignee: Edoardo Comar
>Priority: Critical
>
> If an attempt is made to create a topic with
> num partitions >= QuorumController.MAX_RECORDS_PER_BATCH  (1)
> the client receives an UnknownServerException - it could rather receive a 
> better error.
> The controller logs
> {{2023-05-12 19:25:10,018] WARN [QuorumController id=1] createTopics: failed 
> with unknown server exception IllegalStateException at epoch 2 in 21956 us.  
> Renouncing leadership and reverting to the last committed offset 174. 
> (org.apache.kafka.controller.QuorumController)}}
> {{java.lang.IllegalStateException: Attempted to atomically commit 10001 
> records, but maxRecordsPerBatch is 1}}
> {{    at 
> org.apache.kafka.controller.QuorumController.appendRecords(QuorumController.java:812)}}
> {{    at 
> org.apache.kafka.controller.QuorumController$ControllerWriteEvent.run(QuorumController.java:719)}}
> {{    at 
> org.apache.kafka.queue.KafkaEventQueue$EventContext.run(KafkaEventQueue.java:127)}}
> {{    at 
> org.apache.kafka.queue.KafkaEventQueue$EventHandler.handleEvents(KafkaEventQueue.java:210)}}
> {{    at 
> org.apache.kafka.queue.KafkaEventQueue$EventHandler.run(KafkaEventQueue.java:181)}}
> {{    at java.base/java.lang.Thread.run(Thread.java:829)}}
> {{[}}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (KAFKA-14996) CreateTopic falis with UnknownServerException if num partitions >= QuorumController.MAX_RECORDS_PER_BATCH

2023-05-19 Thread Edoardo Comar (Jira)


[ 
https://issues.apache.org/jira/browse/KAFKA-14996?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17724307#comment-17724307
 ] 

Edoardo Comar commented on KAFKA-14996:
---

given that this means a client request can cause a cluster to become 
unavailable, I'd raise the Priority to critical

> CreateTopic falis with UnknownServerException if num partitions >= 
> QuorumController.MAX_RECORDS_PER_BATCH 
> --
>
> Key: KAFKA-14996
> URL: https://issues.apache.org/jira/browse/KAFKA-14996
> Project: Kafka
>  Issue Type: Bug
>  Components: controller
>Reporter: Edoardo Comar
>Assignee: Edoardo Comar
>Priority: Critical
>
> If an attempt is made to create a topic with
> num partitions >= QuorumController.MAX_RECORDS_PER_BATCH  (1)
> the client receives an UnknownServerException - it could rather receive a 
> better error.
> The controller logs
> {{2023-05-12 19:25:10,018] WARN [QuorumController id=1] createTopics: failed 
> with unknown server exception IllegalStateException at epoch 2 in 21956 us.  
> Renouncing leadership and reverting to the last committed offset 174. 
> (org.apache.kafka.controller.QuorumController)}}
> {{java.lang.IllegalStateException: Attempted to atomically commit 10001 
> records, but maxRecordsPerBatch is 1}}
> {{    at 
> org.apache.kafka.controller.QuorumController.appendRecords(QuorumController.java:812)}}
> {{    at 
> org.apache.kafka.controller.QuorumController$ControllerWriteEvent.run(QuorumController.java:719)}}
> {{    at 
> org.apache.kafka.queue.KafkaEventQueue$EventContext.run(KafkaEventQueue.java:127)}}
> {{    at 
> org.apache.kafka.queue.KafkaEventQueue$EventHandler.handleEvents(KafkaEventQueue.java:210)}}
> {{    at 
> org.apache.kafka.queue.KafkaEventQueue$EventHandler.run(KafkaEventQueue.java:181)}}
> {{    at java.base/java.lang.Thread.run(Thread.java:829)}}
> {{[}}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (KAFKA-14996) CreateTopic falis with UnknownServerException if num partitions >= QuorumController.MAX_RECORDS_PER_BATCH

2023-05-19 Thread Edoardo Comar (Jira)


[ 
https://issues.apache.org/jira/browse/KAFKA-14996?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17724300#comment-17724300
 ] 

Edoardo Comar commented on KAFKA-14996:
---

Similar error is encounter if creating partitions > 
QuorumController.MAX_RECORDS_PER_BATCH on an existing topic.

More worrying is that the cluster looks like it can be unstable after the error 
occurs.

Seen in a cluster with 6 nodes 0,1,2=broker,controller 3,4,5=broker

e.g. server.log for node 1 :

 

{{[2023-05-19 15:43:32,640] INFO [RaftManager id=1] Completed transition to 
CandidateState(localId=1, epoch=300, retries=86, voteStates=\{0=UNRECORDED, 
1=GRANTED, 2=UNRECORDED}, highWatermark=Optional.empty, electionTimeoutMs=1145) 
from CandidateState(localId=1, epoch=299, retries=85, 
voteStates=\{0=UNRECORDED, 1=GRANTED, 2=UNRECORDED}, 
highWatermark=Optional.empty, electionTimeoutMs=1817) 
(org.apache.kafka.raft.QuorumState)}}
{{[2023-05-19 15:43:32,649] WARN [RaftManager id=1] Received error 
UNKNOWN_SERVER_ERROR from node 0 when making an ApiVersionsRequest with 
correlation id 4646. Disconnecting. (org.apache.kafka.clients.NetworkClient)}}
{{[2023-05-19 15:43:32,650] WARN [RaftManager id=1] Received error 
UNKNOWN_SERVER_ERROR from node 2 when making an ApiVersionsRequest with 
correlation id 4647. Disconnecting. (org.apache.kafka.clients.NetworkClient)}}
{{[2023-05-19 15:43:33,095] WARN [RaftManager id=1] Received error 
UNKNOWN_SERVER_ERROR from node 0 when making an ApiVersionsRequest with 
correlation id 4652. Disconnecting. (org.apache.kafka.clients.NetworkClient)}}
{{[2023-05-19 15:43:33,147] WARN [RaftManager id=1] Received error 
UNKNOWN_SERVER_ERROR from node 2 when making an ApiVersionsRequest with 
correlation id 4656. Disconnecting. (org.apache.kafka.clients.NetworkClient)}}
{{[2023-05-19 15:43:33,594] WARN [RaftManager id=1] Received error 
UNKNOWN_SERVER_ERROR from node 0 when making an ApiVersionsRequest with 
correlation id 4678. Disconnecting. (org.apache.kafka.clients.NetworkClient)}}
{{[2023-05-19 15:43:33,696] WARN [RaftManager id=1] Received error 
UNKNOWN_SERVER_ERROR from node 2 when making an ApiVersionsRequest with 
correlation id 4684. Disconnecting. (org.apache.kafka.clients.NetworkClient)}}
{{[2023-05-19 15:43:33,773] INFO [RaftManager id=1] Election has timed out, 
backing off for 1000ms before becoming a candidate again 
(org.apache.kafka.raft.KafkaRaftClient)}}
{{[2023-05-19 15:43:34,774] INFO [RaftManager id=1] Re-elect as candidate after 
election backoff has completed (org.apache.kafka.raft.KafkaRaftClient)}}
{{[2023-05-19 15:43:34,784] INFO [RaftManager id=1] Completed transition to 
CandidateState(localId=1, epoch=301, retries=87, voteStates=\{0=UNRECORDED, 
1=GRANTED, 2=UNRECORDED}, highWatermark=Optional.empty, electionTimeoutMs=1022) 
from CandidateState(localId=1, epoch=300, retries=86, 
voteStates=\{0=UNRECORDED, 1=GRANTED, 2=UNRECORDED}, 
highWatermark=Optional.empty, electionTimeoutMs=1145) 
(org.apache.kafka.raft.QuorumState)}}
{{[2023-05-19 15:43:34,802] WARN [RaftManager id=1] Received error 
UNKNOWN_SERVER_ERROR from node 0 when making an ApiVersionsRequest with 
correlation id 4691. Disconnecting. (org.apache.kafka.clients.NetworkClient)}}
{{[2023-05-19 15:43:34,825] WARN [RaftManager id=1] Received error 
UNKNOWN_SERVER_ERROR from node 2 when making an ApiVersionsRequest with 
correlation id 4692. Disconnecting. (org.apache.kafka.clients.NetworkClient)}}
{{}}

 

> CreateTopic falis with UnknownServerException if num partitions >= 
> QuorumController.MAX_RECORDS_PER_BATCH 
> --
>
> Key: KAFKA-14996
> URL: https://issues.apache.org/jira/browse/KAFKA-14996
> Project: Kafka
>  Issue Type: Bug
>  Components: controller
>Reporter: Edoardo Comar
>Assignee: Edoardo Comar
>Priority: Major
>
> If an attempt is made to create a topic with
> num partitions >= QuorumController.MAX_RECORDS_PER_BATCH  (1)
> the client receives an UnknownServerException - it could rather receive a 
> better error.
> The controller logs
> {{2023-05-12 19:25:10,018] WARN [QuorumController id=1] createTopics: failed 
> with unknown server exception IllegalStateException at epoch 2 in 21956 us.  
> Renouncing leadership and reverting to the last committed offset 174. 
> (org.apache.kafka.controller.QuorumController)}}
> {{java.lang.IllegalStateException: Attempted to atomically commit 10001 
> records, but maxRecordsPerBatch is 1}}
> {{    at 
> org.apache.kafka.controller.QuorumController.appendRecords(QuorumController.java:812)}}
> {{    at 
> org.apache.kafka.controller.QuorumController$ControllerWriteEvent.run(QuorumController.java:719)}}
> {{    at 
> org.apache.kafka.queue.KafkaEventQueue$EventContext.run(KafkaEventQueue.java:127)}}
>