[jira] [Comment Edited] (KAFKA-14996) CreateTopic falis with UnknownServerException if num partitions >= QuorumController.MAX_RECORDS_PER_BATCH
[ https://issues.apache.org/jira/browse/KAFKA-14996?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17724946#comment-17724946 ] Edoardo Comar edited comment on KAFKA-14996 at 5/22/23 2:15 PM: The controller instability is not reproducible with 3.4 (at the git commit `2f13471181` so it must be a regression) Also 3.5 `10189d6159` does not exhibit the controller bug was (Author: ecomar): The controller instability is not reproducible with 3.4 (at the git commit `2f13471181` so it must be a regression) > CreateTopic falis with UnknownServerException if num partitions >= > QuorumController.MAX_RECORDS_PER_BATCH > -- > > Key: KAFKA-14996 > URL: https://issues.apache.org/jira/browse/KAFKA-14996 > Project: Kafka > Issue Type: Bug > Components: controller >Reporter: Edoardo Comar >Assignee: Edoardo Comar >Priority: Critical > > If an attempt is made to create a topic with > num partitions >= QuorumController.MAX_RECORDS_PER_BATCH (1) > the client receives an UnknownServerException - it could rather receive a > better error. > The controller logs > {{2023-05-12 19:25:10,018] WARN [QuorumController id=1] createTopics: failed > with unknown server exception IllegalStateException at epoch 2 in 21956 us. > Renouncing leadership and reverting to the last committed offset 174. > (org.apache.kafka.controller.QuorumController)}} > {{java.lang.IllegalStateException: Attempted to atomically commit 10001 > records, but maxRecordsPerBatch is 1}} > {{ at > org.apache.kafka.controller.QuorumController.appendRecords(QuorumController.java:812)}} > {{ at > org.apache.kafka.controller.QuorumController$ControllerWriteEvent.run(QuorumController.java:719)}} > {{ at > org.apache.kafka.queue.KafkaEventQueue$EventContext.run(KafkaEventQueue.java:127)}} > {{ at > org.apache.kafka.queue.KafkaEventQueue$EventHandler.handleEvents(KafkaEventQueue.java:210)}} > {{ at > org.apache.kafka.queue.KafkaEventQueue$EventHandler.run(KafkaEventQueue.java:181)}} > {{ at java.base/java.lang.Thread.run(Thread.java:829)}} > {{[}} -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Comment Edited] (KAFKA-14996) CreateTopic falis with UnknownServerException if num partitions >= QuorumController.MAX_RECORDS_PER_BATCH
[ https://issues.apache.org/jira/browse/KAFKA-14996?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17724946#comment-17724946 ] Edoardo Comar edited comment on KAFKA-14996 at 5/22/23 1:10 PM: The controller instability is not reproducible with 3.4 (at the git commit `2f13471181` so it must be a regression) was (Author: ecomar): The controller instability is not reproducible with 3.4 (at the git commit `721a917b44` so it must be a regression) > CreateTopic falis with UnknownServerException if num partitions >= > QuorumController.MAX_RECORDS_PER_BATCH > -- > > Key: KAFKA-14996 > URL: https://issues.apache.org/jira/browse/KAFKA-14996 > Project: Kafka > Issue Type: Bug > Components: controller >Reporter: Edoardo Comar >Assignee: Edoardo Comar >Priority: Critical > > If an attempt is made to create a topic with > num partitions >= QuorumController.MAX_RECORDS_PER_BATCH (1) > the client receives an UnknownServerException - it could rather receive a > better error. > The controller logs > {{2023-05-12 19:25:10,018] WARN [QuorumController id=1] createTopics: failed > with unknown server exception IllegalStateException at epoch 2 in 21956 us. > Renouncing leadership and reverting to the last committed offset 174. > (org.apache.kafka.controller.QuorumController)}} > {{java.lang.IllegalStateException: Attempted to atomically commit 10001 > records, but maxRecordsPerBatch is 1}} > {{ at > org.apache.kafka.controller.QuorumController.appendRecords(QuorumController.java:812)}} > {{ at > org.apache.kafka.controller.QuorumController$ControllerWriteEvent.run(QuorumController.java:719)}} > {{ at > org.apache.kafka.queue.KafkaEventQueue$EventContext.run(KafkaEventQueue.java:127)}} > {{ at > org.apache.kafka.queue.KafkaEventQueue$EventHandler.handleEvents(KafkaEventQueue.java:210)}} > {{ at > org.apache.kafka.queue.KafkaEventQueue$EventHandler.run(KafkaEventQueue.java:181)}} > {{ at java.base/java.lang.Thread.run(Thread.java:829)}} > {{[}} -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Comment Edited] (KAFKA-14996) CreateTopic falis with UnknownServerException if num partitions >= QuorumController.MAX_RECORDS_PER_BATCH
[ https://issues.apache.org/jira/browse/KAFKA-14996?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17724307#comment-17724307 ] Edoardo Comar edited comment on KAFKA-14996 at 5/19/23 2:55 PM: given that this means a client request can cause a cluster to become unavailable, I'd raise the Priority to critical this is a potential denial of service attack? cc [~mimaison] [~ijuma] [~rajinisiva...@gmail.com] was (Author: ecomar): given that this means a client request can cause a cluster to become unavailable, I'd raise the Priority to critical this is a potential denial of service attack cc [~mimaison] [~ijuma] [~rajinisiva...@gmail.com] > CreateTopic falis with UnknownServerException if num partitions >= > QuorumController.MAX_RECORDS_PER_BATCH > -- > > Key: KAFKA-14996 > URL: https://issues.apache.org/jira/browse/KAFKA-14996 > Project: Kafka > Issue Type: Bug > Components: controller >Reporter: Edoardo Comar >Assignee: Edoardo Comar >Priority: Critical > > If an attempt is made to create a topic with > num partitions >= QuorumController.MAX_RECORDS_PER_BATCH (1) > the client receives an UnknownServerException - it could rather receive a > better error. > The controller logs > {{2023-05-12 19:25:10,018] WARN [QuorumController id=1] createTopics: failed > with unknown server exception IllegalStateException at epoch 2 in 21956 us. > Renouncing leadership and reverting to the last committed offset 174. > (org.apache.kafka.controller.QuorumController)}} > {{java.lang.IllegalStateException: Attempted to atomically commit 10001 > records, but maxRecordsPerBatch is 1}} > {{ at > org.apache.kafka.controller.QuorumController.appendRecords(QuorumController.java:812)}} > {{ at > org.apache.kafka.controller.QuorumController$ControllerWriteEvent.run(QuorumController.java:719)}} > {{ at > org.apache.kafka.queue.KafkaEventQueue$EventContext.run(KafkaEventQueue.java:127)}} > {{ at > org.apache.kafka.queue.KafkaEventQueue$EventHandler.handleEvents(KafkaEventQueue.java:210)}} > {{ at > org.apache.kafka.queue.KafkaEventQueue$EventHandler.run(KafkaEventQueue.java:181)}} > {{ at java.base/java.lang.Thread.run(Thread.java:829)}} > {{[}} -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Comment Edited] (KAFKA-14996) CreateTopic falis with UnknownServerException if num partitions >= QuorumController.MAX_RECORDS_PER_BATCH
[ https://issues.apache.org/jira/browse/KAFKA-14996?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17724307#comment-17724307 ] Edoardo Comar edited comment on KAFKA-14996 at 5/19/23 2:55 PM: given that this means a client request can cause a cluster to become unavailable, I'd raise the Priority to critical this is a potential denial of service attack cc [~mimaison] [~ijuma] [~rajinisiva...@gmail.com] was (Author: ecomar): given that this means a client request can cause a cluster to become unavailable, I'd raise the Priority to critical > CreateTopic falis with UnknownServerException if num partitions >= > QuorumController.MAX_RECORDS_PER_BATCH > -- > > Key: KAFKA-14996 > URL: https://issues.apache.org/jira/browse/KAFKA-14996 > Project: Kafka > Issue Type: Bug > Components: controller >Reporter: Edoardo Comar >Assignee: Edoardo Comar >Priority: Critical > > If an attempt is made to create a topic with > num partitions >= QuorumController.MAX_RECORDS_PER_BATCH (1) > the client receives an UnknownServerException - it could rather receive a > better error. > The controller logs > {{2023-05-12 19:25:10,018] WARN [QuorumController id=1] createTopics: failed > with unknown server exception IllegalStateException at epoch 2 in 21956 us. > Renouncing leadership and reverting to the last committed offset 174. > (org.apache.kafka.controller.QuorumController)}} > {{java.lang.IllegalStateException: Attempted to atomically commit 10001 > records, but maxRecordsPerBatch is 1}} > {{ at > org.apache.kafka.controller.QuorumController.appendRecords(QuorumController.java:812)}} > {{ at > org.apache.kafka.controller.QuorumController$ControllerWriteEvent.run(QuorumController.java:719)}} > {{ at > org.apache.kafka.queue.KafkaEventQueue$EventContext.run(KafkaEventQueue.java:127)}} > {{ at > org.apache.kafka.queue.KafkaEventQueue$EventHandler.handleEvents(KafkaEventQueue.java:210)}} > {{ at > org.apache.kafka.queue.KafkaEventQueue$EventHandler.run(KafkaEventQueue.java:181)}} > {{ at java.base/java.lang.Thread.run(Thread.java:829)}} > {{[}} -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Comment Edited] (KAFKA-14996) CreateTopic falis with UnknownServerException if num partitions >= QuorumController.MAX_RECORDS_PER_BATCH
[ https://issues.apache.org/jira/browse/KAFKA-14996?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17724300#comment-17724300 ] Edoardo Comar edited comment on KAFKA-14996 at 5/19/23 2:48 PM: Similar error is encounter if creating partitions > QuorumController.MAX_RECORDS_PER_BATCH on an existing topic. More worrying is that the cluster looks like it can be unstable after the error occurs. Seen in a cluster with 6 nodes 0,1,2=broker,controller 3,4,5=broker e.g. server.log for node 1 : {{[2023-05-19 15:43:32,640] INFO [RaftManager id=1] Completed transition to CandidateState(localId=1, epoch=300, retries=86, voteStates=\{0=UNRECORDED, 1=GRANTED, 2=UNRECORDED}, highWatermark=Optional.empty, electionTimeoutMs=1145) from CandidateState(localId=1, epoch=299, retries=85, voteStates=\{0=UNRECORDED, 1=GRANTED, 2=UNRECORDED}, highWatermark=Optional.empty, electionTimeoutMs=1817) (org.apache.kafka.raft.QuorumState)}} {{[2023-05-19 15:43:32,649] WARN [RaftManager id=1] Received error UNKNOWN_SERVER_ERROR from node 0 when making an ApiVersionsRequest with correlation id 4646. Disconnecting. (org.apache.kafka.clients.NetworkClient)}} {{[2023-05-19 15:43:32,650] WARN [RaftManager id=1] Received error UNKNOWN_SERVER_ERROR from node 2 when making an ApiVersionsRequest with correlation id 4647. Disconnecting. (org.apache.kafka.clients.NetworkClient)}} {{[2023-05-19 15:43:33,095] WARN [RaftManager id=1] Received error UNKNOWN_SERVER_ERROR from node 0 when making an ApiVersionsRequest with correlation id 4652. Disconnecting. (org.apache.kafka.clients.NetworkClient)}} {{[2023-05-19 15:43:33,147] WARN [RaftManager id=1] Received error UNKNOWN_SERVER_ERROR from node 2 when making an ApiVersionsRequest with correlation id 4656. Disconnecting. (org.apache.kafka.clients.NetworkClient)}} {{[2023-05-19 15:43:33,594] WARN [RaftManager id=1] Received error UNKNOWN_SERVER_ERROR from node 0 when making an ApiVersionsRequest with correlation id 4678. Disconnecting. (org.apache.kafka.clients.NetworkClient)}} {{[2023-05-19 15:43:33,696] WARN [RaftManager id=1] Received error UNKNOWN_SERVER_ERROR from node 2 when making an ApiVersionsRequest with correlation id 4684. Disconnecting. (org.apache.kafka.clients.NetworkClient)}} {{[2023-05-19 15:43:33,773] INFO [RaftManager id=1] Election has timed out, backing off for 1000ms before becoming a candidate again (org.apache.kafka.raft.KafkaRaftClient)}} {{[2023-05-19 15:43:34,774] INFO [RaftManager id=1] Re-elect as candidate after election backoff has completed (org.apache.kafka.raft.KafkaRaftClient)}} {{[2023-05-19 15:43:34,784] INFO [RaftManager id=1] Completed transition to CandidateState(localId=1, epoch=301, retries=87, voteStates=\{0=UNRECORDED, 1=GRANTED, 2=UNRECORDED}, highWatermark=Optional.empty, electionTimeoutMs=1022) from CandidateState(localId=1, epoch=300, retries=86, voteStates=\{0=UNRECORDED, 1=GRANTED, 2=UNRECORDED}, highWatermark=Optional.empty, electionTimeoutMs=1145) (org.apache.kafka.raft.QuorumState)}} {{[2023-05-19 15:43:34,802] WARN [RaftManager id=1] Received error UNKNOWN_SERVER_ERROR from node 0 when making an ApiVersionsRequest with correlation id 4691. Disconnecting. (org.apache.kafka.clients.NetworkClient)}} {{[2023-05-19 15:43:34,825] WARN [RaftManager id=1] Received error UNKNOWN_SERVER_ERROR from node 2 when making an ApiVersionsRequest with correlation id 4692. Disconnecting. (org.apache.kafka.clients.NetworkClient)}} In this state, client requests that should mutate the metadata (eg delete a topic) always timeout {{% bin/kafka-topics.sh --bootstrap-server localhost:9092 --delete --topic edotest1}} {{Error while executing topic command : Call(callName=deleteTopics, deadlineMs=1684507597582, tries=1, nextAllowedTryMs=1684507597698) timed out at 1684507597598 after 1 attempt(s)}} {{[2023-05-19 15:46:37,602] ERROR org.apache.kafka.common.errors.TimeoutException: Call(callName=deleteTopics, deadlineMs=1684507597582, tries=1, nextAllowedTryMs=1684507597698) timed out at 1684507597598 after 1 attempt(s)}} {{Caused by: org.apache.kafka.common.errors.DisconnectException: Cancelled deleteTopics request with correlation id 5 due to node 5 being disconnected}} {{ (kafka.admin.TopicCommand$)}} was (Author: ecomar): Similar error is encounter if creating partitions > QuorumController.MAX_RECORDS_PER_BATCH on an existing topic. More worrying is that the cluster looks like it can be unstable after the error occurs. Seen in a cluster with 6 nodes 0,1,2=broker,controller 3,4,5=broker e.g. server.log for node 1 : {{[2023-05-19 15:43:32,640] INFO [RaftManager id=1] Completed transition to CandidateState(localId=1, epoch=300, retries=86, voteStates=\{0=UNRECORDED, 1=GRANTED, 2=UNRECORDED}, highWatermark=Optional.empty, electionTimeoutMs=1145) from CandidateState(localId=1, epoch=299, retries=85,