[jira] [Updated] (KAFKA-14845) Broker ZNode creation can fail due to lost Zookeeper Session ID
[ https://issues.apache.org/jira/browse/KAFKA-14845?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Alexandre Dupriez updated KAFKA-14845: -- Description: Our production environment faced a use case where registration of a broker failed due to the presence of a "conflicting" broker znode in Zookeeper. This case is not without familiarity to that fixed by KAFKA-6584 and induced by the Zookeeper bug (or feature) tracked in ZOOKEEPER-2985 opened as of today. A network partition disturbed communication channels between the Kafka and Zookeeper clusters for about 20% of the brokers in the cluster. One of this broker was not able to re-register with Zookeeper and was excluded from the cluster until it was restarted. Broker logs show the failed registration due to a "conflicting" znode write which in this case does not exactly match the scenario covered by KAFKA-6584. The sequence of logs on the broker is as follows. First, a connection is established with the Zookeeper node 3. {code:java} [2023-03-05 16:01:55,342] INFO Socket connection established, initiating session, client: /1.2.3.4:40200, server: zk.3/5.6.7.8:2182 (org.apache.zookeeper.ClientCnxn) [2023-03-05 16:01:55,342] INFO channel is connected: [id: 0x2b45ae40, L:/1.1.3.4:40200 - R:zk.3/5.6.7.8:2182] (org.apache.zookeeper.ClientCnxnSocketNetty){code} An existing Zookeeper session was expired, and upon reconnection, the Zookeeper state change handler was invoked. The creation of the ephemeral znode /brokers/ids/18 started on the controller thread. {code:java} [2023-03-05 16:01:55,345] INFO Creating /brokers/ids/18 (is it secure? false) (kafka.zk.KafkaZkClient){code} The client "session" timed out after 6 seconds. Note the session is 0x0 and the absence of "{_}Session establishment complete{_}" log: the broker appears to have never received or processed the response from the Zookeeper node. {code:java} [2023-03-05 16:02:01,343] INFO Client session timed out, have not heard from server in 6000ms for sessionid 0x0, closing socket connection and attempting reconnect (org.apache.zookeeper.ClientCnxn) [2023-03-05 16:02:01,343] INFO channel is disconnected: [id: 0x2b45ae40, L:/1.2.3.4:40200 ! R:zk.3/5.6.7.8:2182] (org.apache.zookeeper.ClientCnxnSocketNetty){code} Pending requests were aborted with a {{CONNECTIONLOSS}} error and the client started waiting on a new connection notification. {code:java} [2023-03-05 16:02:01,343] INFO [ZooKeeperClient Kafka server] Waiting until connected. (kafka.zookeeper.ZooKeeperClient){code} A new connection was created with the Zookeeper node 1. Note that a valid (new) session ({{{}0x1006c6e0b830001{}}}) was reported by Kafka this time. {code:java} [2023-03-05 16:02:02,037] INFO Socket connection established, initiating session, client: /1.2.3.4:58080, server: zk.1/9.10.11.12:2182 (org.apache.zookeeper.ClientCnxn) [2023-03-05 16:02:02,037] INFO channel is connected: [id: 0x68fba106, L:/1.2.3.4:58080 - R:zk.1/9.10.11.12:2182] (org.apache.zookeeper.ClientCnxnSocketNetty) [2023-03-05 16:02:03,054] INFO Session establishment complete on server zk.1/9.10.11.12:2182, sessionid = 0x1006c6e0b830001, negotiated timeout = 18000 (org.apache.zookeeper.ClientCnxn){code} The Kafka ZK client is notified of the connection. {code:java} [2023-03-05 16:02:03,054] INFO [ZooKeeperClient Kafka server] Connected. (kafka.zookeeper.ZooKeeperClient){code} The broker sends the request to create the znode {{/brokers/ids/18}} which already exists. The error path implemented for KAFKA-6584 is then followed. However, in this case, the session owning the ephemeral node {{0x30043230ac1}} ({{{}216172783240153793{}}}) is different from the last active Zookeeper session which the broker has recorded. And it is also different from the current session {{0x1006c6e0b830001}} ({{{}72176813933264897{}}}), hence the recreation of the broker znode is not attempted. {code:java} [2023-03-05 16:02:04,466] ERROR Error while creating ephemeral at /brokers/ids/18, node already exists and owner '216172783240153793' does not match current session '72176813933264897' (kafka.zk.KafkaZkClient$CheckedEphemeral) org.apache.zookeeper.KeeperException$NodeExistsException: KeeperErrorCode = NodeExists at org.apache.zookeeper.KeeperException.create(KeeperException.java:126) at kafka.zk.KafkaZkClient$CheckedEphemeral.getAfterNodeExists(KafkaZkClient.scala:1821) at kafka.zk.KafkaZkClient$CheckedEphemeral.create(KafkaZkClient.scala:1759) at kafka.zk.KafkaZkClient.checkedEphemeralCreate(KafkaZkClient.scala:1726) at kafka.zk.KafkaZkClient.registerBroker(KafkaZkClient.scala:95) at kafka.controller.KafkaController.processRegisterBrokerAndReelect(KafkaController.scala:1810) at kafka.controller.KafkaController.process(KafkaController.scala:1853) at
[jira] [Updated] (KAFKA-14845) Broker ZNode creation can fail due to lost Zookeeper Session ID
[ https://issues.apache.org/jira/browse/KAFKA-14845?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Alexandre Dupriez updated KAFKA-14845: -- Description: Our production environment faced a use case where registration of a broker failed due to the presence of a "conflicting" broker znode in Zookeeper. This case is not without familiarity to that fixed by KAFKA-6584 and induced by the Zookeeper bug (or feature) tracked in ZOOKEEPER-2985 opened as of today. A network partition disturbed communication channels between the Kafka and Zookeeper clusters for about 20% of the brokers in the cluster. One of this broker was not able to re-register with Zookeeper and was excluded from the cluster until it was restarted. Broker logs show the failed registration due to a "conflicting" znode write which in this case does not exactly match the scenario covered by KAFKA-6584. The sequence of logs on the broker is as follows. First, a connection is established with the Zookeeper node 3. {code:java} [2023-03-05 16:01:55,342] INFO Socket connection established, initiating session, client: /1.2.3.4:40200, server: zk.3/5.6.7.8:2182 (org.apache.zookeeper.ClientCnxn) [2023-03-05 16:01:55,342] INFO channel is connected: [id: 0x2b45ae40, L:/1.1.3.4:40200 - R:zk.3/5.6.7.8:2182] (org.apache.zookeeper.ClientCnxnSocketNetty){code} An existing Zookeeper session was expired, and upon reconnection, the Zookeeper state change handler was invoked. The creation of the ephemeral znode /brokers/ids/18 started on the controller thread. {code:java} [2023-03-05 16:01:55,345] INFO Creating /brokers/ids/18 (is it secure? false) (kafka.zk.KafkaZkClient){code} The client "session" timed out after 6 seconds. Note the session is 0x0 and the absence of "{_}Session establishment complete{_}" log: the broker appears to have never received or processed the response from the Zookeeper node. {code:java} [2023-03-05 16:02:01,343] INFO Client session timed out, have not heard from server in 6000ms for sessionid 0x0, closing socket connection and attempting reconnect (org.apache.zookeeper.ClientCnxn) [2023-03-05 16:02:01,343] INFO channel is disconnected: [id: 0x2b45ae40, L:/1.2.3.4:40200 ! R:zk.3/5.6.7.8:2182] (org.apache.zookeeper.ClientCnxnSocketNetty){code} Pending requests were aborted with a {{CONNECTIONLOSS}} error and the client started waiting on a new connection notification. {code:java} [2023-03-05 16:02:01,343] INFO [ZooKeeperClient Kafka server] Waiting until connected. (kafka.zookeeper.ZooKeeperClient){code} A new connection was created with the Zookeeper node 1. Note that a valid (new) session ({{{}0x1006c6e0b830001{}}}) was reported by Kafka this time. {code:java} [2023-03-05 16:02:02,037] INFO Socket connection established, initiating session, client: /1.2.3.4:58080, server: zk.1/9.10.11.12:2182 (org.apache.zookeeper.ClientCnxn) [2023-03-05 16:02:02,037] INFO channel is connected: [id: 0x68fba106, L:/1.2.3.4:58080 - R:zk.1/9.10.11.12:2182] (org.apache.zookeeper.ClientCnxnSocketNetty) [2023-03-05 16:02:03,054] INFO Session establishment complete on server zk.1/9.10.11.12:2182, sessionid = 0x1006c6e0b830001, negotiated timeout = 18000 (org.apache.zookeeper.ClientCnxn){code} The Kafka ZK client is notified of the connection. {code:java} [2023-03-05 16:02:03,054] INFO [ZooKeeperClient Kafka server] Connected. (kafka.zookeeper.ZooKeeperClient){code} The broker sends the request to create the znode {{/brokers/ids/18}} which already exists. The error path implemented for KAFKA-6584 is then followed. However, in this case, the session owning the ephemeral node {{0x30043230ac1}} ({{{}216172783240153793{}}}) is different from the last active Zookeeper session which the broker has recorded. And it is also different from the current session {{0x1006c6e0b830001}} ({{{}72176813933264897{}}}), hence the recreation of the broker znode is not attempted. {code:java} [2023-03-05 16:02:04,466] ERROR Error while creating ephemeral at /brokers/ids/18, node already exists and owner '216172783240153793' does not match current session '72176813933264897' (kafka.zk.KafkaZkClient$CheckedEphemeral) org.apache.zookeeper.KeeperException$NodeExistsException: KeeperErrorCode = NodeExists at org.apache.zookeeper.KeeperException.create(KeeperException.java:126) at kafka.zk.KafkaZkClient$CheckedEphemeral.getAfterNodeExists(KafkaZkClient.scala:1821) at kafka.zk.KafkaZkClient$CheckedEphemeral.create(KafkaZkClient.scala:1759) at kafka.zk.KafkaZkClient.checkedEphemeralCreate(KafkaZkClient.scala:1726) at kafka.zk.KafkaZkClient.registerBroker(KafkaZkClient.scala:95) at kafka.controller.KafkaController.processRegisterBrokerAndReelect(KafkaController.scala:1810) at kafka.controller.KafkaController.process(KafkaController.scala:1853) at
[jira] [Updated] (KAFKA-14845) Broker ZNode creation can fail due to lost Zookeeper Session ID
[ https://issues.apache.org/jira/browse/KAFKA-14845?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Alexandre Dupriez updated KAFKA-14845: -- Description: Our production environment faced a use case where registration of a broker failed due to the presence of a "conflicting" broker znode in Zookeeper. This case is not without familiarity to that fixed by KAFKA-6584 and induced by the Zookeeper bug (or feature) tracked in ZOOKEEPER-2985 opened as of today. A network partition disturbed communication channels between the Kafka and Zookeeper clusters for about 20% of the brokers in the cluster. One of this broker was not able to re-register with Zookeeper and was excluded from the cluster until it was restarted. Broker logs show the failed registration due to a "conflicting" znode write which in this case does not exactly match the scenario covered by KAFKA-6584. The sequence of logs on the broker is as follows. First, a connection is established with the Zookeeper node 3. {code:java} [2023-03-05 16:01:55,342] INFO Socket connection established, initiating session, client: /1.2.3.4:40200, server: zk.3/5.6.7.8:2182 (org.apache.zookeeper.ClientCnxn) [2023-03-05 16:01:55,342] INFO channel is connected: [id: 0x2b45ae40, L:/1.1.3.4:40200 - R:zk.3/5.6.7.8:2182] (org.apache.zookeeper.ClientCnxnSocketNetty){code} An existing Zookeeper session was expired, and upon reconnection, the Zookeeper state change handler was invoked. The creation of the ephemeral znode /brokers/ids/18 started on the controller thread. {code:java} [2023-03-05 16:01:55,345] INFO Creating /brokers/ids/18 (is it secure? false) (kafka.zk.KafkaZkClient){code} The client "session" timed out after 6 seconds. Note the session is 0x0 and the absence of "{_}Session establishment complete{_}" log: the broker appears to have never received or processed the response from the Zookeeper node. {code:java} [2023-03-05 16:02:01,343] INFO Client session timed out, have not heard from server in 6000ms for sessionid 0x0, closing socket connection and attempting reconnect (org.apache.zookeeper.ClientCnxn) [2023-03-05 16:02:01,343] INFO channel is disconnected: [id: 0x2b45ae40, L:/1.2.3.4:40200 ! R:zk.3/5.6.7.8:2182] (org.apache.zookeeper.ClientCnxnSocketNetty){code} Pending requests were aborted with a {{CONNECTIONLOSS}} error and the client started waiting on a new connection notification. {code:java} [2023-03-05 16:02:01,343] INFO [ZooKeeperClient Kafka server] Waiting until connected. (kafka.zookeeper.ZooKeeperClient){code} A new connection was created with the Zookeeper node 1. Note that a valid (new) session ({{{}0x1006c6e0b830001{}}}) was reported by Kafka this time. {code:java} [2023-03-05 16:02:02,037] INFO Socket connection established, initiating session, client: /1.2.3.4:58080, server: zk.1/9.10.11.12:2182 (org.apache.zookeeper.ClientCnxn) [2023-03-05 16:02:02,037] INFO channel is connected: [id: 0x68fba106, L:/1.2.3.4:58080 - R:zk.1/9.10.11.12:2182] (org.apache.zookeeper.ClientCnxnSocketNetty) [2023-03-05 16:02:03,054] INFO Session establishment complete on server zk.1/9.10.11.12:2182, sessionid = 0x1006c6e0b830001, negotiated timeout = 18000 (org.apache.zookeeper.ClientCnxn){code} The Kafka ZK client is notified of the connection. {code:java} [2023-03-05 16:02:03,054] INFO [ZooKeeperClient Kafka server] Connected. (kafka.zookeeper.ZooKeeperClient){code} The broker sends the request to create the znode {{/brokers/ids/18}} which already exists. The error path implemented for KAFKA-6584 is then followed. However, in this case, the session owning the ephemeral node {{0x30043230ac1}} ({{{}216172783240153793{}}}) is different from the last active Zookeeper session which the broker has recorded. And it is also different from the current session {{0x1006c6e0b830001}} ({{{}72176813933264897{}}}), hence the recreation of the broker znode is not attempted. {code:java} [2023-03-05 16:02:04,466] ERROR Error while creating ephemeral at /brokers/ids/18, node already exists and owner '216172783240153793' does not match current session '72176813933264897' (kafka.zk.KafkaZkClient$CheckedEphemeral) org.apache.zookeeper.KeeperException$NodeExistsException: KeeperErrorCode = NodeExists at org.apache.zookeeper.KeeperException.create(KeeperException.java:126) at kafka.zk.KafkaZkClient$CheckedEphemeral.getAfterNodeExists(KafkaZkClient.scala:1821) at kafka.zk.KafkaZkClient$CheckedEphemeral.create(KafkaZkClient.scala:1759) at kafka.zk.KafkaZkClient.checkedEphemeralCreate(KafkaZkClient.scala:1726) at kafka.zk.KafkaZkClient.registerBroker(KafkaZkClient.scala:95) at kafka.controller.KafkaController.processRegisterBrokerAndReelect(KafkaController.scala:1810) at kafka.controller.KafkaController.process(KafkaController.scala:1853) at
[jira] [Updated] (KAFKA-14845) Broker ZNode creation can fail due to lost Zookeeper Session ID
[ https://issues.apache.org/jira/browse/KAFKA-14845?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Alexandre Dupriez updated KAFKA-14845: -- Description: Our production environment faced a use case where registration of a broker failed due to the presence of a "conflicting" broker znode in Zookeeper. This case is not without familiarity to that fixed by KAFKA-6584 and induced by the Zookeeper bug (or feature) tracked in ZOOKEEPER-2985 opened as of today. A network partition disturbed communication channels between the Kafka and Zookeeper clusters for about 20% of the brokers in the cluster. One of this broker was not able to re-register with Zookeeper and was excluded from the cluster until it was restarted. Broker logs show the failed registration due to a "conflicting" znode write which in this case does not exactly match the scenario covered by KAFKA-6584. The sequence of logs on the broker is as follows. First, a connection is established with the Zookeeper node 3. {code:java} [2023-03-05 16:01:55,342] INFO Socket connection established, initiating session, client: /1.2.3.4:40200, server: zk.3/5.6.7.8:2182 (org.apache.zookeeper.ClientCnxn) [2023-03-05 16:01:55,342] INFO channel is connected: [id: 0x2b45ae40, L:/1.1.3.4:40200 - R:zk.3/5.6.7.8:2182] (org.apache.zookeeper.ClientCnxnSocketNetty){code} An existing Zookeeper session was expired, and upon reconnection, the Zookeeper state change handler was invoked. The creation of the ephemeral znode /brokers/ids/18 started on the controller thread. {code:java} [2023-03-05 16:01:55,345] INFO Creating /brokers/ids/18 (is it secure? false) (kafka.zk.KafkaZkClient){code} The client "session" timed out after 6 seconds. Note the session is 0x0 and the absence of "{_}Session establishment complete{_}" log: the broker appears to have never received or processed the response from the Zookeeper node. {code:java} [2023-03-05 16:02:01,343] INFO Client session timed out, have not heard from server in 6000ms for sessionid 0x0, closing socket connection and attempting reconnect (org.apache.zookeeper.ClientCnxn) [2023-03-05 16:02:01,343] INFO channel is disconnected: [id: 0x2b45ae40, L:/1.2.3.4:40200 ! R:zk.3/5.6.7.8:2182] (org.apache.zookeeper.ClientCnxnSocketNetty){code} Pending requests were aborted with a {{CONNECTIONLOSS}} error and the client started waiting on a new connection notification. {code:java} [2023-03-05 16:02:01,343] INFO [ZooKeeperClient Kafka server] Waiting until connected. (kafka.zookeeper.ZooKeeperClient){code} A new connection was created with the Zookeeper node 1. Note that a valid (new) session ({{{}0x1006c6e0b830001{}}}) was reported by Kafka this time. {code:java} [2023-03-05 16:02:02,037] INFO Socket connection established, initiating session, client: /1.2.3.4:58080, server: zk.1/9.10.11.12:2182 (org.apache.zookeeper.ClientCnxn) [2023-03-05 16:02:02,037] INFO channel is connected: [id: 0x68fba106, L:/1.2.3.4:58080 - R:zk.1/9.10.11.12:2182] (org.apache.zookeeper.ClientCnxnSocketNetty) [2023-03-05 16:02:03,054] INFO Session establishment complete on server zk.1/9.10.11.12:2182, sessionid = 0x1006c6e0b830001, negotiated timeout = 18000 (org.apache.zookeeper.ClientCnxn){code} The Kafka ZK client is notified of the connection. {code:java} [2023-03-05 16:02:03,054] INFO [ZooKeeperClient Kafka server] Connected. (kafka.zookeeper.ZooKeeperClient){code} The broker sends the request to create the znode {{/brokers/ids/18}} which already exists. The error path implemented for KAFKA-6584 is then followed. However, in this case, the session owning the ephemeral node {{0x30043230ac1}} ({{{}216172783240153793{}}}) is different from the last active Zookeeper session which the broker has recorded. And it is also different from the current session {{0x1006c6e0b830001}} ({{{}72176813933264897{}}}), hence the recreation of the broker znode is not attempted. {code:java} [2023-03-05 16:02:04,466] ERROR Error while creating ephemeral at /brokers/ids/18, node already exists and owner '216172783240153793' does not match current session '72176813933264897' (kafka.zk.KafkaZkClient$CheckedEphemeral) org.apache.zookeeper.KeeperException$NodeExistsException: KeeperErrorCode = NodeExists at org.apache.zookeeper.KeeperException.create(KeeperException.java:126) at kafka.zk.KafkaZkClient$CheckedEphemeral.getAfterNodeExists(KafkaZkClient.scala:1821) at kafka.zk.KafkaZkClient$CheckedEphemeral.create(KafkaZkClient.scala:1759) at kafka.zk.KafkaZkClient.checkedEphemeralCreate(KafkaZkClient.scala:1726) at kafka.zk.KafkaZkClient.registerBroker(KafkaZkClient.scala:95) at kafka.controller.KafkaController.processRegisterBrokerAndReelect(KafkaController.scala:1810) at kafka.controller.KafkaController.process(KafkaController.scala:1853) at
[jira] [Updated] (KAFKA-14845) Broker ZNode creation can fail due to lost Zookeeper Session ID
[ https://issues.apache.org/jira/browse/KAFKA-14845?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Alexandre Dupriez updated KAFKA-14845: -- Description: Our production environment faced a use case where registration of a broker failed due to the presence of a "conflicting" broker znode in Zookeeper. This case is not without familiarity to that fixed by KAFKA-6584 and induced by the Zookeeper bug (or feature) tracked in ZOOKEEPER-2985 opened as of today. A network partition disturbed communication channels between the Kafka and Zookeeper clusters for about 20% of the brokers in the cluster. One of this broker was not able to re-register with Zookeeper and was excluded from the cluster until it was restarted. Broker logs show the failed registration due to a "conflicting" znode write which in this case does not exactly match the scenario covered by KAFKA-6584. The sequence of logs on the broker is as follows. First, a connection is established with the Zookeeper node 3. {code:java} [2023-03-05 16:01:55,342] INFO Socket connection established, initiating session, client: /1.2.3.4:40200, server: zk.3/5.6.7.8:2182 (org.apache.zookeeper.ClientCnxn) [2023-03-05 16:01:55,342] INFO channel is connected: [id: 0x2b45ae40, L:/1.1.3.4:40200 - R:zk.3/5.6.7.8:2182] (org.apache.zookeeper.ClientCnxnSocketNetty){code} An existing Zookeeper session was expired, and upon reconnection, the Zookeeper state change handler was invoked. The creation of the ephemeral znode /brokers/ids/18 started on the controller thread. {code:java} [2023-03-05 16:01:55,345] INFO Creating /brokers/ids/18 (is it secure? false) (kafka.zk.KafkaZkClient){code} The client "session" timed out after 6 seconds. Note the session is 0x0 and the absence of "{_}Session establishment complete{_}" log: the broker appears to have never received or processed the response from the Zookeeper node. {code:java} [2023-03-05 16:02:01,343] INFO Client session timed out, have not heard from server in 6000ms for sessionid 0x0, closing socket connection and attempting reconnect (org.apache.zookeeper.ClientCnxn) [2023-03-05 16:02:01,343] INFO channel is disconnected: [id: 0x2b45ae40, L:/1.2.3.4:40200 ! R:zk.3/5.6.7.8:2182] (org.apache.zookeeper.ClientCnxnSocketNetty){code} Pending requests were aborted with a {{CONNECTIONLOSS}} error and the client started waiting on a new connection notification. {code:java} [2023-03-05 16:02:01,343] INFO [ZooKeeperClient Kafka server] Waiting until connected. (kafka.zookeeper.ZooKeeperClient){code} A new connection was created with the Zookeeper node 1. Note that a valid (new) session ({{{}0x1006c6e0b830001{}}}) was reported by Kafka this time. {code:java} [2023-03-05 16:02:02,037] INFO Socket connection established, initiating session, client: /1.2.3.4:58080, server: zk.1/9.10.11.12:2182 (org.apache.zookeeper.ClientCnxn) [2023-03-05 16:02:02,037] INFO channel is connected: [id: 0x68fba106, L:/1.2.3.4:58080 - R:zk.1/9.10.11.12:2182] (org.apache.zookeeper.ClientCnxnSocketNetty) [2023-03-05 16:02:03,054] INFO Session establishment complete on server zk.1/9.10.11.12:2182, sessionid = 0x1006c6e0b830001, negotiated timeout = 18000 (org.apache.zookeeper.ClientCnxn){code} The Kafka ZK client is notified of the connection. {code:java} [2023-03-05 16:02:03,054] INFO [ZooKeeperClient Kafka server] Connected. (kafka.zookeeper.ZooKeeperClient){code} The broker sends the request to create the znode {{/brokers/ids/18}} which already exists. The error path implemented for KAFKA-6584 is then followed. However, in this case, the session owning the ephemeral node {{0x30043230ac1}} ({{{}216172783240153793{}}}) is different from the last active Zookeeper session which the broker has recorded. And it is also different from the current session {{0x1006c6e0b830001}} ({{{}72176813933264897{}}}), hence the recreation of the broker znode is not attempted. {code:java} [2023-03-05 16:02:04,466] ERROR Error while creating ephemeral at /brokers/ids/18, node already exists and owner '216172783240153793' does not match current session '72176813933264897' (kafka.zk.KafkaZkClient$CheckedEphemeral) org.apache.zookeeper.KeeperException$NodeExistsException: KeeperErrorCode = NodeExists at org.apache.zookeeper.KeeperException.create(KeeperException.java:126) at kafka.zk.KafkaZkClient$CheckedEphemeral.getAfterNodeExists(KafkaZkClient.scala:1821) at kafka.zk.KafkaZkClient$CheckedEphemeral.create(KafkaZkClient.scala:1759) at kafka.zk.KafkaZkClient.checkedEphemeralCreate(KafkaZkClient.scala:1726) at kafka.zk.KafkaZkClient.registerBroker(KafkaZkClient.scala:95) at kafka.controller.KafkaController.processRegisterBrokerAndReelect(KafkaController.scala:1810) at kafka.controller.KafkaController.process(KafkaController.scala:1853) at
[jira] [Updated] (KAFKA-14845) Broker ZNode creation can fail due to lost Zookeeper Session ID
[ https://issues.apache.org/jira/browse/KAFKA-14845?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Alexandre Dupriez updated KAFKA-14845: -- Description: Our production environment faced a use case where registration of a broker failed due to the presence of a "conflicting" broker znode in Zookeeper. This case is not without familiarity to that fixed by KAFKA-6584 and induced by the Zookeeper bug (or feature) tracked in ZOOKEEPER-2985 opened as of today. A network partition disturbed communication channels between the Kafka and Zookeeper clusters for about 20% of the brokers in the cluster. One of this broker was not able to re-register with Zookeeper and was excluded from the cluster until it was restarted. Broker logs show the failed registration due to a "conflicting" znode write which in this case does not exactly match the scenario covered by KAFKA-6584. The sequence of logs on the broker is as follows. First, a connection is established with the Zookeeper node 3. {code:java} [2023-03-05 16:01:55,342] INFO Socket connection established, initiating session, client: /1.2.3.4:40200, server: zk.3/5.6.7.8:2182 (org.apache.zookeeper.ClientCnxn) [2023-03-05 16:01:55,342] INFO channel is connected: [id: 0x2b45ae40, L:/1.1.3.4:40200 - R:zk.3/5.6.7.8:2182] (org.apache.zookeeper.ClientCnxnSocketNetty){code} An existing Zookeeper session was expired, and upon reconnection, the Zookeeper state change handler was invoked. The creation of the ephemeral znode /brokers/ids/18 started on the controller thread. {code:java} [2023-03-05 16:01:55,345] INFO Creating /brokers/ids/18 (is it secure? false) (kafka.zk.KafkaZkClient){code} The client "session" timed out after 6 seconds. Note the session is 0x0 and the absence of "{_}Session establishment complete{_}" log: the broker appears to have never received or processed the response from the Zookeeper node. {code:java} [2023-03-05 16:02:01,343] INFO Client session timed out, have not heard from server in 6000ms for sessionid 0x0, closing socket connection and attempting reconnect (org.apache.zookeeper.ClientCnxn) [2023-03-05 16:02:01,343] INFO channel is disconnected: [id: 0x2b45ae40, L:/1.2.3.4:40200 ! R:zk.3/5.6.7.8:2182] (org.apache.zookeeper.ClientCnxnSocketNetty){code} Pending requests were aborted with a {{CONNECTIONLOSS}} error and the client started waiting on a new connection notification. {code:java} [2023-03-05 16:02:01,343] INFO [ZooKeeperClient Kafka server] Waiting until connected. (kafka.zookeeper.ZooKeeperClient){code} A new connection was created with the Zookeeper node 1. Note that a valid (new) session ({{{}0x1006c6e0b830001{}}}) was reported by Kafka this time. {code:java} [2023-03-05 16:02:02,037] INFO Socket connection established, initiating session, client: /1.2.3.4:58080, server: zk.1/9.10.11.12:2182 (org.apache.zookeeper.ClientCnxn) [2023-03-05 16:02:02,037] INFO channel is connected: [id: 0x68fba106, L:/1.2.3.4:58080 - R:zk.1/9.10.11.12:2182] (org.apache.zookeeper.ClientCnxnSocketNetty) [2023-03-05 16:02:03,054] INFO Session establishment complete on server zk.1/9.10.11.12:2182, sessionid = 0x1006c6e0b830001, negotiated timeout = 18000 (org.apache.zookeeper.ClientCnxn){code} The Kafka ZK client is notified of the connection. {code:java} [2023-03-05 16:02:03,054] INFO [ZooKeeperClient Kafka server] Connected. (kafka.zookeeper.ZooKeeperClient){code} The broker sends the request to create the znode {{/brokers/ids/18}} which already exists. The error path implemented for KAFKA-6584 is then followed. However, in this case, the session owning the ephemeral node {{0x30043230ac1}} ({{{}216172783240153793{}}}) is different from the last active Zookeeper session which the broker has recorded. And it is also different from the current session {{0x1006c6e0b830001}} ({{{}72176813933264897{}}}), hence the recreation of the broker znode is not attempted. {code:java} [2023-03-05 16:02:04,466] ERROR Error while creating ephemeral at /brokers/ids/18, node already exists and owner '216172783240153793' does not match current session '72176813933264897' (kafka.zk.KafkaZkClient$CheckedEphemeral) org.apache.zookeeper.KeeperException$NodeExistsException: KeeperErrorCode = NodeExists at org.apache.zookeeper.KeeperException.create(KeeperException.java:126) at kafka.zk.KafkaZkClient$CheckedEphemeral.getAfterNodeExists(KafkaZkClient.scala:1821) at kafka.zk.KafkaZkClient$CheckedEphemeral.create(KafkaZkClient.scala:1759) at kafka.zk.KafkaZkClient.checkedEphemeralCreate(KafkaZkClient.scala:1726) at kafka.zk.KafkaZkClient.registerBroker(KafkaZkClient.scala:95) at kafka.controller.KafkaController.processRegisterBrokerAndReelect(KafkaController.scala:1810) at kafka.controller.KafkaController.process(KafkaController.scala:1853) at
[jira] [Updated] (KAFKA-14845) Broker ZNode creation can fail due to lost Zookeeper Session ID
[ https://issues.apache.org/jira/browse/KAFKA-14845?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Alexandre Dupriez updated KAFKA-14845: -- Attachment: (was: broker-registration.drawio (4).png) > Broker ZNode creation can fail due to lost Zookeeper Session ID > --- > > Key: KAFKA-14845 > URL: https://issues.apache.org/jira/browse/KAFKA-14845 > Project: Kafka > Issue Type: Bug >Reporter: Alexandre Dupriez >Assignee: Alexandre Dupriez >Priority: Minor > Attachments: kafka-broker-reg.log, phoque.png > > > Our production environment faced a use case where registration of a broker > failed due to the presence of a "conflicting" broker znode in Zookeeper. This > case is not without familiarity to that fixed by KAFKA-6584 and induced by > the Zookeeper bug (or feature) tracked in ZOOKEEPER-2985 opened as of today. > A network partition disturbed communication channels between the Kafka and > Zookeeper clusters for about 20% of the brokers in the cluster. One of this > broker was not able to re-register with Zookeeper and was excluded from the > cluster until it was restarted. Broker logs show the failed registration due > to a "conflicting" znode write which in this case does not exactly match the > scenario covered by KAFKA-6584. > The sequence of logs on the broker is as follows. > First, a connection is established with the Zookeeper node 3. > {code:java} > [2023-03-05 16:01:55,342] INFO Socket connection established, initiating > session, client: /1.2.3.4:40200, server: zk.3/5.6.7.8:2182 > (org.apache.zookeeper.ClientCnxn) > [2023-03-05 16:01:55,342] INFO channel is connected: [id: 0x2b45ae40, > L:/1.1.3.4:40200 - R:zk.3/5.6.7.8:2182] > (org.apache.zookeeper.ClientCnxnSocketNetty){code} > An existing Zookeeper session was expired, and upon reconnection, the > Zookeeper state change handler was invoked. The creation of the ephemeral > znode /brokers/ids/18 started on the controller thread. > {code:java} > [2023-03-05 16:01:55,345] INFO Creating /brokers/ids/18 (is it secure? false) > (kafka.zk.KafkaZkClient){code} > The client "session" timed out after 6 seconds. Note the session is 0x0 and > the absence of "{_}Session establishment complete{_}" log: the broker appears > to have never received or processed the response from the Zookeeper node. > {code:java} > [2023-03-05 16:02:01,343] INFO Client session timed out, have not heard from > server in 6000ms for sessionid 0x0, closing socket connection and attempting > reconnect (org.apache.zookeeper.ClientCnxn) > [2023-03-05 16:02:01,343] INFO channel is disconnected: [id: 0x2b45ae40, > L:/1.2.3.4:40200 ! R:zk.3/5.6.7.8:2182] > (org.apache.zookeeper.ClientCnxnSocketNetty){code} > Pending requests were aborted with a {{CONNECTIONLOSS}} error and the client > started waiting on a new connection notification. > {code:java} > [2023-03-05 16:02:01,343] INFO [ZooKeeperClient Kafka server] Waiting until > connected. (kafka.zookeeper.ZooKeeperClient){code} > A new connection was created with the Zookeeper node 1. Note that a valid > (new) session ({{{}0x1006c6e0b830001{}}}) was reported by Kafka this time. > {code:java} > [2023-03-05 16:02:02,037] INFO Socket connection established, initiating > session, client: /1.2.3.4:58080, server: zk.1/9.10.11.12:2182 > (org.apache.zookeeper.ClientCnxn) > [2023-03-05 16:02:02,037] INFO channel is connected: [id: 0x68fba106, > L:/1.2.3.4:58080 - R:zk.1/9.10.11.12:2182] > (org.apache.zookeeper.ClientCnxnSocketNetty) > [2023-03-05 16:02:03,054] INFO Session establishment complete on server > zk.1/9.10.11.12:2182, sessionid = 0x1006c6e0b830001, negotiated timeout = > 18000 (org.apache.zookeeper.ClientCnxn){code} > The Kafka ZK client is notified of the connection. > {code:java} > [2023-03-05 16:02:03,054] INFO [ZooKeeperClient Kafka server] Connected. > (kafka.zookeeper.ZooKeeperClient){code} > The broker sends the request to create the znode {{/brokers/ids/18}} which > already exists. The error path implemented for KAFKA-6584 is then followed. > However, in this case, the session owning the ephemeral node > {{0x30043230ac1}} ({{{}216172783240153793{}}}) is different from the last > active Zookeeper session which the broker has recorded. And it is also > different from the current session {{0x1006c6e0b830001}} > ({{{}72176813933264897{}}}), hence the recreation of the broker znode is not > attempted. > {code:java} > [2023-03-05 16:02:04,466] ERROR Error while creating ephemeral at > /brokers/ids/18, node already exists and owner '216172783240153793' does not > match current session '72176813933264897' > (kafka.zk.KafkaZkClient$CheckedEphemeral) > org.apache.zookeeper.KeeperException$NodeExistsException: KeeperErrorCode = > NodeExists > at >
[jira] [Updated] (KAFKA-14845) Broker ZNode creation can fail due to lost Zookeeper Session ID
[ https://issues.apache.org/jira/browse/KAFKA-14845?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Alexandre Dupriez updated KAFKA-14845: -- Attachment: phoque.png > Broker ZNode creation can fail due to lost Zookeeper Session ID > --- > > Key: KAFKA-14845 > URL: https://issues.apache.org/jira/browse/KAFKA-14845 > Project: Kafka > Issue Type: Bug >Reporter: Alexandre Dupriez >Assignee: Alexandre Dupriez >Priority: Minor > Attachments: kafka-broker-reg.log, phoque.png > > > Our production environment faced a use case where registration of a broker > failed due to the presence of a "conflicting" broker znode in Zookeeper. This > case is not without familiarity to that fixed by KAFKA-6584 and induced by > the Zookeeper bug (or feature) tracked in ZOOKEEPER-2985 opened as of today. > A network partition disturbed communication channels between the Kafka and > Zookeeper clusters for about 20% of the brokers in the cluster. One of this > broker was not able to re-register with Zookeeper and was excluded from the > cluster until it was restarted. Broker logs show the failed registration due > to a "conflicting" znode write which in this case does not exactly match the > scenario covered by KAFKA-6584. > The sequence of logs on the broker is as follows. > First, a connection is established with the Zookeeper node 3. > {code:java} > [2023-03-05 16:01:55,342] INFO Socket connection established, initiating > session, client: /1.2.3.4:40200, server: zk.3/5.6.7.8:2182 > (org.apache.zookeeper.ClientCnxn) > [2023-03-05 16:01:55,342] INFO channel is connected: [id: 0x2b45ae40, > L:/1.1.3.4:40200 - R:zk.3/5.6.7.8:2182] > (org.apache.zookeeper.ClientCnxnSocketNetty){code} > An existing Zookeeper session was expired, and upon reconnection, the > Zookeeper state change handler was invoked. The creation of the ephemeral > znode /brokers/ids/18 started on the controller thread. > {code:java} > [2023-03-05 16:01:55,345] INFO Creating /brokers/ids/18 (is it secure? false) > (kafka.zk.KafkaZkClient){code} > The client "session" timed out after 6 seconds. Note the session is 0x0 and > the absence of "{_}Session establishment complete{_}" log: the broker appears > to have never received or processed the response from the Zookeeper node. > {code:java} > [2023-03-05 16:02:01,343] INFO Client session timed out, have not heard from > server in 6000ms for sessionid 0x0, closing socket connection and attempting > reconnect (org.apache.zookeeper.ClientCnxn) > [2023-03-05 16:02:01,343] INFO channel is disconnected: [id: 0x2b45ae40, > L:/1.2.3.4:40200 ! R:zk.3/5.6.7.8:2182] > (org.apache.zookeeper.ClientCnxnSocketNetty){code} > Pending requests were aborted with a {{CONNECTIONLOSS}} error and the client > started waiting on a new connection notification. > {code:java} > [2023-03-05 16:02:01,343] INFO [ZooKeeperClient Kafka server] Waiting until > connected. (kafka.zookeeper.ZooKeeperClient){code} > A new connection was created with the Zookeeper node 1. Note that a valid > (new) session ({{{}0x1006c6e0b830001{}}}) was reported by Kafka this time. > {code:java} > [2023-03-05 16:02:02,037] INFO Socket connection established, initiating > session, client: /1.2.3.4:58080, server: zk.1/9.10.11.12:2182 > (org.apache.zookeeper.ClientCnxn) > [2023-03-05 16:02:02,037] INFO channel is connected: [id: 0x68fba106, > L:/1.2.3.4:58080 - R:zk.1/9.10.11.12:2182] > (org.apache.zookeeper.ClientCnxnSocketNetty) > [2023-03-05 16:02:03,054] INFO Session establishment complete on server > zk.1/9.10.11.12:2182, sessionid = 0x1006c6e0b830001, negotiated timeout = > 18000 (org.apache.zookeeper.ClientCnxn){code} > The Kafka ZK client is notified of the connection. > {code:java} > [2023-03-05 16:02:03,054] INFO [ZooKeeperClient Kafka server] Connected. > (kafka.zookeeper.ZooKeeperClient){code} > The broker sends the request to create the znode {{/brokers/ids/18}} which > already exists. The error path implemented for KAFKA-6584 is then followed. > However, in this case, the session owning the ephemeral node > {{0x30043230ac1}} ({{{}216172783240153793{}}}) is different from the last > active Zookeeper session which the broker has recorded. And it is also > different from the current session {{0x1006c6e0b830001}} > ({{{}72176813933264897{}}}), hence the recreation of the broker znode is not > attempted. > {code:java} > [2023-03-05 16:02:04,466] ERROR Error while creating ephemeral at > /brokers/ids/18, node already exists and owner '216172783240153793' does not > match current session '72176813933264897' > (kafka.zk.KafkaZkClient$CheckedEphemeral) > org.apache.zookeeper.KeeperException$NodeExistsException: KeeperErrorCode = > NodeExists > at >
[jira] [Updated] (KAFKA-14845) Broker ZNode creation can fail due to lost Zookeeper Session ID
[ https://issues.apache.org/jira/browse/KAFKA-14845?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Alexandre Dupriez updated KAFKA-14845: -- Attachment: broker-registration.drawio (4).png > Broker ZNode creation can fail due to lost Zookeeper Session ID > --- > > Key: KAFKA-14845 > URL: https://issues.apache.org/jira/browse/KAFKA-14845 > Project: Kafka > Issue Type: Bug >Reporter: Alexandre Dupriez >Assignee: Alexandre Dupriez >Priority: Minor > Attachments: broker-registration.drawio (4).png, kafka-broker-reg.log > > > Our production environment faced a use case where registration of a broker > failed due to the presence of a "conflicting" broker znode in Zookeeper. This > case is not without familiarity to that fixed by KAFKA-6584 and induced by > the Zookeeper bug (or feature) tracked in ZOOKEEPER-2985 opened as of today. > A network partition disturbed communication channels between the Kafka and > Zookeeper clusters for about 20% of the brokers in the cluster. One of this > broker was not able to re-register with Zookeeper and was excluded from the > cluster until it was restarted. Broker logs show the failed registration due > to a "conflicting" znode write which in this case does not exactly match the > scenario covered by KAFKA-6584. > The sequence of logs on the broker is as follows. > First, a connection is established with the Zookeeper node 3. > {code:java} > [2023-03-05 16:01:55,342] INFO Socket connection established, initiating > session, client: /1.2.3.4:40200, server: zk.3/5.6.7.8:2182 > (org.apache.zookeeper.ClientCnxn) > [2023-03-05 16:01:55,342] INFO channel is connected: [id: 0x2b45ae40, > L:/1.1.3.4:40200 - R:zk.3/5.6.7.8:2182] > (org.apache.zookeeper.ClientCnxnSocketNetty){code} > An existing Zookeeper session was expired, and upon reconnection, the > Zookeeper state change handler was invoked. The creation of the ephemeral > znode /brokers/ids/18 started on the controller thread. > {code:java} > [2023-03-05 16:01:55,345] INFO Creating /brokers/ids/18 (is it secure? false) > (kafka.zk.KafkaZkClient){code} > The client "session" timed out after 6 seconds. Note the session is 0x0 and > the absence of "{_}Session establishment complete{_}" log: the broker appears > to have never received or processed the response from the Zookeeper node. > {code:java} > [2023-03-05 16:02:01,343] INFO Client session timed out, have not heard from > server in 6000ms for sessionid 0x0, closing socket connection and attempting > reconnect (org.apache.zookeeper.ClientCnxn) > [2023-03-05 16:02:01,343] INFO channel is disconnected: [id: 0x2b45ae40, > L:/1.2.3.4:40200 ! R:zk.3/5.6.7.8:2182] > (org.apache.zookeeper.ClientCnxnSocketNetty){code} > Pending requests were aborted with a {{CONNECTIONLOSS}} error and the client > started waiting on a new connection notification. > {code:java} > [2023-03-05 16:02:01,343] INFO [ZooKeeperClient Kafka server] Waiting until > connected. (kafka.zookeeper.ZooKeeperClient){code} > A new connection was created with the Zookeeper node 1. Note that a valid > (new) session ({{{}0x1006c6e0b830001{}}}) was reported by Kafka this time. > {code:java} > [2023-03-05 16:02:02,037] INFO Socket connection established, initiating > session, client: /1.2.3.4:58080, server: zk.1/9.10.11.12:2182 > (org.apache.zookeeper.ClientCnxn) > [2023-03-05 16:02:02,037] INFO channel is connected: [id: 0x68fba106, > L:/1.2.3.4:58080 - R:zk.1/9.10.11.12:2182] > (org.apache.zookeeper.ClientCnxnSocketNetty) > [2023-03-05 16:02:03,054] INFO Session establishment complete on server > zk.1/9.10.11.12:2182, sessionid = 0x1006c6e0b830001, negotiated timeout = > 18000 (org.apache.zookeeper.ClientCnxn){code} > The Kafka ZK client is notified of the connection. > {code:java} > [2023-03-05 16:02:03,054] INFO [ZooKeeperClient Kafka server] Connected. > (kafka.zookeeper.ZooKeeperClient){code} > The broker sends the request to create the znode {{/brokers/ids/18}} which > already exists. The error path implemented for KAFKA-6584 is then followed. > However, in this case, the session owning the ephemeral node > {{0x30043230ac1}} ({{{}216172783240153793{}}}) is different from the last > active Zookeeper session which the broker has recorded. And it is also > different from the current session {{0x1006c6e0b830001}} > ({{{}72176813933264897{}}}), hence the recreation of the broker znode is not > attempted. > {code:java} > [2023-03-05 16:02:04,466] ERROR Error while creating ephemeral at > /brokers/ids/18, node already exists and owner '216172783240153793' does not > match current session '72176813933264897' > (kafka.zk.KafkaZkClient$CheckedEphemeral) > org.apache.zookeeper.KeeperException$NodeExistsException: KeeperErrorCode = > NodeExists > at >
[jira] [Updated] (KAFKA-14845) Broker ZNode creation can fail due to lost Zookeeper Session ID
[ https://issues.apache.org/jira/browse/KAFKA-14845?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Alexandre Dupriez updated KAFKA-14845: -- Description: Our production environment faced a use case where registration of a broker failed due to the presence of a "conflicting" broker znode in Zookeeper. This case is not without familiarity to that fixed by KAFKA-6584 and induced by the Zookeeper bug (or feature) tracked in ZOOKEEPER-2985 opened as of today. A network partition disturbed communication channels between the Kafka and Zookeeper clusters for about 20% of the brokers in the cluster. One of this broker was not able to re-register with Zookeeper and was excluded from the cluster until it was restarted. Broker logs show the failed registration due to a "conflicting" znode write which in this case does not exactly match the scenario covered by KAFKA-6584. The sequence of logs on the broker is as follows. First, a connection is established with the Zookeeper node 3. {code:java} [2023-03-05 16:01:55,342] INFO Socket connection established, initiating session, client: /1.2.3.4:40200, server: zk.3/5.6.7.8:2182 (org.apache.zookeeper.ClientCnxn) [2023-03-05 16:01:55,342] INFO channel is connected: [id: 0x2b45ae40, L:/1.1.3.4:40200 - R:zk.3/5.6.7.8:2182] (org.apache.zookeeper.ClientCnxnSocketNetty){code} An existing Zookeeper session was expired, and upon reconnection, the Zookeeper state change handler was invoked. The creation of the ephemeral znode /brokers/ids/18 started on the controller thread. {code:java} [2023-03-05 16:01:55,345] INFO Creating /brokers/ids/18 (is it secure? false) (kafka.zk.KafkaZkClient){code} The client "session" timed out after 6 seconds. Note the session is 0x0 and the absence of "{_}Session establishment complete{_}" log: the broker appears to have never received or processed the response from the Zookeeper node. {code:java} [2023-03-05 16:02:01,343] INFO Client session timed out, have not heard from server in 6000ms for sessionid 0x0, closing socket connection and attempting reconnect (org.apache.zookeeper.ClientCnxn) [2023-03-05 16:02:01,343] INFO channel is disconnected: [id: 0x2b45ae40, L:/1.2.3.4:40200 ! R:zk.3/5.6.7.8:2182] (org.apache.zookeeper.ClientCnxnSocketNetty){code} Pending requests were aborted with a {{CONNECTIONLOSS}} error and the client started waiting on a new connection notification. {code:java} [2023-03-05 16:02:01,343] INFO [ZooKeeperClient Kafka server] Waiting until connected. (kafka.zookeeper.ZooKeeperClient){code} A new connection was created with the Zookeeper node 1. Note that a valid (new) session ({{{}0x1006c6e0b830001{}}}) was reported by Kafka this time. {code:java} [2023-03-05 16:02:02,037] INFO Socket connection established, initiating session, client: /1.2.3.4:58080, server: zk.1/9.10.11.12:2182 (org.apache.zookeeper.ClientCnxn) [2023-03-05 16:02:02,037] INFO channel is connected: [id: 0x68fba106, L:/1.2.3.4:58080 - R:zk.1/9.10.11.12:2182] (org.apache.zookeeper.ClientCnxnSocketNetty) [2023-03-05 16:02:03,054] INFO Session establishment complete on server zk.1/9.10.11.12:2182, sessionid = 0x1006c6e0b830001, negotiated timeout = 18000 (org.apache.zookeeper.ClientCnxn){code} The Kafka ZK client is notified of the connection. {code:java} [2023-03-05 16:02:03,054] INFO [ZooKeeperClient Kafka server] Connected. (kafka.zookeeper.ZooKeeperClient){code} The broker sends the request to create the znode {{/brokers/ids/18}} which already exists. The error path implemented for KAFKA-6584 is then followed. However, in this case, the session owning the ephemeral node {{0x30043230ac1}} ({{{}216172783240153793{}}}) is different from the last active Zookeeper session which the broker has recorded. And it is also different from the current session {{0x1006c6e0b830001}} ({{{}72176813933264897{}}}), hence the recreation of the broker znode is not attempted. {code:java} [2023-03-05 16:02:04,466] ERROR Error while creating ephemeral at /brokers/ids/18, node already exists and owner '216172783240153793' does not match current session '72176813933264897' (kafka.zk.KafkaZkClient$CheckedEphemeral) org.apache.zookeeper.KeeperException$NodeExistsException: KeeperErrorCode = NodeExists at org.apache.zookeeper.KeeperException.create(KeeperException.java:126) at kafka.zk.KafkaZkClient$CheckedEphemeral.getAfterNodeExists(KafkaZkClient.scala:1821) at kafka.zk.KafkaZkClient$CheckedEphemeral.create(KafkaZkClient.scala:1759) at kafka.zk.KafkaZkClient.checkedEphemeralCreate(KafkaZkClient.scala:1726) at kafka.zk.KafkaZkClient.registerBroker(KafkaZkClient.scala:95) at kafka.controller.KafkaController.processRegisterBrokerAndReelect(KafkaController.scala:1810) at kafka.controller.KafkaController.process(KafkaController.scala:1853) at
[jira] [Updated] (KAFKA-14845) Broker ZNode creation can fail due to lost Zookeeper Session ID
[ https://issues.apache.org/jira/browse/KAFKA-14845?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Alexandre Dupriez updated KAFKA-14845: -- Description: Our production environment faced a use case where registration of a broker failed due to the presence of a "conflicting" broker znode in Zookeeper. This case is not without familiarity to that fixed by KAFKA-6584 and induced by the Zookeeper bug (or feature) tracked in ZOOKEEPER-2985 opened as of today. A network partition disturbed communication channels between the Kafka and Zookeeper clusters for about 20% of the brokers in the cluster. One of this broker was not able to re-register with Zookeeper and was excluded from the cluster until it was restarted. Broker logs show the failed registration due to a "conflicting" znode write which in this case does not exactly match the scenario covered by KAFKA-6584. The sequence of logs on the broker is as follows. First, a connection is established with the Zookeeper node 3. {code:java} [2023-03-05 16:01:55,342] INFO Socket connection established, initiating session, client: /1.2.3.4:40200, server: zk.3/5.6.7.8:2182 (org.apache.zookeeper.ClientCnxn) [2023-03-05 16:01:55,342] INFO channel is connected: [id: 0x2b45ae40, L:/1.1.3.4:40200 - R:zk.3/5.6.7.8:2182] (org.apache.zookeeper.ClientCnxnSocketNetty){code} An existing Zookeeper session was expired, and upon reconnection, the Zookeeper state change handler was invoked. The creation of the ephemeral znode /brokers/ids/18 started on the controller thread. {code:java} [2023-03-05 16:01:55,345] INFO Creating /brokers/ids/18 (is it secure? false) (kafka.zk.KafkaZkClient){code} The client "session" timed out after 6 seconds. Note the session is 0x0 and the absence of "{_}Session establishment complete{_}" log: the broker appears to have never received or processed the response from the Zookeeper node. {code:java} [2023-03-05 16:02:01,343] INFO Client session timed out, have not heard from server in 6000ms for sessionid 0x0, closing socket connection and attempting reconnect (org.apache.zookeeper.ClientCnxn) [2023-03-05 16:02:01,343] INFO channel is disconnected: [id: 0x2b45ae40, L:/1.2.3.4:40200 ! R:zk.3/5.6.7.8:2182] (org.apache.zookeeper.ClientCnxnSocketNetty){code} Pending requests were aborted with a {{CONNECTIONLOSS}} error and the client started waiting on a new connection notification. {code:java} [2023-03-05 16:02:01,343] INFO [ZooKeeperClient Kafka server] Waiting until connected. (kafka.zookeeper.ZooKeeperClient){code} A new connection was created with the Zookeeper node 1. Note that a valid (new) session ({{{}0x1006c6e0b830001{}}}) was reported by Kafka this time. {code:java} [2023-03-05 16:02:02,037] INFO Socket connection established, initiating session, client: /1.2.3.4:58080, server: zk.1/9.10.11.12:2182 (org.apache.zookeeper.ClientCnxn) [2023-03-05 16:02:02,037] INFO channel is connected: [id: 0x68fba106, L:/1.2.3.4:58080 - R:zk.1/9.10.11.12:2182] (org.apache.zookeeper.ClientCnxnSocketNetty) [2023-03-05 16:02:03,054] INFO Session establishment complete on server zk.1/9.10.11.12:2182, sessionid = 0x1006c6e0b830001, negotiated timeout = 18000 (org.apache.zookeeper.ClientCnxn){code} The Kafka ZK client is notified of the connection. {code:java} [2023-03-05 16:02:03,054] INFO [ZooKeeperClient Kafka server] Connected. (kafka.zookeeper.ZooKeeperClient){code} The broker sends the request to create the znode {{/brokers/ids/18}} which already exists. The error path implemented for KAFKA-6584 is then followed. However, in this case, the session owning the ephemeral node {{0x30043230ac1}} ({{{}216172783240153793{}}}) is different from the last active Zookeeper session which the broker has recorded. And it is also different from the current session {{0x1006c6e0b830001}} ({{{}72176813933264897{}}}), hence the recreation of the broker znode is not attempted. {code:java} [2023-03-05 16:02:04,466] ERROR Error while creating ephemeral at /brokers/ids/18, node already exists and owner '216172783240153793' does not match current session '72176813933264897' (kafka.zk.KafkaZkClient$CheckedEphemeral) org.apache.zookeeper.KeeperException$NodeExistsException: KeeperErrorCode = NodeExists at org.apache.zookeeper.KeeperException.create(KeeperException.java:126) at kafka.zk.KafkaZkClient$CheckedEphemeral.getAfterNodeExists(KafkaZkClient.scala:1821) at kafka.zk.KafkaZkClient$CheckedEphemeral.create(KafkaZkClient.scala:1759) at kafka.zk.KafkaZkClient.checkedEphemeralCreate(KafkaZkClient.scala:1726) at kafka.zk.KafkaZkClient.registerBroker(KafkaZkClient.scala:95) at kafka.controller.KafkaController.processRegisterBrokerAndReelect(KafkaController.scala:1810) at kafka.controller.KafkaController.process(KafkaController.scala:1853) at
[jira] [Updated] (KAFKA-14845) Broker ZNode creation can fail due to lost Zookeeper Session ID
[ https://issues.apache.org/jira/browse/KAFKA-14845?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Alexandre Dupriez updated KAFKA-14845: -- Description: Our production environment faced a use case where registration of a broker failed due to the presence of a "conflicting" broker znode in Zookeeper. This case is not without familiarity to that fixed by KAFKA-6584 and induced by the Zookeeper bug (or feature) tracked in ZOOKEEPER-2985 opened as of today. A network partition disturbed communication channels between the Kafka and Zookeeper clusters for about 20% of the brokers in the cluster. One of this broker was not able to re-register with Zookeeper and was excluded from the cluster until it was restarted. Broker logs show the failed registration due to a "conflicting" znode write which in this case does not exactly match the scenario covered by KAFKA-6584. The sequence of logs on the broker is as follows. First, a connection is established with the Zookeeper node 3. {code:java} [2023-03-05 16:01:55,342] INFO Socket connection established, initiating session, client: /1.2.3.4:40200, server: zk.3/5.6.7.8:2182 (org.apache.zookeeper.ClientCnxn) [2023-03-05 16:01:55,342] INFO channel is connected: [id: 0x2b45ae40, L:/1.1.3.4:40200 - R:zk.3/5.6.7.8:2182] (org.apache.zookeeper.ClientCnxnSocketNetty){code} An existing Zookeeper session was expired, and upon reconnection, the Zookeeper state change handler was invoked. The creation of the ephemeral znode /brokers/ids/18 started on the controller thread. {code:java} [2023-03-05 16:01:55,345] INFO Creating /brokers/ids/18 (is it secure? false) (kafka.zk.KafkaZkClient){code} The client "session" timed out after 6 seconds. Note the session is 0x0 and the absence of "{_}Session establishment complete{_}" log: the broker appears to have never received or processed the response from the Zookeeper node. {code:java} [2023-03-05 16:02:01,343] INFO Client session timed out, have not heard from server in 6000ms for sessionid 0x0, closing socket connection and attempting reconnect (org.apache.zookeeper.ClientCnxn) [2023-03-05 16:02:01,343] INFO channel is disconnected: [id: 0x2b45ae40, L:/1.2.3.4:40200 ! R:zk.3/5.6.7.8:2182] (org.apache.zookeeper.ClientCnxnSocketNetty){code} Pending requests were aborted with a {{CONNECTIONLOSS}} error and the client started waiting on a new connection notification. {code:java} [2023-03-05 16:02:01,343] INFO [ZooKeeperClient Kafka server] Waiting until connected. (kafka.zookeeper.ZooKeeperClient){code} A new connection was created with the Zookeeper node 1. Note that a valid (new) session ({{{}0x1006c6e0b830001{}}}) was reported by Kafka this time. {code:java} [2023-03-05 16:02:02,037] INFO Socket connection established, initiating session, client: /1.2.3.4:58080, server: zk.1/9.10.11.12:2182 (org.apache.zookeeper.ClientCnxn) [2023-03-05 16:02:02,037] INFO channel is connected: [id: 0x68fba106, L:/1.2.3.4:58080 - R:zk.1/9.10.11.12:2182] (org.apache.zookeeper.ClientCnxnSocketNetty) [2023-03-05 16:02:03,054] INFO Session establishment complete on server zk.1/9.10.11.12:2182, sessionid = 0x1006c6e0b830001, negotiated timeout = 18000 (org.apache.zookeeper.ClientCnxn){code} The Kafka ZK client is notified of the connection. {code:java} [2023-03-05 16:02:03,054] INFO [ZooKeeperClient Kafka server] Connected. (kafka.zookeeper.ZooKeeperClient){code} The broker sends the request to create the znode {{/brokers/ids/18}} which already exists. The error path implemented for KAFKA-6584 is then followed. However, in this case, the session owning the ephemeral node {{0x30043230ac1}} ({{{}216172783240153793{}}}) is different from the last active Zookeeper session which the broker has recorded. And it is also different from the current session {{0x1006c6e0b830001}} ({{{}72176813933264897{}}}), hence the recreation of the broker znode is not attempted. {code:java} [2023-03-05 16:02:04,466] ERROR Error while creating ephemeral at /brokers/ids/18, node already exists and owner '216172783240153793' does not match current session '72176813933264897' (kafka.zk.KafkaZkClient$CheckedEphemeral) org.apache.zookeeper.KeeperException$NodeExistsException: KeeperErrorCode = NodeExists at org.apache.zookeeper.KeeperException.create(KeeperException.java:126) at kafka.zk.KafkaZkClient$CheckedEphemeral.getAfterNodeExists(KafkaZkClient.scala:1821) at kafka.zk.KafkaZkClient$CheckedEphemeral.create(KafkaZkClient.scala:1759) at kafka.zk.KafkaZkClient.checkedEphemeralCreate(KafkaZkClient.scala:1726) at kafka.zk.KafkaZkClient.registerBroker(KafkaZkClient.scala:95) at kafka.controller.KafkaController.processRegisterBrokerAndReelect(KafkaController.scala:1810) at kafka.controller.KafkaController.process(KafkaController.scala:1853) at
[jira] [Updated] (KAFKA-14845) Broker ZNode creation can fail due to lost Zookeeper Session ID
[ https://issues.apache.org/jira/browse/KAFKA-14845?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Alexandre Dupriez updated KAFKA-14845: -- Description: Our production environment faced a use case where registration of a broker failed due to the presence of a "conflicting" broker znode in Zookeeper. This case is not without familiarity to that fixed by KAFKA-6584 and induced by the Zookeeper bug (or feature) tracked in ZOOKEEPER-2985 opened as of today. A network partition disturbed communication channels between the Kafka and Zookeeper clusters for about 20% of the brokers in the cluster. One of this broker was not able to re-register with Zookeeper and was excluded from the cluster until it was restarted. Broker logs show the failed registration due to a "conflicting" znode write which in this case does not exactly match the scenario covered by KAFKA-6584. The sequence of logs on the broker is as follows. First, a connection is established with the Zookeeper node 3. {code:java} [2023-03-05 16:01:55,342] INFO Socket connection established, initiating session, client: /1.2.3.4:40200, server: zk.3/5.6.7.8:2182 (org.apache.zookeeper.ClientCnxn) [2023-03-05 16:01:55,342] INFO channel is connected: [id: 0x2b45ae40, L:/1.1.3.4:40200 - R:zk.3/5.6.7.8:2182] (org.apache.zookeeper.ClientCnxnSocketNetty){code} An existing Zookeeper session was expired, and upon reconnection, the Zookeeper state change handler was invoked. The creation of the ephemeral znode /brokers/ids/18 started on the controller thread. {code:java} [2023-03-05 16:01:55,345] INFO Creating /brokers/ids/18 (is it secure? false) (kafka.zk.KafkaZkClient){code} The client "session" timed out after 6 seconds. Note the session is 0x0 and the absence of "{_}Session establishment complete{_}" log: the broker appears to have never received or processed the response from the Zookeeper node. {code:java} [2023-03-05 16:02:01,343] INFO Client session timed out, have not heard from server in 6000ms for sessionid 0x0, closing socket connection and attempting reconnect (org.apache.zookeeper.ClientCnxn) [2023-03-05 16:02:01,343] INFO channel is disconnected: [id: 0x2b45ae40, L:/1.2.3.4:40200 ! R:zk.3/5.6.7.8:2182] (org.apache.zookeeper.ClientCnxnSocketNetty){code} Pending requests were aborted with a {{CONNECTIONLOSS}} error and the client started waiting on a new connection notification. {code:java} [2023-03-05 16:02:01,343] INFO [ZooKeeperClient Kafka server] Waiting until connected. (kafka.zookeeper.ZooKeeperClient){code} A new connection was created with the Zookeeper node 1. Note that a valid (new) session ({{{}0x1006c6e0b830001{}}}) was reported by Kafka this time. {code:java} [2023-03-05 16:02:02,037] INFO Socket connection established, initiating session, client: /1.2.3.4:58080, server: zk.1/9.10.11.12:2182 (org.apache.zookeeper.ClientCnxn) [2023-03-05 16:02:02,037] INFO channel is connected: [id: 0x68fba106, L:/1.2.3.4:58080 - R:zk.1/9.10.11.12:2182] (org.apache.zookeeper.ClientCnxnSocketNetty) [2023-03-05 16:02:03,054] INFO Session establishment complete on server zk.1/9.10.11.12:2182, sessionid = 0x1006c6e0b830001, negotiated timeout = 18000 (org.apache.zookeeper.ClientCnxn){code} The Kafka ZK client is notified of the connection. {code:java} [2023-03-05 16:02:03,054] INFO [ZooKeeperClient Kafka server] Connected. (kafka.zookeeper.ZooKeeperClient){code} The broker sends the request to create the znode {{/brokers/ids/18}} which already exists. The error path implemented for KAFKA-6584 is then followed. However, in this case, the session owning the ephemeral node {{0x30043230ac1}} ({{{}216172783240153793{}}}) is different from the last active Zookeeper session which the broker has recorded. And it is also different from the current session {{0x1006c6e0b830001}} ({{{}72176813933264897{}}}), hence the recreation of the broker znode is not attempted. {code:java} [2023-03-05 16:02:04,466] ERROR Error while creating ephemeral at /brokers/ids/18, node already exists and owner '216172783240153793' does not match current session '72176813933264897' (kafka.zk.KafkaZkClient$CheckedEphemeral) org.apache.zookeeper.KeeperException$NodeExistsException: KeeperErrorCode = NodeExists at org.apache.zookeeper.KeeperException.create(KeeperException.java:126) at kafka.zk.KafkaZkClient$CheckedEphemeral.getAfterNodeExists(KafkaZkClient.scala:1821) at kafka.zk.KafkaZkClient$CheckedEphemeral.create(KafkaZkClient.scala:1759) at kafka.zk.KafkaZkClient.checkedEphemeralCreate(KafkaZkClient.scala:1726) at kafka.zk.KafkaZkClient.registerBroker(KafkaZkClient.scala:95) at kafka.controller.KafkaController.processRegisterBrokerAndReelect(KafkaController.scala:1810) at kafka.controller.KafkaController.process(KafkaController.scala:1853) at
[jira] [Updated] (KAFKA-14845) Broker ZNode creation can fail due to lost Zookeeper Session ID
[ https://issues.apache.org/jira/browse/KAFKA-14845?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Alexandre Dupriez updated KAFKA-14845: -- Summary: Broker ZNode creation can fail due to lost Zookeeper Session ID (was: Broker ZNode creation can fail due to a session ID unknown to the broker) > Broker ZNode creation can fail due to lost Zookeeper Session ID > --- > > Key: KAFKA-14845 > URL: https://issues.apache.org/jira/browse/KAFKA-14845 > Project: Kafka > Issue Type: Bug >Reporter: Alexandre Dupriez >Assignee: Alexandre Dupriez >Priority: Minor > Attachments: kafka-broker-reg.log > > > Our production environment faced a use case where registration of a broker > failed due to the presence of a "conflicting" broker znode in Zookeeper. This > case is not without familiarity to that fixed by KAFKA-6584 and induced by > the Zookeeper bug (or feature) tracked in ZOOKEEPER-2985 opened as of today. > A network partition disturbed communication channels between the Kafka and > Zookeeper clusters for about 20% of the brokers in the cluster. One of this > broker was not able to re-register with Zookeeper and was excluded from the > cluster until it was restarted. Broker logs show the failed registration due > to a "conflicting" znode write which in this case does not exactly match the > scenario covered by KAFKA-6584. > The sequence of logs on the broker is as follows. > First, a connection is established with the Zookeeper node 3. > {code:java} > [2023-03-05 16:01:55,342] INFO Socket connection established, initiating > session, client: /1.2.3.4:40200, server: zk.3/5.6.7.8:2182 > (org.apache.zookeeper.ClientCnxn) > [2023-03-05 16:01:55,342] INFO channel is connected: [id: 0x2b45ae40, > L:/1.1.3.4:40200 - R:zk.3/5.6.7.8:2182] > (org.apache.zookeeper.ClientCnxnSocketNetty){code} > An existing Zookeeper session was expired, and upon reconnection, the > Zookeeper state change handler was invoked. The creation of the ephemeral > znode /brokers/ids/18 started on the controller thread. > {code:java} > [2023-03-05 16:01:55,345] INFO Creating /brokers/ids/18 (is it secure? false) > (kafka.zk.KafkaZkClient){code} > The client "session" timed out after 6 seconds. Note the session is 0x0 and > the absence of "{_}Session establishment complete{_}" log: the broker appears > to have never received or processed the response from the Zookeeper node. > {code:java} > [2023-03-05 16:02:01,343] INFO Client session timed out, have not heard from > server in 6000ms for sessionid 0x0, closing socket connection and attempting > reconnect (org.apache.zookeeper.ClientCnxn) > [2023-03-05 16:02:01,343] INFO channel is disconnected: [id: 0x2b45ae40, > L:/1.2.3.4:40200 ! R:zk.3/5.6.7.8:2182] > (org.apache.zookeeper.ClientCnxnSocketNetty){code} > Pending requests were aborted with a {{CONNECTIONLOSS}} error and the client > started waiting on a new connection notification. > {code:java} > [2023-03-05 16:02:01,343] INFO [ZooKeeperClient Kafka server] Waiting until > connected. (kafka.zookeeper.ZooKeeperClient){code} > A new connection was created with the Zookeeper node 1. Note that a valid > (new) session ({{{}0x1006c6e0b830001{}}}) was reported by Kafka this time. > {code:java} > [2023-03-05 16:02:02,037] INFO Socket connection established, initiating > session, client: /1.2.3.4:58080, server: zk.1/9.10.11.12:2182 > (org.apache.zookeeper.ClientCnxn) > [2023-03-05 16:02:02,037] INFO channel is connected: [id: 0x68fba106, > L:/1.2.3.4:58080 - R:zk.1/9.10.11.12:2182] > (org.apache.zookeeper.ClientCnxnSocketNetty) > [2023-03-05 16:02:03,054] INFO Session establishment complete on server > zk.1/9.10.11.12:2182, sessionid = 0x1006c6e0b830001, negotiated timeout = > 18000 (org.apache.zookeeper.ClientCnxn){code} > The Kafka ZK client is notified of the connection. > {code:java} > [2023-03-05 16:02:03,054] INFO [ZooKeeperClient Kafka server] Connected. > (kafka.zookeeper.ZooKeeperClient){code} > The broker sends the request to create the znode {{/brokers/ids/18}} which > already exists. The error path implemented for KAFKA-6584 is then followed. > However, in this case, the session owning the ephemeral node > {{0x30043230ac1}} ({{{}216172783240153793{}}}) is different from the last > active Zookeeper session which the broker has recorded. And it is also > different from the current session {{0x1006c6e0b830001}} > ({{{}72176813933264897{}}}), hence the recreation of the broker znode is not > attempted. > {code:java} > [2023-03-05 16:02:04,466] ERROR Error while creating ephemeral at > /brokers/ids/18, node already exists and owner '216172783240153793' does not > match current session '72176813933264897' > (kafka.zk.KafkaZkClient$CheckedEphemeral) >