[
https://issues.apache.org/jira/browse/NIFI-15833?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=18073486#comment-18073486
]
David Handermann commented on NIFI-15833:
-----------------------------------------
[~danw4] Thanks for providing the generated summary and analysis.
For clarification, have you been able to reproduce this issue with an external
ZooKeeper cluster, or does it appear specific to the embedded ZooKeeper
configuration?
> When all nodes of a NiFi cluster are started simultaneously (e.g., via
> automated deployment tooling), the node that wins the ZooKeeper Cluster
> Coordinator election can enter a permanent deadlock.
> ---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
>
> Key: NIFI-15833
> URL: https://issues.apache.org/jira/browse/NIFI-15833
> Project: Apache NiFi
> Issue Type: Bug
> Components: Core Framework
> Affects Versions: 2.4.0, 2.5.0, 2.6.0, 2.7.0, 2.8.0
> Environment: OS: Red Hat Enterprise Linux 8 (x86_64)
> Java: OpenJDK 21.0.10+7 (openjdk-21.0.10.0.7-1.el8.x86_64)
> NiFi: Apache NiFi 2.8.0
> ZooKeeper: Embedded (bundled with NiFi)
> Cluster: 5 nodes, 8 vCPUs / 15Gi RAM each, JVM heap -Xms7G -Xmx7G
> Deployment: Ansible-automated simultaneous start across all nodes
> Reporter: Dan W
> Priority: Major
>
> h1. Cluster Coordinator Self-Connection Deadlock During Simultaneous Node
> Startup
> h2. Summary
> When all nodes of a NiFi cluster are started simultaneously (e.g., via
> automated deployment tooling), the node that wins the ZooKeeper Cluster
> Coordinator election can enter a permanent deadlock. The elected coordinator
> attempts to join the cluster by sending a connection request to the
> coordinator — which is itself — and
> {{AbstractNodeProtocolSender.validateNotConnectingToSelf()}} throws an
> {{{}UnknownServiceAddressException{}}}, blocking it. The node retries every
> ~5 seconds indefinitely. It never completes initialization, never accepts
> connection requests from other nodes, and the entire cluster fails to form.
> *This is a permanent deadlock with no self-recovery mechanism.* Manual
> intervention (restart) is the only way out.
> *Version:* Apache NiFi 2.8.0, Java 21 (openjdk-21.0.10.0.7), embedded
> ZooKeeper
> *Cluster Size:* 5 nodes (reproduced in a production environment)
> ----
> h2. Environment
> * 5-node NiFi cluster, all nodes identical hardware/configuration
> * Embedded ZooKeeper (all 5 nodes participate)
> * OIDC authentication via external identity provider
> * Deployment is automated: all 5 nodes are stopped, flow definition is
> deployed, and all 5 nodes are started within the same second
> * JVM: {{-Xms7G -Xmx7G}} per node
> * Relevant {{{}nifi.properties{}}}:
> {{nifi.cluster.is.node=true
> nifi.cluster.node.protocol.port=11443
> nifi.cluster.node.read.timeout=5 sec
> nifi.cluster.node.connection.timeout=5 sec}}
> ----
> h2. Reproduction Steps
> # Configure a 5-node NiFi 2.8.0 cluster with embedded ZooKeeper
> # Stop all 5 nodes simultaneously
> # Start all 5 nodes simultaneously (within the same second)
> # Observe cluster state via API: {{GET /nifi-api/controller/cluster}}
> *Expected:* All 5 nodes connect and form a cluster within a few minutes.
> *Actual:* The cluster never forms. The API returns HTTP 409 ("The Flow
> Controller is initializing the Data Flow") indefinitely. Cluster summary
> reports 0 connected nodes.
> This is reliably reproducible when all nodes start within a tight window. It
> does NOT occur when nodes are started with a stagger (e.g., 30+ seconds
> apart), because an already-initialized node wins the coordinator election and
> can accept connection requests normally.
> ----
> h2. Root Cause Analysis
> h3. The Deadlock
> The bug is in the cluster join sequence during startup. Here is the exact
> chain of events:
> # All 5 nodes start simultaneously and begin their initialization sequence
> # Embedded ZooKeeper forms a quorum and holds a leader election for the
> "Cluster Coordinator" role
> # *node-1* wins the election and becomes Cluster Coordinator
> # node-1's {{[main]}} thread enters {{StandardFlowService.load()}} →
> {{{}StandardFlowService.connect(){}}}, which calls
> {{NodeProtocolSenderListener.requestConnection()}} to join the cluster
> # {{AbstractNodeProtocolSender.requestConnection()}} calls
> {{validateNotConnectingToSelf()}} at line 119
> # This method detects that the coordinator address resolved from ZooKeeper
> ({{{}node-1:11443{}}}) matches the local node
> # It throws {{{}UnknownServiceAddressException{}}}:
> {{Cluster Coordinator is currently node-1.example.com:11443, which is this
> node,
> but connecting to self is not allowed at this phase of the lifecycle.
> This node must wait for a new Cluster Coordinator to be elected before
> connecting to the cluster.}}
> # {{StandardFlowService.connect()}} catches the exception, waits ~5 seconds,
> and retries
> # *Goto step 4.* The coordinator election does not change because node-1 is
> still running and holding the ZooKeeper ephemeral node.
> h3. Why It Never Recovers
> The deadlock is permanent because of a circular dependency:
> * *node-1 cannot join the cluster* because {{validateNotConnectingToSelf()}}
> prevents it from sending a connection request to itself
> * *node-1 will not relinquish the coordinator role* because it is still
> running and its ZooKeeper session is active
> * *A new coordinator will not be elected* because ZooKeeper has no reason to
> re-elect — the current leader's session is alive
> * *The other 4 nodes cannot form the cluster* because the coordinator
> (node-1) never completed initialization and never processes their connection
> requests
> The error message says "This node must wait for a new Cluster Coordinator to
> be elected" — but that will never happen without external intervention.
> h3. Observed Cluster State After 2+ Hours
> From the application logs, here is the actual cluster state observed 2+ hours
> after startup:
> *node-1 (elected coordinator):*
> * Stuck on {{[main]}} thread in {{StandardFlowService.connect()}} retry loop
> * Logged the {{validateNotConnectingToSelf}} exception *180 times* (every ~5
> seconds for 2+ hours)
> * Receiving heartbeats from node-3 and node-5 (they think node-1 is
> coordinator), but cannot process them meaningfully because initialization
> never completed
> * API returns HTTP 409 on all endpoints
> *node-2:*
> * Also elected itself as coordinator (split-brain — see below)
> * Receiving heartbeats from node-4
> * Sending heartbeats to itself
> * No WARN/ERROR in logs, but cluster still shows 0 connected nodes
> * API returns HTTP 409
> *node-3:* Sending heartbeats to node-1 (stuck coordinator). Never joined the
> cluster.
> *node-4:* Sending heartbeats to node-2. Never joined the cluster.
> *node-5:* Sending heartbeats to node-1 (stuck coordinator). Never joined the
> cluster.
> h3. Secondary Issue: Split-Brain Coordinator Election
> During the deadlock, we observed what appears to be a split-brain in the
> ZooKeeper coordinator election:
> * node-1 believes it is the coordinator (from initial election)
> * node-2 also believes it is the coordinator (possibly from a secondary
> election after ZooKeeper session instability)
> * node-3 and node-5 send heartbeats to node-1
> * node-4 sends heartbeats to node-2
> This may be a consequence of the primary deadlock causing ZooKeeper session
> timeouts on some nodes while node-1's session remains active, leading to
> inconsistent election results across the quorum.
> ----
> h2. Stack Trace (from node-1, repeated 180 times over 2+ hours)
>
> {{WARN [main] o.a.nifi.controller.StandardFlowService Failed to connect to
> cluster
> org.apache.nifi.cluster.protocol.UnknownServiceAddressException: Cluster
> Coordinator is currently node-1.example.com:11443, which is this node, but
> connecting to self is not allowed at this phase of the lifecycle. This node
> must wait for a new Cluster Coordinator to be elected before connecting to
> the cluster.
> at
> o.a.n.cluster.protocol.AbstractNodeProtocolSender.validateNotConnectingToSelf(AbstractNodeProtocolSender.java:119)
> at
> o.a.n.cluster.protocol.AbstractNodeProtocolSender.requestConnection(AbstractNodeProtocolSender.java:74)
> at
> o.a.n.cluster.protocol.impl.NodeProtocolSenderListener.requestConnection(NodeProtocolSenderListener.java:91)
> at
> o.a.n.controller.StandardFlowService.connect(StandardFlowService.java:825)
> at
> o.a.n.controller.StandardFlowService.load(StandardFlowService.java:449)
> at o.a.n.web.server.JettyServer.start(JettyServer.java:842)
> at o.a.n.runtime.Application.startServer(Application.java:131)
> at o.a.n.runtime.Application.run(Application.java:78)
> at o.a.n.runtime.Application.run(Application.java:60)
> at org.apache.nifi.NiFi.main(NiFi.java:42)}}
> *Meanwhile, node-1 is also processing heartbeats from other nodes:*
>
> {{INFO [Process Cluster Protocol Request-32]
> o.a.n.c.p.impl.SocketProtocolListener
> Finished processing request (type=HEARTBEAT, length=6060 bytes) from
> node-5:8443 in 44 millis
> INFO [Process Cluster Protocol Request-33]
> o.a.n.c.p.impl.SocketProtocolListener
> Finished processing request (type=HEARTBEAT, length=6060 bytes) from
> node-3:8443 in 45 millis
> WARN [main] o.a.nifi.controller.StandardFlowService Failed to connect to
> cluster
> ...connecting to self is not allowed at this phase of the lifecycle...
> INFO [Process Cluster Protocol Request-34]
> o.a.n.c.p.impl.SocketProtocolListener
> Finished processing request (type=HEARTBEAT, length=6052 bytes) from
> node-5:8443 in 45 millis}}
> The node is receiving heartbeats and responding to protocol requests on
> background threads, but the {{[main]}} thread is stuck in the retry loop and
> never completes {{{}JettyServer.start(){}}}.
> ----
> h2. Affected Code
> The bug originates in
> {{{}AbstractNodeProtocolSender.validateNotConnectingToSelf(){}}}:
> # *{{AbstractNodeProtocolSender.java:119}}* —
> {{validateNotConnectingToSelf()}} throws {{UnknownServiceAddressException}}
> when the coordinator address matches the local node. This check exists to
> prevent circular connection requests, but it does not account for the case
> where the coordinator IS the node that needs to bootstrap the cluster.
> # *{{AbstractNodeProtocolSender.java:74}}* — {{requestConnection()}} calls
> {{validateNotConnectingToSelf()}} unconditionally before sending the
> connection request.
> # *{{StandardFlowService.java:825}}* — {{connect()}} catches the exception
> and retries, but has no fallback logic for the self-coordinator case. No
> retry limit. No coordinator resignation mechanism.
> # *{{StandardFlowService.java:449}}* — {{load()}} calls {{connect()}} during
> initialization, blocking the {{[main]}} thread (and therefore
> {{{}JettyServer.start(){}}}) until the connection succeeds — which it never
> does.
> ----
> h2. Suggested Fix
> The coordinator node should be able to bootstrap itself without sending a
> network connection request. When {{validateNotConnectingToSelf()}} detects
> that the current node IS the coordinator, instead of throwing an exception,
> the node should handle its own connection locally:
> h3. Option A: Local Self-Connection (Preferred)
> In {{{}AbstractNodeProtocolSender.requestConnection(){}}}, when the
> coordinator resolves to self, bypass the network send and instead invoke the
> coordinator's connection handling logic directly (the same code path that
> processes incoming {{CONNECTION_REQUEST}} messages from other nodes). The
> coordinator can admit itself as the first cluster member, complete
> initialization, and then accept connection requests from the remaining nodes
> normally.
> h3. Option B: Coordinator Resignation with Backoff
> If the coordinator detects it cannot connect to itself during the
> initialization phase, it should:
> # Relinquish the ZooKeeper coordinator ephemeral node
> # Wait a randomized backoff period
> # Allow a different (possibly already-initialized) node to win the election
> # Retry connection to the new coordinator
> This is less ideal because it adds latency and still depends on another node
> being ready, but it breaks the deadlock.
> h3. Option C: Retry Limit with Forced Re-Election
> Add a maximum retry count to the {{StandardFlowService.connect()}} loop.
> After N failed attempts (e.g., 12 attempts = ~60 seconds), the node should
> forcibly resign the coordinator role by closing its ZooKeeper leader election
> participation, wait for a new election, and retry.
> ----
> h2. Current Workaround
> The only workaround is to restart the deadlocked coordinator node (or restart
> the entire cluster). On restart, a different node may win the election and
> bootstrap successfully — but this is not guaranteed and the same deadlock can
> recur.
> *Reliable workaround:* Stagger node startups by 30+ seconds so that the first
> node fully initializes before the next starts. This ensures the coordinator
> election is won by a node that has already completed initialization. This is
> impractical for automated deployment pipelines that manage clusters
> declaratively.
> ----
> h2. Impact
> * *Severity:* Critical for automated/orchestrated deployments
> * *Data Loss:* None (the cluster never forms, so no data is processed or
> lost)
> * *Recovery:* Requires manual intervention (restart)
> * *Scope:* Any NiFi cluster where nodes are started simultaneously (common
> in CI/CD pipelines, container orchestrators, Ansible/Terraform deployments,
> and auto-scaling groups)
--
This message was sent by Atlassian Jira
(v8.20.10#820010)