Dan W created NIFI-15833:
----------------------------
Summary: When all nodes of a NiFi cluster are started
simultaneously (e.g., via automated deployment tooling), the node that wins the
ZooKeeper Cluster Coordinator election can enter a permanent deadlock.
Key: NIFI-15833
URL: https://issues.apache.org/jira/browse/NIFI-15833
Project: Apache NiFi
Issue Type: Bug
Components: Core Framework
Affects Versions: 2.8.0, 2.7.0, 2.6.0, 2.5.0, 2.4.0
Environment: OS: Red Hat Enterprise Linux 8 (x86_64)
Java: OpenJDK 21.0.10+7 (openjdk-21.0.10.0.7-1.el8.x86_64)
NiFi: Apache NiFi 2.8.0
ZooKeeper: Embedded (bundled with NiFi)
Cluster: 5 nodes, 8 vCPUs / 15Gi RAM each, JVM heap -Xms7G -Xmx7G
Deployment: Ansible-automated simultaneous start across all nodes
Reporter: Dan W
h1. Cluster Coordinator Self-Connection Deadlock During Simultaneous Node
Startup
h2. Summary
When all nodes of a NiFi cluster are started simultaneously (e.g., via
automated deployment tooling), the node that wins the ZooKeeper Cluster
Coordinator election can enter a permanent deadlock. The elected coordinator
attempts to join the cluster by sending a connection request to the coordinator
— which is itself — and
{{AbstractNodeProtocolSender.validateNotConnectingToSelf()}} throws an
{{{}UnknownServiceAddressException{}}}, blocking it. The node retries every ~5
seconds indefinitely. It never completes initialization, never accepts
connection requests from other nodes, and the entire cluster fails to form.
*This is a permanent deadlock with no self-recovery mechanism.* Manual
intervention (restart) is the only way out.
*Version:* Apache NiFi 2.8.0, Java 21 (openjdk-21.0.10.0.7), embedded ZooKeeper
*Cluster Size:* 5 nodes (reproduced in a production environment)
----
h2. Environment
* 5-node NiFi cluster, all nodes identical hardware/configuration
* Embedded ZooKeeper (all 5 nodes participate)
* OIDC authentication via external identity provider
* Deployment is automated: all 5 nodes are stopped, flow definition is
deployed, and all 5 nodes are started within the same second
* JVM: {{-Xms7G -Xmx7G}} per node
* Relevant {{{}nifi.properties{}}}:
{{nifi.cluster.is.node=true
nifi.cluster.node.protocol.port=11443
nifi.cluster.node.read.timeout=5 sec
nifi.cluster.node.connection.timeout=5 sec}}
----
h2. Reproduction Steps
# Configure a 5-node NiFi 2.8.0 cluster with embedded ZooKeeper
# Stop all 5 nodes simultaneously
# Start all 5 nodes simultaneously (within the same second)
# Observe cluster state via API: {{GET /nifi-api/controller/cluster}}
*Expected:* All 5 nodes connect and form a cluster within a few minutes.
*Actual:* The cluster never forms. The API returns HTTP 409 ("The Flow
Controller is initializing the Data Flow") indefinitely. Cluster summary
reports 0 connected nodes.
This is reliably reproducible when all nodes start within a tight window. It
does NOT occur when nodes are started with a stagger (e.g., 30+ seconds apart),
because an already-initialized node wins the coordinator election and can
accept connection requests normally.
----
h2. Root Cause Analysis
h3. The Deadlock
The bug is in the cluster join sequence during startup. Here is the exact chain
of events:
# All 5 nodes start simultaneously and begin their initialization sequence
# Embedded ZooKeeper forms a quorum and holds a leader election for the
"Cluster Coordinator" role
# *node-1* wins the election and becomes Cluster Coordinator
# node-1's {{[main]}} thread enters {{StandardFlowService.load()}} →
{{{}StandardFlowService.connect(){}}}, which calls
{{NodeProtocolSenderListener.requestConnection()}} to join the cluster
# {{AbstractNodeProtocolSender.requestConnection()}} calls
{{validateNotConnectingToSelf()}} at line 119
# This method detects that the coordinator address resolved from ZooKeeper
({{{}node-1:11443{}}}) matches the local node
# It throws {{{}UnknownServiceAddressException{}}}:
{{Cluster Coordinator is currently node-1.example.com:11443, which is this
node,
but connecting to self is not allowed at this phase of the lifecycle.
This node must wait for a new Cluster Coordinator to be elected before
connecting to the cluster.}}
# {{StandardFlowService.connect()}} catches the exception, waits ~5 seconds,
and retries
# *Goto step 4.* The coordinator election does not change because node-1 is
still running and holding the ZooKeeper ephemeral node.
h3. Why It Never Recovers
The deadlock is permanent because of a circular dependency:
* *node-1 cannot join the cluster* because {{validateNotConnectingToSelf()}}
prevents it from sending a connection request to itself
* *node-1 will not relinquish the coordinator role* because it is still
running and its ZooKeeper session is active
* *A new coordinator will not be elected* because ZooKeeper has no reason to
re-elect — the current leader's session is alive
* *The other 4 nodes cannot form the cluster* because the coordinator (node-1)
never completed initialization and never processes their connection requests
The error message says "This node must wait for a new Cluster Coordinator to be
elected" — but that will never happen without external intervention.
h3. Observed Cluster State After 2+ Hours
>From the application logs, here is the actual cluster state observed 2+ hours
>after startup:
*node-1 (elected coordinator):*
* Stuck on {{[main]}} thread in {{StandardFlowService.connect()}} retry loop
* Logged the {{validateNotConnectingToSelf}} exception *180 times* (every ~5
seconds for 2+ hours)
* Receiving heartbeats from node-3 and node-5 (they think node-1 is
coordinator), but cannot process them meaningfully because initialization never
completed
* API returns HTTP 409 on all endpoints
*node-2:*
* Also elected itself as coordinator (split-brain — see below)
* Receiving heartbeats from node-4
* Sending heartbeats to itself
* No WARN/ERROR in logs, but cluster still shows 0 connected nodes
* API returns HTTP 409
*node-3:* Sending heartbeats to node-1 (stuck coordinator). Never joined the
cluster.
*node-4:* Sending heartbeats to node-2. Never joined the cluster.
*node-5:* Sending heartbeats to node-1 (stuck coordinator). Never joined the
cluster.
h3. Secondary Issue: Split-Brain Coordinator Election
During the deadlock, we observed what appears to be a split-brain in the
ZooKeeper coordinator election:
* node-1 believes it is the coordinator (from initial election)
* node-2 also believes it is the coordinator (possibly from a secondary
election after ZooKeeper session instability)
* node-3 and node-5 send heartbeats to node-1
* node-4 sends heartbeats to node-2
This may be a consequence of the primary deadlock causing ZooKeeper session
timeouts on some nodes while node-1's session remains active, leading to
inconsistent election results across the quorum.
----
h2. Stack Trace (from node-1, repeated 180 times over 2+ hours)
{{WARN [main] o.a.nifi.controller.StandardFlowService Failed to connect to
cluster
org.apache.nifi.cluster.protocol.UnknownServiceAddressException: Cluster
Coordinator is currently node-1.example.com:11443, which is this node, but
connecting to self is not allowed at this phase of the lifecycle. This node
must wait for a new Cluster Coordinator to be elected before connecting to the
cluster.
at
o.a.n.cluster.protocol.AbstractNodeProtocolSender.validateNotConnectingToSelf(AbstractNodeProtocolSender.java:119)
at
o.a.n.cluster.protocol.AbstractNodeProtocolSender.requestConnection(AbstractNodeProtocolSender.java:74)
at
o.a.n.cluster.protocol.impl.NodeProtocolSenderListener.requestConnection(NodeProtocolSenderListener.java:91)
at
o.a.n.controller.StandardFlowService.connect(StandardFlowService.java:825)
at
o.a.n.controller.StandardFlowService.load(StandardFlowService.java:449)
at o.a.n.web.server.JettyServer.start(JettyServer.java:842)
at o.a.n.runtime.Application.startServer(Application.java:131)
at o.a.n.runtime.Application.run(Application.java:78)
at o.a.n.runtime.Application.run(Application.java:60)
at org.apache.nifi.NiFi.main(NiFi.java:42)}}
*Meanwhile, node-1 is also processing heartbeats from other nodes:*
{{INFO [Process Cluster Protocol Request-32]
o.a.n.c.p.impl.SocketProtocolListener
Finished processing request (type=HEARTBEAT, length=6060 bytes) from
node-5:8443 in 44 millis
INFO [Process Cluster Protocol Request-33] o.a.n.c.p.impl.SocketProtocolListener
Finished processing request (type=HEARTBEAT, length=6060 bytes) from
node-3:8443 in 45 millis
WARN [main] o.a.nifi.controller.StandardFlowService Failed to connect to cluster
...connecting to self is not allowed at this phase of the lifecycle...
INFO [Process Cluster Protocol Request-34] o.a.n.c.p.impl.SocketProtocolListener
Finished processing request (type=HEARTBEAT, length=6052 bytes) from
node-5:8443 in 45 millis}}
The node is receiving heartbeats and responding to protocol requests on
background threads, but the {{[main]}} thread is stuck in the retry loop and
never completes {{{}JettyServer.start(){}}}.
----
h2. Affected Code
The bug originates in
{{{}AbstractNodeProtocolSender.validateNotConnectingToSelf(){}}}:
# *{{AbstractNodeProtocolSender.java:119}}* —
{{validateNotConnectingToSelf()}} throws {{UnknownServiceAddressException}}
when the coordinator address matches the local node. This check exists to
prevent circular connection requests, but it does not account for the case
where the coordinator IS the node that needs to bootstrap the cluster.
# *{{AbstractNodeProtocolSender.java:74}}* — {{requestConnection()}} calls
{{validateNotConnectingToSelf()}} unconditionally before sending the connection
request.
# *{{StandardFlowService.java:825}}* — {{connect()}} catches the exception and
retries, but has no fallback logic for the self-coordinator case. No retry
limit. No coordinator resignation mechanism.
# *{{StandardFlowService.java:449}}* — {{load()}} calls {{connect()}} during
initialization, blocking the {{[main]}} thread (and therefore
{{{}JettyServer.start(){}}}) until the connection succeeds — which it never
does.
----
h2. Suggested Fix
The coordinator node should be able to bootstrap itself without sending a
network connection request. When {{validateNotConnectingToSelf()}} detects that
the current node IS the coordinator, instead of throwing an exception, the node
should handle its own connection locally:
h3. Option A: Local Self-Connection (Preferred)
In {{{}AbstractNodeProtocolSender.requestConnection(){}}}, when the coordinator
resolves to self, bypass the network send and instead invoke the coordinator's
connection handling logic directly (the same code path that processes incoming
{{CONNECTION_REQUEST}} messages from other nodes). The coordinator can admit
itself as the first cluster member, complete initialization, and then accept
connection requests from the remaining nodes normally.
h3. Option B: Coordinator Resignation with Backoff
If the coordinator detects it cannot connect to itself during the
initialization phase, it should:
# Relinquish the ZooKeeper coordinator ephemeral node
# Wait a randomized backoff period
# Allow a different (possibly already-initialized) node to win the election
# Retry connection to the new coordinator
This is less ideal because it adds latency and still depends on another node
being ready, but it breaks the deadlock.
h3. Option C: Retry Limit with Forced Re-Election
Add a maximum retry count to the {{StandardFlowService.connect()}} loop. After
N failed attempts (e.g., 12 attempts = ~60 seconds), the node should forcibly
resign the coordinator role by closing its ZooKeeper leader election
participation, wait for a new election, and retry.
----
h2. Current Workaround
The only workaround is to restart the deadlocked coordinator node (or restart
the entire cluster). On restart, a different node may win the election and
bootstrap successfully — but this is not guaranteed and the same deadlock can
recur.
*Reliable workaround:* Stagger node startups by 30+ seconds so that the first
node fully initializes before the next starts. This ensures the coordinator
election is won by a node that has already completed initialization. This is
impractical for automated deployment pipelines that manage clusters
declaratively.
----
h2. Impact
* *Severity:* Critical for automated/orchestrated deployments
* *Data Loss:* None (the cluster never forms, so no data is processed or lost)
* *Recovery:* Requires manual intervention (restart)
* *Scope:* Any NiFi cluster where nodes are started simultaneously (common in
CI/CD pipelines, container orchestrators, Ansible/Terraform deployments, and
auto-scaling groups)
--
This message was sent by Atlassian Jira
(v8.20.10#820010)