Dan W created NIFI-15833:
----------------------------

             Summary: When all nodes of a NiFi cluster are started 
simultaneously (e.g., via automated deployment tooling), the node that wins the 
ZooKeeper Cluster Coordinator election can enter a permanent deadlock.
                 Key: NIFI-15833
                 URL: https://issues.apache.org/jira/browse/NIFI-15833
             Project: Apache NiFi
          Issue Type: Bug
          Components: Core Framework
    Affects Versions: 2.8.0, 2.7.0, 2.6.0, 2.5.0, 2.4.0
         Environment: OS: Red Hat Enterprise Linux 8 (x86_64)
 Java: OpenJDK 21.0.10+7 (openjdk-21.0.10.0.7-1.el8.x86_64)
 NiFi: Apache NiFi 2.8.0
 ZooKeeper: Embedded (bundled with NiFi)
 Cluster: 5 nodes, 8 vCPUs / 15Gi RAM each, JVM heap -Xms7G -Xmx7G
 Deployment: Ansible-automated simultaneous start across all nodes
            Reporter: Dan W


h1. Cluster Coordinator Self-Connection Deadlock During Simultaneous Node 
Startup
h2. Summary

When all nodes of a NiFi cluster are started simultaneously (e.g., via 
automated deployment tooling), the node that wins the ZooKeeper Cluster 
Coordinator election can enter a permanent deadlock. The elected coordinator 
attempts to join the cluster by sending a connection request to the coordinator 
— which is itself — and 
{{AbstractNodeProtocolSender.validateNotConnectingToSelf()}} throws an 
{{{}UnknownServiceAddressException{}}}, blocking it. The node retries every ~5 
seconds indefinitely. It never completes initialization, never accepts 
connection requests from other nodes, and the entire cluster fails to form.

*This is a permanent deadlock with no self-recovery mechanism.* Manual 
intervention (restart) is the only way out.

*Version:* Apache NiFi 2.8.0, Java 21 (openjdk-21.0.10.0.7), embedded ZooKeeper
*Cluster Size:* 5 nodes (reproduced in a production environment)
----
h2. Environment
 * 5-node NiFi cluster, all nodes identical hardware/configuration
 * Embedded ZooKeeper (all 5 nodes participate)
 * OIDC authentication via external identity provider
 * Deployment is automated: all 5 nodes are stopped, flow definition is 
deployed, and all 5 nodes are started within the same second
 * JVM: {{-Xms7G -Xmx7G}} per node
 * Relevant {{{}nifi.properties{}}}:
 {{nifi.cluster.is.node=true
nifi.cluster.node.protocol.port=11443
nifi.cluster.node.read.timeout=5 sec
nifi.cluster.node.connection.timeout=5 sec}}

----
h2. Reproduction Steps
 # Configure a 5-node NiFi 2.8.0 cluster with embedded ZooKeeper
 # Stop all 5 nodes simultaneously
 # Start all 5 nodes simultaneously (within the same second)
 # Observe cluster state via API: {{GET /nifi-api/controller/cluster}}

*Expected:* All 5 nodes connect and form a cluster within a few minutes.
*Actual:* The cluster never forms. The API returns HTTP 409 ("The Flow 
Controller is initializing the Data Flow") indefinitely. Cluster summary 
reports 0 connected nodes.

This is reliably reproducible when all nodes start within a tight window. It 
does NOT occur when nodes are started with a stagger (e.g., 30+ seconds apart), 
because an already-initialized node wins the coordinator election and can 
accept connection requests normally.
----
h2. Root Cause Analysis
h3. The Deadlock

The bug is in the cluster join sequence during startup. Here is the exact chain 
of events:
 # All 5 nodes start simultaneously and begin their initialization sequence
 # Embedded ZooKeeper forms a quorum and holds a leader election for the 
"Cluster Coordinator" role
 # *node-1* wins the election and becomes Cluster Coordinator
 # node-1's {{[main]}} thread enters {{StandardFlowService.load()}} → 
{{{}StandardFlowService.connect(){}}}, which calls 
{{NodeProtocolSenderListener.requestConnection()}} to join the cluster
 # {{AbstractNodeProtocolSender.requestConnection()}} calls 
{{validateNotConnectingToSelf()}} at line 119
 # This method detects that the coordinator address resolved from ZooKeeper 
({{{}node-1:11443{}}}) matches the local node
 # It throws {{{}UnknownServiceAddressException{}}}:
 {{Cluster Coordinator is currently node-1.example.com:11443, which is this 
node,
but connecting to self is not allowed at this phase of the lifecycle.
This node must wait for a new Cluster Coordinator to be elected before
connecting to the cluster.}}
 # {{StandardFlowService.connect()}} catches the exception, waits ~5 seconds, 
and retries
 # *Goto step 4.* The coordinator election does not change because node-1 is 
still running and holding the ZooKeeper ephemeral node.

h3. Why It Never Recovers

The deadlock is permanent because of a circular dependency:
 * *node-1 cannot join the cluster* because {{validateNotConnectingToSelf()}} 
prevents it from sending a connection request to itself
 * *node-1 will not relinquish the coordinator role* because it is still 
running and its ZooKeeper session is active
 * *A new coordinator will not be elected* because ZooKeeper has no reason to 
re-elect — the current leader's session is alive
 * *The other 4 nodes cannot form the cluster* because the coordinator (node-1) 
never completed initialization and never processes their connection requests

The error message says "This node must wait for a new Cluster Coordinator to be 
elected" — but that will never happen without external intervention.
h3. Observed Cluster State After 2+ Hours

>From the application logs, here is the actual cluster state observed 2+ hours 
>after startup:

*node-1 (elected coordinator):*
 * Stuck on {{[main]}} thread in {{StandardFlowService.connect()}} retry loop
 * Logged the {{validateNotConnectingToSelf}} exception *180 times* (every ~5 
seconds for 2+ hours)
 * Receiving heartbeats from node-3 and node-5 (they think node-1 is 
coordinator), but cannot process them meaningfully because initialization never 
completed
 * API returns HTTP 409 on all endpoints

*node-2:*
 * Also elected itself as coordinator (split-brain — see below)
 * Receiving heartbeats from node-4
 * Sending heartbeats to itself
 * No WARN/ERROR in logs, but cluster still shows 0 connected nodes
 * API returns HTTP 409

*node-3:* Sending heartbeats to node-1 (stuck coordinator). Never joined the 
cluster.
*node-4:* Sending heartbeats to node-2. Never joined the cluster.
*node-5:* Sending heartbeats to node-1 (stuck coordinator). Never joined the 
cluster.
h3. Secondary Issue: Split-Brain Coordinator Election

During the deadlock, we observed what appears to be a split-brain in the 
ZooKeeper coordinator election:
 * node-1 believes it is the coordinator (from initial election)
 * node-2 also believes it is the coordinator (possibly from a secondary 
election after ZooKeeper session instability)
 * node-3 and node-5 send heartbeats to node-1
 * node-4 sends heartbeats to node-2

This may be a consequence of the primary deadlock causing ZooKeeper session 
timeouts on some nodes while node-1's session remains active, leading to 
inconsistent election results across the quorum.
----
h2. Stack Trace (from node-1, repeated 180 times over 2+ hours)
 
{{WARN [main] o.a.nifi.controller.StandardFlowService Failed to connect to 
cluster
org.apache.nifi.cluster.protocol.UnknownServiceAddressException: Cluster 
Coordinator is currently node-1.example.com:11443, which is this node, but 
connecting to self is not allowed at this phase of the lifecycle. This node 
must wait for a new Cluster Coordinator to be elected before connecting to the 
cluster.
        at 
o.a.n.cluster.protocol.AbstractNodeProtocolSender.validateNotConnectingToSelf(AbstractNodeProtocolSender.java:119)
        at 
o.a.n.cluster.protocol.AbstractNodeProtocolSender.requestConnection(AbstractNodeProtocolSender.java:74)
        at 
o.a.n.cluster.protocol.impl.NodeProtocolSenderListener.requestConnection(NodeProtocolSenderListener.java:91)
        at 
o.a.n.controller.StandardFlowService.connect(StandardFlowService.java:825)
        at 
o.a.n.controller.StandardFlowService.load(StandardFlowService.java:449)
        at o.a.n.web.server.JettyServer.start(JettyServer.java:842)
        at o.a.n.runtime.Application.startServer(Application.java:131)
        at o.a.n.runtime.Application.run(Application.java:78)
        at o.a.n.runtime.Application.run(Application.java:60)
        at org.apache.nifi.NiFi.main(NiFi.java:42)}}

*Meanwhile, node-1 is also processing heartbeats from other nodes:*
 
{{INFO [Process Cluster Protocol Request-32] 
o.a.n.c.p.impl.SocketProtocolListener
  Finished processing request (type=HEARTBEAT, length=6060 bytes) from 
node-5:8443 in 44 millis

INFO [Process Cluster Protocol Request-33] o.a.n.c.p.impl.SocketProtocolListener
  Finished processing request (type=HEARTBEAT, length=6060 bytes) from 
node-3:8443 in 45 millis

WARN [main] o.a.nifi.controller.StandardFlowService Failed to connect to cluster
  ...connecting to self is not allowed at this phase of the lifecycle...

INFO [Process Cluster Protocol Request-34] o.a.n.c.p.impl.SocketProtocolListener
  Finished processing request (type=HEARTBEAT, length=6052 bytes) from 
node-5:8443 in 45 millis}}

The node is receiving heartbeats and responding to protocol requests on 
background threads, but the {{[main]}} thread is stuck in the retry loop and 
never completes {{{}JettyServer.start(){}}}.
----
h2. Affected Code

The bug originates in 
{{{}AbstractNodeProtocolSender.validateNotConnectingToSelf(){}}}:
 # *{{AbstractNodeProtocolSender.java:119}}* — 
{{validateNotConnectingToSelf()}} throws {{UnknownServiceAddressException}} 
when the coordinator address matches the local node. This check exists to 
prevent circular connection requests, but it does not account for the case 
where the coordinator IS the node that needs to bootstrap the cluster.

 # *{{AbstractNodeProtocolSender.java:74}}* — {{requestConnection()}} calls 
{{validateNotConnectingToSelf()}} unconditionally before sending the connection 
request.

 # *{{StandardFlowService.java:825}}* — {{connect()}} catches the exception and 
retries, but has no fallback logic for the self-coordinator case. No retry 
limit. No coordinator resignation mechanism.

 # *{{StandardFlowService.java:449}}* — {{load()}} calls {{connect()}} during 
initialization, blocking the {{[main]}} thread (and therefore 
{{{}JettyServer.start(){}}}) until the connection succeeds — which it never 
does.

----
h2. Suggested Fix

The coordinator node should be able to bootstrap itself without sending a 
network connection request. When {{validateNotConnectingToSelf()}} detects that 
the current node IS the coordinator, instead of throwing an exception, the node 
should handle its own connection locally:
h3. Option A: Local Self-Connection (Preferred)

In {{{}AbstractNodeProtocolSender.requestConnection(){}}}, when the coordinator 
resolves to self, bypass the network send and instead invoke the coordinator's 
connection handling logic directly (the same code path that processes incoming 
{{CONNECTION_REQUEST}} messages from other nodes). The coordinator can admit 
itself as the first cluster member, complete initialization, and then accept 
connection requests from the remaining nodes normally.
h3. Option B: Coordinator Resignation with Backoff

If the coordinator detects it cannot connect to itself during the 
initialization phase, it should:
 # Relinquish the ZooKeeper coordinator ephemeral node
 # Wait a randomized backoff period
 # Allow a different (possibly already-initialized) node to win the election
 # Retry connection to the new coordinator

This is less ideal because it adds latency and still depends on another node 
being ready, but it breaks the deadlock.
h3. Option C: Retry Limit with Forced Re-Election

Add a maximum retry count to the {{StandardFlowService.connect()}} loop. After 
N failed attempts (e.g., 12 attempts = ~60 seconds), the node should forcibly 
resign the coordinator role by closing its ZooKeeper leader election 
participation, wait for a new election, and retry.
----
h2. Current Workaround

The only workaround is to restart the deadlocked coordinator node (or restart 
the entire cluster). On restart, a different node may win the election and 
bootstrap successfully — but this is not guaranteed and the same deadlock can 
recur.

*Reliable workaround:* Stagger node startups by 30+ seconds so that the first 
node fully initializes before the next starts. This ensures the coordinator 
election is won by a node that has already completed initialization. This is 
impractical for automated deployment pipelines that manage clusters 
declaratively.
----
h2. Impact
 * *Severity:* Critical for automated/orchestrated deployments
 * *Data Loss:* None (the cluster never forms, so no data is processed or lost)
 * *Recovery:* Requires manual intervention (restart)
 * *Scope:* Any NiFi cluster where nodes are started simultaneously (common in 
CI/CD pipelines, container orchestrators, Ansible/Terraform deployments, and 
auto-scaling groups)



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

Reply via email to