[ 
https://issues.apache.org/jira/browse/NIFI-15833?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=18073494#comment-18073494
 ] 

Dan W commented on NIFI-15833:
------------------------------

I'd also note that the issue doesn't appear to be limited to simultaneous 
startup scenarios. We've observed what looks like the same underlying problem — 
stale coordinator election state — surfacing after transient network delays and 
other incidental disruptions during normal cluster operation. Even if we 
introduced a mandatory staggered startup between nodes in our deployment 
pipeline, that wouldn't address the core issue: the cluster coordination 
protocol has no mechanism to detect or recover from a node sending heartbeats 
to a non-coordinator. Once a node latches onto the wrong coordinator — whether 
from a race at startup or a momentary network hiccup — it stays there 
permanently with no errors and no self-correction.

> When all nodes of a NiFi cluster are started simultaneously (e.g., via 
> automated deployment tooling), the node that wins the ZooKeeper Cluster 
> Coordinator election can enter a permanent deadlock.
> ---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
>
>                 Key: NIFI-15833
>                 URL: https://issues.apache.org/jira/browse/NIFI-15833
>             Project: Apache NiFi
>          Issue Type: Bug
>          Components: Core Framework
>    Affects Versions: 2.4.0, 2.5.0, 2.6.0, 2.7.0, 2.8.0
>         Environment: OS: Red Hat Enterprise Linux 8 (x86_64)
>  Java: OpenJDK 21.0.10+7 (openjdk-21.0.10.0.7-1.el8.x86_64)
>  NiFi: Apache NiFi 2.8.0
>  ZooKeeper: Embedded (bundled with NiFi)
>  Cluster: 5 nodes, 8 vCPUs / 15Gi RAM each, JVM heap -Xms7G -Xmx7G
>  Deployment: Ansible-automated simultaneous start across all nodes
>            Reporter: Dan W
>            Priority: Major
>
> h1. Cluster Coordinator Self-Connection Deadlock During Simultaneous Node 
> Startup
> h2. Summary
> When all nodes of a NiFi cluster are started simultaneously (e.g., via 
> automated deployment tooling), the node that wins the ZooKeeper Cluster 
> Coordinator election can enter a permanent deadlock. The elected coordinator 
> attempts to join the cluster by sending a connection request to the 
> coordinator — which is itself — and 
> {{AbstractNodeProtocolSender.validateNotConnectingToSelf()}} throws an 
> {{{}UnknownServiceAddressException{}}}, blocking it. The node retries every 
> ~5 seconds indefinitely. It never completes initialization, never accepts 
> connection requests from other nodes, and the entire cluster fails to form.
> *This is a permanent deadlock with no self-recovery mechanism.* Manual 
> intervention (restart) is the only way out.
> *Version:* Apache NiFi 2.8.0, Java 21 (openjdk-21.0.10.0.7), embedded 
> ZooKeeper
> *Cluster Size:* 5 nodes (reproduced in a production environment)
> ----
> h2. Environment
>  * 5-node NiFi cluster, all nodes identical hardware/configuration
>  * Embedded ZooKeeper (all 5 nodes participate)
>  * OIDC authentication via external identity provider
>  * Deployment is automated: all 5 nodes are stopped, flow definition is 
> deployed, and all 5 nodes are started within the same second
>  * JVM: {{-Xms7G -Xmx7G}} per node
>  * Relevant {{{}nifi.properties{}}}:
>  {{nifi.cluster.is.node=true
> nifi.cluster.node.protocol.port=11443
> nifi.cluster.node.read.timeout=5 sec
> nifi.cluster.node.connection.timeout=5 sec}}
> ----
> h2. Reproduction Steps
>  # Configure a 5-node NiFi 2.8.0 cluster with embedded ZooKeeper
>  # Stop all 5 nodes simultaneously
>  # Start all 5 nodes simultaneously (within the same second)
>  # Observe cluster state via API: {{GET /nifi-api/controller/cluster}}
> *Expected:* All 5 nodes connect and form a cluster within a few minutes.
> *Actual:* The cluster never forms. The API returns HTTP 409 ("The Flow 
> Controller is initializing the Data Flow") indefinitely. Cluster summary 
> reports 0 connected nodes.
> This is reliably reproducible when all nodes start within a tight window. It 
> does NOT occur when nodes are started with a stagger (e.g., 30+ seconds 
> apart), because an already-initialized node wins the coordinator election and 
> can accept connection requests normally.
> ----
> h2. Root Cause Analysis
> h3. The Deadlock
> The bug is in the cluster join sequence during startup. Here is the exact 
> chain of events:
>  # All 5 nodes start simultaneously and begin their initialization sequence
>  # Embedded ZooKeeper forms a quorum and holds a leader election for the 
> "Cluster Coordinator" role
>  # *node-1* wins the election and becomes Cluster Coordinator
>  # node-1's {{[main]}} thread enters {{StandardFlowService.load()}} → 
> {{{}StandardFlowService.connect(){}}}, which calls 
> {{NodeProtocolSenderListener.requestConnection()}} to join the cluster
>  # {{AbstractNodeProtocolSender.requestConnection()}} calls 
> {{validateNotConnectingToSelf()}} at line 119
>  # This method detects that the coordinator address resolved from ZooKeeper 
> ({{{}node-1:11443{}}}) matches the local node
>  # It throws {{{}UnknownServiceAddressException{}}}:
>  {{Cluster Coordinator is currently node-1.example.com:11443, which is this 
> node,
> but connecting to self is not allowed at this phase of the lifecycle.
> This node must wait for a new Cluster Coordinator to be elected before
> connecting to the cluster.}}
>  # {{StandardFlowService.connect()}} catches the exception, waits ~5 seconds, 
> and retries
>  # *Goto step 4.* The coordinator election does not change because node-1 is 
> still running and holding the ZooKeeper ephemeral node.
> h3. Why It Never Recovers
> The deadlock is permanent because of a circular dependency:
>  * *node-1 cannot join the cluster* because {{validateNotConnectingToSelf()}} 
> prevents it from sending a connection request to itself
>  * *node-1 will not relinquish the coordinator role* because it is still 
> running and its ZooKeeper session is active
>  * *A new coordinator will not be elected* because ZooKeeper has no reason to 
> re-elect — the current leader's session is alive
>  * *The other 4 nodes cannot form the cluster* because the coordinator 
> (node-1) never completed initialization and never processes their connection 
> requests
> The error message says "This node must wait for a new Cluster Coordinator to 
> be elected" — but that will never happen without external intervention.
> h3. Observed Cluster State After 2+ Hours
> From the application logs, here is the actual cluster state observed 2+ hours 
> after startup:
> *node-1 (elected coordinator):*
>  * Stuck on {{[main]}} thread in {{StandardFlowService.connect()}} retry loop
>  * Logged the {{validateNotConnectingToSelf}} exception *180 times* (every ~5 
> seconds for 2+ hours)
>  * Receiving heartbeats from node-3 and node-5 (they think node-1 is 
> coordinator), but cannot process them meaningfully because initialization 
> never completed
>  * API returns HTTP 409 on all endpoints
> *node-2:*
>  * Also elected itself as coordinator (split-brain — see below)
>  * Receiving heartbeats from node-4
>  * Sending heartbeats to itself
>  * No WARN/ERROR in logs, but cluster still shows 0 connected nodes
>  * API returns HTTP 409
> *node-3:* Sending heartbeats to node-1 (stuck coordinator). Never joined the 
> cluster.
> *node-4:* Sending heartbeats to node-2. Never joined the cluster.
> *node-5:* Sending heartbeats to node-1 (stuck coordinator). Never joined the 
> cluster.
> h3. Secondary Issue: Split-Brain Coordinator Election
> During the deadlock, we observed what appears to be a split-brain in the 
> ZooKeeper coordinator election:
>  * node-1 believes it is the coordinator (from initial election)
>  * node-2 also believes it is the coordinator (possibly from a secondary 
> election after ZooKeeper session instability)
>  * node-3 and node-5 send heartbeats to node-1
>  * node-4 sends heartbeats to node-2
> This may be a consequence of the primary deadlock causing ZooKeeper session 
> timeouts on some nodes while node-1's session remains active, leading to 
> inconsistent election results across the quorum.
> ----
> h2. Stack Trace (from node-1, repeated 180 times over 2+ hours)
>  
> {{WARN [main] o.a.nifi.controller.StandardFlowService Failed to connect to 
> cluster
> org.apache.nifi.cluster.protocol.UnknownServiceAddressException: Cluster 
> Coordinator is currently node-1.example.com:11443, which is this node, but 
> connecting to self is not allowed at this phase of the lifecycle. This node 
> must wait for a new Cluster Coordinator to be elected before connecting to 
> the cluster.
>       at 
> o.a.n.cluster.protocol.AbstractNodeProtocolSender.validateNotConnectingToSelf(AbstractNodeProtocolSender.java:119)
>       at 
> o.a.n.cluster.protocol.AbstractNodeProtocolSender.requestConnection(AbstractNodeProtocolSender.java:74)
>       at 
> o.a.n.cluster.protocol.impl.NodeProtocolSenderListener.requestConnection(NodeProtocolSenderListener.java:91)
>       at 
> o.a.n.controller.StandardFlowService.connect(StandardFlowService.java:825)
>       at 
> o.a.n.controller.StandardFlowService.load(StandardFlowService.java:449)
>       at o.a.n.web.server.JettyServer.start(JettyServer.java:842)
>       at o.a.n.runtime.Application.startServer(Application.java:131)
>       at o.a.n.runtime.Application.run(Application.java:78)
>       at o.a.n.runtime.Application.run(Application.java:60)
>       at org.apache.nifi.NiFi.main(NiFi.java:42)}}
> *Meanwhile, node-1 is also processing heartbeats from other nodes:*
>  
> {{INFO [Process Cluster Protocol Request-32] 
> o.a.n.c.p.impl.SocketProtocolListener
>   Finished processing request (type=HEARTBEAT, length=6060 bytes) from 
> node-5:8443 in 44 millis
> INFO [Process Cluster Protocol Request-33] 
> o.a.n.c.p.impl.SocketProtocolListener
>   Finished processing request (type=HEARTBEAT, length=6060 bytes) from 
> node-3:8443 in 45 millis
> WARN [main] o.a.nifi.controller.StandardFlowService Failed to connect to 
> cluster
>   ...connecting to self is not allowed at this phase of the lifecycle...
> INFO [Process Cluster Protocol Request-34] 
> o.a.n.c.p.impl.SocketProtocolListener
>   Finished processing request (type=HEARTBEAT, length=6052 bytes) from 
> node-5:8443 in 45 millis}}
> The node is receiving heartbeats and responding to protocol requests on 
> background threads, but the {{[main]}} thread is stuck in the retry loop and 
> never completes {{{}JettyServer.start(){}}}.
> ----
> h2. Affected Code
> The bug originates in 
> {{{}AbstractNodeProtocolSender.validateNotConnectingToSelf(){}}}:
>  # *{{AbstractNodeProtocolSender.java:119}}* — 
> {{validateNotConnectingToSelf()}} throws {{UnknownServiceAddressException}} 
> when the coordinator address matches the local node. This check exists to 
> prevent circular connection requests, but it does not account for the case 
> where the coordinator IS the node that needs to bootstrap the cluster.
>  # *{{AbstractNodeProtocolSender.java:74}}* — {{requestConnection()}} calls 
> {{validateNotConnectingToSelf()}} unconditionally before sending the 
> connection request.
>  # *{{StandardFlowService.java:825}}* — {{connect()}} catches the exception 
> and retries, but has no fallback logic for the self-coordinator case. No 
> retry limit. No coordinator resignation mechanism.
>  # *{{StandardFlowService.java:449}}* — {{load()}} calls {{connect()}} during 
> initialization, blocking the {{[main]}} thread (and therefore 
> {{{}JettyServer.start(){}}}) until the connection succeeds — which it never 
> does.
> ----
> h2. Suggested Fix
> The coordinator node should be able to bootstrap itself without sending a 
> network connection request. When {{validateNotConnectingToSelf()}} detects 
> that the current node IS the coordinator, instead of throwing an exception, 
> the node should handle its own connection locally:
> h3. Option A: Local Self-Connection (Preferred)
> In {{{}AbstractNodeProtocolSender.requestConnection(){}}}, when the 
> coordinator resolves to self, bypass the network send and instead invoke the 
> coordinator's connection handling logic directly (the same code path that 
> processes incoming {{CONNECTION_REQUEST}} messages from other nodes). The 
> coordinator can admit itself as the first cluster member, complete 
> initialization, and then accept connection requests from the remaining nodes 
> normally.
> h3. Option B: Coordinator Resignation with Backoff
> If the coordinator detects it cannot connect to itself during the 
> initialization phase, it should:
>  # Relinquish the ZooKeeper coordinator ephemeral node
>  # Wait a randomized backoff period
>  # Allow a different (possibly already-initialized) node to win the election
>  # Retry connection to the new coordinator
> This is less ideal because it adds latency and still depends on another node 
> being ready, but it breaks the deadlock.
> h3. Option C: Retry Limit with Forced Re-Election
> Add a maximum retry count to the {{StandardFlowService.connect()}} loop. 
> After N failed attempts (e.g., 12 attempts = ~60 seconds), the node should 
> forcibly resign the coordinator role by closing its ZooKeeper leader election 
> participation, wait for a new election, and retry.
> ----
> h2. Current Workaround
> The only workaround is to restart the deadlocked coordinator node (or restart 
> the entire cluster). On restart, a different node may win the election and 
> bootstrap successfully — but this is not guaranteed and the same deadlock can 
> recur.
> *Reliable workaround:* Stagger node startups by 30+ seconds so that the first 
> node fully initializes before the next starts. This ensures the coordinator 
> election is won by a node that has already completed initialization. This is 
> impractical for automated deployment pipelines that manage clusters 
> declaratively.
> ----
> h2. Impact
>  * *Severity:* Critical for automated/orchestrated deployments
>  * *Data Loss:* None (the cluster never forms, so no data is processed or 
> lost)
>  * *Recovery:* Requires manual intervention (restart)
>  * *Scope:* Any NiFi cluster where nodes are started simultaneously (common 
> in CI/CD pipelines, container orchestrators, Ansible/Terraform deployments, 
> and auto-scaling groups)



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

Reply via email to