[
https://issues.apache.org/jira/browse/NIFI-15833?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=18073494#comment-18073494
]
Dan W commented on NIFI-15833:
------------------------------
I'd also note that the issue doesn't appear to be limited to simultaneous
startup scenarios. We've observed what looks like the same underlying problem —
stale coordinator election state — surfacing after transient network delays and
other incidental disruptions during normal cluster operation. Even if we
introduced a mandatory staggered startup between nodes in our deployment
pipeline, that wouldn't address the core issue: the cluster coordination
protocol has no mechanism to detect or recover from a node sending heartbeats
to a non-coordinator. Once a node latches onto the wrong coordinator — whether
from a race at startup or a momentary network hiccup — it stays there
permanently with no errors and no self-correction.
> When all nodes of a NiFi cluster are started simultaneously (e.g., via
> automated deployment tooling), the node that wins the ZooKeeper Cluster
> Coordinator election can enter a permanent deadlock.
> ---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
>
> Key: NIFI-15833
> URL: https://issues.apache.org/jira/browse/NIFI-15833
> Project: Apache NiFi
> Issue Type: Bug
> Components: Core Framework
> Affects Versions: 2.4.0, 2.5.0, 2.6.0, 2.7.0, 2.8.0
> Environment: OS: Red Hat Enterprise Linux 8 (x86_64)
> Java: OpenJDK 21.0.10+7 (openjdk-21.0.10.0.7-1.el8.x86_64)
> NiFi: Apache NiFi 2.8.0
> ZooKeeper: Embedded (bundled with NiFi)
> Cluster: 5 nodes, 8 vCPUs / 15Gi RAM each, JVM heap -Xms7G -Xmx7G
> Deployment: Ansible-automated simultaneous start across all nodes
> Reporter: Dan W
> Priority: Major
>
> h1. Cluster Coordinator Self-Connection Deadlock During Simultaneous Node
> Startup
> h2. Summary
> When all nodes of a NiFi cluster are started simultaneously (e.g., via
> automated deployment tooling), the node that wins the ZooKeeper Cluster
> Coordinator election can enter a permanent deadlock. The elected coordinator
> attempts to join the cluster by sending a connection request to the
> coordinator — which is itself — and
> {{AbstractNodeProtocolSender.validateNotConnectingToSelf()}} throws an
> {{{}UnknownServiceAddressException{}}}, blocking it. The node retries every
> ~5 seconds indefinitely. It never completes initialization, never accepts
> connection requests from other nodes, and the entire cluster fails to form.
> *This is a permanent deadlock with no self-recovery mechanism.* Manual
> intervention (restart) is the only way out.
> *Version:* Apache NiFi 2.8.0, Java 21 (openjdk-21.0.10.0.7), embedded
> ZooKeeper
> *Cluster Size:* 5 nodes (reproduced in a production environment)
> ----
> h2. Environment
> * 5-node NiFi cluster, all nodes identical hardware/configuration
> * Embedded ZooKeeper (all 5 nodes participate)
> * OIDC authentication via external identity provider
> * Deployment is automated: all 5 nodes are stopped, flow definition is
> deployed, and all 5 nodes are started within the same second
> * JVM: {{-Xms7G -Xmx7G}} per node
> * Relevant {{{}nifi.properties{}}}:
> {{nifi.cluster.is.node=true
> nifi.cluster.node.protocol.port=11443
> nifi.cluster.node.read.timeout=5 sec
> nifi.cluster.node.connection.timeout=5 sec}}
> ----
> h2. Reproduction Steps
> # Configure a 5-node NiFi 2.8.0 cluster with embedded ZooKeeper
> # Stop all 5 nodes simultaneously
> # Start all 5 nodes simultaneously (within the same second)
> # Observe cluster state via API: {{GET /nifi-api/controller/cluster}}
> *Expected:* All 5 nodes connect and form a cluster within a few minutes.
> *Actual:* The cluster never forms. The API returns HTTP 409 ("The Flow
> Controller is initializing the Data Flow") indefinitely. Cluster summary
> reports 0 connected nodes.
> This is reliably reproducible when all nodes start within a tight window. It
> does NOT occur when nodes are started with a stagger (e.g., 30+ seconds
> apart), because an already-initialized node wins the coordinator election and
> can accept connection requests normally.
> ----
> h2. Root Cause Analysis
> h3. The Deadlock
> The bug is in the cluster join sequence during startup. Here is the exact
> chain of events:
> # All 5 nodes start simultaneously and begin their initialization sequence
> # Embedded ZooKeeper forms a quorum and holds a leader election for the
> "Cluster Coordinator" role
> # *node-1* wins the election and becomes Cluster Coordinator
> # node-1's {{[main]}} thread enters {{StandardFlowService.load()}} →
> {{{}StandardFlowService.connect(){}}}, which calls
> {{NodeProtocolSenderListener.requestConnection()}} to join the cluster
> # {{AbstractNodeProtocolSender.requestConnection()}} calls
> {{validateNotConnectingToSelf()}} at line 119
> # This method detects that the coordinator address resolved from ZooKeeper
> ({{{}node-1:11443{}}}) matches the local node
> # It throws {{{}UnknownServiceAddressException{}}}:
> {{Cluster Coordinator is currently node-1.example.com:11443, which is this
> node,
> but connecting to self is not allowed at this phase of the lifecycle.
> This node must wait for a new Cluster Coordinator to be elected before
> connecting to the cluster.}}
> # {{StandardFlowService.connect()}} catches the exception, waits ~5 seconds,
> and retries
> # *Goto step 4.* The coordinator election does not change because node-1 is
> still running and holding the ZooKeeper ephemeral node.
> h3. Why It Never Recovers
> The deadlock is permanent because of a circular dependency:
> * *node-1 cannot join the cluster* because {{validateNotConnectingToSelf()}}
> prevents it from sending a connection request to itself
> * *node-1 will not relinquish the coordinator role* because it is still
> running and its ZooKeeper session is active
> * *A new coordinator will not be elected* because ZooKeeper has no reason to
> re-elect — the current leader's session is alive
> * *The other 4 nodes cannot form the cluster* because the coordinator
> (node-1) never completed initialization and never processes their connection
> requests
> The error message says "This node must wait for a new Cluster Coordinator to
> be elected" — but that will never happen without external intervention.
> h3. Observed Cluster State After 2+ Hours
> From the application logs, here is the actual cluster state observed 2+ hours
> after startup:
> *node-1 (elected coordinator):*
> * Stuck on {{[main]}} thread in {{StandardFlowService.connect()}} retry loop
> * Logged the {{validateNotConnectingToSelf}} exception *180 times* (every ~5
> seconds for 2+ hours)
> * Receiving heartbeats from node-3 and node-5 (they think node-1 is
> coordinator), but cannot process them meaningfully because initialization
> never completed
> * API returns HTTP 409 on all endpoints
> *node-2:*
> * Also elected itself as coordinator (split-brain — see below)
> * Receiving heartbeats from node-4
> * Sending heartbeats to itself
> * No WARN/ERROR in logs, but cluster still shows 0 connected nodes
> * API returns HTTP 409
> *node-3:* Sending heartbeats to node-1 (stuck coordinator). Never joined the
> cluster.
> *node-4:* Sending heartbeats to node-2. Never joined the cluster.
> *node-5:* Sending heartbeats to node-1 (stuck coordinator). Never joined the
> cluster.
> h3. Secondary Issue: Split-Brain Coordinator Election
> During the deadlock, we observed what appears to be a split-brain in the
> ZooKeeper coordinator election:
> * node-1 believes it is the coordinator (from initial election)
> * node-2 also believes it is the coordinator (possibly from a secondary
> election after ZooKeeper session instability)
> * node-3 and node-5 send heartbeats to node-1
> * node-4 sends heartbeats to node-2
> This may be a consequence of the primary deadlock causing ZooKeeper session
> timeouts on some nodes while node-1's session remains active, leading to
> inconsistent election results across the quorum.
> ----
> h2. Stack Trace (from node-1, repeated 180 times over 2+ hours)
>
> {{WARN [main] o.a.nifi.controller.StandardFlowService Failed to connect to
> cluster
> org.apache.nifi.cluster.protocol.UnknownServiceAddressException: Cluster
> Coordinator is currently node-1.example.com:11443, which is this node, but
> connecting to self is not allowed at this phase of the lifecycle. This node
> must wait for a new Cluster Coordinator to be elected before connecting to
> the cluster.
> at
> o.a.n.cluster.protocol.AbstractNodeProtocolSender.validateNotConnectingToSelf(AbstractNodeProtocolSender.java:119)
> at
> o.a.n.cluster.protocol.AbstractNodeProtocolSender.requestConnection(AbstractNodeProtocolSender.java:74)
> at
> o.a.n.cluster.protocol.impl.NodeProtocolSenderListener.requestConnection(NodeProtocolSenderListener.java:91)
> at
> o.a.n.controller.StandardFlowService.connect(StandardFlowService.java:825)
> at
> o.a.n.controller.StandardFlowService.load(StandardFlowService.java:449)
> at o.a.n.web.server.JettyServer.start(JettyServer.java:842)
> at o.a.n.runtime.Application.startServer(Application.java:131)
> at o.a.n.runtime.Application.run(Application.java:78)
> at o.a.n.runtime.Application.run(Application.java:60)
> at org.apache.nifi.NiFi.main(NiFi.java:42)}}
> *Meanwhile, node-1 is also processing heartbeats from other nodes:*
>
> {{INFO [Process Cluster Protocol Request-32]
> o.a.n.c.p.impl.SocketProtocolListener
> Finished processing request (type=HEARTBEAT, length=6060 bytes) from
> node-5:8443 in 44 millis
> INFO [Process Cluster Protocol Request-33]
> o.a.n.c.p.impl.SocketProtocolListener
> Finished processing request (type=HEARTBEAT, length=6060 bytes) from
> node-3:8443 in 45 millis
> WARN [main] o.a.nifi.controller.StandardFlowService Failed to connect to
> cluster
> ...connecting to self is not allowed at this phase of the lifecycle...
> INFO [Process Cluster Protocol Request-34]
> o.a.n.c.p.impl.SocketProtocolListener
> Finished processing request (type=HEARTBEAT, length=6052 bytes) from
> node-5:8443 in 45 millis}}
> The node is receiving heartbeats and responding to protocol requests on
> background threads, but the {{[main]}} thread is stuck in the retry loop and
> never completes {{{}JettyServer.start(){}}}.
> ----
> h2. Affected Code
> The bug originates in
> {{{}AbstractNodeProtocolSender.validateNotConnectingToSelf(){}}}:
> # *{{AbstractNodeProtocolSender.java:119}}* —
> {{validateNotConnectingToSelf()}} throws {{UnknownServiceAddressException}}
> when the coordinator address matches the local node. This check exists to
> prevent circular connection requests, but it does not account for the case
> where the coordinator IS the node that needs to bootstrap the cluster.
> # *{{AbstractNodeProtocolSender.java:74}}* — {{requestConnection()}} calls
> {{validateNotConnectingToSelf()}} unconditionally before sending the
> connection request.
> # *{{StandardFlowService.java:825}}* — {{connect()}} catches the exception
> and retries, but has no fallback logic for the self-coordinator case. No
> retry limit. No coordinator resignation mechanism.
> # *{{StandardFlowService.java:449}}* — {{load()}} calls {{connect()}} during
> initialization, blocking the {{[main]}} thread (and therefore
> {{{}JettyServer.start(){}}}) until the connection succeeds — which it never
> does.
> ----
> h2. Suggested Fix
> The coordinator node should be able to bootstrap itself without sending a
> network connection request. When {{validateNotConnectingToSelf()}} detects
> that the current node IS the coordinator, instead of throwing an exception,
> the node should handle its own connection locally:
> h3. Option A: Local Self-Connection (Preferred)
> In {{{}AbstractNodeProtocolSender.requestConnection(){}}}, when the
> coordinator resolves to self, bypass the network send and instead invoke the
> coordinator's connection handling logic directly (the same code path that
> processes incoming {{CONNECTION_REQUEST}} messages from other nodes). The
> coordinator can admit itself as the first cluster member, complete
> initialization, and then accept connection requests from the remaining nodes
> normally.
> h3. Option B: Coordinator Resignation with Backoff
> If the coordinator detects it cannot connect to itself during the
> initialization phase, it should:
> # Relinquish the ZooKeeper coordinator ephemeral node
> # Wait a randomized backoff period
> # Allow a different (possibly already-initialized) node to win the election
> # Retry connection to the new coordinator
> This is less ideal because it adds latency and still depends on another node
> being ready, but it breaks the deadlock.
> h3. Option C: Retry Limit with Forced Re-Election
> Add a maximum retry count to the {{StandardFlowService.connect()}} loop.
> After N failed attempts (e.g., 12 attempts = ~60 seconds), the node should
> forcibly resign the coordinator role by closing its ZooKeeper leader election
> participation, wait for a new election, and retry.
> ----
> h2. Current Workaround
> The only workaround is to restart the deadlocked coordinator node (or restart
> the entire cluster). On restart, a different node may win the election and
> bootstrap successfully — but this is not guaranteed and the same deadlock can
> recur.
> *Reliable workaround:* Stagger node startups by 30+ seconds so that the first
> node fully initializes before the next starts. This ensures the coordinator
> election is won by a node that has already completed initialization. This is
> impractical for automated deployment pipelines that manage clusters
> declaratively.
> ----
> h2. Impact
> * *Severity:* Critical for automated/orchestrated deployments
> * *Data Loss:* None (the cluster never forms, so no data is processed or
> lost)
> * *Recovery:* Requires manual intervention (restart)
> * *Scope:* Any NiFi cluster where nodes are started simultaneously (common
> in CI/CD pipelines, container orchestrators, Ansible/Terraform deployments,
> and auto-scaling groups)
--
This message was sent by Atlassian Jira
(v8.20.10#820010)