Mark Payne created NIFI-16006:
---------------------------------
Summary: Cluster coordinator can disconnect a freshly-rejoined node
Key: NIFI-16006
URL: https://issues.apache.org/jira/browse/NIFI-16006
Project: Apache NiFi
Issue Type: Bug
Reporter: Mark Payne
When a NiFi cluster node is disconnected through the REST API (PUT
/controller/cluster/nodes/\{id} with state DISCONNECTING) and a heartbeat from
that node arrives at the coordinator immediately afterward — because the
heartbeat was created and dispatched on the node before the node received the
disconnection notification — AbstractHeartbeatMonitor logs:
Ignoring received heartbeat from disconnected node <host:port>. Node was
disconnected due to [User Disconnected Node]. Issuing disconnection request.
and enqueues a new DISCONNECTION_REQUEST directed at that node.
NodeClusterCoordinator's Disconnect <nodeId> thread retries the delivery
indefinitely until the notification is successfully received, logging Failed to
notify <host:port> that it has been disconnected on each failed attempt.
If the node's JVM is stopped before the retry succeeds and then restarted, the
new JVM sends a Cluster Connection Request to the coordinator, is accepted as
CONNECTING, and within milliseconds the queued DISCONNECTION_REQUEST
(originally intended for the old JVM) is finally delivered to it.
StandardFlowService on the new JVM processes the disconnection-notification and
flips the node to "Not Clustered", but not before the node's first heartbeat
reaches the coordinator. The coordinator therefore observes the node as
CONNECTED for one heartbeat cycle and only catches up to reality when it times
out the missing heartbeats (~17 seconds with the default 2-second heartbeat
interval and 8x missing-heartbeat threshold).
Net effect: any observer that reads cluster state during the brief
false-CONNECTED window — including REST clients, the UI, and system-test
helpers such as NiFiSystemIT.waitForAllNodesConnected() — sees a healthy
cluster and proceeds. The cluster silently degrades several seconds later with
the node DISCONNECTED for "Lack of Heartbeat" and no auto-reconnect.
h3. Root cause:
Two cooperating issues, either of which alone would prevent the bug:
NodeClusterCoordinator does not cancel pending DISCONNECTION_REQUEST retry
attempts when it receives a fresh Connection Request from the same node. The
retry succeeds against the new JVM and is interpreted as a legitimate
user-issued disconnect.
StandardFlowService does not track a connection-generation identifier that
would let it discard a DISCONNECTION_REQUEST that is older than its current
Connection Request. As a result, the freshly-joined node blindly processes the
stale message and disconnects itself.
h3. Reproduction steps:
Start a 2-node NiFi cluster.
Confirm both nodes are CONNECTED.
PUT /nifi-api/controller/cluster/nodes/\{node2Id}
with status DISCONNECTING.
Immediately kill the node 2 JVM (do not wait for the disconnect retries on the
coordinator to settle).
Restart the node 2 JVM.
Poll GET /nifi-api/controller/cluster/nodes from the coordinator: node 2
reports CONNECTED briefly, then transitions to DISCONNECTED with disconnect
reason Lack of Heartbeat ~16-17 seconds later. Nothing reconnects it.
--
This message was sent by Atlassian Jira
(v8.20.10#820010)