Mark Payne created NIFI-16006:
---------------------------------

             Summary: Cluster coordinator can disconnect a freshly-rejoined node
                 Key: NIFI-16006
                 URL: https://issues.apache.org/jira/browse/NIFI-16006
             Project: Apache NiFi
          Issue Type: Bug
            Reporter: Mark Payne


When a NiFi cluster node is disconnected through the REST API (PUT 
/controller/cluster/nodes/\{id} with state DISCONNECTING) and a heartbeat from 
that node arrives at the coordinator immediately afterward — because the 
heartbeat was created and dispatched on the node before the node received the 
disconnection notification — AbstractHeartbeatMonitor logs:

Ignoring received heartbeat from disconnected node <host:port>. Node was 
disconnected due to [User Disconnected Node]. Issuing disconnection request.

and enqueues a new DISCONNECTION_REQUEST directed at that node. 
NodeClusterCoordinator's Disconnect <nodeId> thread retries the delivery 
indefinitely until the notification is successfully received, logging Failed to 
notify <host:port> that it has been disconnected on each failed attempt.

If the node's JVM is stopped before the retry succeeds and then restarted, the 
new JVM sends a Cluster Connection Request to the coordinator, is accepted as 
CONNECTING, and within milliseconds the queued DISCONNECTION_REQUEST 
(originally intended for the old JVM) is finally delivered to it. 
StandardFlowService on the new JVM processes the disconnection-notification and 
flips the node to "Not Clustered", but not before the node's first heartbeat 
reaches the coordinator. The coordinator therefore observes the node as 
CONNECTED for one heartbeat cycle and only catches up to reality when it times 
out the missing heartbeats (~17 seconds with the default 2-second heartbeat 
interval and 8x missing-heartbeat threshold).

Net effect: any observer that reads cluster state during the brief 
false-CONNECTED window — including REST clients, the UI, and system-test 
helpers such as NiFiSystemIT.waitForAllNodesConnected() — sees a healthy 
cluster and proceeds. The cluster silently degrades several seconds later with 
the node DISCONNECTED for "Lack of Heartbeat" and no auto-reconnect.
h3. Root cause:

Two cooperating issues, either of which alone would prevent the bug:

NodeClusterCoordinator does not cancel pending DISCONNECTION_REQUEST retry 
attempts when it receives a fresh Connection Request from the same node. The 
retry succeeds against the new JVM and is interpreted as a legitimate 
user-issued disconnect.
StandardFlowService does not track a connection-generation identifier that 
would let it discard a DISCONNECTION_REQUEST that is older than its current 
Connection Request. As a result, the freshly-joined node blindly processes the 
stale message and disconnects itself.
h3. Reproduction steps:

Start a 2-node NiFi cluster.
Confirm both nodes are CONNECTED.
PUT /nifi-api/controller/cluster/nodes/\{node2Id}

with status DISCONNECTING.
Immediately kill the node 2 JVM (do not wait for the disconnect retries on the 
coordinator to settle).
Restart the node 2 JVM.
Poll GET /nifi-api/controller/cluster/nodes from the coordinator: node 2 
reports CONNECTED briefly, then transitions to DISCONNECTED with disconnect 
reason Lack of Heartbeat ~16-17 seconds later. Nothing reconnects it.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

Reply via email to