Ravikumar M created NIFI-15839:
----------------------------------

             Summary: Cluster nodes randomly disconnected due to Component 
Revision mismatch — causes full cluster downtime
                 Key: NIFI-15839
                 URL: https://issues.apache.org/jira/browse/NIFI-15839
             Project: Apache NiFi
          Issue Type: Bug
    Affects Versions: 2.5.0, 2.0.0-M2
         Environment: Tested on both 3-node and 5-node NiFi clusters running on 
Kubernetes with ZooKeeper. EC2 instances running with m6a.12xlarge(48 
Core/192GB)
            Reporter: Ravikumar M


We are experiencing random node disconnections caused by Component Revision 
count mismatches (off-by-one) between the Cluster Coordinator and other nodes. 
This occurs during minor canvas changes (stopping/starting a single processor) 
and sometimes with no user-initiated changes at all.

The coordinator's Heartbeat Monitor detects a Revision Update Count difference 
of exactly 1 and forces the node to reconnect:


WARN [Heartbeat Monitor Thread-1] o.a.n.c.c.node.NodeClusterCoordinator 
Requesting that <node> reconnect to the cluster due to: Node has a Revision 
Update Count 
of <N+1> but local value is only <N>. Node 

Critical Impact — Coordinator Overload Causes Full Cluster Downtime:

When the coordinator forces multiple non-coordinator nodes to reconnect 
simultaneously, the reconnection load overwhelms the coordinator itself. Under 
this load, the coordinator also goes down, resulting in a complete cluster 
outage with no healthy nodes available to process traffic. This is the primary 
production impact — what starts as a single off-by-one revision mismatch 
cascades into total downtime.

We believe this is related to NIFI-8204 and NIFI-13885, though in our case the 
coordinator is healthy when the mismatch first occurs.

This is actively causing production downtime for us. Could the community advise 
on:

Is there a known fix or patch available for this in a newer version?
Would introducing a tolerance threshold or retry delay in the heartbeat 
revision validation be a viable fix?
Could reconnection requests be staggered to avoid overwhelming the coordinator?
Any recommended workaround or configuration change to mitigate this until a fix 
is available?
Any guidance would be greatly appreciated. Happy to provide additional logs or 
details if needed.

Thanks.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

Reply via email to