Ravikumar M created NIFI-15839:
----------------------------------
Summary: Cluster nodes randomly disconnected due to Component
Revision mismatch — causes full cluster downtime
Key: NIFI-15839
URL: https://issues.apache.org/jira/browse/NIFI-15839
Project: Apache NiFi
Issue Type: Bug
Affects Versions: 2.5.0, 2.0.0-M2
Environment: Tested on both 3-node and 5-node NiFi clusters running on
Kubernetes with ZooKeeper. EC2 instances running with m6a.12xlarge(48
Core/192GB)
Reporter: Ravikumar M
We are experiencing random node disconnections caused by Component Revision
count mismatches (off-by-one) between the Cluster Coordinator and other nodes.
This occurs during minor canvas changes (stopping/starting a single processor)
and sometimes with no user-initiated changes at all.
The coordinator's Heartbeat Monitor detects a Revision Update Count difference
of exactly 1 and forces the node to reconnect:
WARN [Heartbeat Monitor Thread-1] o.a.n.c.c.node.NodeClusterCoordinator
Requesting that <node> reconnect to the cluster due to: Node has a Revision
Update Count
of <N+1> but local value is only <N>. Node
Critical Impact — Coordinator Overload Causes Full Cluster Downtime:
When the coordinator forces multiple non-coordinator nodes to reconnect
simultaneously, the reconnection load overwhelms the coordinator itself. Under
this load, the coordinator also goes down, resulting in a complete cluster
outage with no healthy nodes available to process traffic. This is the primary
production impact — what starts as a single off-by-one revision mismatch
cascades into total downtime.
We believe this is related to NIFI-8204 and NIFI-13885, though in our case the
coordinator is healthy when the mismatch first occurs.
This is actively causing production downtime for us. Could the community advise
on:
Is there a known fix or patch available for this in a newer version?
Would introducing a tolerance threshold or retry delay in the heartbeat
revision validation be a viable fix?
Could reconnection requests be staggered to avoid overwhelming the coordinator?
Any recommended workaround or configuration change to mitigate this until a fix
is available?
Any guidance would be greatly appreciated. Happy to provide additional logs or
details if needed.
Thanks.
--
This message was sent by Atlassian Jira
(v8.20.10#820010)