[ https://issues.apache.org/jira/browse/NIFI-12969?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17832242#comment-17832242 ]
Nissim Shiman commented on NIFI-12969: -------------------------------------- [~pvillard] This is an excellent observation as the symptoms are very similar. I tried this on a 2.0.0-SNAPSHOT that is post the NIFI-12232 fix, but the issue still remains. On closer look, it appears the cause of this issue (in StandardConnection.java) is the next line after the one that caused NIFI-12232, so maybe that one was masking some occurrences of this this one, but, yes, you are correct they are in the same area of code. > Under heavy load, nifi node unable to rejoin cluster, graph modified with > temp funnel > ------------------------------------------------------------------------------------- > > Key: NIFI-12969 > URL: https://issues.apache.org/jira/browse/NIFI-12969 > Project: Apache NiFi > Issue Type: Bug > Affects Versions: 1.24.0, 2.0.0-M2 > Reporter: Nissim Shiman > Assignee: Nissim Shiman > Priority: Major > > Under heavy load, if a node leaves the cluster (due to heartbeat time out), > many times it is unable to rejoin the cluster. > The nodes' graph will have been modified with a temp-funnel as well. > Appears to be some sort of [timing > issue|https://github.com/apache/nifi/blob/rel/nifi-2.0.0-M2/nifi-nar-bundles/nifi-framework-bundle/nifi-framework/nifi-framework-components/src/main/java/org/apache/nifi/connectable/StandardConnection.java#L298] > # To reproduce, on a nifi cluster of three nodes, set up: > 2 GenerateFlowFile processors -> PG > Inside PG: > inputPort -> UpdateAttribute > # Keep all defaults except for the following: > For UpdateAttribute terminate the success relationship > One of the GenerateFlowFile processors can be disabled, > the other one should have Run Schedule to be 0 min (this will allow for the > heavy load) > # In nifi.properties (on all 3 nodes) to allow for nodes to fall out of the > cluster, set: nifi.cluster.protocol.heartbeat.interval=2 sec (default is 5) > nifi.cluster.protocol.heartbeat.missable.max=1 (default is 8) > Restart nifi. Start flow. The nodes will quickly fall out and rejoin cluster. > After a few minutes one will likely not be able to rejoin. The graph for > that node will have the disabled GenerateFlowFile now pointing to a funnel (a > temp-funnel) instead of the PG > Stack trace on that nodes nifi-app.log will look like this: (this is from > 2.0.0-M2): > {code:java} > 2024-03-28 13:55:19,395 INFO [Reconnect to Cluster] > o.a.nifi.controller.StandardFlowService Node disconnected due to Failed to > properly handle Reconnection request due to org.apache.nifi.control > ler.serialization.FlowSynchronizationException: Failed to connect node to > cluster because local flow controller partially updated. Administrator should > disconnect node and review flow for corrup > tion. > 2024-03-28 13:55:19,395 ERROR [Reconnect to Cluster] > o.a.nifi.controller.StandardFlowService Handling reconnection request failed > due to: org.apache.nifi.controller.serialization.FlowSynchroniza > tionException: Failed to connect node to cluster because local flow > controller partially updated. Administrator should disconnect node and review > flow for corruption. > org.apache.nifi.controller.serialization.FlowSynchronizationException: Failed > to connect node to cluster because local flow controller partially updated. > Administrator should disconnect node and > review flow for corruption. > at > org.apache.nifi.controller.StandardFlowService.loadFromConnectionResponse(StandardFlowService.java:985) > at > org.apache.nifi.controller.StandardFlowService.handleReconnectionRequest(StandardFlowService.java:655) > at > org.apache.nifi.controller.StandardFlowService$1.run(StandardFlowService.java:384) > at java.base/java.lang.Thread.run(Thread.java:1583) > Caused by: > org.apache.nifi.controller.serialization.FlowSynchronizationException: > java.lang.IllegalStateException: Cannot change destination of Connection > because FlowFiles from this Connection > are currently held by LocalPort[id=99213c00-78ca-4848-112f-5454cc20656b, > type=INPUT_PORT, name=inputPort, group=innerPG] > at > org.apache.nifi.controller.serialization.VersionedFlowSynchronizer.synchronizeFlow(VersionedFlowSynchronizer.java:472) > at > org.apache.nifi.controller.serialization.VersionedFlowSynchronizer.sync(VersionedFlowSynchronizer.java:223) > at > org.apache.nifi.controller.FlowController.synchronize(FlowController.java:1740) > at > org.apache.nifi.persistence.StandardFlowConfigurationDAO.load(StandardFlowConfigurationDAO.java:91) > at > org.apache.nifi.controller.StandardFlowService.loadFromBytes(StandardFlowService.java:805) > at > org.apache.nifi.controller.StandardFlowService.loadFromConnectionResponse(StandardFlowService.java:954) > ... 3 common frames omitted > Caused by: java.lang.IllegalStateException: Cannot change destination of > Connection because FlowFiles from this Connection are currently held by > LocalPort[id=99213c00-78ca-4848-112f-5454cc20656b > , type=INPUT_PORT, name=inputPort, group=innerPG] > at > org.apache.nifi.connectable.StandardConnection.setDestination(StandardConnection.java:299) > at > org.apache.nifi.flow.synchronization.StandardVersionedComponentSynchronizer.updateConnectionDestinations(StandardVersionedComponentSynchronizer.java:705) > at > org.apache.nifi.flow.synchronization.StandardVersionedComponentSynchronizer.synchronize(StandardVersionedComponentSynchronizer.java:423) > at > org.apache.nifi.flow.synchronization.StandardVersionedComponentSynchronizer.lambda$synchronize$0(StandardVersionedComponentSynchronizer.java:248) > at > org.apache.nifi.controller.flow.AbstractFlowManager.withParameterContextResolution(AbstractFlowManager.java:638) > at > org.apache.nifi.flow.synchronization.StandardVersionedComponentSynchronizer.synchronize(StandardVersionedComponentSynchronizer.java:243) > at > org.apache.nifi.groups.StandardProcessGroup.synchronizeFlow(StandardProcessGroup.java:3860) > at > org.apache.nifi.controller.serialization.VersionedFlowSynchronizer.synchronizeFlow(VersionedFlowSynchronizer.java:464) > ... 8 common frames omitted > 2024-03-28 13:55:19,395 INFO [Reconnect to Cluster] > o.a.n.c.c.node.NodeClusterCoordinator machine-name-2.organization.org:8443 > requested disconnection from cluster due to org.apache.nifi.c > ontroller.serialization.FlowSynchronizationException: Failed to connect node > to cluster because local flow controller partially updated. Administrator > should disconnect node and review flow for > corruption. > 2024-03-28 13:55:19,395 INFO [Reconnect to Cluster] > o.a.n.c.c.node.NodeClusterCoordinator Status of > <machine-name-2.organization>.org:8443 changed from > NodeConnectionStatus[nodeId=<machine-name- > 2.organization>.org:8443, state=CONNECTING, updateId=852] to > NodeConnectionStatus[nodeId=<machine-name-2.organization>.org:8443, > state=DISCONNECTED, Disconnect Code=Node's Flow did n > ot Match Cluster Flow, Disconnect > Reason=org.apache.nifi.controller.serialization.FlowSynchronizationException: > Failed to connect node to cluster because local flow controller partially > updated. > Administrator should disconnect node and review flow for corruption., > updateId=854] > {code} -- This message was sent by Atlassian Jira (v8.20.10#820010)