[jira] [Commented] (NIFI-12969) Under heavy load, nifi node unable to rejoin cluster, graph modified with temp funnel

Mark Payne (Jira) Wed, 03 Apr 2024 13:41:05 -0700


    [ 
https://issues.apache.org/jira/browse/NIFI-12969?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17833738#comment-17833738
 ]


Mark Payne commented on NIFI-12969:
-----------------------------------

[~Nissim Shiman] [~pgyori] I pushed a PR that appears to address the issue. I 
believe you're on the right track, that the situation is caused by the fact 
that the temp funnel was incorrectly used. But instead of trying to detect when 
it's going to happen and/or rollback, the issue is that we had a bug in the 
logic for when the temp funnel was created. In this case, there should never be 
a temp funnel. In cases where we DO need a temp funnel, the existing logic 
should handle stopping the Port, which would make this work smoothly. The issue 
arose here because the Port was (rightly) left running. We just need to avoid 
creating the temp funnel unnecessarily.

> Under heavy load, nifi node unable to rejoin cluster, graph modified with 
> temp funnel
> -------------------------------------------------------------------------------------
>
>                 Key: NIFI-12969
>                 URL: https://issues.apache.org/jira/browse/NIFI-12969
>             Project: Apache NiFi
>          Issue Type: Bug
>    Affects Versions: 1.24.0, 2.0.0-M2
>            Reporter: Nissim Shiman
>            Assignee: Mark Payne
>            Priority: Critical
>             Fix For: 2.0.0-M3, 1.26.0
>
>         Attachments: nifi-app.log, simple_flow.png, 
> simple_flow_with_temp-funnel.png
>
>          Time Spent: 10m
>  Remaining Estimate: 0h
>
> Under heavy load, if a node leaves the cluster (due to heartbeat time out), 
> many times it is unable to rejoin the cluster.
> The nodes' graph will have been modified with a temp-funnel as well.
> Appears to be some sort of [timing 
> issue|https://github.com/apache/nifi/blob/rel/nifi-2.0.0-M2/nifi-nar-bundles/nifi-framework-bundle/nifi-framework/nifi-framework-components/src/main/java/org/apache/nifi/connectable/StandardConnection.java#L298]
>  # To reproduce, on a nifi cluster of three nodes, set up:
> 2 GenerateFlowFile processors -> PG
> Inside PG:
> inputPort -> UpdateAttribute
>  # Keep all defaults except for the following:
> For UpdateAttribute terminate the success relationship
> One of the GenerateFlowFile processors can be disabled,
> the other one should have Run Schedule to be 0 min (this will allow for the 
> heavy load)
>  # In nifi.properties (on all 3 nodes) to allow for nodes to fall out of the 
> cluster, set: nifi.cluster.protocol.heartbeat.interval=2 sec  (default is 5) 
> nifi.cluster.protocol.heartbeat.missable.max=1   (default is 8)
> Restart nifi. Start flow. The nodes will quickly fall out and rejoin cluster. 
> After a few minutes one will likely not be able to rejoin.  The graph for 
> that node will have the disabled GenerateFlowFile now pointing to a funnel (a 
> temp-funnel) instead of the PG
> Stack trace on that nodes nifi-app.log will look like this: (this is from 
> 2.0.0-M2):
> {code:java}
> 2024-03-28 13:55:19,395 INFO [Reconnect to Cluster] 
> o.a.nifi.controller.StandardFlowService Node disconnected due to Failed to 
> properly handle Reconnection request due to org.apache.nifi.control
> ler.serialization.FlowSynchronizationException: Failed to connect node to 
> cluster because local flow controller partially updated. Administrator should 
> disconnect node and review flow for corrup
> tion.
> 2024-03-28 13:55:19,395 ERROR [Reconnect to Cluster] 
> o.a.nifi.controller.StandardFlowService Handling reconnection request failed 
> due to: org.apache.nifi.controller.serialization.FlowSynchroniza
> tionException: Failed to connect node to cluster because local flow 
> controller partially updated. Administrator should disconnect node and review 
> flow for corruption.
> org.apache.nifi.controller.serialization.FlowSynchronizationException: Failed 
> to connect node to cluster because local flow controller partially updated. 
> Administrator should disconnect node and
>  review flow for corruption.
>         at 
> org.apache.nifi.controller.StandardFlowService.loadFromConnectionResponse(StandardFlowService.java:985)
>         at 
> org.apache.nifi.controller.StandardFlowService.handleReconnectionRequest(StandardFlowService.java:655)
>         at 
> org.apache.nifi.controller.StandardFlowService$1.run(StandardFlowService.java:384)
>         at java.base/java.lang.Thread.run(Thread.java:1583)
> Caused by: 
> org.apache.nifi.controller.serialization.FlowSynchronizationException: 
> java.lang.IllegalStateException: Cannot change destination of Connection 
> because FlowFiles from this Connection
> are currently held by LocalPort[id=99213c00-78ca-4848-112f-5454cc20656b, 
> type=INPUT_PORT, name=inputPort, group=innerPG]
>         at 
> org.apache.nifi.controller.serialization.VersionedFlowSynchronizer.synchronizeFlow(VersionedFlowSynchronizer.java:472)
>         at 
> org.apache.nifi.controller.serialization.VersionedFlowSynchronizer.sync(VersionedFlowSynchronizer.java:223)
>         at 
> org.apache.nifi.controller.FlowController.synchronize(FlowController.java:1740)
>         at 
> org.apache.nifi.persistence.StandardFlowConfigurationDAO.load(StandardFlowConfigurationDAO.java:91)
>         at 
> org.apache.nifi.controller.StandardFlowService.loadFromBytes(StandardFlowService.java:805)
>         at 
> org.apache.nifi.controller.StandardFlowService.loadFromConnectionResponse(StandardFlowService.java:954)
>         ... 3 common frames omitted
> Caused by: java.lang.IllegalStateException: Cannot change destination of 
> Connection because FlowFiles from this Connection are currently held by 
> LocalPort[id=99213c00-78ca-4848-112f-5454cc20656b
> , type=INPUT_PORT, name=inputPort, group=innerPG]
>         at 
> org.apache.nifi.connectable.StandardConnection.setDestination(StandardConnection.java:299)
>         at 
> org.apache.nifi.flow.synchronization.StandardVersionedComponentSynchronizer.updateConnectionDestinations(StandardVersionedComponentSynchronizer.java:705)
>         at 
> org.apache.nifi.flow.synchronization.StandardVersionedComponentSynchronizer.synchronize(StandardVersionedComponentSynchronizer.java:423)
>         at 
> org.apache.nifi.flow.synchronization.StandardVersionedComponentSynchronizer.lambda$synchronize$0(StandardVersionedComponentSynchronizer.java:248)
>         at 
> org.apache.nifi.controller.flow.AbstractFlowManager.withParameterContextResolution(AbstractFlowManager.java:638)
>         at 
> org.apache.nifi.flow.synchronization.StandardVersionedComponentSynchronizer.synchronize(StandardVersionedComponentSynchronizer.java:243)
>         at 
> org.apache.nifi.groups.StandardProcessGroup.synchronizeFlow(StandardProcessGroup.java:3860)
>         at 
> org.apache.nifi.controller.serialization.VersionedFlowSynchronizer.synchronizeFlow(VersionedFlowSynchronizer.java:464)
>         ... 8 common frames omitted
> 2024-03-28 13:55:19,395 INFO [Reconnect to Cluster] 
> o.a.n.c.c.node.NodeClusterCoordinator machine-name-2.organization.org:8443 
> requested disconnection from cluster due to org.apache.nifi.c
> ontroller.serialization.FlowSynchronizationException: Failed to connect node 
> to cluster because local flow controller partially updated. Administrator 
> should disconnect node and review flow for
> corruption.
> 2024-03-28 13:55:19,395 INFO [Reconnect to Cluster] 
> o.a.n.c.c.node.NodeClusterCoordinator Status of 
> <machine-name-2.organization>.org:8443 changed from 
> NodeConnectionStatus[nodeId=<machine-name-
> 2.organization>.org:8443, state=CONNECTING, updateId=852] to 
> NodeConnectionStatus[nodeId=<machine-name-2.organization>.org:8443, 
> state=DISCONNECTED, Disconnect Code=Node's Flow did n
> ot Match Cluster Flow, Disconnect 
> Reason=org.apache.nifi.controller.serialization.FlowSynchronizationException: 
> Failed to connect node to cluster because local flow controller partially 
> updated.
>  Administrator should disconnect node and review flow for corruption., 
> updateId=854]
> {code}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Commented] (NIFI-12969) Under heavy load, nifi node unable to rejoin cluster, graph modified with temp funnel

Reply via email to