[ https://issues.apache.org/jira/browse/NIFI-12232?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
David Handermann updated NIFI-12232: ------------------------------------ Fix Version/s: 1.26.0 Resolution: Fixed Status: Resolved (was: Patch Available) > Frequent "failed to connect node to cluster because local flow controller > partially updated. Administrator should disconnect node and review flow for > corruption" > ----------------------------------------------------------------------------------------------------------------------------------------------------------------- > > Key: NIFI-12232 > URL: https://issues.apache.org/jira/browse/NIFI-12232 > Project: Apache NiFi > Issue Type: Bug > Components: Configuration Management > Affects Versions: 1.23.2 > Reporter: John Joseph > Assignee: Mark Payne > Priority: Major > Fix For: 2.0.0, 1.26.0 > > Attachments: image-2023-10-16-16-12-31-027.png, > image-2024-02-14-13-33-44-354.png > > Time Spent: 0.5h > Remaining Estimate: 0h > > This is an issue that we have been observing in the 1.23.2 version of NiFi > when we try upgrade, > Since Rolling upgrade is not supported in NiFi, we scale out the revision > that is running and {_}run a helm upgrade{_}. > We have NIFI running in k8s cluster mode, there is a post job that call the > Tenants and policies API. On a successful run it would run like this > {code:java} > set_policies() Action: 'read' Resource: '/flow' entity_id: > 'ad2d3ad6-5d69-3e0f-95e9-c7feb36e2de5' entity_name: 'CN=nifi-api-admin' > entity_type: 'USER' > set_policies() status: '200' > 'read' '/flow' policy already exists. It will be updated... > set_policies() fetching policy inside -eq 200 status: '200' > set_policies() after update PUT: '200' > set_policies() Action: 'read' Resource: '/tenants' entity_id: > 'ad2d3ad6-5d69-3e0f-95e9-c7feb36e2de5' entity_name: 'CN=nifi-api-admin' > entity_type: 'USER' > set_policies() status: '200'{code} > *_This job was running fine in 1.23.0, 1.22 and other previous versions._* In > {*}{{1.23.2}}{*}, we are noticing that the job is failing very frequently > with the error logs; > {code:java} > set_policies() Action: 'read' Resource: '/flow' entity_id: > 'ad2d3ad6-5d69-3e0f-95e9-c7feb36e2de5' entity_name: 'CN=nifi-api-admin' > entity_type: 'USER' > set_policies() status: '200' > 'read' '/flow' policy already exists. It will be updated... > set_policies() fetching policy inside -eq 200 status: '200' > set_policies() after update PUT: '400' > An error occurred getting 'read' '/flow' policy: 'This node is disconnected > from its configured cluster. The requested change will only be allowed if the > flag to acknowledge the disconnected node is set.'{code} > {{_*'This node is disconnected from its configured cluster. The requested > change will only be allowed if the flag to acknowledge the disconnected node > is set.'*_}} > The job is configured to run only after all the pods are up and running. > Though the pods are up we see exception is the inside pods > {code:java} > org.apache.nifi.controller.serialization.FlowSynchronizationException: Failed > to connect node to cluster because local flow controller partially updated. > Administrator should disconnect node and review flow for corruption. > at > org.apache.nifi.controller.StandardFlowService.loadFromConnectionResponse(StandardFlowService.java:1059) > at > org.apache.nifi.controller.StandardFlowService.handleReconnectionRequest(StandardFlowService.java:667) > at > org.apache.nifi.controller.StandardFlowService.access$200(StandardFlowService.java:107) > at > org.apache.nifi.controller.StandardFlowService$1.run(StandardFlowService.java:396) > at java.base/java.lang.Thread.run(Thread.java:833) > Caused by: > org.apache.nifi.controller.serialization.FlowSynchronizationException: > java.lang.IllegalStateException: Cannot change destination of Connection > because the current destination is running > at > org.apache.nifi.controller.serialization.VersionedFlowSynchronizer.synchronizeFlow(VersionedFlowSynchronizer.java:448) > at > org.apache.nifi.controller.serialization.VersionedFlowSynchronizer.sync(VersionedFlowSynchronizer.java:206) > at > org.apache.nifi.controller.serialization.StandardFlowSynchronizer.sync(StandardFlowSynchronizer.java:42) > at > org.apache.nifi.controller.FlowController.synchronize(FlowController.java:1530) > at > org.apache.nifi.persistence.StandardFlowConfigurationDAO.load(StandardFlowConfigurationDAO.java:104) > at > org.apache.nifi.controller.StandardFlowService.loadFromBytes(StandardFlowService.java:817) > at > org.apache.nifi.controller.StandardFlowService.loadFromConnectionResponse(StandardFlowService.java:1028) > ... 4 common frames omitted > Caused by: java.lang.IllegalStateException: Cannot change destination of > Connection because the current destination is running > at > org.apache.nifi.connectable.StandardConnection.setDestination(StandardConnection.java:310) > at > org.apache.nifi.flow.synchronization.StandardVersionedComponentSynchronizer.updateConnectionDestinations(StandardVersionedComponentSynchronizer.java:700) > at > org.apache.nifi.flow.synchronization.StandardVersionedComponentSynchronizer.synchronize(StandardVersionedComponentSynchronizer.java:405) > at > org.apache.nifi.flow.synchronization.StandardVersionedComponentSynchronizer.synchronizeChildGroups(StandardVersionedComponentSynchronizer.java:543) > at > org.apache.nifi.flow.synchronization.StandardVersionedComponentSynchronizer.synchronize(StandardVersionedComponentSynchronizer.java:427) > at > org.apache.nifi.flow.synchronization.StandardVersionedComponentSynchronizer.lambda$synchronize$0(StandardVersionedComponentSynchronizer.java:266) > at > org.apache.nifi.controller.flow.AbstractFlowManager.withParameterContextResolution(AbstractFlowManager.java:550) > at > org.apache.nifi.flow.synchronization.StandardVersionedComponentSynchronizer.synchronize(StandardVersionedComponentSynchronizer.java:261) > at > org.apache.nifi.groups.StandardProcessGroup.synchronizeFlow(StandardProcessGroup.java:3977) > at > org.apache.nifi.controller.serialization.VersionedFlowSynchronizer.synchronizeFlow(VersionedFlowSynchronizer.java:439) > ... 10 common frames omitted{code} > Attaching screenshots of the UI as well. this issue is observed a lot > checking with CLI command. > {code:java} > ./cli.sh nifi cluster-summary -u > https://nifi-headless.doc-norc.svc.cluster.local:9443 -ts > /opt/nifi/cert_mgr/truststore.jks -tst jks -tsp changeit -ks > /opt/nifi/cert_mgr/keystore.j > ks -kst jks -ksp changeit > Total node count: 0 > Connected node count: 0 > Clustered: true > Connected to cluster: false{code} > > We tried Workaround > {code:java} > 1.Exec to the pod that has the flow file issue, delete the flow file so that > it deletes from the PVC > 2. Exit from pod > 3. Delete the pod that had the problem{code} > Pod will respwan, cluster coordinator will recreate the flowfile from the > connected nodes > This connected all the nodes. But this does not feel like an ideal solution > as we're seeing this issue quite often and cannot run this WA every time > !image-2023-10-16-16-12-31-027.png! > > we also see this Exception sometimes > {code:java} > org.apache.zookeeper.KeeperException$ConnectionLossException: KeeperErrorCode > = ConnectionLoss for /nifi/leaders/Cluster Coordinator > at > org.apache.zookeeper.KeeperException.create(KeeperException.java:102) > at > org.apache.zookeeper.KeeperException.create(KeeperException.java:54) > at org.apache.zookeeper.ZooKeeper.getChildren(ZooKeeper.java:2480) > at > org.apache.curator.framework.imps.GetChildrenBuilderImpl$3.call(GetChildrenBuilderImpl.java:243) > at > org.apache.curator.framework.imps.GetChildrenBuilderImpl$3.call(GetChildrenBuilderImpl.java:232) > at org.apache.curator.RetryLoop.callWithRetry(RetryLoop.java:94) > at > org.apache.curator.framework.imps.GetChildrenBuilderImpl.pathInForeground(GetChildrenBuilderImpl.java:229) > at > org.apache.curator.framework.imps.GetChildrenBuilderImpl.forPath(GetChildrenBuilderImpl.java:220) > at > org.apache.curator.framework.imps.GetChildrenBuilderImpl.forPath(GetChildrenBuilderImpl.java:42) > at > org.apache.curator.framework.recipes.locks.LockInternals.getSortedChildren(LockInternals.java:155) > at > org.apache.curator.framework.recipes.locks.LockInternals.getParticipantNodes(LockInternals.java:135) > at > org.apache.curator.framework.recipes.locks.InterProcessMutex.getParticipantNodes(InterProcessMutex.java:170) > at > org.apache.curator.framework.recipes.leader.LeaderSelector.getLeader(LeaderSelector.java:336) > at > org.apache.nifi.controller.leader.election.CuratorLeaderElectionManager.getLeader(CuratorLeaderElectionManager.java:281) > at > org.apache.nifi.controller.leader.election.CuratorLeaderElectionManager$ElectionListener.verifyLeader(CuratorLeaderElectionManager.java:572) > at > org.apache.nifi.controller.leader.election.CuratorLeaderElectionManager$ElectionListener.isLeader(CuratorLeaderElectionManager.java:526) > at > org.apache.nifi.controller.leader.election.CuratorLeaderElectionManager$LeaderRole.isLeader(CuratorLeaderElectionManager.java:467) > at > org.apache.nifi.controller.leader.election.CuratorLeaderElectionManager.isLeader(CuratorLeaderElectionManager.java:262) > at > org.apache.nifi.cluster.coordination.node.NodeClusterCoordinator.isActiveClusterCoordinator(NodeClusterCoordinator.java:824) > at > org.apache.nifi.cluster.coordination.heartbeat.AbstractHeartbeatMonitor.monitorHeartbeats(AbstractHeartbeatMonitor.java:132) > at > org.apache.nifi.cluster.coordination.heartbeat.AbstractHeartbeatMonitor$1.run(AbstractHeartbeatMonitor.java:84) > at org.apache.nifi.engine.FlowEngine$2.run(FlowEngine.java:110){code} -- This message was sent by Atlassian Jira (v8.20.10#820010)