[ https://issues.apache.org/jira/browse/NIFI-10362?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17580862#comment-17580862 ]
ASF subversion and git services commented on NIFI-10362: -------------------------------------------------------- Commit 21503f6353c33063b7acff5915a94397aad72926 in nifi's branch refs/heads/main from Mark Payne [ https://gitbox.apache.org/repos/asf?p=nifi.git;h=21503f6353 ] NIFI-10362: When asynchronous node disconnect is issued, do not send disconnect to node if the node becomes reconnected in the interim. Also, addressed the issue in which a disconnected node acts on a replicated request during the first phase by detect that it's the first phase if configured for cluster, not when only when connected to a cluster. This closes #6308 Signed-off-by: David Handermann <exceptionfact...@apache.org> > Cluster can disconnect node as soon as it rejoins cluster upon restart > ---------------------------------------------------------------------- > > Key: NIFI-10362 > URL: https://issues.apache.org/jira/browse/NIFI-10362 > Project: Apache NiFi > Issue Type: Bug > Components: Core Framework > Reporter: Mark Payne > Assignee: Mark Payne > Priority: Major > Fix For: 1.18.0 > > Time Spent: 20m > Remaining Estimate: 0h > > When the Cluster Coordinator disconnects a node due to a user requesting that > the node get disconnected, the node is immediately marked as DISCONNECTED, > and then a background thread is responsible for notifying the node that it's > been disconnected. The background task attempts several times if it cannot > successfully send the notification. > However, if the node is disconnected and then restarted before it's been > notified, we have a situation in which the node becomes CONNECTING (and > possibly then CONNECTED), and then the background task is triggered. This > then results in the node being told that it's DISCONNECTED. But the Cluster > Coordinator doesn't think so (because its already changed the state back to > CONNECTING/CONNECTED). > While the chances that this happens are slim in production and it's easily > worked around (by simply waiting a few seconds after disconnecting a node > before restarting it, or just restarting without disconnecting) it causes a > lot of problems for system tests and potentially other automated activities. > It results in the following log message in the Cluster Coordinator: > {code:java} > 2022-08-15 00:47:50,200 ERROR [Disconnect localhost:5672] > org.apache.nifi.cluster.coordination.node.NodeClusterCoordinator Failed to > notify localhost:5672 that it has been disconnected from the cluster due to > User anonymous requested that node be disconnected from cluster {code} > And then we see confusing error messages such as: > {code:java} > 2022-08-15 00:48:01,461 INFO [Replicate Request Thread-23] > org.apache.nifi.cluster.coordination.http.replication.ThreadPoolRequestReplicator > Received a status of 200 from localhost:5672 for request PUT > /nifi-api/flow/process-groups/root when performing first stage of two-stage > commit. The action will not occur. Node explanation: > {"id":"root","state":"STOPPED"} {code} > This is because when the cluster coordinator replicates the request to all > nodes, the node that thinks it is disconnected receives the request and > performs the action. It then responds with a "200 OK" but it should have > noted that it's the first phase of a 2-phase action and responded with "201 > Continue". -- This message was sent by Atlassian Jira (v8.20.10#820010)