[ 
https://issues.apache.org/jira/browse/NIFI-10362?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17580862#comment-17580862
 ] 

ASF subversion and git services commented on NIFI-10362:
--------------------------------------------------------

Commit 21503f6353c33063b7acff5915a94397aad72926 in nifi's branch 
refs/heads/main from Mark Payne
[ https://gitbox.apache.org/repos/asf?p=nifi.git;h=21503f6353 ]

NIFI-10362: When asynchronous node disconnect is issued, do not send disconnect 
to node if the node becomes reconnected in the interim. Also, addressed the 
issue in which a disconnected node acts on a replicated request during the 
first phase by detect that it's the first phase if configured for cluster, not 
when only when connected to a cluster.

This closes #6308

Signed-off-by: David Handermann <exceptionfact...@apache.org>


> Cluster can disconnect node as soon as it rejoins cluster upon restart
> ----------------------------------------------------------------------
>
>                 Key: NIFI-10362
>                 URL: https://issues.apache.org/jira/browse/NIFI-10362
>             Project: Apache NiFi
>          Issue Type: Bug
>          Components: Core Framework
>            Reporter: Mark Payne
>            Assignee: Mark Payne
>            Priority: Major
>             Fix For: 1.18.0
>
>          Time Spent: 20m
>  Remaining Estimate: 0h
>
> When the Cluster Coordinator disconnects a node due to a user requesting that 
> the node get disconnected, the node is immediately marked as DISCONNECTED, 
> and then a background thread is responsible for notifying the node that it's 
> been disconnected. The background task attempts several times if it cannot 
> successfully send the notification.
> However, if the node is disconnected and then restarted before it's been 
> notified, we have a situation in which the node becomes CONNECTING (and 
> possibly then CONNECTED), and then the background task is triggered. This 
> then results in the node being told that it's DISCONNECTED. But the Cluster 
> Coordinator doesn't think so (because its already changed the state back to 
> CONNECTING/CONNECTED).
> While the chances that this happens are slim in production and it's easily 
> worked around (by simply waiting a few seconds after disconnecting a node 
> before restarting it, or just restarting without disconnecting) it causes a 
> lot of problems for system tests and potentially other automated activities.
> It results in the following log message in the Cluster Coordinator:
> {code:java}
> 2022-08-15 00:47:50,200 ERROR [Disconnect localhost:5672] 
> org.apache.nifi.cluster.coordination.node.NodeClusterCoordinator Failed to 
> notify localhost:5672 that it has been disconnected from the cluster due to 
> User anonymous requested that node be disconnected from cluster {code}
> And then we see confusing error messages such as:
> {code:java}
> 2022-08-15 00:48:01,461 INFO [Replicate Request Thread-23] 
> org.apache.nifi.cluster.coordination.http.replication.ThreadPoolRequestReplicator
>  Received a status of 200 from localhost:5672 for request PUT 
> /nifi-api/flow/process-groups/root when performing first stage of two-stage 
> commit. The action will not occur. Node explanation: 
> {"id":"root","state":"STOPPED"} {code}
> This is because when the cluster coordinator replicates the request to all 
> nodes, the node that thinks it is disconnected receives the request and 
> performs the action. It then responds with a "200 OK" but it should have 
> noted that it's the first phase of a 2-phase action and responded with "201 
> Continue".



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

Reply via email to