Hi, I have a NiFi cluster up and running and I'm testing various failover scenarios.
I have 2 nodes in the cluster : - centos-a : Coordinator node / primary - centos-b : Cluster node I noticed in 1 of the scenarios when I killed the Cluster Coordinator node, that the following happened : centos-b couldn't contact the coordinator anymore and became the new coordinator / primary node. (as expected) : Failed to send heartbeat due to: org.apache.nifi.cluster.protocol.ProtocolException: Failed to send message to Cluster Coordinator due to: java.net.ConnectException: Connection refused (Connection refused) This node has been elected Leader for Role 'Primary Node' This node has been elected Leader for Role 'Cluster Coordinator' When attempting to access the UI on centos-b, I got the following error : 2017-05-18 11:18:49,368 WARN [Replicate Request Thread-2] o.a.n.c.c.h.r.ThreadPoolRequestReplicator Failed to replicate request GET /nifi-api/flow/current-user to centos-a:8080 due to {} If my understanding is correct, NiFi will try to replicate to connected nodes in the cluster. Here, centos-a was killed a while back and should have been disconnected, but as far as NiFi was concerned it was still connected. As a result I cannot access the UI anymore (due to the replication error), but I can lookup the cluster info via the REST API. And sure enough, it still sees centos-a as being CONNECTED. { "cluster": { "generated": "11:20:13 UTC", "nodes": [ { "activeThreadCount": 0, "address": "centos-b", "apiPort": 8080, "events": [ { "category": "INFO", "message": "Node Status changed from CONNECTING to CONNECTED", "timestamp": "05/18/2017 11:17:31 UTC" }, { "category": "INFO", "message": "Node Status changed from [Unknown Node] to CONNECTING", "timestamp": "05/18/2017 11:17:27 UTC" } ], "heartbeat": "05/18/2017 11:20:09 UTC", "nodeId": "a5bce78d-23ea-4435-a0dd-4b731459f1b9", "nodeStartTime": "05/18/2017 11:17:25 UTC", "queued": "8,492 / 13.22 MB", "roles": [ "Primary Node", "Cluster Coordinator" ], "status": "CONNECTED" }, { "address": "centos-a", "apiPort": 8080, "events": [], "nodeId": "b89e8418-4b7f-4743-bdf4-4a08a92f3892", "roles": [], "status": "CONNECTED" } ] } } When centos-a was brought back online, i noticed the following status change : Status of centos-a:8080 changed from NodeConnectionStatus[nodeId=centos-a:8080, state=CONNECTED, updateId=15] to NodeConnectionStatus[nodeId=centos-a:8080, state=CONNECTING, updateId=19] So it went from connected -> connecting. It clearly missed the disconnected step here. When shutting down the centos-a node using nifi.sh stop, it goes into the DISCONNECTED state : Status of centos-a:8080 changed from NodeConnectionStatus[nodeId=centos-a:8080, state=CONNECTED, updateId=12] to NodeConnectionStatus[nodeId=centos-a:8 080, state=DISCONNECTED, Disconnect Code=Node was Shutdown, Disconnect Reason=Node was Shutdown, updateId=13] How can I debug this further, and can somebody provide some additional insights ? I have seen nodes getting disconnected due to missing heartbeats tatus of centos-a:8080 changed from NodeConnectionStatus[nodeId=centos-a:8080, state=CONNECTED, updateId=10] to NodeConnectionStatus[nodeId=centos-a:8080, state=DISCONNECTED, Disconnect Code=Lack of Heartbeat, Disconnect Reason=Have not received a heartbeat from node in 41 seconds, updateId=11] But sometimes it doesn't seem to detect this, and NiFi keeps on thinking it is CONNECTED, despite not having received heartbeats in ages. Any ideas ? -- View this message in context: http://apache-nifi-users-list.2361937.n4.nabble.com/Nifi-Cluster-fails-to-disconnect-node-when-node-was-killed-tp1942.html Sent from the Apache NiFi Users List mailing list archive at Nabble.com.