Nifi Cluster fails to disconnect node when node was killed

ddewaele Thu, 18 May 2017 04:43:51 -0700

Hi,

I have a NiFi cluster up and running and I'm testing various failover
scenarios.


I have 2 nodes in the cluster :

- centos-a : Coordinator node / primary
- centos-b : Cluster node

I noticed in 1 of the scenarios when I killed the Cluster Coordinator node,
that the following happened :

centos-b couldn't contact the coordinator anymore and became the new
coordinator / primary node. (as expected) :

Failed to send heartbeat due to:
org.apache.nifi.cluster.protocol.ProtocolException: Failed to send message
to Cluster Coordinator due to: java.net.ConnectException: Connection refused
(Connection refused)
This node has been elected Leader for Role 'Primary Node'
This node has been elected Leader for Role 'Cluster Coordinator'

When attempting to access the UI on centos-b, I got the following error :

2017-05-18 11:18:49,368 WARN [Replicate Request Thread-2]
o.a.n.c.c.h.r.ThreadPoolRequestReplicator Failed to replicate request GET
/nifi-api/flow/current-user to centos-a:8080 due to {}

If my understanding is correct, NiFi will try to replicate to connected
nodes in the cluster. Here, centos-a was killed a while back and should have
been disconnected, but as far as NiFi was concerned it was still connected.

As a result I cannot access the UI anymore (due to the replication error),
but I can lookup the cluster info via the REST API. And sure enough, it
still sees centos-a as being CONNECTED.

{
    "cluster": {
        "generated": "11:20:13 UTC",
        "nodes": [
            {
                "activeThreadCount": 0,
                "address": "centos-b",
                "apiPort": 8080,
                "events": [
                    {
                        "category": "INFO",
                        "message": "Node Status changed from CONNECTING to
CONNECTED",
                        "timestamp": "05/18/2017 11:17:31 UTC"
                    },
                    {
                        "category": "INFO",
                        "message": "Node Status changed from [Unknown Node]
to CONNECTING",
                        "timestamp": "05/18/2017 11:17:27 UTC"
                    }
                ],
                "heartbeat": "05/18/2017 11:20:09 UTC",
                "nodeId": "a5bce78d-23ea-4435-a0dd-4b731459f1b9",
                "nodeStartTime": "05/18/2017 11:17:25 UTC",
                "queued": "8,492 / 13.22 MB",
                "roles": [
                    "Primary Node",
                    "Cluster Coordinator"
                ],
                "status": "CONNECTED"
            },
            {
                "address": "centos-a",
                "apiPort": 8080,
                "events": [],
                "nodeId": "b89e8418-4b7f-4743-bdf4-4a08a92f3892",
                "roles": [],
                "status": "CONNECTED"
            }
        ]
    }
}

When centos-a was brought back online, i noticed the following status change
:

Status of centos-a:8080 changed from
NodeConnectionStatus[nodeId=centos-a:8080, state=CONNECTED, updateId=15] to
NodeConnectionStatus[nodeId=centos-a:8080, state=CONNECTING, updateId=19]

So it went from connected -> connecting.

It clearly missed the disconnected step here.

When shutting down the centos-a node using nifi.sh stop, it goes into the
DISCONNECTED state :

Status of centos-a:8080 changed from
NodeConnectionStatus[nodeId=centos-a:8080, state=CONNECTED, updateId=12] to
NodeConnectionStatus[nodeId=centos-a:8
080, state=DISCONNECTED, Disconnect Code=Node was Shutdown, Disconnect
Reason=Node was Shutdown, updateId=13]

How can I debug this further, and can somebody provide some additional
insights ? I have seen nodes getting disconnected due to missing heartbeats

tatus of centos-a:8080 changed from
NodeConnectionStatus[nodeId=centos-a:8080, state=CONNECTED, updateId=10] to
NodeConnectionStatus[nodeId=centos-a:8080, state=DISCONNECTED, Disconnect
Code=Lack of Heartbeat, Disconnect Reason=Have not received a heartbeat from
node in 41 seconds, updateId=11]

But sometimes it doesn't seem to detect this, and NiFi keeps on thinking it
is CONNECTED, despite not having received heartbeats in ages.

Any ideas ?



--
View this message in context: 
http://apache-nifi-users-list.2361937.n4.nabble.com/Nifi-Cluster-fails-to-disconnect-node-when-node-was-killed-tp1942.html
Sent from the Apache NiFi Users List mailing list archive at Nabble.com.

Nifi Cluster fails to disconnect node when node was killed

Reply via email to