Re: Nifi Cluster fails to disconnect node when node was killed

Neil Derraugh Thu, 18 May 2017 09:29:43 -0700

Pretty sure this is the problem I was describing in the "Phantom Node"
thread recently.


If I kill non-primary nodes the cluster remains healthy despite the lost
nodes.  The terminated nodes end up with a DISCONNECTED status.

If I kill the primary it winds up with a CONNECTED status, but a new
primary/cluster coordinator gets elected too.

Additionally it seems in 1.2.0 that the REST API no longer support deleting
a node in a CONNECTED state (Cannot remove Node with ID
1780fde7-c2f4-469c-9884-fe843eac5b73 because it is not disconnected,
current state = CONNECTED).  So right now I don't have a workaround and
have to kill all the nodes and start over.

On Thu, May 18, 2017 at 11:20 AM, Mark Payne <marka...@hotmail.com> wrote:

> Hello,
>
> Just looking through this thread now. I believe that I understand the
> problem. I have updated the JIRA with details about what I think is the
> problem and a potential remedy for the problem.
>
> Thanks
> -Mark
>
> > On May 18, 2017, at 9:49 AM, Matt Gilman <matt.c.gil...@gmail.com>
> wrote:
> >
> > Thanks for the additional details. They will be helpful when working the
> JIRA. All nodes, including the coordinator, heartbeat to the active
> coordinator. This means that the coordinator effectively heartbeats to
> itself. It appears, based on your log messages, that this is not happening.
> Because no heartbeats were receive from any node, the lack of heartbeats
> from the terminated node is not considered.
> >
> > Matt
> >
> > Sent from my iPhone
> >
> >> On May 18, 2017, at 8:30 AM, ddewaele <ddewa...@gmail.com> wrote:
> >>
> >> Found something interesting in the centos-b debug logging....
> >>
> >> after centos-a (the coordinator) is killed centos-b takes over. Notice
> how
> >> it "Will not disconnect any nodes due to lack of heartbeat" and how it
> still
> >> sees centos-a as connected despite the fact that there are no heartbeats
> >> anymore.
> >>
> >> 2017-05-18 12:41:38,010 INFO [Leader Election Notification Thread-2]
> >> o.apache.nifi.controller.FlowController This node elected Active
> Cluster
> >> Coordinator
> >> 2017-05-18 12:41:38,010 DEBUG [Leader Election Notification Thread-2]
> >> o.a.n.c.c.h.ClusterProtocolHeartbeatMonitor Purging old heartbeats
> >> 2017-05-18 12:41:38,014 INFO [Leader Election Notification Thread-1]
> >> o.apache.nifi.controller.FlowController This node has been elected
> Primary
> >> Node
> >> 2017-05-18 12:41:38,353 DEBUG [Heartbeat Monitor Thread-1]
> >> o.a.n.c.c.h.AbstractHeartbeatMonitor Received no new heartbeats. Will
> not
> >> disconnect any nodes due to lack of heartbeat
> >> 2017-05-18 12:41:41,336 DEBUG [Process Cluster Protocol Request-3]
> >> o.a.n.c.c.h.ClusterProtocolHeartbeatMonitor Received new heartbeat from
> >> centos-b:8080
> >> 2017-05-18 12:41:41,337 DEBUG [Process Cluster Protocol Request-3]
> >> o.a.n.c.c.h.ClusterProtocolHeartbeatMonitor
> >>
> >> Calculated diff between current cluster status and node cluster status
> as
> >> follows:
> >> Node: [NodeConnectionStatus[nodeId=centos-b:8080, state=CONNECTED,
> >> updateId=45], NodeConnectionStatus[nodeId=centos-a:8080,
> state=CONNECTED,
> >> updateId=42]]
> >> Self: [NodeConnectionStatus[nodeId=centos-b:8080, state=CONNECTED,
> >> updateId=45], NodeConnectionStatus[nodeId=centos-a:8080,
> state=CONNECTED,
> >> updateId=42]]
> >> Difference: []
> >>
> >>
> >> 2017-05-18 12:41:41,337 INFO [Process Cluster Protocol Request-3]
> >> o.a.n.c.p.impl.SocketProtocolListener Finished processing request
> >> 410e7db5-8bb0-4f97-8ee8-fc8647c54959 (type=HEARTBEAT, length=2341
> bytes)
> >> from centos-b:8080 in 3 millis
> >> 2017-05-18 12:41:41,339 INFO [Clustering Tasks Thread-2]
> >> o.a.n.c.c.ClusterProtocolHeartbeater Heartbeat created at 2017-05-18
> >> 12:41:41,330 and sent to centos-b:10001 at 2017-05-18 12:41:41,339; send
> >> took 8 millis
> >> 2017-05-18 12:41:43,354 INFO [Heartbeat Monitor Thread-1]
> >> o.a.n.c.c.h.AbstractHeartbeatMonitor Finished processing 1 heartbeats
> in
> >> 93276 nanos
> >> 2017-05-18 12:41:46,346 DEBUG [Process Cluster Protocol Request-4]
> >> o.a.n.c.c.h.ClusterProtocolHeartbeatMonitor Received new heartbeat from
> >> centos-b:8080
> >> 2017-05-18 12:41:46,346 DEBUG [Process Cluster Protocol Request-4]
> >> o.a.n.c.c.h.ClusterProtocolHeartbeatMonitor
> >>
> >> Calculated diff between current cluster status and node cluster status
> as
> >> follows:
> >> Node: [NodeConnectionStatus[nodeId=centos-b:8080, state=CONNECTED,
> >> updateId=45], NodeConnectionStatus[nodeId=centos-a:8080,
> state=CONNECTED,
> >> updateId=42]]
> >> Self: [NodeConnectionStatus[nodeId=centos-b:8080, state=CONNECTED,
> >> updateId=45], NodeConnectionStatus[nodeId=centos-a:8080,
> state=CONNECTED,
> >> updateId=42]]
> >> Difference: []
> >>
> >>
> >>
> >>
> >> --
> >> View this message in context: http://apache-nifi-users-list.
> 2361937.n4.nabble.com/Nifi-Cluster-fails-to-disconnect-
> node-when-node-was-killed-tp1942p1950.html
> >> Sent from the Apache NiFi Users List mailing list archive at Nabble.com.
>
>

Re: Nifi Cluster fails to disconnect node when node was killed

Reply via email to