Pretty sure this is the problem I was describing in the "Phantom Node" thread recently.
If I kill non-primary nodes the cluster remains healthy despite the lost nodes. The terminated nodes end up with a DISCONNECTED status. If I kill the primary it winds up with a CONNECTED status, but a new primary/cluster coordinator gets elected too. Additionally it seems in 1.2.0 that the REST API no longer support deleting a node in a CONNECTED state (Cannot remove Node with ID 1780fde7-c2f4-469c-9884-fe843eac5b73 because it is not disconnected, current state = CONNECTED). So right now I don't have a workaround and have to kill all the nodes and start over. On Thu, May 18, 2017 at 11:20 AM, Mark Payne <marka...@hotmail.com> wrote: > Hello, > > Just looking through this thread now. I believe that I understand the > problem. I have updated the JIRA with details about what I think is the > problem and a potential remedy for the problem. > > Thanks > -Mark > > > On May 18, 2017, at 9:49 AM, Matt Gilman <matt.c.gil...@gmail.com> > wrote: > > > > Thanks for the additional details. They will be helpful when working the > JIRA. All nodes, including the coordinator, heartbeat to the active > coordinator. This means that the coordinator effectively heartbeats to > itself. It appears, based on your log messages, that this is not happening. > Because no heartbeats were receive from any node, the lack of heartbeats > from the terminated node is not considered. > > > > Matt > > > > Sent from my iPhone > > > >> On May 18, 2017, at 8:30 AM, ddewaele <ddewa...@gmail.com> wrote: > >> > >> Found something interesting in the centos-b debug logging.... > >> > >> after centos-a (the coordinator) is killed centos-b takes over. Notice > how > >> it "Will not disconnect any nodes due to lack of heartbeat" and how it > still > >> sees centos-a as connected despite the fact that there are no heartbeats > >> anymore. > >> > >> 2017-05-18 12:41:38,010 INFO [Leader Election Notification Thread-2] > >> o.apache.nifi.controller.FlowController This node elected Active > Cluster > >> Coordinator > >> 2017-05-18 12:41:38,010 DEBUG [Leader Election Notification Thread-2] > >> o.a.n.c.c.h.ClusterProtocolHeartbeatMonitor Purging old heartbeats > >> 2017-05-18 12:41:38,014 INFO [Leader Election Notification Thread-1] > >> o.apache.nifi.controller.FlowController This node has been elected > Primary > >> Node > >> 2017-05-18 12:41:38,353 DEBUG [Heartbeat Monitor Thread-1] > >> o.a.n.c.c.h.AbstractHeartbeatMonitor Received no new heartbeats. Will > not > >> disconnect any nodes due to lack of heartbeat > >> 2017-05-18 12:41:41,336 DEBUG [Process Cluster Protocol Request-3] > >> o.a.n.c.c.h.ClusterProtocolHeartbeatMonitor Received new heartbeat from > >> centos-b:8080 > >> 2017-05-18 12:41:41,337 DEBUG [Process Cluster Protocol Request-3] > >> o.a.n.c.c.h.ClusterProtocolHeartbeatMonitor > >> > >> Calculated diff between current cluster status and node cluster status > as > >> follows: > >> Node: [NodeConnectionStatus[nodeId=centos-b:8080, state=CONNECTED, > >> updateId=45], NodeConnectionStatus[nodeId=centos-a:8080, > state=CONNECTED, > >> updateId=42]] > >> Self: [NodeConnectionStatus[nodeId=centos-b:8080, state=CONNECTED, > >> updateId=45], NodeConnectionStatus[nodeId=centos-a:8080, > state=CONNECTED, > >> updateId=42]] > >> Difference: [] > >> > >> > >> 2017-05-18 12:41:41,337 INFO [Process Cluster Protocol Request-3] > >> o.a.n.c.p.impl.SocketProtocolListener Finished processing request > >> 410e7db5-8bb0-4f97-8ee8-fc8647c54959 (type=HEARTBEAT, length=2341 > bytes) > >> from centos-b:8080 in 3 millis > >> 2017-05-18 12:41:41,339 INFO [Clustering Tasks Thread-2] > >> o.a.n.c.c.ClusterProtocolHeartbeater Heartbeat created at 2017-05-18 > >> 12:41:41,330 and sent to centos-b:10001 at 2017-05-18 12:41:41,339; send > >> took 8 millis > >> 2017-05-18 12:41:43,354 INFO [Heartbeat Monitor Thread-1] > >> o.a.n.c.c.h.AbstractHeartbeatMonitor Finished processing 1 heartbeats > in > >> 93276 nanos > >> 2017-05-18 12:41:46,346 DEBUG [Process Cluster Protocol Request-4] > >> o.a.n.c.c.h.ClusterProtocolHeartbeatMonitor Received new heartbeat from > >> centos-b:8080 > >> 2017-05-18 12:41:46,346 DEBUG [Process Cluster Protocol Request-4] > >> o.a.n.c.c.h.ClusterProtocolHeartbeatMonitor > >> > >> Calculated diff between current cluster status and node cluster status > as > >> follows: > >> Node: [NodeConnectionStatus[nodeId=centos-b:8080, state=CONNECTED, > >> updateId=45], NodeConnectionStatus[nodeId=centos-a:8080, > state=CONNECTED, > >> updateId=42]] > >> Self: [NodeConnectionStatus[nodeId=centos-b:8080, state=CONNECTED, > >> updateId=45], NodeConnectionStatus[nodeId=centos-a:8080, > state=CONNECTED, > >> updateId=42]] > >> Difference: [] > >> > >> > >> > >> > >> -- > >> View this message in context: http://apache-nifi-users-list. > 2361937.n4.nabble.com/Nifi-Cluster-fails-to-disconnect- > node-when-node-was-killed-tp1942p1950.html > >> Sent from the Apache NiFi Users List mailing list archive at Nabble.com. > >