That's fair. But for the sake of total clarity on my own part, after one of these disaster scenarios with a newly quorum-elected primary things cannot be driven through the UI and at least through parts the REST API.
I just ran through the following. We have 3 nodes A, B, C with A primary, and A becomes unreachable without first disconnecting. Then B and C may (I haven't verified) continue operating the flow they had in the clusters' last "good" state. But they do elect a new primary, as per the REST nifi-api/controller/cluster response. But now the flow can't be changed, and in some cases it can't be reported on either, i.e. some GETs fail, like nifi-api/flow/process-groups/root. Are we describing the same behavior? On Fri, May 19, 2017 at 11:12 AM, Joe Witt <joe.w...@gmail.com> wrote: > If there is no longer a quorum then we cannot drive things from the UI > but the cluster remaining is in tact from a functioning point of view > other than being able to assign a primary to handle the one-off items. > > On Fri, May 19, 2017 at 11:04 AM, Neil Derraugh > <neil.derra...@intellifylearning.com> wrote: > > Hi Joe, > > > > Maybe I'm missing something, but if the primary node suffers a network > > partition or container/vm/machine loss or becomes otherwise unreachable > then > > the cluster is unusable, at least from the UI. > > > > If that's not so please correct me. > > > > Thanks, > > Neil > > > > On Thu, May 18, 2017 at 9:56 PM, Joe Witt <joe.w...@gmail.com> wrote: > >> > >> Neil, > >> > >> Want to make sure I understand what you're saying. What are stating > >> is a single point of failure? > >> > >> Thanks > >> Joe > >> > >> On Thu, May 18, 2017 at 5:27 PM, Neil Derraugh > >> <neil.derra...@intellifylearning.com> wrote: > >> > Thanks for the insight Matt. > >> > > >> > It's a disaster recovery issue. It's not something I plan on doing on > >> > purpose. It seems it is a single point of failure unfortunately. I > can > >> > see > >> > no other way to resolve the issue other than to blow everything away > and > >> > start a new cluster. > >> > > >> > On Thu, May 18, 2017 at 2:49 PM, Matt Gilman <matt.c.gil...@gmail.com > > > >> > wrote: > >> >> > >> >> Neil, > >> >> > >> >> Disconnecting a node prior to removal is the correct process. It > >> >> appears > >> >> that the check was lost going from 0.x to 1.x. Folks reported this > JIRA > >> >> [1] > >> >> indicating that deleting a connected node did not work. This process > >> >> does > >> >> not work because the node needs to be disconnected first. The JIRA > was > >> >> addressed by restoring the check that a node is disconnected prior to > >> >> deletion. > >> >> > >> >> Hopefully the JIRA I filed earlier today [2] will address the phantom > >> >> node > >> >> you were seeing. Until then, can you update your workaround to > >> >> disconnect > >> >> the node in question prior to deletion? > >> >> > >> >> Thanks > >> >> > >> >> Matt > >> >> > >> >> [1] https://issues.apache.org/jira/browse/NIFI-3295 > >> >> [2] https://issues.apache.org/jira/browse/NIFI-3933 > >> >> > >> >> On Thu, May 18, 2017 at 12:29 PM, Neil Derraugh > >> >> <neil.derra...@intellifylearning.com> wrote: > >> >>> > >> >>> Pretty sure this is the problem I was describing in the "Phantom > Node" > >> >>> thread recently. > >> >>> > >> >>> If I kill non-primary nodes the cluster remains healthy despite the > >> >>> lost > >> >>> nodes. The terminated nodes end up with a DISCONNECTED status. > >> >>> > >> >>> If I kill the primary it winds up with a CONNECTED status, but a new > >> >>> primary/cluster coordinator gets elected too. > >> >>> > >> >>> Additionally it seems in 1.2.0 that the REST API no longer support > >> >>> deleting a node in a CONNECTED state (Cannot remove Node with ID > >> >>> 1780fde7-c2f4-469c-9884-fe843eac5b73 because it is not > disconnected, > >> >>> current > >> >>> state = CONNECTED). So right now I don't have a workaround and have > >> >>> to kill > >> >>> all the nodes and start over. > >> >>> > >> >>> On Thu, May 18, 2017 at 11:20 AM, Mark Payne <marka...@hotmail.com> > >> >>> wrote: > >> >>>> > >> >>>> Hello, > >> >>>> > >> >>>> Just looking through this thread now. I believe that I understand > the > >> >>>> problem. I have updated the JIRA with details about what I think is > >> >>>> the > >> >>>> problem and a potential remedy for the problem. > >> >>>> > >> >>>> Thanks > >> >>>> -Mark > >> >>>> > >> >>>> > On May 18, 2017, at 9:49 AM, Matt Gilman < > matt.c.gil...@gmail.com> > >> >>>> > wrote: > >> >>>> > > >> >>>> > Thanks for the additional details. They will be helpful when > >> >>>> > working > >> >>>> > the JIRA. All nodes, including the coordinator, heartbeat to the > >> >>>> > active > >> >>>> > coordinator. This means that the coordinator effectively > heartbeats > >> >>>> > to > >> >>>> > itself. It appears, based on your log messages, that this is not > >> >>>> > happening. > >> >>>> > Because no heartbeats were receive from any node, the lack of > >> >>>> > heartbeats > >> >>>> > from the terminated node is not considered. > >> >>>> > > >> >>>> > Matt > >> >>>> > > >> >>>> > Sent from my iPhone > >> >>>> > > >> >>>> >> On May 18, 2017, at 8:30 AM, ddewaele <ddewa...@gmail.com> > wrote: > >> >>>> >> > >> >>>> >> Found something interesting in the centos-b debug logging.... > >> >>>> >> > >> >>>> >> after centos-a (the coordinator) is killed centos-b takes over. > >> >>>> >> Notice how > >> >>>> >> it "Will not disconnect any nodes due to lack of heartbeat" and > >> >>>> >> how > >> >>>> >> it still > >> >>>> >> sees centos-a as connected despite the fact that there are no > >> >>>> >> heartbeats > >> >>>> >> anymore. > >> >>>> >> > >> >>>> >> 2017-05-18 12:41:38,010 INFO [Leader Election Notification > >> >>>> >> Thread-2] > >> >>>> >> o.apache.nifi.controller.FlowController This node elected > Active > >> >>>> >> Cluster > >> >>>> >> Coordinator > >> >>>> >> 2017-05-18 12:41:38,010 DEBUG [Leader Election Notification > >> >>>> >> Thread-2] > >> >>>> >> o.a.n.c.c.h.ClusterProtocolHeartbeatMonitor Purging old > heartbeats > >> >>>> >> 2017-05-18 12:41:38,014 INFO [Leader Election Notification > >> >>>> >> Thread-1] > >> >>>> >> o.apache.nifi.controller.FlowController This node has been > elected > >> >>>> >> Primary > >> >>>> >> Node > >> >>>> >> 2017-05-18 12:41:38,353 DEBUG [Heartbeat Monitor Thread-1] > >> >>>> >> o.a.n.c.c.h.AbstractHeartbeatMonitor Received no new > heartbeats. > >> >>>> >> Will > >> >>>> >> not > >> >>>> >> disconnect any nodes due to lack of heartbeat > >> >>>> >> 2017-05-18 12:41:41,336 DEBUG [Process Cluster Protocol > Request-3] > >> >>>> >> o.a.n.c.c.h.ClusterProtocolHeartbeatMonitor Received new > heartbeat > >> >>>> >> from > >> >>>> >> centos-b:8080 > >> >>>> >> 2017-05-18 12:41:41,337 DEBUG [Process Cluster Protocol > Request-3] > >> >>>> >> o.a.n.c.c.h.ClusterProtocolHeartbeatMonitor > >> >>>> >> > >> >>>> >> Calculated diff between current cluster status and node cluster > >> >>>> >> status as > >> >>>> >> follows: > >> >>>> >> Node: [NodeConnectionStatus[nodeId=centos-b:8080, > state=CONNECTED, > >> >>>> >> updateId=45], NodeConnectionStatus[nodeId=centos-a:8080, > >> >>>> >> state=CONNECTED, > >> >>>> >> updateId=42]] > >> >>>> >> Self: [NodeConnectionStatus[nodeId=centos-b:8080, > state=CONNECTED, > >> >>>> >> updateId=45], NodeConnectionStatus[nodeId=centos-a:8080, > >> >>>> >> state=CONNECTED, > >> >>>> >> updateId=42]] > >> >>>> >> Difference: [] > >> >>>> >> > >> >>>> >> > >> >>>> >> 2017-05-18 12:41:41,337 INFO [Process Cluster Protocol > Request-3] > >> >>>> >> o.a.n.c.p.impl.SocketProtocolListener Finished processing > request > >> >>>> >> 410e7db5-8bb0-4f97-8ee8-fc8647c54959 (type=HEARTBEAT, > length=2341 > >> >>>> >> bytes) > >> >>>> >> from centos-b:8080 in 3 millis > >> >>>> >> 2017-05-18 12:41:41,339 INFO [Clustering Tasks Thread-2] > >> >>>> >> o.a.n.c.c.ClusterProtocolHeartbeater Heartbeat created at > >> >>>> >> 2017-05-18 > >> >>>> >> 12:41:41,330 and sent to centos-b:10001 at 2017-05-18 > >> >>>> >> 12:41:41,339; > >> >>>> >> send > >> >>>> >> took 8 millis > >> >>>> >> 2017-05-18 12:41:43,354 INFO [Heartbeat Monitor Thread-1] > >> >>>> >> o.a.n.c.c.h.AbstractHeartbeatMonitor Finished processing 1 > >> >>>> >> heartbeats > >> >>>> >> in > >> >>>> >> 93276 nanos > >> >>>> >> 2017-05-18 12:41:46,346 DEBUG [Process Cluster Protocol > Request-4] > >> >>>> >> o.a.n.c.c.h.ClusterProtocolHeartbeatMonitor Received new > heartbeat > >> >>>> >> from > >> >>>> >> centos-b:8080 > >> >>>> >> 2017-05-18 12:41:46,346 DEBUG [Process Cluster Protocol > Request-4] > >> >>>> >> o.a.n.c.c.h.ClusterProtocolHeartbeatMonitor > >> >>>> >> > >> >>>> >> Calculated diff between current cluster status and node cluster > >> >>>> >> status as > >> >>>> >> follows: > >> >>>> >> Node: [NodeConnectionStatus[nodeId=centos-b:8080, > state=CONNECTED, > >> >>>> >> updateId=45], NodeConnectionStatus[nodeId=centos-a:8080, > >> >>>> >> state=CONNECTED, > >> >>>> >> updateId=42]] > >> >>>> >> Self: [NodeConnectionStatus[nodeId=centos-b:8080, > state=CONNECTED, > >> >>>> >> updateId=45], NodeConnectionStatus[nodeId=centos-a:8080, > >> >>>> >> state=CONNECTED, > >> >>>> >> updateId=42]] > >> >>>> >> Difference: [] > >> >>>> >> > >> >>>> >> > >> >>>> >> > >> >>>> >> > >> >>>> >> -- > >> >>>> >> View this message in context: > >> >>>> >> > >> >>>> >> http://apache-nifi-users-list.2361937.n4.nabble.com/Nifi- > Cluster-fails-to-disconnect-node-when-node-was-killed-tp1942p1950.html > >> >>>> >> Sent from the Apache NiFi Users List mailing list archive at > >> >>>> >> Nabble.com. > >> >>>> > >> >>> > >> >> > >> > > > > > >