If there is no longer a quorum then we cannot drive things from the UI but the cluster remaining is in tact from a functioning point of view other than being able to assign a primary to handle the one-off items.
On Fri, May 19, 2017 at 11:04 AM, Neil Derraugh <neil.derra...@intellifylearning.com> wrote: > Hi Joe, > > Maybe I'm missing something, but if the primary node suffers a network > partition or container/vm/machine loss or becomes otherwise unreachable then > the cluster is unusable, at least from the UI. > > If that's not so please correct me. > > Thanks, > Neil > > On Thu, May 18, 2017 at 9:56 PM, Joe Witt <joe.w...@gmail.com> wrote: >> >> Neil, >> >> Want to make sure I understand what you're saying. What are stating >> is a single point of failure? >> >> Thanks >> Joe >> >> On Thu, May 18, 2017 at 5:27 PM, Neil Derraugh >> <neil.derra...@intellifylearning.com> wrote: >> > Thanks for the insight Matt. >> > >> > It's a disaster recovery issue. It's not something I plan on doing on >> > purpose. It seems it is a single point of failure unfortunately. I can >> > see >> > no other way to resolve the issue other than to blow everything away and >> > start a new cluster. >> > >> > On Thu, May 18, 2017 at 2:49 PM, Matt Gilman <matt.c.gil...@gmail.com> >> > wrote: >> >> >> >> Neil, >> >> >> >> Disconnecting a node prior to removal is the correct process. It >> >> appears >> >> that the check was lost going from 0.x to 1.x. Folks reported this JIRA >> >> [1] >> >> indicating that deleting a connected node did not work. This process >> >> does >> >> not work because the node needs to be disconnected first. The JIRA was >> >> addressed by restoring the check that a node is disconnected prior to >> >> deletion. >> >> >> >> Hopefully the JIRA I filed earlier today [2] will address the phantom >> >> node >> >> you were seeing. Until then, can you update your workaround to >> >> disconnect >> >> the node in question prior to deletion? >> >> >> >> Thanks >> >> >> >> Matt >> >> >> >> [1] https://issues.apache.org/jira/browse/NIFI-3295 >> >> [2] https://issues.apache.org/jira/browse/NIFI-3933 >> >> >> >> On Thu, May 18, 2017 at 12:29 PM, Neil Derraugh >> >> <neil.derra...@intellifylearning.com> wrote: >> >>> >> >>> Pretty sure this is the problem I was describing in the "Phantom Node" >> >>> thread recently. >> >>> >> >>> If I kill non-primary nodes the cluster remains healthy despite the >> >>> lost >> >>> nodes. The terminated nodes end up with a DISCONNECTED status. >> >>> >> >>> If I kill the primary it winds up with a CONNECTED status, but a new >> >>> primary/cluster coordinator gets elected too. >> >>> >> >>> Additionally it seems in 1.2.0 that the REST API no longer support >> >>> deleting a node in a CONNECTED state (Cannot remove Node with ID >> >>> 1780fde7-c2f4-469c-9884-fe843eac5b73 because it is not disconnected, >> >>> current >> >>> state = CONNECTED). So right now I don't have a workaround and have >> >>> to kill >> >>> all the nodes and start over. >> >>> >> >>> On Thu, May 18, 2017 at 11:20 AM, Mark Payne <marka...@hotmail.com> >> >>> wrote: >> >>>> >> >>>> Hello, >> >>>> >> >>>> Just looking through this thread now. I believe that I understand the >> >>>> problem. I have updated the JIRA with details about what I think is >> >>>> the >> >>>> problem and a potential remedy for the problem. >> >>>> >> >>>> Thanks >> >>>> -Mark >> >>>> >> >>>> > On May 18, 2017, at 9:49 AM, Matt Gilman <matt.c.gil...@gmail.com> >> >>>> > wrote: >> >>>> > >> >>>> > Thanks for the additional details. They will be helpful when >> >>>> > working >> >>>> > the JIRA. All nodes, including the coordinator, heartbeat to the >> >>>> > active >> >>>> > coordinator. This means that the coordinator effectively heartbeats >> >>>> > to >> >>>> > itself. It appears, based on your log messages, that this is not >> >>>> > happening. >> >>>> > Because no heartbeats were receive from any node, the lack of >> >>>> > heartbeats >> >>>> > from the terminated node is not considered. >> >>>> > >> >>>> > Matt >> >>>> > >> >>>> > Sent from my iPhone >> >>>> > >> >>>> >> On May 18, 2017, at 8:30 AM, ddewaele <ddewa...@gmail.com> wrote: >> >>>> >> >> >>>> >> Found something interesting in the centos-b debug logging.... >> >>>> >> >> >>>> >> after centos-a (the coordinator) is killed centos-b takes over. >> >>>> >> Notice how >> >>>> >> it "Will not disconnect any nodes due to lack of heartbeat" and >> >>>> >> how >> >>>> >> it still >> >>>> >> sees centos-a as connected despite the fact that there are no >> >>>> >> heartbeats >> >>>> >> anymore. >> >>>> >> >> >>>> >> 2017-05-18 12:41:38,010 INFO [Leader Election Notification >> >>>> >> Thread-2] >> >>>> >> o.apache.nifi.controller.FlowController This node elected Active >> >>>> >> Cluster >> >>>> >> Coordinator >> >>>> >> 2017-05-18 12:41:38,010 DEBUG [Leader Election Notification >> >>>> >> Thread-2] >> >>>> >> o.a.n.c.c.h.ClusterProtocolHeartbeatMonitor Purging old heartbeats >> >>>> >> 2017-05-18 12:41:38,014 INFO [Leader Election Notification >> >>>> >> Thread-1] >> >>>> >> o.apache.nifi.controller.FlowController This node has been elected >> >>>> >> Primary >> >>>> >> Node >> >>>> >> 2017-05-18 12:41:38,353 DEBUG [Heartbeat Monitor Thread-1] >> >>>> >> o.a.n.c.c.h.AbstractHeartbeatMonitor Received no new heartbeats. >> >>>> >> Will >> >>>> >> not >> >>>> >> disconnect any nodes due to lack of heartbeat >> >>>> >> 2017-05-18 12:41:41,336 DEBUG [Process Cluster Protocol Request-3] >> >>>> >> o.a.n.c.c.h.ClusterProtocolHeartbeatMonitor Received new heartbeat >> >>>> >> from >> >>>> >> centos-b:8080 >> >>>> >> 2017-05-18 12:41:41,337 DEBUG [Process Cluster Protocol Request-3] >> >>>> >> o.a.n.c.c.h.ClusterProtocolHeartbeatMonitor >> >>>> >> >> >>>> >> Calculated diff between current cluster status and node cluster >> >>>> >> status as >> >>>> >> follows: >> >>>> >> Node: [NodeConnectionStatus[nodeId=centos-b:8080, state=CONNECTED, >> >>>> >> updateId=45], NodeConnectionStatus[nodeId=centos-a:8080, >> >>>> >> state=CONNECTED, >> >>>> >> updateId=42]] >> >>>> >> Self: [NodeConnectionStatus[nodeId=centos-b:8080, state=CONNECTED, >> >>>> >> updateId=45], NodeConnectionStatus[nodeId=centos-a:8080, >> >>>> >> state=CONNECTED, >> >>>> >> updateId=42]] >> >>>> >> Difference: [] >> >>>> >> >> >>>> >> >> >>>> >> 2017-05-18 12:41:41,337 INFO [Process Cluster Protocol Request-3] >> >>>> >> o.a.n.c.p.impl.SocketProtocolListener Finished processing request >> >>>> >> 410e7db5-8bb0-4f97-8ee8-fc8647c54959 (type=HEARTBEAT, length=2341 >> >>>> >> bytes) >> >>>> >> from centos-b:8080 in 3 millis >> >>>> >> 2017-05-18 12:41:41,339 INFO [Clustering Tasks Thread-2] >> >>>> >> o.a.n.c.c.ClusterProtocolHeartbeater Heartbeat created at >> >>>> >> 2017-05-18 >> >>>> >> 12:41:41,330 and sent to centos-b:10001 at 2017-05-18 >> >>>> >> 12:41:41,339; >> >>>> >> send >> >>>> >> took 8 millis >> >>>> >> 2017-05-18 12:41:43,354 INFO [Heartbeat Monitor Thread-1] >> >>>> >> o.a.n.c.c.h.AbstractHeartbeatMonitor Finished processing 1 >> >>>> >> heartbeats >> >>>> >> in >> >>>> >> 93276 nanos >> >>>> >> 2017-05-18 12:41:46,346 DEBUG [Process Cluster Protocol Request-4] >> >>>> >> o.a.n.c.c.h.ClusterProtocolHeartbeatMonitor Received new heartbeat >> >>>> >> from >> >>>> >> centos-b:8080 >> >>>> >> 2017-05-18 12:41:46,346 DEBUG [Process Cluster Protocol Request-4] >> >>>> >> o.a.n.c.c.h.ClusterProtocolHeartbeatMonitor >> >>>> >> >> >>>> >> Calculated diff between current cluster status and node cluster >> >>>> >> status as >> >>>> >> follows: >> >>>> >> Node: [NodeConnectionStatus[nodeId=centos-b:8080, state=CONNECTED, >> >>>> >> updateId=45], NodeConnectionStatus[nodeId=centos-a:8080, >> >>>> >> state=CONNECTED, >> >>>> >> updateId=42]] >> >>>> >> Self: [NodeConnectionStatus[nodeId=centos-b:8080, state=CONNECTED, >> >>>> >> updateId=45], NodeConnectionStatus[nodeId=centos-a:8080, >> >>>> >> state=CONNECTED, >> >>>> >> updateId=42]] >> >>>> >> Difference: [] >> >>>> >> >> >>>> >> >> >>>> >> >> >>>> >> >> >>>> >> -- >> >>>> >> View this message in context: >> >>>> >> >> >>>> >> http://apache-nifi-users-list.2361937.n4.nabble.com/Nifi-Cluster-fails-to-disconnect-node-when-node-was-killed-tp1942p1950.html >> >>>> >> Sent from the Apache NiFi Users List mailing list archive at >> >>>> >> Nabble.com. >> >>>> >> >>> >> >> >> > > >