I see. Yeah that sounds like something the jira gilman mentioned will resolve. Thanks for clarifying. I'm sure that jira will be addressed soon.
On May 19, 2017 1:06 PM, "Neil Derraugh" < neil.derra...@intellifylearning.com> wrote: > That's the whole problem from my perspective: it stays CONNECTED. It > never becomes DISCONNECTED. You can't delete it from the API in 1.2.0. > > That's why I said it was a single point of failure. The exact semantics > of calling it a single point of failure might be debatable, but the fact > that the cluster can't be modified and/or gracefully shutdown (afaik) is > what I was referring to. > > On Fri, May 19, 2017 at 12:40 PM, Joe Witt <joe.w...@gmail.com> wrote: > >> I believe at the state you describe that down node is now considered >> disconnected. The cluster behavior prohibits you from making changes when >> it knows not all members of the cluster cannot honor the change. If you >> are sure you want to make the changes anyway and move on without that node >> you should be able to remove it/delete it from the cluster. Now you have a >> cluster of two connected nodes and you can make changes. >> >> On May 19, 2017 12:23 PM, "Neil Derraugh" <neil.derraugh@intellifylearni >> ng.com> wrote: >> >>> That's fair. But for the sake of total clarity on my own part, after >>> one of these disaster scenarios with a newly quorum-elected primary things >>> cannot be driven through the UI and at least through parts the REST API. >>> >>> I just ran through the following. We have 3 nodes A, B, C with A >>> primary, and A becomes unreachable without first disconnecting. Then B and >>> C may (I haven't verified) continue operating the flow they had in the >>> clusters' last "good" state. But they do elect a new primary, as per the >>> REST nifi-api/controller/cluster response. But now the flow can't be >>> changed, and in some cases it can't be reported on either, i.e. some GETs >>> fail, like nifi-api/flow/process-groups/root. >>> >>> Are we describing the same behavior? >>> >>> On Fri, May 19, 2017 at 11:12 AM, Joe Witt <joe.w...@gmail.com> wrote: >>> >>>> If there is no longer a quorum then we cannot drive things from the UI >>>> but the cluster remaining is in tact from a functioning point of view >>>> other than being able to assign a primary to handle the one-off items. >>>> >>>> On Fri, May 19, 2017 at 11:04 AM, Neil Derraugh >>>> <neil.derra...@intellifylearning.com> wrote: >>>> > Hi Joe, >>>> > >>>> > Maybe I'm missing something, but if the primary node suffers a network >>>> > partition or container/vm/machine loss or becomes otherwise >>>> unreachable then >>>> > the cluster is unusable, at least from the UI. >>>> > >>>> > If that's not so please correct me. >>>> > >>>> > Thanks, >>>> > Neil >>>> > >>>> > On Thu, May 18, 2017 at 9:56 PM, Joe Witt <joe.w...@gmail.com> wrote: >>>> >> >>>> >> Neil, >>>> >> >>>> >> Want to make sure I understand what you're saying. What are stating >>>> >> is a single point of failure? >>>> >> >>>> >> Thanks >>>> >> Joe >>>> >> >>>> >> On Thu, May 18, 2017 at 5:27 PM, Neil Derraugh >>>> >> <neil.derra...@intellifylearning.com> wrote: >>>> >> > Thanks for the insight Matt. >>>> >> > >>>> >> > It's a disaster recovery issue. It's not something I plan on >>>> doing on >>>> >> > purpose. It seems it is a single point of failure unfortunately. >>>> I can >>>> >> > see >>>> >> > no other way to resolve the issue other than to blow everything >>>> away and >>>> >> > start a new cluster. >>>> >> > >>>> >> > On Thu, May 18, 2017 at 2:49 PM, Matt Gilman < >>>> matt.c.gil...@gmail.com> >>>> >> > wrote: >>>> >> >> >>>> >> >> Neil, >>>> >> >> >>>> >> >> Disconnecting a node prior to removal is the correct process. It >>>> >> >> appears >>>> >> >> that the check was lost going from 0.x to 1.x. Folks reported >>>> this JIRA >>>> >> >> [1] >>>> >> >> indicating that deleting a connected node did not work. This >>>> process >>>> >> >> does >>>> >> >> not work because the node needs to be disconnected first. The >>>> JIRA was >>>> >> >> addressed by restoring the check that a node is disconnected >>>> prior to >>>> >> >> deletion. >>>> >> >> >>>> >> >> Hopefully the JIRA I filed earlier today [2] will address the >>>> phantom >>>> >> >> node >>>> >> >> you were seeing. Until then, can you update your workaround to >>>> >> >> disconnect >>>> >> >> the node in question prior to deletion? >>>> >> >> >>>> >> >> Thanks >>>> >> >> >>>> >> >> Matt >>>> >> >> >>>> >> >> [1] https://issues.apache.org/jira/browse/NIFI-3295 >>>> >> >> [2] https://issues.apache.org/jira/browse/NIFI-3933 >>>> >> >> >>>> >> >> On Thu, May 18, 2017 at 12:29 PM, Neil Derraugh >>>> >> >> <neil.derra...@intellifylearning.com> wrote: >>>> >> >>> >>>> >> >>> Pretty sure this is the problem I was describing in the "Phantom >>>> Node" >>>> >> >>> thread recently. >>>> >> >>> >>>> >> >>> If I kill non-primary nodes the cluster remains healthy despite >>>> the >>>> >> >>> lost >>>> >> >>> nodes. The terminated nodes end up with a DISCONNECTED status. >>>> >> >>> >>>> >> >>> If I kill the primary it winds up with a CONNECTED status, but a >>>> new >>>> >> >>> primary/cluster coordinator gets elected too. >>>> >> >>> >>>> >> >>> Additionally it seems in 1.2.0 that the REST API no longer >>>> support >>>> >> >>> deleting a node in a CONNECTED state (Cannot remove Node with ID >>>> >> >>> 1780fde7-c2f4-469c-9884-fe843eac5b73 because it is not >>>> disconnected, >>>> >> >>> current >>>> >> >>> state = CONNECTED). So right now I don't have a workaround and >>>> have >>>> >> >>> to kill >>>> >> >>> all the nodes and start over. >>>> >> >>> >>>> >> >>> On Thu, May 18, 2017 at 11:20 AM, Mark Payne < >>>> marka...@hotmail.com> >>>> >> >>> wrote: >>>> >> >>>> >>>> >> >>>> Hello, >>>> >> >>>> >>>> >> >>>> Just looking through this thread now. I believe that I >>>> understand the >>>> >> >>>> problem. I have updated the JIRA with details about what I >>>> think is >>>> >> >>>> the >>>> >> >>>> problem and a potential remedy for the problem. >>>> >> >>>> >>>> >> >>>> Thanks >>>> >> >>>> -Mark >>>> >> >>>> >>>> >> >>>> > On May 18, 2017, at 9:49 AM, Matt Gilman < >>>> matt.c.gil...@gmail.com> >>>> >> >>>> > wrote: >>>> >> >>>> > >>>> >> >>>> > Thanks for the additional details. They will be helpful when >>>> >> >>>> > working >>>> >> >>>> > the JIRA. All nodes, including the coordinator, heartbeat to >>>> the >>>> >> >>>> > active >>>> >> >>>> > coordinator. This means that the coordinator effectively >>>> heartbeats >>>> >> >>>> > to >>>> >> >>>> > itself. It appears, based on your log messages, that this is >>>> not >>>> >> >>>> > happening. >>>> >> >>>> > Because no heartbeats were receive from any node, the lack of >>>> >> >>>> > heartbeats >>>> >> >>>> > from the terminated node is not considered. >>>> >> >>>> > >>>> >> >>>> > Matt >>>> >> >>>> > >>>> >> >>>> > Sent from my iPhone >>>> >> >>>> > >>>> >> >>>> >> On May 18, 2017, at 8:30 AM, ddewaele <ddewa...@gmail.com> >>>> wrote: >>>> >> >>>> >> >>>> >> >>>> >> Found something interesting in the centos-b debug logging.... >>>> >> >>>> >> >>>> >> >>>> >> after centos-a (the coordinator) is killed centos-b takes >>>> over. >>>> >> >>>> >> Notice how >>>> >> >>>> >> it "Will not disconnect any nodes due to lack of heartbeat" >>>> and >>>> >> >>>> >> how >>>> >> >>>> >> it still >>>> >> >>>> >> sees centos-a as connected despite the fact that there are no >>>> >> >>>> >> heartbeats >>>> >> >>>> >> anymore. >>>> >> >>>> >> >>>> >> >>>> >> 2017-05-18 12:41:38,010 INFO [Leader Election Notification >>>> >> >>>> >> Thread-2] >>>> >> >>>> >> o.apache.nifi.controller.FlowController This node elected >>>> Active >>>> >> >>>> >> Cluster >>>> >> >>>> >> Coordinator >>>> >> >>>> >> 2017-05-18 12:41:38,010 DEBUG [Leader Election Notification >>>> >> >>>> >> Thread-2] >>>> >> >>>> >> o.a.n.c.c.h.ClusterProtocolHeartbeatMonitor Purging old >>>> heartbeats >>>> >> >>>> >> 2017-05-18 12:41:38,014 INFO [Leader Election Notification >>>> >> >>>> >> Thread-1] >>>> >> >>>> >> o.apache.nifi.controller.FlowController This node has been >>>> elected >>>> >> >>>> >> Primary >>>> >> >>>> >> Node >>>> >> >>>> >> 2017-05-18 12:41:38,353 DEBUG [Heartbeat Monitor Thread-1] >>>> >> >>>> >> o.a.n.c.c.h.AbstractHeartbeatMonitor Received no new >>>> heartbeats. >>>> >> >>>> >> Will >>>> >> >>>> >> not >>>> >> >>>> >> disconnect any nodes due to lack of heartbeat >>>> >> >>>> >> 2017-05-18 12:41:41,336 DEBUG [Process Cluster Protocol >>>> Request-3] >>>> >> >>>> >> o.a.n.c.c.h.ClusterProtocolHeartbeatMonitor Received new >>>> heartbeat >>>> >> >>>> >> from >>>> >> >>>> >> centos-b:8080 >>>> >> >>>> >> 2017-05-18 12:41:41,337 DEBUG [Process Cluster Protocol >>>> Request-3] >>>> >> >>>> >> o.a.n.c.c.h.ClusterProtocolHeartbeatMonitor >>>> >> >>>> >> >>>> >> >>>> >> Calculated diff between current cluster status and node >>>> cluster >>>> >> >>>> >> status as >>>> >> >>>> >> follows: >>>> >> >>>> >> Node: [NodeConnectionStatus[nodeId=centos-b:8080, >>>> state=CONNECTED, >>>> >> >>>> >> updateId=45], NodeConnectionStatus[nodeId=centos-a:8080, >>>> >> >>>> >> state=CONNECTED, >>>> >> >>>> >> updateId=42]] >>>> >> >>>> >> Self: [NodeConnectionStatus[nodeId=centos-b:8080, >>>> state=CONNECTED, >>>> >> >>>> >> updateId=45], NodeConnectionStatus[nodeId=centos-a:8080, >>>> >> >>>> >> state=CONNECTED, >>>> >> >>>> >> updateId=42]] >>>> >> >>>> >> Difference: [] >>>> >> >>>> >> >>>> >> >>>> >> >>>> >> >>>> >> 2017-05-18 12:41:41,337 INFO [Process Cluster Protocol >>>> Request-3] >>>> >> >>>> >> o.a.n.c.p.impl.SocketProtocolListener Finished processing >>>> request >>>> >> >>>> >> 410e7db5-8bb0-4f97-8ee8-fc8647c54959 (type=HEARTBEAT, >>>> length=2341 >>>> >> >>>> >> bytes) >>>> >> >>>> >> from centos-b:8080 in 3 millis >>>> >> >>>> >> 2017-05-18 12:41:41,339 INFO [Clustering Tasks Thread-2] >>>> >> >>>> >> o.a.n.c.c.ClusterProtocolHeartbeater Heartbeat created at >>>> >> >>>> >> 2017-05-18 >>>> >> >>>> >> 12:41:41,330 and sent to centos-b:10001 at 2017-05-18 >>>> >> >>>> >> 12:41:41,339; >>>> >> >>>> >> send >>>> >> >>>> >> took 8 millis >>>> >> >>>> >> 2017-05-18 12:41:43,354 INFO [Heartbeat Monitor Thread-1] >>>> >> >>>> >> o.a.n.c.c.h.AbstractHeartbeatMonitor Finished processing 1 >>>> >> >>>> >> heartbeats >>>> >> >>>> >> in >>>> >> >>>> >> 93276 nanos >>>> >> >>>> >> 2017-05-18 12:41:46,346 DEBUG [Process Cluster Protocol >>>> Request-4] >>>> >> >>>> >> o.a.n.c.c.h.ClusterProtocolHeartbeatMonitor Received new >>>> heartbeat >>>> >> >>>> >> from >>>> >> >>>> >> centos-b:8080 >>>> >> >>>> >> 2017-05-18 12:41:46,346 DEBUG [Process Cluster Protocol >>>> Request-4] >>>> >> >>>> >> o.a.n.c.c.h.ClusterProtocolHeartbeatMonitor >>>> >> >>>> >> >>>> >> >>>> >> Calculated diff between current cluster status and node >>>> cluster >>>> >> >>>> >> status as >>>> >> >>>> >> follows: >>>> >> >>>> >> Node: [NodeConnectionStatus[nodeId=centos-b:8080, >>>> state=CONNECTED, >>>> >> >>>> >> updateId=45], NodeConnectionStatus[nodeId=centos-a:8080, >>>> >> >>>> >> state=CONNECTED, >>>> >> >>>> >> updateId=42]] >>>> >> >>>> >> Self: [NodeConnectionStatus[nodeId=centos-b:8080, >>>> state=CONNECTED, >>>> >> >>>> >> updateId=45], NodeConnectionStatus[nodeId=centos-a:8080, >>>> >> >>>> >> state=CONNECTED, >>>> >> >>>> >> updateId=42]] >>>> >> >>>> >> Difference: [] >>>> >> >>>> >> >>>> >> >>>> >> >>>> >> >>>> >> >>>> >> >>>> >> >>>> >> >>>> >> -- >>>> >> >>>> >> View this message in context: >>>> >> >>>> >> >>>> >> >>>> >> http://apache-nifi-users-list.2361937.n4.nabble.com/Nifi-Clu >>>> ster-fails-to-disconnect-node-when-node-was-killed-tp1942p1950.html >>>> >> >>>> >> Sent from the Apache NiFi Users List mailing list archive at >>>> >> >>>> >> Nabble.com. >>>> >> >>>> >>>> >> >>> >>>> >> >> >>>> >> > >>>> > >>>> > >>>> >>> >>> >