Re: Nifi Cluster fails to disconnect node when node was killed

Neil Derraugh Fri, 19 May 2017 09:24:06 -0700

That's fair.  But for the sake of total clarity on my own part, after one
of these disaster scenarios with a newly quorum-elected primary things
cannot be driven through the UI and at least through parts the REST API.


I just ran through the following.  We have 3 nodes A, B, C with A primary,
and A becomes unreachable without first disconnecting.  Then B and C may (I
haven't verified) continue operating the flow they had in the clusters'
last "good" state.  But they do elect a new primary, as per the REST
nifi-api/controller/cluster response.  But now the flow can't be changed,
and in some cases it can't be reported on either, i.e. some GETs fail, like
nifi-api/flow/process-groups/root.

Are we describing the same behavior?

On Fri, May 19, 2017 at 11:12 AM, Joe Witt <joe.w...@gmail.com> wrote:

> If there is no longer a quorum then we cannot drive things from the UI
> but the cluster remaining is in tact from a functioning point of view
> other than being able to assign a primary to handle the one-off items.
>
> On Fri, May 19, 2017 at 11:04 AM, Neil Derraugh
> <neil.derra...@intellifylearning.com> wrote:
> > Hi Joe,
> >
> > Maybe I'm missing something, but if the primary node suffers a network
> > partition or container/vm/machine loss or becomes otherwise unreachable
> then
> > the cluster is unusable, at least from the UI.
> >
> > If that's not so please correct me.
> >
> > Thanks,
> > Neil
> >
> > On Thu, May 18, 2017 at 9:56 PM, Joe Witt <joe.w...@gmail.com> wrote:
> >>
> >> Neil,
> >>
> >> Want to make sure I understand what you're saying.  What are stating
> >> is a single point of failure?
> >>
> >> Thanks
> >> Joe
> >>
> >> On Thu, May 18, 2017 at 5:27 PM, Neil Derraugh
> >> <neil.derra...@intellifylearning.com> wrote:
> >> > Thanks for the insight Matt.
> >> >
> >> > It's a disaster recovery issue.  It's not something I plan on doing on
> >> > purpose.  It seems it is a single point of failure unfortunately.  I
> can
> >> > see
> >> > no other way to resolve the issue other than to blow everything away
> and
> >> > start a new cluster.
> >> >
> >> > On Thu, May 18, 2017 at 2:49 PM, Matt Gilman <matt.c.gil...@gmail.com
> >
> >> > wrote:
> >> >>
> >> >> Neil,
> >> >>
> >> >> Disconnecting a node prior to removal is the correct process. It
> >> >> appears
> >> >> that the check was lost going from 0.x to 1.x. Folks reported this
> JIRA
> >> >> [1]
> >> >> indicating that deleting a connected node did not work. This process
> >> >> does
> >> >> not work because the node needs to be disconnected first. The JIRA
> was
> >> >> addressed by restoring the check that a node is disconnected prior to
> >> >> deletion.
> >> >>
> >> >> Hopefully the JIRA I filed earlier today [2] will address the phantom
> >> >> node
> >> >> you were seeing. Until then, can you update your workaround to
> >> >> disconnect
> >> >> the node in question prior to deletion?
> >> >>
> >> >> Thanks
> >> >>
> >> >> Matt
> >> >>
> >> >> [1] https://issues.apache.org/jira/browse/NIFI-3295
> >> >> [2] https://issues.apache.org/jira/browse/NIFI-3933
> >> >>
> >> >> On Thu, May 18, 2017 at 12:29 PM, Neil Derraugh
> >> >> <neil.derra...@intellifylearning.com> wrote:
> >> >>>
> >> >>> Pretty sure this is the problem I was describing in the "Phantom
> Node"
> >> >>> thread recently.
> >> >>>
> >> >>> If I kill non-primary nodes the cluster remains healthy despite the
> >> >>> lost
> >> >>> nodes.  The terminated nodes end up with a DISCONNECTED status.
> >> >>>
> >> >>> If I kill the primary it winds up with a CONNECTED status, but a new
> >> >>> primary/cluster coordinator gets elected too.
> >> >>>
> >> >>> Additionally it seems in 1.2.0 that the REST API no longer support
> >> >>> deleting a node in a CONNECTED state (Cannot remove Node with ID
> >> >>> 1780fde7-c2f4-469c-9884-fe843eac5b73 because it is not
> disconnected,
> >> >>> current
> >> >>> state = CONNECTED).  So right now I don't have a workaround and have
> >> >>> to kill
> >> >>> all the nodes and start over.
> >> >>>
> >> >>> On Thu, May 18, 2017 at 11:20 AM, Mark Payne <marka...@hotmail.com>
> >> >>> wrote:
> >> >>>>
> >> >>>> Hello,
> >> >>>>
> >> >>>> Just looking through this thread now. I believe that I understand
> the
> >> >>>> problem. I have updated the JIRA with details about what I think is
> >> >>>> the
> >> >>>> problem and a potential remedy for the problem.
> >> >>>>
> >> >>>> Thanks
> >> >>>> -Mark
> >> >>>>
> >> >>>> > On May 18, 2017, at 9:49 AM, Matt Gilman <
> matt.c.gil...@gmail.com>
> >> >>>> > wrote:
> >> >>>> >
> >> >>>> > Thanks for the additional details. They will be helpful when
> >> >>>> > working
> >> >>>> > the JIRA. All nodes, including the coordinator, heartbeat to the
> >> >>>> > active
> >> >>>> > coordinator. This means that the coordinator effectively
> heartbeats
> >> >>>> > to
> >> >>>> > itself. It appears, based on your log messages, that this is not
> >> >>>> > happening.
> >> >>>> > Because no heartbeats were receive from any node, the lack of
> >> >>>> > heartbeats
> >> >>>> > from the terminated node is not considered.
> >> >>>> >
> >> >>>> > Matt
> >> >>>> >
> >> >>>> > Sent from my iPhone
> >> >>>> >
> >> >>>> >> On May 18, 2017, at 8:30 AM, ddewaele <ddewa...@gmail.com>
> wrote:
> >> >>>> >>
> >> >>>> >> Found something interesting in the centos-b debug logging....
> >> >>>> >>
> >> >>>> >> after centos-a (the coordinator) is killed centos-b takes over.
> >> >>>> >> Notice how
> >> >>>> >> it "Will not disconnect any nodes due to lack of heartbeat" and
> >> >>>> >> how
> >> >>>> >> it still
> >> >>>> >> sees centos-a as connected despite the fact that there are no
> >> >>>> >> heartbeats
> >> >>>> >> anymore.
> >> >>>> >>
> >> >>>> >> 2017-05-18 12:41:38,010 INFO [Leader Election Notification
> >> >>>> >> Thread-2]
> >> >>>> >> o.apache.nifi.controller.FlowController This node elected
> Active
> >> >>>> >> Cluster
> >> >>>> >> Coordinator
> >> >>>> >> 2017-05-18 12:41:38,010 DEBUG [Leader Election Notification
> >> >>>> >> Thread-2]
> >> >>>> >> o.a.n.c.c.h.ClusterProtocolHeartbeatMonitor Purging old
> heartbeats
> >> >>>> >> 2017-05-18 12:41:38,014 INFO [Leader Election Notification
> >> >>>> >> Thread-1]
> >> >>>> >> o.apache.nifi.controller.FlowController This node has been
> elected
> >> >>>> >> Primary
> >> >>>> >> Node
> >> >>>> >> 2017-05-18 12:41:38,353 DEBUG [Heartbeat Monitor Thread-1]
> >> >>>> >> o.a.n.c.c.h.AbstractHeartbeatMonitor Received no new
> heartbeats.
> >> >>>> >> Will
> >> >>>> >> not
> >> >>>> >> disconnect any nodes due to lack of heartbeat
> >> >>>> >> 2017-05-18 12:41:41,336 DEBUG [Process Cluster Protocol
> Request-3]
> >> >>>> >> o.a.n.c.c.h.ClusterProtocolHeartbeatMonitor Received new
> heartbeat
> >> >>>> >> from
> >> >>>> >> centos-b:8080
> >> >>>> >> 2017-05-18 12:41:41,337 DEBUG [Process Cluster Protocol
> Request-3]
> >> >>>> >> o.a.n.c.c.h.ClusterProtocolHeartbeatMonitor
> >> >>>> >>
> >> >>>> >> Calculated diff between current cluster status and node cluster
> >> >>>> >> status as
> >> >>>> >> follows:
> >> >>>> >> Node: [NodeConnectionStatus[nodeId=centos-b:8080,
> state=CONNECTED,
> >> >>>> >> updateId=45], NodeConnectionStatus[nodeId=centos-a:8080,
> >> >>>> >> state=CONNECTED,
> >> >>>> >> updateId=42]]
> >> >>>> >> Self: [NodeConnectionStatus[nodeId=centos-b:8080,
> state=CONNECTED,
> >> >>>> >> updateId=45], NodeConnectionStatus[nodeId=centos-a:8080,
> >> >>>> >> state=CONNECTED,
> >> >>>> >> updateId=42]]
> >> >>>> >> Difference: []
> >> >>>> >>
> >> >>>> >>
> >> >>>> >> 2017-05-18 12:41:41,337 INFO [Process Cluster Protocol
> Request-3]
> >> >>>> >> o.a.n.c.p.impl.SocketProtocolListener Finished processing
> request
> >> >>>> >> 410e7db5-8bb0-4f97-8ee8-fc8647c54959 (type=HEARTBEAT,
> length=2341
> >> >>>> >> bytes)
> >> >>>> >> from centos-b:8080 in 3 millis
> >> >>>> >> 2017-05-18 12:41:41,339 INFO [Clustering Tasks Thread-2]
> >> >>>> >> o.a.n.c.c.ClusterProtocolHeartbeater Heartbeat created at
> >> >>>> >> 2017-05-18
> >> >>>> >> 12:41:41,330 and sent to centos-b:10001 at 2017-05-18
> >> >>>> >> 12:41:41,339;
> >> >>>> >> send
> >> >>>> >> took 8 millis
> >> >>>> >> 2017-05-18 12:41:43,354 INFO [Heartbeat Monitor Thread-1]
> >> >>>> >> o.a.n.c.c.h.AbstractHeartbeatMonitor Finished processing 1
> >> >>>> >> heartbeats
> >> >>>> >> in
> >> >>>> >> 93276 nanos
> >> >>>> >> 2017-05-18 12:41:46,346 DEBUG [Process Cluster Protocol
> Request-4]
> >> >>>> >> o.a.n.c.c.h.ClusterProtocolHeartbeatMonitor Received new
> heartbeat
> >> >>>> >> from
> >> >>>> >> centos-b:8080
> >> >>>> >> 2017-05-18 12:41:46,346 DEBUG [Process Cluster Protocol
> Request-4]
> >> >>>> >> o.a.n.c.c.h.ClusterProtocolHeartbeatMonitor
> >> >>>> >>
> >> >>>> >> Calculated diff between current cluster status and node cluster
> >> >>>> >> status as
> >> >>>> >> follows:
> >> >>>> >> Node: [NodeConnectionStatus[nodeId=centos-b:8080,
> state=CONNECTED,
> >> >>>> >> updateId=45], NodeConnectionStatus[nodeId=centos-a:8080,
> >> >>>> >> state=CONNECTED,
> >> >>>> >> updateId=42]]
> >> >>>> >> Self: [NodeConnectionStatus[nodeId=centos-b:8080,
> state=CONNECTED,
> >> >>>> >> updateId=45], NodeConnectionStatus[nodeId=centos-a:8080,
> >> >>>> >> state=CONNECTED,
> >> >>>> >> updateId=42]]
> >> >>>> >> Difference: []
> >> >>>> >>
> >> >>>> >>
> >> >>>> >>
> >> >>>> >>
> >> >>>> >> --
> >> >>>> >> View this message in context:
> >> >>>> >>
> >> >>>> >> http://apache-nifi-users-list.2361937.n4.nabble.com/Nifi-
> Cluster-fails-to-disconnect-node-when-node-was-killed-tp1942p1950.html
> >> >>>> >> Sent from the Apache NiFi Users List mailing list archive at
> >> >>>> >> Nabble.com.
> >> >>>>
> >> >>>
> >> >>
> >> >
> >
> >
>

Re: Nifi Cluster fails to disconnect node when node was killed

Reply via email to