Re: Nifi Cluster fails to disconnect node when node was killed

Neil Derraugh Fri, 19 May 2017 08:05:12 -0700

Hi Joe,

Maybe I'm missing something, but if the primary node suffers a network
partition or container/vm/machine loss or becomes otherwise unreachable
then the cluster is unusable, at least from the UI.


If that's not so please correct me.

Thanks,
Neil

On Thu, May 18, 2017 at 9:56 PM, Joe Witt <joe.w...@gmail.com> wrote:

> Neil,
>
> Want to make sure I understand what you're saying.  What are stating
> is a single point of failure?
>
> Thanks
> Joe
>
> On Thu, May 18, 2017 at 5:27 PM, Neil Derraugh
> <neil.derra...@intellifylearning.com> wrote:
> > Thanks for the insight Matt.
> >
> > It's a disaster recovery issue.  It's not something I plan on doing on
> > purpose.  It seems it is a single point of failure unfortunately.  I can
> see
> > no other way to resolve the issue other than to blow everything away and
> > start a new cluster.
> >
> > On Thu, May 18, 2017 at 2:49 PM, Matt Gilman <matt.c.gil...@gmail.com>
> > wrote:
> >>
> >> Neil,
> >>
> >> Disconnecting a node prior to removal is the correct process. It appears
> >> that the check was lost going from 0.x to 1.x. Folks reported this JIRA
> [1]
> >> indicating that deleting a connected node did not work. This process
> does
> >> not work because the node needs to be disconnected first. The JIRA was
> >> addressed by restoring the check that a node is disconnected prior to
> >> deletion.
> >>
> >> Hopefully the JIRA I filed earlier today [2] will address the phantom
> node
> >> you were seeing. Until then, can you update your workaround to
> disconnect
> >> the node in question prior to deletion?
> >>
> >> Thanks
> >>
> >> Matt
> >>
> >> [1] https://issues.apache.org/jira/browse/NIFI-3295
> >> [2] https://issues.apache.org/jira/browse/NIFI-3933
> >>
> >> On Thu, May 18, 2017 at 12:29 PM, Neil Derraugh
> >> <neil.derra...@intellifylearning.com> wrote:
> >>>
> >>> Pretty sure this is the problem I was describing in the "Phantom Node"
> >>> thread recently.
> >>>
> >>> If I kill non-primary nodes the cluster remains healthy despite the
> lost
> >>> nodes.  The terminated nodes end up with a DISCONNECTED status.
> >>>
> >>> If I kill the primary it winds up with a CONNECTED status, but a new
> >>> primary/cluster coordinator gets elected too.
> >>>
> >>> Additionally it seems in 1.2.0 that the REST API no longer support
> >>> deleting a node in a CONNECTED state (Cannot remove Node with ID
> >>> 1780fde7-c2f4-469c-9884-fe843eac5b73 because it is not disconnected,
> current
> >>> state = CONNECTED).  So right now I don't have a workaround and have
> to kill
> >>> all the nodes and start over.
> >>>
> >>> On Thu, May 18, 2017 at 11:20 AM, Mark Payne <marka...@hotmail.com>
> >>> wrote:
> >>>>
> >>>> Hello,
> >>>>
> >>>> Just looking through this thread now. I believe that I understand the
> >>>> problem. I have updated the JIRA with details about what I think is
> the
> >>>> problem and a potential remedy for the problem.
> >>>>
> >>>> Thanks
> >>>> -Mark
> >>>>
> >>>> > On May 18, 2017, at 9:49 AM, Matt Gilman <matt.c.gil...@gmail.com>
> >>>> > wrote:
> >>>> >
> >>>> > Thanks for the additional details. They will be helpful when working
> >>>> > the JIRA. All nodes, including the coordinator, heartbeat to the
> active
> >>>> > coordinator. This means that the coordinator effectively heartbeats
> to
> >>>> > itself. It appears, based on your log messages, that this is not
> happening.
> >>>> > Because no heartbeats were receive from any node, the lack of
> heartbeats
> >>>> > from the terminated node is not considered.
> >>>> >
> >>>> > Matt
> >>>> >
> >>>> > Sent from my iPhone
> >>>> >
> >>>> >> On May 18, 2017, at 8:30 AM, ddewaele <ddewa...@gmail.com> wrote:
> >>>> >>
> >>>> >> Found something interesting in the centos-b debug logging....
> >>>> >>
> >>>> >> after centos-a (the coordinator) is killed centos-b takes over.
> >>>> >> Notice how
> >>>> >> it "Will not disconnect any nodes due to lack of heartbeat" and how
> >>>> >> it still
> >>>> >> sees centos-a as connected despite the fact that there are no
> >>>> >> heartbeats
> >>>> >> anymore.
> >>>> >>
> >>>> >> 2017-05-18 12:41:38,010 INFO [Leader Election Notification
> Thread-2]
> >>>> >> o.apache.nifi.controller.FlowController This node elected Active
> >>>> >> Cluster
> >>>> >> Coordinator
> >>>> >> 2017-05-18 12:41:38,010 DEBUG [Leader Election Notification
> Thread-2]
> >>>> >> o.a.n.c.c.h.ClusterProtocolHeartbeatMonitor Purging old heartbeats
> >>>> >> 2017-05-18 12:41:38,014 INFO [Leader Election Notification
> Thread-1]
> >>>> >> o.apache.nifi.controller.FlowController This node has been elected
> >>>> >> Primary
> >>>> >> Node
> >>>> >> 2017-05-18 12:41:38,353 DEBUG [Heartbeat Monitor Thread-1]
> >>>> >> o.a.n.c.c.h.AbstractHeartbeatMonitor Received no new heartbeats.
> Will
> >>>> >> not
> >>>> >> disconnect any nodes due to lack of heartbeat
> >>>> >> 2017-05-18 12:41:41,336 DEBUG [Process Cluster Protocol Request-3]
> >>>> >> o.a.n.c.c.h.ClusterProtocolHeartbeatMonitor Received new heartbeat
> >>>> >> from
> >>>> >> centos-b:8080
> >>>> >> 2017-05-18 12:41:41,337 DEBUG [Process Cluster Protocol Request-3]
> >>>> >> o.a.n.c.c.h.ClusterProtocolHeartbeatMonitor
> >>>> >>
> >>>> >> Calculated diff between current cluster status and node cluster
> >>>> >> status as
> >>>> >> follows:
> >>>> >> Node: [NodeConnectionStatus[nodeId=centos-b:8080, state=CONNECTED,
> >>>> >> updateId=45], NodeConnectionStatus[nodeId=centos-a:8080,
> >>>> >> state=CONNECTED,
> >>>> >> updateId=42]]
> >>>> >> Self: [NodeConnectionStatus[nodeId=centos-b:8080, state=CONNECTED,
> >>>> >> updateId=45], NodeConnectionStatus[nodeId=centos-a:8080,
> >>>> >> state=CONNECTED,
> >>>> >> updateId=42]]
> >>>> >> Difference: []
> >>>> >>
> >>>> >>
> >>>> >> 2017-05-18 12:41:41,337 INFO [Process Cluster Protocol Request-3]
> >>>> >> o.a.n.c.p.impl.SocketProtocolListener Finished processing request
> >>>> >> 410e7db5-8bb0-4f97-8ee8-fc8647c54959 (type=HEARTBEAT, length=2341
> >>>> >> bytes)
> >>>> >> from centos-b:8080 in 3 millis
> >>>> >> 2017-05-18 12:41:41,339 INFO [Clustering Tasks Thread-2]
> >>>> >> o.a.n.c.c.ClusterProtocolHeartbeater Heartbeat created at
> 2017-05-18
> >>>> >> 12:41:41,330 and sent to centos-b:10001 at 2017-05-18 12:41:41,339;
> >>>> >> send
> >>>> >> took 8 millis
> >>>> >> 2017-05-18 12:41:43,354 INFO [Heartbeat Monitor Thread-1]
> >>>> >> o.a.n.c.c.h.AbstractHeartbeatMonitor Finished processing 1
> heartbeats
> >>>> >> in
> >>>> >> 93276 nanos
> >>>> >> 2017-05-18 12:41:46,346 DEBUG [Process Cluster Protocol Request-4]
> >>>> >> o.a.n.c.c.h.ClusterProtocolHeartbeatMonitor Received new heartbeat
> >>>> >> from
> >>>> >> centos-b:8080
> >>>> >> 2017-05-18 12:41:46,346 DEBUG [Process Cluster Protocol Request-4]
> >>>> >> o.a.n.c.c.h.ClusterProtocolHeartbeatMonitor
> >>>> >>
> >>>> >> Calculated diff between current cluster status and node cluster
> >>>> >> status as
> >>>> >> follows:
> >>>> >> Node: [NodeConnectionStatus[nodeId=centos-b:8080, state=CONNECTED,
> >>>> >> updateId=45], NodeConnectionStatus[nodeId=centos-a:8080,
> >>>> >> state=CONNECTED,
> >>>> >> updateId=42]]
> >>>> >> Self: [NodeConnectionStatus[nodeId=centos-b:8080, state=CONNECTED,
> >>>> >> updateId=45], NodeConnectionStatus[nodeId=centos-a:8080,
> >>>> >> state=CONNECTED,
> >>>> >> updateId=42]]
> >>>> >> Difference: []
> >>>> >>
> >>>> >>
> >>>> >>
> >>>> >>
> >>>> >> --
> >>>> >> View this message in context:
> >>>> >> http://apache-nifi-users-list.2361937.n4.nabble.com/Nifi-
> Cluster-fails-to-disconnect-node-when-node-was-killed-tp1942p1950.html
> >>>> >> Sent from the Apache NiFi Users List mailing list archive at
> >>>> >> Nabble.com.
> >>>>
> >>>
> >>
> >
>

Re: Nifi Cluster fails to disconnect node when node was killed

Reply via email to