Re: Nifi Cluster fails to disconnect node when node was killed

Joe Witt Fri, 19 May 2017 09:40:49 -0700

I believe at the state you describe that down node is now considered
disconnected.  The cluster behavior prohibits you from making changes when
it knows not all members of the cluster cannot honor the change.  If you
are sure you want to make the changes anyway and move on without that node
you should be able to remove it/delete it from the cluster.  Now you have a
cluster of two connected nodes and you can make changes.


On May 19, 2017 12:23 PM, "Neil Derraugh" <
neil.derra...@intellifylearning.com> wrote:

> That's fair.  But for the sake of total clarity on my own part, after one
> of these disaster scenarios with a newly quorum-elected primary things
> cannot be driven through the UI and at least through parts the REST API.
>
> I just ran through the following.  We have 3 nodes A, B, C with A primary,
> and A becomes unreachable without first disconnecting.  Then B and C may (I
> haven't verified) continue operating the flow they had in the clusters'
> last "good" state.  But they do elect a new primary, as per the REST
> nifi-api/controller/cluster response.  But now the flow can't be changed,
> and in some cases it can't be reported on either, i.e. some GETs fail, like
> nifi-api/flow/process-groups/root.
>
> Are we describing the same behavior?
>
> On Fri, May 19, 2017 at 11:12 AM, Joe Witt <joe.w...@gmail.com> wrote:
>
>> If there is no longer a quorum then we cannot drive things from the UI
>> but the cluster remaining is in tact from a functioning point of view
>> other than being able to assign a primary to handle the one-off items.
>>
>> On Fri, May 19, 2017 at 11:04 AM, Neil Derraugh
>> <neil.derra...@intellifylearning.com> wrote:
>> > Hi Joe,
>> >
>> > Maybe I'm missing something, but if the primary node suffers a network
>> > partition or container/vm/machine loss or becomes otherwise unreachable
>> then
>> > the cluster is unusable, at least from the UI.
>> >
>> > If that's not so please correct me.
>> >
>> > Thanks,
>> > Neil
>> >
>> > On Thu, May 18, 2017 at 9:56 PM, Joe Witt <joe.w...@gmail.com> wrote:
>> >>
>> >> Neil,
>> >>
>> >> Want to make sure I understand what you're saying.  What are stating
>> >> is a single point of failure?
>> >>
>> >> Thanks
>> >> Joe
>> >>
>> >> On Thu, May 18, 2017 at 5:27 PM, Neil Derraugh
>> >> <neil.derra...@intellifylearning.com> wrote:
>> >> > Thanks for the insight Matt.
>> >> >
>> >> > It's a disaster recovery issue.  It's not something I plan on doing
>> on
>> >> > purpose.  It seems it is a single point of failure unfortunately.  I
>> can
>> >> > see
>> >> > no other way to resolve the issue other than to blow everything away
>> and
>> >> > start a new cluster.
>> >> >
>> >> > On Thu, May 18, 2017 at 2:49 PM, Matt Gilman <
>> matt.c.gil...@gmail.com>
>> >> > wrote:
>> >> >>
>> >> >> Neil,
>> >> >>
>> >> >> Disconnecting a node prior to removal is the correct process. It
>> >> >> appears
>> >> >> that the check was lost going from 0.x to 1.x. Folks reported this
>> JIRA
>> >> >> [1]
>> >> >> indicating that deleting a connected node did not work. This process
>> >> >> does
>> >> >> not work because the node needs to be disconnected first. The JIRA
>> was
>> >> >> addressed by restoring the check that a node is disconnected prior
>> to
>> >> >> deletion.
>> >> >>
>> >> >> Hopefully the JIRA I filed earlier today [2] will address the
>> phantom
>> >> >> node
>> >> >> you were seeing. Until then, can you update your workaround to
>> >> >> disconnect
>> >> >> the node in question prior to deletion?
>> >> >>
>> >> >> Thanks
>> >> >>
>> >> >> Matt
>> >> >>
>> >> >> [1] https://issues.apache.org/jira/browse/NIFI-3295
>> >> >> [2] https://issues.apache.org/jira/browse/NIFI-3933
>> >> >>
>> >> >> On Thu, May 18, 2017 at 12:29 PM, Neil Derraugh
>> >> >> <neil.derra...@intellifylearning.com> wrote:
>> >> >>>
>> >> >>> Pretty sure this is the problem I was describing in the "Phantom
>> Node"
>> >> >>> thread recently.
>> >> >>>
>> >> >>> If I kill non-primary nodes the cluster remains healthy despite the
>> >> >>> lost
>> >> >>> nodes.  The terminated nodes end up with a DISCONNECTED status.
>> >> >>>
>> >> >>> If I kill the primary it winds up with a CONNECTED status, but a
>> new
>> >> >>> primary/cluster coordinator gets elected too.
>> >> >>>
>> >> >>> Additionally it seems in 1.2.0 that the REST API no longer support
>> >> >>> deleting a node in a CONNECTED state (Cannot remove Node with ID
>> >> >>> 1780fde7-c2f4-469c-9884-fe843eac5b73 because it is not
>> disconnected,
>> >> >>> current
>> >> >>> state = CONNECTED).  So right now I don't have a workaround and
>> have
>> >> >>> to kill
>> >> >>> all the nodes and start over.
>> >> >>>
>> >> >>> On Thu, May 18, 2017 at 11:20 AM, Mark Payne <marka...@hotmail.com
>> >
>> >> >>> wrote:
>> >> >>>>
>> >> >>>> Hello,
>> >> >>>>
>> >> >>>> Just looking through this thread now. I believe that I understand
>> the
>> >> >>>> problem. I have updated the JIRA with details about what I think
>> is
>> >> >>>> the
>> >> >>>> problem and a potential remedy for the problem.
>> >> >>>>
>> >> >>>> Thanks
>> >> >>>> -Mark
>> >> >>>>
>> >> >>>> > On May 18, 2017, at 9:49 AM, Matt Gilman <
>> matt.c.gil...@gmail.com>
>> >> >>>> > wrote:
>> >> >>>> >
>> >> >>>> > Thanks for the additional details. They will be helpful when
>> >> >>>> > working
>> >> >>>> > the JIRA. All nodes, including the coordinator, heartbeat to the
>> >> >>>> > active
>> >> >>>> > coordinator. This means that the coordinator effectively
>> heartbeats
>> >> >>>> > to
>> >> >>>> > itself. It appears, based on your log messages, that this is not
>> >> >>>> > happening.
>> >> >>>> > Because no heartbeats were receive from any node, the lack of
>> >> >>>> > heartbeats
>> >> >>>> > from the terminated node is not considered.
>> >> >>>> >
>> >> >>>> > Matt
>> >> >>>> >
>> >> >>>> > Sent from my iPhone
>> >> >>>> >
>> >> >>>> >> On May 18, 2017, at 8:30 AM, ddewaele <ddewa...@gmail.com>
>> wrote:
>> >> >>>> >>
>> >> >>>> >> Found something interesting in the centos-b debug logging....
>> >> >>>> >>
>> >> >>>> >> after centos-a (the coordinator) is killed centos-b takes over.
>> >> >>>> >> Notice how
>> >> >>>> >> it "Will not disconnect any nodes due to lack of heartbeat" and
>> >> >>>> >> how
>> >> >>>> >> it still
>> >> >>>> >> sees centos-a as connected despite the fact that there are no
>> >> >>>> >> heartbeats
>> >> >>>> >> anymore.
>> >> >>>> >>
>> >> >>>> >> 2017-05-18 12:41:38,010 INFO [Leader Election Notification
>> >> >>>> >> Thread-2]
>> >> >>>> >> o.apache.nifi.controller.FlowController This node elected
>> Active
>> >> >>>> >> Cluster
>> >> >>>> >> Coordinator
>> >> >>>> >> 2017-05-18 12:41:38,010 DEBUG [Leader Election Notification
>> >> >>>> >> Thread-2]
>> >> >>>> >> o.a.n.c.c.h.ClusterProtocolHeartbeatMonitor Purging old
>> heartbeats
>> >> >>>> >> 2017-05-18 12:41:38,014 INFO [Leader Election Notification
>> >> >>>> >> Thread-1]
>> >> >>>> >> o.apache.nifi.controller.FlowController This node has been
>> elected
>> >> >>>> >> Primary
>> >> >>>> >> Node
>> >> >>>> >> 2017-05-18 12:41:38,353 DEBUG [Heartbeat Monitor Thread-1]
>> >> >>>> >> o.a.n.c.c.h.AbstractHeartbeatMonitor Received no new
>> heartbeats.
>> >> >>>> >> Will
>> >> >>>> >> not
>> >> >>>> >> disconnect any nodes due to lack of heartbeat
>> >> >>>> >> 2017-05-18 12:41:41,336 DEBUG [Process Cluster Protocol
>> Request-3]
>> >> >>>> >> o.a.n.c.c.h.ClusterProtocolHeartbeatMonitor Received new
>> heartbeat
>> >> >>>> >> from
>> >> >>>> >> centos-b:8080
>> >> >>>> >> 2017-05-18 12:41:41,337 DEBUG [Process Cluster Protocol
>> Request-3]
>> >> >>>> >> o.a.n.c.c.h.ClusterProtocolHeartbeatMonitor
>> >> >>>> >>
>> >> >>>> >> Calculated diff between current cluster status and node cluster
>> >> >>>> >> status as
>> >> >>>> >> follows:
>> >> >>>> >> Node: [NodeConnectionStatus[nodeId=centos-b:8080,
>> state=CONNECTED,
>> >> >>>> >> updateId=45], NodeConnectionStatus[nodeId=centos-a:8080,
>> >> >>>> >> state=CONNECTED,
>> >> >>>> >> updateId=42]]
>> >> >>>> >> Self: [NodeConnectionStatus[nodeId=centos-b:8080,
>> state=CONNECTED,
>> >> >>>> >> updateId=45], NodeConnectionStatus[nodeId=centos-a:8080,
>> >> >>>> >> state=CONNECTED,
>> >> >>>> >> updateId=42]]
>> >> >>>> >> Difference: []
>> >> >>>> >>
>> >> >>>> >>
>> >> >>>> >> 2017-05-18 12:41:41,337 INFO [Process Cluster Protocol
>> Request-3]
>> >> >>>> >> o.a.n.c.p.impl.SocketProtocolListener Finished processing
>> request
>> >> >>>> >> 410e7db5-8bb0-4f97-8ee8-fc8647c54959 (type=HEARTBEAT,
>> length=2341
>> >> >>>> >> bytes)
>> >> >>>> >> from centos-b:8080 in 3 millis
>> >> >>>> >> 2017-05-18 12:41:41,339 INFO [Clustering Tasks Thread-2]
>> >> >>>> >> o.a.n.c.c.ClusterProtocolHeartbeater Heartbeat created at
>> >> >>>> >> 2017-05-18
>> >> >>>> >> 12:41:41,330 and sent to centos-b:10001 at 2017-05-18
>> >> >>>> >> 12:41:41,339;
>> >> >>>> >> send
>> >> >>>> >> took 8 millis
>> >> >>>> >> 2017-05-18 12:41:43,354 INFO [Heartbeat Monitor Thread-1]
>> >> >>>> >> o.a.n.c.c.h.AbstractHeartbeatMonitor Finished processing 1
>> >> >>>> >> heartbeats
>> >> >>>> >> in
>> >> >>>> >> 93276 nanos
>> >> >>>> >> 2017-05-18 12:41:46,346 DEBUG [Process Cluster Protocol
>> Request-4]
>> >> >>>> >> o.a.n.c.c.h.ClusterProtocolHeartbeatMonitor Received new
>> heartbeat
>> >> >>>> >> from
>> >> >>>> >> centos-b:8080
>> >> >>>> >> 2017-05-18 12:41:46,346 DEBUG [Process Cluster Protocol
>> Request-4]
>> >> >>>> >> o.a.n.c.c.h.ClusterProtocolHeartbeatMonitor
>> >> >>>> >>
>> >> >>>> >> Calculated diff between current cluster status and node cluster
>> >> >>>> >> status as
>> >> >>>> >> follows:
>> >> >>>> >> Node: [NodeConnectionStatus[nodeId=centos-b:8080,
>> state=CONNECTED,
>> >> >>>> >> updateId=45], NodeConnectionStatus[nodeId=centos-a:8080,
>> >> >>>> >> state=CONNECTED,
>> >> >>>> >> updateId=42]]
>> >> >>>> >> Self: [NodeConnectionStatus[nodeId=centos-b:8080,
>> state=CONNECTED,
>> >> >>>> >> updateId=45], NodeConnectionStatus[nodeId=centos-a:8080,
>> >> >>>> >> state=CONNECTED,
>> >> >>>> >> updateId=42]]
>> >> >>>> >> Difference: []
>> >> >>>> >>
>> >> >>>> >>
>> >> >>>> >>
>> >> >>>> >>
>> >> >>>> >> --
>> >> >>>> >> View this message in context:
>> >> >>>> >>
>> >> >>>> >> http://apache-nifi-users-list.2361937.n4.nabble.com/Nifi-Clu
>> ster-fails-to-disconnect-node-when-node-was-killed-tp1942p1950.html
>> >> >>>> >> Sent from the Apache NiFi Users List mailing list archive at
>> >> >>>> >> Nabble.com.
>> >> >>>>
>> >> >>>
>> >> >>
>> >> >
>> >
>> >
>>
>
>

Re: Nifi Cluster fails to disconnect node when node was killed

Reply via email to