Re: Nifi Cluster fails to disconnect node when node was killed

Joe Witt Fri, 19 May 2017 10:10:28 -0700

I see.  Yeah that sounds like something the jira gilman mentioned will
resolve.  Thanks for clarifying.  I'm sure that jira will be addressed soon.


On May 19, 2017 1:06 PM, "Neil Derraugh" <
neil.derra...@intellifylearning.com> wrote:

> That's the whole problem from my perspective: it stays CONNECTED.  It
> never becomes DISCONNECTED.  You can't delete it from the API in 1.2.0.
>
> That's why I said it was a single point of failure.  The exact semantics
> of calling it a single point of failure might be debatable, but the fact
> that the cluster can't be modified and/or gracefully shutdown (afaik) is
> what I was referring to.
>
> On Fri, May 19, 2017 at 12:40 PM, Joe Witt <joe.w...@gmail.com> wrote:
>
>> I believe at the state you describe that down node is now considered
>> disconnected.  The cluster behavior prohibits you from making changes when
>> it knows not all members of the cluster cannot honor the change.  If you
>> are sure you want to make the changes anyway and move on without that node
>> you should be able to remove it/delete it from the cluster.  Now you have a
>> cluster of two connected nodes and you can make changes.
>>
>> On May 19, 2017 12:23 PM, "Neil Derraugh" <neil.derraugh@intellifylearni
>> ng.com> wrote:
>>
>>> That's fair.  But for the sake of total clarity on my own part, after
>>> one of these disaster scenarios with a newly quorum-elected primary things
>>> cannot be driven through the UI and at least through parts the REST API.
>>>
>>> I just ran through the following.  We have 3 nodes A, B, C with A
>>> primary, and A becomes unreachable without first disconnecting.  Then B and
>>> C may (I haven't verified) continue operating the flow they had in the
>>> clusters' last "good" state.  But they do elect a new primary, as per the
>>> REST nifi-api/controller/cluster response.  But now the flow can't be
>>> changed, and in some cases it can't be reported on either, i.e. some GETs
>>> fail, like nifi-api/flow/process-groups/root.
>>>
>>> Are we describing the same behavior?
>>>
>>> On Fri, May 19, 2017 at 11:12 AM, Joe Witt <joe.w...@gmail.com> wrote:
>>>
>>>> If there is no longer a quorum then we cannot drive things from the UI
>>>> but the cluster remaining is in tact from a functioning point of view
>>>> other than being able to assign a primary to handle the one-off items.
>>>>
>>>> On Fri, May 19, 2017 at 11:04 AM, Neil Derraugh
>>>> <neil.derra...@intellifylearning.com> wrote:
>>>> > Hi Joe,
>>>> >
>>>> > Maybe I'm missing something, but if the primary node suffers a network
>>>> > partition or container/vm/machine loss or becomes otherwise
>>>> unreachable then
>>>> > the cluster is unusable, at least from the UI.
>>>> >
>>>> > If that's not so please correct me.
>>>> >
>>>> > Thanks,
>>>> > Neil
>>>> >
>>>> > On Thu, May 18, 2017 at 9:56 PM, Joe Witt <joe.w...@gmail.com> wrote:
>>>> >>
>>>> >> Neil,
>>>> >>
>>>> >> Want to make sure I understand what you're saying.  What are stating
>>>> >> is a single point of failure?
>>>> >>
>>>> >> Thanks
>>>> >> Joe
>>>> >>
>>>> >> On Thu, May 18, 2017 at 5:27 PM, Neil Derraugh
>>>> >> <neil.derra...@intellifylearning.com> wrote:
>>>> >> > Thanks for the insight Matt.
>>>> >> >
>>>> >> > It's a disaster recovery issue.  It's not something I plan on
>>>> doing on
>>>> >> > purpose.  It seems it is a single point of failure unfortunately.
>>>> I can
>>>> >> > see
>>>> >> > no other way to resolve the issue other than to blow everything
>>>> away and
>>>> >> > start a new cluster.
>>>> >> >
>>>> >> > On Thu, May 18, 2017 at 2:49 PM, Matt Gilman <
>>>> matt.c.gil...@gmail.com>
>>>> >> > wrote:
>>>> >> >>
>>>> >> >> Neil,
>>>> >> >>
>>>> >> >> Disconnecting a node prior to removal is the correct process. It
>>>> >> >> appears
>>>> >> >> that the check was lost going from 0.x to 1.x. Folks reported
>>>> this JIRA
>>>> >> >> [1]
>>>> >> >> indicating that deleting a connected node did not work. This
>>>> process
>>>> >> >> does
>>>> >> >> not work because the node needs to be disconnected first. The
>>>> JIRA was
>>>> >> >> addressed by restoring the check that a node is disconnected
>>>> prior to
>>>> >> >> deletion.
>>>> >> >>
>>>> >> >> Hopefully the JIRA I filed earlier today [2] will address the
>>>> phantom
>>>> >> >> node
>>>> >> >> you were seeing. Until then, can you update your workaround to
>>>> >> >> disconnect
>>>> >> >> the node in question prior to deletion?
>>>> >> >>
>>>> >> >> Thanks
>>>> >> >>
>>>> >> >> Matt
>>>> >> >>
>>>> >> >> [1] https://issues.apache.org/jira/browse/NIFI-3295
>>>> >> >> [2] https://issues.apache.org/jira/browse/NIFI-3933
>>>> >> >>
>>>> >> >> On Thu, May 18, 2017 at 12:29 PM, Neil Derraugh
>>>> >> >> <neil.derra...@intellifylearning.com> wrote:
>>>> >> >>>
>>>> >> >>> Pretty sure this is the problem I was describing in the "Phantom
>>>> Node"
>>>> >> >>> thread recently.
>>>> >> >>>
>>>> >> >>> If I kill non-primary nodes the cluster remains healthy despite
>>>> the
>>>> >> >>> lost
>>>> >> >>> nodes.  The terminated nodes end up with a DISCONNECTED status.
>>>> >> >>>
>>>> >> >>> If I kill the primary it winds up with a CONNECTED status, but a
>>>> new
>>>> >> >>> primary/cluster coordinator gets elected too.
>>>> >> >>>
>>>> >> >>> Additionally it seems in 1.2.0 that the REST API no longer
>>>> support
>>>> >> >>> deleting a node in a CONNECTED state (Cannot remove Node with ID
>>>> >> >>> 1780fde7-c2f4-469c-9884-fe843eac5b73 because it is not
>>>> disconnected,
>>>> >> >>> current
>>>> >> >>> state = CONNECTED).  So right now I don't have a workaround and
>>>> have
>>>> >> >>> to kill
>>>> >> >>> all the nodes and start over.
>>>> >> >>>
>>>> >> >>> On Thu, May 18, 2017 at 11:20 AM, Mark Payne <
>>>> marka...@hotmail.com>
>>>> >> >>> wrote:
>>>> >> >>>>
>>>> >> >>>> Hello,
>>>> >> >>>>
>>>> >> >>>> Just looking through this thread now. I believe that I
>>>> understand the
>>>> >> >>>> problem. I have updated the JIRA with details about what I
>>>> think is
>>>> >> >>>> the
>>>> >> >>>> problem and a potential remedy for the problem.
>>>> >> >>>>
>>>> >> >>>> Thanks
>>>> >> >>>> -Mark
>>>> >> >>>>
>>>> >> >>>> > On May 18, 2017, at 9:49 AM, Matt Gilman <
>>>> matt.c.gil...@gmail.com>
>>>> >> >>>> > wrote:
>>>> >> >>>> >
>>>> >> >>>> > Thanks for the additional details. They will be helpful when
>>>> >> >>>> > working
>>>> >> >>>> > the JIRA. All nodes, including the coordinator, heartbeat to
>>>> the
>>>> >> >>>> > active
>>>> >> >>>> > coordinator. This means that the coordinator effectively
>>>> heartbeats
>>>> >> >>>> > to
>>>> >> >>>> > itself. It appears, based on your log messages, that this is
>>>> not
>>>> >> >>>> > happening.
>>>> >> >>>> > Because no heartbeats were receive from any node, the lack of
>>>> >> >>>> > heartbeats
>>>> >> >>>> > from the terminated node is not considered.
>>>> >> >>>> >
>>>> >> >>>> > Matt
>>>> >> >>>> >
>>>> >> >>>> > Sent from my iPhone
>>>> >> >>>> >
>>>> >> >>>> >> On May 18, 2017, at 8:30 AM, ddewaele <ddewa...@gmail.com>
>>>> wrote:
>>>> >> >>>> >>
>>>> >> >>>> >> Found something interesting in the centos-b debug logging....
>>>> >> >>>> >>
>>>> >> >>>> >> after centos-a (the coordinator) is killed centos-b takes
>>>> over.
>>>> >> >>>> >> Notice how
>>>> >> >>>> >> it "Will not disconnect any nodes due to lack of heartbeat"
>>>> and
>>>> >> >>>> >> how
>>>> >> >>>> >> it still
>>>> >> >>>> >> sees centos-a as connected despite the fact that there are no
>>>> >> >>>> >> heartbeats
>>>> >> >>>> >> anymore.
>>>> >> >>>> >>
>>>> >> >>>> >> 2017-05-18 12:41:38,010 INFO [Leader Election Notification
>>>> >> >>>> >> Thread-2]
>>>> >> >>>> >> o.apache.nifi.controller.FlowController This node elected
>>>> Active
>>>> >> >>>> >> Cluster
>>>> >> >>>> >> Coordinator
>>>> >> >>>> >> 2017-05-18 12:41:38,010 DEBUG [Leader Election Notification
>>>> >> >>>> >> Thread-2]
>>>> >> >>>> >> o.a.n.c.c.h.ClusterProtocolHeartbeatMonitor Purging old
>>>> heartbeats
>>>> >> >>>> >> 2017-05-18 12:41:38,014 INFO [Leader Election Notification
>>>> >> >>>> >> Thread-1]
>>>> >> >>>> >> o.apache.nifi.controller.FlowController This node has been
>>>> elected
>>>> >> >>>> >> Primary
>>>> >> >>>> >> Node
>>>> >> >>>> >> 2017-05-18 12:41:38,353 DEBUG [Heartbeat Monitor Thread-1]
>>>> >> >>>> >> o.a.n.c.c.h.AbstractHeartbeatMonitor Received no new
>>>> heartbeats.
>>>> >> >>>> >> Will
>>>> >> >>>> >> not
>>>> >> >>>> >> disconnect any nodes due to lack of heartbeat
>>>> >> >>>> >> 2017-05-18 12:41:41,336 DEBUG [Process Cluster Protocol
>>>> Request-3]
>>>> >> >>>> >> o.a.n.c.c.h.ClusterProtocolHeartbeatMonitor Received new
>>>> heartbeat
>>>> >> >>>> >> from
>>>> >> >>>> >> centos-b:8080
>>>> >> >>>> >> 2017-05-18 12:41:41,337 DEBUG [Process Cluster Protocol
>>>> Request-3]
>>>> >> >>>> >> o.a.n.c.c.h.ClusterProtocolHeartbeatMonitor
>>>> >> >>>> >>
>>>> >> >>>> >> Calculated diff between current cluster status and node
>>>> cluster
>>>> >> >>>> >> status as
>>>> >> >>>> >> follows:
>>>> >> >>>> >> Node: [NodeConnectionStatus[nodeId=centos-b:8080,
>>>> state=CONNECTED,
>>>> >> >>>> >> updateId=45], NodeConnectionStatus[nodeId=centos-a:8080,
>>>> >> >>>> >> state=CONNECTED,
>>>> >> >>>> >> updateId=42]]
>>>> >> >>>> >> Self: [NodeConnectionStatus[nodeId=centos-b:8080,
>>>> state=CONNECTED,
>>>> >> >>>> >> updateId=45], NodeConnectionStatus[nodeId=centos-a:8080,
>>>> >> >>>> >> state=CONNECTED,
>>>> >> >>>> >> updateId=42]]
>>>> >> >>>> >> Difference: []
>>>> >> >>>> >>
>>>> >> >>>> >>
>>>> >> >>>> >> 2017-05-18 12:41:41,337 INFO [Process Cluster Protocol
>>>> Request-3]
>>>> >> >>>> >> o.a.n.c.p.impl.SocketProtocolListener Finished processing
>>>> request
>>>> >> >>>> >> 410e7db5-8bb0-4f97-8ee8-fc8647c54959 (type=HEARTBEAT,
>>>> length=2341
>>>> >> >>>> >> bytes)
>>>> >> >>>> >> from centos-b:8080 in 3 millis
>>>> >> >>>> >> 2017-05-18 12:41:41,339 INFO [Clustering Tasks Thread-2]
>>>> >> >>>> >> o.a.n.c.c.ClusterProtocolHeartbeater Heartbeat created at
>>>> >> >>>> >> 2017-05-18
>>>> >> >>>> >> 12:41:41,330 and sent to centos-b:10001 at 2017-05-18
>>>> >> >>>> >> 12:41:41,339;
>>>> >> >>>> >> send
>>>> >> >>>> >> took 8 millis
>>>> >> >>>> >> 2017-05-18 12:41:43,354 INFO [Heartbeat Monitor Thread-1]
>>>> >> >>>> >> o.a.n.c.c.h.AbstractHeartbeatMonitor Finished processing 1
>>>> >> >>>> >> heartbeats
>>>> >> >>>> >> in
>>>> >> >>>> >> 93276 nanos
>>>> >> >>>> >> 2017-05-18 12:41:46,346 DEBUG [Process Cluster Protocol
>>>> Request-4]
>>>> >> >>>> >> o.a.n.c.c.h.ClusterProtocolHeartbeatMonitor Received new
>>>> heartbeat
>>>> >> >>>> >> from
>>>> >> >>>> >> centos-b:8080
>>>> >> >>>> >> 2017-05-18 12:41:46,346 DEBUG [Process Cluster Protocol
>>>> Request-4]
>>>> >> >>>> >> o.a.n.c.c.h.ClusterProtocolHeartbeatMonitor
>>>> >> >>>> >>
>>>> >> >>>> >> Calculated diff between current cluster status and node
>>>> cluster
>>>> >> >>>> >> status as
>>>> >> >>>> >> follows:
>>>> >> >>>> >> Node: [NodeConnectionStatus[nodeId=centos-b:8080,
>>>> state=CONNECTED,
>>>> >> >>>> >> updateId=45], NodeConnectionStatus[nodeId=centos-a:8080,
>>>> >> >>>> >> state=CONNECTED,
>>>> >> >>>> >> updateId=42]]
>>>> >> >>>> >> Self: [NodeConnectionStatus[nodeId=centos-b:8080,
>>>> state=CONNECTED,
>>>> >> >>>> >> updateId=45], NodeConnectionStatus[nodeId=centos-a:8080,
>>>> >> >>>> >> state=CONNECTED,
>>>> >> >>>> >> updateId=42]]
>>>> >> >>>> >> Difference: []
>>>> >> >>>> >>
>>>> >> >>>> >>
>>>> >> >>>> >>
>>>> >> >>>> >>
>>>> >> >>>> >> --
>>>> >> >>>> >> View this message in context:
>>>> >> >>>> >>
>>>> >> >>>> >> http://apache-nifi-users-list.2361937.n4.nabble.com/Nifi-Clu
>>>> ster-fails-to-disconnect-node-when-node-was-killed-tp1942p1950.html
>>>> >> >>>> >> Sent from the Apache NiFi Users List mailing list archive at
>>>> >> >>>> >> Nabble.com.
>>>> >> >>>>
>>>> >> >>>
>>>> >> >>
>>>> >> >
>>>> >
>>>> >
>>>>
>>>
>>>
>

Re: Nifi Cluster fails to disconnect node when node was killed

Reply via email to