Re: Nifi Cluster fails to disconnect node when node was killed

Joe Witt Fri, 19 May 2017 08:12:46 -0700

If there is no longer a quorum then we cannot drive things from the UI
but the cluster remaining is in tact from a functioning point of view
other than being able to assign a primary to handle the one-off items.


On Fri, May 19, 2017 at 11:04 AM, Neil Derraugh
<neil.derra...@intellifylearning.com> wrote:
> Hi Joe,
>
> Maybe I'm missing something, but if the primary node suffers a network
> partition or container/vm/machine loss or becomes otherwise unreachable then
> the cluster is unusable, at least from the UI.
>
> If that's not so please correct me.
>
> Thanks,
> Neil
>
> On Thu, May 18, 2017 at 9:56 PM, Joe Witt <joe.w...@gmail.com> wrote:
>>
>> Neil,
>>
>> Want to make sure I understand what you're saying.  What are stating
>> is a single point of failure?
>>
>> Thanks
>> Joe
>>
>> On Thu, May 18, 2017 at 5:27 PM, Neil Derraugh
>> <neil.derra...@intellifylearning.com> wrote:
>> > Thanks for the insight Matt.
>> >
>> > It's a disaster recovery issue.  It's not something I plan on doing on
>> > purpose.  It seems it is a single point of failure unfortunately.  I can
>> > see
>> > no other way to resolve the issue other than to blow everything away and
>> > start a new cluster.
>> >
>> > On Thu, May 18, 2017 at 2:49 PM, Matt Gilman <matt.c.gil...@gmail.com>
>> > wrote:
>> >>
>> >> Neil,
>> >>
>> >> Disconnecting a node prior to removal is the correct process. It
>> >> appears
>> >> that the check was lost going from 0.x to 1.x. Folks reported this JIRA
>> >> [1]
>> >> indicating that deleting a connected node did not work. This process
>> >> does
>> >> not work because the node needs to be disconnected first. The JIRA was
>> >> addressed by restoring the check that a node is disconnected prior to
>> >> deletion.
>> >>
>> >> Hopefully the JIRA I filed earlier today [2] will address the phantom
>> >> node
>> >> you were seeing. Until then, can you update your workaround to
>> >> disconnect
>> >> the node in question prior to deletion?
>> >>
>> >> Thanks
>> >>
>> >> Matt
>> >>
>> >> [1] https://issues.apache.org/jira/browse/NIFI-3295
>> >> [2] https://issues.apache.org/jira/browse/NIFI-3933
>> >>
>> >> On Thu, May 18, 2017 at 12:29 PM, Neil Derraugh
>> >> <neil.derra...@intellifylearning.com> wrote:
>> >>>
>> >>> Pretty sure this is the problem I was describing in the "Phantom Node"
>> >>> thread recently.
>> >>>
>> >>> If I kill non-primary nodes the cluster remains healthy despite the
>> >>> lost
>> >>> nodes.  The terminated nodes end up with a DISCONNECTED status.
>> >>>
>> >>> If I kill the primary it winds up with a CONNECTED status, but a new
>> >>> primary/cluster coordinator gets elected too.
>> >>>
>> >>> Additionally it seems in 1.2.0 that the REST API no longer support
>> >>> deleting a node in a CONNECTED state (Cannot remove Node with ID
>> >>> 1780fde7-c2f4-469c-9884-fe843eac5b73 because it is not disconnected,
>> >>> current
>> >>> state = CONNECTED).  So right now I don't have a workaround and have
>> >>> to kill
>> >>> all the nodes and start over.
>> >>>
>> >>> On Thu, May 18, 2017 at 11:20 AM, Mark Payne <marka...@hotmail.com>
>> >>> wrote:
>> >>>>
>> >>>> Hello,
>> >>>>
>> >>>> Just looking through this thread now. I believe that I understand the
>> >>>> problem. I have updated the JIRA with details about what I think is
>> >>>> the
>> >>>> problem and a potential remedy for the problem.
>> >>>>
>> >>>> Thanks
>> >>>> -Mark
>> >>>>
>> >>>> > On May 18, 2017, at 9:49 AM, Matt Gilman <matt.c.gil...@gmail.com>
>> >>>> > wrote:
>> >>>> >
>> >>>> > Thanks for the additional details. They will be helpful when
>> >>>> > working
>> >>>> > the JIRA. All nodes, including the coordinator, heartbeat to the
>> >>>> > active
>> >>>> > coordinator. This means that the coordinator effectively heartbeats
>> >>>> > to
>> >>>> > itself. It appears, based on your log messages, that this is not
>> >>>> > happening.
>> >>>> > Because no heartbeats were receive from any node, the lack of
>> >>>> > heartbeats
>> >>>> > from the terminated node is not considered.
>> >>>> >
>> >>>> > Matt
>> >>>> >
>> >>>> > Sent from my iPhone
>> >>>> >
>> >>>> >> On May 18, 2017, at 8:30 AM, ddewaele <ddewa...@gmail.com> wrote:
>> >>>> >>
>> >>>> >> Found something interesting in the centos-b debug logging....
>> >>>> >>
>> >>>> >> after centos-a (the coordinator) is killed centos-b takes over.
>> >>>> >> Notice how
>> >>>> >> it "Will not disconnect any nodes due to lack of heartbeat" and
>> >>>> >> how
>> >>>> >> it still
>> >>>> >> sees centos-a as connected despite the fact that there are no
>> >>>> >> heartbeats
>> >>>> >> anymore.
>> >>>> >>
>> >>>> >> 2017-05-18 12:41:38,010 INFO [Leader Election Notification
>> >>>> >> Thread-2]
>> >>>> >> o.apache.nifi.controller.FlowController This node elected Active
>> >>>> >> Cluster
>> >>>> >> Coordinator
>> >>>> >> 2017-05-18 12:41:38,010 DEBUG [Leader Election Notification
>> >>>> >> Thread-2]
>> >>>> >> o.a.n.c.c.h.ClusterProtocolHeartbeatMonitor Purging old heartbeats
>> >>>> >> 2017-05-18 12:41:38,014 INFO [Leader Election Notification
>> >>>> >> Thread-1]
>> >>>> >> o.apache.nifi.controller.FlowController This node has been elected
>> >>>> >> Primary
>> >>>> >> Node
>> >>>> >> 2017-05-18 12:41:38,353 DEBUG [Heartbeat Monitor Thread-1]
>> >>>> >> o.a.n.c.c.h.AbstractHeartbeatMonitor Received no new heartbeats.
>> >>>> >> Will
>> >>>> >> not
>> >>>> >> disconnect any nodes due to lack of heartbeat
>> >>>> >> 2017-05-18 12:41:41,336 DEBUG [Process Cluster Protocol Request-3]
>> >>>> >> o.a.n.c.c.h.ClusterProtocolHeartbeatMonitor Received new heartbeat
>> >>>> >> from
>> >>>> >> centos-b:8080
>> >>>> >> 2017-05-18 12:41:41,337 DEBUG [Process Cluster Protocol Request-3]
>> >>>> >> o.a.n.c.c.h.ClusterProtocolHeartbeatMonitor
>> >>>> >>
>> >>>> >> Calculated diff between current cluster status and node cluster
>> >>>> >> status as
>> >>>> >> follows:
>> >>>> >> Node: [NodeConnectionStatus[nodeId=centos-b:8080, state=CONNECTED,
>> >>>> >> updateId=45], NodeConnectionStatus[nodeId=centos-a:8080,
>> >>>> >> state=CONNECTED,
>> >>>> >> updateId=42]]
>> >>>> >> Self: [NodeConnectionStatus[nodeId=centos-b:8080, state=CONNECTED,
>> >>>> >> updateId=45], NodeConnectionStatus[nodeId=centos-a:8080,
>> >>>> >> state=CONNECTED,
>> >>>> >> updateId=42]]
>> >>>> >> Difference: []
>> >>>> >>
>> >>>> >>
>> >>>> >> 2017-05-18 12:41:41,337 INFO [Process Cluster Protocol Request-3]
>> >>>> >> o.a.n.c.p.impl.SocketProtocolListener Finished processing request
>> >>>> >> 410e7db5-8bb0-4f97-8ee8-fc8647c54959 (type=HEARTBEAT, length=2341
>> >>>> >> bytes)
>> >>>> >> from centos-b:8080 in 3 millis
>> >>>> >> 2017-05-18 12:41:41,339 INFO [Clustering Tasks Thread-2]
>> >>>> >> o.a.n.c.c.ClusterProtocolHeartbeater Heartbeat created at
>> >>>> >> 2017-05-18
>> >>>> >> 12:41:41,330 and sent to centos-b:10001 at 2017-05-18
>> >>>> >> 12:41:41,339;
>> >>>> >> send
>> >>>> >> took 8 millis
>> >>>> >> 2017-05-18 12:41:43,354 INFO [Heartbeat Monitor Thread-1]
>> >>>> >> o.a.n.c.c.h.AbstractHeartbeatMonitor Finished processing 1
>> >>>> >> heartbeats
>> >>>> >> in
>> >>>> >> 93276 nanos
>> >>>> >> 2017-05-18 12:41:46,346 DEBUG [Process Cluster Protocol Request-4]
>> >>>> >> o.a.n.c.c.h.ClusterProtocolHeartbeatMonitor Received new heartbeat
>> >>>> >> from
>> >>>> >> centos-b:8080
>> >>>> >> 2017-05-18 12:41:46,346 DEBUG [Process Cluster Protocol Request-4]
>> >>>> >> o.a.n.c.c.h.ClusterProtocolHeartbeatMonitor
>> >>>> >>
>> >>>> >> Calculated diff between current cluster status and node cluster
>> >>>> >> status as
>> >>>> >> follows:
>> >>>> >> Node: [NodeConnectionStatus[nodeId=centos-b:8080, state=CONNECTED,
>> >>>> >> updateId=45], NodeConnectionStatus[nodeId=centos-a:8080,
>> >>>> >> state=CONNECTED,
>> >>>> >> updateId=42]]
>> >>>> >> Self: [NodeConnectionStatus[nodeId=centos-b:8080, state=CONNECTED,
>> >>>> >> updateId=45], NodeConnectionStatus[nodeId=centos-a:8080,
>> >>>> >> state=CONNECTED,
>> >>>> >> updateId=42]]
>> >>>> >> Difference: []
>> >>>> >>
>> >>>> >>
>> >>>> >>
>> >>>> >>
>> >>>> >> --
>> >>>> >> View this message in context:
>> >>>> >>
>> >>>> >> http://apache-nifi-users-list.2361937.n4.nabble.com/Nifi-Cluster-fails-to-disconnect-node-when-node-was-killed-tp1942p1950.html
>> >>>> >> Sent from the Apache NiFi Users List mailing list archive at
>> >>>> >> Nabble.com.
>> >>>>
>> >>>
>> >>
>> >
>
>

Re: Nifi Cluster fails to disconnect node when node was killed

Reply via email to