Re: Nifi Cluster fails to disconnect node when node was killed

Joe Witt Thu, 18 May 2017 18:56:30 -0700

Neil,

Want to make sure I understand what you're saying.  What are stating
is a single point of failure?


Thanks
Joe

On Thu, May 18, 2017 at 5:27 PM, Neil Derraugh
<neil.derra...@intellifylearning.com> wrote:
> Thanks for the insight Matt.
>
> It's a disaster recovery issue.  It's not something I plan on doing on
> purpose.  It seems it is a single point of failure unfortunately.  I can see
> no other way to resolve the issue other than to blow everything away and
> start a new cluster.
>
> On Thu, May 18, 2017 at 2:49 PM, Matt Gilman <matt.c.gil...@gmail.com>
> wrote:
>>
>> Neil,
>>
>> Disconnecting a node prior to removal is the correct process. It appears
>> that the check was lost going from 0.x to 1.x. Folks reported this JIRA [1]
>> indicating that deleting a connected node did not work. This process does
>> not work because the node needs to be disconnected first. The JIRA was
>> addressed by restoring the check that a node is disconnected prior to
>> deletion.
>>
>> Hopefully the JIRA I filed earlier today [2] will address the phantom node
>> you were seeing. Until then, can you update your workaround to disconnect
>> the node in question prior to deletion?
>>
>> Thanks
>>
>> Matt
>>
>> [1] https://issues.apache.org/jira/browse/NIFI-3295
>> [2] https://issues.apache.org/jira/browse/NIFI-3933
>>
>> On Thu, May 18, 2017 at 12:29 PM, Neil Derraugh
>> <neil.derra...@intellifylearning.com> wrote:
>>>
>>> Pretty sure this is the problem I was describing in the "Phantom Node"
>>> thread recently.
>>>
>>> If I kill non-primary nodes the cluster remains healthy despite the lost
>>> nodes.  The terminated nodes end up with a DISCONNECTED status.
>>>
>>> If I kill the primary it winds up with a CONNECTED status, but a new
>>> primary/cluster coordinator gets elected too.
>>>
>>> Additionally it seems in 1.2.0 that the REST API no longer support
>>> deleting a node in a CONNECTED state (Cannot remove Node with ID
>>> 1780fde7-c2f4-469c-9884-fe843eac5b73 because it is not disconnected, current
>>> state = CONNECTED).  So right now I don't have a workaround and have to kill
>>> all the nodes and start over.
>>>
>>> On Thu, May 18, 2017 at 11:20 AM, Mark Payne <marka...@hotmail.com>
>>> wrote:
>>>>
>>>> Hello,
>>>>
>>>> Just looking through this thread now. I believe that I understand the
>>>> problem. I have updated the JIRA with details about what I think is the
>>>> problem and a potential remedy for the problem.
>>>>
>>>> Thanks
>>>> -Mark
>>>>
>>>> > On May 18, 2017, at 9:49 AM, Matt Gilman <matt.c.gil...@gmail.com>
>>>> > wrote:
>>>> >
>>>> > Thanks for the additional details. They will be helpful when working
>>>> > the JIRA. All nodes, including the coordinator, heartbeat to the active
>>>> > coordinator. This means that the coordinator effectively heartbeats to
>>>> > itself. It appears, based on your log messages, that this is not 
>>>> > happening.
>>>> > Because no heartbeats were receive from any node, the lack of heartbeats
>>>> > from the terminated node is not considered.
>>>> >
>>>> > Matt
>>>> >
>>>> > Sent from my iPhone
>>>> >
>>>> >> On May 18, 2017, at 8:30 AM, ddewaele <ddewa...@gmail.com> wrote:
>>>> >>
>>>> >> Found something interesting in the centos-b debug logging....
>>>> >>
>>>> >> after centos-a (the coordinator) is killed centos-b takes over.
>>>> >> Notice how
>>>> >> it "Will not disconnect any nodes due to lack of heartbeat" and how
>>>> >> it still
>>>> >> sees centos-a as connected despite the fact that there are no
>>>> >> heartbeats
>>>> >> anymore.
>>>> >>
>>>> >> 2017-05-18 12:41:38,010 INFO [Leader Election Notification Thread-2]
>>>> >> o.apache.nifi.controller.FlowController This node elected Active
>>>> >> Cluster
>>>> >> Coordinator
>>>> >> 2017-05-18 12:41:38,010 DEBUG [Leader Election Notification Thread-2]
>>>> >> o.a.n.c.c.h.ClusterProtocolHeartbeatMonitor Purging old heartbeats
>>>> >> 2017-05-18 12:41:38,014 INFO [Leader Election Notification Thread-1]
>>>> >> o.apache.nifi.controller.FlowController This node has been elected
>>>> >> Primary
>>>> >> Node
>>>> >> 2017-05-18 12:41:38,353 DEBUG [Heartbeat Monitor Thread-1]
>>>> >> o.a.n.c.c.h.AbstractHeartbeatMonitor Received no new heartbeats. Will
>>>> >> not
>>>> >> disconnect any nodes due to lack of heartbeat
>>>> >> 2017-05-18 12:41:41,336 DEBUG [Process Cluster Protocol Request-3]
>>>> >> o.a.n.c.c.h.ClusterProtocolHeartbeatMonitor Received new heartbeat
>>>> >> from
>>>> >> centos-b:8080
>>>> >> 2017-05-18 12:41:41,337 DEBUG [Process Cluster Protocol Request-3]
>>>> >> o.a.n.c.c.h.ClusterProtocolHeartbeatMonitor
>>>> >>
>>>> >> Calculated diff between current cluster status and node cluster
>>>> >> status as
>>>> >> follows:
>>>> >> Node: [NodeConnectionStatus[nodeId=centos-b:8080, state=CONNECTED,
>>>> >> updateId=45], NodeConnectionStatus[nodeId=centos-a:8080,
>>>> >> state=CONNECTED,
>>>> >> updateId=42]]
>>>> >> Self: [NodeConnectionStatus[nodeId=centos-b:8080, state=CONNECTED,
>>>> >> updateId=45], NodeConnectionStatus[nodeId=centos-a:8080,
>>>> >> state=CONNECTED,
>>>> >> updateId=42]]
>>>> >> Difference: []
>>>> >>
>>>> >>
>>>> >> 2017-05-18 12:41:41,337 INFO [Process Cluster Protocol Request-3]
>>>> >> o.a.n.c.p.impl.SocketProtocolListener Finished processing request
>>>> >> 410e7db5-8bb0-4f97-8ee8-fc8647c54959 (type=HEARTBEAT, length=2341
>>>> >> bytes)
>>>> >> from centos-b:8080 in 3 millis
>>>> >> 2017-05-18 12:41:41,339 INFO [Clustering Tasks Thread-2]
>>>> >> o.a.n.c.c.ClusterProtocolHeartbeater Heartbeat created at 2017-05-18
>>>> >> 12:41:41,330 and sent to centos-b:10001 at 2017-05-18 12:41:41,339;
>>>> >> send
>>>> >> took 8 millis
>>>> >> 2017-05-18 12:41:43,354 INFO [Heartbeat Monitor Thread-1]
>>>> >> o.a.n.c.c.h.AbstractHeartbeatMonitor Finished processing 1 heartbeats
>>>> >> in
>>>> >> 93276 nanos
>>>> >> 2017-05-18 12:41:46,346 DEBUG [Process Cluster Protocol Request-4]
>>>> >> o.a.n.c.c.h.ClusterProtocolHeartbeatMonitor Received new heartbeat
>>>> >> from
>>>> >> centos-b:8080
>>>> >> 2017-05-18 12:41:46,346 DEBUG [Process Cluster Protocol Request-4]
>>>> >> o.a.n.c.c.h.ClusterProtocolHeartbeatMonitor
>>>> >>
>>>> >> Calculated diff between current cluster status and node cluster
>>>> >> status as
>>>> >> follows:
>>>> >> Node: [NodeConnectionStatus[nodeId=centos-b:8080, state=CONNECTED,
>>>> >> updateId=45], NodeConnectionStatus[nodeId=centos-a:8080,
>>>> >> state=CONNECTED,
>>>> >> updateId=42]]
>>>> >> Self: [NodeConnectionStatus[nodeId=centos-b:8080, state=CONNECTED,
>>>> >> updateId=45], NodeConnectionStatus[nodeId=centos-a:8080,
>>>> >> state=CONNECTED,
>>>> >> updateId=42]]
>>>> >> Difference: []
>>>> >>
>>>> >>
>>>> >>
>>>> >>
>>>> >> --
>>>> >> View this message in context:
>>>> >> http://apache-nifi-users-list.2361937.n4.nabble.com/Nifi-Cluster-fails-to-disconnect-node-when-node-was-killed-tp1942p1950.html
>>>> >> Sent from the Apache NiFi Users List mailing list archive at
>>>> >> Nabble.com.
>>>>
>>>
>>
>

Re: Nifi Cluster fails to disconnect node when node was killed

Reply via email to