have you evaluated latency across the machines in your cluster?  I ask
because 122ms is pretty long and 917ms is very long.  Are these nodes
across a WAN link?

On Thu, May 25, 2017 at 11:08 AM, Mark Bean <mark.o.b...@gmail.com> wrote:
> Update: now all 5 nodes, regardless of ZK server, are indicating SUSPENDED
> -> RECONNECTED.
>
> On Thu, May 25, 2017 at 10:23 AM, Mark Bean <mark.o.b...@gmail.com> wrote:
>
>> I reduced the number of embedded ZooKeeper servers on the 5-Node NiFi
>> Cluster from 5 to 3. This has improved the situation. I do not see any of
>> the three Nodes which are also ZK servers disconnecting/reconnecting to the
>> cluster as before. However, the two Nodes which are not running ZK continue
>> to disconnect and reconnect. The following is taken from one of the non-ZK
>> Nodes. It's curious that some messages are issued twice from the same
>> thread, but reference a different object
>>
>> nifi-app.log
>> 2017-05-25 13:40:01,628 INFO [main-EventTrhead] 
>> o.a.c.f.state.ConnectionStateManager
>> State change: SUSPENDED
>> 2017-05-25 13:39:45,627 INFO [Clustering Tasks Thread-1] 
>> o.a.n.c.c.ClusterProtocolHeaertbeater
>> Heartbeat create at 2017-05-25 13:39:45,504 and sent to FQDN:PORT at
>> 2017-05-25 13:39:45,627; send took 122 millis
>> 2017-05-25 13:39:50,862 INFO [Clustering Tasks Thread-1] 
>> o.a.n.c.c.ClusterProtocolHeaertbeater
>> Heartbeat create at 2017-05-25 13:39:50,732 and sent to FQDN:PORT at
>> 2017-05-25 13:39:50,862; send took 122 millis
>> 2017-05-25 13:39:56,089 INFO [Clustering Tasks Thread-1] 
>> o.a.n.c.c.ClusterProtocolHeaertbeater
>> Heartbeat create at 2017-05-25 13:39:55,966 and sent to FQDN:PORT at
>> 2017-05-25 13:39:56,089; send took 129 millis
>> 2017-05-25 13:40:01,629 INFO [Curator-ConnectionStateManager-0]
>> o.a.n.c.l.e.CuratorLeaderElectionManager org.apache.nifi.controller.
>> leader.election.CuratorLeaderElectionManager$ElectionListener@68f8b6a2
>> Connection State changed to SUSPENDED
>> 2017-05-25 13:40:01,629 INFO [Curator-ConnectionStateManager-0]
>> o.a.n.c.l.e.CuratorLeaderElectionManager org.apache.nifi.controller.
>> leader.election.CuratorLeaderElectionManager$ElectionListener@663f55cd
>> Connection State changed to SUSPENDED
>> 2017-05-25 13:40:02,412 INFO [main-EventThread] 
>> o.a.c.f.state.ConnectinoStateManager
>> State change: RECONNECTED
>> 2017-05-25 13:40:02,413 INFO [Curator-ConnectionStateManager-0]
>> o.a.n.c.l.e.CuratorLeaderElectionManager org.apache.nifi.controller.
>> leader.election.CuratorLeaderElectionManager$ElectionListener@68f8b6a2
>> Connection State changed to RECONNECTED
>> 2017-05-25 13:40:02,413 INFO [Curator-ConnectionStateManager-0]
>> o.a.n.c.l.e.CuratorLeaderElectionManager org.apache.nifi.controller.
>> leader.election.CuratorLeaderElectionManager$ElectionListener@663f55cd
>> Connection State changed to RECONNECTED
>> 2017-05-25 13:40:02,550 INFO [Clustering Tasks Thread-1] 
>> o.a.n.c.c.ClusterProtocolHeaertbeater
>> Heartbeat create at 2017-05-25 13:40:01,632 and sent to FQDN:PORT at
>> 2017-05-25 13:40:02,550; send took 917 millis
>> 2017-05-25 13:40:07,787 INFO [Clustering Tasks Thread-1] 
>> o.a.n.c.c.ClusterProtocolHeaertbeater
>> Heartbeat create at 2017-05-25 13:40:07,657 and sent to FQDN:PORT at
>> 2017-05-25 13:40:07,787; send took 129 millis
>>
>> I will work on setting up an external ZK next, but would still like some
>> insight to what is being observed with the embedded ZK.
>>
>> Thanks,
>> Mark
>>
>>
>>
>>
>> On Wed, May 24, 2017 at 3:57 PM, Mark Bean <mark.o.b...@gmail.com> wrote:
>>
>>> Yes, we are using the embedded ZK. We will try instantiating and external
>>> ZK and see if that resolves the problem.
>>>
>>> The load on the system is extremely small. Currently (as Nodes are
>>> disconnecting/reconnecting) all input ports to the flow are turned off. The
>>> only data in the flow is from a single GenerateFlow generating 5B every 30
>>> secs.
>>>
>>> Also, it is a 5-node cluster with embedded ZK on each node. First, I will
>>> try reducing ZK to only 3 nodes. Then, I will try a 3-node external ZK.
>>>
>>> Thanks,
>>> Mark
>>>
>>> On Wed, May 24, 2017 at 11:49 AM, Joe Witt <joe.w...@gmail.com> wrote:
>>>
>>>> Are you using the embedded Zookeeper?  If yes we recommend using an
>>>> external zookeeper.
>>>>
>>>> What type of load are the systems under when this occurs (cpu,
>>>> network, memory, disk io)? Under high load the default timeouts for
>>>> clustering are too aggressive.  You can relax these for higher load
>>>> clusters and should see good behavior.  Even if the system overall is
>>>> not under all that high of load if you're seeing garbage collection
>>>> pauses that are lengthy and/or frequent it can cause the same high
>>>> load effect as far as the JVM is concerned.
>>>>
>>>> Thanks
>>>> Joe
>>>>
>>>> On Wed, May 24, 2017 at 9:11 AM, Mark Bean <mark.o.b...@gmail.com>
>>>> wrote:
>>>> > We have a cluster which is showing signs of instability. The Primary
>>>> Node
>>>> > and Coordinator are reassigned to different nodes every several
>>>> minutes. I
>>>> > believe this is due to lack of heartbeat or other coordination. The
>>>> > following error occurs periodically in the nifi-app.log
>>>> >
>>>> > ERROR [CommitProcessor:1] o.apache.zookeeper.server.NIOServerCnxn
>>>> > Unexpected Exception:
>>>> > java.nio.channels.CancelledKeyException: null
>>>> >         at sun.nio.ch.SelectionKeyImpl.en
>>>> sureValid(SectionKeyImpl.java:73)
>>>> >         at sun.nio.ch.SelectionKeyImpl.in
>>>> terestOps(SelctionKeyImpl.java:77)
>>>> >         at
>>>> > org.apache.zookeeper.server.NIOServerCnxn.sendBuffer(NIOServ
>>>> erCnxn.java:151)
>>>> >         at
>>>> > org.apache.zookeeper.server.NIOServerCnXn.sendResopnse(NIOSe
>>>> rverCnxn.java:1081)
>>>> >         at
>>>> > org.apache.zookeeper.server.FinalRequestProcessor.processReq
>>>> uest(FinalRequestProcessor.java:404)
>>>> >         at
>>>> > org.apache.zookeeper.server.quorum.CommitProcessor.run(Commi
>>>> tProcessor.java:74)
>>>> >
>>>> > Apache NiFi 1.2.0
>>>> >
>>>> > Thoughts?
>>>>
>>>
>>>
>>

Reply via email to