Re: unstable cluster

Mark Bean Thu, 25 May 2017 07:23:24 -0700

I reduced the number of embedded ZooKeeper servers on the 5-Node NiFi
Cluster from 5 to 3. This has improved the situation. I do not see any of
the three Nodes which are also ZK servers disconnecting/reconnecting to the
cluster as before. However, the two Nodes which are not running ZK continue
to disconnect and reconnect. The following is taken from one of the non-ZK
Nodes. It's curious that some messages are issued twice from the same
thread, but reference a different object


nifi-app.log
2017-05-25 13:40:01,628 INFO [main-EventTrhead]
o.a.c.f.state.ConnectionStateManager State change: SUSPENDED
2017-05-25 13:39:45,627 INFO [Clustering Tasks Thread-1]
o.a.n.c.c.ClusterProtocolHeaertbeater Heartbeat create at 2017-05-25
13:39:45,504 and sent to FQDN:PORT at 2017-05-25 13:39:45,627; send took
122 millis
2017-05-25 13:39:50,862 INFO [Clustering Tasks Thread-1]
o.a.n.c.c.ClusterProtocolHeaertbeater Heartbeat create at 2017-05-25
13:39:50,732 and sent to FQDN:PORT at 2017-05-25 13:39:50,862; send took
122 millis
2017-05-25 13:39:56,089 INFO [Clustering Tasks Thread-1]
o.a.n.c.c.ClusterProtocolHeaertbeater Heartbeat create at 2017-05-25
13:39:55,966 and sent to FQDN:PORT at 2017-05-25 13:39:56,089; send took
129 millis
2017-05-25 13:40:01,629 INFO [Curator-ConnectionStateManager-0]
o.a.n.c.l.e.CuratorLeaderElectionManager
org.apache.nifi.controller.leader.election.CuratorLeaderElectionManager$ElectionListener@68f8b6a2
Connection State changed to SUSPENDED
2017-05-25 13:40:01,629 INFO [Curator-ConnectionStateManager-0]
o.a.n.c.l.e.CuratorLeaderElectionManager
org.apache.nifi.controller.leader.election.CuratorLeaderElectionManager$ElectionListener@663f55cd
Connection State changed to SUSPENDED
2017-05-25 13:40:02,412 INFO [main-EventThread]
o.a.c.f.state.ConnectinoStateManager State change: RECONNECTED
2017-05-25 13:40:02,413 INFO [Curator-ConnectionStateManager-0]
o.a.n.c.l.e.CuratorLeaderElectionManager
org.apache.nifi.controller.leader.election.CuratorLeaderElectionManager$ElectionListener@68f8b6a2
Connection State changed to RECONNECTED
2017-05-25 13:40:02,413 INFO [Curator-ConnectionStateManager-0]
o.a.n.c.l.e.CuratorLeaderElectionManager
org.apache.nifi.controller.leader.election.CuratorLeaderElectionManager$ElectionListener@663f55cd
Connection State changed to RECONNECTED
2017-05-25 13:40:02,550 INFO [Clustering Tasks Thread-1]
o.a.n.c.c.ClusterProtocolHeaertbeater Heartbeat create at 2017-05-25
13:40:01,632 and sent to FQDN:PORT at 2017-05-25 13:40:02,550; send took
917 millis
2017-05-25 13:40:07,787 INFO [Clustering Tasks Thread-1]
o.a.n.c.c.ClusterProtocolHeaertbeater Heartbeat create at 2017-05-25
13:40:07,657 and sent to FQDN:PORT at 2017-05-25 13:40:07,787; send took
129 millis

I will work on setting up an external ZK next, but would still like some
insight to what is being observed with the embedded ZK.

Thanks,
Mark




On Wed, May 24, 2017 at 3:57 PM, Mark Bean <mark.o.b...@gmail.com> wrote:

> Yes, we are using the embedded ZK. We will try instantiating and external
> ZK and see if that resolves the problem.
>
> The load on the system is extremely small. Currently (as Nodes are
> disconnecting/reconnecting) all input ports to the flow are turned off. The
> only data in the flow is from a single GenerateFlow generating 5B every 30
> secs.
>
> Also, it is a 5-node cluster with embedded ZK on each node. First, I will
> try reducing ZK to only 3 nodes. Then, I will try a 3-node external ZK.
>
> Thanks,
> Mark
>
> On Wed, May 24, 2017 at 11:49 AM, Joe Witt <joe.w...@gmail.com> wrote:
>
>> Are you using the embedded Zookeeper?  If yes we recommend using an
>> external zookeeper.
>>
>> What type of load are the systems under when this occurs (cpu,
>> network, memory, disk io)? Under high load the default timeouts for
>> clustering are too aggressive.  You can relax these for higher load
>> clusters and should see good behavior.  Even if the system overall is
>> not under all that high of load if you're seeing garbage collection
>> pauses that are lengthy and/or frequent it can cause the same high
>> load effect as far as the JVM is concerned.
>>
>> Thanks
>> Joe
>>
>> On Wed, May 24, 2017 at 9:11 AM, Mark Bean <mark.o.b...@gmail.com> wrote:
>> > We have a cluster which is showing signs of instability. The Primary
>> Node
>> > and Coordinator are reassigned to different nodes every several
>> minutes. I
>> > believe this is due to lack of heartbeat or other coordination. The
>> > following error occurs periodically in the nifi-app.log
>> >
>> > ERROR [CommitProcessor:1] o.apache.zookeeper.server.NIOServerCnxn
>> > Unexpected Exception:
>> > java.nio.channels.CancelledKeyException: null
>> >         at sun.nio.ch.SelectionKeyImpl.ensureValid(SectionKeyImpl.java:
>> 73)
>> >         at sun.nio.ch.SelectionKeyImpl.interestOps(SelctionKeyImpl.java
>> :77)
>> >         at
>> > org.apache.zookeeper.server.NIOServerCnxn.sendBuffer(NIOServ
>> erCnxn.java:151)
>> >         at
>> > org.apache.zookeeper.server.NIOServerCnXn.sendResopnse(NIOSe
>> rverCnxn.java:1081)
>> >         at
>> > org.apache.zookeeper.server.FinalRequestProcessor.processReq
>> uest(FinalRequestProcessor.java:404)
>> >         at
>> > org.apache.zookeeper.server.quorum.CommitProcessor.run(Commi
>> tProcessor.java:74)
>> >
>> > Apache NiFi 1.2.0
>> >
>> > Thoughts?
>>
>
>

Re: unstable cluster

Reply via email to