ping shows acceptably fast response time between servers, approximately 0.100-0.150 ms
On Thu, May 25, 2017 at 11:13 AM, Joe Witt <joe.w...@gmail.com> wrote: > have you evaluated latency across the machines in your cluster? I ask > because 122ms is pretty long and 917ms is very long. Are these nodes > across a WAN link? > > On Thu, May 25, 2017 at 11:08 AM, Mark Bean <mark.o.b...@gmail.com> wrote: > > Update: now all 5 nodes, regardless of ZK server, are indicating > SUSPENDED > > -> RECONNECTED. > > > > On Thu, May 25, 2017 at 10:23 AM, Mark Bean <mark.o.b...@gmail.com> > wrote: > > > >> I reduced the number of embedded ZooKeeper servers on the 5-Node NiFi > >> Cluster from 5 to 3. This has improved the situation. I do not see any > of > >> the three Nodes which are also ZK servers disconnecting/reconnecting to > the > >> cluster as before. However, the two Nodes which are not running ZK > continue > >> to disconnect and reconnect. The following is taken from one of the > non-ZK > >> Nodes. It's curious that some messages are issued twice from the same > >> thread, but reference a different object > >> > >> nifi-app.log > >> 2017-05-25 13:40:01,628 INFO [main-EventTrhead] o.a.c.f.state. > ConnectionStateManager > >> State change: SUSPENDED > >> 2017-05-25 13:39:45,627 INFO [Clustering Tasks Thread-1] o.a.n.c.c. > ClusterProtocolHeaertbeater > >> Heartbeat create at 2017-05-25 13:39:45,504 and sent to FQDN:PORT at > >> 2017-05-25 13:39:45,627; send took 122 millis > >> 2017-05-25 13:39:50,862 INFO [Clustering Tasks Thread-1] o.a.n.c.c. > ClusterProtocolHeaertbeater > >> Heartbeat create at 2017-05-25 13:39:50,732 and sent to FQDN:PORT at > >> 2017-05-25 13:39:50,862; send took 122 millis > >> 2017-05-25 13:39:56,089 INFO [Clustering Tasks Thread-1] o.a.n.c.c. > ClusterProtocolHeaertbeater > >> Heartbeat create at 2017-05-25 13:39:55,966 and sent to FQDN:PORT at > >> 2017-05-25 13:39:56,089; send took 129 millis > >> 2017-05-25 13:40:01,629 INFO [Curator-ConnectionStateManager-0] > >> o.a.n.c.l.e.CuratorLeaderElectionManager org.apache.nifi.controller. > >> leader.election.CuratorLeaderElectionManager$ElectionListener@68f8b6a2 > >> Connection State changed to SUSPENDED > >> 2017-05-25 13:40:01,629 INFO [Curator-ConnectionStateManager-0] > >> o.a.n.c.l.e.CuratorLeaderElectionManager org.apache.nifi.controller. > >> leader.election.CuratorLeaderElectionManager$ElectionListener@663f55cd > >> Connection State changed to SUSPENDED > >> 2017-05-25 13:40:02,412 INFO [main-EventThread] o.a.c.f.state. > ConnectinoStateManager > >> State change: RECONNECTED > >> 2017-05-25 13:40:02,413 INFO [Curator-ConnectionStateManager-0] > >> o.a.n.c.l.e.CuratorLeaderElectionManager org.apache.nifi.controller. > >> leader.election.CuratorLeaderElectionManager$ElectionListener@68f8b6a2 > >> Connection State changed to RECONNECTED > >> 2017-05-25 13:40:02,413 INFO [Curator-ConnectionStateManager-0] > >> o.a.n.c.l.e.CuratorLeaderElectionManager org.apache.nifi.controller. > >> leader.election.CuratorLeaderElectionManager$ElectionListener@663f55cd > >> Connection State changed to RECONNECTED > >> 2017-05-25 13:40:02,550 INFO [Clustering Tasks Thread-1] o.a.n.c.c. > ClusterProtocolHeaertbeater > >> Heartbeat create at 2017-05-25 13:40:01,632 and sent to FQDN:PORT at > >> 2017-05-25 13:40:02,550; send took 917 millis > >> 2017-05-25 13:40:07,787 INFO [Clustering Tasks Thread-1] o.a.n.c.c. > ClusterProtocolHeaertbeater > >> Heartbeat create at 2017-05-25 13:40:07,657 and sent to FQDN:PORT at > >> 2017-05-25 13:40:07,787; send took 129 millis > >> > >> I will work on setting up an external ZK next, but would still like some > >> insight to what is being observed with the embedded ZK. > >> > >> Thanks, > >> Mark > >> > >> > >> > >> > >> On Wed, May 24, 2017 at 3:57 PM, Mark Bean <mark.o.b...@gmail.com> > wrote: > >> > >>> Yes, we are using the embedded ZK. We will try instantiating and > external > >>> ZK and see if that resolves the problem. > >>> > >>> The load on the system is extremely small. Currently (as Nodes are > >>> disconnecting/reconnecting) all input ports to the flow are turned > off. The > >>> only data in the flow is from a single GenerateFlow generating 5B > every 30 > >>> secs. > >>> > >>> Also, it is a 5-node cluster with embedded ZK on each node. First, I > will > >>> try reducing ZK to only 3 nodes. Then, I will try a 3-node external ZK. > >>> > >>> Thanks, > >>> Mark > >>> > >>> On Wed, May 24, 2017 at 11:49 AM, Joe Witt <joe.w...@gmail.com> wrote: > >>> > >>>> Are you using the embedded Zookeeper? If yes we recommend using an > >>>> external zookeeper. > >>>> > >>>> What type of load are the systems under when this occurs (cpu, > >>>> network, memory, disk io)? Under high load the default timeouts for > >>>> clustering are too aggressive. You can relax these for higher load > >>>> clusters and should see good behavior. Even if the system overall is > >>>> not under all that high of load if you're seeing garbage collection > >>>> pauses that are lengthy and/or frequent it can cause the same high > >>>> load effect as far as the JVM is concerned. > >>>> > >>>> Thanks > >>>> Joe > >>>> > >>>> On Wed, May 24, 2017 at 9:11 AM, Mark Bean <mark.o.b...@gmail.com> > >>>> wrote: > >>>> > We have a cluster which is showing signs of instability. The Primary > >>>> Node > >>>> > and Coordinator are reassigned to different nodes every several > >>>> minutes. I > >>>> > believe this is due to lack of heartbeat or other coordination. The > >>>> > following error occurs periodically in the nifi-app.log > >>>> > > >>>> > ERROR [CommitProcessor:1] o.apache.zookeeper.server.NIOServerCnxn > >>>> > Unexpected Exception: > >>>> > java.nio.channels.CancelledKeyException: null > >>>> > at sun.nio.ch.SelectionKeyImpl.en > >>>> sureValid(SectionKeyImpl.java:73) > >>>> > at sun.nio.ch.SelectionKeyImpl.in > >>>> terestOps(SelctionKeyImpl.java:77) > >>>> > at > >>>> > org.apache.zookeeper.server.NIOServerCnxn.sendBuffer(NIOServ > >>>> erCnxn.java:151) > >>>> > at > >>>> > org.apache.zookeeper.server.NIOServerCnXn.sendResopnse(NIOSe > >>>> rverCnxn.java:1081) > >>>> > at > >>>> > org.apache.zookeeper.server.FinalRequestProcessor.processReq > >>>> uest(FinalRequestProcessor.java:404) > >>>> > at > >>>> > org.apache.zookeeper.server.quorum.CommitProcessor.run(Commi > >>>> tProcessor.java:74) > >>>> > > >>>> > Apache NiFi 1.2.0 > >>>> > > >>>> > Thoughts? > >>>> > >>> > >>> > >> >