Jeff, The Nodes are disconnecting from the Cluster due to the problem reported in [1]. ZK fixed this in 3.4.10. This was the reason for inquiring about upgrading the embedded ZK to 3.4.10. While I understand there are additional reasons (log4j) to wait for a later ZK release so they can be included as well. But, can we take two smaller steps (especially since ZK 3.5.2 or 3.6.0 is a somewhat unknown timeframe) rather than one big step?
[1] https://issues.apache.org/jira/browse/ZOOKEEPER-2044 On Tue, May 30, 2017 at 8:42 AM, Jeff <jtsw...@gmail.com> wrote: > Mark, > > I did report a JIRA [1] for upgrading to 3.5.2 or 3.6.0 (just due to log4j > issues) once it's out and stable, There are issues with the way that ZK > refers to log4j classes in the code that cause issues for NiFi and our > Toolkit.. However there has been some back and forth [2] (in 3.4.0, which > doesn't fix the issue, but moves towards fixing it), [3], and [4] on the > changes being implemented in versions 3.5.2 and 3.6.0. Also, it looks like > ZK 3.6.0 is headed toward using log4j 2 [5]. > > There are many components outside of NiFi that are still using ZK 3.4.6, so > it may be a while before we can move to 3.4.10. I don't currently know > anything about the forward compatibility of 3.4.6. Are there > improvements/fixes in 3.4.10 which you need? > > [1] https://issues.apache.org/jira/browse/NIFI-3067 > [2] https://issues.apache.org/jira/browse/ZOOKEEPER-850 > [3] https://issues.apache.org/jira/browse/ZOOKEEPER-1371 > [4] https://issues.apache.org/jira/browse/ZOOKEEPER-2393 > [5] https://issues.apache.org/jira/browse/ZOOKEEPER-2342 > > - Jeff > > On Tue, May 30, 2017 at 8:15 AM Mark Bean <mark.o.b...@gmail.com> wrote: > > > Updated to external ZooKeeper last Friday. Over the weekend, there are no > > reports of SUSPENDED or RECONNECTED. > > > > Are there plans to upgrade the embedded ZooKeeper to the latest version, > > 3.4.10? > > > > Thanks, > > Mark > > > > On Thu, May 25, 2017 at 11:56 AM, Joe Witt <joe.w...@gmail.com> wrote: > > > > > looked at a secured cluster and the send times are routinely at 100ms > > > similar to yours. I think what i was flagging as potentially > > > interesting is not interesting at all. > > > > > > On Thu, May 25, 2017 at 11:34 AM, Joe Witt <joe.w...@gmail.com> wrote: > > > > Ok. Well as a point of comparison i'm looking at heartbeat logs from > > > > another cluster and the times are consistently 1-3 millis for the > > > > send. Yours above show 100+ms typical with one north of 900ms. Not > > > > sure how relevant that is but something i noticed. > > > > > > > > On Thu, May 25, 2017 at 11:29 AM, Mark Bean <mark.o.b...@gmail.com> > > > wrote: > > > >> ping shows acceptably fast response time between servers, > > approximately > > > >> 0.100-0.150 ms > > > >> > > > >> > > > >> On Thu, May 25, 2017 at 11:13 AM, Joe Witt <joe.w...@gmail.com> > > wrote: > > > >> > > > >>> have you evaluated latency across the machines in your cluster? I > > ask > > > >>> because 122ms is pretty long and 917ms is very long. Are these > nodes > > > >>> across a WAN link? > > > >>> > > > >>> On Thu, May 25, 2017 at 11:08 AM, Mark Bean <mark.o.b...@gmail.com > > > > > wrote: > > > >>> > Update: now all 5 nodes, regardless of ZK server, are indicating > > > >>> SUSPENDED > > > >>> > -> RECONNECTED. > > > >>> > > > > >>> > On Thu, May 25, 2017 at 10:23 AM, Mark Bean < > mark.o.b...@gmail.com > > > > > > >>> wrote: > > > >>> > > > > >>> >> I reduced the number of embedded ZooKeeper servers on the 5-Node > > > NiFi > > > >>> >> Cluster from 5 to 3. This has improved the situation. I do not > see > > > any > > > >>> of > > > >>> >> the three Nodes which are also ZK servers > > > disconnecting/reconnecting to > > > >>> the > > > >>> >> cluster as before. However, the two Nodes which are not running > ZK > > > >>> continue > > > >>> >> to disconnect and reconnect. The following is taken from one of > > the > > > >>> non-ZK > > > >>> >> Nodes. It's curious that some messages are issued twice from the > > > same > > > >>> >> thread, but reference a different object > > > >>> >> > > > >>> >> nifi-app.log > > > >>> >> 2017-05-25 13:40:01,628 INFO [main-EventTrhead] o.a.c.f.state. > > > >>> ConnectionStateManager > > > >>> >> State change: SUSPENDED > > > >>> >> 2017-05-25 13:39:45,627 INFO [Clustering Tasks Thread-1] > > o.a.n.c.c. > > > >>> ClusterProtocolHeaertbeater > > > >>> >> Heartbeat create at 2017-05-25 13:39:45,504 and sent to > FQDN:PORT > > at > > > >>> >> 2017-05-25 13:39:45,627; send took 122 millis > > > >>> >> 2017-05-25 13:39:50,862 INFO [Clustering Tasks Thread-1] > > o.a.n.c.c. > > > >>> ClusterProtocolHeaertbeater > > > >>> >> Heartbeat create at 2017-05-25 13:39:50,732 and sent to > FQDN:PORT > > at > > > >>> >> 2017-05-25 13:39:50,862; send took 122 millis > > > >>> >> 2017-05-25 13:39:56,089 INFO [Clustering Tasks Thread-1] > > o.a.n.c.c. > > > >>> ClusterProtocolHeaertbeater > > > >>> >> Heartbeat create at 2017-05-25 13:39:55,966 and sent to > FQDN:PORT > > at > > > >>> >> 2017-05-25 13:39:56,089; send took 129 millis > > > >>> >> 2017-05-25 13:40:01,629 INFO [Curator-ConnectionStateManager-0] > > > >>> >> o.a.n.c.l.e.CuratorLeaderElectionManager > > > org.apache.nifi.controller. > > > >>> >> leader.election.CuratorLeaderElectionManager$ > > > ElectionListener@68f8b6a2 > > > >>> >> Connection State changed to SUSPENDED > > > >>> >> 2017-05-25 13:40:01,629 INFO [Curator-ConnectionStateManager-0] > > > >>> >> o.a.n.c.l.e.CuratorLeaderElectionManager > > > org.apache.nifi.controller. > > > >>> >> leader.election.CuratorLeaderElectionManager$ > > > ElectionListener@663f55cd > > > >>> >> Connection State changed to SUSPENDED > > > >>> >> 2017-05-25 13:40:02,412 INFO [main-EventThread] o.a.c.f.state. > > > >>> ConnectinoStateManager > > > >>> >> State change: RECONNECTED > > > >>> >> 2017-05-25 13:40:02,413 INFO [Curator-ConnectionStateManager-0] > > > >>> >> o.a.n.c.l.e.CuratorLeaderElectionManager > > > org.apache.nifi.controller. > > > >>> >> leader.election.CuratorLeaderElectionManager$ > > > ElectionListener@68f8b6a2 > > > >>> >> Connection State changed to RECONNECTED > > > >>> >> 2017-05-25 13:40:02,413 INFO [Curator-ConnectionStateManager-0] > > > >>> >> o.a.n.c.l.e.CuratorLeaderElectionManager > > > org.apache.nifi.controller. > > > >>> >> leader.election.CuratorLeaderElectionManager$ > > > ElectionListener@663f55cd > > > >>> >> Connection State changed to RECONNECTED > > > >>> >> 2017-05-25 13:40:02,550 INFO [Clustering Tasks Thread-1] > > o.a.n.c.c. > > > >>> ClusterProtocolHeaertbeater > > > >>> >> Heartbeat create at 2017-05-25 13:40:01,632 and sent to > FQDN:PORT > > at > > > >>> >> 2017-05-25 13:40:02,550; send took 917 millis > > > >>> >> 2017-05-25 13:40:07,787 INFO [Clustering Tasks Thread-1] > > o.a.n.c.c. > > > >>> ClusterProtocolHeaertbeater > > > >>> >> Heartbeat create at 2017-05-25 13:40:07,657 and sent to > FQDN:PORT > > at > > > >>> >> 2017-05-25 13:40:07,787; send took 129 millis > > > >>> >> > > > >>> >> I will work on setting up an external ZK next, but would still > > like > > > some > > > >>> >> insight to what is being observed with the embedded ZK. > > > >>> >> > > > >>> >> Thanks, > > > >>> >> Mark > > > >>> >> > > > >>> >> > > > >>> >> > > > >>> >> > > > >>> >> On Wed, May 24, 2017 at 3:57 PM, Mark Bean < > mark.o.b...@gmail.com > > > > > > >>> wrote: > > > >>> >> > > > >>> >>> Yes, we are using the embedded ZK. We will try instantiating > and > > > >>> external > > > >>> >>> ZK and see if that resolves the problem. > > > >>> >>> > > > >>> >>> The load on the system is extremely small. Currently (as Nodes > > are > > > >>> >>> disconnecting/reconnecting) all input ports to the flow are > > turned > > > >>> off. The > > > >>> >>> only data in the flow is from a single GenerateFlow generating > 5B > > > >>> every 30 > > > >>> >>> secs. > > > >>> >>> > > > >>> >>> Also, it is a 5-node cluster with embedded ZK on each node. > > First, > > > I > > > >>> will > > > >>> >>> try reducing ZK to only 3 nodes. Then, I will try a 3-node > > > external ZK. > > > >>> >>> > > > >>> >>> Thanks, > > > >>> >>> Mark > > > >>> >>> > > > >>> >>> On Wed, May 24, 2017 at 11:49 AM, Joe Witt <joe.w...@gmail.com > > > > > wrote: > > > >>> >>> > > > >>> >>>> Are you using the embedded Zookeeper? If yes we recommend > using > > > an > > > >>> >>>> external zookeeper. > > > >>> >>>> > > > >>> >>>> What type of load are the systems under when this occurs (cpu, > > > >>> >>>> network, memory, disk io)? Under high load the default > timeouts > > > for > > > >>> >>>> clustering are too aggressive. You can relax these for higher > > > load > > > >>> >>>> clusters and should see good behavior. Even if the system > > > overall is > > > >>> >>>> not under all that high of load if you're seeing garbage > > > collection > > > >>> >>>> pauses that are lengthy and/or frequent it can cause the same > > high > > > >>> >>>> load effect as far as the JVM is concerned. > > > >>> >>>> > > > >>> >>>> Thanks > > > >>> >>>> Joe > > > >>> >>>> > > > >>> >>>> On Wed, May 24, 2017 at 9:11 AM, Mark Bean < > > mark.o.b...@gmail.com > > > > > > > >>> >>>> wrote: > > > >>> >>>> > We have a cluster which is showing signs of instability. The > > > Primary > > > >>> >>>> Node > > > >>> >>>> > and Coordinator are reassigned to different nodes every > > several > > > >>> >>>> minutes. I > > > >>> >>>> > believe this is due to lack of heartbeat or other > > coordination. > > > The > > > >>> >>>> > following error occurs periodically in the nifi-app.log > > > >>> >>>> > > > > >>> >>>> > ERROR [CommitProcessor:1] o.apache.zookeeper.server. > > > NIOServerCnxn > > > >>> >>>> > Unexpected Exception: > > > >>> >>>> > java.nio.channels.CancelledKeyException: null > > > >>> >>>> > at sun.nio.ch.SelectionKeyImpl.en > > > >>> >>>> sureValid(SectionKeyImpl.java:73) > > > >>> >>>> > at sun.nio.ch.SelectionKeyImpl.in > > > >>> >>>> terestOps(SelctionKeyImpl.java:77) > > > >>> >>>> > at > > > >>> >>>> > org.apache.zookeeper.server.NIOServerCnxn.sendBuffer( > NIOServ > > > >>> >>>> erCnxn.java:151) > > > >>> >>>> > at > > > >>> >>>> > org.apache.zookeeper.server.NIOServerCnXn.sendResopnse( > NIOSe > > > >>> >>>> rverCnxn.java:1081) > > > >>> >>>> > at > > > >>> >>>> > org.apache.zookeeper.server.FinalRequestProcessor. > processReq > > > >>> >>>> uest(FinalRequestProcessor.java:404) > > > >>> >>>> > at > > > >>> >>>> > org.apache.zookeeper.server.quorum.CommitProcessor.run( > Commi > > > >>> >>>> tProcessor.java:74) > > > >>> >>>> > > > > >>> >>>> > Apache NiFi 1.2.0 > > > >>> >>>> > > > > >>> >>>> > Thoughts? > > > >>> >>>> > > > >>> >>> > > > >>> >>> > > > >>> >> > > > >>> > > > > > >