Re: unstable cluster

Mark Bean Tue, 30 May 2017 06:56:02 -0700

Jeff,

The Nodes are disconnecting from the Cluster due to the problem reported in
[1]. ZK fixed this in 3.4.10. This was the reason for inquiring about
upgrading the embedded ZK to 3.4.10. While I understand there are
additional reasons (log4j) to wait for a later ZK release so they can be
included as well. But, can we take two smaller steps (especially since ZK
3.5.2 or 3.6.0 is a somewhat unknown timeframe) rather than one big step?


[1] https://issues.apache.org/jira/browse/ZOOKEEPER-2044

On Tue, May 30, 2017 at 8:42 AM, Jeff <jtsw...@gmail.com> wrote:

> Mark,
>
> I did report a JIRA [1] for upgrading to 3.5.2 or 3.6.0 (just due to log4j
> issues) once it's out and stable, There are issues with the way that ZK
> refers to log4j classes in the code that cause issues for NiFi and our
> Toolkit..  However there has been some back and forth [2] (in 3.4.0, which
> doesn't fix the issue, but moves towards fixing it), [3], and [4] on the
> changes being implemented in versions 3.5.2 and 3.6.0.  Also, it looks like
> ZK 3.6.0 is headed toward using log4j 2 [5].
>
> There are many components outside of NiFi that are still using ZK 3.4.6, so
> it may be a while before we can move to 3.4.10. I don't currently know
> anything about the forward compatibility of 3.4.6.  Are there
> improvements/fixes in 3.4.10 which you need?
>
> [1] https://issues.apache.org/jira/browse/NIFI-3067
> [2] https://issues.apache.org/jira/browse/ZOOKEEPER-850
> [3] https://issues.apache.org/jira/browse/ZOOKEEPER-1371
> [4] https://issues.apache.org/jira/browse/ZOOKEEPER-2393
> [5] https://issues.apache.org/jira/browse/ZOOKEEPER-2342
>
> - Jeff
>
> On Tue, May 30, 2017 at 8:15 AM Mark Bean <mark.o.b...@gmail.com> wrote:
>
> > Updated to external ZooKeeper last Friday. Over the weekend, there are no
> > reports of SUSPENDED or RECONNECTED.
> >
> > Are there plans to upgrade the embedded ZooKeeper to the latest version,
> > 3.4.10?
> >
> > Thanks,
> > Mark
> >
> > On Thu, May 25, 2017 at 11:56 AM, Joe Witt <joe.w...@gmail.com> wrote:
> >
> > > looked at a secured cluster and the send times are routinely at 100ms
> > > similar to yours.  I think what i was flagging as potentially
> > > interesting is not interesting at all.
> > >
> > > On Thu, May 25, 2017 at 11:34 AM, Joe Witt <joe.w...@gmail.com> wrote:
> > > > Ok.  Well as a point of comparison i'm looking at heartbeat logs from
> > > > another cluster and the times are consistently 1-3 millis for the
> > > > send.  Yours above show 100+ms typical with one north of 900ms.  Not
> > > > sure how relevant that is but something i noticed.
> > > >
> > > > On Thu, May 25, 2017 at 11:29 AM, Mark Bean <mark.o.b...@gmail.com>
> > > wrote:
> > > >> ping shows acceptably fast response time between servers,
> > approximately
> > > >> 0.100-0.150 ms
> > > >>
> > > >>
> > > >> On Thu, May 25, 2017 at 11:13 AM, Joe Witt <joe.w...@gmail.com>
> > wrote:
> > > >>
> > > >>> have you evaluated latency across the machines in your cluster?  I
> > ask
> > > >>> because 122ms is pretty long and 917ms is very long.  Are these
> nodes
> > > >>> across a WAN link?
> > > >>>
> > > >>> On Thu, May 25, 2017 at 11:08 AM, Mark Bean <mark.o.b...@gmail.com
> >
> > > wrote:
> > > >>> > Update: now all 5 nodes, regardless of ZK server, are indicating
> > > >>> SUSPENDED
> > > >>> > -> RECONNECTED.
> > > >>> >
> > > >>> > On Thu, May 25, 2017 at 10:23 AM, Mark Bean <
> mark.o.b...@gmail.com
> > >
> > > >>> wrote:
> > > >>> >
> > > >>> >> I reduced the number of embedded ZooKeeper servers on the 5-Node
> > > NiFi
> > > >>> >> Cluster from 5 to 3. This has improved the situation. I do not
> see
> > > any
> > > >>> of
> > > >>> >> the three Nodes which are also ZK servers
> > > disconnecting/reconnecting to
> > > >>> the
> > > >>> >> cluster as before. However, the two Nodes which are not running
> ZK
> > > >>> continue
> > > >>> >> to disconnect and reconnect. The following is taken from one of
> > the
> > > >>> non-ZK
> > > >>> >> Nodes. It's curious that some messages are issued twice from the
> > > same
> > > >>> >> thread, but reference a different object
> > > >>> >>
> > > >>> >> nifi-app.log
> > > >>> >> 2017-05-25 13:40:01,628 INFO [main-EventTrhead] o.a.c.f.state.
> > > >>> ConnectionStateManager
> > > >>> >> State change: SUSPENDED
> > > >>> >> 2017-05-25 13:39:45,627 INFO [Clustering Tasks Thread-1]
> > o.a.n.c.c.
> > > >>> ClusterProtocolHeaertbeater
> > > >>> >> Heartbeat create at 2017-05-25 13:39:45,504 and sent to
> FQDN:PORT
> > at
> > > >>> >> 2017-05-25 13:39:45,627; send took 122 millis
> > > >>> >> 2017-05-25 13:39:50,862 INFO [Clustering Tasks Thread-1]
> > o.a.n.c.c.
> > > >>> ClusterProtocolHeaertbeater
> > > >>> >> Heartbeat create at 2017-05-25 13:39:50,732 and sent to
> FQDN:PORT
> > at
> > > >>> >> 2017-05-25 13:39:50,862; send took 122 millis
> > > >>> >> 2017-05-25 13:39:56,089 INFO [Clustering Tasks Thread-1]
> > o.a.n.c.c.
> > > >>> ClusterProtocolHeaertbeater
> > > >>> >> Heartbeat create at 2017-05-25 13:39:55,966 and sent to
> FQDN:PORT
> > at
> > > >>> >> 2017-05-25 13:39:56,089; send took 129 millis
> > > >>> >> 2017-05-25 13:40:01,629 INFO [Curator-ConnectionStateManager-0]
> > > >>> >> o.a.n.c.l.e.CuratorLeaderElectionManager
> > > org.apache.nifi.controller.
> > > >>> >> leader.election.CuratorLeaderElectionManager$
> > > ElectionListener@68f8b6a2
> > > >>> >> Connection State changed to SUSPENDED
> > > >>> >> 2017-05-25 13:40:01,629 INFO [Curator-ConnectionStateManager-0]
> > > >>> >> o.a.n.c.l.e.CuratorLeaderElectionManager
> > > org.apache.nifi.controller.
> > > >>> >> leader.election.CuratorLeaderElectionManager$
> > > ElectionListener@663f55cd
> > > >>> >> Connection State changed to SUSPENDED
> > > >>> >> 2017-05-25 13:40:02,412 INFO [main-EventThread] o.a.c.f.state.
> > > >>> ConnectinoStateManager
> > > >>> >> State change: RECONNECTED
> > > >>> >> 2017-05-25 13:40:02,413 INFO [Curator-ConnectionStateManager-0]
> > > >>> >> o.a.n.c.l.e.CuratorLeaderElectionManager
> > > org.apache.nifi.controller.
> > > >>> >> leader.election.CuratorLeaderElectionManager$
> > > ElectionListener@68f8b6a2
> > > >>> >> Connection State changed to RECONNECTED
> > > >>> >> 2017-05-25 13:40:02,413 INFO [Curator-ConnectionStateManager-0]
> > > >>> >> o.a.n.c.l.e.CuratorLeaderElectionManager
> > > org.apache.nifi.controller.
> > > >>> >> leader.election.CuratorLeaderElectionManager$
> > > ElectionListener@663f55cd
> > > >>> >> Connection State changed to RECONNECTED
> > > >>> >> 2017-05-25 13:40:02,550 INFO [Clustering Tasks Thread-1]
> > o.a.n.c.c.
> > > >>> ClusterProtocolHeaertbeater
> > > >>> >> Heartbeat create at 2017-05-25 13:40:01,632 and sent to
> FQDN:PORT
> > at
> > > >>> >> 2017-05-25 13:40:02,550; send took 917 millis
> > > >>> >> 2017-05-25 13:40:07,787 INFO [Clustering Tasks Thread-1]
> > o.a.n.c.c.
> > > >>> ClusterProtocolHeaertbeater
> > > >>> >> Heartbeat create at 2017-05-25 13:40:07,657 and sent to
> FQDN:PORT
> > at
> > > >>> >> 2017-05-25 13:40:07,787; send took 129 millis
> > > >>> >>
> > > >>> >> I will work on setting up an external ZK next, but would still
> > like
> > > some
> > > >>> >> insight to what is being observed with the embedded ZK.
> > > >>> >>
> > > >>> >> Thanks,
> > > >>> >> Mark
> > > >>> >>
> > > >>> >>
> > > >>> >>
> > > >>> >>
> > > >>> >> On Wed, May 24, 2017 at 3:57 PM, Mark Bean <
> mark.o.b...@gmail.com
> > >
> > > >>> wrote:
> > > >>> >>
> > > >>> >>> Yes, we are using the embedded ZK. We will try instantiating
> and
> > > >>> external
> > > >>> >>> ZK and see if that resolves the problem.
> > > >>> >>>
> > > >>> >>> The load on the system is extremely small. Currently (as Nodes
> > are
> > > >>> >>> disconnecting/reconnecting) all input ports to the flow are
> > turned
> > > >>> off. The
> > > >>> >>> only data in the flow is from a single GenerateFlow generating
> 5B
> > > >>> every 30
> > > >>> >>> secs.
> > > >>> >>>
> > > >>> >>> Also, it is a 5-node cluster with embedded ZK on each node.
> > First,
> > > I
> > > >>> will
> > > >>> >>> try reducing ZK to only 3 nodes. Then, I will try a 3-node
> > > external ZK.
> > > >>> >>>
> > > >>> >>> Thanks,
> > > >>> >>> Mark
> > > >>> >>>
> > > >>> >>> On Wed, May 24, 2017 at 11:49 AM, Joe Witt <joe.w...@gmail.com
> >
> > > wrote:
> > > >>> >>>
> > > >>> >>>> Are you using the embedded Zookeeper?  If yes we recommend
> using
> > > an
> > > >>> >>>> external zookeeper.
> > > >>> >>>>
> > > >>> >>>> What type of load are the systems under when this occurs (cpu,
> > > >>> >>>> network, memory, disk io)? Under high load the default
> timeouts
> > > for
> > > >>> >>>> clustering are too aggressive.  You can relax these for higher
> > > load
> > > >>> >>>> clusters and should see good behavior.  Even if the system
> > > overall is
> > > >>> >>>> not under all that high of load if you're seeing garbage
> > > collection
> > > >>> >>>> pauses that are lengthy and/or frequent it can cause the same
> > high
> > > >>> >>>> load effect as far as the JVM is concerned.
> > > >>> >>>>
> > > >>> >>>> Thanks
> > > >>> >>>> Joe
> > > >>> >>>>
> > > >>> >>>> On Wed, May 24, 2017 at 9:11 AM, Mark Bean <
> > mark.o.b...@gmail.com
> > > >
> > > >>> >>>> wrote:
> > > >>> >>>> > We have a cluster which is showing signs of instability. The
> > > Primary
> > > >>> >>>> Node
> > > >>> >>>> > and Coordinator are reassigned to different nodes every
> > several
> > > >>> >>>> minutes. I
> > > >>> >>>> > believe this is due to lack of heartbeat or other
> > coordination.
> > > The
> > > >>> >>>> > following error occurs periodically in the nifi-app.log
> > > >>> >>>> >
> > > >>> >>>> > ERROR [CommitProcessor:1] o.apache.zookeeper.server.
> > > NIOServerCnxn
> > > >>> >>>> > Unexpected Exception:
> > > >>> >>>> > java.nio.channels.CancelledKeyException: null
> > > >>> >>>> >         at sun.nio.ch.SelectionKeyImpl.en
> > > >>> >>>> sureValid(SectionKeyImpl.java:73)
> > > >>> >>>> >         at sun.nio.ch.SelectionKeyImpl.in
> > > >>> >>>> terestOps(SelctionKeyImpl.java:77)
> > > >>> >>>> >         at
> > > >>> >>>> > org.apache.zookeeper.server.NIOServerCnxn.sendBuffer(
> NIOServ
> > > >>> >>>> erCnxn.java:151)
> > > >>> >>>> >         at
> > > >>> >>>> > org.apache.zookeeper.server.NIOServerCnXn.sendResopnse(
> NIOSe
> > > >>> >>>> rverCnxn.java:1081)
> > > >>> >>>> >         at
> > > >>> >>>> > org.apache.zookeeper.server.FinalRequestProcessor.
> processReq
> > > >>> >>>> uest(FinalRequestProcessor.java:404)
> > > >>> >>>> >         at
> > > >>> >>>> > org.apache.zookeeper.server.quorum.CommitProcessor.run(
> Commi
> > > >>> >>>> tProcessor.java:74)
> > > >>> >>>> >
> > > >>> >>>> > Apache NiFi 1.2.0
> > > >>> >>>> >
> > > >>> >>>> > Thoughts?
> > > >>> >>>>
> > > >>> >>>
> > > >>> >>>
> > > >>> >>
> > > >>>
> > >
> >
>

Re: unstable cluster

Reply via email to