Re: Re: Disfunctional cluster with version 1.13.2

Axel Schwarz Fri, 27 Aug 2021 02:07:28 -0700

Hey Mark,

thanks again for your help. We just installed a dedicated Zookeeper Cluster. I 
will report our experiences regarding this topic as soon as we regained the 
courage to update to version 1.13.2 again :D


--- Ursprüngliche Nachricht ---
Von: Mark Payne <[email protected]>
Datum: 23.08.2021 15:50:35
An: "[email protected]" <[email protected]>, Axel Schwarz       
<[email protected]>
Betreff: Re: Disfunctional cluster with version 1.13.2

> Hey Axel,
>
> Yes - it is definitely advisable to run zookeeper external to NiFi, on another
> server/VM. In fact - I would not recommend running any other service 
> co-located
> with NiFi. NiFi is really designed to be quite greedy and make use of whatever
> hardware you allow it to. This can be a problem specifically with ZooKeeper.
> If a NiFi node gets really busy (due to a sudden burst of traffic, for 
> example),
> it can easily starve Zookeeper of threads. This can happen due to 
> over-utilization
> of the CPU, or it can happen due to a lot of garbage collection, etc. 
> Typically
> this is a very temporary thing and NiFi happily recovers once the load has
> been worked down.
>
> But when this happens, if using an embedded ZK, all of the nodes in the NiFi
> cluster would start having trouble communicating with ZooKeeper. As a result,
> they will all start disconnecting and reconnecting. This, in turn, puts 
> greater
> stress on the node that is already falling behind. And so begins a cycle
> that can lead to cluster instability. Running an external ZooKeeper on the
> same node is better than embedded because at least it doesn’t share the same
> JVM / heap / garbage collection, etc. But contention with disks & cpu
> can still cause the same issues. So running it elsewhere is definitely 
> recommended.
>
>
> In terms of how many zookeeper servers to use - it really depends on what
> you need. A single Zookeeper instance is enough to get up & running but
> becomes a single point of failure, so typically 3 nodes are recommended for
> ZooKeeper (though the nodes don’t need to be particularly powerful). I don’t
> think adding any additional nodes provides any benefit, beyond making it
> more resilient if additional ZK nodes fail.
>
> Thanks
> -Mark
>
> > On Aug 23, 2021, at 4:30 AM, Axel Schwarz <[email protected]>
> wrote:
> >
> > Hey Mark,
> >
> > thanks for your answer. We always had an external zookeeper running,
> but on the same hosts as the nifi instances. But as we were researching 
> through
> the load balancing problem in version 1.13.2, we decided to use the embedded
> zookeeper, because in the documentation there is no hint at all to use an
> external zookeeper instance in production environment. So we just thought
> we should get rid of that possible source of error. Like incompatible versions
> etc.
> >
> > So just to be clear: It's officially advisable to run some external
> zookeeper instance on extra servers? If so, which version should we use?
> How many servers should a zookeeper cluster have to manage 3-5 Nifi-Hosts?
>
> >
> > Gotta say, I really appreciate all the work this community puts into
> helping others. I hope I can give something back some time.
> >
> > --- Ursprüngliche Nachricht ---
> > Von: Mark Payne <[email protected]>
> > Datum: 17.08.2021 16:13:29
> > An: "[email protected]" <[email protected]>,
> Axel Schwarz  <[email protected]>
> > Betreff: Re: Disfunctional cluster with version 1.13.2
> >
> >> Axel,
> >>
> >> So a few things to consider here.
> >>
> >> The NullPointer that you’re showing there in the logs was fixed
> in 1.14.0
> >> [1].
> >>
> >> Secondly, is a question of why the ZooKeeper client is disconnecting.
> Given
> >> that you’re running an embedded zookeeper server, it could be for
> any number
> >> of reasons - low java heap, too high CPU usage, over-utilized disks,
> etc.
> >> The embedded zookeeper is a convenience that should be used for
> testing &
> >> playing around locally but should never be used for production use
> - you
> >> should always depend on an external zookeeper for any production
> use case,
> >> running on a different machine (or set of machines). I suspect moving
> to
> >> an external zookeeper alone will be enough to help significantly
> here.
> >>
> >> But if you run into issues after moving to an external zookeeper,
> would recommend
> >> running “bin/nifi.sh diagnostics diag1.txt” when you’re having trouble.
> This
> >> will gather all sorts of information that you should be able to
> share, including
> >> garbage collection information, how many times the cluster coordinator
> is
> >> changing, etc. that can be helpful in diagnosing cluster-related
> issues.
> >>
> >>
> >> Thanks
> >> -Mark
> >>
> >> [1] https://issues.apache.org/jira/browse/NIFI-8394
> >>
> >>
> >>> On Aug 17, 2021, at 1:18 AM, Axel Schwarz <[email protected]>
>
> >> wrote:
> >>>
> >>> Hey Pierre,
> >>>
> >>> we're currently using the embedded ZK. So as far as I know this
> is Version
> >> 3.5.6
> >>>
> >>> --- Ursprüngliche Nachricht ---
> >>> Von: Pierre Villard <[email protected]>
> >>> Datum: 16.08.2021 16:59:22
> >>> An: [email protected], Axel Schwarz <[email protected]>
>
> >>
> >>> Betreff: Re: Disfunctional cluster with version 1.13.2
> >>>
> >>>> Hi,
> >>>>
> >>>> What's the version of ZK?
> >>>>
> >>>> Thanks,
> >>>> Pierre
> >>>>
> >>>> Le jeu. 12 août 2021 à 09:55, Axel Schwarz <[email protected]>
>
> >> a
> >>>> écrit :
> >>>>
> >>>>> Dear all,
> >>>>>
> >>>>> after successfully battling the load balancing and installing
>
> >> Version
> >>>>
> >>>>> 1.13.2 again in our 3 node production environment, we
> experienced
> >> another
> >>>>
> >>>>> failure in the cluster resulting in a complete cut-off
> of the
> >> flow just
> >>>>
> >>>>> 1,5h after the update.
> >>>>> We noticed it just by trying to access the webinterface,
> which
> >> immediately
> >>>>
> >>>>> after login showed something like:
> >>>>>
> >>>>> "Cannot replicate request to Node nifiHost1.contoso.com:8443
>
> >> because
> >>>> the
> >>>>> node is not connected"
> >>>>>
> >>>>> There was nothing we could do through the webinterface
> aside
> >> from staring
> >>>>
> >>>>> at this message and when looking at the live logs, there
> was
> >> nothing
> >>>>
> >>>>> suspicious. The log moved on as if nothing happened.
>
> >>>>> After a restart of the cluster everything was working
> fine again,
> >> but
> >>>> we
> >>>>> saw, that the entire flow wasn't working for some period
> of
> >> time. This
> >>>>
> >>>>> alone is really uncool, as we running a cluster for
> exactly
> >> that reason:
> >>>>
> >>>>> The flow should keep working, even if some node decides
> to malfunction
> >>
> >>>> for
> >>>>> whatever reason.
> >>>>>
> >>>>> Digging a little deeper into the logs showed two noticable
> problems:
> >>
> >>>>
> >>>>>
> >>>>> 1. The Zookeeper is restarting every few minutes. Which
> in the
> >> log always
> >>>>
> >>>>> looks like this:
> >>>>>
> >>>>> ------------------------------------(nifiHost1.contoso.com)
>
> >>
> >>>>> nifi-app.log------------------------------------------------------------------------------------
>
> >>
> >>>>
> >>>>>
> >>>>> 2021-08-11 12:02:39,187 INFO 
> >>>>> [QuorumPeer[myid=1](plain=/0.0.0.0:2181)(secure=disabled)]
>
> >>
> >>>>
> >>>>> o.a.zookeeper.server.ZooKeeperServer Shutting down
> >>>>> 2021-08-11 12:02:39,187 INFO 
> >>>>> [QuorumPeer[myid=1](plain=/0.0.0.0:2181)(secure=disabled)]
>
> >>
> >>>>
> >>>>> o.a.zookeeper.server.ZooKeeperServer shutting down
> >>>>> 2021-08-11 12:02:39,194 INFO 
> >>>>> [QuorumPeer[myid=1](plain=/0.0.0.0:2181)(secure=disabled)]
>
> >>
> >>>>
> >>>>> o.a.z.server.FinalRequestProcessor shutdown of request
> processor
> >> complete
> >>>>
> >>>>> 2021-08-11 12:02:39,196 INFO 
> >>>>> [QuorumPeer[myid=1](plain=/0.0.0.0:2181)(secure=disabled)]
>
> >>
> >>>>
> >>>>> o.a.z.server.SyncRequestProcessor Shutting down
> >>>>> 2021-08-11 12:02:39,196 INFO [SyncThread:1]
> >>>>> o.a.z.server.SyncRequestProcessor SyncRequestProcessor
> exited!
> >>
> >>>>> 2021-08-11 12:02:39,199 INFO 
> >>>>> [QuorumPeer[myid=1](plain=/0.0.0.0:2181)(secure=disabled)]
>
> >>
> >>>>
> >>>>> o.a.zookeeper.server.ZooKeeperServer minSessionTimeout
> set to
> >> 4000
> >>>>> 2021-08-11 12:02:39,200 INFO 
> >>>>> [QuorumPeer[myid=1](plain=/0.0.0.0:2181)(secure=disabled)]
>
> >>
> >>>>
> >>>>> o.a.zookeeper.server.ZooKeeperServer maxSessionTimeout
> set to
> >> 40000
> >>>>
> >>>>> 2021-08-11 12:02:39,200 INFO 
> >>>>> [QuorumPeer[myid=1](plain=/0.0.0.0:2181)(secure=disabled)]
>
> >>
> >>>>
> >>>>> o.a.zookeeper.server.ZooKeeperServer Created server
> with tickTime
> >> 2000
> >>>>
> >>>>> minSessionTimeout 4000 maxSessionTimeout 40000 datadir
>
> >>>>> /opt/nifi/logs/zookeeper/version-2 snapdir
> >>>>> /opt/nifi/state/zookeeper/version-2
> >>>>> 2021-08-11 12:02:39,276 ERROR [Timer-Driven Process
> Thread-27-EventThread]
> >>
> >>>>
> >>>>> org.apache.zookeeper.ClientCnxn Error while calling
> watcher
> >>
> >>>>> java.lang.NullPointerException: null
> >>>>>       at
> >>>>> org.apache.zookeeper.ClientCnxn$EventThread.processEvent(ClientCnxn.java:535)
>
> >>
> >>>>
> >>>>>       at
> >>>>> org.apache.zookeeper.ClientCnxn$EventThread.run(ClientCnxn.java:510)
>
> >>
> >>>>
> >>>>> 2021-08-11 12:02:39,277 INFO [main-EventThread]
> >>>>> o.a.c.f.state.ConnectionStateManager State change: SUSPENDED
>
> >>
> >>>>> 2021-08-11 12:02:40,001 INFO [main-EventThread]
> >>>>> o.a.c.f.state.ConnectionStateManager State change: RECONNECTED
>
> >>
> >>>>> 2021-08-11 12:02:40,006 INFO [Curator-ConnectionStateManager-0]
>
> >>
> >>>>> o.a.n.c.l.e.CuratorLeaderElectionManager
> >>>>> org.apache.nifi.controller.leader.election.CuratorLeaderElectionManager$ElectionListener@1d37e350
>
> >>
> >>>>
> >>>>> Connection State changed to SUSPENDED
> >>>>> 2021-08-11 12:02:40,006 INFO [Curator-ConnectionStateManager-0]
>
> >>
> >>>>> o.a.n.c.l.e.CuratorLeaderElectionManager
> >>>>> org.apache.nifi.controller.leader.election.CuratorLeaderElectionManager$ElectionListener@1908642e
>
> >>
> >>>>
> >>>>> Connection State changed to SUSPENDED
> >>>>> 2021-08-11 12:02:40,006 INFO [Curator-ConnectionStateManager-0]
>
> >>
> >>>>> o.a.n.c.l.e.CuratorLeaderElectionManager
> >>>>> org.apache.nifi.controller.leader.election.CuratorLeaderElectionManager$ElectionListener@1d37e350
>
> >>
> >>>>
> >>>>> Connection State changed to RECONNECTED
> >>>>> 2021-08-11 12:02:40,006 INFO [Curator-ConnectionStateManager-0]
>
> >>
> >>>>> o.a.n.c.l.e.CuratorLeaderElectionManager
> >>>>> org.apache.nifi.controller.leader.election.CuratorLeaderElectionManager$ElectionListener@1908642e
>
> >>
> >>>>
> >>>>> Connection State changed to RECONNECTED
> >>>>> 2021-08-11 12:02:40,014 INFO [main-EventThread]
> >>>>> o.a.c.framework.imps.EnsembleTracker New config event
> received:
> >> {}
> >>>>> 2021-08-11 12:02:40,127 INFO [Clustering Tasks Thread-2]
>
> >>>>> o.a.n.c.c.ClusterProtocolHeartbeater Heartbeat created
> at 2021-08-11
> >>
> >>>>
> >>>>> 12:02:39,176 and sent to nifiHost3.contoso.com:54321
> at 2021-08-11
> >>
> >>>>> 12:02:40,127; send took 951 millis
> >>>>> 2021-08-11 12:02:40,186 ERROR [Timer-Driven Process
> Thread-27-EventThread]
> >>
> >>>>
> >>>>> org.apache.zookeeper.ClientCnxn Error while calling
> watcher
> >>
> >>>>> java.lang.NullPointerException: null
> >>>>>       at
> >>>>> org.apache.zookeeper.ClientCnxn$EventThread.processEvent(ClientCnxn.java:535)
>
> >>
> >>>>
> >>>>>       at
> >>>>> org.apache.zookeeper.ClientCnxn$EventThread.run(ClientCnxn.java:510)
>
> >>
> >>>>
> >>>>>
> >>>>>
> >>>>> ------------------------------------------------------------------------------------------------------------------------------------------------------------
>
> >>
> >>>>
> >>>>>
> >>>>> This happens, as mentioned, every few minutes and looks
> like
> >> this when
> >>>>
> >>>>> condensed through grep:
> >>>>>
> >>>>> ------------------------------------(nifiHost1.contoso.com)
>
> >> nifi-app.log
> >>>>
> >>>>> | grep "ZooKeeperServer Shutting
> >>>>> down"---------------------------------------------
>
> >>>>>
> >>>>> ./nifi-app.log:4237:2021-08-11 12:02:39,187 INFO
> >>>>> [QuorumPeer[myid=1](plain=/0.0.0.0:2181)(secure=disabled)]
>
> >>>>> o.a.zookeeper.server.ZooKeeperServer Shutting down
> >>>>> ./nifi-app.log:4378:2021-08-11 12:03:22,084 INFO
> >>>>> [QuorumPeer[myid=1](plain=/0.0.0.0:2181)(secure=disabled)]
>
> >>
> >>>>> o.a.zookeeper.server.ZooKeeperServer Shutting down
> >>>>> ./nifi-app.log:4447:2021-08-11 12:03:57,639 INFO
> >>>>> [QuorumPeer[myid=1](plain=/0.0.0.0:2181)(secure=disabled)]
>
> >>>>> o.a.zookeeper.server.ZooKeeperServer Shutting down
> >>>>> ./nifi-app.log:4563:2021-08-11 12:04:18,921 INFO
> >>>>> [QuorumPeer[myid=1](plain=/0.0.0.0:2181)(secure=disabled)]
>
> >>
> >>>>> o.a.zookeeper.server.ZooKeeperServer Shutting down
> >>>>> ./nifi-app.log:4629:2021-08-11 12:04:33,199 INFO
> >>>>> [QuorumPeer[myid=1](plain=/0.0.0.0:2181)(secure=disabled)]
>
> >>>>> o.a.zookeeper.server.ZooKeeperServer Shutting down
> >>>>> ./nifi-app.log:4650:2021-08-11 12:05:03,939 INFO
> >>>>> [QuorumPeer[myid=1](plain=/0.0.0.0:2181)(secure=disabled)]
>
> >>
> >>>>> o.a.zookeeper.server.ZooKeeperServer Shutting down
> >>>>> ./nifi-app.log:4801:2021-08-11 12:05:39,348 INFO
> >>>>> [QuorumPeer[myid=1](plain=/0.0.0.0:2181)(secure=disabled)]
>
> >>>>> o.a.zookeeper.server.ZooKeeperServer Shutting down
> >>>>> ./nifi-app.log:5327:2021-08-11 12:06:56,736 INFO
> >>>>> [QuorumPeer[myid=1](plain=/0.0.0.0:2181)(secure=disabled)]
>
> >>
> >>>>> o.a.zookeeper.server.ZooKeeperServer Shutting down
> >>>>> ./nifi-app.log:7250:2021-08-11 12:21:41,464 INFO
> >>>>> [QuorumPeer[myid=1](plain=/0.0.0.0:2181)(secure=disabled)]
>
> >>>>> o.a.zookeeper.server.ZooKeeperServer Shutting down
> >>>>> ./nifi-app.log:7703:2021-08-11 12:24:54,244 INFO
> >>>>> [QuorumPeer[myid=1](plain=/0.0.0.0:2181)(secure=disabled)]
>
> >>
> >>>>> o.a.zookeeper.server.ZooKeeperServer Shutting down
> >>>>> ./nifi-app.log:8148:2021-08-11 12:27:21,093 INFO
> >>>>> [QuorumPeer[myid=1](plain=/0.0.0.0:2181)(secure=disabled)]
>
> >>>>> o.a.zookeeper.server.ZooKeeperServer Shutting down
> >>>>> ./nifi-app.log:8190:2021-08-11 12:27:37,820 INFO
> >>>>> [QuorumPeer[myid=1](plain=/0.0.0.0:2181)(secure=disabled)]
>
> >>
> >>>>> o.a.zookeeper.server.ZooKeeperServer Shutting down
> >>>>> ./nifi-app.log:8350:2021-08-11 12:28:37,417 INFO
> >>>>> [QuorumPeer[myid=1](plain=/0.0.0.0:2181)(secure=disabled)]
>
> >>>>> o.a.zookeeper.server.ZooKeeperServer Shutting down
> >>>>> ./nifi-app.log:8797:2021-08-11 12:33:25,184 INFO
> >>>>> [QuorumPeer[myid=1](plain=/0.0.0.0:2181)(secure=disabled)]
>
> >>
> >>>>> o.a.zookeeper.server.ZooKeeperServer Shutting down
> >>>>> ./nifi-app.log:8813:2021-08-11 12:33:39,672 INFO
> >>>>> [QuorumPeer[myid=1](plain=/0.0.0.0:2181)(secure=disabled)]
>
> >>>>> o.a.zookeeper.server.ZooKeeperServer Shutting down
> >>>>> ./nifi-app.log:8971:2021-08-11 12:34:22,003 INFO
> >>>>> [QuorumPeer[myid=1](plain=/0.0.0.0:2181)(secure=disabled)]
>
> >>
> >>>>> o.a.zookeeper.server.ZooKeeperServer Shutting down
> >>>>> ./nifi-app.log:8984:2021-08-11 12:34:36,511 INFO
> >>>>> [QuorumPeer[myid=1](plain=/0.0.0.0:2181)(secure=disabled)]
>
> >>>>> o.a.zookeeper.server.ZooKeeperServer Shutting down
> >>>>> ./nifi-app.log:9349:2021-08-11 12:36:29,802 INFO
> >>>>> [QuorumPeer[myid=1](plain=/0.0.0.0:2181)(secure=disabled)]
>
> >>
> >>>>> o.a.zookeeper.server.ZooKeeperServer Shutting down
> >>>>> ./nifi-app.log:9376:2021-08-11 12:36:41,541 INFO
> >>>>> [QuorumPeer[myid=1](plain=/0.0.0.0:2181)(secure=disabled)]
>
> >>>>> o.a.zookeeper.server.ZooKeeperServer Shutting down
> >>>>> ./nifi-app.log:9491:2021-08-11 12:37:12,390 INFO
> >>>>> [QuorumPeer[myid=1](plain=/0.0.0.0:2181)(secure=disabled)]
>
> >>
> >>>>> o.a.zookeeper.server.ZooKeeperServer Shutting down
> >>>>> ./nifi-app.log:9504:2021-08-11 12:37:25,888 INFO
> >>>>> [QuorumPeer[myid=1](plain=/0.0.0.0:2181)(secure=disabled)]
>
> >>>>> o.a.zookeeper.server.ZooKeeperServer Shutting down
> >>>>> ./nifi-app.log:9676:2021-08-11 12:39:22,775 INFO
> >>>>> [QuorumPeer[myid=1](plain=/0.0.0.0:2181)(secure=disabled)]
>
> >>
> >>>>> o.a.zookeeper.server.ZooKeeperServer Shutting down
> >>>>> ./nifi-app.log:9842:2021-08-11 12:40:49,396 INFO
> >>>>> [QuorumPeer[myid=1](plain=/0.0.0.0:2181)(secure=disabled)]
>
> >>>>> o.a.zookeeper.server.ZooKeeperServer Shutting down
> >>>>> ./nifi-app.log:10251:2021-08-11 12:43:38,807 INFO
> >>>>> [QuorumPeer[myid=1](plain=/0.0.0.0:2181)(secure=disabled)]
>
> >>
> >>>>> o.a.zookeeper.server.ZooKeeperServer Shutting down
> >>>>> ./nifi-app.log:10397:2021-08-11 12:44:36,326 INFO
> >>>>> [QuorumPeer[myid=1](plain=/0.0.0.0:2181)(secure=disabled)]
>
> >>>>> o.a.zookeeper.server.ZooKeeperServer Shutting down
> >>>>> ./nifi-app.log:11405:2021-08-11 12:52:26,384 INFO
> >>>>> [QuorumPeer[myid=1](plain=/0.0.0.0:2181)(secure=disabled)]
>
> >>
> >>>>> o.a.zookeeper.server.ZooKeeperServer Shutting down
> >>>>> ./nifi-app.log:11629:2021-08-11 12:53:33,054 INFO
> >>>>> [QuorumPeer[myid=1](plain=/0.0.0.0:2181)(secure=disabled)]
>
> >>>>> o.a.zookeeper.server.ZooKeeperServer Shutting down
> >>>>> ./nifi-app.log:11827:2021-08-11 12:54:43,783 INFO
> >>>>> [QuorumPeer[myid=1](plain=/0.0.0.0:2181)(secure=disabled)]
>
> >>
> >>>>> o.a.zookeeper.server.ZooKeeperServer Shutting down
> >>>>> ./nifi-app.log:11990:2021-08-11 12:56:23,389 INFO
> >>>>> [QuorumPeer[myid=1](plain=/0.0.0.0:2181)(secure=disabled)]
>
> >>>>> o.a.zookeeper.server.ZooKeeperServer Shutting down
> >>>>> ./nifi-app.log:12085:2021-08-11 12:57:21,410 INFO
> >>>>> [QuorumPeer[myid=1](plain=/0.0.0.0:2181)(secure=disabled)]
>
> >>
> >>>>> o.a.zookeeper.server.ZooKeeperServer Shutting down
> >>>>> ./nifi-app.log:12823:2021-08-11 13:01:23,674 INFO
> >>>>> [QuorumPeer[myid=1](plain=/0.0.0.0:2181)(secure=disabled)]
>
> >>>>> o.a.zookeeper.server.ZooKeeperServer Shutting down
> >>>>> ./nifi-app.log:13087:2021-08-11 13:04:44,524 INFO
> >>>>> [QuorumPeer[myid=1](plain=/0.0.0.0:2181)(secure=disabled)]
>
> >>
> >>>>> o.a.zookeeper.server.ZooKeeperServer Shutting down
> >>>>> ./nifi-app.log:14610:2021-08-11 13:16:15,432 INFO
> >>>>> [QuorumPeer[myid=1](plain=/0.0.0.0:2181)(secure=disabled)]
>
> >>>>> o.a.zookeeper.server.ZooKeeperServer Shutting down
> >>>>> ./nifi-app.log:14635:2021-08-11 13:16:37,724 INFO
> >>>>> [QuorumPeer[myid=1](plain=/0.0.0.0:2181)(secure=disabled)]
>
> >>
> >>>>> o.a.zookeeper.server.ZooKeeperServer Shutting down
> >>>>> ./nifi-app.log:14751:2021-08-11 13:17:38,169 INFO
> >>>>> [QuorumPeer[myid=1](plain=/0.0.0.0:2181)(secure=disabled)]
>
> >>>>> o.a.zookeeper.server.ZooKeeperServer Shutting down
> >>>>> ./nifi-app.log:15136:2021-08-11 13:17:59,547 INFO
> >>>>> [QuorumPeer[myid=1](plain=/0.0.0.0:2181)(secure=disabled)]
>
> >>
> >>>>> o.a.zookeeper.server.ZooKeeperServer Shutting down
> >>>>> ./nifi-app.log:15203:2021-08-11 13:18:30,695 INFO
> >>>>> [QuorumPeer[myid=1](plain=/0.0.0.0:2181)(secure=disabled)]
>
> >>>>> o.a.zookeeper.server.ZooKeeperServer Shutting down
> >>>>> ./nifi-app.log:15516:2021-08-11 13:21:43,631 INFO
> >>>>> [QuorumPeer[myid=1](plain=/0.0.0.0:2181)(secure=disabled)]
>
> >>
> >>>>> o.a.zookeeper.server.ZooKeeperServer Shutting down
> >>>>> ./nifi-app.log:15743:2021-08-11 13:22:49,630 INFO
> >>>>> [QuorumPeer[myid=1](plain=/0.0.0.0:2181)(secure=disabled)]
>
> >>>>> o.a.zookeeper.server.ZooKeeperServer Shutting down
> >>>>> ./nifi-app.log:15950:2021-08-11 13:23:50,574 INFO
> >>>>> [QuorumPeer[myid=1](plain=/0.0.0.0:2181)(secure=disabled)]
>
> >>
> >>>>> o.a.zookeeper.server.ZooKeeperServer Shutting down
> >>>>> ./nifi-app.log:16035:2021-08-11 13:24:30,913 INFO
> >>>>> [QuorumPeer[myid=1](plain=/0.0.0.0:2181)(secure=disabled)]
>
> >>>>> o.a.zookeeper.server.ZooKeeperServer Shutting down
> >>>>> ./nifi-app.log:16396:2021-08-11 13:28:01,873 INFO
> >>>>> [QuorumPeer[myid=1](plain=/0.0.0.0:2181)(secure=disabled)]
>
> >>
> >>>>> o.a.zookeeper.server.ZooKeeperServer Shutting down
> >>>>> ./nifi-app.log:16470:2021-08-11 13:29:12,550 INFO
> >>>>> [QuorumPeer[myid=1](plain=/0.0.0.0:2181)(secure=disabled)]
>
> >>>>> o.a.zookeeper.server.ZooKeeperServer Shutting down
> >>>>> ./nifi-app.log:16541:2021-08-11 13:29:38,490 INFO
> >>>>> [QuorumPeer[myid=1](plain=/0.0.0.0:2181)(secure=disabled)]
>
> >>
> >>>>> o.a.zookeeper.server.ZooKeeperServer Shutting down
> >>>>> ./nifi-app.log:16568:2021-08-11 13:29:53,554 INFO
> >>>>> [QuorumPeer[myid=1](plain=/0.0.0.0:2181)(secure=disabled)]
>
> >>>>> o.a.zookeeper.server.ZooKeeperServer Shutting down
> >>>>> ./nifi-app.log:16678:2021-08-11 13:31:13,437 INFO
> >>>>> [QuorumPeer[myid=1](plain=/0.0.0.0:2181)(secure=disabled)]
>
> >>
> >>>>> o.a.zookeeper.server.ZooKeeperServer Shutting down
> >>>>> ./nifi-app.log:16705:2021-08-11 13:31:28,195 INFO
> >>>>> [QuorumPeer[myid=1](plain=/0.0.0.0:2181)(secure=disabled)]
>
> >>>>> o.a.zookeeper.server.ZooKeeperServer Shutting down
> >>>>>
> >>>>>
> >>>>> ------------------------------------------------------------------------------------------------------------------------------------------------------------
>
> >>
> >>>>
> >>>>>
> >>>>> So my first question is: Is this behaviour of zookeeper
> normal?
> >> Well
> >>>> it is
> >>>>> just an INFO, but I would love to here that this is
> a completely
> >> normal
> >>>>
> >>>>> phenomenon. :D
> >>>>>
> >>>>> 2. After all these zookeeper restarts, nifi finally
> throws the
> >> following
> >>>>
> >>>>> error along with the "last" zookeeper restart:
>
> >>>>>
> >>>>> ------------------------------------(nifiHost1.contoso.com)
>
> >>
> >>>>> nifi-app.log------------------------------------------------------------------------------------
>
> >>
> >>>>
> >>>>>
> >>>>> 2021-08-11 13:31:13,083 INFO [main-EventThread]
> >>>>> o.a.c.f.state.ConnectionStateManager State change: SUSPENDED
>
> >>
> >>>>> 2021-08-11 13:31:13,437 INFO 
> >>>>> [QuorumPeer[myid=1](plain=/0.0.0.0:2181)(secure=disabled)]
>
> >>
> >>>>
> >>>>> o.a.zookeeper.server.ZooKeeperServer Shutting down
> >>>>> 2021-08-11 13:31:13,437 INFO 
> >>>>> [QuorumPeer[myid=1](plain=/0.0.0.0:2181)(secure=disabled)]
>
> >>
> >>>>
> >>>>> o.a.zookeeper.server.ZooKeeperServer shutting down
> >>>>> 2021-08-11 13:31:13,439 INFO 
> >>>>> [QuorumPeer[myid=1](plain=/0.0.0.0:2181)(secure=disabled)]
>
> >>
> >>>>
> >>>>> o.a.z.server.FinalRequestProcessor shutdown of request
> processor
> >> complete
> >>>>
> >>>>> 2021-08-11 13:31:13,440 INFO 
> >>>>> [QuorumPeer[myid=1](plain=/0.0.0.0:2181)(secure=disabled)]
>
> >>
> >>>>
> >>>>> o.a.z.server.SyncRequestProcessor Shutting down
> >>>>> 2021-08-11 13:31:13,440 INFO [SyncThread:1]
> >>>>> o.a.z.server.SyncRequestProcessor SyncRequestProcessor
> exited!
> >>
> >>>>> 2021-08-11 13:31:13,450 INFO 
> >>>>> [QuorumPeer[myid=1](plain=/0.0.0.0:2181)(secure=disabled)]
>
> >>
> >>>>
> >>>>> o.a.zookeeper.server.ZooKeeperServer minSessionTimeout
> set to
> >> 4000
> >>>>> 2021-08-11 13:31:13,450 INFO 
> >>>>> [QuorumPeer[myid=1](plain=/0.0.0.0:2181)(secure=disabled)]
>
> >>
> >>>>
> >>>>> o.a.zookeeper.server.ZooKeeperServer maxSessionTimeout
> set to
> >> 40000
> >>>>
> >>>>> 2021-08-11 13:31:13,450 INFO 
> >>>>> [QuorumPeer[myid=1](plain=/0.0.0.0:2181)(secure=disabled)]
>
> >>
> >>>>
> >>>>> o.a.zookeeper.server.ZooKeeperServer Created server
> with tickTime
> >> 2000
> >>>>
> >>>>> minSessionTimeout 4000 maxSessionTimeout 40000 datadir
>
> >>>>> /opt/nifi/logs/zookeeper/version-2 snapdir
> >>>>> /opt/nifi/state/zookeeper/version-2
> >>>>> 2021-08-11 13:31:13,462 ERROR [Load-Balance Server Thread-5]
>
> >>
> >>>>> o.a.n.c.q.c.s.ConnectionLoadBalanceServer Failed to
> communicate
> >> with
> >>>> Peer
> >>>>> nifiHost2.contoso.com/192.168.1.11:19500
> >>>>> java.io.EOFException: Expected to receive a Transaction
> Completion
> >>
> >>>>> Indicator from Peer nifiHost2.contoso.com but encountered
> EOF
> >>
> >>>>>       at
> >>>>> org.apache.nifi.controller.queue.clustered.server.StandardLoadBalanceProtocol.completeTransaction(StandardLoadBalanceProtocol.java:299)
>
> >>
> >>>>
> >>>>>       at
> >>>>> org.apache.nifi.controller.queue.clustered.server.StandardLoadBalanceProtocol.receiveFlowFiles(StandardLoadBalanceProtocol.java:275)
>
> >>
> >>>>
> >>>>>       at
> >>>>> org.apache.nifi.controller.queue.clustered.server.StandardLoadBalanceProtocol.receiveFlowFiles(StandardLoadBalanceProtocol.java:133)
>
> >>
> >>>>
> >>>>>       at
> >>>>> org.apache.nifi.controller.queue.clustered.server.ConnectionLoadBalanceServer$CommunicateAction.run(ConnectionLoadBalanceServer.java:162)
>
> >>
> >>>>
> >>>>>       at java.lang.Thread.run(Thread.java:748)
> >>>>> 2021-08-11 13:31:13,464 ERROR [Load-Balance Server Thread-6]
>
> >>
> >>>>> o.a.n.c.q.c.s.ConnectionLoadBalanceServer Failed to
> communicate
> >> with
> >>>> Peer
> >>>>> nifiHost2.contoso.com/192.168.1.11:19502
> >>>>> java.io.EOFException: Expected to receive a Transaction
> Completion
> >>
> >>>>> Indicator from Peer nifiHost2.contoso.com but encountered
> EOF
> >>
> >>>>>       at
> >>>>> org.apache.nifi.controller.queue.clustered.server.StandardLoadBalanceProtocol.completeTransaction(StandardLoadBalanceProtocol.java:299)
>
> >>
> >>>>
> >>>>>       at
> >>>>> org.apache.nifi.controller.queue.clustered.server.StandardLoadBalanceProtocol.receiveFlowFiles(StandardLoadBalanceProtocol.java:275)
>
> >>
> >>>>
> >>>>>       at
> >>>>> org.apache.nifi.controller.queue.clustered.server.StandardLoadBalanceProtocol.receiveFlowFiles(StandardLoadBalanceProtocol.java:133)
>
> >>
> >>>>
> >>>>>       at
> >>>>> org.apache.nifi.controller.queue.clustered.server.ConnectionLoadBalanceServer$CommunicateAction.run(ConnectionLoadBalanceServer.java:162)
>
> >>
> >>>>
> >>>>>       at java.lang.Thread.run(Thread.java:748)
> >>>>> 2021-08-11 13:31:13,589 INFO [Process Cluster Protocol
> Request-1]
> >>
> >>>>> o.a.n.c.c.node.NodeClusterCoordinator Status of 
> >>>>> nifiHost1.contoso.com:8443
>
> >>
> >>>>
> >>>>> changed from NodeConnectionStatus[nodeId=nifiHost1.contoso.com:8443,
>
> >>
> >>>>
> >>>>> state=CONNECTED, updateId=16] to NodeConnectionStatus[nodeId=
>
> >>
> >>>>> nifiHost1.contoso.com:8443, state=CONNECTING, updateId=16]
>
> >>>>> 2021-08-11 13:31:13,793 INFO [Process Cluster Protocol
> Request-7]
> >>
> >>>>> o.a.n.c.p.impl.SocketProtocolListener Finished processing
> request
> >>
> >>>>> 88980d85-b956-4713-9686-50d76a407871 (type=RECONNECTION_REQUEST,
>
> >>
> >>>>> length=2209040 bytes) from nifiHost1.contoso.com:8443
> in 142
> >> millis
> >>>>
> >>>>> 2021-08-11 13:31:13,794 INFO [Reconnect to Cluster]
>
> >>>>> o.a.nifi.controller.StandardFlowService Processing reconnection
>
> >> request
> >>>>
> >>>>> from cluster coordinator.
> >>>>> 2021-08-11 13:31:13,794 INFO [Reconnect to Cluster]
>
> >>>>> o.a.n.c.l.e.CuratorLeaderElectionManager Election Listener
> for
> >> Role
> >>>> Primary
> >>>>> Node disabled
> >>>>> 2021-08-11 13:31:13,802 INFO [main-EventThread]
> >>>>> o.a.c.f.state.ConnectionStateManager State change: RECONNECTED
>
> >>
> >>>>> 2021-08-11 13:31:28,195 INFO 
> >>>>> [QuorumPeer[myid=1](plain=/0.0.0.0:2181)(secure=disabled)]
>
> >>
> >>>>
> >>>>> o.a.zookeeper.server.ZooKeeperServer Shutting down
> >>>>> 2021-08-11 13:31:28,196 INFO 
> >>>>> [QuorumPeer[myid=1](plain=/0.0.0.0:2181)(secure=disabled)]
>
> >>
> >>>>
> >>>>> o.a.zookeeper.server.ZooKeeperServer shutting down
> >>>>> 2021-08-11 13:31:28,198 INFO [pool-22-thread-1]
> >>>>> o.a.n.c.r.WriteAheadFlowFileRepository Initiating checkpoint
>
> >> of FlowFile
> >>>>
> >>>>> Repository
> >>>>> 2021-08-11 13:31:28,202 INFO 
> >>>>> [QuorumPeer[myid=1](plain=/0.0.0.0:2181)(secure=disabled)]
>
> >>
> >>>>
> >>>>> o.a.z.server.FinalRequestProcessor shutdown of request
> processor
> >> complete
> >>>>
> >>>>> 2021-08-11 13:31:28,203 INFO 
> >>>>> [QuorumPeer[myid=1](plain=/0.0.0.0:2181)(secure=disabled)]
>
> >>
> >>>>
> >>>>> o.a.z.server.SyncRequestProcessor Shutting down
> >>>>> 2021-08-11 13:31:28,203 INFO [SyncThread:1]
> >>>>> o.a.z.server.SyncRequestProcessor SyncRequestProcessor
> exited!
> >>
> >>>>> 2021-08-11 13:31:28,204 INFO [Timer-Driven Process Thread-21]
>
> >>
> >>>>> o.a.n.p.store.WriteAheadStorePartition Successfully
> rolled over
> >> Event
> >>>>
> >>>>> Writer for Provenance Event Store Partition[directory=/opt/pr]
>
> >> due to
> >>>>
> >>>>> MAX_TIME_REACHED. Event File was 6.7 MB and contained
> 1626 events.
> >>
> >>>>> 2021-08-11 13:31:28,208 INFO [pool-22-thread-1]
> >>>>> o.a.n.wali.SequentialAccessWriteAheadLog Checkpointed
> Write-Ahead
> >> Log
> >>>> with
> >>>>> 10 Records and 0 Swap Files in 8 milliseconds (Stop-the-world
>
> >> time =
> >>>> 1
> >>>>> milliseconds), max Transaction ID 27772177643
> >>>>> 2021-08-11 13:31:28,208 INFO [pool-22-thread-1]
> >>>>> o.a.n.c.r.WriteAheadFlowFileRepository Successfully
> checkpointed
> >> FlowFile
> >>>>
> >>>>> Repository with 10 records in 8 milliseconds
> >>>>> 2021-08-11 13:31:28,295 INFO [main-EventThread]
> >>>>> o.a.c.f.state.ConnectionStateManager State change: SUSPENDED
>
> >>
> >>>>> 2021-08-11 13:31:29,808 INFO 
> >>>>> [QuorumPeer[myid=1](plain=/0.0.0.0:2181)(secure=disabled)]
>
> >>
> >>>>
> >>>>> o.a.zookeeper.server.ZooKeeperServer minSessionTimeout
> set to
> >> 4000
> >>>>> 2021-08-11 13:31:29,809 INFO 
> >>>>> [QuorumPeer[myid=1](plain=/0.0.0.0:2181)(secure=disabled)]
>
> >>
> >>>>
> >>>>> o.a.zookeeper.server.ZooKeeperServer maxSessionTimeout
> set to
> >> 40000
> >>>>
> >>>>> 2021-08-11 13:31:29,809 INFO 
> >>>>> [QuorumPeer[myid=1](plain=/0.0.0.0:2181)(secure=disabled)]
>
> >>
> >>>>
> >>>>> o.a.zookeeper.server.ZooKeeperServer Created server
> with tickTime
> >> 2000
> >>>>
> >>>>> minSessionTimeout 4000 maxSessionTimeout 40000 datadir
>
> >>>>> /opt/nifi/logs/zookeeper/version-2 snapdir
> >>>>> /opt/nifi/state/zookeeper/version-2
> >>>>> 2021-08-11 13:31:32,291 INFO [main-EventThread]
> >>>>> o.a.c.f.state.ConnectionStateManager State change: RECONNECTED
>
> >>
> >>>>>
> >>>>>
> >>>>> ------------------------------------------------------------------------------------------------------------------------------------------------------------
>
> >>
> >>>>
> >>>>>
> >>>>> After that, the log continues with normal messages about
> heartbeats
> >>
> >>>> and
> >>>>> provenance events but no error occurs anymore. That
> exactly
> >> was the
> >>>> time,
> >>>>> the whole flow cut-off.
> >>>>>
> >>>>> I've got to add, that we've never experienced that flow
> cut-off
> >> behaviour
> >>>>
> >>>>> with version 1.12.1, but the zookeeper restarts occur
> there
> >> as well
> >>>> and as
> >>>>> often as seen above.
> >>>>> But when first updating to version 1.13.2 back in June,
> we experienced
> >>
> >>>>
> >>>>> this flow cut-off multiple times beside the not working
> load
> >> balancing.
> >>>>
> >>>>> Back then I thought this was connected to each other,
> but now
> >> we had
> >>>> a
> >>>>> fully functional load balancing and the cut-off happened
> nevertheless.
> >>
> >>>>
> >>>>>
> >>>>> I'm thankful for any hint. :)
> >>>>>
> >>>>>
> >>>>>
> >>>>>
> >>>>
> >>>
> >>>
> >>
> >
> >
>

Re: Re: Disfunctional cluster with version 1.13.2

Reply via email to