Hey Mark, thanks again for your help. We just installed a dedicated Zookeeper Cluster. I will report our experiences regarding this topic as soon as we regained the courage to update to version 1.13.2 again :D
--- Ursprüngliche Nachricht --- Von: Mark Payne <[email protected]> Datum: 23.08.2021 15:50:35 An: "[email protected]" <[email protected]>, Axel Schwarz <[email protected]> Betreff: Re: Disfunctional cluster with version 1.13.2 > Hey Axel, > > Yes - it is definitely advisable to run zookeeper external to NiFi, on another > server/VM. In fact - I would not recommend running any other service > co-located > with NiFi. NiFi is really designed to be quite greedy and make use of whatever > hardware you allow it to. This can be a problem specifically with ZooKeeper. > If a NiFi node gets really busy (due to a sudden burst of traffic, for > example), > it can easily starve Zookeeper of threads. This can happen due to > over-utilization > of the CPU, or it can happen due to a lot of garbage collection, etc. > Typically > this is a very temporary thing and NiFi happily recovers once the load has > been worked down. > > But when this happens, if using an embedded ZK, all of the nodes in the NiFi > cluster would start having trouble communicating with ZooKeeper. As a result, > they will all start disconnecting and reconnecting. This, in turn, puts > greater > stress on the node that is already falling behind. And so begins a cycle > that can lead to cluster instability. Running an external ZooKeeper on the > same node is better than embedded because at least it doesn’t share the same > JVM / heap / garbage collection, etc. But contention with disks & cpu > can still cause the same issues. So running it elsewhere is definitely > recommended. > > > In terms of how many zookeeper servers to use - it really depends on what > you need. A single Zookeeper instance is enough to get up & running but > becomes a single point of failure, so typically 3 nodes are recommended for > ZooKeeper (though the nodes don’t need to be particularly powerful). I don’t > think adding any additional nodes provides any benefit, beyond making it > more resilient if additional ZK nodes fail. > > Thanks > -Mark > > > On Aug 23, 2021, at 4:30 AM, Axel Schwarz <[email protected]> > wrote: > > > > Hey Mark, > > > > thanks for your answer. We always had an external zookeeper running, > but on the same hosts as the nifi instances. But as we were researching > through > the load balancing problem in version 1.13.2, we decided to use the embedded > zookeeper, because in the documentation there is no hint at all to use an > external zookeeper instance in production environment. So we just thought > we should get rid of that possible source of error. Like incompatible versions > etc. > > > > So just to be clear: It's officially advisable to run some external > zookeeper instance on extra servers? If so, which version should we use? > How many servers should a zookeeper cluster have to manage 3-5 Nifi-Hosts? > > > > > Gotta say, I really appreciate all the work this community puts into > helping others. I hope I can give something back some time. > > > > --- Ursprüngliche Nachricht --- > > Von: Mark Payne <[email protected]> > > Datum: 17.08.2021 16:13:29 > > An: "[email protected]" <[email protected]>, > Axel Schwarz <[email protected]> > > Betreff: Re: Disfunctional cluster with version 1.13.2 > > > >> Axel, > >> > >> So a few things to consider here. > >> > >> The NullPointer that you’re showing there in the logs was fixed > in 1.14.0 > >> [1]. > >> > >> Secondly, is a question of why the ZooKeeper client is disconnecting. > Given > >> that you’re running an embedded zookeeper server, it could be for > any number > >> of reasons - low java heap, too high CPU usage, over-utilized disks, > etc. > >> The embedded zookeeper is a convenience that should be used for > testing & > >> playing around locally but should never be used for production use > - you > >> should always depend on an external zookeeper for any production > use case, > >> running on a different machine (or set of machines). I suspect moving > to > >> an external zookeeper alone will be enough to help significantly > here. > >> > >> But if you run into issues after moving to an external zookeeper, > would recommend > >> running “bin/nifi.sh diagnostics diag1.txt” when you’re having trouble. > This > >> will gather all sorts of information that you should be able to > share, including > >> garbage collection information, how many times the cluster coordinator > is > >> changing, etc. that can be helpful in diagnosing cluster-related > issues. > >> > >> > >> Thanks > >> -Mark > >> > >> [1] https://issues.apache.org/jira/browse/NIFI-8394 > >> > >> > >>> On Aug 17, 2021, at 1:18 AM, Axel Schwarz <[email protected]> > > >> wrote: > >>> > >>> Hey Pierre, > >>> > >>> we're currently using the embedded ZK. So as far as I know this > is Version > >> 3.5.6 > >>> > >>> --- Ursprüngliche Nachricht --- > >>> Von: Pierre Villard <[email protected]> > >>> Datum: 16.08.2021 16:59:22 > >>> An: [email protected], Axel Schwarz <[email protected]> > > >> > >>> Betreff: Re: Disfunctional cluster with version 1.13.2 > >>> > >>>> Hi, > >>>> > >>>> What's the version of ZK? > >>>> > >>>> Thanks, > >>>> Pierre > >>>> > >>>> Le jeu. 12 août 2021 à 09:55, Axel Schwarz <[email protected]> > > >> a > >>>> écrit : > >>>> > >>>>> Dear all, > >>>>> > >>>>> after successfully battling the load balancing and installing > > >> Version > >>>> > >>>>> 1.13.2 again in our 3 node production environment, we > experienced > >> another > >>>> > >>>>> failure in the cluster resulting in a complete cut-off > of the > >> flow just > >>>> > >>>>> 1,5h after the update. > >>>>> We noticed it just by trying to access the webinterface, > which > >> immediately > >>>> > >>>>> after login showed something like: > >>>>> > >>>>> "Cannot replicate request to Node nifiHost1.contoso.com:8443 > > >> because > >>>> the > >>>>> node is not connected" > >>>>> > >>>>> There was nothing we could do through the webinterface > aside > >> from staring > >>>> > >>>>> at this message and when looking at the live logs, there > was > >> nothing > >>>> > >>>>> suspicious. The log moved on as if nothing happened. > > >>>>> After a restart of the cluster everything was working > fine again, > >> but > >>>> we > >>>>> saw, that the entire flow wasn't working for some period > of > >> time. This > >>>> > >>>>> alone is really uncool, as we running a cluster for > exactly > >> that reason: > >>>> > >>>>> The flow should keep working, even if some node decides > to malfunction > >> > >>>> for > >>>>> whatever reason. > >>>>> > >>>>> Digging a little deeper into the logs showed two noticable > problems: > >> > >>>> > >>>>> > >>>>> 1. The Zookeeper is restarting every few minutes. Which > in the > >> log always > >>>> > >>>>> looks like this: > >>>>> > >>>>> ------------------------------------(nifiHost1.contoso.com) > > >> > >>>>> nifi-app.log------------------------------------------------------------------------------------ > > >> > >>>> > >>>>> > >>>>> 2021-08-11 12:02:39,187 INFO > >>>>> [QuorumPeer[myid=1](plain=/0.0.0.0:2181)(secure=disabled)] > > >> > >>>> > >>>>> o.a.zookeeper.server.ZooKeeperServer Shutting down > >>>>> 2021-08-11 12:02:39,187 INFO > >>>>> [QuorumPeer[myid=1](plain=/0.0.0.0:2181)(secure=disabled)] > > >> > >>>> > >>>>> o.a.zookeeper.server.ZooKeeperServer shutting down > >>>>> 2021-08-11 12:02:39,194 INFO > >>>>> [QuorumPeer[myid=1](plain=/0.0.0.0:2181)(secure=disabled)] > > >> > >>>> > >>>>> o.a.z.server.FinalRequestProcessor shutdown of request > processor > >> complete > >>>> > >>>>> 2021-08-11 12:02:39,196 INFO > >>>>> [QuorumPeer[myid=1](plain=/0.0.0.0:2181)(secure=disabled)] > > >> > >>>> > >>>>> o.a.z.server.SyncRequestProcessor Shutting down > >>>>> 2021-08-11 12:02:39,196 INFO [SyncThread:1] > >>>>> o.a.z.server.SyncRequestProcessor SyncRequestProcessor > exited! > >> > >>>>> 2021-08-11 12:02:39,199 INFO > >>>>> [QuorumPeer[myid=1](plain=/0.0.0.0:2181)(secure=disabled)] > > >> > >>>> > >>>>> o.a.zookeeper.server.ZooKeeperServer minSessionTimeout > set to > >> 4000 > >>>>> 2021-08-11 12:02:39,200 INFO > >>>>> [QuorumPeer[myid=1](plain=/0.0.0.0:2181)(secure=disabled)] > > >> > >>>> > >>>>> o.a.zookeeper.server.ZooKeeperServer maxSessionTimeout > set to > >> 40000 > >>>> > >>>>> 2021-08-11 12:02:39,200 INFO > >>>>> [QuorumPeer[myid=1](plain=/0.0.0.0:2181)(secure=disabled)] > > >> > >>>> > >>>>> o.a.zookeeper.server.ZooKeeperServer Created server > with tickTime > >> 2000 > >>>> > >>>>> minSessionTimeout 4000 maxSessionTimeout 40000 datadir > > >>>>> /opt/nifi/logs/zookeeper/version-2 snapdir > >>>>> /opt/nifi/state/zookeeper/version-2 > >>>>> 2021-08-11 12:02:39,276 ERROR [Timer-Driven Process > Thread-27-EventThread] > >> > >>>> > >>>>> org.apache.zookeeper.ClientCnxn Error while calling > watcher > >> > >>>>> java.lang.NullPointerException: null > >>>>> at > >>>>> org.apache.zookeeper.ClientCnxn$EventThread.processEvent(ClientCnxn.java:535) > > >> > >>>> > >>>>> at > >>>>> org.apache.zookeeper.ClientCnxn$EventThread.run(ClientCnxn.java:510) > > >> > >>>> > >>>>> 2021-08-11 12:02:39,277 INFO [main-EventThread] > >>>>> o.a.c.f.state.ConnectionStateManager State change: SUSPENDED > > >> > >>>>> 2021-08-11 12:02:40,001 INFO [main-EventThread] > >>>>> o.a.c.f.state.ConnectionStateManager State change: RECONNECTED > > >> > >>>>> 2021-08-11 12:02:40,006 INFO [Curator-ConnectionStateManager-0] > > >> > >>>>> o.a.n.c.l.e.CuratorLeaderElectionManager > >>>>> org.apache.nifi.controller.leader.election.CuratorLeaderElectionManager$ElectionListener@1d37e350 > > >> > >>>> > >>>>> Connection State changed to SUSPENDED > >>>>> 2021-08-11 12:02:40,006 INFO [Curator-ConnectionStateManager-0] > > >> > >>>>> o.a.n.c.l.e.CuratorLeaderElectionManager > >>>>> org.apache.nifi.controller.leader.election.CuratorLeaderElectionManager$ElectionListener@1908642e > > >> > >>>> > >>>>> Connection State changed to SUSPENDED > >>>>> 2021-08-11 12:02:40,006 INFO [Curator-ConnectionStateManager-0] > > >> > >>>>> o.a.n.c.l.e.CuratorLeaderElectionManager > >>>>> org.apache.nifi.controller.leader.election.CuratorLeaderElectionManager$ElectionListener@1d37e350 > > >> > >>>> > >>>>> Connection State changed to RECONNECTED > >>>>> 2021-08-11 12:02:40,006 INFO [Curator-ConnectionStateManager-0] > > >> > >>>>> o.a.n.c.l.e.CuratorLeaderElectionManager > >>>>> org.apache.nifi.controller.leader.election.CuratorLeaderElectionManager$ElectionListener@1908642e > > >> > >>>> > >>>>> Connection State changed to RECONNECTED > >>>>> 2021-08-11 12:02:40,014 INFO [main-EventThread] > >>>>> o.a.c.framework.imps.EnsembleTracker New config event > received: > >> {} > >>>>> 2021-08-11 12:02:40,127 INFO [Clustering Tasks Thread-2] > > >>>>> o.a.n.c.c.ClusterProtocolHeartbeater Heartbeat created > at 2021-08-11 > >> > >>>> > >>>>> 12:02:39,176 and sent to nifiHost3.contoso.com:54321 > at 2021-08-11 > >> > >>>>> 12:02:40,127; send took 951 millis > >>>>> 2021-08-11 12:02:40,186 ERROR [Timer-Driven Process > Thread-27-EventThread] > >> > >>>> > >>>>> org.apache.zookeeper.ClientCnxn Error while calling > watcher > >> > >>>>> java.lang.NullPointerException: null > >>>>> at > >>>>> org.apache.zookeeper.ClientCnxn$EventThread.processEvent(ClientCnxn.java:535) > > >> > >>>> > >>>>> at > >>>>> org.apache.zookeeper.ClientCnxn$EventThread.run(ClientCnxn.java:510) > > >> > >>>> > >>>>> > >>>>> > >>>>> ------------------------------------------------------------------------------------------------------------------------------------------------------------ > > >> > >>>> > >>>>> > >>>>> This happens, as mentioned, every few minutes and looks > like > >> this when > >>>> > >>>>> condensed through grep: > >>>>> > >>>>> ------------------------------------(nifiHost1.contoso.com) > > >> nifi-app.log > >>>> > >>>>> | grep "ZooKeeperServer Shutting > >>>>> down"--------------------------------------------- > > >>>>> > >>>>> ./nifi-app.log:4237:2021-08-11 12:02:39,187 INFO > >>>>> [QuorumPeer[myid=1](plain=/0.0.0.0:2181)(secure=disabled)] > > >>>>> o.a.zookeeper.server.ZooKeeperServer Shutting down > >>>>> ./nifi-app.log:4378:2021-08-11 12:03:22,084 INFO > >>>>> [QuorumPeer[myid=1](plain=/0.0.0.0:2181)(secure=disabled)] > > >> > >>>>> o.a.zookeeper.server.ZooKeeperServer Shutting down > >>>>> ./nifi-app.log:4447:2021-08-11 12:03:57,639 INFO > >>>>> [QuorumPeer[myid=1](plain=/0.0.0.0:2181)(secure=disabled)] > > >>>>> o.a.zookeeper.server.ZooKeeperServer Shutting down > >>>>> ./nifi-app.log:4563:2021-08-11 12:04:18,921 INFO > >>>>> [QuorumPeer[myid=1](plain=/0.0.0.0:2181)(secure=disabled)] > > >> > >>>>> o.a.zookeeper.server.ZooKeeperServer Shutting down > >>>>> ./nifi-app.log:4629:2021-08-11 12:04:33,199 INFO > >>>>> [QuorumPeer[myid=1](plain=/0.0.0.0:2181)(secure=disabled)] > > >>>>> o.a.zookeeper.server.ZooKeeperServer Shutting down > >>>>> ./nifi-app.log:4650:2021-08-11 12:05:03,939 INFO > >>>>> [QuorumPeer[myid=1](plain=/0.0.0.0:2181)(secure=disabled)] > > >> > >>>>> o.a.zookeeper.server.ZooKeeperServer Shutting down > >>>>> ./nifi-app.log:4801:2021-08-11 12:05:39,348 INFO > >>>>> [QuorumPeer[myid=1](plain=/0.0.0.0:2181)(secure=disabled)] > > >>>>> o.a.zookeeper.server.ZooKeeperServer Shutting down > >>>>> ./nifi-app.log:5327:2021-08-11 12:06:56,736 INFO > >>>>> [QuorumPeer[myid=1](plain=/0.0.0.0:2181)(secure=disabled)] > > >> > >>>>> o.a.zookeeper.server.ZooKeeperServer Shutting down > >>>>> ./nifi-app.log:7250:2021-08-11 12:21:41,464 INFO > >>>>> [QuorumPeer[myid=1](plain=/0.0.0.0:2181)(secure=disabled)] > > >>>>> o.a.zookeeper.server.ZooKeeperServer Shutting down > >>>>> ./nifi-app.log:7703:2021-08-11 12:24:54,244 INFO > >>>>> [QuorumPeer[myid=1](plain=/0.0.0.0:2181)(secure=disabled)] > > >> > >>>>> o.a.zookeeper.server.ZooKeeperServer Shutting down > >>>>> ./nifi-app.log:8148:2021-08-11 12:27:21,093 INFO > >>>>> [QuorumPeer[myid=1](plain=/0.0.0.0:2181)(secure=disabled)] > > >>>>> o.a.zookeeper.server.ZooKeeperServer Shutting down > >>>>> ./nifi-app.log:8190:2021-08-11 12:27:37,820 INFO > >>>>> [QuorumPeer[myid=1](plain=/0.0.0.0:2181)(secure=disabled)] > > >> > >>>>> o.a.zookeeper.server.ZooKeeperServer Shutting down > >>>>> ./nifi-app.log:8350:2021-08-11 12:28:37,417 INFO > >>>>> [QuorumPeer[myid=1](plain=/0.0.0.0:2181)(secure=disabled)] > > >>>>> o.a.zookeeper.server.ZooKeeperServer Shutting down > >>>>> ./nifi-app.log:8797:2021-08-11 12:33:25,184 INFO > >>>>> [QuorumPeer[myid=1](plain=/0.0.0.0:2181)(secure=disabled)] > > >> > >>>>> o.a.zookeeper.server.ZooKeeperServer Shutting down > >>>>> ./nifi-app.log:8813:2021-08-11 12:33:39,672 INFO > >>>>> [QuorumPeer[myid=1](plain=/0.0.0.0:2181)(secure=disabled)] > > >>>>> o.a.zookeeper.server.ZooKeeperServer Shutting down > >>>>> ./nifi-app.log:8971:2021-08-11 12:34:22,003 INFO > >>>>> [QuorumPeer[myid=1](plain=/0.0.0.0:2181)(secure=disabled)] > > >> > >>>>> o.a.zookeeper.server.ZooKeeperServer Shutting down > >>>>> ./nifi-app.log:8984:2021-08-11 12:34:36,511 INFO > >>>>> [QuorumPeer[myid=1](plain=/0.0.0.0:2181)(secure=disabled)] > > >>>>> o.a.zookeeper.server.ZooKeeperServer Shutting down > >>>>> ./nifi-app.log:9349:2021-08-11 12:36:29,802 INFO > >>>>> [QuorumPeer[myid=1](plain=/0.0.0.0:2181)(secure=disabled)] > > >> > >>>>> o.a.zookeeper.server.ZooKeeperServer Shutting down > >>>>> ./nifi-app.log:9376:2021-08-11 12:36:41,541 INFO > >>>>> [QuorumPeer[myid=1](plain=/0.0.0.0:2181)(secure=disabled)] > > >>>>> o.a.zookeeper.server.ZooKeeperServer Shutting down > >>>>> ./nifi-app.log:9491:2021-08-11 12:37:12,390 INFO > >>>>> [QuorumPeer[myid=1](plain=/0.0.0.0:2181)(secure=disabled)] > > >> > >>>>> o.a.zookeeper.server.ZooKeeperServer Shutting down > >>>>> ./nifi-app.log:9504:2021-08-11 12:37:25,888 INFO > >>>>> [QuorumPeer[myid=1](plain=/0.0.0.0:2181)(secure=disabled)] > > >>>>> o.a.zookeeper.server.ZooKeeperServer Shutting down > >>>>> ./nifi-app.log:9676:2021-08-11 12:39:22,775 INFO > >>>>> [QuorumPeer[myid=1](plain=/0.0.0.0:2181)(secure=disabled)] > > >> > >>>>> o.a.zookeeper.server.ZooKeeperServer Shutting down > >>>>> ./nifi-app.log:9842:2021-08-11 12:40:49,396 INFO > >>>>> [QuorumPeer[myid=1](plain=/0.0.0.0:2181)(secure=disabled)] > > >>>>> o.a.zookeeper.server.ZooKeeperServer Shutting down > >>>>> ./nifi-app.log:10251:2021-08-11 12:43:38,807 INFO > >>>>> [QuorumPeer[myid=1](plain=/0.0.0.0:2181)(secure=disabled)] > > >> > >>>>> o.a.zookeeper.server.ZooKeeperServer Shutting down > >>>>> ./nifi-app.log:10397:2021-08-11 12:44:36,326 INFO > >>>>> [QuorumPeer[myid=1](plain=/0.0.0.0:2181)(secure=disabled)] > > >>>>> o.a.zookeeper.server.ZooKeeperServer Shutting down > >>>>> ./nifi-app.log:11405:2021-08-11 12:52:26,384 INFO > >>>>> [QuorumPeer[myid=1](plain=/0.0.0.0:2181)(secure=disabled)] > > >> > >>>>> o.a.zookeeper.server.ZooKeeperServer Shutting down > >>>>> ./nifi-app.log:11629:2021-08-11 12:53:33,054 INFO > >>>>> [QuorumPeer[myid=1](plain=/0.0.0.0:2181)(secure=disabled)] > > >>>>> o.a.zookeeper.server.ZooKeeperServer Shutting down > >>>>> ./nifi-app.log:11827:2021-08-11 12:54:43,783 INFO > >>>>> [QuorumPeer[myid=1](plain=/0.0.0.0:2181)(secure=disabled)] > > >> > >>>>> o.a.zookeeper.server.ZooKeeperServer Shutting down > >>>>> ./nifi-app.log:11990:2021-08-11 12:56:23,389 INFO > >>>>> [QuorumPeer[myid=1](plain=/0.0.0.0:2181)(secure=disabled)] > > >>>>> o.a.zookeeper.server.ZooKeeperServer Shutting down > >>>>> ./nifi-app.log:12085:2021-08-11 12:57:21,410 INFO > >>>>> [QuorumPeer[myid=1](plain=/0.0.0.0:2181)(secure=disabled)] > > >> > >>>>> o.a.zookeeper.server.ZooKeeperServer Shutting down > >>>>> ./nifi-app.log:12823:2021-08-11 13:01:23,674 INFO > >>>>> [QuorumPeer[myid=1](plain=/0.0.0.0:2181)(secure=disabled)] > > >>>>> o.a.zookeeper.server.ZooKeeperServer Shutting down > >>>>> ./nifi-app.log:13087:2021-08-11 13:04:44,524 INFO > >>>>> [QuorumPeer[myid=1](plain=/0.0.0.0:2181)(secure=disabled)] > > >> > >>>>> o.a.zookeeper.server.ZooKeeperServer Shutting down > >>>>> ./nifi-app.log:14610:2021-08-11 13:16:15,432 INFO > >>>>> [QuorumPeer[myid=1](plain=/0.0.0.0:2181)(secure=disabled)] > > >>>>> o.a.zookeeper.server.ZooKeeperServer Shutting down > >>>>> ./nifi-app.log:14635:2021-08-11 13:16:37,724 INFO > >>>>> [QuorumPeer[myid=1](plain=/0.0.0.0:2181)(secure=disabled)] > > >> > >>>>> o.a.zookeeper.server.ZooKeeperServer Shutting down > >>>>> ./nifi-app.log:14751:2021-08-11 13:17:38,169 INFO > >>>>> [QuorumPeer[myid=1](plain=/0.0.0.0:2181)(secure=disabled)] > > >>>>> o.a.zookeeper.server.ZooKeeperServer Shutting down > >>>>> ./nifi-app.log:15136:2021-08-11 13:17:59,547 INFO > >>>>> [QuorumPeer[myid=1](plain=/0.0.0.0:2181)(secure=disabled)] > > >> > >>>>> o.a.zookeeper.server.ZooKeeperServer Shutting down > >>>>> ./nifi-app.log:15203:2021-08-11 13:18:30,695 INFO > >>>>> [QuorumPeer[myid=1](plain=/0.0.0.0:2181)(secure=disabled)] > > >>>>> o.a.zookeeper.server.ZooKeeperServer Shutting down > >>>>> ./nifi-app.log:15516:2021-08-11 13:21:43,631 INFO > >>>>> [QuorumPeer[myid=1](plain=/0.0.0.0:2181)(secure=disabled)] > > >> > >>>>> o.a.zookeeper.server.ZooKeeperServer Shutting down > >>>>> ./nifi-app.log:15743:2021-08-11 13:22:49,630 INFO > >>>>> [QuorumPeer[myid=1](plain=/0.0.0.0:2181)(secure=disabled)] > > >>>>> o.a.zookeeper.server.ZooKeeperServer Shutting down > >>>>> ./nifi-app.log:15950:2021-08-11 13:23:50,574 INFO > >>>>> [QuorumPeer[myid=1](plain=/0.0.0.0:2181)(secure=disabled)] > > >> > >>>>> o.a.zookeeper.server.ZooKeeperServer Shutting down > >>>>> ./nifi-app.log:16035:2021-08-11 13:24:30,913 INFO > >>>>> [QuorumPeer[myid=1](plain=/0.0.0.0:2181)(secure=disabled)] > > >>>>> o.a.zookeeper.server.ZooKeeperServer Shutting down > >>>>> ./nifi-app.log:16396:2021-08-11 13:28:01,873 INFO > >>>>> [QuorumPeer[myid=1](plain=/0.0.0.0:2181)(secure=disabled)] > > >> > >>>>> o.a.zookeeper.server.ZooKeeperServer Shutting down > >>>>> ./nifi-app.log:16470:2021-08-11 13:29:12,550 INFO > >>>>> [QuorumPeer[myid=1](plain=/0.0.0.0:2181)(secure=disabled)] > > >>>>> o.a.zookeeper.server.ZooKeeperServer Shutting down > >>>>> ./nifi-app.log:16541:2021-08-11 13:29:38,490 INFO > >>>>> [QuorumPeer[myid=1](plain=/0.0.0.0:2181)(secure=disabled)] > > >> > >>>>> o.a.zookeeper.server.ZooKeeperServer Shutting down > >>>>> ./nifi-app.log:16568:2021-08-11 13:29:53,554 INFO > >>>>> [QuorumPeer[myid=1](plain=/0.0.0.0:2181)(secure=disabled)] > > >>>>> o.a.zookeeper.server.ZooKeeperServer Shutting down > >>>>> ./nifi-app.log:16678:2021-08-11 13:31:13,437 INFO > >>>>> [QuorumPeer[myid=1](plain=/0.0.0.0:2181)(secure=disabled)] > > >> > >>>>> o.a.zookeeper.server.ZooKeeperServer Shutting down > >>>>> ./nifi-app.log:16705:2021-08-11 13:31:28,195 INFO > >>>>> [QuorumPeer[myid=1](plain=/0.0.0.0:2181)(secure=disabled)] > > >>>>> o.a.zookeeper.server.ZooKeeperServer Shutting down > >>>>> > >>>>> > >>>>> ------------------------------------------------------------------------------------------------------------------------------------------------------------ > > >> > >>>> > >>>>> > >>>>> So my first question is: Is this behaviour of zookeeper > normal? > >> Well > >>>> it is > >>>>> just an INFO, but I would love to here that this is > a completely > >> normal > >>>> > >>>>> phenomenon. :D > >>>>> > >>>>> 2. After all these zookeeper restarts, nifi finally > throws the > >> following > >>>> > >>>>> error along with the "last" zookeeper restart: > > >>>>> > >>>>> ------------------------------------(nifiHost1.contoso.com) > > >> > >>>>> nifi-app.log------------------------------------------------------------------------------------ > > >> > >>>> > >>>>> > >>>>> 2021-08-11 13:31:13,083 INFO [main-EventThread] > >>>>> o.a.c.f.state.ConnectionStateManager State change: SUSPENDED > > >> > >>>>> 2021-08-11 13:31:13,437 INFO > >>>>> [QuorumPeer[myid=1](plain=/0.0.0.0:2181)(secure=disabled)] > > >> > >>>> > >>>>> o.a.zookeeper.server.ZooKeeperServer Shutting down > >>>>> 2021-08-11 13:31:13,437 INFO > >>>>> [QuorumPeer[myid=1](plain=/0.0.0.0:2181)(secure=disabled)] > > >> > >>>> > >>>>> o.a.zookeeper.server.ZooKeeperServer shutting down > >>>>> 2021-08-11 13:31:13,439 INFO > >>>>> [QuorumPeer[myid=1](plain=/0.0.0.0:2181)(secure=disabled)] > > >> > >>>> > >>>>> o.a.z.server.FinalRequestProcessor shutdown of request > processor > >> complete > >>>> > >>>>> 2021-08-11 13:31:13,440 INFO > >>>>> [QuorumPeer[myid=1](plain=/0.0.0.0:2181)(secure=disabled)] > > >> > >>>> > >>>>> o.a.z.server.SyncRequestProcessor Shutting down > >>>>> 2021-08-11 13:31:13,440 INFO [SyncThread:1] > >>>>> o.a.z.server.SyncRequestProcessor SyncRequestProcessor > exited! > >> > >>>>> 2021-08-11 13:31:13,450 INFO > >>>>> [QuorumPeer[myid=1](plain=/0.0.0.0:2181)(secure=disabled)] > > >> > >>>> > >>>>> o.a.zookeeper.server.ZooKeeperServer minSessionTimeout > set to > >> 4000 > >>>>> 2021-08-11 13:31:13,450 INFO > >>>>> [QuorumPeer[myid=1](plain=/0.0.0.0:2181)(secure=disabled)] > > >> > >>>> > >>>>> o.a.zookeeper.server.ZooKeeperServer maxSessionTimeout > set to > >> 40000 > >>>> > >>>>> 2021-08-11 13:31:13,450 INFO > >>>>> [QuorumPeer[myid=1](plain=/0.0.0.0:2181)(secure=disabled)] > > >> > >>>> > >>>>> o.a.zookeeper.server.ZooKeeperServer Created server > with tickTime > >> 2000 > >>>> > >>>>> minSessionTimeout 4000 maxSessionTimeout 40000 datadir > > >>>>> /opt/nifi/logs/zookeeper/version-2 snapdir > >>>>> /opt/nifi/state/zookeeper/version-2 > >>>>> 2021-08-11 13:31:13,462 ERROR [Load-Balance Server Thread-5] > > >> > >>>>> o.a.n.c.q.c.s.ConnectionLoadBalanceServer Failed to > communicate > >> with > >>>> Peer > >>>>> nifiHost2.contoso.com/192.168.1.11:19500 > >>>>> java.io.EOFException: Expected to receive a Transaction > Completion > >> > >>>>> Indicator from Peer nifiHost2.contoso.com but encountered > EOF > >> > >>>>> at > >>>>> org.apache.nifi.controller.queue.clustered.server.StandardLoadBalanceProtocol.completeTransaction(StandardLoadBalanceProtocol.java:299) > > >> > >>>> > >>>>> at > >>>>> org.apache.nifi.controller.queue.clustered.server.StandardLoadBalanceProtocol.receiveFlowFiles(StandardLoadBalanceProtocol.java:275) > > >> > >>>> > >>>>> at > >>>>> org.apache.nifi.controller.queue.clustered.server.StandardLoadBalanceProtocol.receiveFlowFiles(StandardLoadBalanceProtocol.java:133) > > >> > >>>> > >>>>> at > >>>>> org.apache.nifi.controller.queue.clustered.server.ConnectionLoadBalanceServer$CommunicateAction.run(ConnectionLoadBalanceServer.java:162) > > >> > >>>> > >>>>> at java.lang.Thread.run(Thread.java:748) > >>>>> 2021-08-11 13:31:13,464 ERROR [Load-Balance Server Thread-6] > > >> > >>>>> o.a.n.c.q.c.s.ConnectionLoadBalanceServer Failed to > communicate > >> with > >>>> Peer > >>>>> nifiHost2.contoso.com/192.168.1.11:19502 > >>>>> java.io.EOFException: Expected to receive a Transaction > Completion > >> > >>>>> Indicator from Peer nifiHost2.contoso.com but encountered > EOF > >> > >>>>> at > >>>>> org.apache.nifi.controller.queue.clustered.server.StandardLoadBalanceProtocol.completeTransaction(StandardLoadBalanceProtocol.java:299) > > >> > >>>> > >>>>> at > >>>>> org.apache.nifi.controller.queue.clustered.server.StandardLoadBalanceProtocol.receiveFlowFiles(StandardLoadBalanceProtocol.java:275) > > >> > >>>> > >>>>> at > >>>>> org.apache.nifi.controller.queue.clustered.server.StandardLoadBalanceProtocol.receiveFlowFiles(StandardLoadBalanceProtocol.java:133) > > >> > >>>> > >>>>> at > >>>>> org.apache.nifi.controller.queue.clustered.server.ConnectionLoadBalanceServer$CommunicateAction.run(ConnectionLoadBalanceServer.java:162) > > >> > >>>> > >>>>> at java.lang.Thread.run(Thread.java:748) > >>>>> 2021-08-11 13:31:13,589 INFO [Process Cluster Protocol > Request-1] > >> > >>>>> o.a.n.c.c.node.NodeClusterCoordinator Status of > >>>>> nifiHost1.contoso.com:8443 > > >> > >>>> > >>>>> changed from NodeConnectionStatus[nodeId=nifiHost1.contoso.com:8443, > > >> > >>>> > >>>>> state=CONNECTED, updateId=16] to NodeConnectionStatus[nodeId= > > >> > >>>>> nifiHost1.contoso.com:8443, state=CONNECTING, updateId=16] > > >>>>> 2021-08-11 13:31:13,793 INFO [Process Cluster Protocol > Request-7] > >> > >>>>> o.a.n.c.p.impl.SocketProtocolListener Finished processing > request > >> > >>>>> 88980d85-b956-4713-9686-50d76a407871 (type=RECONNECTION_REQUEST, > > >> > >>>>> length=2209040 bytes) from nifiHost1.contoso.com:8443 > in 142 > >> millis > >>>> > >>>>> 2021-08-11 13:31:13,794 INFO [Reconnect to Cluster] > > >>>>> o.a.nifi.controller.StandardFlowService Processing reconnection > > >> request > >>>> > >>>>> from cluster coordinator. > >>>>> 2021-08-11 13:31:13,794 INFO [Reconnect to Cluster] > > >>>>> o.a.n.c.l.e.CuratorLeaderElectionManager Election Listener > for > >> Role > >>>> Primary > >>>>> Node disabled > >>>>> 2021-08-11 13:31:13,802 INFO [main-EventThread] > >>>>> o.a.c.f.state.ConnectionStateManager State change: RECONNECTED > > >> > >>>>> 2021-08-11 13:31:28,195 INFO > >>>>> [QuorumPeer[myid=1](plain=/0.0.0.0:2181)(secure=disabled)] > > >> > >>>> > >>>>> o.a.zookeeper.server.ZooKeeperServer Shutting down > >>>>> 2021-08-11 13:31:28,196 INFO > >>>>> [QuorumPeer[myid=1](plain=/0.0.0.0:2181)(secure=disabled)] > > >> > >>>> > >>>>> o.a.zookeeper.server.ZooKeeperServer shutting down > >>>>> 2021-08-11 13:31:28,198 INFO [pool-22-thread-1] > >>>>> o.a.n.c.r.WriteAheadFlowFileRepository Initiating checkpoint > > >> of FlowFile > >>>> > >>>>> Repository > >>>>> 2021-08-11 13:31:28,202 INFO > >>>>> [QuorumPeer[myid=1](plain=/0.0.0.0:2181)(secure=disabled)] > > >> > >>>> > >>>>> o.a.z.server.FinalRequestProcessor shutdown of request > processor > >> complete > >>>> > >>>>> 2021-08-11 13:31:28,203 INFO > >>>>> [QuorumPeer[myid=1](plain=/0.0.0.0:2181)(secure=disabled)] > > >> > >>>> > >>>>> o.a.z.server.SyncRequestProcessor Shutting down > >>>>> 2021-08-11 13:31:28,203 INFO [SyncThread:1] > >>>>> o.a.z.server.SyncRequestProcessor SyncRequestProcessor > exited! > >> > >>>>> 2021-08-11 13:31:28,204 INFO [Timer-Driven Process Thread-21] > > >> > >>>>> o.a.n.p.store.WriteAheadStorePartition Successfully > rolled over > >> Event > >>>> > >>>>> Writer for Provenance Event Store Partition[directory=/opt/pr] > > >> due to > >>>> > >>>>> MAX_TIME_REACHED. Event File was 6.7 MB and contained > 1626 events. > >> > >>>>> 2021-08-11 13:31:28,208 INFO [pool-22-thread-1] > >>>>> o.a.n.wali.SequentialAccessWriteAheadLog Checkpointed > Write-Ahead > >> Log > >>>> with > >>>>> 10 Records and 0 Swap Files in 8 milliseconds (Stop-the-world > > >> time = > >>>> 1 > >>>>> milliseconds), max Transaction ID 27772177643 > >>>>> 2021-08-11 13:31:28,208 INFO [pool-22-thread-1] > >>>>> o.a.n.c.r.WriteAheadFlowFileRepository Successfully > checkpointed > >> FlowFile > >>>> > >>>>> Repository with 10 records in 8 milliseconds > >>>>> 2021-08-11 13:31:28,295 INFO [main-EventThread] > >>>>> o.a.c.f.state.ConnectionStateManager State change: SUSPENDED > > >> > >>>>> 2021-08-11 13:31:29,808 INFO > >>>>> [QuorumPeer[myid=1](plain=/0.0.0.0:2181)(secure=disabled)] > > >> > >>>> > >>>>> o.a.zookeeper.server.ZooKeeperServer minSessionTimeout > set to > >> 4000 > >>>>> 2021-08-11 13:31:29,809 INFO > >>>>> [QuorumPeer[myid=1](plain=/0.0.0.0:2181)(secure=disabled)] > > >> > >>>> > >>>>> o.a.zookeeper.server.ZooKeeperServer maxSessionTimeout > set to > >> 40000 > >>>> > >>>>> 2021-08-11 13:31:29,809 INFO > >>>>> [QuorumPeer[myid=1](plain=/0.0.0.0:2181)(secure=disabled)] > > >> > >>>> > >>>>> o.a.zookeeper.server.ZooKeeperServer Created server > with tickTime > >> 2000 > >>>> > >>>>> minSessionTimeout 4000 maxSessionTimeout 40000 datadir > > >>>>> /opt/nifi/logs/zookeeper/version-2 snapdir > >>>>> /opt/nifi/state/zookeeper/version-2 > >>>>> 2021-08-11 13:31:32,291 INFO [main-EventThread] > >>>>> o.a.c.f.state.ConnectionStateManager State change: RECONNECTED > > >> > >>>>> > >>>>> > >>>>> ------------------------------------------------------------------------------------------------------------------------------------------------------------ > > >> > >>>> > >>>>> > >>>>> After that, the log continues with normal messages about > heartbeats > >> > >>>> and > >>>>> provenance events but no error occurs anymore. That > exactly > >> was the > >>>> time, > >>>>> the whole flow cut-off. > >>>>> > >>>>> I've got to add, that we've never experienced that flow > cut-off > >> behaviour > >>>> > >>>>> with version 1.12.1, but the zookeeper restarts occur > there > >> as well > >>>> and as > >>>>> often as seen above. > >>>>> But when first updating to version 1.13.2 back in June, > we experienced > >> > >>>> > >>>>> this flow cut-off multiple times beside the not working > load > >> balancing. > >>>> > >>>>> Back then I thought this was connected to each other, > but now > >> we had > >>>> a > >>>>> fully functional load balancing and the cut-off happened > nevertheless. > >> > >>>> > >>>>> > >>>>> I'm thankful for any hint. :) > >>>>> > >>>>> > >>>>> > >>>>> > >>>> > >>> > >>> > >> > > > > >
