Also, which version of zookeeper and what image (I've found different versions and images provided better stability)?
Cheers, Chris Sampson On Tue, 29 Sep 2020, 17:34 Sushil Kumar, <skm....@gmail.com> wrote: > Hello Wyll > > It may be helpful if you can send nifi.properties. > > Thanks > Sushil Kumar > > On Tue, Sep 29, 2020 at 7:58 AM Wyll Ingersoll < > wyllys.ingers...@keepertech.com> wrote: > >> >> I have a 3-node Nifi (1.11.4) cluster in kubernetes environment (as a >> StatefulSet) using external zookeeper (3 nodes also) to manage state. >> >> Whenever even 1 node (pod/container) goes down or is restarted, it can >> throw the whole cluster into a bad state that forces me to restart ALL of >> the pods in order to recover. This seems wrong. The problem seems to be >> that when the primary node goes away, the remaining 2 nodes don't ever try >> to take over. Instead, I have restart all of them individually until one >> of them becomes the primary, then the other 2 eventually join and sync up. >> >> When one of the nodes is refusing to sync up, I often see these errors in >> the log and the only way to get it back into the cluster is to restart it. >> The node showing the errors below never seems to be able to rejoin or >> resync with the other 2 nodes. >> >> >> 2020-09-29 10:18:53,324 ERROR [Reconnect to Cluster] >> o.a.nifi.controller.StandardFlowService Handling reconnection request >> failed due to: org.apache.nifi.cluster.ConnectionException: Failed to >> connect node to cluster due to: java.lang.NullPointerException >> >> org.apache.nifi.cluster.ConnectionException: Failed to connect node to >> cluster due to: java.lang.NullPointerException >> >> at >> org.apache.nifi.controller.StandardFlowService.loadFromConnectionResponse(StandardFlowService.java:1035) >> >> at >> org.apache.nifi.controller.StandardFlowService.handleReconnectionRequest(StandardFlowService.java:668) >> >> at >> org.apache.nifi.controller.StandardFlowService.access$200(StandardFlowService.java:109) >> >> at >> org.apache.nifi.controller.StandardFlowService$1.run(StandardFlowService.java:415) >> >> at java.lang.Thread.run(Thread.java:748) >> >> Caused by: java.lang.NullPointerException: null >> >> at >> org.apache.nifi.controller.StandardFlowService.loadFromConnectionResponse(StandardFlowService.java:989) >> >> ... 4 common frames omitted >> >> 2020-09-29 10:18:53,326 INFO [Reconnect to Cluster] >> o.a.c.f.imps.CuratorFrameworkImpl Starting >> >> 2020-09-29 10:18:53,327 INFO [Reconnect to Cluster] >> org.apache.zookeeper.ClientCnxnSocket jute.maxbuffer value is 4194304 Bytes >> >> 2020-09-29 10:18:53,328 INFO [Reconnect to Cluster] >> o.a.c.f.imps.CuratorFrameworkImpl Default schema >> >> 2020-09-29 10:18:53,807 INFO [Reconnect to Cluster-EventThread] >> o.a.c.f.state.ConnectionStateManager State change: CONNECTED >> >> 2020-09-29 10:18:53,809 INFO [Reconnect to Cluster-EventThread] >> o.a.c.framework.imps.EnsembleTracker New config event received: >> {server.1=zk-0.zk-hs.ki.svc.cluster.local:2888:3888:participant; >> 0.0.0.0:2181, version=0, >> server.3=zk-2.zk-hs.ki.svc.cluster.local:2888:3888:participant; >> 0.0.0.0:2181, >> server.2=zk-1.zk-hs.ki.svc.cluster.local:2888:3888:participant; >> 0.0.0.0:2181} >> >> 2020-09-29 10:18:53,810 INFO [Curator-Framework-0] >> o.a.c.f.imps.CuratorFrameworkImpl backgroundOperationsLoop exiting >> >> 2020-09-29 10:18:53,813 INFO [Reconnect to Cluster-EventThread] >> o.a.c.framework.imps.EnsembleTracker New config event received: >> {server.1=zk-0.zk-hs.ki.svc.cluster.local:2888:3888:participant; >> 0.0.0.0:2181, version=0, >> server.3=zk-2.zk-hs.ki.svc.cluster.local:2888:3888:participant; >> 0.0.0.0:2181, >> server.2=zk-1.zk-hs.ki.svc.cluster.local:2888:3888:participant; >> 0.0.0.0:2181} >> >> 2020-09-29 10:18:54,323 INFO [Reconnect to Cluster] >> o.a.n.c.l.e.CuratorLeaderElectionManager Cannot unregister Leader Election >> Role 'Primary Node' becuase that role is not registered >> >> 2020-09-29 10:18:54,324 INFO [Reconnect to Cluster] >> o.a.n.c.l.e.CuratorLeaderElectionManager Cannot unregister Leader Election >> Role 'Cluster Coordinator' becuase that role is not registered >> >>