Hi Shawn, Thanks for reporting the issue. I'll see if I can reproduce it and figure out why it's occurring/what we can do to fix it. I created a Jira issue here: https://issues.apache.org/jira/browse/NIFI-9560.
Nathan On Mon, Jan 10, 2022 at 1:24 PM Shawn Weeks <swe...@weeksconsulting.us> wrote: > In my case the IP of Zookeeper is not changing though… > > > > Thanks > > Shawn > > > > *From:* Shawn Weeks <swe...@weeksconsulting.us> > *Sent:* Monday, January 10, 2022 12:23 PM > *To:* users@nifi.apache.org > *Subject:* RE: NiFi Losing Zookeeper Connection > > > > Of note someone on Stack overflow is having this issue with the current > version of the curator framework. This sounds like the same issue. > > > > > https://stackoverflow.com/questions/68215630/why-isnt-curator-recovering-when-zookeeper-is-back-online > > > > Thanks > > Shawn > > > > *From:* Shawn Weeks <swe...@weeksconsulting.us> > *Sent:* Monday, January 10, 2022 12:12 PM > *To:* users@nifi.apache.org > *Subject:* NiFi Losing Zookeeper Connection > > > > I’ve been dealing with a Zookeeper connection issue on NiFi 1.14 for a > while now and I was wondering if anyone had any ideas. Basic issue is a > NiFi node will lose its connection to Zookeeper due to network > interruptions and then it’s never able to get its connection back. Logs > look like it’s retrying over and over but I suspect it’s not and it’s stuck > in this mode where the connection is gone but it’s never going to > reconnect. Only way to resolve the issue is to restart NiFi. Exception in > the logs starts around 2022-01-10 17:20:55,919 and I’ve cross referenced it > with some zookeeper logs at the same time. All three zookeeper logs show a > similar error about this box. In this example 192.168.1.212 is the IP for > the NiFi instance called nifi0592.example.org. This is running in AWS and > I’ve reviewed flow logs for REJECT or firewall blocks but nothing. We’re on > Zookeeper 3.6.3 and I’m seeing this across multiple NiFi instances and > VPCs. I’ve found mentions of the suspended in a zookeeper ticket but the > client version that fixed it has been in NiFi for several versions now. > > > > Thanks > > Shawn > > > > # NiFi Log > > 2022-01-10 17:19:57,464 INFO [Write-Ahead Local State Provider > Maintenance] org.wali.MinimalLockingWriteAheadLog > org.wali.MinimalLockingWriteAheadLog@718198db checkpointed with 2951 > Records and 0 Swap Files in 19 milliseconds (Stop-the-world time = 11 > milliseconds, Clear Edit Logs time = 1 millis), max Transaction ID 1224814 > > 2022-01-10 17:19:57,781 WARN [Clustering Tasks Thread-3] > o.apache.nifi.controller.FlowController Failed to send heartbeat due to: > org.apache.nifi.cluster.protocol.ProtocolException: Cannot send heartbeat > because there is no Cluster Coordinator currently elected > > 2022-01-10 17:19:57,927 INFO [Timer-Driven Process Thread-13] > o.a.n.remote.StandardRemoteProcessGroup Successfully refreshed Flow > Contents for RemoteProcessGroup[https://nifi0590.example.org:8443/nifi]; > updated to reflect 2 Input Ports [InputPort[name=vantage_file_push, > targetId=51747258-3f23-3cc2-885c-0acf8f94d8dc], > InputPort[name=incoming_bulletin, > targetId=45d7c264-3094-352f-9734-7c379d2ec648]] and 0 Output Ports [] > > 2022-01-10 17:20:05,918 WARN [Curator-ConnectionStateManager-0] > o.a.c.f.state.ConnectionStateManager Session timeout has elapsed while > SUSPENDED. Injecting a session expiration. Elapsed ms: 10001. Adjusted > session timeout ms: 10000 > > 2022-01-10 17:20:12,884 WARN [Clustering Tasks Thread-3] > o.apache.nifi.controller.FlowController Failed to send heartbeat due to: > org.apache.nifi.cluster.protocol.ProtocolException: Cannot send heartbeat > because there is no Cluster Coordinator currently elected > > 2022-01-10 17:20:15,918 WARN [Curator-ConnectionStateManager-0] > o.a.c.f.state.ConnectionStateManager Session timeout has elapsed while > SUSPENDED. Injecting a session expiration. Elapsed ms: 10000. Adjusted > session timeout ms: 10000 > > 2022-01-10 17:20:16,992 INFO [pool-13-thread-1] > o.a.n.c.r.WriteAheadFlowFileRepository Initiating checkpoint of FlowFile > Repository > > 2022-01-10 17:20:16,992 INFO [pool-13-thread-1] > o.a.n.c.r.WriteAheadFlowFileRepository Successfully checkpointed FlowFile > Repository with 98 records in 0 milliseconds > > 2022-01-10 17:20:19,438 INFO [Timer-Driven Process Thread-36] > o.a.nifi.groups.StandardProcessGroup > StandardProcessGroup[identifier=9e444aad-017c-1000-ffff-ffffe0ebbb57,name=Service > - Invoke EBM Workflow] is not the most recent version of the flow that is > under Version Control; current version is 1; most recent version is 2 > > 2022-01-10 17:20:19,849 INFO [Timer-Driven Process Thread-36] > o.a.nifi.groups.StandardProcessGroup > StandardProcessGroup[identifier=f9b53979-9eac-1ed5-a8c0-446e5b758cd4,name=Monitor > Inbound SFTP] is not the most recent version of the flow that is under > Version Control; current version is 1; most recent version is 3 > > 2022-01-10 17:20:19,866 INFO [Timer-Driven Process Thread-36] > o.a.nifi.groups.StandardProcessGroup > StandardProcessGroup[identifier=e36d47d2-9a3a-1a20-0000-00002bc9db2d,name=New > EBM Parent] is not the most recent version of the flow that is under > Version Control; current version is 5; most recent version is 6 > > 2022-01-10 17:20:24,142 INFO [Cleanup Archive for default] > o.a.n.c.repository.FileSystemRepository Successfully deleted 0 files (0 > bytes) from archive > > 2022-01-10 17:20:25,918 WARN [Curator-ConnectionStateManager-0] > o.a.c.f.state.ConnectionStateManager Session timeout has elapsed while > SUSPENDED. Injecting a session expiration. Elapsed ms: 10000. Adjusted > session timeout ms: 10000 > > 2022-01-10 17:20:27,944 INFO [Timer-Driven Process Thread-10] > o.a.n.remote.StandardRemoteProcessGroup Successfully refreshed Flow > Contents for RemoteProcessGroup[https://nifi0590.example.org:8443/nifi]; > updated to reflect 2 Input Ports [InputPort[name=vantage_file_push, > targetId=51747258-3f23-3cc2-885c-0acf8f94d8dc], > InputPort[name=incoming_bulletin, > targetId=45d7c264-3094-352f-9734-7c379d2ec648]] and 0 Output Ports [] > > 2022-01-10 17:20:27,986 WARN [Clustering Tasks Thread-3] > o.apache.nifi.controller.FlowController Failed to send heartbeat due to: > org.apache.nifi.cluster.protocol.ProtocolException: Cannot send heartbeat > because there is no Cluster Coordinator currently elected > > 2022-01-10 17:20:35,919 WARN [Curator-ConnectionStateManager-0] > o.a.c.f.state.ConnectionStateManager Session timeout has elapsed while > SUSPENDED. Injecting a session expiration. Elapsed ms: 10001. Adjusted > session timeout ms: 10000 > > 2022-01-10 17:20:36,993 INFO [pool-13-thread-1] > o.a.n.c.r.WriteAheadFlowFileRepository Initiating checkpoint of FlowFile > Repository > > 2022-01-10 17:20:36,993 INFO [pool-13-thread-1] > o.a.n.c.r.WriteAheadFlowFileRepository Successfully checkpointed FlowFile > Repository with 98 records in 0 milliseconds > > 2022-01-10 17:20:43,086 WARN [Clustering Tasks Thread-3] > o.apache.nifi.controller.FlowController Failed to send heartbeat due to: > org.apache.nifi.cluster.protocol.ProtocolException: Cannot send heartbeat > because there is no Cluster Coordinator currently elected > > 2022-01-10 17:20:45,919 WARN [Curator-ConnectionStateManager-0] > o.a.c.f.state.ConnectionStateManager Session timeout has elapsed while > SUSPENDED. Injecting a session expiration. Elapsed ms: 10000. Adjusted > session timeout ms: 10000 > > 2022-01-10 17:20:55,229 INFO [NiFi Web Server-61-EventThread] > org.apache.zookeeper.ClientCnxnSocket jute.maxbuffer value is 4194304 Bytes > > 2022-01-10 17:20:55,682 INFO [NiFi Web Server-3434-EventThread] > org.apache.zookeeper.ClientCnxnSocket jute.maxbuffer value is 4194304 Bytes > > 2022-01-10 17:20:55,919 WARN [Curator-ConnectionStateManager-0] > o.a.c.f.state.ConnectionStateManager Session timeout has elapsed while > SUSPENDED. Injecting a session expiration. Elapsed ms: 10000. Adjusted > session timeout ms: 10000 > > 2022-01-10 17:20:56,993 INFO [pool-13-thread-1] > o.a.n.c.r.WriteAheadFlowFileRepository Initiating checkpoint of FlowFile > Repository > > 2022-01-10 17:20:57,032 INFO [pool-13-thread-1] > o.a.n.c.r.WriteAheadFlowFileRepository Successfully checkpointed FlowFile > Repository with 98 records in 0 milliseconds > > 2022-01-10 17:20:58,388 INFO [Timer-Driven Process Thread-59] > o.a.n.remote.StandardRemoteProcessGroup Successfully refreshed Flow > Contents for RemoteProcessGroup[https://nifi0590.example.org:8443/nifi]; > updated to reflect 2 Input Ports [InputPort[name=vantage_file_push, > targetId=51747258-3f23-3cc2-885c-0acf8f94d8dc], > InputPort[name=incoming_bulletin, > targetId=45d7c264-3094-352f-9734-7c379d2ec648]] and 0 Output Ports [] > > 2022-01-10 17:21:04,546 WARN [Clustering Tasks Thread-3] > o.apache.nifi.controller.FlowController Failed to send heartbeat due to: > org.apache.nifi.cluster.protocol.ProtocolException: Cannot send heartbeat > because there is no Cluster Coordinator currently elected > > 2022-01-10 17:21:05,416 INFO [NiFi Web Server-4693] > o.a.n.c.m.e.NoConnectedNodesException Cluster failed processing request: > org.apache.nifi.cluster.exception.NoClusterCoordinatorException: No node > has yet been elected Cluster Coordinator. Cannot establish connection to > cluster yet.. Returning Service Unavailable response. > > 2022-01-10 17:21:05,418 WARN [Http Site-to-Site PeerSelector] > o.apache.nifi.remote.client.PeerSelector Could not communicate with > nifi0592.example.org:8443 to determine which node(s) exist in the remote > NiFi instance, due to > org.apache.nifi.remote.util.SiteToSiteRestApiClient$HttpGetFailedException: > response code 503:Service Unavailable with explanation: null > > 2022-01-10 17:21:05,533 INFO [Http Site-to-Site PeerSelector] > o.apache.nifi.remote.client.PeerSelector Successfully refreshed peer status > cache; remote group consists of 2 peers > > 2022-01-10 17:21:05,920 WARN [Curator-ConnectionStateManager-0] > o.a.c.f.state.ConnectionStateManager Session timeout has elapsed while > SUSPENDED. Injecting a session expiration. Elapsed ms: 10000. Adjusted > session timeout ms: 10000 > > 2022-01-10 17:21:15,919 WARN [Curator-ConnectionStateManager-0] > o.a.c.f.state.ConnectionStateManager Session timeout has elapsed while > SUSPENDED. Injecting a session expiration. Elapsed ms: 10000. Adjusted > session timeout ms: 10000 > > 2022-01-10 17:21:17,032 INFO [pool-13-thread-1] > o.a.n.c.r.WriteAheadFlowFileRepository Initiating checkpoint of FlowFile > Repository > > > > # Zookeeper Log > > 2022-01-10 17:20:54,455 [myid:3] - WARN > [NIOWorkerThread-2:NIOServerCnxn@364] - Unexpected exception > > EndOfStreamException: Unable to read additional data from client, it > probably closed the socket: address = /192.168.1.212:51384, session = > 0x3002a3c766c02b7 > > at > org.apache.zookeeper.server.NIOServerCnxn.handleFailedRead(NIOServerCnxn.java:163) > > at > org.apache.zookeeper.server.NIOServerCnxn.doIO(NIOServerCnxn.java:326) > > at > org.apache.zookeeper.server.NIOServerCnxnFactory$IOWorkRequest.doWork(NIOServerCnxnFactory.java:522) > > at > org.apache.zookeeper.server.WorkerService$ScheduledWorkRequest.run(WorkerService.java:154) > > at > java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128) > > at > java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628) > > at java.base/java.lang.Thread.run(Thread.java:829) > > 2022-01-10 17:20:54,455 [myid:3] - WARN > [NIOWorkerThread-1:NIOServerCnxn@364] - Unexpected exception > > EndOfStreamException: Unable to read additional data from client, it > probably closed the socket: address = /192.168.1.212:51380, session = > 0x3002a3c766c02b6 > > at > org.apache.zookeeper.server.NIOServerCnxn.handleFailedRead(NIOServerCnxn.java:163) > > at > org.apache.zookeeper.server.NIOServerCnxn.doIO(NIOServerCnxn.java:326) > > at > org.apache.zookeeper.server.NIOServerCnxnFactory$IOWorkRequest.doWork(NIOServerCnxnFactory.java:522) > > at > org.apache.zookeeper.server.WorkerService$ScheduledWorkRequest.run(WorkerService.java:154) > > at > java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128) > > at > java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628) > > at java.base/java.lang.Thread.run(Thread.java:829) > > 2022-01-10 17:20:55,045 [myid:3] - INFO [ > SessionTracker:ZooKeeperServer@610] - Expiring session 0x3002a3c766c02b7, > timeout of 10000ms exceeded > > 2022-01-10 17:20:55,045 [myid:3] - INFO [ > SessionTracker:ZooKeeperServer@610] - Expiring session 0x3002a3c766c02b6, > timeout of 10000ms exceeded > > 2022-01-10 17:20:55,045 [myid:3] - INFO [ > RequestThrottler:QuorumZooKeeperServer@159] - Submitting global > closeSession request for session 0x3002a3c766c02b7 > > 2022-01-10 17:20:55,045 [myid:3] - INFO [ > RequestThrottler:QuorumZooKeeperServer@159] - Submitting global > closeSession request for session 0x3002a3c766c02b6 > > 2022-01-10 17:20:55,910 [myid:3] - INFO > [CommitProcessor:3:LeaderSessionTracker@104] - Committing global session > 0x3002a3c766c02b8 > > 2022-01-10 17:20:55,910 [myid:3] - INFO > [CommitProcessor:3:LeaderSessionTracker@104] - Committing global session > 0x3002a3c766c02b9 >