I've been dealing with a Zookeeper connection issue on NiFi 1.14 for a while now and I was wondering if anyone had any ideas. Basic issue is a NiFi node will lose its connection to Zookeeper due to network interruptions and then it's never able to get its connection back. Logs look like it's retrying over and over but I suspect it's not and it's stuck in this mode where the connection is gone but it's never going to reconnect. Only way to resolve the issue is to restart NiFi. Exception in the logs starts around 2022-01-10 17:20:55,919 and I've cross referenced it with some zookeeper logs at the same time. All three zookeeper logs show a similar error about this box. In this example 192.168.1.212 is the IP for the NiFi instance called nifi0592.example.org. This is running in AWS and I've reviewed flow logs for REJECT or firewall blocks but nothing. We're on Zookeeper 3.6.3 and I'm seeing this across multiple NiFi instances and VPCs. I've found mentions of the suspended in a zookeeper ticket but the client version that fixed it has been in NiFi for several versions now.
Thanks Shawn # NiFi Log 2022-01-10 17:19:57,464 INFO [Write-Ahead Local State Provider Maintenance] org.wali.MinimalLockingWriteAheadLog org.wali.MinimalLockingWriteAheadLog@718198db<mailto:org.wali.MinimalLockingWriteAheadLog@718198db> checkpointed with 2951 Records and 0 Swap Files in 19 milliseconds (Stop-the-world time = 11 milliseconds, Clear Edit Logs time = 1 millis), max Transaction ID 1224814 2022-01-10 17:19:57,781 WARN [Clustering Tasks Thread-3] o.apache.nifi.controller.FlowController Failed to send heartbeat due to: org.apache.nifi.cluster.protocol.ProtocolException: Cannot send heartbeat because there is no Cluster Coordinator currently elected 2022-01-10 17:19:57,927 INFO [Timer-Driven Process Thread-13] o.a.n.remote.StandardRemoteProcessGroup Successfully refreshed Flow Contents for RemoteProcessGroup[https://nifi0590.example.org:8443/nifi]; updated to reflect 2 Input Ports [InputPort[name=vantage_file_push, targetId=51747258-3f23-3cc2-885c-0acf8f94d8dc], InputPort[name=incoming_bulletin, targetId=45d7c264-3094-352f-9734-7c379d2ec648]] and 0 Output Ports [] 2022-01-10 17:20:05,918 WARN [Curator-ConnectionStateManager-0] o.a.c.f.state.ConnectionStateManager Session timeout has elapsed while SUSPENDED. Injecting a session expiration. Elapsed ms: 10001. Adjusted session timeout ms: 10000 2022-01-10 17:20:12,884 WARN [Clustering Tasks Thread-3] o.apache.nifi.controller.FlowController Failed to send heartbeat due to: org.apache.nifi.cluster.protocol.ProtocolException: Cannot send heartbeat because there is no Cluster Coordinator currently elected 2022-01-10 17:20:15,918 WARN [Curator-ConnectionStateManager-0] o.a.c.f.state.ConnectionStateManager Session timeout has elapsed while SUSPENDED. Injecting a session expiration. Elapsed ms: 10000. Adjusted session timeout ms: 10000 2022-01-10 17:20:16,992 INFO [pool-13-thread-1] o.a.n.c.r.WriteAheadFlowFileRepository Initiating checkpoint of FlowFile Repository 2022-01-10 17:20:16,992 INFO [pool-13-thread-1] o.a.n.c.r.WriteAheadFlowFileRepository Successfully checkpointed FlowFile Repository with 98 records in 0 milliseconds 2022-01-10 17:20:19,438 INFO [Timer-Driven Process Thread-36] o.a.nifi.groups.StandardProcessGroup StandardProcessGroup[identifier=9e444aad-017c-1000-ffff-ffffe0ebbb57,name=Service - Invoke EBM Workflow] is not the most recent version of the flow that is under Version Control; current version is 1; most recent version is 2 2022-01-10 17:20:19,849 INFO [Timer-Driven Process Thread-36] o.a.nifi.groups.StandardProcessGroup StandardProcessGroup[identifier=f9b53979-9eac-1ed5-a8c0-446e5b758cd4,name=Monitor Inbound SFTP] is not the most recent version of the flow that is under Version Control; current version is 1; most recent version is 3 2022-01-10 17:20:19,866 INFO [Timer-Driven Process Thread-36] o.a.nifi.groups.StandardProcessGroup StandardProcessGroup[identifier=e36d47d2-9a3a-1a20-0000-00002bc9db2d,name=New EBM Parent] is not the most recent version of the flow that is under Version Control; current version is 5; most recent version is 6 2022-01-10 17:20:24,142 INFO [Cleanup Archive for default] o.a.n.c.repository.FileSystemRepository Successfully deleted 0 files (0 bytes) from archive 2022-01-10 17:20:25,918 WARN [Curator-ConnectionStateManager-0] o.a.c.f.state.ConnectionStateManager Session timeout has elapsed while SUSPENDED. Injecting a session expiration. Elapsed ms: 10000. Adjusted session timeout ms: 10000 2022-01-10 17:20:27,944 INFO [Timer-Driven Process Thread-10] o.a.n.remote.StandardRemoteProcessGroup Successfully refreshed Flow Contents for RemoteProcessGroup[https://nifi0590.example.org:8443/nifi]; updated to reflect 2 Input Ports [InputPort[name=vantage_file_push, targetId=51747258-3f23-3cc2-885c-0acf8f94d8dc], InputPort[name=incoming_bulletin, targetId=45d7c264-3094-352f-9734-7c379d2ec648]] and 0 Output Ports [] 2022-01-10 17:20:27,986 WARN [Clustering Tasks Thread-3] o.apache.nifi.controller.FlowController Failed to send heartbeat due to: org.apache.nifi.cluster.protocol.ProtocolException: Cannot send heartbeat because there is no Cluster Coordinator currently elected 2022-01-10 17:20:35,919 WARN [Curator-ConnectionStateManager-0] o.a.c.f.state.ConnectionStateManager Session timeout has elapsed while SUSPENDED. Injecting a session expiration. Elapsed ms: 10001. Adjusted session timeout ms: 10000 2022-01-10 17:20:36,993 INFO [pool-13-thread-1] o.a.n.c.r.WriteAheadFlowFileRepository Initiating checkpoint of FlowFile Repository 2022-01-10 17:20:36,993 INFO [pool-13-thread-1] o.a.n.c.r.WriteAheadFlowFileRepository Successfully checkpointed FlowFile Repository with 98 records in 0 milliseconds 2022-01-10 17:20:43,086 WARN [Clustering Tasks Thread-3] o.apache.nifi.controller.FlowController Failed to send heartbeat due to: org.apache.nifi.cluster.protocol.ProtocolException: Cannot send heartbeat because there is no Cluster Coordinator currently elected 2022-01-10 17:20:45,919 WARN [Curator-ConnectionStateManager-0] o.a.c.f.state.ConnectionStateManager Session timeout has elapsed while SUSPENDED. Injecting a session expiration. Elapsed ms: 10000. Adjusted session timeout ms: 10000 2022-01-10 17:20:55,229 INFO [NiFi Web Server-61-EventThread] org.apache.zookeeper.ClientCnxnSocket jute.maxbuffer value is 4194304 Bytes 2022-01-10 17:20:55,682 INFO [NiFi Web Server-3434-EventThread] org.apache.zookeeper.ClientCnxnSocket jute.maxbuffer value is 4194304 Bytes 2022-01-10 17:20:55,919 WARN [Curator-ConnectionStateManager-0] o.a.c.f.state.ConnectionStateManager Session timeout has elapsed while SUSPENDED. Injecting a session expiration. Elapsed ms: 10000. Adjusted session timeout ms: 10000 2022-01-10 17:20:56,993 INFO [pool-13-thread-1] o.a.n.c.r.WriteAheadFlowFileRepository Initiating checkpoint of FlowFile Repository 2022-01-10 17:20:57,032 INFO [pool-13-thread-1] o.a.n.c.r.WriteAheadFlowFileRepository Successfully checkpointed FlowFile Repository with 98 records in 0 milliseconds 2022-01-10 17:20:58,388 INFO [Timer-Driven Process Thread-59] o.a.n.remote.StandardRemoteProcessGroup Successfully refreshed Flow Contents for RemoteProcessGroup[https://nifi0590.example.org:8443/nifi]; updated to reflect 2 Input Ports [InputPort[name=vantage_file_push, targetId=51747258-3f23-3cc2-885c-0acf8f94d8dc], InputPort[name=incoming_bulletin, targetId=45d7c264-3094-352f-9734-7c379d2ec648]] and 0 Output Ports [] 2022-01-10 17:21:04,546 WARN [Clustering Tasks Thread-3] o.apache.nifi.controller.FlowController Failed to send heartbeat due to: org.apache.nifi.cluster.protocol.ProtocolException: Cannot send heartbeat because there is no Cluster Coordinator currently elected 2022-01-10 17:21:05,416 INFO [NiFi Web Server-4693] o.a.n.c.m.e.NoConnectedNodesException Cluster failed processing request: org.apache.nifi.cluster.exception.NoClusterCoordinatorException: No node has yet been elected Cluster Coordinator. Cannot establish connection to cluster yet.. Returning Service Unavailable response. 2022-01-10 17:21:05,418 WARN [Http Site-to-Site PeerSelector] o.apache.nifi.remote.client.PeerSelector Could not communicate with nifi0592.example.org:8443 to determine which node(s) exist in the remote NiFi instance, due to org.apache.nifi.remote.util.SiteToSiteRestApiClient$HttpGetFailedException: response code 503:Service Unavailable with explanation: null 2022-01-10 17:21:05,533 INFO [Http Site-to-Site PeerSelector] o.apache.nifi.remote.client.PeerSelector Successfully refreshed peer status cache; remote group consists of 2 peers 2022-01-10 17:21:05,920 WARN [Curator-ConnectionStateManager-0] o.a.c.f.state.ConnectionStateManager Session timeout has elapsed while SUSPENDED. Injecting a session expiration. Elapsed ms: 10000. Adjusted session timeout ms: 10000 2022-01-10 17:21:15,919 WARN [Curator-ConnectionStateManager-0] o.a.c.f.state.ConnectionStateManager Session timeout has elapsed while SUSPENDED. Injecting a session expiration. Elapsed ms: 10000. Adjusted session timeout ms: 10000 2022-01-10 17:21:17,032 INFO [pool-13-thread-1] o.a.n.c.r.WriteAheadFlowFileRepository Initiating checkpoint of FlowFile Repository # Zookeeper Log 2022-01-10 17:20:54,455 [myid:3] - WARN [NIOWorkerThread-2:NIOServerCnxn@364] - Unexpected exception EndOfStreamException: Unable to read additional data from client, it probably closed the socket: address = /192.168.1.212:51384, session = 0x3002a3c766c02b7 at org.apache.zookeeper.server.NIOServerCnxn.handleFailedRead(NIOServerCnxn.java:163) at org.apache.zookeeper.server.NIOServerCnxn.doIO(NIOServerCnxn.java:326) at org.apache.zookeeper.server.NIOServerCnxnFactory$IOWorkRequest.doWork(NIOServerCnxnFactory.java:522) at org.apache.zookeeper.server.WorkerService$ScheduledWorkRequest.run(WorkerService.java:154) at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128) at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628) at java.base/java.lang.Thread.run(Thread.java:829) 2022-01-10 17:20:54,455 [myid:3] - WARN [NIOWorkerThread-1:NIOServerCnxn@364] - Unexpected exception EndOfStreamException: Unable to read additional data from client, it probably closed the socket: address = /192.168.1.212:51380, session = 0x3002a3c766c02b6 at org.apache.zookeeper.server.NIOServerCnxn.handleFailedRead(NIOServerCnxn.java:163) at org.apache.zookeeper.server.NIOServerCnxn.doIO(NIOServerCnxn.java:326) at org.apache.zookeeper.server.NIOServerCnxnFactory$IOWorkRequest.doWork(NIOServerCnxnFactory.java:522) at org.apache.zookeeper.server.WorkerService$ScheduledWorkRequest.run(WorkerService.java:154) at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128) at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628) at java.base/java.lang.Thread.run(Thread.java:829) 2022-01-10 17:20:55,045 [myid:3] - INFO [SessionTracker:ZooKeeperServer@610] - Expiring session 0x3002a3c766c02b7, timeout of 10000ms exceeded 2022-01-10 17:20:55,045 [myid:3] - INFO [SessionTracker:ZooKeeperServer@610] - Expiring session 0x3002a3c766c02b6, timeout of 10000ms exceeded 2022-01-10 17:20:55,045 [myid:3] - INFO [RequestThrottler:QuorumZooKeeperServer@159] - Submitting global closeSession request for session 0x3002a3c766c02b7 2022-01-10 17:20:55,045 [myid:3] - INFO [RequestThrottler:QuorumZooKeeperServer@159] - Submitting global closeSession request for session 0x3002a3c766c02b6 2022-01-10 17:20:55,910 [myid:3] - INFO [CommitProcessor:3:LeaderSessionTracker@104] - Committing global session 0x3002a3c766c02b8 2022-01-10 17:20:55,910 [myid:3] - INFO [CommitProcessor:3:LeaderSessionTracker@104] - Committing global session 0x3002a3c766c02b9