Hi Shawn,

Thanks for reporting the issue. I'll see if I can reproduce it and figure
out why it's occurring/what we can do to fix it. I created a Jira issue
here: https://issues.apache.org/jira/browse/NIFI-9560.

Nathan

On Mon, Jan 10, 2022 at 1:24 PM Shawn Weeks <swe...@weeksconsulting.us>
wrote:

> In my case the IP of Zookeeper is not changing though…
>
>
>
> Thanks
>
> Shawn
>
>
>
> *From:* Shawn Weeks <swe...@weeksconsulting.us>
> *Sent:* Monday, January 10, 2022 12:23 PM
> *To:* users@nifi.apache.org
> *Subject:* RE: NiFi Losing Zookeeper Connection
>
>
>
> Of note someone on Stack overflow is having this issue with the current
> version of the curator framework. This sounds like the same issue.
>
>
>
>
> https://stackoverflow.com/questions/68215630/why-isnt-curator-recovering-when-zookeeper-is-back-online
>
>
>
> Thanks
>
> Shawn
>
>
>
> *From:* Shawn Weeks <swe...@weeksconsulting.us>
> *Sent:* Monday, January 10, 2022 12:12 PM
> *To:* users@nifi.apache.org
> *Subject:* NiFi Losing Zookeeper Connection
>
>
>
> I’ve been dealing with a Zookeeper connection issue on NiFi 1.14 for a
> while now and I was wondering if anyone had any ideas. Basic issue is a
> NiFi node will lose its connection to Zookeeper due to network
> interruptions and then it’s never able to get its connection back. Logs
> look like it’s retrying over and over but I suspect it’s not and it’s stuck
> in this mode where the connection is gone but it’s never going to
> reconnect. Only way to resolve the issue is to restart NiFi. Exception in
> the logs starts around 2022-01-10 17:20:55,919 and I’ve cross referenced it
> with some zookeeper logs at the same time. All three zookeeper logs show a
> similar error about this box. In this example 192.168.1.212 is the IP for
> the NiFi instance called nifi0592.example.org. This is running in AWS and
> I’ve reviewed flow logs for REJECT or firewall blocks but nothing. We’re on
> Zookeeper 3.6.3 and I’m seeing this across multiple NiFi instances and
> VPCs. I’ve found mentions of the suspended in a zookeeper ticket but the
> client version that fixed it has been in NiFi for several versions now.
>
>
>
> Thanks
>
> Shawn
>
>
>
> # NiFi Log
>
> 2022-01-10 17:19:57,464 INFO [Write-Ahead Local State Provider
> Maintenance] org.wali.MinimalLockingWriteAheadLog
> org.wali.MinimalLockingWriteAheadLog@718198db checkpointed with 2951
> Records and 0 Swap Files in 19 milliseconds (Stop-the-world time = 11
> milliseconds, Clear Edit Logs time = 1 millis), max Transaction ID 1224814
>
> 2022-01-10 17:19:57,781 WARN [Clustering Tasks Thread-3]
> o.apache.nifi.controller.FlowController Failed to send heartbeat due to:
> org.apache.nifi.cluster.protocol.ProtocolException: Cannot send heartbeat
> because there is no Cluster Coordinator currently elected
>
> 2022-01-10 17:19:57,927 INFO [Timer-Driven Process Thread-13]
> o.a.n.remote.StandardRemoteProcessGroup Successfully refreshed Flow
> Contents for RemoteProcessGroup[https://nifi0590.example.org:8443/nifi];
> updated to reflect 2 Input Ports [InputPort[name=vantage_file_push,
> targetId=51747258-3f23-3cc2-885c-0acf8f94d8dc],
> InputPort[name=incoming_bulletin,
> targetId=45d7c264-3094-352f-9734-7c379d2ec648]] and 0 Output Ports []
>
> 2022-01-10 17:20:05,918 WARN [Curator-ConnectionStateManager-0]
> o.a.c.f.state.ConnectionStateManager Session timeout has elapsed while
> SUSPENDED. Injecting a session expiration. Elapsed ms: 10001. Adjusted
> session timeout ms: 10000
>
> 2022-01-10 17:20:12,884 WARN [Clustering Tasks Thread-3]
> o.apache.nifi.controller.FlowController Failed to send heartbeat due to:
> org.apache.nifi.cluster.protocol.ProtocolException: Cannot send heartbeat
> because there is no Cluster Coordinator currently elected
>
> 2022-01-10 17:20:15,918 WARN [Curator-ConnectionStateManager-0]
> o.a.c.f.state.ConnectionStateManager Session timeout has elapsed while
> SUSPENDED. Injecting a session expiration. Elapsed ms: 10000. Adjusted
> session timeout ms: 10000
>
> 2022-01-10 17:20:16,992 INFO [pool-13-thread-1]
> o.a.n.c.r.WriteAheadFlowFileRepository Initiating checkpoint of FlowFile
> Repository
>
> 2022-01-10 17:20:16,992 INFO [pool-13-thread-1]
> o.a.n.c.r.WriteAheadFlowFileRepository Successfully checkpointed FlowFile
> Repository with 98 records in 0 milliseconds
>
> 2022-01-10 17:20:19,438 INFO [Timer-Driven Process Thread-36]
> o.a.nifi.groups.StandardProcessGroup
> StandardProcessGroup[identifier=9e444aad-017c-1000-ffff-ffffe0ebbb57,name=Service
> - Invoke EBM Workflow] is not the most recent version of the flow that is
> under Version Control; current version is 1; most recent version is 2
>
> 2022-01-10 17:20:19,849 INFO [Timer-Driven Process Thread-36]
> o.a.nifi.groups.StandardProcessGroup
> StandardProcessGroup[identifier=f9b53979-9eac-1ed5-a8c0-446e5b758cd4,name=Monitor
> Inbound SFTP] is not the most recent version of the flow that is under
> Version Control; current version is 1; most recent version is 3
>
> 2022-01-10 17:20:19,866 INFO [Timer-Driven Process Thread-36]
> o.a.nifi.groups.StandardProcessGroup
> StandardProcessGroup[identifier=e36d47d2-9a3a-1a20-0000-00002bc9db2d,name=New
> EBM Parent] is not the most recent version of the flow that is under
> Version Control; current version is 5; most recent version is 6
>
> 2022-01-10 17:20:24,142 INFO [Cleanup Archive for default]
> o.a.n.c.repository.FileSystemRepository Successfully deleted 0 files (0
> bytes) from archive
>
> 2022-01-10 17:20:25,918 WARN [Curator-ConnectionStateManager-0]
> o.a.c.f.state.ConnectionStateManager Session timeout has elapsed while
> SUSPENDED. Injecting a session expiration. Elapsed ms: 10000. Adjusted
> session timeout ms: 10000
>
> 2022-01-10 17:20:27,944 INFO [Timer-Driven Process Thread-10]
> o.a.n.remote.StandardRemoteProcessGroup Successfully refreshed Flow
> Contents for RemoteProcessGroup[https://nifi0590.example.org:8443/nifi];
> updated to reflect 2 Input Ports [InputPort[name=vantage_file_push,
> targetId=51747258-3f23-3cc2-885c-0acf8f94d8dc],
> InputPort[name=incoming_bulletin,
> targetId=45d7c264-3094-352f-9734-7c379d2ec648]] and 0 Output Ports []
>
> 2022-01-10 17:20:27,986 WARN [Clustering Tasks Thread-3]
> o.apache.nifi.controller.FlowController Failed to send heartbeat due to:
> org.apache.nifi.cluster.protocol.ProtocolException: Cannot send heartbeat
> because there is no Cluster Coordinator currently elected
>
> 2022-01-10 17:20:35,919 WARN [Curator-ConnectionStateManager-0]
> o.a.c.f.state.ConnectionStateManager Session timeout has elapsed while
> SUSPENDED. Injecting a session expiration. Elapsed ms: 10001. Adjusted
> session timeout ms: 10000
>
> 2022-01-10 17:20:36,993 INFO [pool-13-thread-1]
> o.a.n.c.r.WriteAheadFlowFileRepository Initiating checkpoint of FlowFile
> Repository
>
> 2022-01-10 17:20:36,993 INFO [pool-13-thread-1]
> o.a.n.c.r.WriteAheadFlowFileRepository Successfully checkpointed FlowFile
> Repository with 98 records in 0 milliseconds
>
> 2022-01-10 17:20:43,086 WARN [Clustering Tasks Thread-3]
> o.apache.nifi.controller.FlowController Failed to send heartbeat due to:
> org.apache.nifi.cluster.protocol.ProtocolException: Cannot send heartbeat
> because there is no Cluster Coordinator currently elected
>
> 2022-01-10 17:20:45,919 WARN [Curator-ConnectionStateManager-0]
> o.a.c.f.state.ConnectionStateManager Session timeout has elapsed while
> SUSPENDED. Injecting a session expiration. Elapsed ms: 10000. Adjusted
> session timeout ms: 10000
>
> 2022-01-10 17:20:55,229 INFO [NiFi Web Server-61-EventThread]
> org.apache.zookeeper.ClientCnxnSocket jute.maxbuffer value is 4194304 Bytes
>
> 2022-01-10 17:20:55,682 INFO [NiFi Web Server-3434-EventThread]
> org.apache.zookeeper.ClientCnxnSocket jute.maxbuffer value is 4194304 Bytes
>
> 2022-01-10 17:20:55,919 WARN [Curator-ConnectionStateManager-0]
> o.a.c.f.state.ConnectionStateManager Session timeout has elapsed while
> SUSPENDED. Injecting a session expiration. Elapsed ms: 10000. Adjusted
> session timeout ms: 10000
>
> 2022-01-10 17:20:56,993 INFO [pool-13-thread-1]
> o.a.n.c.r.WriteAheadFlowFileRepository Initiating checkpoint of FlowFile
> Repository
>
> 2022-01-10 17:20:57,032 INFO [pool-13-thread-1]
> o.a.n.c.r.WriteAheadFlowFileRepository Successfully checkpointed FlowFile
> Repository with 98 records in 0 milliseconds
>
> 2022-01-10 17:20:58,388 INFO [Timer-Driven Process Thread-59]
> o.a.n.remote.StandardRemoteProcessGroup Successfully refreshed Flow
> Contents for RemoteProcessGroup[https://nifi0590.example.org:8443/nifi];
> updated to reflect 2 Input Ports [InputPort[name=vantage_file_push,
> targetId=51747258-3f23-3cc2-885c-0acf8f94d8dc],
> InputPort[name=incoming_bulletin,
> targetId=45d7c264-3094-352f-9734-7c379d2ec648]] and 0 Output Ports []
>
> 2022-01-10 17:21:04,546 WARN [Clustering Tasks Thread-3]
> o.apache.nifi.controller.FlowController Failed to send heartbeat due to:
> org.apache.nifi.cluster.protocol.ProtocolException: Cannot send heartbeat
> because there is no Cluster Coordinator currently elected
>
> 2022-01-10 17:21:05,416 INFO [NiFi Web Server-4693]
> o.a.n.c.m.e.NoConnectedNodesException Cluster failed processing request:
> org.apache.nifi.cluster.exception.NoClusterCoordinatorException: No node
> has yet been elected Cluster Coordinator. Cannot establish connection to
> cluster yet.. Returning Service Unavailable response.
>
> 2022-01-10 17:21:05,418 WARN [Http Site-to-Site PeerSelector]
> o.apache.nifi.remote.client.PeerSelector Could not communicate with
> nifi0592.example.org:8443 to determine which node(s) exist in the remote
> NiFi instance, due to
> org.apache.nifi.remote.util.SiteToSiteRestApiClient$HttpGetFailedException:
> response code 503:Service Unavailable with explanation: null
>
> 2022-01-10 17:21:05,533 INFO [Http Site-to-Site PeerSelector]
> o.apache.nifi.remote.client.PeerSelector Successfully refreshed peer status
> cache; remote group consists of 2 peers
>
> 2022-01-10 17:21:05,920 WARN [Curator-ConnectionStateManager-0]
> o.a.c.f.state.ConnectionStateManager Session timeout has elapsed while
> SUSPENDED. Injecting a session expiration. Elapsed ms: 10000. Adjusted
> session timeout ms: 10000
>
> 2022-01-10 17:21:15,919 WARN [Curator-ConnectionStateManager-0]
> o.a.c.f.state.ConnectionStateManager Session timeout has elapsed while
> SUSPENDED. Injecting a session expiration. Elapsed ms: 10000. Adjusted
> session timeout ms: 10000
>
> 2022-01-10 17:21:17,032 INFO [pool-13-thread-1]
> o.a.n.c.r.WriteAheadFlowFileRepository Initiating checkpoint of FlowFile
> Repository
>
>
>
> # Zookeeper Log
>
> 2022-01-10 17:20:54,455 [myid:3] - WARN
> [NIOWorkerThread-2:NIOServerCnxn@364] - Unexpected exception
>
> EndOfStreamException: Unable to read additional data from client, it
> probably closed the socket: address = /192.168.1.212:51384, session =
> 0x3002a3c766c02b7
>
>      at
> org.apache.zookeeper.server.NIOServerCnxn.handleFailedRead(NIOServerCnxn.java:163)
>
>      at
> org.apache.zookeeper.server.NIOServerCnxn.doIO(NIOServerCnxn.java:326)
>
>      at
> org.apache.zookeeper.server.NIOServerCnxnFactory$IOWorkRequest.doWork(NIOServerCnxnFactory.java:522)
>
>      at
> org.apache.zookeeper.server.WorkerService$ScheduledWorkRequest.run(WorkerService.java:154)
>
>      at
> java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128)
>
>      at
> java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628)
>
>      at java.base/java.lang.Thread.run(Thread.java:829)
>
> 2022-01-10 17:20:54,455 [myid:3] - WARN
> [NIOWorkerThread-1:NIOServerCnxn@364] - Unexpected exception
>
> EndOfStreamException: Unable to read additional data from client, it
> probably closed the socket: address = /192.168.1.212:51380, session =
> 0x3002a3c766c02b6
>
>      at
> org.apache.zookeeper.server.NIOServerCnxn.handleFailedRead(NIOServerCnxn.java:163)
>
>      at
> org.apache.zookeeper.server.NIOServerCnxn.doIO(NIOServerCnxn.java:326)
>
>      at
> org.apache.zookeeper.server.NIOServerCnxnFactory$IOWorkRequest.doWork(NIOServerCnxnFactory.java:522)
>
>      at
> org.apache.zookeeper.server.WorkerService$ScheduledWorkRequest.run(WorkerService.java:154)
>
>      at
> java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128)
>
>      at
> java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628)
>
>      at java.base/java.lang.Thread.run(Thread.java:829)
>
> 2022-01-10 17:20:55,045 [myid:3] - INFO  [
> SessionTracker:ZooKeeperServer@610] - Expiring session 0x3002a3c766c02b7,
> timeout of 10000ms exceeded
>
> 2022-01-10 17:20:55,045 [myid:3] - INFO  [
> SessionTracker:ZooKeeperServer@610] - Expiring session 0x3002a3c766c02b6,
> timeout of 10000ms exceeded
>
> 2022-01-10 17:20:55,045 [myid:3] - INFO  [
> RequestThrottler:QuorumZooKeeperServer@159] - Submitting global
> closeSession request for session 0x3002a3c766c02b7
>
> 2022-01-10 17:20:55,045 [myid:3] - INFO  [
> RequestThrottler:QuorumZooKeeperServer@159] - Submitting global
> closeSession request for session 0x3002a3c766c02b6
>
> 2022-01-10 17:20:55,910 [myid:3] - INFO
> [CommitProcessor:3:LeaderSessionTracker@104] - Committing global session
> 0x3002a3c766c02b8
>
> 2022-01-10 17:20:55,910 [myid:3] - INFO
> [CommitProcessor:3:LeaderSessionTracker@104] - Committing global session
> 0x3002a3c766c02b9
>

Reply via email to