I think I beat you to it. Feel free to mark whichever a dupelicate.

https://issues.apache.org/jira/browse/NIFI-9559

Thanks
Shawn

From: Nathan Gough <thena...@apache.org>
Sent: Monday, January 10, 2022 12:37 PM
To: users@nifi.apache.org
Subject: Re: NiFi Losing Zookeeper Connection

Hi Shawn,

Thanks for reporting the issue. I'll see if I can reproduce it and figure out 
why it's occurring/what we can do to fix it. I created a Jira issue here: 
https://issues.apache.org/jira/browse/NIFI-9560.

Nathan

On Mon, Jan 10, 2022 at 1:24 PM Shawn Weeks 
<swe...@weeksconsulting.us<mailto:swe...@weeksconsulting.us>> wrote:
In my case the IP of Zookeeper is not changing though…

Thanks
Shawn

From: Shawn Weeks <swe...@weeksconsulting.us<mailto:swe...@weeksconsulting.us>>
Sent: Monday, January 10, 2022 12:23 PM
To: users@nifi.apache.org<mailto:users@nifi.apache.org>
Subject: RE: NiFi Losing Zookeeper Connection

Of note someone on Stack overflow is having this issue with the current version 
of the curator framework. This sounds like the same issue.

https://stackoverflow.com/questions/68215630/why-isnt-curator-recovering-when-zookeeper-is-back-online

Thanks
Shawn

From: Shawn Weeks <swe...@weeksconsulting.us<mailto:swe...@weeksconsulting.us>>
Sent: Monday, January 10, 2022 12:12 PM
To: users@nifi.apache.org<mailto:users@nifi.apache.org>
Subject: NiFi Losing Zookeeper Connection

I’ve been dealing with a Zookeeper connection issue on NiFi 1.14 for a while 
now and I was wondering if anyone had any ideas. Basic issue is a NiFi node 
will lose its connection to Zookeeper due to network interruptions and then 
it’s never able to get its connection back. Logs look like it’s retrying over 
and over but I suspect it’s not and it’s stuck in this mode where the 
connection is gone but it’s never going to reconnect. Only way to resolve the 
issue is to restart NiFi. Exception in the logs starts around 2022-01-10 
17:20:55,919 and I’ve cross referenced it with some zookeeper logs at the same 
time. All three zookeeper logs show a similar error about this box. In this 
example 192.168.1.212 is the IP for the NiFi instance called 
nifi0592.example.org<http://nifi0592.example.org>. This is running in AWS and 
I’ve reviewed flow logs for REJECT or firewall blocks but nothing. We’re on 
Zookeeper 3.6.3 and I’m seeing this across multiple NiFi instances and VPCs. 
I’ve found mentions of the suspended in a zookeeper ticket but the client 
version that fixed it has been in NiFi for several versions now.

Thanks
Shawn

# NiFi Log
2022-01-10 17:19:57,464 INFO [Write-Ahead Local State Provider Maintenance] 
org.wali.MinimalLockingWriteAheadLog 
org.wali.MinimalLockingWriteAheadLog@718198db<mailto:org.wali.MinimalLockingWriteAheadLog@718198db>
 checkpointed with 2951 Records and 0 Swap Files in 19 milliseconds 
(Stop-the-world time = 11 milliseconds, Clear Edit Logs time = 1 millis), max 
Transaction ID 1224814
2022-01-10 17:19:57,781 WARN [Clustering Tasks Thread-3] 
o.apache.nifi.controller.FlowController Failed to send heartbeat due to: 
org.apache.nifi.cluster.protocol.ProtocolException: Cannot send heartbeat 
because there is no Cluster Coordinator currently elected
2022-01-10 17:19:57,927 INFO [Timer-Driven Process Thread-13] 
o.a.n.remote.StandardRemoteProcessGroup Successfully refreshed Flow Contents 
for RemoteProcessGroup[https://nifi0590.example.org:8443/nifi]; updated to 
reflect 2 Input Ports [InputPort[name=vantage_file_push, 
targetId=51747258-3f23-3cc2-885c-0acf8f94d8dc], 
InputPort[name=incoming_bulletin, 
targetId=45d7c264-3094-352f-9734-7c379d2ec648]] and 0 Output Ports []
2022-01-10 17:20:05,918 WARN [Curator-ConnectionStateManager-0] 
o.a.c.f.state.ConnectionStateManager Session timeout has elapsed while 
SUSPENDED. Injecting a session expiration. Elapsed ms: 10001. Adjusted session 
timeout ms: 10000
2022-01-10 17:20:12,884 WARN [Clustering Tasks Thread-3] 
o.apache.nifi.controller.FlowController Failed to send heartbeat due to: 
org.apache.nifi.cluster.protocol.ProtocolException: Cannot send heartbeat 
because there is no Cluster Coordinator currently elected
2022-01-10 17:20:15,918 WARN [Curator-ConnectionStateManager-0] 
o.a.c.f.state.ConnectionStateManager Session timeout has elapsed while 
SUSPENDED. Injecting a session expiration. Elapsed ms: 10000. Adjusted session 
timeout ms: 10000
2022-01-10 17:20:16,992 INFO [pool-13-thread-1] 
o.a.n.c.r.WriteAheadFlowFileRepository Initiating checkpoint of FlowFile 
Repository
2022-01-10 17:20:16,992 INFO [pool-13-thread-1] 
o.a.n.c.r.WriteAheadFlowFileRepository Successfully checkpointed FlowFile 
Repository with 98 records in 0 milliseconds
2022-01-10 17:20:19,438 INFO [Timer-Driven Process Thread-36] 
o.a.nifi.groups.StandardProcessGroup 
StandardProcessGroup[identifier=9e444aad-017c-1000-ffff-ffffe0ebbb57,name=Service
 - Invoke EBM Workflow] is not the most recent version of the flow that is 
under Version Control; current version is 1; most recent version is 2
2022-01-10 17:20:19,849 INFO [Timer-Driven Process Thread-36] 
o.a.nifi.groups.StandardProcessGroup 
StandardProcessGroup[identifier=f9b53979-9eac-1ed5-a8c0-446e5b758cd4,name=Monitor
 Inbound SFTP] is not the most recent version of the flow that is under Version 
Control; current version is 1; most recent version is 3
2022-01-10 17:20:19,866 INFO [Timer-Driven Process Thread-36] 
o.a.nifi.groups.StandardProcessGroup 
StandardProcessGroup[identifier=e36d47d2-9a3a-1a20-0000-00002bc9db2d,name=New 
EBM Parent] is not the most recent version of the flow that is under Version 
Control; current version is 5; most recent version is 6
2022-01-10 17:20:24,142 INFO [Cleanup Archive for default] 
o.a.n.c.repository.FileSystemRepository Successfully deleted 0 files (0 bytes) 
from archive
2022-01-10 17:20:25,918 WARN [Curator-ConnectionStateManager-0] 
o.a.c.f.state.ConnectionStateManager Session timeout has elapsed while 
SUSPENDED. Injecting a session expiration. Elapsed ms: 10000. Adjusted session 
timeout ms: 10000
2022-01-10 17:20:27,944 INFO [Timer-Driven Process Thread-10] 
o.a.n.remote.StandardRemoteProcessGroup Successfully refreshed Flow Contents 
for RemoteProcessGroup[https://nifi0590.example.org:8443/nifi]; updated to 
reflect 2 Input Ports [InputPort[name=vantage_file_push, 
targetId=51747258-3f23-3cc2-885c-0acf8f94d8dc], 
InputPort[name=incoming_bulletin, 
targetId=45d7c264-3094-352f-9734-7c379d2ec648]] and 0 Output Ports []
2022-01-10 17:20:27,986 WARN [Clustering Tasks Thread-3] 
o.apache.nifi.controller.FlowController Failed to send heartbeat due to: 
org.apache.nifi.cluster.protocol.ProtocolException: Cannot send heartbeat 
because there is no Cluster Coordinator currently elected
2022-01-10 17:20:35,919 WARN [Curator-ConnectionStateManager-0] 
o.a.c.f.state.ConnectionStateManager Session timeout has elapsed while 
SUSPENDED. Injecting a session expiration. Elapsed ms: 10001. Adjusted session 
timeout ms: 10000
2022-01-10 17:20:36,993 INFO [pool-13-thread-1] 
o.a.n.c.r.WriteAheadFlowFileRepository Initiating checkpoint of FlowFile 
Repository
2022-01-10 17:20:36,993 INFO [pool-13-thread-1] 
o.a.n.c.r.WriteAheadFlowFileRepository Successfully checkpointed FlowFile 
Repository with 98 records in 0 milliseconds
2022-01-10 17:20:43,086 WARN [Clustering Tasks Thread-3] 
o.apache.nifi.controller.FlowController Failed to send heartbeat due to: 
org.apache.nifi.cluster.protocol.ProtocolException: Cannot send heartbeat 
because there is no Cluster Coordinator currently elected
2022-01-10 17:20:45,919 WARN [Curator-ConnectionStateManager-0] 
o.a.c.f.state.ConnectionStateManager Session timeout has elapsed while 
SUSPENDED. Injecting a session expiration. Elapsed ms: 10000. Adjusted session 
timeout ms: 10000
2022-01-10 17:20:55,229 INFO [NiFi Web Server-61-EventThread] 
org.apache.zookeeper.ClientCnxnSocket jute.maxbuffer value is 4194304 Bytes
2022-01-10 17:20:55,682 INFO [NiFi Web Server-3434-EventThread] 
org.apache.zookeeper.ClientCnxnSocket jute.maxbuffer value is 4194304 Bytes
2022-01-10 17:20:55,919 WARN [Curator-ConnectionStateManager-0] 
o.a.c.f.state.ConnectionStateManager Session timeout has elapsed while 
SUSPENDED. Injecting a session expiration. Elapsed ms: 10000. Adjusted session 
timeout ms: 10000
2022-01-10 17:20:56,993 INFO [pool-13-thread-1] 
o.a.n.c.r.WriteAheadFlowFileRepository Initiating checkpoint of FlowFile 
Repository
2022-01-10 17:20:57,032 INFO [pool-13-thread-1] 
o.a.n.c.r.WriteAheadFlowFileRepository Successfully checkpointed FlowFile 
Repository with 98 records in 0 milliseconds
2022-01-10 17:20:58,388 INFO [Timer-Driven Process Thread-59] 
o.a.n.remote.StandardRemoteProcessGroup Successfully refreshed Flow Contents 
for RemoteProcessGroup[https://nifi0590.example.org:8443/nifi]; updated to 
reflect 2 Input Ports [InputPort[name=vantage_file_push, 
targetId=51747258-3f23-3cc2-885c-0acf8f94d8dc], 
InputPort[name=incoming_bulletin, 
targetId=45d7c264-3094-352f-9734-7c379d2ec648]] and 0 Output Ports []
2022-01-10 17:21:04,546 WARN [Clustering Tasks Thread-3] 
o.apache.nifi.controller.FlowController Failed to send heartbeat due to: 
org.apache.nifi.cluster.protocol.ProtocolException: Cannot send heartbeat 
because there is no Cluster Coordinator currently elected
2022-01-10 17:21:05,416 INFO [NiFi Web Server-4693] 
o.a.n.c.m.e.NoConnectedNodesException Cluster failed processing request: 
org.apache.nifi.cluster.exception.NoClusterCoordinatorException: No node has 
yet been elected Cluster Coordinator. Cannot establish connection to cluster 
yet.. Returning Service Unavailable response.
2022-01-10 17:21:05,418 WARN [Http Site-to-Site PeerSelector] 
o.apache.nifi.remote.client.PeerSelector Could not communicate with 
nifi0592.example.org:8443<http://nifi0592.example.org:8443> to determine which 
node(s) exist in the remote NiFi instance, due to 
org.apache.nifi.remote.util.SiteToSiteRestApiClient$HttpGetFailedException: 
response code 503:Service Unavailable with explanation: null
2022-01-10 17:21:05,533 INFO [Http Site-to-Site PeerSelector] 
o.apache.nifi.remote.client.PeerSelector Successfully refreshed peer status 
cache; remote group consists of 2 peers
2022-01-10 17:21:05,920 WARN [Curator-ConnectionStateManager-0] 
o.a.c.f.state.ConnectionStateManager Session timeout has elapsed while 
SUSPENDED. Injecting a session expiration. Elapsed ms: 10000. Adjusted session 
timeout ms: 10000
2022-01-10 17:21:15,919 WARN [Curator-ConnectionStateManager-0] 
o.a.c.f.state.ConnectionStateManager Session timeout has elapsed while 
SUSPENDED. Injecting a session expiration. Elapsed ms: 10000. Adjusted session 
timeout ms: 10000
2022-01-10 17:21:17,032 INFO [pool-13-thread-1] 
o.a.n.c.r.WriteAheadFlowFileRepository Initiating checkpoint of FlowFile 
Repository

# Zookeeper Log
2022-01-10 17:20:54,455 [myid:3] - WARN  [NIOWorkerThread-2:NIOServerCnxn@364] 
- Unexpected exception
EndOfStreamException: Unable to read additional data from client, it probably 
closed the socket: address = /192.168.1.212:51384<http://192.168.1.212:51384>, 
session = 0x3002a3c766c02b7
     at 
org.apache.zookeeper.server.NIOServerCnxn.handleFailedRead(NIOServerCnxn.java:163)
     at org.apache.zookeeper.server.NIOServerCnxn.doIO(NIOServerCnxn.java:326)
     at 
org.apache.zookeeper.server.NIOServerCnxnFactory$IOWorkRequest.doWork(NIOServerCnxnFactory.java:522)
     at 
org.apache.zookeeper.server.WorkerService$ScheduledWorkRequest.run(WorkerService.java:154)
     at 
java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128)
     at 
java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628)
     at java.base/java.lang.Thread.run(Thread.java:829)
2022-01-10 17:20:54,455 [myid:3] - WARN  [NIOWorkerThread-1:NIOServerCnxn@364] 
- Unexpected exception
EndOfStreamException: Unable to read additional data from client, it probably 
closed the socket: address = /192.168.1.212:51380<http://192.168.1.212:51380>, 
session = 0x3002a3c766c02b6
     at 
org.apache.zookeeper.server.NIOServerCnxn.handleFailedRead(NIOServerCnxn.java:163)
     at org.apache.zookeeper.server.NIOServerCnxn.doIO(NIOServerCnxn.java:326)
     at 
org.apache.zookeeper.server.NIOServerCnxnFactory$IOWorkRequest.doWork(NIOServerCnxnFactory.java:522)
     at 
org.apache.zookeeper.server.WorkerService$ScheduledWorkRequest.run(WorkerService.java:154)
     at 
java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128)
     at 
java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628)
     at java.base/java.lang.Thread.run(Thread.java:829)
2022-01-10 17:20:55,045 [myid:3] - INFO  [SessionTracker:ZooKeeperServer@610] - 
Expiring session 0x3002a3c766c02b7, timeout of 10000ms exceeded
2022-01-10 17:20:55,045 [myid:3] - INFO  [SessionTracker:ZooKeeperServer@610] - 
Expiring session 0x3002a3c766c02b6, timeout of 10000ms exceeded
2022-01-10 17:20:55,045 [myid:3] - INFO  
[RequestThrottler:QuorumZooKeeperServer@159] - Submitting global closeSession 
request for session 0x3002a3c766c02b7
2022-01-10 17:20:55,045 [myid:3] - INFO  
[RequestThrottler:QuorumZooKeeperServer@159] - Submitting global closeSession 
request for session 0x3002a3c766c02b6
2022-01-10 17:20:55,910 [myid:3] - INFO  
[CommitProcessor:3:LeaderSessionTracker@104] - Committing global session 
0x3002a3c766c02b8
2022-01-10 17:20:55,910 [myid:3] - INFO  
[CommitProcessor:3:LeaderSessionTracker@104] - Committing global session 
0x3002a3c766c02b9

Reply via email to