Re: Dealing with cluster errors

2017-02-13 Thread Jeff
Joe, Some settings to try if these issues occur again: In nifi.properties: nifi.cluster.node.read.timeout=30 sec In zookeeper.properties: tickTime=5000 Try switching your RemoteGroup settings from HTTP to RAW, and set the Communications Timeout to something like 1m. On Mon, Feb 13, 2017 at

Re: Dealing with cluster errors

2017-02-13 Thread Joe Gresock
The disk utilization is currently 90-95% used by system and user, and iowait is very low. We do use site-to-site. Interestingly, I can no longer replicate the problem, which is good but puzzling. Since the problem first started, I have externalized the ZK quorum and decreased the scheduled

Re: Dealing with cluster errors

2017-02-13 Thread Jeff
Hello Joe, What is the disk utilization on the nodes of your cluster while you're having issues with using the UI? I have done some testing under heavy disk utilization and have had to increase the timeout values for cluster communication to prevent replication requests from timing out. Does

Re: Dealing with cluster errors

2017-02-13 Thread Joe Gresock
"Can you tell us more about the processors using cluster scoped state and what the rates are through them?" In this case it's probably not relevant, because I have that processor stopped. However, it's a custom MongoDB processor that stores the last mongo ID in the cluster scoped state, to

Re: Dealing with cluster errors

2017-02-13 Thread Joe Witt
Joe Can you tell us more about the processors using cluster scoped state and what the rates are through them? I could envision us putting too much strain on zk in some cases. Thanks Joe On Mon, Feb 13, 2017 at 10:51 AM, Joe Gresock wrote: > I was able to externalize my

Re: Dealing with cluster errors

2017-02-13 Thread Joe Gresock
I was able to externalize my zookeeper quorum, which is now running on 3 separate VMs. I am able to bring up the nifi cluster when my data flow is stopped, and I can tell the zk migration worked because I have some processors with cluster-scoped state. However, I am still having a hard time

Re: Dealing with cluster errors

2017-02-10 Thread Andrew Grande
Joe, External ZK quorum would be my first move. And make sure those boxes have fast disks and no heavy load from other processes. Andrew On Fri, Feb 10, 2017, 7:23 AM Joe Gresock wrote: > I should add that the flows on the individual nodes appear to be processing > the

Re: Dealing with cluster errors

2017-02-10 Thread Joe Gresock
I should add that the flows on the individual nodes appear to be processing the data just fine, and the solution I've found so far is to just wait for the data to subside, after which point the console comes up successfully. So, no complaint on the durability of the underlying data flows. It's

Dealing with cluster errors

2017-02-10 Thread Joe Gresock
We have a 7-node cluster and we currently use the embedded zookeepers on 3 of the nodes. I've noticed that when we have a high volume in our flow (which is causing the CPU to be hit pretty hard), I have a really hard time getting the console page to come up, as it cycles through the following