Mark, You could try increasing the "tickTime" setting in zookeeper.properties to give ZK a little more time before timing something out. Ticktime is defined [1] as "the length of a single tick, which is the basic time unit used by ZooKeeper, as measured in milliseconds. It is used to regulate heartbeats, and timeouts. For example, the minimum session timeout will be two ticks". You might want to try a value of 10000 or 15000 in there, making a tick 10 or 15 seconds respectively.
There are other settings you might want to look into as well, but when I was testing on heavily loaded clusters, increasing the timeouts for NiFi as you've done above along with increasing ticktime had good results. [1] http://zookeeper.apache.org/doc/r3.4.6/zookeeperAdmin.html On Tue, May 8, 2018 at 10:43 AM Mark Bean <mark.o.b...@gmail.com> wrote: > We have a 6-node cluster using external ZooKeeper. It is heavily loaded, > and we are attempting to tune some of the properties to alleviate some > observed issues. By "heavily loaded" I mean the graph is large (approx. > 3,000 processors) and there is a lot of data in process (approx. 2M > flowfiles/120GB queued) > > One symptom we see is that changes to the graph are not replicated to other > nodes, and the Node(s) are subsequently disconnected from the cluster. In > one example, we see in the nifi-app.log that the node is disconnected due > to "failed to process request PUT > /nifi-api/connection/976a60b5d-3c4e-3bbb-8fbe-4790f3ecb147" > > The following properties are set in nifi.properties: > > nifi.cluster.node.protocol.threads=30 > nifi.cluster.node.protocol.max.threads=50 > nifi.cluster.node.event.history.size=25 > nifi.cluster.node.connection.timeout=60 sec > nifi.cluster.node.read.timeout=60 sec > nifi.cluster.node.max.concurrent.requests=500 > nifi.cluster.node.request.replication.claim.timeout=20 secs > > nifi.zookeeper.connect.timeout=30 secs > nifi.zookeeper.session.timeout=30 secs > > Some of the (timeout) values are set fairly high due to the heavily loaded > system; we allow a longer time to complete tasks. Are there interrelated > properties which a long timeout might actually become detrimental? Are > there other properties we should look at more closely? > > Thanks, > Mark >