We have a 6-node cluster using external ZooKeeper. It is heavily loaded, and we are attempting to tune some of the properties to alleviate some observed issues. By "heavily loaded" I mean the graph is large (approx. 3,000 processors) and there is a lot of data in process (approx. 2M flowfiles/120GB queued)
One symptom we see is that changes to the graph are not replicated to other nodes, and the Node(s) are subsequently disconnected from the cluster. In one example, we see in the nifi-app.log that the node is disconnected due to "failed to process request PUT /nifi-api/connection/976a60b5d-3c4e-3bbb-8fbe-4790f3ecb147" The following properties are set in nifi.properties: nifi.cluster.node.protocol.threads=30 nifi.cluster.node.protocol.max.threads=50 nifi.cluster.node.event.history.size=25 nifi.cluster.node.connection.timeout=60 sec nifi.cluster.node.read.timeout=60 sec nifi.cluster.node.max.concurrent.requests=500 nifi.cluster.node.request.replication.claim.timeout=20 secs nifi.zookeeper.connect.timeout=30 secs nifi.zookeeper.session.timeout=30 secs Some of the (timeout) values are set fairly high due to the heavily loaded system; we allow a longer time to complete tasks. Are there interrelated properties which a long timeout might actually become detrimental? Are there other properties we should look at more closely? Thanks, Mark