We've had some more clustering issues, and found that some nodes are running out of memory when we have unexpected spikes in data, then we run into a GC stop-the-world event... We lowered our thread count, and that has allowed the cluster to stabilize for the time being.
Our hardware is pretty robust, we usually have 1000+ threads running on each node in the cluster (cumulative ~4,000 threads). Each node has about 500G's of RAM. But we've only been running NiFi with 70G's of RAM, and it usually uses only 50G's. I enabled GC logging and after analyzing the data we decided to increase the heap size. We are experimenting with upping the max to 200G of heap to better absorb spikes in data. We are using the default G1GC. Also, how much impact is there from doing GC logging all the time? The metrics we are getting are really helpful for debugging/analyzing, but we don't want to slowdown the cluster too much. Thoughts on issues we might encounter? Things we should consider? --Peter