We've had some more clustering issues, and found that some nodes are running 
out of memory when we have unexpected spikes in data, then we run into a GC 
stop-the-world event... We lowered our thread count, and that has allowed the 
cluster to stabilize for the time being.

Our hardware is pretty robust, we usually have 1000+ threads running on each 
node in the cluster (cumulative ~4,000 threads). Each node has about 500G's of 
RAM. But we've only been running NiFi with 70G's of RAM, and it usually uses 
only 50G's.

I enabled GC logging and after analyzing the data we decided to increase the 
heap size. We are experimenting with upping the max to 200G of heap to better 
absorb spikes in data. We are using the default G1GC.

Also, how much impact is there from doing GC logging all the time? The metrics 
we are getting are really helpful for debugging/analyzing, but we don't want to 
slowdown the cluster too much.

Thoughts on issues we might encounter? Things we should consider?

--Peter

Reply via email to