Aaron, My guess would be that you are hitting a Full Garbage Collection. With such a huge Java heap, that will cause a "stop the world" pause for quite a long time. Which garbage collector are you using? Have you tried reducing the heap from 48 GB to say 4 or 8 GB?
Thanks -Mark > On Jul 14, 2016, at 11:14 AM, Aaron Longfield <alongfi...@gmail.com> wrote: > > Hi, > > I'm having an issue with a small (two node) NiFi cluster where the nodes will > stop processing any queued flowfiles. I haven't seen any error messages > logged related to it, and when attempting to restart the service, NiFi > doesn't respond and the script forcibly kills it. This causes multiple > flowfile version to hang around, and generally makes me feel like it might be > causing data loss. > > I'm running the web UI on a different box, and when things stop working, it > stops showing changes to counts in any queues, and the thread count never > changes. It still thinks the nodes are connecting and responding, though. > > My environment is two 8 cpu systems w/ 60GB memory with 48GB given to the > NiFi JVM in bootstrap.conf. I have timer threads limited to 12, and event > threads to 4. Install is on the current Amazon Linux AMI and using OpenJDK > 1.8.0.91 x64. > > Any idea, other debug steps, or changes that I can try? I'm running 0.7.0, > having upgraded from 0.6.1, but this has been occurring with both versions. > The higher the flowfile volume I push through, the faster this happens. > > Thanks for any help there is to give! > > -Aaron Longfield