Mike, also if what Joe asked with the backpressure is "not being applied", if you're good with a profiler, I think joe and I both gravitated to 0x00000006c533b770 being locked in at org.apache.nifi.provenance.PersistentProvenanceRepository.persistRecord(PersistentProvenanceRepository.java:757). It would be interesting to see if that section is taking longer over time.
On Thu, Feb 16, 2017 at 11:56 PM, Joe Witt <joe.w...@gmail.com> wrote: > Mike > > One more thing...can you please grab a couple more thread dumps for us > with 5 to 10 mins between? > > I don't see a deadlock but do suspect either just crazy slow IO going > on or a possible livelock. The thread dump will help narrow that down > a bit. > > Can you run 'iostat -xmh 20' for a bit (or its equivalent) on the > system too please. > > Thanks > Joe > > On Thu, Feb 16, 2017 at 11:52 PM, Joe Witt <joe.w...@gmail.com> wrote: > > Mike, > > > > No need for more info. Heap/GC looks beautiful. > > > > The thread dump however, shows some problems. The provenance > > repository is locked up. Numerous threads are sitting here > > > > at java.util.concurrent.locks.ReentrantReadWriteLock$ReadLock.lock( > ReentrantReadWriteLock.java:727) > > at org.apache.nifi.provenance.PersistentProvenanceRepository > .persistRecord(PersistentProvenanceRepository.java:757) > > > > This means these are processors committing their sessions and updating > > provenance but they're waiting on a readlock to provenance. This lock > > cannot be obtained because a provenance maintenance thread is > > attempting to purge old events and cannot. > > > > I recall us having addressed this so am looking to see when that was > > addressed. If provenance is not critical for you right now you can > > swap out the persistent implementation with the volatile provenance > > repository. In nifi.properties change this line > > > > nifi.provenance.repository.implementation=org.apache.nifi.provenance. > PersistentProvenanceRepository > > > > to > > > > nifi.provenance.repository.implementation=org.apache.nifi.provenance. > VolatileProvenanceRepository > > > > The behavior reminds me of this issue which was fixed in 1.x > > https://issues.apache.org/jira/browse/NIFI-2395 > > > > Need to dig into this more... > > > > Thanks > > Joe > > > > On Thu, Feb 16, 2017 at 11:36 PM, Mikhail Sosonkin <mikh...@synack.com> > wrote: > >> Hi Joe, > >> > >> Thank you for your quick response. The system is currently in the > deadlock > >> state with 10 worker threads spinning. So, I'll gather the info you > >> requested. > >> > >> - The available space on the partition is 223G free of 500G (same as was > >> available for 0.6.1) > >> - java.arg.3=-Xmx4096m in bootstrap.conf > >> - thread dump and jstats are here > >> https://gist.github.com/nologic/1ac064cb42cc16ca45d6ccd1239ce085 > >> > >> Unfortunately, it's hard to predict when the decay starts and it takes > too > >> long to have to monitor the system manually. However, if you still need, > >> after seeing the attached dumps, the thread dumps while it decays I can > set > >> up a timer script. > >> > >> Let me know if you need any more info. > >> > >> Thanks, > >> Mike. > >> > >> > >> On Thu, Feb 16, 2017 at 9:54 PM, Joe Witt <joe.w...@gmail.com> wrote: > >>> > >>> Mike, > >>> > >>> Can you capture a series of thread dumps as the gradual decay occurs > >>> and signal at what point they were generated specifically calling out > >>> the "now the system is doing nothing" point. Can you check for space > >>> available on the system during these times as well. Also, please > >>> advise on the behavior of the heap/garbage collection. Often (not > >>> always) a gradual decay in performance can suggest an issue with GC as > >>> you know. Can you run something like > >>> > >>> jstat -gcutil -h5 <pid> 1000 > >>> > >>> And capture those rules in these chunks as well. > >>> > >>> This would give us a pretty good picture of the health of the system/ > >>> and JVM around these times. It is probably too much for the mailing > >>> list for the info so feel free to create a JIRA for this and put > >>> attachments there or link to gists in github/etc. > >>> > >>> Pretty confident we can get to the bottom of what you're seeing > quickly. > >>> > >>> Thanks > >>> Joe > >>> > >>> On Thu, Feb 16, 2017 at 9:43 PM, Mikhail Sosonkin <mikh...@synack.com> > >>> wrote: > >>> > Hello, > >>> > > >>> > Recently, we've upgraded from 0.6.1 to 1.1.1 and at first everything > was > >>> > working well. However, a few hours later none of the processors were > >>> > showing > >>> > any activity. Then, I tried restarting nifi which caused some > flowfiles > >>> > to > >>> > get corrupted evidenced by exceptions thrown in the nifi-app.log, > >>> > however > >>> > the processors still continue to produce no activity. Next, I stop > the > >>> > service and delete all state (content_repository database_repository > >>> > flowfile_repository provenance_repository work). Then the processors > >>> > start > >>> > working for a few hours (maybe a day) until the deadlock occurs > again. > >>> > > >>> > So, this cycle continues where I have to periodically reset the > service > >>> > and > >>> > delete the state to get things moving. Obviously, that's not great. > I'll > >>> > note that the flow.xml file has been changed, as I added/removed > >>> > processors, > >>> > by the new version of nifi but 95% of the flow configuration is the > same > >>> > as > >>> > before the upgrade. So, I'm wondering if there is a configuration > >>> > setting > >>> > that causes these deadlocks. > >>> > > >>> > What I've been able to observe is that the deadlock is "gradual" in > that > >>> > my > >>> > flow usually takes about 4-5 threads to execute. The deadlock causes > the > >>> > worker threads to max out at the limit and I'm not even able to stop > any > >>> > processors or list queues. I also, have not seen this behavior in a > >>> > fresh > >>> > install of Nifi where the flow.xml would start out empty. > >>> > > >>> > Can you give me some advise on what to do about this? Would the > problem > >>> > be > >>> > resolved if I manually rebuild the flow with the new version of Nifi > >>> > (not > >>> > looking forward to that)? > >>> > > >>> > Much appreciated. > >>> > > >>> > Mike. > >>> > > >>> > This email may contain material that is confidential for the sole > use of > >>> > the > >>> > intended recipient(s). Any review, reliance or distribution or > >>> > disclosure > >>> > by others without express permission is strictly prohibited. If you > are > >>> > not > >>> > the intended recipient, please contact the sender and delete all > copies > >>> > of > >>> > this message. > >> > >> > >> > >> This email may contain material that is confidential for the sole use > of the > >> intended recipient(s). Any review, reliance or distribution or > disclosure > >> by others without express permission is strictly prohibited. If you > are not > >> the intended recipient, please contact the sender and delete all copies > of > >> this message. >