Eric, How much memory is available on the VM/system? Curious if maybe oom_killer or some process like that is killing nifi. Might want to check system logs like dmesg to see if there’s anything there.
Thanks -Mark On Apr 19, 2022, at 4:28 PM, Eric Secules <esecu...@gmail.com<mailto:esecu...@gmail.com>> wrote: Hi Joe & Mark, I'll work on getting those JVM stats. But the instance is locked down pretty tight so getting GC, and Thread dump will be difficult. We renamed the flow.xml.gz file and restarted with the hope of seeing some UI. The overall CPU usage is lower, but that is not helping the UI. Bootstrap log says: /opt/nifi/nifi-current/logs$ tail nifi-bootstrap.log 2022-04-19 19:43:03,104 INFO [main] o.a.n.b.NotificationServiceManager Successfully loaded the following 0 services: [] 2022-04-19 19:43:03,107 INFO [main] org.apache.nifi.bootstrap.RunNiFi Registered no Notification Services for Notification Type NIFI_STARTED 2022-04-19 19:43:03,107 INFO [main] org.apache.nifi.bootstrap.RunNiFi Registered no Notification Services for Notification Type NIFI_STOPPED 2022-04-19 19:43:03,107 INFO [main] org.apache.nifi.bootstrap.RunNiFi Registered no Notification Services for Notification Type NIFI_DIED 2022-04-19 19:43:03,143 INFO [main] org.apache.nifi.bootstrap.RunNiFi Runtime Java version: 1.8.0_265 2022-04-19 19:43:03,154 INFO [main] org.apache.nifi.bootstrap.Command Starting Apache NiFi... 2022-04-19 19:43:03,155 INFO [main] org.apache.nifi.bootstrap.Command Working Directory: /opt/nifi/nifi-current 2022-04-19 19:43:03,155 INFO [main] org.apache.nifi.bootstrap.Command Command: /usr/local/openjdk-8/bin/java -classpath /opt/nifi/nifi-current/./conf:/opt/nifi/nifi-current/./lib/nifi-nar-utils-1.12.1.jar:/opt/nifi/nifi-current/./lib/nifi-runtime-1.12.1.jar:/opt/nifi/nifi-current/./lib/nifi-framework-api-1.12.1.jar:/opt/nifi/nifi-current/./lib/logback-classic-1.2.3.jar:/opt/nifi/nifi-current/./lib/javax.servlet-api-3.1.0.jar:/opt/nifi/nifi-current/./lib/slf4j-api-1.7.30.jar:/opt/nifi/nifi-current/./lib/log4j-over-slf4j-1.7.30.jar:/opt/nifi/nifi-current/./lib/jul-to-slf4j-1.7.30.jar:/opt/nifi/nifi-current/./lib/nifi-properties-1.12.1.jar:/opt/nifi/nifi-current/./lib/nifi-api-1.12.1.jar:/opt/nifi/nifi-current/./lib/jcl-over-slf4j-1.7.30.jar:/opt/nifi/nifi-current/./lib/logback-core-1.2.3.jar:/opt/nifi/nifi-current/./lib/jetty-schemas-3.1.jar -Dorg.apache.jasper.compiler.disablejsr199=true -Xmx14g -Xms12g -Dcurator-log-only-first-connection-issue-as-error-level=true -Djavax.security.auth.useSubjectCredsOnly=true -Djava.security.egd=file:/dev/urandom -Dzookeeper.admin.enableServer=false -Dsun.net.http.allowRestrictedHeaders=true -Djava.net.preferIPv4Stack=true -Djava.awt.headless=true -Djava.protocol.handler.pkgs=sun.net.www.protocol -Dnifi.properties.file.path=/opt/nifi/nifi-current/./conf/nifi.properties -Dnifi.bootstrap.listen.port=34211 -Dapp=NiFi -Dorg.apache.nifi.bootstrap.config.log.dir=/opt/nifi/nifi-current/logs org.apache.nifi.NiFi 2022-04-19 19:43:08,517 INFO [main] org.apache.nifi.bootstrap.Command Launched Apache NiFi with Process ID 123 2022-04-19 19:43:13,834 INFO [NiFi Bootstrap Command Listener] org.apache.nifi.bootstrap.RunNiFi Apache NiFi now running and listening for Bootstrap requests on port 44417 Mark, our nifi.properties has no value set for that property, so we are using whatever the default is i guess. Since it's 1.14.0 we will take your suggestion and set it to a long time. CPU taken by nifi is 37% (3.5 CPU cores) and 22.2% memory the jvm was started with -Xmx14g -Xms12g Disk IO is happening, but not at a high rate. App log contains these messages before the Jetty shutdown message {"level":"INFO","thread":"main","message":"o.a.n.r.v.FileBasedVariableRegistry Loaded 113 properties from system properties and environment variables"} {"level":"INFO","thread":"main","message":"o.a.n.r.v.FileBasedVariableRegistry Loaded a total of 113 properties. Including precedence overrides effective accessible registry key size is 113"} {"level":"INFO","thread":"main","message":"o.a.n.p.store.WriteAheadStorePartition After recovering ./provenance_repository, next Event ID to be generated will be 13089684"} {"level":"INFO","thread":"main","message":"o.a.n.p.index.lucene.LuceneEventIndex Will avoid re-indexing Provenance Events because the newest index is defunct, so it will be re-indexed in the background"} {"level":"INFO","thread":"pool-20-thread-1","message":"o.a.n.p.index.lucene.LuceneEventIndex Determined that Max Event ID indexed for Partition default is approximately 13082369 based on index ./provenance_repository/lucene-8-index-1649528440360"} {"level":"INFO","thread":"pool-20-thread-1","message":"o.a.n.p.store.WriteAheadStorePartition The last Provenance Event indexed for partition default is 13072369, but the last event written to partition has ID 13089683. Re-indexing up to the last 17314 events for ./provenance_repository to ensure that the Event Index is accurate and up-to-date"} {"level":"INFO","thread":"pool-20-thread-1","message":"o.a.n.p.store.WriteAheadStorePartition Finished re-indexing 17315 events across 2 files for ./provenance_repository in 9.713 seconds"} {"level":"INFO","thread":"main","message":"o.a.n.c.repository.FileSystemRepository Maximum Threshold for Container default set to 2858730232217 bytes; if volume exceeds this size, archived data will be deleted until it no longer exceeds this size"} {"level":"INFO","thread":"main","message":"o.a.n.c.repository.FileSystemRepository Initializing FileSystemRepository with 'Always Sync' set to false"} {"level":"INFO","thread":"Thread-1","message":"org.apache.nifi.NiFi Initiating shutdown of Jetty web server..."} But there are no system error or warning messages. Despite seeing logs that the webserver is listening, we always get "connection refused" when trying to communicate with it. Thanks, Eric On Tue, Apr 19, 2022 at 12:57 PM Mark Payne <marka...@hotmail.com<mailto:marka...@hotmail.com>> wrote: Eric, I certainly agree with what Joe said. I would also recommend checking in nifi.properties if you have a value set for: nifi.monitor.long.running.task.schedule I recommend setting that to “9999 hours” In 1.14.0, we introduced the notion of a Long-Running Task Monitor. It’s generally very fast. Typically runs in 10s of milliseconds on my macbook. But it relies on JVM-specific code, and we’ve seen in some environments that can cause the UI responsiveness to be very adversely affected. We disabled the task monitor by default in 1.15, I believe, because of this. Thanks -Mark On Apr 19, 2022, at 3:44 PM, Joe Witt <joe.w...@gmail.com<mailto:joe.w...@gmail.com>> wrote: Eric When the UI isn't responsive it would be great to have a snapshot of: - CPU usage at that time - GC behavior/logging at/around that time. - IO Utilization around that time - NiFi Thread dump precisely during it and ideally also one after it responds again NiFi Restarting itself is very interesting of course. There should be more in the app log and bootstrap that will help illuminate the issue then. Thanks On Tue, Apr 19, 2022 at 12:42 PM Eric Secules <esecu...@gmail.com<mailto:esecu...@gmail.com>> wrote: By the way, I am running NiFi 1.14.0 and it looks like it keeps restarting itself. I am seeing this in the logs about once an hour. {"level":"INFO","thread":"main","message":"org.apache.nifi.NiFi Controller initialization took 4737354582168 nanoseconds (4737 seconds)."} On Tue, Apr 19, 2022 at 12:34 PM Eric Secules <esecu...@gmail.com<mailto:esecu...@gmail.com>> wrote: Hello, When my nifi system goes under high load the web UI becomes unresponsive until load comes down. Is there a way I can see what's going on (processor status summary, queued count, active thread count) when the UI is unresponsive? The logs are not showing any errors and the various repositories are all mounted to separate volumes with elastic capacity so I am sure that none of them ran out of space. Our monitoring shows bursts of CPU usage and memory use lower than normal. The logs show that the StandardProcessScheduler stops processors followed by starting them, but I never see logs related to the UI being ready to serve. It does this about once an hour. I see that the flow is slowly processing based on log activity and databases. How can I see what's going on when the web UI is not responding? Thanks, Eric