Thanks Dave, I am not familiar with Prometheus JVM metrics but I gather it's an open source solution that you have coupled with grafana for visualization. I will have a look into this.
Best, Marco On Tue, Jul 11, 2023 at 9:32 AM Dave Reynolds <dave.e.reyno...@gmail.com> wrote: > Hi Marco, > > On 11/07/2023 09:04, Marco Neumann wrote: > > Dave, can you say a bit more about the profiling methodology? Are you > using > > a tool such as VisualVM to collect the data? Or do you just use the > system > > monitor? > > The JVM metrics here are from prometheus scanning the metrics exposed by > fuseki via the built in micrometer (displayed use grafana). They give a > *lot* of details on things like GC behaviour etc which I'm not showing. > > Ironically the only thing this fuseki was doing when it died originally > was supporting these metric scans, and the health check ping. > > The overall memory curve is picked up by telegraph scanning the OS level > stats for the docker processes (collected via influx DB and again > displayed in grafana). These are what you would get with e.g. top on the > machine or a system monitor but means we have longer term records which > we access remotely. When I quoted 240K RES, 35K shared that was actually > just top on the machine. > > When running locally can also use things like jconsole or visualVM but > I actually find the prometheus + telegraph metrics we have in our > production monitoring more detailed and easier to work with. We run lots > of services so the monitoring and alerting stack, while all industry > standard, has been a life saver for us. > > For doing the debugging locally I also tried setting the JVM flags to > enable finer grain native memory tracking and use jcmd (in a scripted > loop) to pull out those more detailed metrics. Though they are not that > much more detailed than the micrometer/prometheus metrics. > That use of jcmd and the caution on how to interpret RES came from the > blog item I mentioned earlier: > https://poonamparhar.github.io/troubleshooting_native_memory_leaks/ > > For the memory leak checking I used valgrind but there's lots of others. > > Dave > > > > > Marco > > > > On Tue, Jul 11, 2023 at 8:57 AM Dave Reynolds <dave.e.reyno...@gmail.com > > > > wrote: > > > >> For interest[*] ... > >> > >> This is what the core JVM metrics look like when transitioning from a > >> Jetty10 to a Jetty9.4 instance. You can see the direct buffer cycling up > >> to 500MB (which happens to be the max heap setting) on Jetty 10, nothing > >> on Jetty 9. The drop in Mapped buffers is just because TDB hadn't been > >> asked any queries yet. > >> > >> > >> > https://www.dropbox.com/scl/fi/9afhrztbb36fvzqkuw996/fuseki-jetty10-jetty9-transition.png?rlkey=7fpj4x1pn5mjnf3jjwenmp65m&dl=0 > >> > >> Here' the same metrics around the time of triggering a TDB backup. Shows > >> the mapped buffer use for TDB but no significant impact on heap etc. > >> > >> > >> > https://www.dropbox.com/scl/fi/0s40vpizf94c4w3m2awna/fuseki-jetty10-backup.png?rlkey=ai31m6z58w0uex8zix8e9ctna&dl=0 > >> > >> These are all on the same instance as the RES memory trace: > >> > >> > >> > https://www.dropbox.com/scl/fi/c58nqkr2hi193a84btedg/fuseki-4.9.0-jetty-9.4.png?rlkey=b7osnj6k1oy1xskl4j25zz6o8&dl=0 > >> > >> Dave > >> > >> [*] I've been staring and metric graphs for so many days I may have a > >> distorted notion of what's interesting :) > >> > >> On 11/07/2023 08:39, Dave Reynolds wrote: > >>> After a 10 hour test of 4.9.0 with Jetty 9.4 on java 17 in the > >>> production, containerized, environment then it is indeed very stable. > >>> > >>> Running at less that 6% of memory on 4GB machine compared to peaks of > >>> ~50% for versions with Jetty 10. RES shows as 240K with 35K shared > >>> (presume mostly libraries). > >>> > >>> Copy of trace is: > >>> > >> > https://www.dropbox.com/scl/fi/c58nqkr2hi193a84btedg/fuseki-4.9.0-jetty-9.4.png?rlkey=b7osnj6k1oy1xskl4j25zz6o8&dl=0 > >>> > >>> The high spikes on left of image are the prior run on with out of the > >>> box 4.7.0 on same JVM. > >>> > >>> The small spike at 06:00 is a dump so TDB was able to touch and scan > all > >>> the (modest) data with very minor blip in resident size (as you'd > hope). > >>> JVM stats show the mapped buffers for TDB jumping up but confirm heap > is > >>> stable at < 60M, non-heap 60M. > >>> > >>> Dave > >>> > >>> On 10/07/2023 20:52, Dave Reynolds wrote: > >>>> Since this thread has got complex, I'm posting this update here at the > >>>> top level. > >>>> > >>>> Thanks to folks, especially Andy and Rob for suggestions and for > >>>> investigating. > >>>> > >>>> After a lot more testing at our end I believe we now have some > >>>> workarounds. > >>>> > >>>> First, at least on java 17, the process growth does seem to level out. > >>>> Despite what I just said to Rob, having just checked our soak tests, a > >>>> jena 4.7.0/java 17 test with 500MB max heap has lasted for 7 days. > >>>> Process size oscillates between 1.5GB and 2GB but hasn't gone above > >>>> that in a week. The oscillation is almost entirely the cycling of the > >>>> direct memory buffers used by Jetty. Empirically those cycle up to > >>>> something comparable to the set max heap size, at least for us. > >>>> > >>>> While this week long test was 4.7.0, based on earlier tests I suspect > >>>> 4.8.0 (and now 4.9.0) would also level out at least on a timescale of > >>>> days. > >>>> > >>>> The key has been setting the max heap low. At 2GB and even 1GB (the > >>>> default on a 4GB machine) we see higher peak levels of direct buffers > >>>> and overall process size grew to around 3GB at which point the > >>>> container is killed on the small machines. Though java 17 does seem to > >>>> be better behaved that java 11, so switching to that probably also > >>>> helped. > >>>> > >>>> Given the actual heap is low (50MB heap, 60MB non-heap) then needing > >>>> 2GB to run in feels high but is workable. So my previously suggested > >>>> rule of thumb that, in this low memory regime, allow 4x the max heap > >>>> size seems to work. > >>>> > >>>> Second, we're now pretty confident the issue is jetty 10+. > >>>> > >>>> We've built a fuseki-server 4.9.0 with Jetty replaced by version > >>>> 9.4.51.v20230217. This required some minor source changes to compile > >>>> and pass tests. On a local bare metal test where we saw process growth > >>>> up to 1.5-2GB this build has run stably using less than 500MB for 4 > >>>> hours. > >>>> > >>>> We'll set a longer term test running in the target containerized > >>>> environment to confirm things but quite hopeful this will be long term > >>>> stable. > >>>> > >>>> I realise Jetty 9.4.x is out of community support but eclipse say EOL > >>>> is "unlikely to happen before 2025". So, while this may not be a > >>>> solution for the Jena project, it could give us a workaround at the > >>>> cost of doing custom builds. > >>>> > >>>> Dave > >>>> > >>>> > >>>> On 03/07/2023 14:20, Dave Reynolds wrote: > >>>>> We have a very strange problem with recent fuseki versions when > >>>>> running (in docker containers) on small machines. Suspect a jetty > >>>>> issue but it's not clear. > >>>>> > >>>>> Wondering if anyone has seen anything like this. > >>>>> > >>>>> This is a production service but with tiny data (~250k triples, ~60MB > >>>>> as NQuads). Runs on 4GB machines with java heap allocation of > 500MB[1]. > >>>>> > >>>>> We used to run using 3.16 on jdk 8 (AWS Corretto for the long term > >>>>> support) with no problems. > >>>>> > >>>>> Switching to fuseki 4.8.0 on jdk 11 the process grows in the space of > >>>>> a day or so to reach ~3GB of memory at which point the 4GB machine > >>>>> becomes unviable and things get OOM killed. > >>>>> > >>>>> The strange thing is that this growth happens when the system is > >>>>> answering no Sparql queries at all, just regular health ping checks > >>>>> and (prometheus) metrics scrapes from the monitoring systems. > >>>>> > >>>>> Furthermore the space being consumed is not visible to any of the JVM > >>>>> metrics: > >>>>> - Heap and and non-heap are stable at around 100MB total (mostly > >>>>> non-heap metaspace). > >>>>> - Mapped buffers stay at 50MB and remain long term stable. > >>>>> - Direct memory buffers being allocated up to around 500MB then being > >>>>> reclaimed. Since there are no sparql queries at all we assume this is > >>>>> jetty NIO buffers being churned as a result of the metric scrapes. > >>>>> However, this direct buffer behaviour seems stable, it cycles between > >>>>> 0 and 500MB on approx a 10min cycle but is stable over a period of > >>>>> days and shows no leaks. > >>>>> > >>>>> Yet the java process grows from an initial 100MB to at least 3GB. > >>>>> This can occur in the space of a couple of hours or can take up to a > >>>>> day or two with no predictability in how fast. > >>>>> > >>>>> Presumably there is some low level JNI space allocated by Jetty (?) > >>>>> which is invisible to all the JVM metrics and is not being reliably > >>>>> reclaimed. > >>>>> > >>>>> Trying 4.6.0, which we've had less problems with elsewhere, that > >>>>> seems to grow to around 1GB (plus up to 0.5GB for the cycling direct > >>>>> memory buffers) and then stays stable (at least on a three day soak > >>>>> test). We could live with allocating 1.5GB to a system that should > >>>>> only need a few 100MB but concerned that it may not be stable in the > >>>>> really long term and, in any case, would rather be able to update to > >>>>> more recent fuseki versions. > >>>>> > >>>>> Trying 4.8.0 on java 17 it grows rapidly to around 1GB again but then > >>>>> keeps ticking up slowly at random intervals. We project that it would > >>>>> take a few weeks to grow the scale it did under java 11 but it will > >>>>> still eventually kill the machine. > >>>>> > >>>>> Anyone seem anything remotely like this? > >>>>> > >>>>> Dave > >>>>> > >>>>> [1] 500M heap may be overkill but there can be some complex queries > >>>>> and that should still leave plenty of space for OS buffers etc in the > >>>>> remaining memory on a 4GB machine. > >>>>> > >>>>> > >>>>> > >> > > > > > -- --- Marco Neumann