Dave, can you say a bit more about the profiling methodology? Are you using a tool such as VisualVM to collect the data? Or do you just use the system monitor?
Marco On Tue, Jul 11, 2023 at 8:57 AM Dave Reynolds <dave.e.reyno...@gmail.com> wrote: > For interest[*] ... > > This is what the core JVM metrics look like when transitioning from a > Jetty10 to a Jetty9.4 instance. You can see the direct buffer cycling up > to 500MB (which happens to be the max heap setting) on Jetty 10, nothing > on Jetty 9. The drop in Mapped buffers is just because TDB hadn't been > asked any queries yet. > > > https://www.dropbox.com/scl/fi/9afhrztbb36fvzqkuw996/fuseki-jetty10-jetty9-transition.png?rlkey=7fpj4x1pn5mjnf3jjwenmp65m&dl=0 > > Here' the same metrics around the time of triggering a TDB backup. Shows > the mapped buffer use for TDB but no significant impact on heap etc. > > > https://www.dropbox.com/scl/fi/0s40vpizf94c4w3m2awna/fuseki-jetty10-backup.png?rlkey=ai31m6z58w0uex8zix8e9ctna&dl=0 > > These are all on the same instance as the RES memory trace: > > > https://www.dropbox.com/scl/fi/c58nqkr2hi193a84btedg/fuseki-4.9.0-jetty-9.4.png?rlkey=b7osnj6k1oy1xskl4j25zz6o8&dl=0 > > Dave > > [*] I've been staring and metric graphs for so many days I may have a > distorted notion of what's interesting :) > > On 11/07/2023 08:39, Dave Reynolds wrote: > > After a 10 hour test of 4.9.0 with Jetty 9.4 on java 17 in the > > production, containerized, environment then it is indeed very stable. > > > > Running at less that 6% of memory on 4GB machine compared to peaks of > > ~50% for versions with Jetty 10. RES shows as 240K with 35K shared > > (presume mostly libraries). > > > > Copy of trace is: > > > https://www.dropbox.com/scl/fi/c58nqkr2hi193a84btedg/fuseki-4.9.0-jetty-9.4.png?rlkey=b7osnj6k1oy1xskl4j25zz6o8&dl=0 > > > > The high spikes on left of image are the prior run on with out of the > > box 4.7.0 on same JVM. > > > > The small spike at 06:00 is a dump so TDB was able to touch and scan all > > the (modest) data with very minor blip in resident size (as you'd hope). > > JVM stats show the mapped buffers for TDB jumping up but confirm heap is > > stable at < 60M, non-heap 60M. > > > > Dave > > > > On 10/07/2023 20:52, Dave Reynolds wrote: > >> Since this thread has got complex, I'm posting this update here at the > >> top level. > >> > >> Thanks to folks, especially Andy and Rob for suggestions and for > >> investigating. > >> > >> After a lot more testing at our end I believe we now have some > >> workarounds. > >> > >> First, at least on java 17, the process growth does seem to level out. > >> Despite what I just said to Rob, having just checked our soak tests, a > >> jena 4.7.0/java 17 test with 500MB max heap has lasted for 7 days. > >> Process size oscillates between 1.5GB and 2GB but hasn't gone above > >> that in a week. The oscillation is almost entirely the cycling of the > >> direct memory buffers used by Jetty. Empirically those cycle up to > >> something comparable to the set max heap size, at least for us. > >> > >> While this week long test was 4.7.0, based on earlier tests I suspect > >> 4.8.0 (and now 4.9.0) would also level out at least on a timescale of > >> days. > >> > >> The key has been setting the max heap low. At 2GB and even 1GB (the > >> default on a 4GB machine) we see higher peak levels of direct buffers > >> and overall process size grew to around 3GB at which point the > >> container is killed on the small machines. Though java 17 does seem to > >> be better behaved that java 11, so switching to that probably also > >> helped. > >> > >> Given the actual heap is low (50MB heap, 60MB non-heap) then needing > >> 2GB to run in feels high but is workable. So my previously suggested > >> rule of thumb that, in this low memory regime, allow 4x the max heap > >> size seems to work. > >> > >> Second, we're now pretty confident the issue is jetty 10+. > >> > >> We've built a fuseki-server 4.9.0 with Jetty replaced by version > >> 9.4.51.v20230217. This required some minor source changes to compile > >> and pass tests. On a local bare metal test where we saw process growth > >> up to 1.5-2GB this build has run stably using less than 500MB for 4 > >> hours. > >> > >> We'll set a longer term test running in the target containerized > >> environment to confirm things but quite hopeful this will be long term > >> stable. > >> > >> I realise Jetty 9.4.x is out of community support but eclipse say EOL > >> is "unlikely to happen before 2025". So, while this may not be a > >> solution for the Jena project, it could give us a workaround at the > >> cost of doing custom builds. > >> > >> Dave > >> > >> > >> On 03/07/2023 14:20, Dave Reynolds wrote: > >>> We have a very strange problem with recent fuseki versions when > >>> running (in docker containers) on small machines. Suspect a jetty > >>> issue but it's not clear. > >>> > >>> Wondering if anyone has seen anything like this. > >>> > >>> This is a production service but with tiny data (~250k triples, ~60MB > >>> as NQuads). Runs on 4GB machines with java heap allocation of 500MB[1]. > >>> > >>> We used to run using 3.16 on jdk 8 (AWS Corretto for the long term > >>> support) with no problems. > >>> > >>> Switching to fuseki 4.8.0 on jdk 11 the process grows in the space of > >>> a day or so to reach ~3GB of memory at which point the 4GB machine > >>> becomes unviable and things get OOM killed. > >>> > >>> The strange thing is that this growth happens when the system is > >>> answering no Sparql queries at all, just regular health ping checks > >>> and (prometheus) metrics scrapes from the monitoring systems. > >>> > >>> Furthermore the space being consumed is not visible to any of the JVM > >>> metrics: > >>> - Heap and and non-heap are stable at around 100MB total (mostly > >>> non-heap metaspace). > >>> - Mapped buffers stay at 50MB and remain long term stable. > >>> - Direct memory buffers being allocated up to around 500MB then being > >>> reclaimed. Since there are no sparql queries at all we assume this is > >>> jetty NIO buffers being churned as a result of the metric scrapes. > >>> However, this direct buffer behaviour seems stable, it cycles between > >>> 0 and 500MB on approx a 10min cycle but is stable over a period of > >>> days and shows no leaks. > >>> > >>> Yet the java process grows from an initial 100MB to at least 3GB. > >>> This can occur in the space of a couple of hours or can take up to a > >>> day or two with no predictability in how fast. > >>> > >>> Presumably there is some low level JNI space allocated by Jetty (?) > >>> which is invisible to all the JVM metrics and is not being reliably > >>> reclaimed. > >>> > >>> Trying 4.6.0, which we've had less problems with elsewhere, that > >>> seems to grow to around 1GB (plus up to 0.5GB for the cycling direct > >>> memory buffers) and then stays stable (at least on a three day soak > >>> test). We could live with allocating 1.5GB to a system that should > >>> only need a few 100MB but concerned that it may not be stable in the > >>> really long term and, in any case, would rather be able to update to > >>> more recent fuseki versions. > >>> > >>> Trying 4.8.0 on java 17 it grows rapidly to around 1GB again but then > >>> keeps ticking up slowly at random intervals. We project that it would > >>> take a few weeks to grow the scale it did under java 11 but it will > >>> still eventually kill the machine. > >>> > >>> Anyone seem anything remotely like this? > >>> > >>> Dave > >>> > >>> [1] 500M heap may be overkill but there can be some complex queries > >>> and that should still leave plenty of space for OS buffers etc in the > >>> remaining memory on a 4GB machine. > >>> > >>> > >>> > -- --- Marco Neumann