Re: Mystery memory leak in fuseki

Dave Reynolds Tue, 11 Jul 2023 00:57:50 -0700

For interest[*] ...

This is what the core JVM metrics look like when transitioning from aJetty10 to a Jetty9.4 instance. You can see the direct buffer cycling upto 500MB (which happens to be the max heap setting) on Jetty 10, nothingon Jetty 9. The drop in Mapped buffers is just because TDB hadn't beenasked any queries yet.


https://www.dropbox.com/scl/fi/9afhrztbb36fvzqkuw996/fuseki-jetty10-jetty9-transition.png?rlkey=7fpj4x1pn5mjnf3jjwenmp65m&dl=0

Here' the same metrics around the time of triggering a TDB backup. Showsthe mapped buffer use for TDB but no significant impact on heap etc.


https://www.dropbox.com/scl/fi/0s40vpizf94c4w3m2awna/fuseki-jetty10-backup.png?rlkey=ai31m6z58w0uex8zix8e9ctna&dl=0

These are all on the same instance as the RES memory trace:

https://www.dropbox.com/scl/fi/c58nqkr2hi193a84btedg/fuseki-4.9.0-jetty-9.4.png?rlkey=b7osnj6k1oy1xskl4j25zz6o8&dl=0

Dave

[*] I've been staring and metric graphs for so many days I may have adistorted notion of what's interesting :)


On 11/07/2023 08:39, Dave Reynolds wrote:

After a 10 hour test of 4.9.0 with Jetty 9.4 on java 17 in theproduction, containerized, environment then it is indeed very stable.
Running at less that 6% of memory on 4GB machine compared to peaks of~50% for versions with Jetty 10. RES shows as 240K with 35K shared(presume mostly libraries).
Copy of trace is:https://www.dropbox.com/scl/fi/c58nqkr2hi193a84btedg/fuseki-4.9.0-jetty-9.4.png?rlkey=b7osnj6k1oy1xskl4j25zz6o8&dl=0
The high spikes on left of image are the prior run on with out of thebox 4.7.0 on same JVM.
The small spike at 06:00 is a dump so TDB was able to touch and scan allthe (modest) data with very minor blip in resident size (as you'd hope).JVM stats show the mapped buffers for TDB jumping up but confirm heap isstable at < 60M, non-heap 60M.
Dave

On 10/07/2023 20:52, Dave Reynolds wrote:
Since this thread has got complex, I'm posting this update here at thetop level.
Thanks to folks, especially Andy and Rob for suggestions and forinvestigating.
After a lot more testing at our end I believe we now have someworkarounds.
First, at least on java 17, the process growth does seem to level out.Despite what I just said to Rob, having just checked our soak tests, ajena 4.7.0/java 17 test with 500MB max heap has lasted for 7 days.Process size oscillates between 1.5GB and 2GB but hasn't gone abovethat in a week. The oscillation is almost entirely the cycling of thedirect memory buffers used by Jetty. Empirically those cycle up tosomething comparable to the set max heap size, at least for us.
While this week long test was 4.7.0, based on earlier tests I suspect4.8.0 (and now 4.9.0) would also level out at least on a timescale ofdays.
The key has been setting the max heap low. At 2GB and even 1GB (thedefault on a 4GB machine) we see higher peak levels of direct buffersand overall process size grew to around 3GB at which point thecontainer is killed on the small machines. Though java 17 does seem tobe better behaved that java 11, so switching to that probably alsohelped.
Given the actual heap is low (50MB heap, 60MB non-heap) then needing2GB to run in feels high but is workable. So my previously suggestedrule of thumb that, in this low memory regime, allow 4x the max heapsize seems to work.
Second, we're now pretty confident the issue is jetty 10+.
We've built a fuseki-server 4.9.0 with Jetty replaced by version9.4.51.v20230217. This required some minor source changes to compileand pass tests. On a local bare metal test where we saw process growthup to 1.5-2GB this build has run stably using less than 500MB for 4hours.
We'll set a longer term test running in the target containerizedenvironment to confirm things but quite hopeful this will be long termstable.
I realise Jetty 9.4.x is out of community support but eclipse say EOLis "unlikely to happen before 2025". So, while this may not be asolution for the Jena project, it could give us a workaround at thecost of doing custom builds.
Dave


On 03/07/2023 14:20, Dave Reynolds wrote:
We have a very strange problem with recent fuseki versions whenrunning (in docker containers) on small machines. Suspect a jettyissue but it's not clear.
Wondering if anyone has seen anything like this.
This is a production service but with tiny data (~250k triples, ~60MBas NQuads). Runs on 4GB machines with java heap allocation of 500MB[1].
We used to run using 3.16 on jdk 8 (AWS Corretto for the long termsupport) with no problems.
Switching to fuseki 4.8.0 on jdk 11 the process grows in the space ofa day or so to reach ~3GB of memory at which point the 4GB machinebecomes unviable and things get OOM killed.
The strange thing is that this growth happens when the system isanswering no Sparql queries at all, just regular health ping checksand (prometheus) metrics scrapes from the monitoring systems.
Furthermore the space being consumed is not visible to any of the JVMmetrics:- Heap and and non-heap are stable at around 100MB total (mostlynon-heap metaspace).
- Mapped buffers stay at 50MB and remain long term stable.
- Direct memory buffers being allocated up to around 500MB then beingreclaimed. Since there are no sparql queries at all we assume this isjetty NIO buffers being churned as a result of the metric scrapes.However, this direct buffer behaviour seems stable, it cycles between0 and 500MB on approx a 10min cycle but is stable over a period ofdays and shows no leaks.
Yet the java process grows from an initial 100MB to at least 3GB.This can occur in the space of a couple of hours or can take up to aday or two with no predictability in how fast.
Presumably there is some low level JNI space allocated by Jetty (?)which is invisible to all the JVM metrics and is not being reliablyreclaimed.
Trying 4.6.0, which we've had less problems with elsewhere, thatseems to grow to around 1GB (plus up to 0.5GB for the cycling directmemory buffers) and then stays stable (at least on a three day soaktest). We could live with allocating 1.5GB to a system that shouldonly need a few 100MB but concerned that it may not be stable in thereally long term and, in any case, would rather be able to update tomore recent fuseki versions.
Trying 4.8.0 on java 17 it grows rapidly to around 1GB again but thenkeeps ticking up slowly at random intervals. We project that it wouldtake a few weeks to grow the scale it did under java 11 but it willstill eventually kill the machine.
Anyone seem anything remotely like this?

Dave
[1] 500M heap may be overkill but there can be some complex queriesand that should still leave plenty of space for OS buffers etc in theremaining memory on a 4GB machine.

Re: Mystery memory leak in fuseki

Reply via email to