Hey Andy, Metaspace seems to be stable!
We’re running this on Java 11 currently. I can check it out with Java17 though. We’ve currently set Xms/Xmx to 2560MB & MaxMetaspaceSize to 256MB. The ECS task is set with a ceiling of 4GB Memory & 1vcpu. Could it be more of a race condition than size of used objects, due to logging? I do see some time sensitive eviction code in Jetty - https://github.com/eclipse/jetty.project/blob/9e16d81cf8922c75e3d2d96c66442b896a9c69e1/jetty-io/src/main/java/org/eclipse/jetty/io/ArrayRetainableByteBufferPool.java#L374 Not sure if the same type of thing exists in the Jena codebase. I will try to check with `—empty` also. Thanks, Conal On 2023/07/19 21:10:24 Andy Seaborne wrote: > Conal, > > Thanks for the information. > Can you see if metaspace is growing as well? > > All, > > Could someone please try running Fuseki main, with no datasets (--empty) > with some healthcheck ping traffic. > > Andy > > On 19/07/2023 14:42, Conal McLaughlin wrote: > > Hey Dave, > > > > Thank you for providing an in depth analysis of your issues. > > We appear to be witnessing the same type of problems with our current > > Fuseki deployment. > > We are deploying a containerised Fuseki into a AWS ECS task alongside > > other containers - this may not be ideal but that’s a different story. > > > > I just wanted to add another data point to everything you have described. > > Firstly, it does seem like “idle” (or very low traffic) instances are > > the problem, for us (coupled with a larger heap than necessary). > > We witness the same increase in the ECS task memory consumption up until > > the whole thing is killed off. Which includes the Fuseki container. > > > > In an attempt to see what was going on beneath the hood, we turned up > > the logging to TRACE in the log4j2.xml file provided to Fuseki. > > This appeared to stabilise the increasing memory consumption. > > Even just switching the `logger.jetty.level` to TRACE alleviates the issue. > > Colour me confused! > > A Log4j logger that is active will use a few objects - may that's enough > to trigger a minor GC which in turn is enough to flush some non-heap > resources. > > How big is the heap? > This is Java17? > > > We are testing this on Fuseki 4.8.0/TDB2 with close to 0 triples and > > extremely low query traffic / health checks via /ping. > > KPk7uhH2F9Lp.png > > ecs-task-memory - Image on Pasteboard > > <https://pasteboard.co/KPk7uhH2F9Lp.png> > > pasteboard.co <https://pasteboard.co/KPk7uhH2F9Lp.png> > > > > <https://pasteboard.co/KPk7uhH2F9Lp.png> > > > > > > Cheers, > > Conal > > > > On 2023/07/11 09:31:25 Dave Reynolds wrote: > > > Hi Rob, > > > > > > Good point. Will try to find time to experiment with that but given the > > > testing cycle time that will take a while and can't start immediately. > > > > > > I'm a little sceptical though. As mentioned before, all the metrics we > > > see show the direct memory pool that Jetty uses cycling up the max heap > > > size and then being collected but with no long term growth to match the > > > process size growth. This really feels more like a bug (though not sure > > > where) than tuning. The fact that actual behaviour doesn't match the > > > documentation isn't encouraging. > > > > > > It's also pretty hard to figure what the right pool configuration would > > > be. This thing is just being asked to deliver a few metrics (12KB per > > > request) several times a minute but manages to eat 500MB of direct > > > buffer space every 5mins. So what the right pool parameters are to > > > support real usage peaks is not going to be easy to figure out. > > > > > > None the less you are right. That's something that should be explored. > > > > > > Dave > > > > > > > > > On 11/07/2023 09:45, Rob @ DNR wrote: > > > > Dave > > > > > > > > Thanks for the further information. > > > > > > > > Have you experimented with using Jetty 10 but providing more > > detailed configuration?Fuseki supports providing detailed Jetty > > configuration if needed via the --jetty-config option > > > > > > > > The following section look relevant: > > > > > > > > > > https://eclipse.dev/jetty/documentation/jetty-10/operations-guide/index.html#og-module-bytebufferpool > > > > > > > > It looks like the default is that Jetty uses a heuristic to > > determine these values, sadly the heuristic in question is not detailed > > in that documentation. > > > > > > > > Best guess from digging through their code is that the “heuristic” > > is this: > > > > > > > > > > https://github.com/eclipse/jetty.project/blob/jetty-10.0.x/jetty-io/src/main/java/org/eclipse/jetty/io/AbstractByteBufferPool.java#L78-L84 > > > > > > > > i.e., ¼ of the configured max heap size.This doesn’t necessarily > > align with the exact sizes of process growth you see but I note the > > documentation does explicitly say that buffers used can go beyond these > > limits but that those will just be GC’d rather than pooled for reuse. > > > > > > > > Example byte buffer configuration at > > https://github.com/eclipse/jetty.project/blob/9a05c75ad28ebad4abbe624fa432664c59763747/jetty-server/src/main/config/etc/jetty-bytebufferpool.xml#L4 > > > > > > > > Any chance you could try customising this for your needs with stock > > Fuseki and see if this allows you to make the process size smaller and > > sufficiently predictable for your use case? > > > > > > > > Rob > > > > > > > > From: Dave Reynolds <da...@gmail.com> > > > > Date: Tuesday, 11 July 2023 at 08:58 > > > > To: users@jena.apache.org <us...@jena.apache.org> > > > > Subject: Re: Mystery memory leak in fuseki > > > > For interest[*] ... > > > > > > > > This is what the core JVM metrics look like when transitioning from a > > > > Jetty10 to a Jetty9.4 instance. You can see the direct buffer > > cycling up > > > > to 500MB (which happens to be the max heap setting) on Jetty 10, > > nothing > > > > on Jetty 9. The drop in Mapped buffers is just because TDB hadn't been > > > > asked any queries yet. > > > > > > > > > > https://www.dropbox.com/scl/fi/9afhrztbb36fvzqkuw996/fuseki-jetty10-jetty9-transition.png?rlkey=7fpj4x1pn5mjnf3jjwenmp65m&dl=0 > > > > > > > > Here' the same metrics around the time of triggering a TDB backup. > > Shows > > > > the mapped buffer use for TDB but no significant impact on heap etc. > > > > > > > > > > https://www.dropbox.com/scl/fi/0s40vpizf94c4w3m2awna/fuseki-jetty10-backup.png?rlkey=ai31m6z58w0uex8zix8e9ctna&dl=0 > > > > > > > > These are all on the same instance as the RES memory trace: > > > > > > > > > > https://www.dropbox.com/scl/fi/c58nqkr2hi193a84btedg/fuseki-4.9.0-jetty-9.4.png?rlkey=b7osnj6k1oy1xskl4j25zz6o8&dl=0 > > > > > > > > Dave > > > > > > > > [*] I've been staring and metric graphs for so many days I may have a > > > > distorted notion of what's interesting :) > > > > > > > > On 11/07/2023 08:39, Dave Reynolds wrote: > > > >> After a 10 hour test of 4.9.0 with Jetty 9.4 on java 17 in the > > > >> production, containerized, environment then it is indeed very stable. > > > >> > > > >> Running at less that 6% of memory on 4GB machine compared to peaks of > > > >> ~50% for versions with Jetty 10. RES shows as 240K with 35K shared > > > >> (presume mostly libraries). > > > >> > > > >> Copy of trace is: > > > >> > > https://www.dropbox.com/scl/fi/c58nqkr2hi193a84btedg/fuseki-4.9.0-jetty-9.4.png?rlkey=b7osnj6k1oy1xskl4j25zz6o8&dl=0 > > > >> > > > >> The high spikes on left of image are the prior run on with out of the > > > >> box 4.7.0 on same JVM. > > > >> > > > >> The small spike at 06:00 is a dump so TDB was able to touch and > > scan all > > > >> the (modest) data with very minor blip in resident size (as you'd > > hope). > > > >> JVM stats show the mapped buffers for TDB jumping up but confirm > > heap is > > > >> stable at < 60M, non-heap 60M. > > > >> > > > >> Dave > > > >> > > > >> On 10/07/2023 20:52, Dave Reynolds wrote: > > > >>> Since this thread has got complex, I'm posting this update here > > at the > > > >>> top level. > > > >>> > > > >>> Thanks to folks, especially Andy and Rob for suggestions and for > > > >>> investigating. > > > >>> > > > >>> After a lot more testing at our end I believe we now have some > > > >>> workarounds. > > > >>> > > > >>> First, at least on java 17, the process growth does seem to level > > out. > > > >>> Despite what I just said to Rob, having just checked our soak > > tests, a > > > >>> jena 4.7.0/java 17 test with 500MB max heap has lasted for 7 days. > > > >>> Process size oscillates between 1.5GB and 2GB but hasn't gone above > > > >>> that in a week. The oscillation is almost entirely the cycling of the > > > >>> direct memory buffers used by Jetty. Empirically those cycle up to > > > >>> something comparable to the set max heap size, at least for us. > > > >>> > > > >>> While this week long test was 4.7.0, based on earlier tests I suspect > > > >>> 4.8.0 (and now 4.9.0) would also level out at least on a timescale of > > > >>> days. > > > >>> > > > >>> The key has been setting the max heap low. At 2GB and even 1GB (the > > > >>> default on a 4GB machine) we see higher peak levels of direct buffers > > > >>> and overall process size grew to around 3GB at which point the > > > >>> container is killed on the small machines. Though java 17 does > > seem to > > > >>> be better behaved that java 11, so switching to that probably also > > > >>> helped. > > > >>> > > > >>> Given the actual heap is low (50MB heap, 60MB non-heap) then needing > > > >>> 2GB to run in feels high but is workable. So my previously suggested > > > >>> rule of thumb that, in this low memory regime, allow 4x the max heap > > > >>> size seems to work. > > > >>> > > > >>> Second, we're now pretty confident the issue is jetty 10+. > > > >>> > > > >>> We've built a fuseki-server 4.9.0 with Jetty replaced by version > > > >>> 9.4.51.v20230217. This required some minor source changes to compile > > > >>> and pass tests. On a local bare metal test where we saw process > > growth > > > >>> up to 1.5-2GB this build has run stably using less than 500MB for 4 > > > >>> hours. > > > >>> > > > >>> We'll set a longer term test running in the target containerized > > > >>> environment to confirm things but quite hopeful this will be long > > term > > > >>> stable. > > > >>> > > > >>> I realise Jetty 9.4.x is out of community support but eclipse say EOL > > > >>> is "unlikely to happen before 2025". So, while this may not be a > > > >>> solution for the Jena project, it could give us a workaround at the > > > >>> cost of doing custom builds. > > > >>> > > > >>> Dave > > > >>> > > > >>> > > > >>> On 03/07/2023 14:20, Dave Reynolds wrote: > > > >>>> We have a very strange problem with recent fuseki versions when > > > >>>> running (in docker containers) on small machines. Suspect a jetty > > > >>>> issue but it's not clear. > > > >>>> > > > >>>> Wondering if anyone has seen anything like this. > > > >>>> > > > >>>> This is a production service but with tiny data (~250k triples, > > ~60MB > > > >>>> as NQuads). Runs on 4GB machines with java heap allocation of > > 500MB[1]. > > > >>>> > > > >>>> We used to run using 3.16 on jdk 8 (AWS Corretto for the long term > > > >>>> support) with no problems. > > > >>>> > > > >>>> Switching to fuseki 4.8.0 on jdk 11 the process grows in the > > space of > > > >>>> a day or so to reach ~3GB of memory at which point the 4GB machine > > > >>>> becomes unviable and things get OOM killed. > > > >>>> > > > >>>> The strange thing is that this growth happens when the system is > > > >>>> answering no Sparql queries at all, just regular health ping checks > > > >>>> and (prometheus) metrics scrapes from the monitoring systems. > > > >>>> > > > >>>> Furthermore the space being consumed is not visible to any of > > the JVM > > > >>>> metrics: > > > >>>> - Heap and and non-heap are stable at around 100MB total (mostly > > > >>>> non-heap metaspace). > > > >>>> - Mapped buffers stay at 50MB and remain long term stable. > > > >>>> - Direct memory buffers being allocated up to around 500MB then > > being > > > >>>> reclaimed. Since there are no sparql queries at all we assume > > this is > > > >>>> jetty NIO buffers being churned as a result of the metric scrapes. > > > >>>> However, this direct buffer behaviour seems stable, it cycles > > between > > > >>>> 0 and 500MB on approx a 10min cycle but is stable over a period of > > > >>>> days and shows no leaks. > > > >>>> > > > >>>> Yet the java process grows from an initial 100MB to at least 3GB. > > > >>>> This can occur in the space of a couple of hours or can take up to a > > > >>>> day or two with no predictability in how fast. > > > >>>> > > > >>>> Presumably there is some low level JNI space allocated by Jetty (?) > > > >>>> which is invisible to all the JVM metrics and is not being reliably > > > >>>> reclaimed. > > > >>>> > > > >>>> Trying 4.6.0, which we've had less problems with elsewhere, that > > > >>>> seems to grow to around 1GB (plus up to 0.5GB for the cycling direct > > > >>>> memory buffers) and then stays stable (at least on a three day soak > > > >>>> test). We could live with allocating 1.5GB to a system that should > > > >>>> only need a few 100MB but concerned that it may not be stable in the > > > >>>> really long term and, in any case, would rather be able to update to > > > >>>> more recent fuseki versions. > > > >>>> > > > >>>> Trying 4.8.0 on java 17 it grows rapidly to around 1GB again but > > then > > > >>>> keeps ticking up slowly at random intervals. We project that it > > would > > > >>>> take a few weeks to grow the scale it did under java 11 but it will > > > >>>> still eventually kill the machine. > > > >>>> > > > >>>> Anyone seem anything remotely like this? > > > >>>> > > > >>>> Dave > > > >>>> > > > >>>> [1]500M heap may be overkill but there can be some complex queries > > > >>>> and that should still leave plenty of space for OS buffers etc > > in the > > > >>>> remaining memory on a 4GB machine. > > > >>>> > > > >>>> > > > >>>> > > > > > > > >