Since this thread has got complex, I'm posting this update here at the top level.

Thanks to folks, especially Andy and Rob for suggestions and for investigating.

After a lot more testing at our end I believe we now have some workarounds.

First, at least on java 17, the process growth does seem to level out. Despite what I just said to Rob, having just checked our soak tests, a jena 4.7.0/java 17 test with 500MB max heap has lasted for 7 days. Process size oscillates between 1.5GB and 2GB but hasn't gone above that in a week. The oscillation is almost entirely the cycling of the direct memory buffers used by Jetty. Empirically those cycle up to something comparable to the set max heap size, at least for us.

While this week long test was 4.7.0, based on earlier tests I suspect 4.8.0 (and now 4.9.0) would also level out at least on a timescale of days.

The key has been setting the max heap low. At 2GB and even 1GB (the default on a 4GB machine) we see higher peak levels of direct buffers and overall process size grew to around 3GB at which point the container is killed on the small machines. Though java 17 does seem to be better behaved that java 11, so switching to that probably also helped.

Given the actual heap is low (50MB heap, 60MB non-heap) then needing 2GB to run in feels high but is workable. So my previously suggested rule of thumb that, in this low memory regime, allow 4x the max heap size seems to work.

Second, we're now pretty confident the issue is jetty 10+.

We've built a fuseki-server 4.9.0 with Jetty replaced by version 9.4.51.v20230217. This required some minor source changes to compile and pass tests. On a local bare metal test where we saw process growth up to 1.5-2GB this build has run stably using less than 500MB for 4 hours.

We'll set a longer term test running in the target containerized environment to confirm things but quite hopeful this will be long term stable.

I realise Jetty 9.4.x is out of community support but eclipse say EOL is "unlikely to happen before 2025". So, while this may not be a solution for the Jena project, it could give us a workaround at the cost of doing custom builds.

Dave


On 03/07/2023 14:20, Dave Reynolds wrote:
We have a very strange problem with recent fuseki versions when running (in docker containers) on small machines. Suspect a jetty issue but it's not clear.

Wondering if anyone has seen anything like this.

This is a production service but with tiny data (~250k triples, ~60MB as NQuads). Runs on 4GB machines with java heap allocation of 500MB[1].

We used to run using 3.16 on jdk 8 (AWS Corretto for the long term support) with no problems.

Switching to fuseki 4.8.0 on jdk 11 the process grows in the space of a day or so to reach ~3GB of memory at which point the 4GB machine becomes unviable and things get OOM killed.

The strange thing is that this growth happens when the system is answering no Sparql queries at all, just regular health ping checks and (prometheus) metrics scrapes from the monitoring systems.

Furthermore the space being consumed is not visible to any of the JVM metrics: - Heap and and non-heap are stable at around 100MB total (mostly non-heap metaspace).
- Mapped buffers stay at 50MB and remain long term stable.
- Direct memory buffers being allocated up to around 500MB then being reclaimed. Since there are no sparql queries at all we assume this is jetty NIO buffers being churned as a result of the metric scrapes. However, this direct buffer behaviour seems stable, it cycles between 0 and 500MB on approx a 10min cycle but is stable over a period of days and shows no leaks.

Yet the java process grows from an initial 100MB to at least 3GB. This can occur in the space of a couple of hours or can take up to a day or two with no predictability in how fast.

Presumably there is some low level JNI space allocated by Jetty (?) which is invisible to all the JVM metrics and is not being reliably reclaimed.

Trying 4.6.0, which we've had less problems with elsewhere, that seems to grow to around 1GB (plus up to 0.5GB for the cycling direct memory buffers) and then stays stable (at least on a three day soak test).  We could live with allocating 1.5GB to a system that should only need a few 100MB but concerned that it may not be stable in the really long term and, in any case, would rather be able to update to more recent fuseki versions.

Trying 4.8.0 on java 17 it grows rapidly to around 1GB again but then keeps ticking up slowly at random intervals. We project that it would take a few weeks to grow the scale it did under java 11 but it will still eventually kill the machine.

Anyone seem anything remotely like this?

Dave

[1]  500M heap may be overkill but there can be some complex queries and that should still leave plenty of space for OS buffers etc in the remaining memory on a 4GB machine.



Reply via email to