For interest[*] ...
This is what the core JVM metrics look like when transitioning from a
Jetty10 to a Jetty9.4 instance. You can see the direct buffer cycling up
to 500MB (which happens to be the max heap setting) on Jetty 10, nothing
on Jetty 9. The drop in Mapped buffers is just because TDB hadn't been
asked any queries yet.
https://www.dropbox.com/scl/fi/9afhrztbb36fvzqkuw996/fuseki-jetty10-jetty9-transition.png?rlkey=7fpj4x1pn5mjnf3jjwenmp65m&dl=0
Here' the same metrics around the time of triggering a TDB backup. Shows
the mapped buffer use for TDB but no significant impact on heap etc.
https://www.dropbox.com/scl/fi/0s40vpizf94c4w3m2awna/fuseki-jetty10-backup.png?rlkey=ai31m6z58w0uex8zix8e9ctna&dl=0
These are all on the same instance as the RES memory trace:
https://www.dropbox.com/scl/fi/c58nqkr2hi193a84btedg/fuseki-4.9.0-jetty-9.4.png?rlkey=b7osnj6k1oy1xskl4j25zz6o8&dl=0
Dave
[*] I've been staring and metric graphs for so many days I may have a
distorted notion of what's interesting :)
On 11/07/2023 08:39, Dave Reynolds wrote:
After a 10 hour test of 4.9.0 with Jetty 9.4 on java 17 in the
production, containerized, environment then it is indeed very stable.
Running at less that 6% of memory on 4GB machine compared to peaks of
~50% for versions with Jetty 10. RES shows as 240K with 35K shared
(presume mostly libraries).
Copy of trace is:
https://www.dropbox.com/scl/fi/c58nqkr2hi193a84btedg/fuseki-4.9.0-jetty-9.4.png?rlkey=b7osnj6k1oy1xskl4j25zz6o8&dl=0
The high spikes on left of image are the prior run on with out of the
box 4.7.0 on same JVM.
The small spike at 06:00 is a dump so TDB was able to touch and scan all
the (modest) data with very minor blip in resident size (as you'd hope).
JVM stats show the mapped buffers for TDB jumping up but confirm heap is
stable at < 60M, non-heap 60M.
Dave
On 10/07/2023 20:52, Dave Reynolds wrote:
Since this thread has got complex, I'm posting this update here at the
top level.
Thanks to folks, especially Andy and Rob for suggestions and for
investigating.
After a lot more testing at our end I believe we now have some
workarounds.
First, at least on java 17, the process growth does seem to level out.
Despite what I just said to Rob, having just checked our soak tests, a
jena 4.7.0/java 17 test with 500MB max heap has lasted for 7 days.
Process size oscillates between 1.5GB and 2GB but hasn't gone above
that in a week. The oscillation is almost entirely the cycling of the
direct memory buffers used by Jetty. Empirically those cycle up to
something comparable to the set max heap size, at least for us.
While this week long test was 4.7.0, based on earlier tests I suspect
4.8.0 (and now 4.9.0) would also level out at least on a timescale of
days.
The key has been setting the max heap low. At 2GB and even 1GB (the
default on a 4GB machine) we see higher peak levels of direct buffers
and overall process size grew to around 3GB at which point the
container is killed on the small machines. Though java 17 does seem to
be better behaved that java 11, so switching to that probably also
helped.
Given the actual heap is low (50MB heap, 60MB non-heap) then needing
2GB to run in feels high but is workable. So my previously suggested
rule of thumb that, in this low memory regime, allow 4x the max heap
size seems to work.
Second, we're now pretty confident the issue is jetty 10+.
We've built a fuseki-server 4.9.0 with Jetty replaced by version
9.4.51.v20230217. This required some minor source changes to compile
and pass tests. On a local bare metal test where we saw process growth
up to 1.5-2GB this build has run stably using less than 500MB for 4
hours.
We'll set a longer term test running in the target containerized
environment to confirm things but quite hopeful this will be long term
stable.
I realise Jetty 9.4.x is out of community support but eclipse say EOL
is "unlikely to happen before 2025". So, while this may not be a
solution for the Jena project, it could give us a workaround at the
cost of doing custom builds.
Dave
On 03/07/2023 14:20, Dave Reynolds wrote:
We have a very strange problem with recent fuseki versions when
running (in docker containers) on small machines. Suspect a jetty
issue but it's not clear.
Wondering if anyone has seen anything like this.
This is a production service but with tiny data (~250k triples, ~60MB
as NQuads). Runs on 4GB machines with java heap allocation of 500MB[1].
We used to run using 3.16 on jdk 8 (AWS Corretto for the long term
support) with no problems.
Switching to fuseki 4.8.0 on jdk 11 the process grows in the space of
a day or so to reach ~3GB of memory at which point the 4GB machine
becomes unviable and things get OOM killed.
The strange thing is that this growth happens when the system is
answering no Sparql queries at all, just regular health ping checks
and (prometheus) metrics scrapes from the monitoring systems.
Furthermore the space being consumed is not visible to any of the JVM
metrics:
- Heap and and non-heap are stable at around 100MB total (mostly
non-heap metaspace).
- Mapped buffers stay at 50MB and remain long term stable.
- Direct memory buffers being allocated up to around 500MB then being
reclaimed. Since there are no sparql queries at all we assume this is
jetty NIO buffers being churned as a result of the metric scrapes.
However, this direct buffer behaviour seems stable, it cycles between
0 and 500MB on approx a 10min cycle but is stable over a period of
days and shows no leaks.
Yet the java process grows from an initial 100MB to at least 3GB.
This can occur in the space of a couple of hours or can take up to a
day or two with no predictability in how fast.
Presumably there is some low level JNI space allocated by Jetty (?)
which is invisible to all the JVM metrics and is not being reliably
reclaimed.
Trying 4.6.0, which we've had less problems with elsewhere, that
seems to grow to around 1GB (plus up to 0.5GB for the cycling direct
memory buffers) and then stays stable (at least on a three day soak
test). We could live with allocating 1.5GB to a system that should
only need a few 100MB but concerned that it may not be stable in the
really long term and, in any case, would rather be able to update to
more recent fuseki versions.
Trying 4.8.0 on java 17 it grows rapidly to around 1GB again but then
keeps ticking up slowly at random intervals. We project that it would
take a few weeks to grow the scale it did under java 11 but it will
still eventually kill the machine.
Anyone seem anything remotely like this?
Dave
[1] 500M heap may be overkill but there can be some complex queries
and that should still leave plenty of space for OS buffers etc in the
remaining memory on a 4GB machine.