Re: Mystery memory leak in fuseki

Dave Reynolds Thu, 06 Jul 2023 03:56:00 -0700

Hi Frank,

That's consistent with what we see but I don't think that's a memoryleak as such.

If you continuously GET the /$/ping or /$/metrics endpoints then you seedirect memory buffers being allocated and eventually reclaimed. See thegraphs we posted earlier.

This applies to either endpoint so it's not e.g. the micrometer librarybut seems to be any http request/response. I assume it is jettyallocating these buffers.

We assume the same would apply if running actual queries at a similarrate and not scaping the metrics - that the direct memory bufferallocation would happen for any http traffic.

Direct memory buffers are only reclaimed after garbage collects by aseparate cleaner thread so how much direct memory gets used before it'sall cleaned up depends on how fast GCs are happening relative to howfast buffers are being allocated.

We see the total direct memory buffer pool grow to a size roughlycomparable to the allocated maximum heap size (much larger than theactual heap and non-heap usage in our case) before the GC and clean upmanages to collect them. At the rates we check the ping and metricsendpoints the cycle time for the direct memory buffers to grow tomaximum and then get cleaned up is O(10mins). However, all of this doesdepend on what else is going on, you can force faster garbage collectsto reduce the max direct buffers left around at the cost of the CPU forall those GCs.

So it is surprising to see that high rate of DM buffer allocation forpretty simple http requests, and surprising that new buffers keepgetting allocated (I thought Jetty managed a buffer pool and should bereusing not reallocating each time). However, it does all get garbagecollected away on a short timescale and isn't a leak.

Our issue is that on a timescale of hours to days we see the processsize grow, with that direct memory churn superimposed on top. It doesn'tgrow steadily or predictably and "largely" levels off after a day or twobut that level is rather higher than the max heap, much higher than theactual heap, and isn't completely stable. Subjectively agree with youthat the rate of growth is higher on later versions but the wholebehaviour is not sufficiently deterministic to be sure. Our previousimpression that 4.6.x, for example, reached a long term stable level maynot be right it just levels of a little more quickly than 4.7.0 and4.8.0 but not clear it is any more absolutely stable than those.

All of this happens in non-docker environments, though containers do addcomplications.

After days of running with native memory tracking on and repeated heapdumps we have no smoking gun and no solution.

The heap itself is stable with just a few byte buffers (presume pointersto those DM buffers) coming and going, there's nothing on the heap thatgrows and nothing suspicious.

All the many many different JVM native memory pools are long term stableat least as far as the JVM metrics and JVM native memory tracking isconcerned. Have tried with tracking on for a day and looking at thenative memory details with a continuous jcmd and all we see is thegrowth and shrinkage of the "Other" native memory pool where the directbuffers hide out. When the process size (whether RES or PSS,Proportional Set Size) has grown by e.g. 2x there's no sign of it in anyof the JVM memory pools.

Best we have is that this is some freak memory fragmentation effect fromthe surprisingly high churn of direct memory buffer allocation.Eventually the usage does level off, at least on a timescale of weeks,but not clear whether it does so sufficiently stably on a timescale ofmonths to allow reliable long term running.

The fact that this growth is not deterministic, not dependent on thework being done and levels off suggests it's not a leak as such but:handwave: some sort of fragmentation.

There doesn't seem to be any way to force the JVM to preallocate thedirect memory pool and then leave it at a fixed size. You can set amaximum for direct memory but not clear that helps. Setting it does seemto mean that the buffers are GCed more quickly but any setting whichmakes a difference means you have no margin to handle bursts of realrequests.

Rule of thumb for the moment is expect at least 4x process size overyour maximum heap setting. So for a fuseki configuration serving modestdata we might often allow 500MB heap and java 8/fuseki 3.16.0 would seestable process sizes around 250MB. The same configuration java 11/17 andfuseki 4.x then expect the process size to grow to around 2GB and thenallow some safety margin on top. So roughly 10x more expensive. Whethereven that's good enough to allow long term running in small machines wedon't know.


Dave


On 05/07/2023 16:25, Lange, Frank wrote:

Hi folks,

Dave mentioned ping for health checks in his initial post. I have the suspicion 
that the ping endpoint produces a memory leak.

Test setup:
* Debian 12, Linux kernel 6.1.0-9-amd64
* `java -version`:
   openjdk version "17.0.7" 2023-04-18
   OpenJDK Runtime Environment (build 17.0.7+7-Debian-1deb12u1)
* fresh install of apache-jena-fuseki-4.8.0, default configuration (run 
directory is created automatically), no datasets
* Fuseki is started via the `fuseki-server` script; no extra JVM_ARGS (i.e. it 
becomes --Xmx4G)

Test execution:
* Call the ping endpoint a few hundred times, e.g. via `for i in {1..100}; do 
curl http://127.0.0.1:3030/$/ping; done`.

Observation:
* Memory consumption of the java process increases significantly up to around 
4GBs, then some GC steps in and reduces memory consumption to less than 1GB. 
The cycle repeats with more pings.
* Tweaking the JVM arg --Xmx can change when GC steps in.

Can anyone reproduce my observations?

I tried that with all versions down from v4.8.0 down to v4.0.0 and I'm happy to 
give you some clues:
The erratic behaviour starts with version 4.3.0, so it's advisable to check what happened 
between v4.2.0 and v4.3.0. Another impression is that v4.1.0 is even less 
"memory-leaky" than v4.2.0.

I also analyzed with VisualVM in this test setup, but to be honest I don't see 
any suspicious memory leak situation there.


Best regards,
Frank



________________________________
Von: Dave Reynolds <dave.e.reyno...@gmail.com>
Gesendet: Dienstag, 4. Juli 2023 12:16
An: users@jena.apache.org
Betreff: Re: Mystery memory leak in fuseki

Try that again:

For interest this is what the JVM metrics look like. The main
heap/non-heap ones are:

https://www.dropbox.com/s/g1ih98kprnvjvxx/fusdeki-metrics-1.png?dl=0

So stable at around 75MB used, 110MB committed.

Whereas the buffer pools are:

https://www.dropbox.com/s/c77b2oarzxjlsa7/fuseki-buffer-metrics.png?dl=0

So gets up to a size comparable with the allowed max heap size (500MB)
then GC back. Lots of churn just for reporting the metrics but no sign
of the upward trend which dominates the MEM% curves and nothing to
explain the growth to 1.8GB and beyond

Guess could try doing a heap dump anyway in case that gives a clue but
not sure that's the right haystack.

Dave

On 04/07/2023 10:56, Dave Reynolds wrote:

For interest this is what the JVM metrics look like. The main
heap/non-heap ones are:

https://www.dropbox.com/s/8auux5v352ur04m/fusdeki-metrics-1.png?dl=0

So stable at around 75MB used, 110MB committed.

Whereas the buffer pools are:

https://www.dropbox.com/s/c77b2oarzxjlsa7/fuseki-buffer-metrics.png?dl=0

So gets up to a size comparable with the allowed max heap size (500MB)
then GC back. Lots of churn just for reporting the metrics but no sign
of the upward trend which dominates the MEM% curves and nothing to
explain the growth to 1.8GB and beyond

Guess could try doing a heap dump anyway in case that gives a clue but
not sure that's the right haystack.

Dave


On 04/07/2023 10:41, Dave Reynolds wrote:

  >  Does this only happen in a container?  Or can you reproduce it
running locally as well?

Not reproduced locally yet, partly because it's harder to set up the
equivalent metrics monitoring there.

Can try harder at that.

  > If you can reproduce it locally then attaching a profiler like
VisualVM so you can take a heap snapshot and see where the memory is
going that would be useful

Thanks, aware of that option but I thought that would just allow us to
probe the heap, non-heap and buffer JVM memory pools. We have quite
detailed monitoring traces on all the JVM metrics which confirms heap
and non-heap are all fine, sitting stably at a low level and not
reflecting the leak.

That's also what tells us the direct memory buffers are cycling but
being properly collected and not leaking. Assuming the JVM metrics are
accurate then the leak is somewhere in native memory beyond the ken of
the JVM metrics.

Dave

On 04/07/2023 10:11, Rob @ DNR wrote:

Does this only happen in a container?  Or can you reproduce it
running locally as well?

If you can reproduce it locally then attaching a profiler like
VisualVM so you can take a heap snapshot and see where the memory is
going that would be useful

Rob

From: Dave Reynolds <dave.e.reyno...@gmail.com>
Date: Tuesday, 4 July 2023 at 09:31
To: users@jena.apache.org <users@jena.apache.org>
Subject: Re: Mystery memory leak in fuseki
Tried 4.7.0 under most up to date java 17 and it acts like 4.8.0. After
16hours it gets to about 1.6GB and by eye has nearly flatted off
somewhat but not completely.

For interest here's a MEM% curve on a 4GB box (hope the link works).

https://www.dropbox.com/s/xjmluk4o3wlwo0y/fuseki-mem-percent.png?dl=0

The flattish curve from 12:00 to 17:20 is a run using 3.16.0 for
comparison. The curve from then onwards is 4.7.0.

The spikes on the 4.7.0 match the allocation and recovery of the direct
memory buffers. The JVM metrics show those cycling around every 10mins
and being reclaimed each time with no leaking visible at that level.
Heap, non-heap and mapped buffers are all basically unchanging which is
to be expected since it's doing nothing apart from reporting metrics.

Whereas this curve (again from 17:20 onwards) shows basically the same
4.7.0 set up on a separate host, showing that despite flattening out
somewhat usage continues to grow - a least on a 16 hour timescale.

https://www.dropbox.com/s/k0v54yq4kexklk0/fuseki-mem-percent-2.png?dl=0


Both of those runs were using Eclipse Temurin on a base Ubuntu jammy
container. Pervious runs used AWS Corretto on an AL2 base container.
Behaviour basically unchanged so eliminates this being some
Corretto-specific issue or a weird base container OS issue.

Dave

On 03/07/2023 14:54, Andy Seaborne wrote:

Hi Dave,

Could you try 4.7.0?

4.6.0 was 2022-08-20
4.7.0 was 2022-12-27
4.8.0 was 2023-04-20

This is an in-memory database?

Micrometer/Prometheus has had several upgrades but if it is not heap
and
not direct memory (I though that was a hard bound set at start up), I
don't see how it can be involved.

       Andy

On 03/07/2023 14:20, Dave Reynolds wrote:

We have a very strange problem with recent fuseki versions when
running (in docker containers) on small machines. Suspect a jetty
issue but it's not clear.

Wondering if anyone has seen anything like this.

This is a production service but with tiny data (~250k triples, ~60MB
as NQuads). Runs on 4GB machines with java heap allocation of
500MB[1].

We used to run using 3.16 on jdk 8 (AWS Corretto for the long term
support) with no problems.

Switching to fuseki 4.8.0 on jdk 11 the process grows in the space of
a day or so to reach ~3GB of memory at which point the 4GB machine
becomes unviable and things get OOM killed.

The strange thing is that this growth happens when the system is
answering no Sparql queries at all, just regular health ping checks
and (prometheus) metrics scrapes from the monitoring systems.

Furthermore the space being consumed is not visible to any of the JVM
metrics:
- Heap and and non-heap are stable at around 100MB total (mostly
non-heap metaspace).
- Mapped buffers stay at 50MB and remain long term stable.
- Direct memory buffers being allocated up to around 500MB then being
reclaimed. Since there are no sparql queries at all we assume this is
jetty NIO buffers being churned as a result of the metric scrapes.
However, this direct buffer behaviour seems stable, it cycles between
0 and 500MB on approx a 10min cycle but is stable over a period of
days and shows no leaks.

Yet the java process grows from an initial 100MB to at least 3GB. This
can occur in the space of a couple of hours or can take up to a day or
two with no predictability in how fast.

Presumably there is some low level JNI space allocated by Jetty (?)
which is invisible to all the JVM metrics and is not being reliably
reclaimed.

Trying 4.6.0, which we've had less problems with elsewhere, that seems
to grow to around 1GB (plus up to 0.5GB for the cycling direct memory
buffers) and then stays stable (at least on a three day soak test).
We could live with allocating 1.5GB to a system that should only need
a few 100MB but concerned that it may not be stable in the really long
term and, in any case, would rather be able to update to more recent
fuseki versions.

Trying 4.8.0 on java 17 it grows rapidly to around 1GB again but then
keeps ticking up slowly at random intervals. We project that it would
take a few weeks to grow the scale it did under java 11 but it will
still eventually kill the machine.

Anyone seem anything remotely like this?

Dave

[1]  500M heap may be overkill but there can be some complex queries
and that should still leave plenty of space for OS buffers etc in the
remaining memory on a 4GB machine.

Re: Mystery memory leak in fuseki

Reply via email to