RE: Re: Mystery memory leak in fuseki

Conal McLaughlin Thu, 20 Jul 2023 01:48:23 -0700

Hey Andy, 

Metaspace seems to be stable!


We’re running this on Java 11 currently.
I can check it out with Java17 though.

We’ve currently set Xms/Xmx to 2560MB & MaxMetaspaceSize to 256MB.

The ECS task is set with a ceiling of 4GB Memory & 1vcpu.

Could it be more of a race condition than size of used objects, due to logging?
I do see some time sensitive eviction code in Jetty - 
https://github.com/eclipse/jetty.project/blob/9e16d81cf8922c75e3d2d96c66442b896a9c69e1/jetty-io/src/main/java/org/eclipse/jetty/io/ArrayRetainableByteBufferPool.java#L374
Not sure if the same type of thing exists in the Jena codebase.

I will try to check with `—empty` also.


Thanks,
Conal

On 2023/07/19 21:10:24 Andy Seaborne wrote:
> Conal,
> 
> Thanks for the information.
> Can you see if metaspace is growing as well?
> 
> All,
> 
> Could someone please try running Fuseki main, with no datasets (--empty) 
> with some healthcheck ping traffic.
> 
>      Andy
> 
> On 19/07/2023 14:42, Conal McLaughlin wrote:
> > Hey Dave,
> > 
> > Thank you for providing an in depth analysis of your issues.
> > We appear to be witnessing the same type of problems with our current 
> > Fuseki deployment.
> > We are deploying a containerised Fuseki into a AWS ECS task alongside 
> > other containers - this may not be ideal but that’s a different story.
> > 
> > I just wanted to add another data point to everything you have described.
> > Firstly, it does seem like “idle” (or very low traffic) instances are 
> > the problem, for us (coupled with a larger heap than necessary).
> > We witness the same increase in the ECS task memory consumption up until 
> > the whole thing is killed off. Which includes the Fuseki container.
> > 
> > In an attempt to see what was going on beneath the hood, we turned up 
> > the logging to TRACE in the log4j2.xml file provided to Fuseki.
> > This appeared to stabilise the increasing memory consumption.
> > Even just switching the `logger.jetty.level` to TRACE alleviates the issue.
> 
> Colour me confused!
> 
> A Log4j logger that is active will use a few objects - may that's enough 
> to trigger a minor GC which in turn is enough to flush some non-heap 
> resources.
> 
> How big is the heap?
> This is Java17?
> 
> > We are testing this on Fuseki 4.8.0/TDB2 with close to 0 triples and 
> > extremely low query traffic / health checks via /ping.
> > KPk7uhH2F9Lp.png
> > ecs-task-memory - Image on Pasteboard 
> > <https://pasteboard.co/KPk7uhH2F9Lp.png>
> > pasteboard.co <https://pasteboard.co/KPk7uhH2F9Lp.png>
> > 
> > <https://pasteboard.co/KPk7uhH2F9Lp.png>
> > 
> > 
> > Cheers,
> > Conal
> > 
> > On 2023/07/11 09:31:25 Dave Reynolds wrote:
> >  > Hi Rob,
> >  >
> >  > Good point. Will try to find time to experiment with that but given the
> >  > testing cycle time that will take a while and can't start immediately.
> >  >
> >  > I'm a little sceptical though. As mentioned before, all the metrics we
> >  > see show the direct memory pool that Jetty uses cycling up the max heap
> >  > size and then being collected but with no long term growth to match the
> >  > process size growth. This really feels more like a bug (though not sure
> >  > where) than tuning. The fact that actual behaviour doesn't match the
> >  > documentation isn't encouraging.
> >  >
> >  > It's also pretty hard to figure what the right pool configuration would
> >  > be. This thing is just being asked to deliver a few metrics (12KB per
> >  > request) several times a minute but manages to eat 500MB of direct
> >  > buffer space every 5mins. So what the right pool parameters are to
> >  > support real usage peaks is not going to be easy to figure out.
> >  >
> >  > None the less you are right. That's something that should be explored.
> >  >
> >  > Dave
> >  >
> >  >
> >  > On 11/07/2023 09:45, Rob @ DNR wrote:
> >  > > Dave
> >  > >
> >  > > Thanks for the further information.
> >  > >
> >  > > Have you experimented with using Jetty 10 but providing more 
> > detailed configuration?Fuseki supports providing detailed Jetty 
> > configuration if needed via the --jetty-config option
> >  > >
> >  > > The following section look relevant:
> >  > >
> >  > > 
> > https://eclipse.dev/jetty/documentation/jetty-10/operations-guide/index.html#og-module-bytebufferpool
> >  > >
> >  > > It looks like the default is that Jetty uses a heuristic to 
> > determine these values, sadly the heuristic in question is not detailed 
> > in that documentation.
> >  > >
> >  > > Best guess from digging through their code is that the “heuristic” 
> > is this:
> >  > >
> >  > > 
> > https://github.com/eclipse/jetty.project/blob/jetty-10.0.x/jetty-io/src/main/java/org/eclipse/jetty/io/AbstractByteBufferPool.java#L78-L84
> >  > >
> >  > > i.e., ¼ of the configured max heap size.This doesn’t necessarily 
> > align with the exact sizes of process growth you see but I note the 
> > documentation does explicitly say that buffers used can go beyond these 
> > limits but that those will just be GC’d rather than pooled for reuse.
> >  > >
> >  > > Example byte buffer configuration at 
> > https://github.com/eclipse/jetty.project/blob/9a05c75ad28ebad4abbe624fa432664c59763747/jetty-server/src/main/config/etc/jetty-bytebufferpool.xml#L4
> >  > >
> >  > > Any chance you could try customising this for your needs with stock 
> > Fuseki and see if this allows you to make the process size smaller and 
> > sufficiently predictable for your use case?
> >  > >
> >  > > Rob
> >  > >
> >  > > From: Dave Reynolds <da...@gmail.com>
> >  > > Date: Tuesday, 11 July 2023 at 08:58
> >  > > To: users@jena.apache.org <us...@jena.apache.org>
> >  > > Subject: Re: Mystery memory leak in fuseki
> >  > > For interest[*] ...
> >  > >
> >  > > This is what the core JVM metrics look like when transitioning from a
> >  > > Jetty10 to a Jetty9.4 instance. You can see the direct buffer 
> > cycling up
> >  > > to 500MB (which happens to be the max heap setting) on Jetty 10, 
> > nothing
> >  > > on Jetty 9. The drop in Mapped buffers is just because TDB hadn't been
> >  > > asked any queries yet.
> >  > >
> >  > > 
> > https://www.dropbox.com/scl/fi/9afhrztbb36fvzqkuw996/fuseki-jetty10-jetty9-transition.png?rlkey=7fpj4x1pn5mjnf3jjwenmp65m&dl=0
> >  > >
> >  > > Here' the same metrics around the time of triggering a TDB backup. 
> > Shows
> >  > > the mapped buffer use for TDB but no significant impact on heap etc.
> >  > >
> >  > > 
> > https://www.dropbox.com/scl/fi/0s40vpizf94c4w3m2awna/fuseki-jetty10-backup.png?rlkey=ai31m6z58w0uex8zix8e9ctna&dl=0
> >  > >
> >  > > These are all on the same instance as the RES memory trace:
> >  > >
> >  > > 
> > https://www.dropbox.com/scl/fi/c58nqkr2hi193a84btedg/fuseki-4.9.0-jetty-9.4.png?rlkey=b7osnj6k1oy1xskl4j25zz6o8&dl=0
> >  > >
> >  > > Dave
> >  > >
> >  > > [*] I've been staring and metric graphs for so many days I may have a
> >  > > distorted notion of what's interesting :)
> >  > >
> >  > > On 11/07/2023 08:39, Dave Reynolds wrote:
> >  > >> After a 10 hour test of 4.9.0 with Jetty 9.4 on java 17 in the
> >  > >> production, containerized, environment then it is indeed very stable.
> >  > >>
> >  > >> Running at less that 6% of memory on 4GB machine compared to peaks of
> >  > >> ~50% for versions with Jetty 10. RES shows as 240K with 35K shared
> >  > >> (presume mostly libraries).
> >  > >>
> >  > >> Copy of trace is:
> >  > >> 
> > https://www.dropbox.com/scl/fi/c58nqkr2hi193a84btedg/fuseki-4.9.0-jetty-9.4.png?rlkey=b7osnj6k1oy1xskl4j25zz6o8&dl=0
> >  > >>
> >  > >> The high spikes on left of image are the prior run on with out of the
> >  > >> box 4.7.0 on same JVM.
> >  > >>
> >  > >> The small spike at 06:00 is a dump so TDB was able to touch and 
> > scan all
> >  > >> the (modest) data with very minor blip in resident size (as you'd 
> > hope).
> >  > >> JVM stats show the mapped buffers for TDB jumping up but confirm 
> > heap is
> >  > >> stable at < 60M, non-heap 60M.
> >  > >>
> >  > >> Dave
> >  > >>
> >  > >> On 10/07/2023 20:52, Dave Reynolds wrote:
> >  > >>> Since this thread has got complex, I'm posting this update here 
> > at the
> >  > >>> top level.
> >  > >>>
> >  > >>> Thanks to folks, especially Andy and Rob for suggestions and for
> >  > >>> investigating.
> >  > >>>
> >  > >>> After a lot more testing at our end I believe we now have some
> >  > >>> workarounds.
> >  > >>>
> >  > >>> First, at least on java 17, the process growth does seem to level 
> > out.
> >  > >>> Despite what I just said to Rob, having just checked our soak 
> > tests, a
> >  > >>> jena 4.7.0/java 17 test with 500MB max heap has lasted for 7 days.
> >  > >>> Process size oscillates between 1.5GB and 2GB but hasn't gone above
> >  > >>> that in a week. The oscillation is almost entirely the cycling of the
> >  > >>> direct memory buffers used by Jetty. Empirically those cycle up to
> >  > >>> something comparable to the set max heap size, at least for us.
> >  > >>>
> >  > >>> While this week long test was 4.7.0, based on earlier tests I suspect
> >  > >>> 4.8.0 (and now 4.9.0) would also level out at least on a timescale of
> >  > >>> days.
> >  > >>>
> >  > >>> The key has been setting the max heap low. At 2GB and even 1GB (the
> >  > >>> default on a 4GB machine) we see higher peak levels of direct buffers
> >  > >>> and overall process size grew to around 3GB at which point the
> >  > >>> container is killed on the small machines. Though java 17 does 
> > seem to
> >  > >>> be better behaved that java 11, so switching to that probably also
> >  > >>> helped.
> >  > >>>
> >  > >>> Given the actual heap is low (50MB heap, 60MB non-heap) then needing
> >  > >>> 2GB to run in feels high but is workable. So my previously suggested
> >  > >>> rule of thumb that, in this low memory regime, allow 4x the max heap
> >  > >>> size seems to work.
> >  > >>>
> >  > >>> Second, we're now pretty confident the issue is jetty 10+.
> >  > >>>
> >  > >>> We've built a fuseki-server 4.9.0 with Jetty replaced by version
> >  > >>> 9.4.51.v20230217. This required some minor source changes to compile
> >  > >>> and pass tests. On a local bare metal test where we saw process 
> > growth
> >  > >>> up to 1.5-2GB this build has run stably using less than 500MB for 4
> >  > >>> hours.
> >  > >>>
> >  > >>> We'll set a longer term test running in the target containerized
> >  > >>> environment to confirm things but quite hopeful this will be long 
> > term
> >  > >>> stable.
> >  > >>>
> >  > >>> I realise Jetty 9.4.x is out of community support but eclipse say EOL
> >  > >>> is "unlikely to happen before 2025". So, while this may not be a
> >  > >>> solution for the Jena project, it could give us a workaround at the
> >  > >>> cost of doing custom builds.
> >  > >>>
> >  > >>> Dave
> >  > >>>
> >  > >>>
> >  > >>> On 03/07/2023 14:20, Dave Reynolds wrote:
> >  > >>>> We have a very strange problem with recent fuseki versions when
> >  > >>>> running (in docker containers) on small machines. Suspect a jetty
> >  > >>>> issue but it's not clear.
> >  > >>>>
> >  > >>>> Wondering if anyone has seen anything like this.
> >  > >>>>
> >  > >>>> This is a production service but with tiny data (~250k triples, 
> > ~60MB
> >  > >>>> as NQuads). Runs on 4GB machines with java heap allocation of 
> > 500MB[1].
> >  > >>>>
> >  > >>>> We used to run using 3.16 on jdk 8 (AWS Corretto for the long term
> >  > >>>> support) with no problems.
> >  > >>>>
> >  > >>>> Switching to fuseki 4.8.0 on jdk 11 the process grows in the 
> > space of
> >  > >>>> a day or so to reach ~3GB of memory at which point the 4GB machine
> >  > >>>> becomes unviable and things get OOM killed.
> >  > >>>>
> >  > >>>> The strange thing is that this growth happens when the system is
> >  > >>>> answering no Sparql queries at all, just regular health ping checks
> >  > >>>> and (prometheus) metrics scrapes from the monitoring systems.
> >  > >>>>
> >  > >>>> Furthermore the space being consumed is not visible to any of 
> > the JVM
> >  > >>>> metrics:
> >  > >>>> - Heap and and non-heap are stable at around 100MB total (mostly
> >  > >>>> non-heap metaspace).
> >  > >>>> - Mapped buffers stay at 50MB and remain long term stable.
> >  > >>>> - Direct memory buffers being allocated up to around 500MB then 
> > being
> >  > >>>> reclaimed. Since there are no sparql queries at all we assume 
> > this is
> >  > >>>> jetty NIO buffers being churned as a result of the metric scrapes.
> >  > >>>> However, this direct buffer behaviour seems stable, it cycles 
> > between
> >  > >>>> 0 and 500MB on approx a 10min cycle but is stable over a period of
> >  > >>>> days and shows no leaks.
> >  > >>>>
> >  > >>>> Yet the java process grows from an initial 100MB to at least 3GB.
> >  > >>>> This can occur in the space of a couple of hours or can take up to a
> >  > >>>> day or two with no predictability in how fast.
> >  > >>>>
> >  > >>>> Presumably there is some low level JNI space allocated by Jetty (?)
> >  > >>>> which is invisible to all the JVM metrics and is not being reliably
> >  > >>>> reclaimed.
> >  > >>>>
> >  > >>>> Trying 4.6.0, which we've had less problems with elsewhere, that
> >  > >>>> seems to grow to around 1GB (plus up to 0.5GB for the cycling direct
> >  > >>>> memory buffers) and then stays stable (at least on a three day soak
> >  > >>>> test). We could live with allocating 1.5GB to a system that should
> >  > >>>> only need a few 100MB but concerned that it may not be stable in the
> >  > >>>> really long term and, in any case, would rather be able to update to
> >  > >>>> more recent fuseki versions.
> >  > >>>>
> >  > >>>> Trying 4.8.0 on java 17 it grows rapidly to around 1GB again but 
> > then
> >  > >>>> keeps ticking up slowly at random intervals. We project that it 
> > would
> >  > >>>> take a few weeks to grow the scale it did under java 11 but it will
> >  > >>>> still eventually kill the machine.
> >  > >>>>
> >  > >>>> Anyone seem anything remotely like this?
> >  > >>>>
> >  > >>>> Dave
> >  > >>>>
> >  > >>>> [1]500M heap may be overkill but there can be some complex queries
> >  > >>>> and that should still leave plenty of space for OS buffers etc 
> > in the
> >  > >>>> remaining memory on a 4GB machine.
> >  > >>>>
> >  > >>>>
> >  > >>>>
> >  > >
> >  >
>

RE: Re: Mystery memory leak in fuseki

Reply via email to