Re: Mystery memory leak in fuseki

Marco Neumann Wed, 12 Jul 2023 03:43:45 -0700

Thanks Dave, I am not familiar with Prometheus JVM metrics but I gather
it's an open source solution that you have coupled with grafana for
visualization.  I will have a look into this.


Best,
Marco

On Tue, Jul 11, 2023 at 9:32 AM Dave Reynolds <dave.e.reyno...@gmail.com>
wrote:

> Hi Marco,
>
> On 11/07/2023 09:04, Marco Neumann wrote:
> > Dave, can you say a bit more about the profiling methodology? Are you
> using
> > a tool such as VisualVM to collect the data? Or do you just use the
> system
> > monitor?
>
> The JVM metrics here are from prometheus scanning the metrics exposed by
> fuseki via the built in micrometer (displayed use grafana). They give a
> *lot* of details on things like GC behaviour etc which I'm not showing.
>
> Ironically the only thing this fuseki was doing when it died originally
> was supporting these metric scans, and the health check ping.
>
> The overall memory curve is picked up by telegraph scanning the OS level
> stats for the docker processes (collected via influx DB and again
> displayed in grafana). These are what you would get with e.g. top on the
> machine or a system monitor but means we have longer term records which
> we access remotely. When I quoted 240K RES, 35K shared that was actually
> just top on the machine.
>
> When running locally can also use things like jconsole or visualVM but
> I actually find the prometheus + telegraph metrics we have in our
> production monitoring more detailed and easier to work with. We run lots
> of services so the monitoring and alerting stack, while all industry
> standard, has been a life saver for us.
>
> For doing the debugging locally I also tried setting the JVM flags to
> enable finer grain native memory tracking and use jcmd (in a scripted
> loop) to pull out those more detailed metrics. Though they are not that
> much more detailed than the micrometer/prometheus metrics.
> That use of jcmd and the caution on how to interpret RES came from the
> blog item I mentioned earlier:
> https://poonamparhar.github.io/troubleshooting_native_memory_leaks/
>
> For the memory leak checking I used valgrind but there's lots of others.
>
> Dave
>
> >
> > Marco
> >
> > On Tue, Jul 11, 2023 at 8:57 AM Dave Reynolds <dave.e.reyno...@gmail.com
> >
> > wrote:
> >
> >> For interest[*] ...
> >>
> >> This is what the core JVM metrics look like when transitioning from a
> >> Jetty10 to a Jetty9.4 instance. You can see the direct buffer cycling up
> >> to 500MB (which happens to be the max heap setting) on Jetty 10, nothing
> >> on Jetty 9. The drop in Mapped buffers is just because TDB hadn't been
> >> asked any queries yet.
> >>
> >>
> >>
> https://www.dropbox.com/scl/fi/9afhrztbb36fvzqkuw996/fuseki-jetty10-jetty9-transition.png?rlkey=7fpj4x1pn5mjnf3jjwenmp65m&dl=0
> >>
> >> Here' the same metrics around the time of triggering a TDB backup. Shows
> >> the mapped buffer use for TDB but no significant impact on heap etc.
> >>
> >>
> >>
> https://www.dropbox.com/scl/fi/0s40vpizf94c4w3m2awna/fuseki-jetty10-backup.png?rlkey=ai31m6z58w0uex8zix8e9ctna&dl=0
> >>
> >> These are all on the same instance as the RES memory trace:
> >>
> >>
> >>
> https://www.dropbox.com/scl/fi/c58nqkr2hi193a84btedg/fuseki-4.9.0-jetty-9.4.png?rlkey=b7osnj6k1oy1xskl4j25zz6o8&dl=0
> >>
> >> Dave
> >>
> >> [*] I've been staring and metric graphs for so many days I may have a
> >> distorted notion of what's interesting :)
> >>
> >> On 11/07/2023 08:39, Dave Reynolds wrote:
> >>> After a 10 hour test of 4.9.0 with Jetty 9.4 on java 17 in the
> >>> production, containerized, environment then it is indeed very stable.
> >>>
> >>> Running at less that 6% of memory on 4GB machine compared to peaks of
> >>> ~50% for versions with Jetty 10. RES shows as 240K with 35K shared
> >>> (presume mostly libraries).
> >>>
> >>> Copy of trace is:
> >>>
> >>
> https://www.dropbox.com/scl/fi/c58nqkr2hi193a84btedg/fuseki-4.9.0-jetty-9.4.png?rlkey=b7osnj6k1oy1xskl4j25zz6o8&dl=0
> >>>
> >>> The high spikes on left of image are the prior run on with out of the
> >>> box 4.7.0 on same JVM.
> >>>
> >>> The small spike at 06:00 is a dump so TDB was able to touch and scan
> all
> >>> the (modest) data with very minor blip in resident size (as you'd
> hope).
> >>> JVM stats show the mapped buffers for TDB jumping up but confirm heap
> is
> >>> stable at < 60M, non-heap 60M.
> >>>
> >>> Dave
> >>>
> >>> On 10/07/2023 20:52, Dave Reynolds wrote:
> >>>> Since this thread has got complex, I'm posting this update here at the
> >>>> top level.
> >>>>
> >>>> Thanks to folks, especially Andy and Rob for suggestions and for
> >>>> investigating.
> >>>>
> >>>> After a lot more testing at our end I believe we now have some
> >>>> workarounds.
> >>>>
> >>>> First, at least on java 17, the process growth does seem to level out.
> >>>> Despite what I just said to Rob, having just checked our soak tests, a
> >>>> jena 4.7.0/java 17 test with 500MB max heap has lasted for 7 days.
> >>>> Process size oscillates between 1.5GB and 2GB but hasn't gone above
> >>>> that in a week. The oscillation is almost entirely the cycling of the
> >>>> direct memory buffers used by Jetty. Empirically those cycle up to
> >>>> something comparable to the set max heap size, at least for us.
> >>>>
> >>>> While this week long test was 4.7.0, based on earlier tests I suspect
> >>>> 4.8.0 (and now 4.9.0) would also level out at least on a timescale of
> >>>> days.
> >>>>
> >>>> The key has been setting the max heap low. At 2GB and even 1GB (the
> >>>> default on a 4GB machine) we see higher peak levels of direct buffers
> >>>> and overall process size grew to around 3GB at which point the
> >>>> container is killed on the small machines. Though java 17 does seem to
> >>>> be better behaved that java 11, so switching to that probably also
> >>>> helped.
> >>>>
> >>>> Given the actual heap is low (50MB heap, 60MB non-heap) then needing
> >>>> 2GB to run in feels high but is workable. So my previously suggested
> >>>> rule of thumb that, in this low memory regime, allow 4x the max heap
> >>>> size seems to work.
> >>>>
> >>>> Second, we're now pretty confident the issue is jetty 10+.
> >>>>
> >>>> We've built a fuseki-server 4.9.0 with Jetty replaced by version
> >>>> 9.4.51.v20230217. This required some minor source changes to compile
> >>>> and pass tests. On a local bare metal test where we saw process growth
> >>>> up to 1.5-2GB this build has run stably using less than 500MB for 4
> >>>> hours.
> >>>>
> >>>> We'll set a longer term test running in the target containerized
> >>>> environment to confirm things but quite hopeful this will be long term
> >>>> stable.
> >>>>
> >>>> I realise Jetty 9.4.x is out of community support but eclipse say EOL
> >>>> is "unlikely to happen before 2025". So, while this may not be a
> >>>> solution for the Jena project, it could give us a workaround at the
> >>>> cost of doing custom builds.
> >>>>
> >>>> Dave
> >>>>
> >>>>
> >>>> On 03/07/2023 14:20, Dave Reynolds wrote:
> >>>>> We have a very strange problem with recent fuseki versions when
> >>>>> running (in docker containers) on small machines. Suspect a jetty
> >>>>> issue but it's not clear.
> >>>>>
> >>>>> Wondering if anyone has seen anything like this.
> >>>>>
> >>>>> This is a production service but with tiny data (~250k triples, ~60MB
> >>>>> as NQuads). Runs on 4GB machines with java heap allocation of
> 500MB[1].
> >>>>>
> >>>>> We used to run using 3.16 on jdk 8 (AWS Corretto for the long term
> >>>>> support) with no problems.
> >>>>>
> >>>>> Switching to fuseki 4.8.0 on jdk 11 the process grows in the space of
> >>>>> a day or so to reach ~3GB of memory at which point the 4GB machine
> >>>>> becomes unviable and things get OOM killed.
> >>>>>
> >>>>> The strange thing is that this growth happens when the system is
> >>>>> answering no Sparql queries at all, just regular health ping checks
> >>>>> and (prometheus) metrics scrapes from the monitoring systems.
> >>>>>
> >>>>> Furthermore the space being consumed is not visible to any of the JVM
> >>>>> metrics:
> >>>>> - Heap and and non-heap are stable at around 100MB total (mostly
> >>>>> non-heap metaspace).
> >>>>> - Mapped buffers stay at 50MB and remain long term stable.
> >>>>> - Direct memory buffers being allocated up to around 500MB then being
> >>>>> reclaimed. Since there are no sparql queries at all we assume this is
> >>>>> jetty NIO buffers being churned as a result of the metric scrapes.
> >>>>> However, this direct buffer behaviour seems stable, it cycles between
> >>>>> 0 and 500MB on approx a 10min cycle but is stable over a period of
> >>>>> days and shows no leaks.
> >>>>>
> >>>>> Yet the java process grows from an initial 100MB to at least 3GB.
> >>>>> This can occur in the space of a couple of hours or can take up to a
> >>>>> day or two with no predictability in how fast.
> >>>>>
> >>>>> Presumably there is some low level JNI space allocated by Jetty (?)
> >>>>> which is invisible to all the JVM metrics and is not being reliably
> >>>>> reclaimed.
> >>>>>
> >>>>> Trying 4.6.0, which we've had less problems with elsewhere, that
> >>>>> seems to grow to around 1GB (plus up to 0.5GB for the cycling direct
> >>>>> memory buffers) and then stays stable (at least on a three day soak
> >>>>> test). We could live with allocating 1.5GB to a system that should
> >>>>> only need a few 100MB but concerned that it may not be stable in the
> >>>>> really long term and, in any case, would rather be able to update to
> >>>>> more recent fuseki versions.
> >>>>>
> >>>>> Trying 4.8.0 on java 17 it grows rapidly to around 1GB again but then
> >>>>> keeps ticking up slowly at random intervals. We project that it would
> >>>>> take a few weeks to grow the scale it did under java 11 but it will
> >>>>> still eventually kill the machine.
> >>>>>
> >>>>> Anyone seem anything remotely like this?
> >>>>>
> >>>>> Dave
> >>>>>
> >>>>> [1]  500M heap may be overkill but there can be some complex queries
> >>>>> and that should still leave plenty of space for OS buffers etc in the
> >>>>> remaining memory on a 4GB machine.
> >>>>>
> >>>>>
> >>>>>
> >>
> >
> >
>


-- 


---
Marco Neumann

Re: Mystery memory leak in fuseki

Reply via email to