> We use TDB rather than TDB2

Dave, what is the reason behind that?

On Fri, Jan 16, 2026, 4:39 AM Dave Reynolds <[email protected]>
wrote:

> Separate from the memory limits, which others have discussed, storage
> performance makes a big difference.
>
> We successfully run in AWS EKS Kubernetes with dataset sizes around 120
> million triples (and larger datasets elsewhere) but to make that work
> well we use NVMe ephemeral storage rather than EBS/EFS. We use instances
> with large ephemeral like i4i.large for that. There are various ways to
> use ephemeral from k8s but we found the simple brute force approach the
> best - map ephemeral to the container storage area so emptyDir volumes
> are on ephemeral, use those for the fuseki database area. Have good
> monitoring and good backup and container init procedures.
>
> Easy enough to test on a simple EC2 instance to see whether NVMe gives
> you much performance benefit for your query patterns and then, only if
> so, figure out how to you want to manage ephemeral in k8s.
>
> Oh and on memory use, assuming you have a prometheus/grafana or similar
> monitoring stack set up, then the JVM metrics are very handy guides. The
> container WSS (working set size) metric equates to the k8s metric and
> should be running rather higher than the JVM total. That difference is
> largely the buffered pages Rob mentions. We'll typically expect that
> (and thus the pod memory request) to be 2-3 times the committed heap.
>
> We use TDB rather than TDB2 so our experience may not be fully
> representative.
>
> Dave
>
> On 15/01/2026 12:24, Vince Wouters via users wrote:
> > Hello Jena community,
> >
> > We’re looking for guidance on what other avenues are worth exploring to
> > improve overall query performance on our Apache Jena Fuseki instance,
> which
> > consists of multiple datasets, each containing millions of triples.
> >
> >
> > *Setup*
> >
> >     - Apache Jena Fuseki *5.5.0*
> >     - TDB2-backed datasets
> >     - Running on *AWS EKS (Kubernetes)*
> >     - Dataset size: ~15.6 million triples
> >
> >
> > *Infrastructure*
> >
> >     - Instances tested:
> >        - *c5a.2xlarge* (16 GiB instance, 12 GiB pod memory)
> >        - *c5a.4xlarge* (32 GiB instance, 28 GiB pod memory)
> >     - JVM memory derived from container limits
> >     - Grafana metrics show no storage bottleneck (IOPS and throughput
> remain
> >     well within limits)
> >
> > *Test Queries*
> > SELECT (COUNT(DISTINCT ?s) AS ?sCount) WHERE { GRAPH ?g { ?s ?p ?o } }
> >
> > Takes around 80 seconds for our dataset.
> >
> > SELECT (COUNT(DISTINCT CONCAT(STR(?s), STR(?p))) AS ?c) WHERE { GRAPH ?g
> {
> > ?s ?p ?o } }
> >
> > Takes around 120 seconds for our dataset.
> >
> > *What we’ve observed*
> >
> >     - The first query is stable once a minimum heap is available.
> >     - The second query is memory-intensive:
> >        - On the smaller instance, it will time out once available heap
> drops
> >        below a certain threshold.
> >        - On the larger instance we see clear improvements, but not linear
> >        scaling.
> >     - Increasing heap helps to a point, but does not feel like the full
> >     solution.
> >
> >
> > *Other things we’ve tried*
> >
> >     - TDB optimizer, but that isn’t an option with our number of datasets
> >     and graphs, as far as we can tell.
> >
> > *Question*
> > Given this type of workload and dataset size, what other routes should we
> > consider to improve performance, beyond simply adjusting heap size?
> >
> > In production, we have multiple datasets consisting of millions of
> triples,
> > and our end goal is to improve query times for our users.
> >
> > Any guidance or pointers would be much appreciated.
> >
> > Thanks in advance.
> >
>
>

Reply via email to