Separate from the memory limits, which others have discussed, storage performance makes a big difference.

We successfully run in AWS EKS Kubernetes with dataset sizes around 120 million triples (and larger datasets elsewhere) but to make that work well we use NVMe ephemeral storage rather than EBS/EFS. We use instances with large ephemeral like i4i.large for that. There are various ways to use ephemeral from k8s but we found the simple brute force approach the best - map ephemeral to the container storage area so emptyDir volumes are on ephemeral, use those for the fuseki database area. Have good monitoring and good backup and container init procedures.

Easy enough to test on a simple EC2 instance to see whether NVMe gives you much performance benefit for your query patterns and then, only if so, figure out how to you want to manage ephemeral in k8s.

Oh and on memory use, assuming you have a prometheus/grafana or similar monitoring stack set up, then the JVM metrics are very handy guides. The container WSS (working set size) metric equates to the k8s metric and should be running rather higher than the JVM total. That difference is largely the buffered pages Rob mentions. We'll typically expect that (and thus the pod memory request) to be 2-3 times the committed heap.

We use TDB rather than TDB2 so our experience may not be fully representative.

Dave

On 15/01/2026 12:24, Vince Wouters via users wrote:
Hello Jena community,

We’re looking for guidance on what other avenues are worth exploring to
improve overall query performance on our Apache Jena Fuseki instance, which
consists of multiple datasets, each containing millions of triples.


*Setup*

    - Apache Jena Fuseki *5.5.0*
    - TDB2-backed datasets
    - Running on *AWS EKS (Kubernetes)*
    - Dataset size: ~15.6 million triples


*Infrastructure*

    - Instances tested:
       - *c5a.2xlarge* (16 GiB instance, 12 GiB pod memory)
       - *c5a.4xlarge* (32 GiB instance, 28 GiB pod memory)
    - JVM memory derived from container limits
    - Grafana metrics show no storage bottleneck (IOPS and throughput remain
    well within limits)

*Test Queries*
SELECT (COUNT(DISTINCT ?s) AS ?sCount) WHERE { GRAPH ?g { ?s ?p ?o } }

Takes around 80 seconds for our dataset.

SELECT (COUNT(DISTINCT CONCAT(STR(?s), STR(?p))) AS ?c) WHERE { GRAPH ?g {
?s ?p ?o } }

Takes around 120 seconds for our dataset.

*What we’ve observed*

    - The first query is stable once a minimum heap is available.
    - The second query is memory-intensive:
       - On the smaller instance, it will time out once available heap drops
       below a certain threshold.
       - On the larger instance we see clear improvements, but not linear
       scaling.
    - Increasing heap helps to a point, but does not feel like the full
    solution.


*Other things we’ve tried*

    - TDB optimizer, but that isn’t an option with our number of datasets
    and graphs, as far as we can tell.

*Question*
Given this type of workload and dataset size, what other routes should we
consider to improve performance, beyond simply adjusting heap size?

In production, we have multiple datasets consisting of millions of triples,
and our end goal is to improve query times for our users.

Any guidance or pointers would be much appreciated.

Thanks in advance.


Reply via email to