> We use TDB rather than TDB2 Dave, what is the reason behind that?
On Fri, Jan 16, 2026, 4:39 AM Dave Reynolds <[email protected]> wrote: > Separate from the memory limits, which others have discussed, storage > performance makes a big difference. > > We successfully run in AWS EKS Kubernetes with dataset sizes around 120 > million triples (and larger datasets elsewhere) but to make that work > well we use NVMe ephemeral storage rather than EBS/EFS. We use instances > with large ephemeral like i4i.large for that. There are various ways to > use ephemeral from k8s but we found the simple brute force approach the > best - map ephemeral to the container storage area so emptyDir volumes > are on ephemeral, use those for the fuseki database area. Have good > monitoring and good backup and container init procedures. > > Easy enough to test on a simple EC2 instance to see whether NVMe gives > you much performance benefit for your query patterns and then, only if > so, figure out how to you want to manage ephemeral in k8s. > > Oh and on memory use, assuming you have a prometheus/grafana or similar > monitoring stack set up, then the JVM metrics are very handy guides. The > container WSS (working set size) metric equates to the k8s metric and > should be running rather higher than the JVM total. That difference is > largely the buffered pages Rob mentions. We'll typically expect that > (and thus the pod memory request) to be 2-3 times the committed heap. > > We use TDB rather than TDB2 so our experience may not be fully > representative. > > Dave > > On 15/01/2026 12:24, Vince Wouters via users wrote: > > Hello Jena community, > > > > We’re looking for guidance on what other avenues are worth exploring to > > improve overall query performance on our Apache Jena Fuseki instance, > which > > consists of multiple datasets, each containing millions of triples. > > > > > > *Setup* > > > > - Apache Jena Fuseki *5.5.0* > > - TDB2-backed datasets > > - Running on *AWS EKS (Kubernetes)* > > - Dataset size: ~15.6 million triples > > > > > > *Infrastructure* > > > > - Instances tested: > > - *c5a.2xlarge* (16 GiB instance, 12 GiB pod memory) > > - *c5a.4xlarge* (32 GiB instance, 28 GiB pod memory) > > - JVM memory derived from container limits > > - Grafana metrics show no storage bottleneck (IOPS and throughput > remain > > well within limits) > > > > *Test Queries* > > SELECT (COUNT(DISTINCT ?s) AS ?sCount) WHERE { GRAPH ?g { ?s ?p ?o } } > > > > Takes around 80 seconds for our dataset. > > > > SELECT (COUNT(DISTINCT CONCAT(STR(?s), STR(?p))) AS ?c) WHERE { GRAPH ?g > { > > ?s ?p ?o } } > > > > Takes around 120 seconds for our dataset. > > > > *What we’ve observed* > > > > - The first query is stable once a minimum heap is available. > > - The second query is memory-intensive: > > - On the smaller instance, it will time out once available heap > drops > > below a certain threshold. > > - On the larger instance we see clear improvements, but not linear > > scaling. > > - Increasing heap helps to a point, but does not feel like the full > > solution. > > > > > > *Other things we’ve tried* > > > > - TDB optimizer, but that isn’t an option with our number of datasets > > and graphs, as far as we can tell. > > > > *Question* > > Given this type of workload and dataset size, what other routes should we > > consider to improve performance, beyond simply adjusting heap size? > > > > In production, we have multiple datasets consisting of millions of > triples, > > and our end goal is to improve query times for our users. > > > > Any guidance or pointers would be much appreciated. > > > > Thanks in advance. > > > >
