This could be relevant COTTAS: Columnar Triple Table Storage for Efficient and Compressed RDF Management https://sferrada.com/publication/2025-iswc-arenas-guerrero-cottas/2025-iswc-arenas-guerrero-cottas.pdf
On Thu, 15 Jan 2026 at 15.00, Martynas Jusevičius <[email protected]> wrote: > Hey Rob, > > I've changed the subject so as to not derail the memory optimization > thread. > > Given how many times this topic has come up on this list, and how often > Jena is struggling with the types of queries and the sizes of datasets that > it should be able to handle in theory, maybe the problem is the Java-based > architecture of TDB? > > Maybe it requires a new type of persistence backend such as RocksDB? We > know there is some prototype of this: https://github.com/afs/TDB3 > We also know that Stardog is using RocksDB for storage: > https://docs.stardog.com/operating-stardog/database-administration/storage-optimize > > IMO the lack of scalable open-source triplestores is one of the main pain > points in the RDF ecosystem. > I love Jena and Fuseki and I'm using it as the default triplestore in my > projects, but I have doubts whether I could use it in a high-load > production system. > > I also know this is an open-source project with limited resources, but > that is a different topic. > > Martynas > atomgraph.com > > On Thu, Jan 15, 2026 at 2:34 PM Rob @ DNR <[email protected]> wrote: > >> Hi Vince >> >> > JVM memory derived from container limits >> >> What do you mean by this specifically? >> >> As has been discussed and referenced previously on this list, and is >> noted in our TDB FAQs [1], much of the memory usage for TDB databases is in >> terms of off-heap usage via memory mapped files. >> >> Therefore, setting the JVM heap too high can actually reduce performance >> as the JVM is then competing against the OS for memory and forcing the >> mapped files to be paged out reducing performance. >> >> So firstly, I’d make sure you aren’t setting the JVM heap to use too much >> of your available memory. Ensure you are leaving some headroom between JVM >> heap and container limits for OS usage for the memory mapped files. Since >> you mention you have Grafana in place I’d also look at any metrics that >> might be available around memory mapped file usage/paging etc. to see if >> this might be your problem. >> >> > The second query is memory-intensive >> >> Yes, operators like DISTINCT that require the query engine to keep large >> chunks of the data in-memory are always going to be memory-intensive. The >> Jena query engine is generally designed for lazy streaming and calculation >> of results as much as possible. If you have control over queries being >> issued then I would look at whether you actually need to be using operators >> like DISTINCT in your queries. >> >> > TDB optimizer, but that isn’t an option with our number of datasets >> and graphs, as far as we can tell >> >> Don’t really follow this statement. I assume you’re referring to the >> optional stats based optimiser? Unless your datasets are being frequently >> updated, I don’t see why you wouldn’t gain some value from generating the >> stats for each dataset. Remember that the TDB optimizer works on a >> per-dataset basis so you can generate stats files for each dataset, or some >> subset of your datasets, placing each stats file into the relevant database >> directory and they don’t interact with each other. >> >> > In production, we have multiple datasets consisting of millions of >> triples, >> and our end goal is to improve query times for our users. >> >> Often the best way to improve query times for users is either to exert >> more control over the queries (if the queries aren’t end-user controlled) >> using tools like Jena’s qparse [2] to analyse your queries and experiment >> with modifications to them that might optimise better. Or if you permit >> arbitrary queries to educate/train your users on best practises/how to >> write better queries/SPARQL optimisation etc. >> >> Another thing to consider is that if you aren’t doing federated queries >> across your multiple datasets you might actually be better off having >> independent smaller instances of Fuseki running on smaller AWS nodes, each >> serving a separate dataset. This would give you more flexibility to tune >> the resources, JVM heap etc. for each dataset depending on its needs. >> >> Hope this helps, >> >> Rob >> >> >> [1] https://jena.apache.org/documentation/tdb/faqs.html#java-heap >> [2] https://jena.apache.org/documentation/query/explain.html >> >> From: Vince Wouters via users <[email protected]> >> Date: Thursday, 15 January 2026 at 12:25 >> To: [email protected] <[email protected]> >> Cc: Vince Wouters <[email protected]> >> Subject: Increasing Apache Jena Performance >> >> Hello Jena community, >> >> We’re looking for guidance on what other avenues are worth exploring to >> improve overall query performance on our Apache Jena Fuseki instance, >> which >> consists of multiple datasets, each containing millions of triples. >> >> >> *Setup* >> >> - Apache Jena Fuseki *5.5.0* >> - TDB2-backed datasets >> - Running on *AWS EKS (Kubernetes)* >> - Dataset size: ~15.6 million triples >> >> >> *Infrastructure* >> >> - Instances tested: >> - *c5a.2xlarge* (16 GiB instance, 12 GiB pod memory) >> - *c5a.4xlarge* (32 GiB instance, 28 GiB pod memory) >> - JVM memory derived from container limits >> - Grafana metrics show no storage bottleneck (IOPS and throughput >> remain >> well within limits) >> >> *Test Queries* >> SELECT (COUNT(DISTINCT ?s) AS ?sCount) WHERE { GRAPH ?g { ?s ?p ?o } } >> >> Takes around 80 seconds for our dataset. >> >> SELECT (COUNT(DISTINCT CONCAT(STR(?s), STR(?p))) AS ?c) WHERE { GRAPH ?g { >> ?s ?p ?o } } >> >> Takes around 120 seconds for our dataset. >> >> *What we’ve observed* >> >> - The first query is stable once a minimum heap is available. >> - The second query is memory-intensive: >> - On the smaller instance, it will time out once available heap >> drops >> below a certain threshold. >> - On the larger instance we see clear improvements, but not linear >> scaling. >> - Increasing heap helps to a point, but does not feel like the full >> solution. >> >> >> *Other things we’ve tried* >> >> - TDB optimizer, but that isn’t an option with our number of datasets >> and graphs, as far as we can tell. >> >> *Question* >> Given this type of workload and dataset size, what other routes should we >> consider to improve performance, beyond simply adjusting heap size? >> >> In production, we have multiple datasets consisting of millions of >> triples, >> and our end goal is to improve query times for our users. >> >> Any guidance or pointers would be much appreciated. >> >> Thanks in advance. >> >
