Re: Java-based storage backend

Martynas Jusevičius Sun, 03 May 2026 12:46:01 -0700

This could be relevant

COTTAS: Columnar Triple Table Storage for Efficient and Compressed RDF
Management
https://sferrada.com/publication/2025-iswc-arenas-guerrero-cottas/2025-iswc-arenas-guerrero-cottas.pdf


On Thu, 15 Jan 2026 at 15.00, Martynas Jusevičius <[email protected]>
wrote:

> Hey Rob,
>
> I've changed the subject so as to not derail the memory optimization
> thread.
>
> Given how many times this topic has come up on this list, and how often
> Jena is struggling with the types of queries and the sizes of datasets that
> it should be able to handle in theory, maybe the problem is the Java-based
> architecture of TDB?
>
> Maybe it requires a new type of persistence backend such as RocksDB? We
> know there is some prototype of this: https://github.com/afs/TDB3
> We also know that Stardog is using RocksDB for storage:
> https://docs.stardog.com/operating-stardog/database-administration/storage-optimize
>
> IMO the lack of scalable open-source triplestores is one of the main pain
> points in the RDF ecosystem.
> I love Jena and Fuseki and I'm using it as the default triplestore in my
> projects, but I have doubts whether I could use it in a high-load
> production system.
>
> I also know this is an open-source project with limited resources, but
> that is a different topic.
>
> Martynas
> atomgraph.com
>
> On Thu, Jan 15, 2026 at 2:34 PM Rob @ DNR <[email protected]> wrote:
>
>> Hi Vince
>>
>> > JVM memory derived from container limits
>>
>> What do you mean by this specifically?
>>
>> As has been discussed and referenced previously on this list, and is
>> noted in our TDB FAQs [1], much of the memory usage for TDB databases is in
>> terms of off-heap usage via memory mapped files.
>>
>> Therefore, setting the JVM heap too high can actually reduce performance
>> as the JVM is then competing against the OS for memory and forcing the
>> mapped files to be paged out reducing performance.
>>
>> So firstly, I’d make sure you aren’t setting the JVM heap to use too much
>> of your available memory.  Ensure you are leaving some headroom between JVM
>> heap and container limits for OS usage for the memory mapped files.  Since
>> you mention you have Grafana in place I’d also look at any metrics that
>> might be available around memory mapped file usage/paging etc. to see if
>> this might be your problem.
>>
>> > The second query is memory-intensive
>>
>> Yes, operators like DISTINCT that require the query engine to keep large
>> chunks of the data in-memory are always going to be memory-intensive.  The
>> Jena query engine is generally designed for lazy streaming and calculation
>> of results as much as possible.  If you have control over queries being
>> issued then I would look at whether you actually need to be using operators
>> like DISTINCT in your queries.
>>
>> > TDB optimizer, but that isn’t an option with our number of datasets
>>    and graphs, as far as we can tell
>>
>> Don’t really follow this statement.  I assume you’re referring to the
>> optional stats based optimiser?  Unless your datasets are being frequently
>> updated, I don’t see why you wouldn’t gain some value from generating the
>> stats for each dataset.  Remember that the TDB optimizer works on a
>> per-dataset basis so you can generate stats files for each dataset, or some
>> subset of your datasets, placing each stats file into the relevant database
>> directory and they don’t interact with each other.
>>
>> > In production, we have multiple datasets consisting of millions of
>> triples,
>> and our end goal is to improve query times for our users.
>>
>> Often the best way to improve query times for users is either to exert
>> more control over the queries (if the queries aren’t end-user controlled)
>> using tools like Jena’s qparse [2] to analyse your queries and experiment
>> with modifications to them that might optimise better.  Or if you permit
>> arbitrary queries to educate/train your users on best practises/how to
>> write better queries/SPARQL optimisation etc.
>>
>> Another thing to consider is that if you aren’t doing federated queries
>> across your multiple datasets you might actually be better off having
>> independent smaller instances of Fuseki running on smaller AWS nodes, each
>> serving a separate dataset.  This would give you more flexibility to tune
>> the resources, JVM heap etc. for each dataset depending on its needs.
>>
>> Hope this helps,
>>
>> Rob
>>
>>
>> [1] https://jena.apache.org/documentation/tdb/faqs.html#java-heap
>> [2] https://jena.apache.org/documentation/query/explain.html
>>
>> From: Vince Wouters via users <[email protected]>
>> Date: Thursday, 15 January 2026 at 12:25
>> To: [email protected] <[email protected]>
>> Cc: Vince Wouters <[email protected]>
>> Subject: Increasing Apache Jena Performance
>>
>> Hello Jena community,
>>
>> We’re looking for guidance on what other avenues are worth exploring to
>> improve overall query performance on our Apache Jena Fuseki instance,
>> which
>> consists of multiple datasets, each containing millions of triples.
>>
>>
>> *Setup*
>>
>>    - Apache Jena Fuseki *5.5.0*
>>    - TDB2-backed datasets
>>    - Running on *AWS EKS (Kubernetes)*
>>    - Dataset size: ~15.6 million triples
>>
>>
>> *Infrastructure*
>>
>>    - Instances tested:
>>       - *c5a.2xlarge* (16 GiB instance, 12 GiB pod memory)
>>       - *c5a.4xlarge* (32 GiB instance, 28 GiB pod memory)
>>    - JVM memory derived from container limits
>>    - Grafana metrics show no storage bottleneck (IOPS and throughput
>> remain
>>    well within limits)
>>
>> *Test Queries*
>> SELECT (COUNT(DISTINCT ?s) AS ?sCount) WHERE { GRAPH ?g { ?s ?p ?o } }
>>
>> Takes around 80 seconds for our dataset.
>>
>> SELECT (COUNT(DISTINCT CONCAT(STR(?s), STR(?p))) AS ?c) WHERE { GRAPH ?g {
>> ?s ?p ?o } }
>>
>> Takes around 120 seconds for our dataset.
>>
>> *What we’ve observed*
>>
>>    - The first query is stable once a minimum heap is available.
>>    - The second query is memory-intensive:
>>       - On the smaller instance, it will time out once available heap
>> drops
>>       below a certain threshold.
>>       - On the larger instance we see clear improvements, but not linear
>>       scaling.
>>    - Increasing heap helps to a point, but does not feel like the full
>>    solution.
>>
>>
>> *Other things we’ve tried*
>>
>>    - TDB optimizer, but that isn’t an option with our number of datasets
>>    and graphs, as far as we can tell.
>>
>> *Question*
>> Given this type of workload and dataset size, what other routes should we
>> consider to improve performance, beyond simply adjusting heap size?
>>
>> In production, we have multiple datasets consisting of millions of
>> triples,
>> and our end goal is to improve query times for our users.
>>
>> Any guidance or pointers would be much appreciated.
>>
>> Thanks in advance.
>>
>

Re: Java-based storage backend

Reply via email to