Re: Why DatasetGraphInMemory?

Rob @ DNR Mon, 22 May 2023 02:07:45 -0700

Fuseki is effectively the Jena projects database server that allows sharing a 
single Jena Dataset amongst many processes and users.


This means that users expect database server like behaviour, i.e., 
transactions, read isolation which the transactional in-memory dataset 
provides, when running Fuseki in the in-memory mode.

I’m not sure about the full context of that comment but I don’t think that’s 
entirely true.  It depends on how the user starts and runs Fuseki.  Most people 
who want a persistent dataset would be using TDB which has its own completely 
independent Dataset implementation, query executor and persistent data 
structures.

Broadly speaking users of Fuseki run it in 3 main ways:


  *   With TDB (the --loc=/path/to/db flag)
  *   In Memory (the --mem flag)
  *   With a configuration file (--config flag)

For 1 DatasetGraphInMemory doesn’t get used AFAIK, the TDB specific 
implementations are used instead.  For 2 it’s the default dataset.  For 3 it 
will depend on what the user has placed in their configuration file and might 
be a mixture of 1 and 2 plus inference, ancillary index wrappers 
(text/geospatial indexing) etc.

Again, I think you’re getting hung up on the wrong thing here. An improved 
in-memory Graph implementation will have benefits, but it won’t necessarily be 
for all use cases.  There’s plenty of use cases where you do just want to 
briefly load/generate a bunch of RDF in-memory, manipulate it and move on, 
which an improved in-memory implementation will greatly benefit.

Fuseki, as a database server, likely won’t benefit (except perhaps in some 
peoples custom configuration setups).  However, people who want performance 
with Fuseki should already be using TDB anyway.

Hope this helps,

Rob

From: Arne Bernhardt <arne.bernha...@gmail.com>
Date: Friday, 19 May 2023 at 21:21
To: dev@jena.apache.org <dev@jena.apache.org>
Subject: Why DatasetGraphInMemory?
Hi,
in a recent  response
<https://github.com/apache/jena/issues/1867#issuecomment-1546931793> to an
issue it was said that   "Fuseki - uses DatasetGraphInMemory mostly"  .
For my  PR <https://github.com/apache/jena/pull/1865>, I added a JMH
benchmark suite to the project. So it was easy for me to compare the
performance of GraphMem with
"DatasetGraphFactory.createTxnMem().getDefaultGraph()".
DatasetGraphInMemory is much slower in every discipline tested (#add,
#delete, #contains, #find, #stream).
Maybe my approach is too naive?
I understand very well that the underlying Dexx Collections Framework, with
its immutable persistent data structures, makes threading and transaction
handling easy and that there are no issues with consuming iterators or
streams even after a read transaction has closed.
Is it currently supported for consumers to use iterators and streams after
a transaction has been closed? If so, I don't currently see an easy way to
replace DatasetGraphInMemory with a faster implementation. (although
transaction-aware iterators that copy the remaining elements into lists
could be an option).
Are there other reasons why DatasetGraphInMemory is the preferred dataset
implementation for Fuseki?

Cheers,
Arne

Re: Why DatasetGraphInMemory?

Reply via email to