JENA-624: "Develop a new in-memory RDF Dataset implementation"

A. Soroka Wed, 26 Aug 2015 07:31:05 -0700

Hey, folks--

There hasn't been too much feedback on my proposal for a journaling 
DatasetGraph:


https://github.com/ajs6f/jena/tree/JournalingDatasetgraph

which was and is to be a step towards JENA-624: Develop a new in-memory RDF 
Dataset implementation. So I'm moving on to look at the real problem: an 
in-memory  DatasetGraph with high concurrency, for use with modern hardware 
running many, many threads in large core memory.

I'm beginning to sketch out rough code, and I'd like to run some design 
decisions past the list to get criticism/advice/horrified warnings/whatever 
needs to be said.

1) All-transactional action: i.e. no non-transactional operation. This is 
obviously a great thing for simplifying my work, but I hope it won't be out of 
line with the expected uses for this stuff. 

2) 6 covering indexes in the forms GSPO, GOPS, SPOG, OSGP, PGSO, OPSG. I figure 
to play to the strength of in-core-memory operation: raw speed, but obviously 
this is going to cost space.

3) At least for now, all commits succeed.

4) The use of persistent datastructures to avoid complex and error-prone 
fine-grained locking regimes. I'm using http://pcollections.org/ for now, but I 
am in no way committed to it nor do I claim to have thoroughly vetted it. It's 
simple but enough to get started, and that's all I need to bring the real 
design questions into focus.

5) Snapshot isolation. Transactions do not see commits that occur during their 
lifetime. Each works entirely from the state of the DatasetGraph at the start 
of its life.

6) Only as many as one transaction per thread, for now. Transactions are not 
thread-safe. These are simplifying assumptions that could be relaxed later.

My current design operates as follows:

At the start of a transaction, a fresh in-transaction reference is taken 
atomically from the AtomicReference that points to the index block. As 
operations are performed in the transaction, that in-transaction reference is 
progressed (in the sense in which any persistent datastructure is progressed) 
while the operations are recorded. Upon an abort, the in-transaction reference 
and the record are just thrown away. Upon a commit, the in-transaction 
reference is thrown away and the operation record is re-run against the main 
reference (the one that is copied at the beginning of a transaction). That 
rerun happens inside an atomic update (hence the use of AtomicReference). This 
all should avoid the need for explicit locking in Jena and should confine any 
blocking against the indexes to the actual duration of a commit.

What do you guys think?



---
A. Soroka
The University of Virginia Library

JENA-624: "Develop a new in-memory RDF Dataset implementation"

Reply via email to