Hey, folks-- There hasn't been too much feedback on my proposal for a journaling DatasetGraph:
https://github.com/ajs6f/jena/tree/JournalingDatasetgraph which was and is to be a step towards JENA-624: Develop a new in-memory RDF Dataset implementation. So I'm moving on to look at the real problem: an in-memory DatasetGraph with high concurrency, for use with modern hardware running many, many threads in large core memory. I'm beginning to sketch out rough code, and I'd like to run some design decisions past the list to get criticism/advice/horrified warnings/whatever needs to be said. 1) All-transactional action: i.e. no non-transactional operation. This is obviously a great thing for simplifying my work, but I hope it won't be out of line with the expected uses for this stuff. 2) 6 covering indexes in the forms GSPO, GOPS, SPOG, OSGP, PGSO, OPSG. I figure to play to the strength of in-core-memory operation: raw speed, but obviously this is going to cost space. 3) At least for now, all commits succeed. 4) The use of persistent datastructures to avoid complex and error-prone fine-grained locking regimes. I'm using http://pcollections.org/ for now, but I am in no way committed to it nor do I claim to have thoroughly vetted it. It's simple but enough to get started, and that's all I need to bring the real design questions into focus. 5) Snapshot isolation. Transactions do not see commits that occur during their lifetime. Each works entirely from the state of the DatasetGraph at the start of its life. 6) Only as many as one transaction per thread, for now. Transactions are not thread-safe. These are simplifying assumptions that could be relaxed later. My current design operates as follows: At the start of a transaction, a fresh in-transaction reference is taken atomically from the AtomicReference that points to the index block. As operations are performed in the transaction, that in-transaction reference is progressed (in the sense in which any persistent datastructure is progressed) while the operations are recorded. Upon an abort, the in-transaction reference and the record are just thrown away. Upon a commit, the in-transaction reference is thrown away and the operation record is re-run against the main reference (the one that is copied at the beginning of a transaction). That rerun happens inside an atomic update (hence the use of AtomicReference). This all should avoid the need for explicit locking in Jena and should confine any blocking against the indexes to the actual duration of a commit. What do you guys think? --- A. Soroka The University of Virginia Library