Different policy for concurrency access in TDB supporting a single writer and multiple readers

Andy Seaborne Tue, 22 Mar 2011 08:06:39 -0700

Switching to email for general discussion:

Stephen,

Could you say something about your design ideas for transactions andconcurrency in Parliament? And specifically, what is the unit of MVCC?

I've mainly been thinking about the design for TDB, and a design thatuses blocks as the unit of transaction and MVCC, with each writer seeinga view of the block space that accords with its initial time point

Using a WAL as a redo log, the changes are not in the main databaseuntil the WAL is checkpointed.


If you have the time sequence:

R1: start reader 1
W1: start writer 1
W1: finish writer 1
# a reader here is like R1
R2: start reader 2
W2: start writer 2
W2: finish writer 2
R3: start reader 3
... finish queries ...

which is single writer but reader 1 spans two writers.

My idea has been to create an "IP layer for data" (IP as in InternetProtocol) over blocks, rather that making the data structures above thatlayer (B+Trees, Extensible hash tables) versioned - the data structurecode has to be slightly version aware to pass the version context up anddown to the block layer but the algorithms are essentially not versionspecific. This depends on making the block layer sufficiently efficientwhich might become the practical transaction size limitation.Compression, or recording operations on block according to type (anextreme form of application-driven compression), would be very good. Itretains the data publishing-focus of TDB - that is, reader and readerperformance is primary and costs go mainly on the writers.

Only experimentation will tell. It is my hope to have other datastructures for indexing in the future so a clean separation of thetranasaction/MVCC issues and the data structure algorithms is going tobe good.

Given that design, SERIALIZABLE is the natural isolation policy. Thereis very little locking over blocks in the main database (see below forthe possibility of none!)

That's not to say that the API design should not have these constants ineven if not used.

It happens naturally, each writer sees its modified blocks (as writtento the WAL) and the blocks of past committed writers + the basedatabase. Readers just don't get changed blocks; reader 1 is workingagains the main database; reader 2 gets modified blocks from the WALbecause the WAL can't be all written to the main database until reader 1gets out the way.

READ_COMMITTED makes sense for limiting growth of the WAL as it meansthat the WAL can be flushed to the main database before an earlierreader has finished. It would have to be done in such a way that allthe changes happen at once or the data structure invariants could beinvalid; that is, the reader would also have to be outside such an indexas well and that needs some locking that would not be there otherwise.

REPEATABLE_READ would need block locking for the updates but seems tohave a quirk. Unfortunately, just knowing which blocks the reader hasseen may not be strong enough as a change to the blocks of an index mayresult in a change to a block the reader has looked at and one it hasnot in some data structure consistency way, but the block locking willmiss that. Not sure that's safe for the B+trees - it's not in general.

READ_UNCOMMITTED is very likely to break the index structures as thehigher level assumptions in the B+trees will not be valid.

A feature of the TxTDB MVCC design is that locking is not being done onthe main database. Changes only happen when writing back committedblocks, and then you know the original block to be over-written is neverseen by a reader. Write-back occurs only when all readers upto thatwriter version have exited and so any outstanding reader will be gettingthe updated block via the WAL copy. No block lock needed (a bit scarythat bit :-).

Recovery is a matter of redoing the WAL so it is potentially moreexpensive than an undo log where the undo actions are just runningtransactions, not a period of recent history of writers. But we don'tcrash do we? :-)

I think that limiting the number of versions back that will be admittedto some control constant (like 10 or 20, ideally adapting to few forlarger tranactions) will be appropriate. if it's growing, then it's asign the system is being asked to do more than it has the processingpower to do.

Most of the time, the WAL is write-only, and there are in-memory copiesof blocks (AKA a cache, with version tiers). This is analogous to thedirect-mode write cache there is currently, but with an additionalwrite-back policy as it needs to sync when a transaction commits.


        Andy

On 21/03/11 22:09, Stephen Allen (JIRA) wrote:


[
https://issues.apache.org/jira/browse/JENA-41?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13009433#comment-13009433
]

Stephen Allen commented on JENA-41:
-----------------------------------

I think your idea about the DatasetGraph being the interface for
transactions makes sense.  Transactional DatasetGraphs could also
provide fallback behavior for legacy code by implementing autocommit
transactions if the user called methods on a dataset that was not
initialized in a transactionBegin() call.


With regard to the isolation levels, I believe some of the lower
levels can make sense for particular applications or queries.  For
example say you want to know the size of a few of graphs.

BEGIN READ_ONLY;

>select count (*) where { graph<http://example/g1> { ?s ?p ?o . } } ;
>select count (*) where { graph<http://example/g2> { ?s ?p ?o . } } ;
>COMMIT;


Assuming a traditional pessimistic locking scheme, running the
transaction at SERIALIZABLE could cause the locks held by the first
select query to also be held through the second query, reducing
concurrency (using two transactions instead might not be a good idea
as there is usually some amount of overhead associated with creating
and committing transactions).

If you were OK with the possibility that the two query results are
not truly serializable with respect to each other, then you could
improve concurrency by using a READ_COMMITTED isolation level instead
that would give serializable results for each query (but not the
whole transaction).  And if you really just needed a rough estimate
of size, using READ_UNCOMMITTED may be able to avoid locking all
together.

An additional motivating factor for MVCC implementations is that they
may be implementing snapshot isolation, which probably maps better to
READ_COMMITTED than SERIALIZABLE (especially if it could do predicate
locking for true serializable behavior but allow cheaper snapshot
isolation if that was all that was needed).  The Postgres
documentation does a good job of describing this [1].

I would find it useful to have multiple isolation levels available
(even if internally I'm mapping them all to SERIALIZABLE at first).
The four ANSI Isolation levels seem appropriate, and remember that
implementations are allowed to map unavailable lower levels to higher
levels as desired.


[1]
http://developer.postgresql.org/pgdocs/postgres/transaction-iso.html

Different policy for concurrency access in TDB supporting a single writer and multiple readers

Reply via email to