Hi Andy, This is rather long email I'm afraid, but I've split it into two parts: Parliament and TxTDB comments:
Parliament ---------- The unit of MVCC I'm looking at is per quad. This is similar to PostgreSQL's row-level MVCC. It also means that the WAL is used for durability guarantees and transaction rollback rather than concurrency control. The WAL will track statement insertions/deletions and disk page states on the first modification after a checkpoint. Each statement in Parliament will have two transaction IDs: a creation ID that identifies the transaction that created the statement, and an expiration ID that identifies the transaction that expired (deleted) the statement. As statements are immutable (statements can only be created or deleted), we can expire a statement by atomically updating the expired ID field and avoid locking. When a statement is expired, Parliament retains it until all transactions that were in process at the time of expiration have completed (after which time the statement is allowed to be removed by the background vacuum task). Read iterators can determine if any particular statement is visible by recording two things at the start of the transaction: 1) the current transaction ID, and 2) all in-process transaction IDs. When a transactional query is issued, Parliament displays all statement versions that match the following criteria: - The statement’s creation ID is a committed transaction and is less than the current transaction counter - The statement lacks an expiration ID, its expiration ID was in process at query start, or the expiration ID is greater than the current transaction counter One helpful result of structuring our statement iterator operation in this manner is that we can align our physical data storage with the abstract RDF concept that a resource can be said to only exist if there exists a statement that uses that resource. If we ensure that the only canonical source for determining the existence of a resource is the Statement List, then we are free to loosen the isolation level of the resource dictionary to that of UNCOMMITTED_READ while retaining an overall SERIALIZABLE isolation for the triple store itself (thus there is no requirement for MVCC on the resource dictionary, and minimal locking requirements). Because there is no coordination between read and write transactions, it is not easy to know exactly when a particular expired statement can safely be deleted from the statement list. Borrowing from other MVCC implementations (PostgreSQL in particular), Parliament will use a VACUUM task that will iterate through all statements and remove any links to statements that are considered expired by all open transactions (possibly placing the address in a free-space list for reuse). Right now I am going to focus on an implementation with only a single writer, but with the idea that multiple writers is possible in the future. It that aspect that drives me to suspect that I will want more isolation levels than SERIALIZABLE as lower levels could reduce potentially expensive locking. It could also be useful for improving the throughput of lots of short read queries as well. UNCOMMITTED_READ for example could avoid the locking needed to get a list of all in-process transactions and instead atomically get and increment the current transaction counter. Some of these isolation levels may only make sense if the transaction interface could specify the start and end of a query inside of a transaction since at the DatasetGraph level all you get are .find(Quad) operations. The example I gave on the JIRA would need some way of indicating query start inside of a transaction. TxTDB ----- It sounds like you may be describing shadow paging [1]. In shadow paging, the basic idea is that any reader or writer gets a view onto the data based on reachability from pointers in a particular root block. Pages that are reachable from any live root block are never modified. A vacuum process is required to collect the space from blocks that are no longer reachable (the last reader out could do it too). Updates to indexes must be treated in roughly the same way as data pages, because they contain pointers to different data. If you wrote each block to disk at commit time the system would belong to the class of logless transaction systems (although a small log might be needed to prevent partial updates to the root page pointers). With a shadow paging system you also would not have to actually copy the new pages back over top of the old ones. -Stephen [1] http://en.wikipedia.org/wiki/Shadow_paging -----Original Message----- From: Andy Seaborne [mailto:[email protected]] Sent: Tuesday, March 22, 2011 11:06 AM To: [email protected] Subject: Different policy for concurrency access in TDB supporting a single writer and multiple readers Switching to email for general discussion: Stephen, Could you say something about your design ideas for transactions and concurrency in Parliament? And specifically, what is the unit of MVCC? I've mainly been thinking about the design for TDB, and a design that uses blocks as the unit of transaction and MVCC, with each writer seeing a view of the block space that accords with its initial time point Using a WAL as a redo log, the changes are not in the main database until the WAL is checkpointed. If you have the time sequence: R1: start reader 1 W1: start writer 1 W1: finish writer 1 # a reader here is like R1 R2: start reader 2 W2: start writer 2 W2: finish writer 2 R3: start reader 3 ... finish queries ... which is single writer but reader 1 spans two writers. My idea has been to create an "IP layer for data" (IP as in Internet Protocol) over blocks, rather that making the data structures above that layer (B+Trees, Extensible hash tables) versioned - the data structure code has to be slightly version aware to pass the version context up and down to the block layer but the algorithms are essentially not version specific. This depends on making the block layer sufficiently efficient which might become the practical transaction size limitation. Compression, or recording operations on block according to type (an extreme form of application-driven compression), would be very good. It retains the data publishing-focus of TDB - that is, reader and reader performance is primary and costs go mainly on the writers. Only experimentation will tell. It is my hope to have other data structures for indexing in the future so a clean separation of the tranasaction/MVCC issues and the data structure algorithms is going to be good. Given that design, SERIALIZABLE is the natural isolation policy. There is very little locking over blocks in the main database (see below for the possibility of none!) That's not to say that the API design should not have these constants in even if not used. It happens naturally, each writer sees its modified blocks (as written to the WAL) and the blocks of past committed writers + the base database. Readers just don't get changed blocks; reader 1 is working agains the main database; reader 2 gets modified blocks from the WAL because the WAL can't be all written to the main database until reader 1 gets out the way. READ_COMMITTED makes sense for limiting growth of the WAL as it means that the WAL can be flushed to the main database before an earlier reader has finished. It would have to be done in such a way that all the changes happen at once or the data structure invariants could be invalid; that is, the reader would also have to be outside such an index as well and that needs some locking that would not be there otherwise. REPEATABLE_READ would need block locking for the updates but seems to have a quirk. Unfortunately, just knowing which blocks the reader has seen may not be strong enough as a change to the blocks of an index may result in a change to a block the reader has looked at and one it has not in some data structure consistency way, but the block locking will miss that. Not sure that's safe for the B+trees - it's not in general. READ_UNCOMMITTED is very likely to break the index structures as the higher level assumptions in the B+trees will not be valid. A feature of the TxTDB MVCC design is that locking is not being done on the main database. Changes only happen when writing back committed blocks, and then you know the original block to be over-written is never seen by a reader. Write-back occurs only when all readers upto that writer version have exited and so any outstanding reader will be getting the updated block via the WAL copy. No block lock needed (a bit scary that bit :-). Recovery is a matter of redoing the WAL so it is potentially more expensive than an undo log where the undo actions are just running transactions, not a period of recent history of writers. But we don't crash do we? :-) I think that limiting the number of versions back that will be admitted to some control constant (like 10 or 20, ideally adapting to few for larger tranactions) will be appropriate. if it's growing, then it's a sign the system is being asked to do more than it has the processing power to do. Most of the time, the WAL is write-only, and there are in-memory copies of blocks (AKA a cache, with version tiers). This is analogous to the direct-mode write cache there is currently, but with an additional write-back policy as it needs to sync when a transaction commits. Andy On 21/03/11 22:09, Stephen Allen (JIRA) wrote: > > [ > https://issues.apache.org/jira/browse/JENA-41?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13009433#comment-13009433 > ] > > Stephen Allen commented on JENA-41: > ----------------------------------- > > I think your idea about the DatasetGraph being the interface for > transactions makes sense. Transactional DatasetGraphs could also > provide fallback behavior for legacy code by implementing autocommit > transactions if the user called methods on a dataset that was not > initialized in a transactionBegin() call. > > > With regard to the isolation levels, I believe some of the lower > levels can make sense for particular applications or queries. For > example say you want to know the size of a few of graphs. > >BEGIN READ_ONLY; >select count (*) where { graph<http://example/g1> { ?s ?p ?o . } } ; >select count (*) where { graph<http://example/g2> { ?s ?p ?o . } } ; >COMMIT; > > Assuming a traditional pessimistic locking scheme, running the > transaction at SERIALIZABLE could cause the locks held by the first > select query to also be held through the second query, reducing > concurrency (using two transactions instead might not be a good idea > as there is usually some amount of overhead associated with creating > and committing transactions). > > If you were OK with the possibility that the two query results are > not truly serializable with respect to each other, then you could > improve concurrency by using a READ_COMMITTED isolation level instead > that would give serializable results for each query (but not the > whole transaction). And if you really just needed a rough estimate > of size, using READ_UNCOMMITTED may be able to avoid locking all > together. > > An additional motivating factor for MVCC implementations is that they > may be implementing snapshot isolation, which probably maps better to > READ_COMMITTED than SERIALIZABLE (especially if it could do predicate > locking for true serializable behavior but allow cheaper snapshot > isolation if that was all that was needed). The Postgres > documentation does a good job of describing this [1]. > > I would find it useful to have multiple isolation levels available > (even if internally I'm mapping them all to SERIALIZABLE at first). > The four ANSI Isolation levels seem appropriate, and remember that > implementations are allowed to map unavailable lower levels to higher > levels as desired. > > > [1] > http://developer.postgresql.org/pgdocs/postgres/transaction-iso.html
