Andy,

I'm still in the design process, so this could change.  But my plan for now is 
to use BDB's logging for the resource dictionary and our own logging for the 
statement and resource tables.  The database "element" for those two tables 
will be a single tuple (i.e. a row).  These tables are not append only, 
statements may have their expiration XID updated and resources will have their 
linked list and statistics updated.  As I mentioned, I am focusing on 
single-writer for now, so that simplifies a lot of the logging and concurrency 
design.


I presently can imagine three options for implementing durability in Parliament 
(+Pros and -Cons):

1) Stick with memory mapped I/O by using an undo/redo log (records old value 
and new value for each element)
     + Memory mapping is high performance database elements
     + I/O system is implemented this way already
     - Need to deal with the possibility that OS can write changes before they 
are flushed in the log.  Two methods:
        a) All writes that would go to pages are instead buffered in memory 
until the log is flushed periodically then copied to the actual pages
        b) Make use of functions to lock modified pages in memory until log is 
flushed (Windows has VirtualLock(), Linux has mlock())

2) Abandon memory mapped I/O and use our own buffer manager
     + Full control of flushing to disk
     + Can use any logging scheme we want (redo may not be feasible for huge 
write transactions though)
     - A fair amount of work to implement

3) Use BDB's Queue structure for the statement and resource tables [1]
     + Fast, stable, easy to use instead of implementing our own log
     - Logical record numbers limited to 32-bit unsigned type (eventually 
limits us to only 4,294,967,295 triples)


I am leaning towards solution 1) at the moment.  The undo/redo log is 
attractive because of the drawbacks quoted in [2] (pp 869):

- Undo logging requires that data be written to disk immediately after a 
transaction finishes, perhaps increasing the number of disk I/O's that need to 
be performed.

- On the other hand, redo logging requires us to keep all modified blocks in 
buffers until the transaction commits and the log records have been flushed, 
perhaps increasing the average number of buffers required by transactions.

- Both undo and redo logs may put contradictory requirements on how buffers are 
handled during a checkpoint, unless the database elements are complete blocks 
or sets of blocks.

The last point is important because I want to use tuples as my elements (since 
they are tiny compared to the block, the log size is dramatically reduced even 
though I am storing both undo and redo information).  The rule for undo/redo is 
simply that the update log record appears on disk before any database elements 
are modified on disk, and one of the implementations of 1) can do that.

As an aside, I recall you mentioning that you had a BDB version of TDB, using 
that would seem to offer a fast, stable way of adding transactions to your 
B-trees.  Out of curiosity, were there problems with using BDB?

-Stephen

P.S. Parliament currently stores triples-table-per-graph, but internally it can 
accept quads, the interface to Jena needs to be changed to support quads 
however.  Ultimately I think we may have as a runtime option what method is 
desired, since triples-table-per-graph offers some advantages when you have a 
small number of large graphs that you can then drop very efficiently.

[1] http://pybsddb.sourceforge.net/ref/am_conf/logrec.html
[2] http://www.amazon.com/Database-Systems-Complete-Book-2nd/dp/0131873253



-----Original Message-----
From: Andy Seaborne [mailto:[email protected]] 
Sent: Sunday, March 27, 2011 10:54 AM
To: [email protected]
Subject: Re: Different policy for concurrency access in TDB supporting a single 
writer and multiple readers



On 22/03/11 23:03, Stephen Allen wrote:
> Hi Andy,
>
> This is rather long email I'm afraid, but I've split it into two
> parts: Parliament and TxTDB comments:
>
>
> Parliament ----------
>
> The unit of MVCC I'm looking at is per quad.  This is similar to
> PostgreSQL's row-level MVCC.  It also means that the WAL is used for
> durability guarantees and transaction rollback rather than
> concurrency control.  The WAL will track statement
> insertions/deletions and disk page states on the first modification
> after a checkpoint.

Stephen,

On the Parliament design to check I understand ...

When a new quad is added to the store during a transaction, the quad is 
written to the WAL.  is anything else written?

The quad needs to be added to the statement list and any new nodes to 
resource table.  Some data structure manipulations need to be done 
atomically with respect to failure.  The data structures are in memory 
mapped files - there is a chance (small) part of the mapped file is 
written back to disk at anytime is the OS decides it needs to swap out 
that page.  Having been recently used by being updated, it's quite 
unlikely to be a candidate but it's possible.

Does Parliament record the lower level statement list and resource table 
changes?  Or is teh adding of a quad and adding of a resource 
idempotent, at least can be repeated without reading from the 
datastructure in some way?

These actions that are repeatable without looking at the original state 
are quite nice from a redo-log perspective.

I can see how it might be possible to have datastructures that are 
insensitive to corruption under append because the repeat is simply to 
put in the right answers without reading the potentially corrupt data.

B+Trees don't have this property - when a tree node splits, three nodes 
need writing (left of split, right of split, parent).  A change to a 
single node is OK because the trees don't support duplicate keys.  But 
if the tree is partially written back in a split then the tree on disk 
is broken and reply does not help - the replay traverses the tree 
assuming it's valid.

The chances of this are small - splits aren't the common change 
operation, and updates close in time are likely to be all written or not 
written, but it's possible to interrupt the writing in a small window.

(side question: the ISWC2009 only describes triples - is that a quads 
table or a triples-table-per-graph. )

        Andy

Reply via email to