Andy,
I'm still in the design process, so this could change. But my plan for now is
to use BDB's logging for the resource dictionary and our own logging for the
statement and resource tables. The database "element" for those two tables
will be a single tuple (i.e. a row). These tables are not append only,
statements may have their expiration XID updated and resources will have their
linked list and statistics updated. As I mentioned, I am focusing on
single-writer for now, so that simplifies a lot of the logging and concurrency
design.
I presently can imagine three options for implementing durability in Parliament
(+Pros and -Cons):
1) Stick with memory mapped I/O by using an undo/redo log (records old value
and new value for each element)
+ Memory mapping is high performance database elements
+ I/O system is implemented this way already
- Need to deal with the possibility that OS can write changes before they
are flushed in the log. Two methods:
a) All writes that would go to pages are instead buffered in memory
until the log is flushed periodically then copied to the actual pages
b) Make use of functions to lock modified pages in memory until log is
flushed (Windows has VirtualLock(), Linux has mlock())
2) Abandon memory mapped I/O and use our own buffer manager
+ Full control of flushing to disk
+ Can use any logging scheme we want (redo may not be feasible for huge
write transactions though)
- A fair amount of work to implement
3) Use BDB's Queue structure for the statement and resource tables [1]
+ Fast, stable, easy to use instead of implementing our own log
- Logical record numbers limited to 32-bit unsigned type (eventually
limits us to only 4,294,967,295 triples)
I am leaning towards solution 1) at the moment. The undo/redo log is
attractive because of the drawbacks quoted in [2] (pp 869):
- Undo logging requires that data be written to disk immediately after a
transaction finishes, perhaps increasing the number of disk I/O's that need to
be performed.
- On the other hand, redo logging requires us to keep all modified blocks in
buffers until the transaction commits and the log records have been flushed,
perhaps increasing the average number of buffers required by transactions.
- Both undo and redo logs may put contradictory requirements on how buffers are
handled during a checkpoint, unless the database elements are complete blocks
or sets of blocks.
The last point is important because I want to use tuples as my elements (since
they are tiny compared to the block, the log size is dramatically reduced even
though I am storing both undo and redo information). The rule for undo/redo is
simply that the update log record appears on disk before any database elements
are modified on disk, and one of the implementations of 1) can do that.
As an aside, I recall you mentioning that you had a BDB version of TDB, using
that would seem to offer a fast, stable way of adding transactions to your
B-trees. Out of curiosity, were there problems with using BDB?
-Stephen
P.S. Parliament currently stores triples-table-per-graph, but internally it can
accept quads, the interface to Jena needs to be changed to support quads
however. Ultimately I think we may have as a runtime option what method is
desired, since triples-table-per-graph offers some advantages when you have a
small number of large graphs that you can then drop very efficiently.
[1] http://pybsddb.sourceforge.net/ref/am_conf/logrec.html
[2] http://www.amazon.com/Database-Systems-Complete-Book-2nd/dp/0131873253
-----Original Message-----
From: Andy Seaborne [mailto:[email protected]]
Sent: Sunday, March 27, 2011 10:54 AM
To: [email protected]
Subject: Re: Different policy for concurrency access in TDB supporting a single
writer and multiple readers
On 22/03/11 23:03, Stephen Allen wrote:
> Hi Andy,
>
> This is rather long email I'm afraid, but I've split it into two
> parts: Parliament and TxTDB comments:
>
>
> Parliament ----------
>
> The unit of MVCC I'm looking at is per quad. This is similar to
> PostgreSQL's row-level MVCC. It also means that the WAL is used for
> durability guarantees and transaction rollback rather than
> concurrency control. The WAL will track statement
> insertions/deletions and disk page states on the first modification
> after a checkpoint.
Stephen,
On the Parliament design to check I understand ...
When a new quad is added to the store during a transaction, the quad is
written to the WAL. is anything else written?
The quad needs to be added to the statement list and any new nodes to
resource table. Some data structure manipulations need to be done
atomically with respect to failure. The data structures are in memory
mapped files - there is a chance (small) part of the mapped file is
written back to disk at anytime is the OS decides it needs to swap out
that page. Having been recently used by being updated, it's quite
unlikely to be a candidate but it's possible.
Does Parliament record the lower level statement list and resource table
changes? Or is teh adding of a quad and adding of a resource
idempotent, at least can be repeated without reading from the
datastructure in some way?
These actions that are repeatable without looking at the original state
are quite nice from a redo-log perspective.
I can see how it might be possible to have datastructures that are
insensitive to corruption under append because the repeat is simply to
put in the right answers without reading the potentially corrupt data.
B+Trees don't have this property - when a tree node splits, three nodes
need writing (left of split, right of split, parent). A change to a
single node is OK because the trees don't support duplicate keys. But
if the tree is partially written back in a split then the tree on disk
is broken and reply does not help - the replay traverses the tree
assuming it's valid.
The chances of this are small - splits aren't the common change
operation, and updates close in time are likely to be all written or not
written, but it's possible to interrupt the writing in a small window.
(side question: the ISWC2009 only describes triples - is that a quads
table or a triples-table-per-graph. )
Andy