On 22/03/11 23:03, Stephen Allen wrote:
Hi Andy,
This is rather long email I'm afraid, but I've split it into two parts:
Parliament and TxTDB comments:
Very helpful indeed. I'll respond to the TxTDB part here -- the
pArliament description is interesting and helpful as to what you plan to
do and how you came to your design decisions.
TxTDB -----
It sounds like you may be describing shadow paging [1]. In shadow
paging, the basic idea is that any reader or writer gets a view onto
the data based on reachability from pointers in a particular root
block. Pages that are reachable from any live root block are never
modified. A vacuum process is required to collect the space from
blocks that are no longer reachable (the last reader out could do it
too). Updates to indexes must be treated in roughly the same way as
data pages, because they contain pointers to different data. If you
wrote each block to disk at commit time the system would belong to
the class of logless transaction systems (although a small log might
be needed to prevent partial updates to the root page pointers).
With a shadow paging system you also would not have to actually copy
the new pages back over top of the old ones.
-Stephen
[1] http://en.wikipedia.org/wiki/Shadow_paging
Sort of shadow paging - the Wikipedia page is describing shadow paging
where a new page is allocated with new page id, with the need to chase
back up any datastructre that refers to the page and change that (the
latter could be in-place or shadow'ed recursed). It requires data
structure involvement to chase the references.
I'm thinking of keeping the same page id, and the transactions have a
context that enables them to go to the right place place to find a given
page. The data structures referring to the block do not change; the
"get block 123" step carries out the looking for the right version.
The B+Tree code is already structured around "get block", "put block"
operations - one reason this new implementation was done, rather than
pick up an existing one.
http://www.sqlite.org/wal.html
Section "How WAL Works"
which isn't quite the same WAL as Wikipedia either. This WAL is a redo
log without change in-place as the transaction progresses.
I hope that one form of compression for new blocks is to use a
block-type compressor so adding a triple to a index is not a copy of the
whole block, but just "shift-insert bytes ABC at offset 123, moving up
old bytes". Most changes in a B+Tree are on one block, not causing
block split or rebalancing. That level of compress takes the block
change down to about "insert triple" size, except there are multiple
indexes. TDB does not store a distinguished triple as such although one
index is considered "primary" but they are all cluster indexes at the
moment. Primary is just the one the code assumes exists and falls back
on although currently only total indexing is done so that never happens.
It would also be possible to reverse the whole thing - copy old data out
of the way and change in-place. That puts the I/O to the database on
the writer critical path, which is random access, whereas redo puts
append+sync on the log as the I/O on the writer critical path.
Andy