Re: BTree package documentation

Mike Matrigali Tue, 06 Sep 2005 17:07:07 -0700

I agree there is much more space to save in leaf level compression.
Duplicate key and prefix compression are a couple of possible areas.


Dibyendu Majumdar wrote:

> Mike Matrigali wrote:
> 
>> One could create compressed branch entries if the RowLocation were
>> not needed to differentiate between 2 adjacent entries (ie.
>> for 2 adjacent branch entries: mike, row loc 1, leaf page 113; stan,
>> row loc 2, leaf page 112 -- for these 2 branch rows the minimum
>> suffix of mike/stan would be enough without the row location).
>> Because the page/row format in derby does not require constant number
>> of column rows such a change would not be too hard.  One could also
>> do suffix compression within a column - for instance in the previous
>> example only m/s is really necessary, but such a decision is very
>> datatype dependent and needs to be supported by the datatype itself -
>> not a decision by store looking at the bytes.
>>  
>>
> I guess it would be more fruitful to do compression of duplicate keys in
> non-unique indexes at leaf level pages, unless this is already done.
> 
> Attached is a revised package.html with Javadoc style links. I haven't
> been able to test the links - but I think they should work.
> 
> Regards
> 
> ------------------------------------------------------------------------
> 
> Implements BTree access method, which is the basis for SQL indexes.
> 
> 
>   Overview
> 
> Derby implements secondary SQL indexes as BTrees. The high level
> features of the BTree implementation are:
> 
>    1. Derby implements the standard B+Tree algorithm, in which all keys
>       are stored in leaf pages, and the upper levels are organized as a
>       redundant index for enabling rapid location of a particular key.
>       See The Ubiquitous B-Tree, Douglas Comer
>       <http://portal.acm.org/citation.cfm?id=356776> for a definition of
>       the B+Tree.
>    2. The BTree implementation supports concurrency using page level
>       latches, and recovery using a Write Ahead Log.
>    3. Derby uses data only locking for its logical row level locking.
>    4. Derby supports all four SQL isolation levels, serializable,
>       repeatable read, read committed and read uncommitted.
>    5. Derby uses logical key deletes. This enables it to perform undos
>       during rollbacks and restart recovery as single page operations.
> 
> 
>   High level structure of the B+Tree
> 
>     * The first row in all pages, branch or leaf, is a control row. In
>       branch pages, this is a [EMAIL PROTECTED]
>       org.apache.derby.impl.store.access.btree.BranchControlRow
>       BranchControlRow} and in leaf pages, it is a [EMAIL PROTECTED]
>       org.apache.derby.impl.store.access.btree.LeafControlRow
>       LeafControlRow}.
>     * In addition to the BranchControlRow, branch level pages contain
>       [EMAIL PROTECTED] org.apache.derby.impl.store.access.btree.BranchRow
>       BranchRows}.
>     * At branch level, a page is linked to is left and right siblings,
>       parent, as well as children. The number of child pages is 1
>       greater than the number of BranchRows in the page. The leftmost
>       child pointer is stored in the BranchControlRow itself. Each
>       BranchRow contains a child pointer as its last column.
>     * At leaf level also, a page is linked to its left and right
>       siblings, and the parent. In addition to the LeafControlRow, leaf
>       level pages contain [EMAIL PROTECTED]
>       org.apache.derby.impl.sql.execute.IndexRow IndexRows}. The last
>       column in an IndexRow is a [EMAIL PROTECTED]
>       org.apache.derby.iapi.types.RowLocation RowLocation}.
>     * IndexRows are generated by the [EMAIL PROTECTED]
>       org.apache.derby.iapi.sql.dictionary.IndexRowGenerator
>       IndexRowGenerator}.
> 
> 
>   Handling of Unique Keys
> 
> From one perspective the current BTree implementation assumes all rows
> it handles are unique. This is because the row always includes the
> address of the base table row (RowLocation). One could not use the
> current BTree implementation to store two rows which were exactly the
> same including the RowLocation. Having all rows as unique makes the code
> simpler as binary searches don't have to worry about edge cases of
> duplicate keys.
> 
> Having said that, the BTree implementation does support rejecting rows
> which are not unique given a subset of the columns of the row. Currently
> this is only used to compare whether index rows are unique, by comparing
> all columns other than the last one. So, to [EMAIL PROTECTED]
> org.apache.derby.impl.store.access.btree.BTree#create create} a unique
> SQL index you create a BTree telling it to enforce uniqueness on number
> of columns - 1; and to create a non-unique SQL index you create a BTree
> telling it to enforce uniqueness on all columns.
> 
> Note that currently branch rows also include the RowLocation. For most
> of the code RowLocation is basically opaque, so most of the code just
> treats the BTree as follows:
> 
>     * BTree is created with N columns
>     * Every row must be unique when comparing all N columns
>     * Branch rows contain N+1 columns - all the columns from the leaf
>       entry plus one to tell it which page is the sibling.
> 
> There is no reason that the branch rows always need the row location,
> the minimum requirement is that 2 branch rows must uniquely discriminate
> the keys on 2 different leaf pages. Again for simplicity it is assumed
> that the keys on a branch page are never duplicate, this simplifies
> binary search somewhat.
> 
> 
>   Latching implementation
> 
> Derby uses latches on pages to get exclusive access to the page while
> reading or writing the page (Derby only uses exclusive latches, no
> shared latches are used). In order to prevent deadlocks latches
> requested while holding other latches are always requested top/down and
> left to right. Btree splits are always left to right. If for any reason
> the code needs to break this protocol then it will first request the
> latch NOWAIT and if it can't get the latch it will release all current
> latches and wait for the latch it is trying to get, and then after
> obtaining it go about getting the other latches it needs for the
> particular operation. While traversing down the tree Derby may hold 2
> latches: one on parent and one on child. It then continues doing
> "ladder" latching down the tree releasing the highest node when it has
> successfully got a new lower node latch. Latches are short term, only
> held while reading/modifying the page, never held while an I/O is
> happening. Structure modifications are all isolated from other
> operations through the use of latches.
> 
> 
>   Locking and Isolation Levels
> 
> Derby uses data only locking for its logical row level locking. All
> isolation level implementation is done using logical locks (Derby does
> not support non-locking isolation such as multi-versioning).
> 
> In data only locking the key for all locks whether obtained in the BTree
> or in the base table is the address of the base table row. In the BTree
> case the address of the base table row ([EMAIL PROTECTED]
> org.apache.derby.iapi.types.RowLocation RowLocation}) is always the last
> column of the leaf index entry ([EMAIL PROTECTED]
> org.apache.derby.impl.sql.execute.IndexRow IndexRows}).
> 
> Serializable
>     Derby uses "previous key" locking to prevent phantoms from being
>     inserted into a range being accessed by a scan. A scan always locks
>     the index row just "previous" to the first entry that satisfies the
>     scan condition and then all index entries going forward until the
>     end of the scan. All row locks are held until end of transaction.
>     Inserts also get previous key locks, which will block if the
>     existing row is locked.
> Repeatable Read
>     Same as serializable except that no previous key locking is done.
> Read Committed
>     Write locks are held until end of transaction. Read locks are
>     released once the query in question no longer needs them.
> Read Uncommitted
>     Write locks are held until end of transaction. No read row locks are
>     acquired. The code still gets table level intent locks to prevent
>     concurrent DDL during the query.
> 
> 
>   BTree Structure Modifications
> 
> In Derby, SMOs (structure modification operations - ie. page splits),
> only happen top down. This is not as concurrent as bottom up in ARIES/IM
> <http://www.almaden.ibm.com/u/mohan/RJ6846.pdf>, but is simpler. As in
> ARIES/IM "Not more than 2 index pages are held latched simultaneously at
> anytime. In order to improve concurrency and to avoid deadlocks
> involving latches, even those latches are not held while waiting for a
> lock wich is not immediately grantable. No data page latch is held or
> acquired during an index access. Latch coupling is used while traversing
> the tree - ie. the latch on a parent page is held while requesting a
> latch on a child page."
> 
> In the case of a simple split, exclusive latch is held on P (parent),
> and C (child). If there is room in P, then new page C1 (child right) is
> created, new descriminator is inserted into P, and rows moved to C1.
> Latches are held on P and C for duration of split, and then released.
> The new page C1 is retrieved from a list of free pages which may have
> come from preallocation or from deleted space reclamation. If there is
> no free space, space is automatically added to the file and used.
> 
> The hard case is when P does not have room for descriminator key. In
> this case all latches are released, and Derby does a split pass from top
> to bottom, and will split the internal nodes that do not have room for
> the decrimator key. Note this may result in more splits than necessary
> for this particular insert, but the assumption is that the splits will
> have to happen eventually anyway. After this split pass is done, the
> search for the insert starts again from top down, but it must once again
> check for space because it has given up all its latches and some other
> transaction may have acquired the space in the meantime.
> 
> Optimization is possible to remember C and see if it is right location,
> and/or use sideway pointers to search right rather than do research of tree.
> 
> 
>   Logical Key Deletes
> 
> In both the BTree and the Heap, deletes are first executed by marking a
> "deleted" bit. This is to insure space on the page for abort, since row
> level locking will allow other rows on the page to be modified
> conncurrently with the transaction executing the delete. The system uses
> a background daemon to schedule work after commit to reclaim the space
> of the deleted rows. A row marked deleted can be "purged" if one can
> obtain a lock on it (if it was an uncommitted delete then the
> transaction doing the commit would still have an exclusive lock on the row).
> 
> 
>   Garbage Collection of deleted keys
> 
> Since rows are only marked as "deleted", and not physically removed, it
> is necessary to perform space reclamation on deleted rows.
> 
> 
>     Online during BTREE split
> 
> Whenever there is not enough room on a leaf to do an insert the code
> attempts to find space on the leaf, by checking if it can reclaim any
> committed deletes on that leaf. That work only requires the latch on the
> leaf and NOWAIT row write locks. It is expected that most of the space
> reclaim done in the BTree goes through this path. Most of this work is
> done in [EMAIL PROTECTED]
> org.apache.derby.impl.store.access.btree.BTreeController#reclaim_deleted_rows
> BTreeController.reclaim_deleted_rows}.
> 
> 
>     BTREE post commit work
> 
> Whenever a delete operation deletes the last row from a leaf page then a
> BtreePostCommit job is queued to be executed after the transaction which
> did the delete commits. This work currently requires a table level lock
> as page merges have not been implemented to be allowed concurrent with
> other operations. Many DBMSes don't even try to do page merges except
> when called from some sort of reorg utility. If all rows on page are
> purged, then the page will move to the free list and perform a merge to
> the tree.
> 
> It is expected that the normal space reclamation happens with row locks
> during btree split, which is why not much work has been done to optimize
> btree post commit path
> 
> 
>   Logging and Recovery
> 
> Derby uses physical redo and logical undo for BTree inserts and deletes.
> Logical undo is simplified as a result of using logical key deletes. If
> keys were physically removed during deletes, then the undo of a key
> delete would have required an insert operation which can potentially
> lead to page splits at various levels within the tree. Since the key is
> not physically removed, but only marked as "deleted", undoing a key
> delete is accomplished easily. However, since the page where the insert
> or delete should take place may have moved, it may be necessary to
> [EMAIL PROTECTED] 
> org.apache.derby.impl.store.access.btree.index.B2IUndo#findUndo
> search} for the page.
> 
> Structure modification operations (SMOs) are handled as independent
> [EMAIL PROTECTED]
> org.apache.derby.iapi.store.access.conglomerate.TransactionManager#getInternalTransaction
> internal transactions} and commit separately from the transaction that
> initiated the SMO. Once an SMO has been completed successfully, it is
> not undone, even if the transaction that caused it decides to abort.
> During restart recovery undo phase, incomplete internal transactions are
> undone BEFORE any regular transactions. This ensures that the BTrees are
> structurally consistent before normal undo begins.
>

Re: BTree package documentation

Reply via email to