I agree there is much more space to save in leaf level compression. Duplicate key and prefix compression are a couple of possible areas.
Dibyendu Majumdar wrote: > Mike Matrigali wrote: > >> One could create compressed branch entries if the RowLocation were >> not needed to differentiate between 2 adjacent entries (ie. >> for 2 adjacent branch entries: mike, row loc 1, leaf page 113; stan, >> row loc 2, leaf page 112 -- for these 2 branch rows the minimum >> suffix of mike/stan would be enough without the row location). >> Because the page/row format in derby does not require constant number >> of column rows such a change would not be too hard. One could also >> do suffix compression within a column - for instance in the previous >> example only m/s is really necessary, but such a decision is very >> datatype dependent and needs to be supported by the datatype itself - >> not a decision by store looking at the bytes. >> >> > I guess it would be more fruitful to do compression of duplicate keys in > non-unique indexes at leaf level pages, unless this is already done. > > Attached is a revised package.html with Javadoc style links. I haven't > been able to test the links - but I think they should work. > > Regards > > ------------------------------------------------------------------------ > > Implements BTree access method, which is the basis for SQL indexes. > > > Overview > > Derby implements secondary SQL indexes as BTrees. The high level > features of the BTree implementation are: > > 1. Derby implements the standard B+Tree algorithm, in which all keys > are stored in leaf pages, and the upper levels are organized as a > redundant index for enabling rapid location of a particular key. > See The Ubiquitous B-Tree, Douglas Comer > <http://portal.acm.org/citation.cfm?id=356776> for a definition of > the B+Tree. > 2. The BTree implementation supports concurrency using page level > latches, and recovery using a Write Ahead Log. > 3. Derby uses data only locking for its logical row level locking. > 4. Derby supports all four SQL isolation levels, serializable, > repeatable read, read committed and read uncommitted. > 5. Derby uses logical key deletes. This enables it to perform undos > during rollbacks and restart recovery as single page operations. > > > High level structure of the B+Tree > > * The first row in all pages, branch or leaf, is a control row. In > branch pages, this is a [EMAIL PROTECTED] > org.apache.derby.impl.store.access.btree.BranchControlRow > BranchControlRow} and in leaf pages, it is a [EMAIL PROTECTED] > org.apache.derby.impl.store.access.btree.LeafControlRow > LeafControlRow}. > * In addition to the BranchControlRow, branch level pages contain > [EMAIL PROTECTED] org.apache.derby.impl.store.access.btree.BranchRow > BranchRows}. > * At branch level, a page is linked to is left and right siblings, > parent, as well as children. The number of child pages is 1 > greater than the number of BranchRows in the page. The leftmost > child pointer is stored in the BranchControlRow itself. Each > BranchRow contains a child pointer as its last column. > * At leaf level also, a page is linked to its left and right > siblings, and the parent. In addition to the LeafControlRow, leaf > level pages contain [EMAIL PROTECTED] > org.apache.derby.impl.sql.execute.IndexRow IndexRows}. The last > column in an IndexRow is a [EMAIL PROTECTED] > org.apache.derby.iapi.types.RowLocation RowLocation}. > * IndexRows are generated by the [EMAIL PROTECTED] > org.apache.derby.iapi.sql.dictionary.IndexRowGenerator > IndexRowGenerator}. > > > Handling of Unique Keys > > From one perspective the current BTree implementation assumes all rows > it handles are unique. This is because the row always includes the > address of the base table row (RowLocation). One could not use the > current BTree implementation to store two rows which were exactly the > same including the RowLocation. Having all rows as unique makes the code > simpler as binary searches don't have to worry about edge cases of > duplicate keys. > > Having said that, the BTree implementation does support rejecting rows > which are not unique given a subset of the columns of the row. Currently > this is only used to compare whether index rows are unique, by comparing > all columns other than the last one. So, to [EMAIL PROTECTED] > org.apache.derby.impl.store.access.btree.BTree#create create} a unique > SQL index you create a BTree telling it to enforce uniqueness on number > of columns - 1; and to create a non-unique SQL index you create a BTree > telling it to enforce uniqueness on all columns. > > Note that currently branch rows also include the RowLocation. For most > of the code RowLocation is basically opaque, so most of the code just > treats the BTree as follows: > > * BTree is created with N columns > * Every row must be unique when comparing all N columns > * Branch rows contain N+1 columns - all the columns from the leaf > entry plus one to tell it which page is the sibling. > > There is no reason that the branch rows always need the row location, > the minimum requirement is that 2 branch rows must uniquely discriminate > the keys on 2 different leaf pages. Again for simplicity it is assumed > that the keys on a branch page are never duplicate, this simplifies > binary search somewhat. > > > Latching implementation > > Derby uses latches on pages to get exclusive access to the page while > reading or writing the page (Derby only uses exclusive latches, no > shared latches are used). In order to prevent deadlocks latches > requested while holding other latches are always requested top/down and > left to right. Btree splits are always left to right. If for any reason > the code needs to break this protocol then it will first request the > latch NOWAIT and if it can't get the latch it will release all current > latches and wait for the latch it is trying to get, and then after > obtaining it go about getting the other latches it needs for the > particular operation. While traversing down the tree Derby may hold 2 > latches: one on parent and one on child. It then continues doing > "ladder" latching down the tree releasing the highest node when it has > successfully got a new lower node latch. Latches are short term, only > held while reading/modifying the page, never held while an I/O is > happening. Structure modifications are all isolated from other > operations through the use of latches. > > > Locking and Isolation Levels > > Derby uses data only locking for its logical row level locking. All > isolation level implementation is done using logical locks (Derby does > not support non-locking isolation such as multi-versioning). > > In data only locking the key for all locks whether obtained in the BTree > or in the base table is the address of the base table row. In the BTree > case the address of the base table row ([EMAIL PROTECTED] > org.apache.derby.iapi.types.RowLocation RowLocation}) is always the last > column of the leaf index entry ([EMAIL PROTECTED] > org.apache.derby.impl.sql.execute.IndexRow IndexRows}). > > Serializable > Derby uses "previous key" locking to prevent phantoms from being > inserted into a range being accessed by a scan. A scan always locks > the index row just "previous" to the first entry that satisfies the > scan condition and then all index entries going forward until the > end of the scan. All row locks are held until end of transaction. > Inserts also get previous key locks, which will block if the > existing row is locked. > Repeatable Read > Same as serializable except that no previous key locking is done. > Read Committed > Write locks are held until end of transaction. Read locks are > released once the query in question no longer needs them. > Read Uncommitted > Write locks are held until end of transaction. No read row locks are > acquired. The code still gets table level intent locks to prevent > concurrent DDL during the query. > > > BTree Structure Modifications > > In Derby, SMOs (structure modification operations - ie. page splits), > only happen top down. This is not as concurrent as bottom up in ARIES/IM > <http://www.almaden.ibm.com/u/mohan/RJ6846.pdf>, but is simpler. As in > ARIES/IM "Not more than 2 index pages are held latched simultaneously at > anytime. In order to improve concurrency and to avoid deadlocks > involving latches, even those latches are not held while waiting for a > lock wich is not immediately grantable. No data page latch is held or > acquired during an index access. Latch coupling is used while traversing > the tree - ie. the latch on a parent page is held while requesting a > latch on a child page." > > In the case of a simple split, exclusive latch is held on P (parent), > and C (child). If there is room in P, then new page C1 (child right) is > created, new descriminator is inserted into P, and rows moved to C1. > Latches are held on P and C for duration of split, and then released. > The new page C1 is retrieved from a list of free pages which may have > come from preallocation or from deleted space reclamation. If there is > no free space, space is automatically added to the file and used. > > The hard case is when P does not have room for descriminator key. In > this case all latches are released, and Derby does a split pass from top > to bottom, and will split the internal nodes that do not have room for > the decrimator key. Note this may result in more splits than necessary > for this particular insert, but the assumption is that the splits will > have to happen eventually anyway. After this split pass is done, the > search for the insert starts again from top down, but it must once again > check for space because it has given up all its latches and some other > transaction may have acquired the space in the meantime. > > Optimization is possible to remember C and see if it is right location, > and/or use sideway pointers to search right rather than do research of tree. > > > Logical Key Deletes > > In both the BTree and the Heap, deletes are first executed by marking a > "deleted" bit. This is to insure space on the page for abort, since row > level locking will allow other rows on the page to be modified > conncurrently with the transaction executing the delete. The system uses > a background daemon to schedule work after commit to reclaim the space > of the deleted rows. A row marked deleted can be "purged" if one can > obtain a lock on it (if it was an uncommitted delete then the > transaction doing the commit would still have an exclusive lock on the row). > > > Garbage Collection of deleted keys > > Since rows are only marked as "deleted", and not physically removed, it > is necessary to perform space reclamation on deleted rows. > > > Online during BTREE split > > Whenever there is not enough room on a leaf to do an insert the code > attempts to find space on the leaf, by checking if it can reclaim any > committed deletes on that leaf. That work only requires the latch on the > leaf and NOWAIT row write locks. It is expected that most of the space > reclaim done in the BTree goes through this path. Most of this work is > done in [EMAIL PROTECTED] > org.apache.derby.impl.store.access.btree.BTreeController#reclaim_deleted_rows > BTreeController.reclaim_deleted_rows}. > > > BTREE post commit work > > Whenever a delete operation deletes the last row from a leaf page then a > BtreePostCommit job is queued to be executed after the transaction which > did the delete commits. This work currently requires a table level lock > as page merges have not been implemented to be allowed concurrent with > other operations. Many DBMSes don't even try to do page merges except > when called from some sort of reorg utility. If all rows on page are > purged, then the page will move to the free list and perform a merge to > the tree. > > It is expected that the normal space reclamation happens with row locks > during btree split, which is why not much work has been done to optimize > btree post commit path > > > Logging and Recovery > > Derby uses physical redo and logical undo for BTree inserts and deletes. > Logical undo is simplified as a result of using logical key deletes. If > keys were physically removed during deletes, then the undo of a key > delete would have required an insert operation which can potentially > lead to page splits at various levels within the tree. Since the key is > not physically removed, but only marked as "deleted", undoing a key > delete is accomplished easily. However, since the page where the insert > or delete should take place may have moved, it may be necessary to > [EMAIL PROTECTED] > org.apache.derby.impl.store.access.btree.index.B2IUndo#findUndo > search} for the page. > > Structure modification operations (SMOs) are handled as independent > [EMAIL PROTECTED] > org.apache.derby.iapi.store.access.conglomerate.TransactionManager#getInternalTransaction > internal transactions} and commit separately from the transaction that > initiated the SMO. Once an SMO has been completed successfully, it is > not undone, even if the transaction that caused it decides to abort. > During restart recovery undo phase, incomplete internal transactions are > undone BEFORE any regular transactions. This ensures that the BTrees are > structurally consistent before normal undo begins. >
