This proposal introduces the idea of "lossy index tuples", an idea that provides two important optimizations for btrees and potentially other index types stimulated by recent blog complaints.
The two optimizations are Clustered Indexes (mostly for INSERTs) and Update Duplicate Removal (for non-HOT UPDATEs), discussed more later. LITE can be used for either or both of those optimizations independently, even though they are implemented in fairly similar ways. So if one of those gets shot down the other is still potentially valid. If this has wings, I'll start tracking this as a wiki page. If it doesn't have wings, at least we tried to respond with technical solutions to external input. =Lossy Index Tuple Enhancement= The LITE concept is that if the 13th index tuple bit is set the index tuple represents a collection of tuples on one heap page rather than a pointer to a single root heap tuple. 13th bit is currently unused, so this optimization is fully backwards compatible with all existing indexes, we just start using them more efficiently over time. Importantly, this means that the user doesn't need to do anything at all to begin benefiting from LITE. It also means that much of the code required is non-invasive to the index code. These benefits are much better than inventing a new index type and leaving the user to figure out how and when to use them. We call the 13th bit the "lossy bit", using the same terminology from bitmap index access, indicating that we no longer hold exact details of the heap tuples the pointer refers to, nor do we do reference counting. When the lossy bit is set we no longer interpret the TID as a (blocknumber, linepointer) we interpret the data as (blocknumber, lite_flags). The lite_flags have one bit reserved to decide whether the LITE tuple represents multiple duplicates or whether it represents a range of values. The remainder of the lite_flags allow us to store details of which ranges of linepointers are covered by the LITE tuple, splitting the block into 15 ranges of linepointers. We just XOR the new linepointer position with the bitmap when we update the bitmask. (The ranges are consecutive not striped, to ensure that heavily repetitive updates are optimized). If anyone is worried about running out of index tuple flags, then perhaps we can reserve a few of the 16 bits for additional flags rather than use them all for lp ranges. By splitting the index blocks into 15 ranges we will avoid the need for scanning the whole block in most cases, so the performance degradation from single pointer to complete-block-pointer should be much smoother. This optimization would still allow UNIQUE indexes. In fact this optimization would almost never overlap with a unique index anyway, since we seldom update the PK of a table. Lossy index tuples would not be able to skip accessing the heap during index-only scans, but the presence of covered index columns would increase the uniqueness and very likely stop the index tuple being marked lossy. The remainder of the description is split between the two use cases/optimizations which are similar but would have different code for each. == Update Duplicate Removal == We want an optimization that reduces the effects of multiple UPDATEs on the same block that have duplicate values caused because another index column has been updated and a non-HOT index insert has been generated. This optimizes the case where someone has 12 indexes yet updates just one of them, so we optimize the 11 unnecessary updates, if possible. As UPDATEs occur we request inserts into the index. If a lossy index pointer already exists that covers the new linepointer then we skip the index insert, optimizing writes to the index. If a lossy index pointer already exists for that value but doesn't yet cover our linepointer then we update the bitmask. If a non-lossy index pointer already exists we set the lossy bit and update the bitmask to reflect which linepointers the index entry covers. If no index pointer exists we insert one. The first update on a row will not be optimized, but the 2nd - 16th could be. These actions optimize away most of the writes and WAL activity to the index data block since only 1 in every 16 updates would cause changes to the block (actually about 1 in 10 on average). Most importantly, updates will cause far fewer index block splits and reindexing will be less needed. The same situation can occur if we have many INSERTs with same values on the same block. This could lead to a reduction in size of the btree for indexes with many duplicate values. == Clustered Index == We want an optimization that allows us to index a consecutive range of values on one block. Currently if we have values 1-100 on one heap block, we would have 100 index pointers all pointing to the one heap block. LITE would allow that to be indexed as a single index tuple, in the common case. As an INSERT starts a new block we begin with a base value of N. Frequently we have the case where the next value in the table is N+1 and is inserted into the same block (or could be arranged to do so). After say 100 INSERTs we now have values 1-100 on the heap block. As the first INSERT occurs we insert a normal index tuple pointing to it. As we INSERT a new value we can mark the index tuple as being an RLITE tuple. Further inserts with a higher value on this data block will be optimized away, as long as there is no later tuple e.g. val1 -> block0 val100 -> block1 if we insert the value 85 then it would go to block0, if we insert the value 105 it would go to block1 without creating an additional tuple. The value -5 would create a new tuple. Index entries that split block ranges would be more complex and costly, though they are only seen infrequently in real world apps. e.g. we might have values 1, 90, 91 on block0 and values 100, 105 on block1. If we insert the value 25 into block2 we would then have an index that looks like this val1 -> block0 val25 -> block2 val90 -> block0 val100 -> block1 First, we would clearly benefit here if we route the heap insert for value 25 to occur on block0. That routing may not be acceptable, or the block itself may be full, so we must handle the block split. If a value is inserted into a range, then we must re-read the heap block to see how to handle it. i.e. the index tuple val1 represents values from val1 to val100 and points to block0. If we insert val25 then we must re-read block0 and edit the index entries around it by creating two new index tuples val1 (representing 1-24) and val90 (representing 90-99). This looks doable, but more complex, so I'm posting this now to allow for discussion and for people to spot the holes. == IndexScan == Note that the executor code for IndexScan appears identical between the two optimizations. The difference between duplicate and range LITE tuples is needed only at INSERT time (or UPDATE indexed column to a new value). When we do an IndexScan if we see a LITE tuple we do a scan of the linepointer ranges covered by this index tuple (which might eventually go to a full block scan). For example with bit1 set we would scan 16 linepointers (on an 8192 byte block) and with 2 bits set we would scan 32 linepointers etc.., not necessarily consecutive ranges. So the IndexScan can return potentially many heap rows per index entry, which need to be re-checked and may also need to be sorted prior to returning them. If no rows are returned we can kill the index pointer, just as we do now for btrees, so they will be removed eventually from the index without the need for VACUUM. An BitmapIndexScan that sees a lossy pointer can also use the lossy TID concept, so we can have partially lossy bitmaps. == VACUUM == VACUUM would not cause a problem since line pointer removal for non-lossy index tuples would occur normally. Lossy index tuples would be ignored by index vacuum, but they would not prevent removal of line pointers from the heap. == Code Footprint == All of this code is concentrated in the Scan nodes, and in btinsert(). There are no changes to WAL records or replay. The details are all in the way we insert and interpret index tuples. == Other designs == This is related in a few ways to Grouped Item Tuple indexes, a patch Heikki worked on in 2007, but is not that close. -- Simon Riggs http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers