Making all nbtree entries unique by having heap TIDs participate in comparisons
I've been thinking about using heap TID as a tie-breaker when comparing B-Tree index tuples for a while now [1]. I'd like to make all tuples at the leaf level unique, as assumed by L&Y. This can enable "retail index tuple deletion", which I think we'll probably end up implementing in some form or another, possibly as part of the zheap project. It's also possible that this work will facilitate GIN-style deduplication based on run length encoding of TIDs, or storing versioned heap TIDs in an out-of-line nbtree-versioning structure (unique indexes only). I can see many possibilities, but we have to start somewhere. I attach an unfinished prototype of suffix truncation, that also sometimes *adds* a new attribute in pivot tuples. It adds an extra heap TID from the leaf level when truncating away non-distinguishing attributes during a leaf page split, though only when it must. The patch also has nbtree treat heap TID as a first class part of the key space of the index. Claudio wrote a patch that did something similar, though without the suffix truncation part [2] (I haven't studied his patch, to be honest). My patch is actually a very indirect spin-off of Anastasia's covering index patch, and I want to show what I have in mind now, while it's still swapped into my head. I won't do any serious work on this project unless and until I see a way to implement retail index tuple deletion, which seems like a multi-year project that requires the buy-in of multiple senior community members. On its own, my patch regresses performance unacceptably in some workloads, probably due to interactions with kill_prior_tuple()/LP_DEAD hint setting, and interactions with page space management when there are many "duplicates" (it can still help performance in some pgbench workloads with non-unique indexes, though). Note that the approach to suffix truncation that I've taken isn't even my preferred approach [3] -- it's a medium-term solution that enables making a heap TID attribute part of the key space, which enables everything else. Cheap incremental/retail tuple deletion is the real prize here; don't lose sight of that when looking through my patch. If we're going to teach nbtree to truncate this new implicit heap TID attribute, which seems essential, then we might as well teach nbtree to do suffix truncation of other (user-visible) attributes while we're at it. This patch isn't a particularly effective implementation of suffix truncation, because that's not what I'm truly interested in improving here (plus I haven't even bothered to optimize the logic for picking a split point in light of suffix truncation). amcheck === This patch adds amcheck coverage, which seems like essential infrastructure for developing a feature such as this. Extensive amcheck coverage gave me confidence in my general approach. The basic idea, invariant-wise, is to treat truncated attributes (often including a truncated heap TID attribute in internal pages) as "minus infinity" attributes, which participate in comparisons if and only if we reach such attributes before the end of the scan key (a smaller keysz for the index scan could prevent this). I've generalized the minus infinity concept that _bt_compare() has always considered as a special case, extending it to individual attributes. It's actually possible to remove that old hard-coded _bt_compare() logic with this patch applied without breaking anything, since we can rely on the comparison of an explicitly 0-attribute tuple working the same way (pg_upgrade'd databases will break if we do this, however, so I didn't go that far). Note that I didn't change the logic that has _bt_binsrch() treat internal pages in a special way when tuples compare as equal. We still need that logic for cases where keysz is less than the number of indexed columns. It's only possible to avoid this _bt_binsrch() thing for internal pages when all attributes, including heap TID, were specified and compared (an insertion scan key has to have an entry for every indexed column, including even heap TID). Doing better there doesn't seem worth the trouble of teaching _bt_compare() to tell the _bt_binsrch() caller about this as a special case. That means that we still move left on equality in some cases where it isn't strictly necessary, contrary to L&Y. However, amcheck verifies that the classic "Ki < v <= Ki+1" invariant holds (as opposed to "Ki <= v <= Ki+1") when verifying parent/child relationships, which demonstrates that I have restored the classic invariant (I just don't find it worthwhile to take advantage of it within _bt_binsrch() just yet). Most of this work was done while I was an employee of VMware, though I joined Crunchy data on Monday and cleaned it up a bit more since then. I'm excited about joining Crunchy, but I should also acknowledge VMware's strong support of my work. [1] https://wiki.postgresql.org/wiki/Key_normalization#Making_all_items_in_the_index_unique_by_treating_heap_TID_as_an_implicit_last_att
Making all nbtree entries unique by having heap TIDs participate in comparisons
I've been thinking about using heap TID as a tie-breaker when comparing B-Tree index tuples for a while now [1]. I'd like to make all tuples at the leaf level unique, as assumed by L&Y. This can enable "retail index tuple deletion", which I think we'll probably end up implementing in some form or another, possibly as part of the zheap project. It's also possible that this work will facilitate GIN-style deduplication based on run length encoding of TIDs, or storing versioned heap TIDs in an out-of-line nbtree-versioning structure (unique indexes only). I can see many possibilities, but we have to start somewhere. I attach an unfinished prototype of suffix truncation, that also sometimes *adds* a new attribute in pivot tuples. It adds an extra heap TID from the leaf level when truncating away non-distinguishing attributes during a leaf page split, though only when it must. The patch also has nbtree treat heap TID as a first class part of the key space of the index. Claudio wrote a patch that did something similar, though without the suffix truncation part [2] (I haven't studied his patch, to be honest). My patch is actually a very indirect spin-off of Anastasia's covering index patch, and I want to show what I have in mind now, while it's still swapped into my head. I won't do any serious work on this project unless and until I see a way to implement retail index tuple deletion, which seems like a multi-year project that requires the buy-in of multiple senior community members. On its own, my patch regresses performance unacceptably in some workloads, probably due to interactions with kill_prior_tuple()/LP_DEAD hint setting, and interactions with page space management when there are many "duplicates" (it can still help performance in some pgbench workloads with non-unique indexes, though). Note that the approach to suffix truncation that I've taken isn't even my preferred approach [3] -- it's a medium-term solution that enables making a heap TID attribute part of the key space, which enables everything else. Cheap incremental/retail tuple deletion is the real prize here; don't lose sight of that when looking through my patch. If we're going to teach nbtree to truncate this new implicit heap TID attribute, which seems essential, then we might as well teach nbtree to do suffix truncation of other (user-visible) attributes while we're at it. This patch isn't a particularly effective implementation of suffix truncation, because that's not what I'm truly interested in improving here (plus I haven't even bothered to optimize the logic for picking a split point in light of suffix truncation). amcheck === This patch adds amcheck coverage, which seems like essential infrastructure for developing a feature such as this. Extensive amcheck coverage gave me confidence in my general approach. The basic idea, invariant-wise, is to treat truncated attributes (often including a truncated heap TID attribute in internal pages) as "minus infinity" attributes, which participate in comparisons if and only if we reach such attributes before the end of the scan key (a smaller keysz for the index scan could prevent this). I've generalized the minus infinity concept that _bt_compare() has always considered as a special case, extending it to individual attributes. It's actually possible to remove that old hard-coded _bt_compare() logic with this patch applied without breaking anything, since we can rely on the comparison of an explicitly 0-attribute tuple working the same way (pg_upgrade'd databases will break if we do this, however, so I didn't go that far). Note that I didn't change the logic that has _bt_binsrch() treat internal pages in a special way when tuples compare as equal. We still need that logic for cases where keysz is less than the number of indexed columns. It's only possible to avoid this _bt_binsrch() thing for internal pages when all attributes, including heap TID, were specified and compared (an insertion scan key has to have an entry for every indexed column, including even heap TID). Doing better there doesn't seem worth the trouble of teaching _bt_compare() to tell the _bt_binsrch() caller about this as a special case. That means that we still move left on equality in some cases where it isn't strictly necessary, contrary to L&Y. However, amcheck verifies that the classic "Ki < v <= Ki+1" invariant holds (as opposed to "Ki <= v <= Ki+1") when verifying parent/child relationships, which demonstrates that I have restored the classic invariant (I just don't find it worthwhile to take advantage of it within _bt_binsrch() just yet). Most of this work was done while I was an employee of VMware, though I joined Crunchy data on Monday and cleaned it up a bit more since then. I'm excited about joining Crunchy, but I should also acknowledge VMware's strong support of my work. [1] https://wiki.postgresql.org/wiki/Key_normalization#Making_all_items_in_the_index_unique_by_treating_heap_TID_as_an_implicit_last_att
Re: Making all nbtree entries unique by having heap TIDs participate in comparisons
On 26/02/2019 12:31, Peter Geoghegan wrote: On Mon, Jan 28, 2019 at 7:32 AM Heikki Linnakangas wrote: I spent some time first trying to understand the current algorithm, and then rewriting it in a way that I find easier to understand. I came up with the attached. I think it optimizes for the same goals as your patch, but the approach is quite different. Attached is v13 of the patch series, which significantly refactors nbtsplitloc.c to implement the algorithm using the approach from your prototype posted on January 28 -- I now take a "top down" approach that materializes all legal split points up-front, as opposed to the initial "bottom up" approach that used recursion, and weighed everything (balance of free space, suffix truncation, etc) all at once. Great, looks much simpler now, indeed! Now I finally understand the algorithm. I'm using qsort() to sort the candidate split points array. I'm not trying to do something clever to avoid the up-front effort of sorting everything, even though we could probably get away with that much of the time (e.g. by doing a top-N sort in default mode). Testing has shown that using an inlined qsort() routine in the style of tuplesort.c would make my serial test cases/microbenchmarks faster, without adding much complexity. We're already competitive with the master branch even without this microoptimization, so I've put that off for now. What do you think of the idea of specializing an inlineable qsort() for nbtsplitloc.c? If the performance is acceptable without it, let's not bother. We can optimize later. What would be the worst case scenario for this? Splitting a page that has as many tuples as possible, I guess, so maybe inserting into a table with a single-column index, with 32k BLCKSZ. Have you done performance testing on something like that? Unlike in your prototype, v13 makes the array for holding candidate split points into a single big allocation that is always exactly BLCKSZ. The idea is that palloc() can thereby recycle the big _bt_findsplitloc() allocation within _bt_split(). palloc() considers 8KiB to be the upper limit on the size of individual blocks it manages, and we'll always go on to palloc(BLCKSZ) through the _bt_split() call to PageGetTempPage(). In a sense, we're not even allocating memory that we weren't allocating already. (Not sure that this really matters, but it is easy to do it that way.) Rounding up the allocation to BLCKSZ seems like a premature optimization. Even if it saved some cycles, I don't think it's worth the trouble of having to explain all that in the comment. I think you could change the curdelta, leftfree, and rightfree fields in SplitPoint to int16, to make the array smaller. Other changes from your prototype: * I found a more efficient representation than a pair of raw IndexTuple pointers for each candidate split. Actually, I use the same old representation (firstoldonright + newitemonleft) in each split, and provide routines to work backwards from that to get the lastleft and firstright tuples. This approach is far more space efficient, and space efficiency matters when you've allocating space for hundreds of items in a critical path like this. Ok. * You seemed to refactor _bt_checksplitloc() in passing, making it not do the newitemisfirstonright thing. I changed that back. Did I miss something that you intended here? My patch treated the new item the same as all the old items, in _bt_checksplitloc(), so it didn't need newitemisonright. You still need it with your approach. Changes to my own code since v12: * Simplified "Add "split after new tuple" optimization" commit, and made it more consistent with associated code. This is something that was made a lot easier by the new approach. It would be great to hear what you think of this. I looked at it very briefly. Yeah, it's pretty simple now. Nice! About this comment on _bt_findsplit_loc(): /* * _bt_findsplitloc() -- find an appropriate place to split a page. * * The main goal here is to equalize the free space that will be on each * split page, *after accounting for the inserted tuple*. (If we fail to * account for it, we might find ourselves with too little room on the page * that it needs to go into!) * * If the page is the rightmost page on its level, we instead try to arrange * to leave the left split page fillfactor% full. In this way, when we are * inserting successively increasing keys (consider sequences, timestamps, * etc) we will end up with a tree whose pages are about fillfactor% full, * instead of the 50% full result that we'd get without this special case. * This is the same as nbtsort.c produces for a newly-created tree. Note * that leaf and nonleaf pages use different fillfactors. * * We are passed the intended insert position of the new tuple, expressed as * the offsetnumber of the tuple it must go in front of (this could be * maxoff+1 if the tuple is to go at the end). The new tuple itself is also * p
Re: Making all nbtree entries unique by having heap TIDs participate in comparisons
Some comments on v13-0002-make-heap-TID-a-tie-breaker-nbtree-index-column.patch below. Mostly about code comments. In general, I think a round of copy-editing the comments, to use simpler language, would do good. The actual code changes look good to me. /* * _bt_findinsertloc() -- Finds an insert location for a tuple * * On entry, *bufptr contains the page that the new tuple unambiguously * belongs on. This may not be quite right for callers that just called * _bt_check_unique(), though, since they won't have initially searched * using a scantid. They'll have to insert into a page somewhere to the * right in rare cases where there are many physical duplicates in a * unique index, and their scantid directs us to some page full of * duplicates to the right, where the new tuple must go. (Actually, * since !heapkeyspace pg_upgraded'd non-unique indexes never get a * scantid, they too may require that we move right. We treat them * somewhat like unique indexes.) Seems confusing to first say assertively that "*bufptr contains the page that the new tuple unambiguously belongs to", and then immediately go on to list a whole bunch of exceptions. Maybe just remove "unambiguously". @@ -759,7 +787,10 @@ _bt_findinsertloc(Relation rel, * If this page was incompletely split, finish the split now. We * do this while holding a lock on the left sibling, which is not * good because finishing the split could be a fairly lengthy -* operation. But this should happen very seldom. +* operation. But this should only happen when inserting into a +* unique index that has more than an entire page for duplicates +* of the value being inserted. (!heapkeyspace non-unique indexes +* are an exception, once again.) */ if (P_INCOMPLETE_SPLIT(lpageop)) { This happens very seldom, because you only get an incomplete split if you crash in the middle of a page split, and that should be very rare. I don't think we need to list more fine-grained conditions here, that just confuses the reader. /* * _bt_useduplicatepage() -- Settle for this page of duplicates? * * Prior to PostgreSQL 12/Btree version 4, heap TID was never treated * as a part of the keyspace. If there were many tuples of the same * value spanning more than one leaf page, a new tuple of that same * value could legally be placed on any one of the pages. * * This function handles the question of whether or not an insertion * of a duplicate into a pg_upgrade'd !heapkeyspace index should * insert on the page contained in buf when a choice must be made. * Preemptive microvacuuming is performed here when that could allow * caller to insert on to the page in buf. * * Returns true if caller should proceed with insert on buf's page. * Otherwise, caller should move on to the page to the right (caller * must always be able to still move right following call here). */ So, this function is only used for legacy pg_upgraded indexes. The comment implies that, but doesn't actually say it. /* * Get tie-breaker heap TID attribute, if any. Macro works with both pivot * and non-pivot tuples, despite differences in how heap TID is represented. * * Assumes that any tuple without INDEX_ALT_TID_MASK set has a t_tid that * points to the heap, and that all pivot tuples have INDEX_ALT_TID_MASK set * (since all pivot tuples must as of BTREE_VERSION 4). When non-pivot * tuples use the INDEX_ALT_TID_MASK representation in the future, they'll * probably also contain a heap TID at the end of the tuple. We currently * assume that a tuple with INDEX_ALT_TID_MASK set is a pivot tuple within * heapkeyspace indexes (and that a tuple without it set must be a non-pivot * tuple), but it might also be used by non-pivot tuples in the future. * pg_upgrade'd !heapkeyspace indexes only set INDEX_ALT_TID_MASK in pivot * tuples that actually originated with the truncation of one or more * attributes. */ #define BTreeTupleGetHeapTID(itup) ... The comment claims that "all pivot tuples must be as of BTREE_VERSION 4". I thought that all internal tuples are called pivot tuples, even on version 3. I think what this means to say is that this macro is only used on BTREE_VERSION 4 indexes. Or perhaps that pivot tuples can only have a heap TID in BTREE_VERSION 4 indexes. This macro (and many others in nbtree.h) is quite complicated. A static inline function might be easier to read
Re: Making all nbtree entries unique by having heap TIDs participate in comparisons
On Sun, Mar 3, 2019 at 10:02 PM Heikki Linnakangas wrote: > Some comments on > v13-0002-make-heap-TID-a-tie-breaker-nbtree-index-column.patch below. > Mostly about code comments. In general, I think a round of copy-editing > the comments, to use simpler language, would do good. The actual code > changes look good to me. I'm delighted that the code looks good to you, and makes sense overall. I worked hard to make the patch a natural adjunct to the existing code, which wasn't easy. > Seems confusing to first say assertively that "*bufptr contains the page > that the new tuple unambiguously belongs to", and then immediately go on > to list a whole bunch of exceptions. Maybe just remove "unambiguously". This is fixed in v14 of the patch series. > This happens very seldom, because you only get an incomplete split if > you crash in the middle of a page split, and that should be very rare. I > don't think we need to list more fine-grained conditions here, that just > confuses the reader. Fixed in v14. > > /* > > *_bt_useduplicatepage() -- Settle for this page of duplicates? > So, this function is only used for legacy pg_upgraded indexes. The > comment implies that, but doesn't actually say it. I made that more explicit in v14. > > /* > > * Get tie-breaker heap TID attribute, if any. Macro works with both pivot > > * and non-pivot tuples, despite differences in how heap TID is represented. > > #define BTreeTupleGetHeapTID(itup) ... I fixed up the comments above BTreeTupleGetHeapTID() significantly. > The comment claims that "all pivot tuples must be as of BTREE_VERSION > 4". I thought that all internal tuples are called pivot tuples, even on > version 3. In my mind, "pivot tuple" is a term that describes any tuple that contains a separator key, which could apply to any nbtree version. It's useful to have a distinct term (to not just say "separator key tuple") because Lehman and Yao think of separator keys as separate and distinct from downlinks. Internal page splits actually split *between* a separator key and a downlink. So nbtree internal page splits must split "inside a pivot tuple", leaving its separator on the left hand side (new high key), and its downlink on the right hand side (new minus infinity tuple). Pivot tuples may contain a separator key and a downlink, just a downlink, or just a separator key (sometimes this is implicit, and the block number is garbage). I am particular about the terminology because the pivot tuple vs. downlink vs. separator key thing causes a lot of confusion, particularly when you're using Lehman and Yao (or Lanin and Shasha) to understand how things work in Postgres. We wan't to have a broad term that refers to the tuples that describe the keyspace (pivot tuples), since it's often helpful to refer to them collectively, without seeming to contradict Lehman and Yao. > I think what this means to say is that this macro is only > used on BTREE_VERSION 4 indexes. Or perhaps that pivot tuples can only > have a heap TID in BTREE_VERSION 4 indexes. My high level approach to pg_upgrade/versioning is for index scans to *pretend* that every nbtree index (even on v2 and v3) has a heap attribute that actually makes the keys unique. The difference is that v4 gets to use a scantid, and actually rely on the sort order of heap TIDs, whereas pg_upgrade'd indexes "are not allowed to look at the heap attribute", and must never provide a scantid (they also cannot use the !minusinfkey optimization, but this is only an optimization that v4 indexes don't truly need). They always do the right thing (move left) on otherwise-equal pivot tuples, since they have no scantid. That's why _bt_compare() can use BTreeTupleGetHeapTID() without caring about the version the index uses. It might be NULL for a pivot tuple in a v3 index, even though we imagine/pretend that it should have a value set. But that doesn't matter, because higher level code knows that !heapkeyspace indexes should never get a scantid (_bt_compare() does Assert() that they got that detail right, though). We "have no reason to peak", because we don't have a scantid, so index scans work essentially the same way, regardless of the version in use. There are a few specific cross-version things that we need think about outside of making sure that there is never a scantid (and !minusinfkey optimization is unused) in < v4 indexes, but these are all related to unique indexes. "Pretending that all indexes have a heap TID" is a very useful mental model. Nothing really changes, even though you might guess that changing the classic "Subtree S is described by Ki < v <= Ki+1" invariant would need to break code in _bt_binsrch()/_bt_compare(). Just pretend that the classic invariant was there since the Berkeley days, and don't do anything that breaks the useful illusion on versions before v4. > This macro (and many others in nbtree.h) is quite complicated. A static > inline function might be easier to read. I agree that the macros a
Re: Making all nbtree entries unique by having heap TIDs participate in comparisons
I'm looking at the first patch in the series now. I'd suggest that you commit that very soon. It's useful on its own, and seems pretty much ready to be committed already. I don't think it will be much affected by whatever changes we make to the later patches, anymore. I did some copy-editing of the code comments, see attached patch which applies on top of v14-0001-Refactor-nbtree-insertion-scankeys.patch. Mostly, to use more Plain English: use active voice instead of passive, split long sentences, avoid difficult words. I also had a few comments and questions on some details. I added them in the same patch, marked with "HEIKKI:". Please take a look. - Heikki diff --git a/src/backend/access/nbtree/README b/src/backend/access/nbtree/README index 3680e69b89a..eb4df2ebbe6 100644 --- a/src/backend/access/nbtree/README +++ b/src/backend/access/nbtree/README @@ -609,6 +609,9 @@ original search scankey is consulted as each index entry is sequentially scanned to decide whether to return the entry and whether the scan can stop (see _bt_checkkeys()). +HEIKKI: The above probably needs some updating, now that we have a +separate BTScanInsert struct to represent an insertion scan key. + We use term "pivot" index tuples to distinguish tuples which don't point to heap tuples, but rather used for tree navigation. Pivot tuples includes all tuples on non-leaf pages and high keys on leaf pages. Note that pivot diff --git a/src/backend/access/nbtree/nbtinsert.c b/src/backend/access/nbtree/nbtinsert.c index b3fbba276dd..2a2d6576060 100644 --- a/src/backend/access/nbtree/nbtinsert.c +++ b/src/backend/access/nbtree/nbtinsert.c @@ -97,9 +97,12 @@ static void _bt_vacuum_one_page(Relation rel, Buffer buffer, Relation heapRel); * will allow duplicates. Otherwise (UNIQUE_CHECK_YES or * UNIQUE_CHECK_EXISTING) it will throw error for a duplicate. * For UNIQUE_CHECK_EXISTING we merely run the duplicate check, and - * don't actually insert. If rel is a unique index, then every call - * here is a checkingunique call (i.e. every call does a duplicate - * check, though perhaps only a tentative check). + * don't actually insert. + +HEIKKI: 'checkingunique' is a local variable in the function. Seems a bit +weird to talk about it in the function comment. I didn't understand what +the point of adding this sentence was, so I removed it. + * * The result value is only significant for UNIQUE_CHECK_PARTIAL: * it must be true if the entry is known unique, else false. @@ -285,9 +288,10 @@ top: CheckForSerializableConflictIn(rel, NULL, buf); /* - * Do the insertion. Note that itup_key contains mutable state used - * by _bt_check_unique to help _bt_findinsertloc avoid repeating its - * binary search. !checkingunique case must start own binary search. + * Do the insertion. Note that itup_key contains state filled in by + * _bt_check_unique to help _bt_findinsertloc avoid repeating its + * binary search. !checkingunique case must start its own binary + * search. */ newitemoff = _bt_findinsertloc(rel, itup_key, &buf, checkingunique, itup, stack, heapRel); @@ -311,10 +315,6 @@ top: /* * _bt_check_unique() -- Check for violation of unique index constraint * - * Sets state in itup_key sufficient for later _bt_findinsertloc() call to - * reuse most of the work of our initial binary search to find conflicting - * tuples. - * * Returns InvalidTransactionId if there is no conflict, else an xact ID * we must wait for to see if it commits a conflicting tuple. If an actual * conflict is detected, no return --- just ereport(). If an xact ID is @@ -326,6 +326,10 @@ top: * InvalidTransactionId because we don't want to wait. In this case we * set *is_unique to false if there is a potential conflict, and the * core code must redo the uniqueness check later. + * + * As a side-effect, sets state in itup_key that can later be used by + * _bt_findinsertloc() to reuse most of the binary search work we do + * here. */ static TransactionId _bt_check_unique(Relation rel, BTScanInsert itup_key, @@ -352,8 +356,8 @@ _bt_check_unique(Relation rel, BTScanInsert itup_key, maxoff = PageGetMaxOffsetNumber(page); /* - * Save binary search bounds. Note that this is also used within - * _bt_findinsertloc() later. + * Save binary search bounds. We use them in the fastpath below, but + * also in the _bt_findinsertloc() call later. */ itup_key->savebinsrch = true; offset = _bt_binsrch(rel, itup_key, buf); @@ -375,16 +379,16 @@ _bt_check_unique(Relation rel, BTScanInsert itup_key, if (offset <= maxoff) { /* - * Fastpath: _bt_binsrch() search bounds can be used to limit our - * consideration to items that are definitely duplicates in most - * cases (though not when original page is empty, or when initial - * offset is past the end of the original page, which may indicate - * that we'll have to examine a second or subsequent page). + *
Re: Making all nbtree entries unique by having heap TIDs participate in comparisons
On Tue, Mar 5, 2019 at 3:37 AM Heikki Linnakangas wrote: > I'm looking at the first patch in the series now. I'd suggest that you > commit that very soon. It's useful on its own, and seems pretty much > ready to be committed already. I don't think it will be much affected by > whatever changes we make to the later patches, anymore. I agree that the parts covered by the first patch in the series are very unlikely to need changes, but I hesitate to commit it weeks ahead of the other patches. Some of the things that make _bt_findinsertloc() fast are missing for v3 indexes. The "consider secondary factors during nbtree splits" patch actually more than compensates for that with v3 indexes, at least in some cases, but the first patch applied on its own will slightly regress performance. At least, I benchmarked the first patch on its own several months ago and noticed a small regression at the time, though I don't have the exact details at hand. It might have been an invalid result, because I wasn't particularly thorough at the time. We do make some gains in the first patch (the _bt_check_unique() thing), but we also check the high key more than we need to within _bt_findinsertloc() for non-unique indexes. Plus, the microvacuuming thing isn't as streamlined. It's a lot of work to validate and revalidate the performance of a patch like this, and I'd rather commit the first three patches within a couple of days of each other (I can validate v3 indexes and v4 indexes separately, though). We can put off the other patches for longer, and treat them as independent. I guess I'd also push the final amcheck patch following the first three -- no point in holding back on that. Then we'd be left with "Add "split after new tuple" optimization", and "Add high key "continuescan" optimization" as independent improvements that can be pushed at the last minute of the final CF. > I also had a few comments and questions on some details. I added them in > the same patch, marked with "HEIKKI:". Please take a look. Will respond now. Any point that I haven't responding to directly has been accepted. > +HEIKKI: 'checkingunique' is a local variable in the function. Seems a bit > +weird to talk about it in the function comment. I didn't understand what > +the point of adding this sentence was, so I removed it. Maybe there is no point in the comment you reference here, but I like the idea of "checkingunique", because that symbol name is a common thread between a number of functions that coordinate with each other. It's not just a local variable in one function. > @@ -588,6 +592,17 @@ _bt_check_unique(Relation rel, BTScanInsert itup_key, > if (P_RIGHTMOST(opaque)) > break; > highkeycmp = _bt_compare(rel, itup_key, page, P_HIKEY); > + > + /* > +* HEIKKI: This assertion might fire if the user-defined opclass > +* is broken. It's just an assertion, so maybe that's ok. With a > +* broken opclass, it's obviously "garbage in, garbage out", but > +* we should try to behave sanely anyway. I don't remember what > +* our general policy on that is; should we assert, elog(ERROR), > +* or continue silently in that case? An elog(ERROR) or > +* elog(WARNING) would feel best to me, but I don't remember what > +* we usually do. > +*/ > Assert(highkeycmp <= 0); > if (highkeycmp != 0) > break; We don't really have a general policy on it. However, I don't have any sympathy for the idea of trying to solider on with a corrupt index. I also don't think that it's worth making this a "can't happen" error. Like many of my assertions, this assertion is intended to document an invariant. I don't actually anticipate that it could ever really fail. > +Should we mention explicitly that this binary-search reuse is only applicable > +if unique checks were performed? It's kind of implied by the fact that it's > +_bt_check_unique() that saves the state, but perhaps we should be more clear > +about it. I guess so. > +What is a "garbage duplicate"? Same as a "dead duplicate"? Yes. > +The last sentence, about garbage duplicates, seems really vague. Why do we > +ever do any comparisons that are not strictly necessary? Perhaps it's best to > +just remove that last sentence. Okay -- will remove. > + > +HEIKKI: I don't buy the argument that microvacuuming has to happen here. You > +could easily imagine a separate function that does microvacuuming, and resets > +(or even updates) the binary-search cache in the insertion key. I agree this > +is a convenient place to do it, though. It wasn't supposed to be a water-tight argument. I'll just say that it's convenient. > +/* HEIKKI: > +Do we need 'checkunique' as an argument? If unique checks were not > +performed, the insertion key will simply not have saved state. > +*/ We need it in the next patch in the series, because i
Re: Making all nbtree entries unique by having heap TIDs participate in comparisons
On Tue, Mar 5, 2019 at 3:03 PM Peter Geoghegan wrote: > I agree that the parts covered by the first patch in the series are > very unlikely to need changes, but I hesitate to commit it weeks ahead > of the other patches. I know I'm stating the obvious here, but we don't have many weeks left at this point. I have not reviewed any code, but I have been following this thread and I'd really like to see this work go into PostgreSQL 12, assuming it's in good enough shape. It sounds like really good stuff. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
Re: Making all nbtree entries unique by having heap TIDs participate in comparisons
On Wed, Mar 6, 2019 at 1:37 PM Robert Haas wrote: > I know I'm stating the obvious here, but we don't have many weeks left > at this point. I have not reviewed any code, but I have been > following this thread and I'd really like to see this work go into > PostgreSQL 12, assuming it's in good enough shape. It sounds like > really good stuff. Thanks! Barring any objections, I plan to commit the first 3 patches (plus the amcheck "relocate" patch) within 7 - 10 days (that's almost everything). Heikki hasn't reviewed 'Add high key "continuescan" optimization' yet, and it seems like he should take a look at that before I proceed with it. But that seems like the least controversial enhancement within the entire patch series, so I'm not very worried about it. I'm currently working on v15, which has comment-only revisions requested by Heikki. I expect to continue to work with him to make sure that he is happy with the presentation. I'll also need to revalidate the performance of the patch series following recent minor changes to the logic for choosing a split point. That can take days. This is why I don't want to commit the first patch without committing at least the first three all at once -- it increases the amount of performance validation work I'll have to do considerably. (I have to consider both v4 and v3 indexes already, which seems like enough work.) Two of the later patches (one of which I plan to push as part of the first batch of commits) use heuristics to decide where to split the page. As a Postgres contributor, I have learned to avoid inventing heuristics, so this automatically makes me a bit uneasy. However, I don't feel so bad about it here, on reflection. The on-disk size of the TPC-C indexes are reduced by 35% with the 'Add "split after new tuple" optimization' patch (I think that the entire database is usually about 12% smaller). There simply isn't a fundamentally better way to get the same benefit, and I'm sure that nobody will argue that we should just accept the fact that the most influential database benchmark of all time has a big index bloat problem with Postgres. That would be crazy. That said, it's not impossible that somebody will shout at me because my heuristics made their index bloated. I can't see how that could happen, but I am prepared. I can always adjust the heuristics when new information comes to light. I have fairly thorough test cases that should allow me to do this without regressing anything else. This is a risk that can be managed sensibly. There is no gnawing ambiguity about the on-disk changes laid down in the second patch (nor the first patch), though. Making on-disk changes is always a bit scary, but making the keys unique is clearly a big improvement architecturally, as it brings nbtree closer to the Lehman & Yao design without breaking anything for v3 indexes (v3 indexes simply aren't allowed to use a heap TID in their scankey). Unique keys also allow amcheck to relocate every tuple in the index from the root page, using the same code path as regular index scans. We'll be relying on the uniqueness of keys within amcheck from the beginning, before anybody teaches nbtree to perform retail index tuple deletion. -- Peter Geoghegan
Re: Making all nbtree entries unique by having heap TIDs participate in comparisons
On 06/03/2019 04:03, Peter Geoghegan wrote: On Tue, Mar 5, 2019 at 3:37 AM Heikki Linnakangas wrote: I'm looking at the first patch in the series now. I'd suggest that you commit that very soon. It's useful on its own, and seems pretty much ready to be committed already. I don't think it will be much affected by whatever changes we make to the later patches, anymore. After staring at the first patch for bit longer, a few things started to bother me: * The new struct is called BTScanInsert, but it's used for searches, too. It makes sense when you read the README, which explains the difference between "search scan keys" and "insertion scan keys", but now that we have a separate struct for this, perhaps we call insertion scan keys with a less confusing name. I don't know what to suggest, though. "Positioning key"? * We store the binary search bounds in BTScanInsertData, but they're only used during insertions. * The binary search bounds are specific for a particular buffer. But that buffer is passed around separately from the bounds. It seems easy to have them go out of sync, so that you try to use the cached bounds for a different page. The savebinsrch and restorebinsrch is used to deal with that, but it is pretty complicated. I came up with the attached (against master), which addresses the 2nd and 3rd points. I added a whole new BTInsertStateData struct, to hold the binary search bounds. BTScanInsert now only holds the 'scankeys' array, and the 'nextkey' flag. The new BTInsertStateData struct also holds the current buffer we're considering to insert to, and a 'bounds_valid' flag to indicate if the saved bounds are valid for the current buffer. That way, it's more straightforward to clear the 'bounds_valid' flag whenever we move right. I made a copy of the _bt_binsrch, _bt_binsrch_insert. It does the binary search like _bt_binsrch does, but the bounds caching is only done in _bt_binsrch_insert. Seems more clear to have separate functions for them now, even though there's some duplication. +/* HEIKKI: +Do we need 'checkunique' as an argument? If unique checks were not +performed, the insertion key will simply not have saved state. +*/ We need it in the next patch in the series, because it's also useful for optimizing away the high key check with non-unique indexes. We know that _bt_moveright() was called at the leaf level, with scantid filled in, so there is no question of needing to move right within _bt_findinsertloc() (provided it's a heapkeyspace index). Hmm. Perhaps it would be to move the call to _bt_binsrch (or _bt_binsrch_insert with this patch) to outside _bt_findinsertloc. So that _bt_findinsertloc would only be responsible for finding the correct page to insert to. So the overall code, after patch #2, would be like: /* * Do the insertion. First move right to find the correct page to * insert to, if necessary. If we're inserting to a non-unique index, * _bt_search() already did this when it checked if a move to the * right was required for leaf page. Insertion scankey's scantid * would have been filled out at the time. On a unique index, the * current buffer is the first buffer containing duplicates, however, * so we may need to move right to the correct location for this * tuple. */ if (checkingunique || itup_key->heapkeyspace) _bt_findinsertpage(rel, &insertstate, stack, heapRel); newitemoff = _bt_binsrch_insert(rel, &insertstate); _bt_insertonpg(rel, insertstate.buf, InvalidBuffer, stack, itup, newitemoff, false); Does this make sense? Actually, we even need it in the first patch: we only restore a binary search because we know that there is something to restore, and must ask for it to be restored explicitly (anything else seems unsafe). Maybe we can't restore it because it's not a unique index, or maybe we can't restore it because we microvacuumed, or moved right to get free space. I don't think that it'll be helpful to make _bt_findinsertloc() pretend that it doesn't know exactly where the binary search bounds come from -- it already knows plenty about unique indexes specifically, and about how it may have to invalidate the bounds. The whole way that it couples buffer locks is only useful for unique indexes, so it already knows *plenty* about unique indexes specifically. The attached new version simplifies this, IMHO. The bounds and the current buffer go together in the same struct, so it's easier to keep track whether the bounds are valid or not. - * starting a regular index scan some can be omitted. The array is used as a + * starting a regular index scan, some can be omitted. The array is used as a * flexible array member, though it's sized in a way that makes it possible to * use stack allocations. See nbtree/README for full details. + +HEIKKI: I don't see anything in the README about stack allocations. What +exactly does the README reference refer to? No code seems to actually allocate +this in the
Re: Making all nbtree entries unique by having heap TIDs participate in comparisons
On Wed, Mar 6, 2019 at 10:15 PM Heikki Linnakangas wrote: > After staring at the first patch for bit longer, a few things started to > bother me: > > * The new struct is called BTScanInsert, but it's used for searches, > too. It makes sense when you read the README, which explains the > difference between "search scan keys" and "insertion scan keys", but now > that we have a separate struct for this, perhaps we call insertion scan > keys with a less confusing name. I don't know what to suggest, though. > "Positioning key"? I think that insertion scan key is fine. It's been called that for almost twenty years. It's not like it's an intuitive concept that could be conveyed easily if only we came up with a new, pithy name. > * We store the binary search bounds in BTScanInsertData, but they're > only used during insertions. > > * The binary search bounds are specific for a particular buffer. But > that buffer is passed around separately from the bounds. It seems easy > to have them go out of sync, so that you try to use the cached bounds > for a different page. The savebinsrch and restorebinsrch is used to deal > with that, but it is pretty complicated. That might be an improvement, but I do think that using mutable state in the insertion scankey, to restrict a search is an idea that could work well in at least one other way. That really isn't a once-off thing, even though it looks that way. > I came up with the attached (against master), which addresses the 2nd > and 3rd points. I added a whole new BTInsertStateData struct, to hold > the binary search bounds. BTScanInsert now only holds the 'scankeys' > array, and the 'nextkey' flag. It will also have to store heapkeyspace, of course. And minusinfkey. BTW, I would like to hear what you think of the idea of minusinfkey (and the !minusinfkey optimization) specifically. > The new BTInsertStateData struct also > holds the current buffer we're considering to insert to, and a > 'bounds_valid' flag to indicate if the saved bounds are valid for the > current buffer. That way, it's more straightforward to clear the > 'bounds_valid' flag whenever we move right. I'm not sure that that's an improvement. Moving right should be very rare with my patch. gcov shows that we never move right here anymore with the regression tests, or within _bt_check_unique() -- not once. For a second, I thought that you forgot to invalidate the bounds_valid flag, because you didn't pass it directly, by value to _bt_useduplicatepage(). > I made a copy of the _bt_binsrch, _bt_binsrch_insert. It does the binary > search like _bt_binsrch does, but the bounds caching is only done in > _bt_binsrch_insert. Seems more clear to have separate functions for them > now, even though there's some duplication. I'll have to think about that some more. Having a separate _bt_binsrch_insert() may be worth it, but I'll need to do some profiling. > Hmm. Perhaps it would be to move the call to _bt_binsrch (or > _bt_binsrch_insert with this patch) to outside _bt_findinsertloc. So > that _bt_findinsertloc would only be responsible for finding the correct > page to insert to. So the overall code, after patch #2, would be like: Maybe, but as I said it's not like _bt_findinsertloc() doesn't know all about unique indexes already. This is pointed out in a comment in _bt_doinsert(), even. I guess that it might have to be changed to say _bt_findinsertpage() instead, with your new approach. > /* > * Do the insertion. First move right to find the correct page to > * insert to, if necessary. If we're inserting to a non-unique index, > * _bt_search() already did this when it checked if a move to the > * right was required for leaf page. Insertion scankey's scantid > * would have been filled out at the time. On a unique index, the > * current buffer is the first buffer containing duplicates, however, > * so we may need to move right to the correct location for this > * tuple. > */ > if (checkingunique || itup_key->heapkeyspace) > _bt_findinsertpage(rel, &insertstate, stack, heapRel); > > newitemoff = _bt_binsrch_insert(rel, &insertstate); > > _bt_insertonpg(rel, insertstate.buf, InvalidBuffer, stack, itup, > newitemoff, false); > > Does this make sense? I guess you're saying this because you noticed that the for (;;) loop in _bt_findinsertloc() doesn't do that much in many cases, because of the fastpath. I suppose that this could be an improvement, provided all the assertions that verify that the work "_bt_findinsertpage()" would have done if called was in fact unnecessary. (e.g., check the high key/rightmost-ness) > The attached new version simplifies this, IMHO. The bounds and the > current buffer go together in the same struct, so it's easier to keep > track whether the bounds are valid or not. I'll look into integrating this with my current draft v15 tomorrow. Need to sleep on it. -- Peter Geoghegan
Re: Making all nbtree entries unique by having heap TIDs participate in comparisons
On Wed, Mar 6, 2019 at 10:54 PM Peter Geoghegan wrote: > It will also have to store heapkeyspace, of course. And minusinfkey. > BTW, I would like to hear what you think of the idea of minusinfkey > (and the !minusinfkey optimization) specifically. > I'm not sure that that's an improvement. Moving right should be very > rare with my patch. gcov shows that we never move right here anymore > with the regression tests, or within _bt_check_unique() -- not once. > For a second, I thought that you forgot to invalidate the bounds_valid > flag, because you didn't pass it directly, by value to > _bt_useduplicatepage(). BTW, the !minusinfkey optimization is why we literally never move right within _bt_findinsertloc() while the regression tests run. We always land on the correct leaf page to begin with. (It works with unique index insertions, where scantid is NULL when we descend the tree.) In general, there are two good reasons for us to move right: * There was a concurrent page split (or page deletion), and we just missed the downlink in the parent, and need to recover. * We omit some columns from our scan key (at least scantid), and there are perhaps dozens of matches -- this is not relevant to _bt_doinsert() code. The single value strategy used by nbtsplitloc.c does a good job of making it unlikely that _bt_check_unique()-wise duplicates will cross leaf pages, so there will almost always be one leaf page to visit. And, the !minusinfkey optimization ensures that the only reason we'll move right is because of a concurrent page split, within _bt_moveright(). The buffer lock coupling move to the right that _bt_findinsertloc() does should be considered an edge case with all of these measures, at least with v4 indexes. -- Peter Geoghegan
Re: Making all nbtree entries unique by having heap TIDs participate in comparisons
On 07/03/2019 14:54, Peter Geoghegan wrote: On Wed, Mar 6, 2019 at 10:15 PM Heikki Linnakangas wrote: After staring at the first patch for bit longer, a few things started to bother me: * The new struct is called BTScanInsert, but it's used for searches, too. It makes sense when you read the README, which explains the difference between "search scan keys" and "insertion scan keys", but now that we have a separate struct for this, perhaps we call insertion scan keys with a less confusing name. I don't know what to suggest, though. "Positioning key"? I think that insertion scan key is fine. It's been called that for almost twenty years. It's not like it's an intuitive concept that could be conveyed easily if only we came up with a new, pithy name. Yeah. It's been like that forever, but I must confess I hadn't paid any attention to it, until now. I had not understood that text in the README explaining the difference between search and insertion scan keys, before looking at this patch. Not sure I ever read it with any thought. Now that I understand it, I don't like the "insertion scan key" name. BTW, I would like to hear what you think of the idea of minusinfkey (and the !minusinfkey optimization) specifically. I don't understand it :-(. I guess that's valuable feedback on its own. I'll spend more time reading the code around that, but meanwhile, if you can think of a simpler way to explain it in the comments, that'd be good. The new BTInsertStateData struct also holds the current buffer we're considering to insert to, and a 'bounds_valid' flag to indicate if the saved bounds are valid for the current buffer. That way, it's more straightforward to clear the 'bounds_valid' flag whenever we move right. I'm not sure that that's an improvement. Moving right should be very rare with my patch. gcov shows that we never move right here anymore with the regression tests, or within _bt_check_unique() -- not once. I haven't given performance much thought, really. I don't expect this method to be any slower, but the point of the refactoring is to make the code easier to understand. /* * Do the insertion. First move right to find the correct page to * insert to, if necessary. If we're inserting to a non-unique index, * _bt_search() already did this when it checked if a move to the * right was required for leaf page. Insertion scankey's scantid * would have been filled out at the time. On a unique index, the * current buffer is the first buffer containing duplicates, however, * so we may need to move right to the correct location for this * tuple. */ if (checkingunique || itup_key->heapkeyspace) _bt_findinsertpage(rel, &insertstate, stack, heapRel); newitemoff = _bt_binsrch_insert(rel, &insertstate); _bt_insertonpg(rel, insertstate.buf, InvalidBuffer, stack, itup, newitemoff, false); Does this make sense? I guess you're saying this because you noticed that the for (;;) loop in _bt_findinsertloc() doesn't do that much in many cases, because of the fastpath. The idea is that _bt_findinsertpage() would not need to know whether the unique checks were performed or not. I'd like to encapsulate all the information about the "insert position we're considering" in the BTInsertStateData struct. Passing 'checkingunique' as a separate argument violates that, because when it's set, the key means something slightly different. Hmm. Actually, with patch #2, _bt_findinsertloc() could look at whether 'scantid' is set, instead of 'checkingunique'. That would seem better. If it looks like 'checkingunique', it's making the assumption that if unique checks were not performed, then we are already positioned on the correct page, according to the heap TID. But looking at 'scantid' seems like a more direct way of getting the same information. And then we won't need to pass the 'checkingunique' flag as an "out-of-band" argument. So I'm specifically suggesting that we replace this, in _bt_findinsertloc: if (!checkingunique && itup_key->heapkeyspace) break; With this: if (itup_key->scantid) break; And remove the 'checkingunique' argument from _bt_findinsertloc. - Heikki
Re: Making all nbtree entries unique by having heap TIDs participate in comparisons
On 07/03/2019 15:41, Heikki Linnakangas wrote: On 07/03/2019 14:54, Peter Geoghegan wrote: On Wed, Mar 6, 2019 at 10:15 PM Heikki Linnakangas wrote: After staring at the first patch for bit longer, a few things started to bother me: * The new struct is called BTScanInsert, but it's used for searches, too. It makes sense when you read the README, which explains the difference between "search scan keys" and "insertion scan keys", but now that we have a separate struct for this, perhaps we call insertion scan keys with a less confusing name. I don't know what to suggest, though. "Positioning key"? I think that insertion scan key is fine. It's been called that for almost twenty years. It's not like it's an intuitive concept that could be conveyed easily if only we came up with a new, pithy name. Yeah. It's been like that forever, but I must confess I hadn't paid any attention to it, until now. I had not understood that text in the README explaining the difference between search and insertion scan keys, before looking at this patch. Not sure I ever read it with any thought. Now that I understand it, I don't like the "insertion scan key" name. BTW, I would like to hear what you think of the idea of minusinfkey (and the !minusinfkey optimization) specifically. I don't understand it :-(. I guess that's valuable feedback on its own. I'll spend more time reading the code around that, but meanwhile, if you can think of a simpler way to explain it in the comments, that'd be good. The new BTInsertStateData struct also holds the current buffer we're considering to insert to, and a 'bounds_valid' flag to indicate if the saved bounds are valid for the current buffer. That way, it's more straightforward to clear the 'bounds_valid' flag whenever we move right. I'm not sure that that's an improvement. Moving right should be very rare with my patch. gcov shows that we never move right here anymore with the regression tests, or within _bt_check_unique() -- not once. I haven't given performance much thought, really. I don't expect this method to be any slower, but the point of the refactoring is to make the code easier to understand. /* * Do the insertion. First move right to find the correct page to * insert to, if necessary. If we're inserting to a non-unique index, * _bt_search() already did this when it checked if a move to the * right was required for leaf page. Insertion scankey's scantid * would have been filled out at the time. On a unique index, the * current buffer is the first buffer containing duplicates, however, * so we may need to move right to the correct location for this * tuple. */ if (checkingunique || itup_key->heapkeyspace) _bt_findinsertpage(rel, &insertstate, stack, heapRel); newitemoff = _bt_binsrch_insert(rel, &insertstate); _bt_insertonpg(rel, insertstate.buf, InvalidBuffer, stack, itup, newitemoff, false); Does this make sense? I guess you're saying this because you noticed that the for (;;) loop in _bt_findinsertloc() doesn't do that much in many cases, because of the fastpath. The idea is that _bt_findinsertpage() would not need to know whether the unique checks were performed or not. I'd like to encapsulate all the information about the "insert position we're considering" in the BTInsertStateData struct. Passing 'checkingunique' as a separate argument violates that, because when it's set, the key means something slightly different. Hmm. Actually, with patch #2, _bt_findinsertloc() could look at whether 'scantid' is set, instead of 'checkingunique'. That would seem better. If it looks like 'checkingunique', it's making the assumption that if unique checks were not performed, then we are already positioned on the correct page, according to the heap TID. But looking at 'scantid' seems like a more direct way of getting the same information. And then we won't need to pass the 'checkingunique' flag as an "out-of-band" argument. So I'm specifically suggesting that we replace this, in _bt_findinsertloc: if (!checkingunique && itup_key->heapkeyspace) break; With this: if (itup_key->scantid) break; And remove the 'checkingunique' argument from _bt_findinsertloc. Ah, scratch that. By the time we call _bt_findinsertloc(), scantid has already been restored, even if it was not set originally when we did _bt_search. My dislike here is that passing 'checkingunique' as a separate argument acts like a "modifier", changing slightly the meaning of the insertion scan key. If it's not set, we know we're positioned on the correct page. Otherwise, we might not be. And it's a pretty indirect way of saying that, as it also depends 'heapkeyspace'. Perhaps add a flag to BTInsertStateData, to indicate the same thing more explicitly. Something like "bool is_final_insertion_page; /* when set, no need to move right */". - Heikki
Re: Making all nbtree entries unique by having heap TIDs participate in comparisons
On 05/03/2019 05:16, Peter Geoghegan wrote: Attached is v14, which has changes based on your feedback. As a quick check of the backwards-compatibility code, i.e. !heapkeyspace, I hacked _bt_initmetapage to force the version number to 3, and ran the regression tests. It failed an assertion in the 'create_index' test: (gdb) bt #0 __GI_raise (sig=sig@entry=6) at ../sysdeps/unix/sysv/linux/raise.c:50 #1 0x7f2943f9a535 in __GI_abort () at abort.c:79 #2 0x5622c7d9d6b4 in ExceptionalCondition (conditionName=0x5622c7e4cbe8 "!(_bt_check_natts(rel, key->heapkeyspace, page, offnum))", errorType=0x5622c7e4c62a "FailedAssertion", fileName=0x5622c7e4c734 "nbtsearch.c", lineNumber=511) at assert.c:54 #3 0x5622c78627fb in _bt_compare (rel=0x5622c85afbe0, key=0x7ffd7a996db0, page=0x7f293d433780 "", offnum=2) at nbtsearch.c:511 #4 0x5622c7862640 in _bt_binsrch (rel=0x5622c85afbe0, key=0x7ffd7a996db0, buf=4622) at nbtsearch.c:432 #5 0x5622c7861ec9 in _bt_search (rel=0x5622c85afbe0, key=0x7ffd7a996db0, bufP=0x7ffd7a9976d4, access=1, snapshot=0x5622c8353740) at nbtsearch.c:142 #6 0x5622c7863a44 in _bt_first (scan=0x5622c841e828, dir=ForwardScanDirection) at nbtsearch.c:1183 #7 0x5622c785f8b0 in btgettuple (scan=0x5622c841e828, dir=ForwardScanDirection) at nbtree.c:245 #8 0x5622c78522e3 in index_getnext_tid (scan=0x5622c841e828, direction=ForwardScanDirection) at indexam.c:542 #9 0x5622c7a67784 in IndexOnlyNext (node=0x5622c83ad280) at nodeIndexonlyscan.c:120 #10 0x5622c7a438d5 in ExecScanFetch (node=0x5622c83ad280, accessMtd=0x5622c7a67254 , recheckMtd=0x5622c7a67bc9 ) at execScan.c:95 #11 0x5622c7a4394a in ExecScan (node=0x5622c83ad280, accessMtd=0x5622c7a67254 , recheckMtd=0x5622c7a67bc9 ) at execScan.c:145 #12 0x5622c7a67c73 in ExecIndexOnlyScan (pstate=0x5622c83ad280) at nodeIndexonlyscan.c:322 #13 0x5622c7a41814 in ExecProcNodeFirst (node=0x5622c83ad280) at execProcnode.c:445 #14 0x5622c7a501a5 in ExecProcNode (node=0x5622c83ad280) at ../../../src/include/executor/executor.h:231 #15 0x5622c7a50693 in fetch_input_tuple (aggstate=0x5622c83acdd0) at nodeAgg.c:406 #16 0x5622c7a529d9 in agg_retrieve_direct (aggstate=0x5622c83acdd0) at nodeAgg.c:1737 #17 0x5622c7a525a9 in ExecAgg (pstate=0x5622c83acdd0) at nodeAgg.c:1552 #18 0x5622c7a41814 in ExecProcNodeFirst (node=0x5622c83acdd0) at execProcnode.c:445 #19 0x5622c7a3621d in ExecProcNode (node=0x5622c83acdd0) at ../../../src/include/executor/executor.h:231 #20 0x5622c7a38bd9 in ExecutePlan (estate=0x5622c83acb78, planstate=0x5622c83acdd0, use_parallel_mode=false, operation=CMD_SELECT, sendTuples=true, numberTuples=0, direction=ForwardScanDirection, dest=0x5622c8462088, execute_once=true) at execMain.c:1645 #21 0x5622c7a36872 in standard_ExecutorRun (queryDesc=0x5622c83a9eb8, direction=ForwardScanDirection, count=0, execute_once=true) at execMain.c:363 #22 0x5622c7a36696 in ExecutorRun (queryDesc=0x5622c83a9eb8, direction=ForwardScanDirection, count=0, execute_once=true) at execMain.c:307 #23 0x5622c7c357dc in PortalRunSelect (portal=0x5622c8336778, forward=true, count=0, dest=0x5622c8462088) at pquery.c:929 #24 0x5622c7c3546f in PortalRun (portal=0x5622c8336778, count=9223372036854775807, isTopLevel=true, run_once=true, dest=0x5622c8462088, altdest=0x5622c8462088, completionTag=0x7ffd7a997d50 "") at pquery.c:770 #25 0x5622c7c2f029 in exec_simple_query (query_string=0x5622c82cf508 "SELECT count(*) FROM onek_with_null WHERE unique1 IS NULL AND unique2 IS NULL;") at postgres.c:1215 #26 0x5622c7c3369a in PostgresMain (argc=1, argv=0x5622c82faee0, dbname=0x5622c82fac50 "regression", username=0x5622c82c81e8 "heikki") at postgres.c:4256 #27 0x5622c7b8bcf2 in BackendRun (port=0x5622c82f3d80) at postmaster.c:4378 #28 0x5622c7b8b45b in BackendStartup (port=0x5622c82f3d80) at postmaster.c:4069 #29 0x5622c7b87633 in ServerLoop () at postmaster.c:1699 #30 0x5622c7b86e61 in PostmasterMain (argc=3, argv=0x5622c82c6160) at postmaster.c:1372 #31 0x5622c7aa9925 in main (argc=3, argv=0x5622c82c6160) at main.c:228 I haven't investigated any deeper, but apparently something's broken. This was with patch v14, without any further changes. - Heikki
Re: Making all nbtree entries unique by having heap TIDs participate in comparisons
On Thu, Mar 7, 2019 at 12:14 AM Heikki Linnakangas wrote: > I haven't investigated any deeper, but apparently something's broken. > This was with patch v14, without any further changes. Try it with my patch -- attached. I think that you missed that the INCLUDE indexes thing within nbtsort.c needs to be changed back. -- Peter Geoghegan From cdfe29c5479da6198aa918f4c373cb8a1a1acfe1 Mon Sep 17 00:00:00 2001 From: Peter Geoghegan Date: Mon, 21 Jan 2019 15:35:37 -0800 Subject: [PATCH 08/12] DEBUG: Force version 3 artificially. --- contrib/amcheck/expected/check_btree.out | 7 ++- contrib/pageinspect/expected/btree.out | 2 +- contrib/pgstattuple/expected/pgstattuple.out | 10 +- src/backend/access/nbtree/nbtpage.c | 2 +- src/backend/access/nbtree/nbtsort.c | 16 src/backend/postmaster/postmaster.c | 1 + src/test/regress/expected/dependency.out | 4 ++-- src/test/regress/expected/event_trigger.out | 4 ++-- src/test/regress/expected/foreign_data.out | 8 src/test/regress/expected/rowsecurity.out| 4 ++-- 10 files changed, 24 insertions(+), 34 deletions(-) diff --git a/contrib/amcheck/expected/check_btree.out b/contrib/amcheck/expected/check_btree.out index 687fde8fce..60bebb1c00 100644 --- a/contrib/amcheck/expected/check_btree.out +++ b/contrib/amcheck/expected/check_btree.out @@ -139,11 +139,8 @@ VACUUM delete_test_table; DELETE FROM delete_test_table WHERE a < 79990; VACUUM delete_test_table; SELECT bt_index_parent_check('delete_test_table_pkey', true, true); - bt_index_parent_check - -(1 row) - +ERROR: index "delete_test_table_pkey" does not support relocating tuples +HINT: Only indexes initialized on PostgreSQL 12 support relocation verification. -- -- BUG #15597: must not assume consistent input toasting state when forming -- tuple. Bloom filter must fingerprint normalized index tuple representation. diff --git a/contrib/pageinspect/expected/btree.out b/contrib/pageinspect/expected/btree.out index 067e73f21a..7f003bf801 100644 --- a/contrib/pageinspect/expected/btree.out +++ b/contrib/pageinspect/expected/btree.out @@ -5,7 +5,7 @@ CREATE INDEX test1_a_idx ON test1 USING btree (a); SELECT * FROM bt_metap('test1_a_idx'); -[ RECORD 1 ]---+--- magic | 340322 -version | 4 +version | 3 root| 1 level | 0 fastroot| 1 diff --git a/contrib/pgstattuple/expected/pgstattuple.out b/contrib/pgstattuple/expected/pgstattuple.out index 9920dbfd40..9858ea69d4 100644 --- a/contrib/pgstattuple/expected/pgstattuple.out +++ b/contrib/pgstattuple/expected/pgstattuple.out @@ -48,7 +48,7 @@ select version, tree_level, from pgstatindex('test_pkey'); version | tree_level | index_size | root_block_no | internal_pages | leaf_pages | empty_pages | deleted_pages | avg_leaf_density | leaf_fragmentation -+++---+++-+---+--+ - 4 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | NaN |NaN + 3 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | NaN |NaN (1 row) select version, tree_level, @@ -58,7 +58,7 @@ select version, tree_level, from pgstatindex('test_pkey'::text); version | tree_level | index_size | root_block_no | internal_pages | leaf_pages | empty_pages | deleted_pages | avg_leaf_density | leaf_fragmentation -+++---+++-+---+--+ - 4 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | NaN |NaN + 3 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | NaN |NaN (1 row) select version, tree_level, @@ -68,7 +68,7 @@ select version, tree_level, from pgstatindex('test_pkey'::name); version | tree_level | index_size | root_block_no | internal_pages | leaf_pages | empty_pages | deleted_pages | avg_leaf_density | leaf_fragmentation -+++---+++-+---+--+ - 4 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | NaN |NaN + 3 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | NaN |NaN (1 row) select version, tree_level, @@ -78,7 +78,7 @@ select version, tr
Re: Making all nbtree entries unique by having heap TIDs participate in comparisons
On Wed, Mar 6, 2019 at 11:41 PM Heikki Linnakangas wrote: > > BTW, I would like to hear what you think of the idea of minusinfkey > > (and the !minusinfkey optimization) specifically. > > I don't understand it :-(. I guess that's valuable feedback on its own. > I'll spend more time reading the code around that, but meanwhile, if you > can think of a simpler way to explain it in the comments, that'd be good. Here is another way of explaining it: When I drew you that picture while we were in Lisbon, I mentioned to you that the patch sometimes used a sentinel scantid value that was greater than minus infinity, but less than any real scantid. This could be used to force an otherwise-equal-to-pivot search to go left rather than uselessly going right. I explained this about 30 minutes in, when I was drawing you a picture. Well, that sentinel heap TID thing doesn't exist any more, because it was replaced by the !minusinfkey optimization, which is a *generalization* of the same idea, which extends it to all columns (not just the heap TID column). That way, you never have to go to two pages just because you searched for a value that happened to be at the "right at the edge" of a leaf page. Page deletion wants to assume that truncated attributes from the high key of the page being deleted have actual negative infinity values -- negative infinity is a value, just like any other, albeit one that can only appear in pivot tuples. This is simulated by VACUUM using "minusinfkey = true". We go left in the parent, not right, and land on the correct leaf page. Technically we don't compare the negative infinity values in the pivot to the negative infinity values in the scankey, but we return 0 just as if we had, and found them equal. Similarly, v3 indexes specify "minusinfkey = true" in all cases, because they always want to go left -- just like in old Postgres versions. They don't have negative infinity values (matches can be on either side of the all-equal pivot, so they must go left). -- Peter Geoghegan
Re: Making all nbtree entries unique by having heap TIDs participate in comparisons
On Thu, Mar 7, 2019 at 12:37 AM Peter Geoghegan wrote: > When I drew you that picture while we were in Lisbon, I mentioned to > you that the patch sometimes used a sentinel scantid value that was > greater than minus infinity, but less than any real scantid. This > could be used to force an otherwise-equal-to-pivot search to go left > rather than uselessly going right. I explained this about 30 minutes > in, when I was drawing you a picture. I meant the opposite: it could be used to go right, instead of going left when descending the tree and unnecessarily moving right on the leaf level. As I said, moving right on the leaf level (rather than during the descent) should only happen when it's necessary, such as when there is a concurrent page split. It shouldn't happen reliably when searching for the same value, unless there really are matches across multiple leaf pages, and that's just what we have to do. -- Peter Geoghegan
Re: Making all nbtree entries unique by having heap TIDs participate in comparisons
On Wed, Mar 6, 2019 at 11:41 PM Heikki Linnakangas wrote: > I don't understand it :-(. I guess that's valuable feedback on its own. > I'll spend more time reading the code around that, but meanwhile, if you > can think of a simpler way to explain it in the comments, that'd be good. One more thing on this: If you force bitmap index scans (by disabling index-only scans and index scans with the "enable_" GUCs), then you get EXPLAIN (ANALYZE, BUFFERS) instrumentation for the index alone (and the heap, separately). No visibility map accesses, which obscure the same numbers for a similar index-only scan. You can then observe that most searches of a single value will touch the bare minimum number of index pages. For example, if there are 3 levels in the index, you should access only 3 index pages total, unless there are literally hundreds of matches, and cannot avoid storing them on more than one leaf page. You'll see that the scan touches the minimum possible number of index pages, because of: * Many duplicates strategy. (Not single value strategy, which I incorrectly mentioned in relation to this earlier.) * The !minusinfykey optimization, which ensures that we go to the right of an otherwise-equal pivot tuple in an internal page, rather than left. * The "continuescan" high key patch, which ensures that the scan doesn't go to the right from the first leaf page to try to find even more matches. The high key on the same leaf page will indicate that the scan is over, without actually visiting the sibling. (Again, I'm assuming that your search is for a single value.) -- Peter Geoghegan
Re: Making all nbtree entries unique by having heap TIDs participate in comparisons
On 08/03/2019 12:22, Peter Geoghegan wrote: I would like to work through these other items with you (_bt_binsrch_insert() and so on), but I think that it would be helpful if you made an effort to understand the minusinfkey stuff first. I spent a lot of time improving the explanation of that within _bt_compare(). It's important. Ok, after thinking about it for a while, I think I understand the minus infinity stuff now. Let me try to explain it in my own words: Imagine that you have an index with two key columns, A and B. The index has two leaf pages, with the following items: ++ ++ | Page 1 | | Page 2 | || || |1 1 | |2 1 | |1 2 | |2 2 | |1 3 | |2 3 | |1 4 | |2 4 | |1 5 | |2 5 | ++ ++ The key space is neatly split on the first key column - probably thanks to the new magic in the page split code. Now, what do we have as the high key of page 1? Answer: "2 -inf". The "-inf" is not stored in the key itself, the second key column is just omitted, and the search code knows to treat it implicitly as a value that's lower than any real value. Hence, "minus infinity". However, during page deletion, we need to perform a search to find the downlink pointing to a leaf page. We do that by using the leaf page's high key as the search key. But the search needs to treat it slightly differently in that case. Normally, searching with a single key value, "2", we would land on page 2, because any real value beginning with "2" would be on that page, but in the page deletion case, we want to find page 1. Setting the BTScanInsert.minusinfkey flag tells the search code to do that. Question: Wouldn't it be more straightforward to use "1 +inf" as page 1's high key? I.e treat any missing columns as positive infinity. That way, the search for page deletion wouldn't need to be treated differently. That's also how this used to work: all tuples on a page used to be <= high key, not strictly < high key. And it would also make the rightmost page less of a special case: you could pretend the rightmost page has a pivot tuple with all columns truncated away, which means positive infinity. You have this comment _bt_split which touches the subject: /* * The "high key" for the new left page will be the first key that's going * to go into the new right page, or possibly a truncated version if this * is a leaf page split. This might be either the existing data item at * position firstright, or the incoming tuple. * * The high key for the left page is formed using the first item on the * right page, which may seem to be contrary to Lehman & Yao's approach of * using the left page's last item as its new high key when splitting on * the leaf level. It isn't, though: suffix truncation will leave the * left page's high key fully equal to the last item on the left page when * two tuples with equal key values (excluding heap TID) enclose the split * point. It isn't actually necessary for a new leaf high key to be equal * to the last item on the left for the L&Y "subtree" invariant to hold. * It's sufficient to make sure that the new leaf high key is strictly * less than the first item on the right leaf page, and greater than or * equal to (not necessarily equal to) the last item on the left leaf * page. * * In other words, when suffix truncation isn't possible, L&Y's exact * approach to leaf splits is taken. (Actually, even that is slightly * inaccurate. A tuple with all the keys from firstright but the heap TID * from lastleft will be used as the new high key, since the last left * tuple could be physically larger despite being opclass-equal in respect * of all attributes prior to the heap TID attribute.) */ But it doesn't explain why it's done like that. - Heikki
Re: Making all nbtree entries unique by having heap TIDs participate in comparisons
On Fri, Mar 8, 2019 at 2:12 AM Heikki Linnakangas wrote: > Now, what do we have as the high key of page 1? Answer: "2 -inf". The > "-inf" is not stored in the key itself, the second key column is just > omitted, and the search code knows to treat it implicitly as a value > that's lower than any real value. Hence, "minus infinity". Right. > However, during page deletion, we need to perform a search to find the > downlink pointing to a leaf page. We do that by using the leaf page's > high key as the search key. But the search needs to treat it slightly > differently in that case. Normally, searching with a single key value, > "2", we would land on page 2, because any real value beginning with "2" > would be on that page, but in the page deletion case, we want to find > page 1. Setting the BTScanInsert.minusinfkey flag tells the search code > to do that. Right. > Question: Wouldn't it be more straightforward to use "1 +inf" as page > 1's high key? I.e treat any missing columns as positive infinity. That might also work, but it wouldn't be more straightforward on balance. This is because: * We have always taken the new high key from the firstright item, and we also continue to do that on internal pages -- same as before. It would certainly complicate the nbtsplitloc.c code to have to deal with this new special case now (leaf and internal pages would have to have far different handling, not just slightly different handling). * We have always had "-inf" values as the first item on an internal page, which is explicitly truncated to zero attributes as of Postgres v11. It seems ugly to me to make truncated attributes mean negative infinity in that context, but positive infinity in every other context. * Another reason that I prefer "-inf" to "+inf" is that you can imagine an implementation that makes pivot tuples into normalized binary keys, that are truncated using generic/opclass-agnostic logic, and compared using strcmp(). If the scankey binary string is longer than the pivot tuple, then it's greater according to strcmp() -- that just works. And, you can truncate the original binary strings built using opclass infrastructure without having to understand where attributes begin and end (though this relies on encoding things like NULL-ness a certain way). If we define truncation to be "+inf" now, then none of this works. All of that said, maybe it would be clearer if page deletion was not the special case that has to opt in to minusinfkey semantics. Perhaps it would make more sense for everyone else to opt out of minusinfkey semantics, and to get the !minusinfkey optimization as a result of that. I only did it the other way around because that meant that only nbtpage.c had to acknowledge the special case. Even calling it minusinfkey is misleading in one way, because we're not so much searching for "-inf" values as we are searching for the first page that could have tuples for the untruncated attributes. But isn't that how this has always worked, given that we've had to deal with duplicate pivot tuples on the same level before now? As I said, we're not doing an extra thing when minusinfykey is true (during page deletion) -- it's the other way around. Saying that we're searching for minus infinity values for the truncated attributes is kind of a lie, although the search does behave that way. >That way, the search for page deletion wouldn't need to be treated > differently. That's also how this used to work: all tuples on a page > used to be <= high key, not strictly < high key. That isn't accurate -- it still works that way on the leaf level. The alternative that you've described is possible, I think, but the key space works just the same with either of our approaches. You've merely thought of an alternative way of generating new high keys that satisfy the same invariants as my own scheme. Provided the new separator for high key is >= last item on the left and < first item on the right, everything works. As you point out, the original Lehman and Yao rule for leaf pages (which Postgres kinda followed before) is that the high key is <= items on the leaf level. But this patch makes nbtree follow that rule fully and properly. Maybe you noticed that amcheck tests < on internal pages, and only checks <= on leaf pages. Perhaps it led you to believe that I did things differently. Actually, this is classic Lehman and Yao. The keys in internal pages are all "separators" as far as Lehman and Yao are concerned, so the high key is less of a special case on internal pages. We check < on internal pages because all separators are supposed to be unique on a level. But, as I said, we do check <= on the leaf level. Take a look at "Fig. 7 A B-Link Tree" in the Lehman and Yao paper if this is unclear. That shows that internal pages have unique keys -- we can therefore expect the high key to be < items in internal pages. It also shows that leaf pages copy the high key from the last item on the left page -- we can expect th
Re: Making all nbtree entries unique by having heap TIDs participate in comparisons
On Fri, Mar 8, 2019 at 10:48 AM Peter Geoghegan wrote: > > Question: Wouldn't it be more straightforward to use "1 +inf" as page > > 1's high key? I.e treat any missing columns as positive infinity. > > That might also work, but it wouldn't be more straightforward on > balance. This is because: I thought of another reason: * The 'Add high key "continuescan" optimization' is effective because the high key of a leaf page tends to look relatively dissimilar to other items on the page. The optimization would almost never help if the high key was derived from the lastleft item at the time of a split -- that's no more informative than the lastleft item itself. As things stand with the patch, a high key usually has a value for its last untruncated attribute that can only appear on the page to the right, and never the current page. We'd quite like to be able to conclude that the page to the right can't be interesting there and then, without needing to visit it. Making new leaf high keys "as close as possible to items on the right, without actually touching them" makes the optimization quite likely to work out with the TPC-C indexes, when we search for orderline items for an order that is rightmost of a leaf page in the orderlines primary key. And another reason: * This makes it likely that any new items that would go between the original lastleft and firstright items end up on the right page when they're inserted after the lastleft/firstright split. This is generally a good thing, because we've optimized WAL-logging for new pages that go on the right. (You pointed this out to me in Lisbon, in fact.) -- Peter Geoghegan
Re: Making all nbtree entries unique by having heap TIDs participate in comparisons
On Fri, Mar 8, 2019 at 10:48 AM Peter Geoghegan wrote: > All of that said, maybe it would be clearer if page deletion was not > the special case that has to opt in to minusinfkey semantics. Perhaps > it would make more sense for everyone else to opt out of minusinfkey > semantics, and to get the !minusinfkey optimization as a result of > that. I only did it the other way around because that meant that only > nbtpage.c had to acknowledge the special case. This seems like a good idea -- we should reframe the !minusinfkey optimization, without actually changing the behavior. Flip it around. The minusinfkey field within the insertion scankey struct would be called something like "descendrighttrunc" instead. Same idea, but with the definition inverted. Most _bt_search() callers (all of those outside of nbtpage.c and amcheck) would be required to opt in to that optimization to get it. Under this arrangement, nbtpage.c/page deletion would not ask for the "descendrighttrunc" optimization, and would therefore continue to do what it has always done: find the first leaf page that its insertion scankey values could be on (we don't lie about searching for negative infinity, or having a negative infinity sentinel value in scan key). The only difference for page deletion between v3 indexes and v4 indexes is that with v4 indexes we'll relocate the same leaf page reliably, since every separator key value is guaranteed to be unique on its level (including the leaf level/leaf high keys). This is just a detail, though, and not one that's even worth pointing out; we're not *relying* on that being true on v4 indexes anyway (we check that the block number is a match too, which is strictly necessary for v3 indexes and seems like a good idea for v4 indexes). This is also good because it makes it clear that the unique index code within _bt_doinsert() (that temporarily sets scantid to NULL) benefits from the descendrighttrunc/!minusinfkey optimization -- it should be "honest" and ask for it explicitly. We can make _bt_doinsert() opt in to the optimization for unique indexes, but not for other indexes, where scantid is set from the start. The descendrighttrunc/!minusinfkey optimization cannot help when scantid is set from the start, because we'll always have an attribute value in insertion scankey that breaks the tie for us instead. We'll always move right of a heap-TID-truncated separator key whose untruncated attributes are all equal to a prefix of our insertion scankey values. (This _bt_doinsert() descendrighttrunc/!minusinfkey optimization for unique indexes matters more than you might think -- we do really badly with things like Zipfian distributions currently, and reducing the contention goes some way towards helping with that. Postgres pro noticed this a couple of years back, and analyzed it in detail at that time. It's really nice that we very rarely have to move right within code like _bt_check_unique() and _bt_findsplitloc() with the patch.) Does that make sense to you? Can you live with the name "descendrighttrunc", or do you have a better one? Thanks -- Peter Geoghegan
Re: Making all nbtree entries unique by having heap TIDs participate in comparisons
On 08/03/2019 23:21, Peter Geoghegan wrote: On Fri, Mar 8, 2019 at 10:48 AM Peter Geoghegan wrote: All of that said, maybe it would be clearer if page deletion was not the special case that has to opt in to minusinfkey semantics. Perhaps it would make more sense for everyone else to opt out of minusinfkey semantics, and to get the !minusinfkey optimization as a result of that. I only did it the other way around because that meant that only nbtpage.c had to acknowledge the special case. This seems like a good idea -- we should reframe the !minusinfkey optimization, without actually changing the behavior. Flip it around. The minusinfkey field within the insertion scankey struct would be called something like "descendrighttrunc" instead. Same idea, but with the definition inverted. Most _bt_search() callers (all of those outside of nbtpage.c and amcheck) would be required to opt in to that optimization to get it. Under this arrangement, nbtpage.c/page deletion would not ask for the "descendrighttrunc" optimization, and would therefore continue to do what it has always done: find the first leaf page that its insertion scankey values could be on (we don't lie about searching for negative infinity, or having a negative infinity sentinel value in scan key). The only difference for page deletion between v3 indexes and v4 indexes is that with v4 indexes we'll relocate the same leaf page reliably, since every separator key value is guaranteed to be unique on its level (including the leaf level/leaf high keys). This is just a detail, though, and not one that's even worth pointing out; we're not *relying* on that being true on v4 indexes anyway (we check that the block number is a match too, which is strictly necessary for v3 indexes and seems like a good idea for v4 indexes). This is also good because it makes it clear that the unique index code within _bt_doinsert() (that temporarily sets scantid to NULL) benefits from the descendrighttrunc/!minusinfkey optimization -- it should be "honest" and ask for it explicitly. We can make _bt_doinsert() opt in to the optimization for unique indexes, but not for other indexes, where scantid is set from the start. The descendrighttrunc/!minusinfkey optimization cannot help when scantid is set from the start, because we'll always have an attribute value in insertion scankey that breaks the tie for us instead. We'll always move right of a heap-TID-truncated separator key whose untruncated attributes are all equal to a prefix of our insertion scankey values. (This _bt_doinsert() descendrighttrunc/!minusinfkey optimization for unique indexes matters more than you might think -- we do really badly with things like Zipfian distributions currently, and reducing the contention goes some way towards helping with that. Postgres pro noticed this a couple of years back, and analyzed it in detail at that time. It's really nice that we very rarely have to move right within code like _bt_check_unique() and _bt_findsplitloc() with the patch.) Does that make sense to you? Can you live with the name "descendrighttrunc", or do you have a better one? "descendrighttrunc" doesn't make much sense to me, either. I don't understand it. Maybe a comment would make it clear, though. I don't feel like this is an optimization. It's natural consequence of what the high key means. I guess you can think of it as an optimization, in the same way that not fully scanning the whole index for every search is an optimization, but that's not how I think of it :-). If we don't flip the meaning of the flag, then maybe calling it something like "searching_for_leaf_page" would make sense: /* * When set, we're searching for the leaf page with the given high key, * rather than leaf tuples matching the search keys. * * Normally, when !searching_for_pivot_tuple, if a page's high key * has truncated columns, and the columns that are present are equal to * the search key, the search will not descend to that page. For * example, if an index has two columns, and a page's high key is * ("foo", ), and the search key is also ("foo," ), * the search will not descend to that page, but its right sibling. The * omitted column in the high key means that all tuples on the page must * be strictly < "foo", so we don't need to visit it. However, sometimes * we perform a search to find the parent of a leaf page, using the leaf * page's high key as the search key. In that case, when we search for * ("foo", ), we do want to land on that page, not its right * sibling. */ boolsearching_for_leaf_page; As the patch stands, you're also setting minusinfkey when dealing with v3 indexes. I think it would be better to only set searching_for_leaf_page in nbtpage.c. In general, I think BTScanInsert should describe the search key, regardless of whether it's a V3 or V4 index. Properties of the index belong elsewhere. (We're violating that by storing the 'heapkeyspace' flag in BTScanInsert. That wart is pr
Re: Making all nbtree entries unique by having heap TIDs participate in comparisons
On Sun, Mar 10, 2019 at 7:09 AM Heikki Linnakangas wrote: > "descendrighttrunc" doesn't make much sense to me, either. I don't > understand it. Maybe a comment would make it clear, though. It's not an easily grasped concept. I don't think that any name will easily convey the idea to the reader, though. I'm happy to go with whatever name you prefer. > I don't feel like this is an optimization. It's natural consequence of > what the high key means. I guess you can think of it as an optimization, > in the same way that not fully scanning the whole index for every search > is an optimization, but that's not how I think of it :-). I would agree with this in a green field situation, where we don't have to consider the legacy of v3 indexes. But that's not the case here. > If we don't flip the meaning of the flag, then maybe calling it > something like "searching_for_leaf_page" would make sense: > > /* > * When set, we're searching for the leaf page with the given high key, > * rather than leaf tuples matching the search keys. > * > * Normally, when !searching_for_pivot_tuple, if a page's high key I guess you meant to say "searching_for_pivot_tuple" both times (not "searching_for_leaf_page"). After all, we always search for a leaf page. :-) I'm fine with "searching_for_pivot_tuple", I think. The underscores are not really stylistically consistent with other stuff in nbtree.h, but I can use something very similar to your suggestion that is consistent. > * has truncated columns, and the columns that are present are equal to > * the search key, the search will not descend to that page. For > * example, if an index has two columns, and a page's high key is > * ("foo", ), and the search key is also ("foo," ), > * the search will not descend to that page, but its right sibling. The > * omitted column in the high key means that all tuples on the page must > * be strictly < "foo", so we don't need to visit it. However, sometimes > * we perform a search to find the parent of a leaf page, using the leaf > * page's high key as the search key. In that case, when we search for > * ("foo", ), we do want to land on that page, not its right > * sibling. > */ > boolsearching_for_leaf_page; That works for me (assuming you meant searching_for_pivot_tuple). > As the patch stands, you're also setting minusinfkey when dealing with > v3 indexes. I think it would be better to only set > searching_for_leaf_page in nbtpage.c. That would mean I would have to check both heapkeyspace and minusinfkey within _bt_compare(). I would rather just keep the assertion that makes sure that !heapkeyspace callers are also minusinfkey callers, and the comments that explain why that is. It might even matter to performance -- having an extra condition in _bt_compare() is something we should avoid. > In general, I think BTScanInsert > should describe the search key, regardless of whether it's a V3 or V4 > index. Properties of the index belong elsewhere. (We're violating that > by storing the 'heapkeyspace' flag in BTScanInsert. That wart is > probably OK, it is pretty convenient to have it there. But in principle...) The idea with pg_upgrade'd v3 indexes is, as I said a while back, that they too have a heap TID attribute. nbtsearch.c code is not allowed to rely on its value, though, and must use minusinfkey/searching_for_pivot_tuple semantics (relying on its value being minus infinity is still relying on its value being something). Now, it's also true that there are a number of things that we have to do within nbtinsert.c for !heapkeyspace that don't really concern the key space as such. Even still, thinking about everything with reference to the keyspace, and keeping that as similar as possible between v3 and v4 is a good thing. It is up to high level code (such as _bt_first()) to not allow a !heapkeyspace index scan to do something that won't work for it. It is not up to low level code like _bt_compare() to worry about these differences (beyond asserting that caller got it right). If page deletion didn't need minusinfkey semantics (if nobody but v3 indexes needed that), then I would just make the "move right of separator" !minusinfkey code within _bt_compare() test heapkeyspace. But we do have a general need for minusinfkey semantics, so it seems simpler and more future-proof to keep heapkeyspace out of low-level nbtsearch.c code. -- Peter Geoghegan
Re: Making all nbtree entries unique by having heap TIDs participate in comparisons
On 10/03/2019 20:53, Peter Geoghegan wrote: On Sun, Mar 10, 2019 at 7:09 AM Heikki Linnakangas wrote: If we don't flip the meaning of the flag, then maybe calling it something like "searching_for_leaf_page" would make sense: /* * When set, we're searching for the leaf page with the given high key, * rather than leaf tuples matching the search keys. * * Normally, when !searching_for_pivot_tuple, if a page's high key I guess you meant to say "searching_for_pivot_tuple" both times (not "searching_for_leaf_page"). After all, we always search for a leaf page. :-) Ah, yeah. Not sure. I wrote it as "searching_for_pivot_tuple" first, but changed to "searching_for_leaf_page" at the last minute. My thinking was that in the page-deletion case, you're trying to re-locate a particular leaf page. Otherwise, you're searching for matching tuples, not a particular page. Although during insertion, I guess you are also searching for a particular page, the page to insert to. As the patch stands, you're also setting minusinfkey when dealing with v3 indexes. I think it would be better to only set searching_for_leaf_page in nbtpage.c. That would mean I would have to check both heapkeyspace and minusinfkey within _bt_compare(). Yeah. I would rather just keep the assertion that makes sure that !heapkeyspace callers are also minusinfkey callers, and the comments that explain why that is. It might even matter to performance -- having an extra condition in _bt_compare() is something we should avoid. It's a hot codepath, but I doubt it's *that* hot that it matters, performance-wise... In general, I think BTScanInsert should describe the search key, regardless of whether it's a V3 or V4 index. Properties of the index belong elsewhere. (We're violating that by storing the 'heapkeyspace' flag in BTScanInsert. That wart is probably OK, it is pretty convenient to have it there. But in principle...) The idea with pg_upgrade'd v3 indexes is, as I said a while back, that they too have a heap TID attribute. nbtsearch.c code is not allowed to rely on its value, though, and must use minusinfkey/searching_for_pivot_tuple semantics (relying on its value being minus infinity is still relying on its value being something). Yeah. I find that's a complicated way to think about it. My mental model is that v4 indexes store heap TIDs, and every tuple is unique thanks to that. But on v3, we don't store heap TIDs, and duplicates are possible. - Heikki
Re: Making all nbtree entries unique by having heap TIDs participate in comparisons
On Sun, Mar 10, 2019 at 12:53 PM Heikki Linnakangas wrote: > Ah, yeah. Not sure. I wrote it as "searching_for_pivot_tuple" first, but > changed to "searching_for_leaf_page" at the last minute. My thinking was > that in the page-deletion case, you're trying to re-locate a particular > leaf page. Otherwise, you're searching for matching tuples, not a > particular page. Although during insertion, I guess you are also > searching for a particular page, the page to insert to. I prefer something like "searching_for_pivot_tuple", because it's unambiguous. Okay with that? > It's a hot codepath, but I doubt it's *that* hot that it matters, > performance-wise... I'll figure that out. Although I am currently looking into a regression in workloads that fit in shared_buffers, that my micro-benchmarks didn't catch initially. Indexes are still much smaller, but we get a ~2% regression all the same. OTOH, we get a 7.5%+ increase in throughput when the workload is I/O bound, and latency is generally no worse and even better with any workload. I suspect that the nice top-down approach to nbtsplitloc.c has its costs...will let you know more when I know more. > > The idea with pg_upgrade'd v3 indexes is, as I said a while back, that > > they too have a heap TID attribute. nbtsearch.c code is not allowed to > > rely on its value, though, and must use > > minusinfkey/searching_for_pivot_tuple semantics (relying on its value > > being minus infinity is still relying on its value being something). > > Yeah. I find that's a complicated way to think about it. My mental model > is that v4 indexes store heap TIDs, and every tuple is unique thanks to > that. But on v3, we don't store heap TIDs, and duplicates are possible. I'll try it that way, then. -- Peter Geoghegan
Re: Making all nbtree entries unique by having heap TIDs participate in comparisons
On 12/03/2019 04:47, Peter Geoghegan wrote: In conclusion: I think that this regression is a cost worth accepting. The regression in throughput is relatively small (2% - 3%), and the NEW_ORDER transaction seems to be the only problem (NEW_ORDER happens to be used for 45% of all transactions with TPC-C, and inserts into the largest index by far, without reading much). The patch overtakes master after a few hours anyway -- the patch will still win after about 6 hours, once the database gets big enough, despite all the contention. As I said, I think that we see a regression*because* the indexes are much smaller, not in spite of the fact that they're smaller. The TPC-C/BenchmarkSQL indexes never fail to be about 40% smaller than they are on master, no matter the details, even after many hours. Yeah, that's fine. I'm curious, though, could you bloat the indexes back to the old size by setting the fillfactor? - Heikki
Re: Making all nbtree entries unique by having heap TIDs participate in comparisons
On Mon, Mar 11, 2019 at 11:30 PM Heikki Linnakangas wrote: > Yeah, that's fine. I'm curious, though, could you bloat the indexes back > to the old size by setting the fillfactor? I think that that might work, though it's hard to say for sure offhand. The "split after new item" optimization is supposed to be a variation of rightmost splits, of course. We apply fillfactor in the same way much of the time. You would still literally split immediately after the new item some of the time, though, which makes it unclear how much bloat there would be without testing it. Some indexes mostly apply fillfactor in non-rightmost pages, while other indexes mostly split at the exact point past the new item, depending on details like the size of the groupings. I am currently doing a multi-day 6,000 warehouse benchmark, since I want to be sure that the bloat resistance will hold up over days. I think that it will, because there aren't that many updates, and they're almost all HOT-safe. I'll put the idea of a 50/50 fillfactor benchmark with the high-contention/regressed workload on my TODO list, though. -- Peter Geoghegan
Re: Making all nbtree entries unique by having heap TIDs participate in comparisons
On Mon, Mar 11, 2019 at 10:47 PM Peter Geoghegan wrote: > > On Sun, Mar 10, 2019 at 5:17 PM Peter Geoghegan wrote: > > The regression that I mentioned earlier isn't in pgbench type > > workloads (even when the distribution is something more interesting > > that the uniform distribution default). It is only in workloads with > > lots of page splits and lots of index churn, where we get most of the > > benefit of the patch, but also where the costs are most apparent. > > Hopefully it can be fixed, but if not I'm inclined to think that it's > > a price worth paying. This certainly still needs further analysis and > > discussion, though. This revision of the patch does not attempt to > > address that problem in any way. > > I believe that I've figured out what's going on here. > > At first, I thought that this regression was due to the cycles that > have been added to page splits, but that doesn't seem to be the case > at all. Nothing that I did to make page splits faster helped (e.g. > temporarily go back to doing them "bottom up" made no difference). CPU > utilization was consistently slightly *higher* with the master branch > (patch spent slightly more CPU time idle). I now believe that the > problem is with LWLock/buffer lock contention on index pages, and that > that's an inherent cost with a minority of write-heavy high contention > workloads. A cost that we should just accept. If I wanted to try to say this in fewer words, would it be fair to say that reducing the size of an index by 40% without changing anything else can increase contention on the remaining pages? -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
Re: Making all nbtree entries unique by having heap TIDs participate in comparisons
On Tue, Mar 12, 2019 at 11:32 AM Robert Haas wrote: > If I wanted to try to say this in fewer words, would it be fair to say > that reducing the size of an index by 40% without changing anything > else can increase contention on the remaining pages? Yes. -- Peter Geoghegan
Re: Making all nbtree entries unique by having heap TIDs participate in comparisons
On Tue, Mar 12, 2019 at 2:34 PM Peter Geoghegan wrote: > On Tue, Mar 12, 2019 at 11:32 AM Robert Haas wrote: > > If I wanted to try to say this in fewer words, would it be fair to say > > that reducing the size of an index by 40% without changing anything > > else can increase contention on the remaining pages? > > Yes. Hey, I understood something today! I think it's pretty clear that we have to view that as acceptable. I mean, we could reduce contention even further by finding a way to make indexes 40% larger, but I think it's clear that nobody wants that. Now, maybe in the future we'll want to work on other techniques for reducing contention, but I don't think we should make that the problem of this patch, especially because the regressions are small and go away after a few hours of heavy use. We should optimize for the case where the user intends to keep the database around for years, not hours. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
Re: Making all nbtree entries unique by having heap TIDs participate in comparisons
On Tue, Mar 12, 2019 at 11:40 AM Robert Haas wrote: > Hey, I understood something today! And I said something that could be understood! > I think it's pretty clear that we have to view that as acceptable. I > mean, we could reduce contention even further by finding a way to make > indexes 40% larger, but I think it's clear that nobody wants that. > Now, maybe in the future we'll want to work on other techniques for > reducing contention, but I don't think we should make that the problem > of this patch, especially because the regressions are small and go > away after a few hours of heavy use. We should optimize for the case > where the user intends to keep the database around for years, not > hours. I think so too. There is a feature in other database systems called "reverse key indexes", which deals with this problem in a rather extreme way. This situation is very similar to the situation with rightmost page splits, where fillfactor is applied to pack leaf pages full. The only difference is that there are multiple groupings, not just one single implicit grouping (everything in the index). You could probably make very similar observations about rightmost page splits that apply leaf fillfactor. Here is an example of how the largest index looks for master with the 100 warehouse case that was slightly regressed: table_name| index_name | page_type | npages | avg_live_items | avg_dead_items | avg_item_size --+---+---+-+++--- bmsql_order_line | bmsql_order_line_pkey | R | 1 | 54.000 |0.000 | 23.000 bmsql_order_line | bmsql_order_line_pkey | I |11,482 | 143.200 |0.000 | 23.000 bmsql_order_line | bmsql_order_line_pkey | L | 1,621,316 | 139.458 |0.003 | 24.000 Here is what we see with the patch: table_name| index_name | page_type | npages | avg_live_items | avg_dead_items | avg_item_size --+---+---+-+++--- bmsql_order_line | bmsql_order_line_pkey | R | 1 | 29.000 |0.000 | 22.000 bmsql_order_line | bmsql_order_line_pkey | I | 5,957 | 159.149 |0.000 | 23.000 bmsql_order_line | bmsql_order_line_pkey | L | 936,170 | 233.496 |0.052 | 23.999 REINDEX would leave bmsql_order_line_pkey with 262 items, and we see here that it has 233 after several hours, which is pretty good given the amount of contention. The index actually looks very much like it was just REINDEXED when initial bulk loading finishes, before we get any updates, even though that happens using retail insertions. Notice that the number of internal pages is reduced by almost a full 50% -- it's somewhat better than the reduction in the number of leaf pages, because the benefits compound (items in the root are even a bit smaller, because of this compounding effect, despite alignment effects). Internal pages are the most important pages to have cached, but also potentially the biggest points of contention. -- Peter Geoghegan
Re: Making all nbtree entries unique by having heap TIDs participate in comparisons
Hi, On 2019-03-11 19:47:29 -0700, Peter Geoghegan wrote: > I now believe that the problem is with LWLock/buffer lock contention > on index pages, and that that's an inherent cost with a minority of > write-heavy high contention workloads. A cost that we should just > accept. Have you looked at an offwake or lwlock wait graph (bcc tools) or something in that vein? Would be interesting to see what is waiting for what most often... Greetings, Andres Freund
Re: Making all nbtree entries unique by having heap TIDs participate in comparisons
On Tue, Mar 12, 2019 at 12:40 PM Andres Freund wrote: > Have you looked at an offwake or lwlock wait graph (bcc tools) or > something in that vein? Would be interesting to see what is waiting for > what most often... Not recently, though I did use your BCC script for this very purpose quite a few months ago. I don't remember it helping that much at the time, but then that was with a version of the patch that lacked a couple of important optimizations that we have now. We're now very careful to not descend to the left with an equal pivot tuple. We descend right instead when that's definitely the only place we'll find matches (a high key doesn't count as a match in almost all cases!). Edge-cases where we unnecessarily move left then right, or unnecessarily move right a second time once on the leaf level have been fixed. I fixed the regression I was worried about at the time, without getting much benefit from the BCC script, and moved on. This kind of minutiae is more important than it sounds. I have used EXPLAIN(ANALYZE, BUFFERS) instrumentation to make sure that I understand where every single block access comes from with these edge-cases, paying close attention to the structure of the index, and how the key space is broken up (the values of pivot tuples in internal pages). It is one thing to make the index smaller, and another thing to take full advantage of that -- I have both. This is one of the reasons why I believe that this minor regression cannot be avoided, short of simply allowing the index to get bloated: I'm simply not doing things that differently outside of the page split code, and what I am doing differently is clearly superior. Both in general, and for the NEW_ORDER transaction in particular. I'll make that another TODO item -- this regression will be revisited using BCC instrumentation. I am currently performing a multi-day benchmark on a very large TPC-C/BenchmarkSQL database, and it will have to wait for that. (I would like to use the same environment as before.) -- Peter Geoghegan
Re: Making all nbtree entries unique by having heap TIDs participate in comparisons
On 2019-03-12 14:15:06 -0700, Peter Geoghegan wrote: > On Tue, Mar 12, 2019 at 12:40 PM Andres Freund wrote: > > Have you looked at an offwake or lwlock wait graph (bcc tools) or > > something in that vein? Would be interesting to see what is waiting for > > what most often... > > Not recently, though I did use your BCC script for this very purpose > quite a few months ago. I don't remember it helping that much at the > time, but then that was with a version of the patch that lacked a > couple of important optimizations that we have now. We're now very > careful to not descend to the left with an equal pivot tuple. We > descend right instead when that's definitely the only place we'll find > matches (a high key doesn't count as a match in almost all cases!). > Edge-cases where we unnecessarily move left then right, or > unnecessarily move right a second time once on the leaf level have > been fixed. I fixed the regression I was worried about at the time, > without getting much benefit from the BCC script, and moved on. > > This kind of minutiae is more important than it sounds. I have used > EXPLAIN(ANALYZE, BUFFERS) instrumentation to make sure that I > understand where every single block access comes from with these > edge-cases, paying close attention to the structure of the index, and > how the key space is broken up (the values of pivot tuples in internal > pages). It is one thing to make the index smaller, and another thing > to take full advantage of that -- I have both. This is one of the > reasons why I believe that this minor regression cannot be avoided, > short of simply allowing the index to get bloated: I'm simply not > doing things that differently outside of the page split code, and what > I am doing differently is clearly superior. Both in general, and for > the NEW_ORDER transaction in particular. > > I'll make that another TODO item -- this regression will be revisited > using BCC instrumentation. I am currently performing a multi-day > benchmark on a very large TPC-C/BenchmarkSQL database, and it will > have to wait for that. (I would like to use the same environment as > before.) I'm basically just curious which buffers have most of the additional contention. Is it the lower number of leaf pages, the inner pages, or (somewhat unexplicably) the meta page, or ...? I was thinking that the callstack that e.g. my lwlock tool gives should be able to explain what callstack most of the waits are occuring on. (I should work a bit on that script, I locally had a version that showed both waiters and the waking up callstack, but I don't find it anymore) Greetings, Andres Freund
Re: Making all nbtree entries unique by having heap TIDs participate in comparisons
On Tue, Mar 12, 2019 at 2:22 PM Andres Freund wrote: > I'm basically just curious which buffers have most of the additional > contention. Is it the lower number of leaf pages, the inner pages, or > (somewhat unexplicably) the meta page, or ...? I was thinking that the > callstack that e.g. my lwlock tool gives should be able to explain what > callstack most of the waits are occuring on. Right -- that's exactly what I'm interested in, too. If we can characterize the contention in terms of the types of nbtree blocks that are involved (their level), that could be really helpful. There are 200x+ more leaf blocks than internal blocks, so the internal blocks are a lot hotter. But, there is also a lot fewer splits of internal pages, because you need hundreds of leaf page splits to get one internal split. Is the problem contention caused by internal page splits, or is it contention in internal pages that can be traced back to leaf splits, that insert a downlink in to their parent page? It would be really cool to have some idea of the answers to questions like these. -- Peter Geoghegan
Re: Making all nbtree entries unique by having heap TIDs participate in comparisons
On Wed, Mar 6, 2019 at 10:15 PM Heikki Linnakangas wrote: > I made a copy of the _bt_binsrch, _bt_binsrch_insert. It does the binary > search like _bt_binsrch does, but the bounds caching is only done in > _bt_binsrch_insert. Seems more clear to have separate functions for them > now, even though there's some duplication. > /* > * Do the insertion. First move right to find the correct page to > * insert to, if necessary. If we're inserting to a non-unique index, > * _bt_search() already did this when it checked if a move to the > * right was required for leaf page. Insertion scankey's scantid > * would have been filled out at the time. On a unique index, the > * current buffer is the first buffer containing duplicates, however, > * so we may need to move right to the correct location for this > * tuple. > */ > if (checkingunique || itup_key->heapkeyspace) > _bt_findinsertpage(rel, &insertstate, stack, heapRel); > > newitemoff = _bt_binsrch_insert(rel, &insertstate); > The attached new version simplifies this, IMHO. The bounds and the > current buffer go together in the same struct, so it's easier to keep > track whether the bounds are valid or not. Now that you have a full understanding of how the negative infinity sentinel values work, and how page deletion's leaf page search and !heapkeyspace indexes need to be considered, I think that we should come back to this _bt_binsrch()/_bt_findsplitloc() stuff. My sense is that you now have a full understanding of all the subtleties of the patch, including those that that affect unique index insertion. That will make it much easier to talk about these unresolved questions. My current sense is that it isn't useful to store the current buffer alongside the binary search bounds/hint. It'll hardly ever need to be invalidated, because we'll hardly ever have to move right within _bt_findsplitloc() when doing unique index insertion (as I said before, the regression tests *never* have to do this according to gcov). We're talking about a very specific set of conditions here, so I'd like something that's lightweight and specialized. I agree that the savebinsrch/restorebinsrch fields are a bit ugly, though. I can't think of anything that's better offhand. Perhaps you can suggest something that is both lightweight, and an improvement on savebinsrch/restorebinsrch. I'm of the opinion that having a separate _bt_binsrch_insert() does not make anything clearer. Actually, I think that saving the bounds within the original _bt_binsrch() makes the design of that function clearer, not less clear. It's all quite confusing at the moment, given the rightmost/!leaf/page empty special cases. Seeing how the bounds are reused (or not reused) outside of _bt_binsrch() helps with that. The first 3 patches seem commitable now, but I think that it's important to be sure that I've addressed everything you raised satisfactorily before pushing. Or that everything works in a way that you can live with, at least. It would be great if you could take a look at the 'Add high key "continuescan" optimization' patch, which is the only one you haven't commented on so far (excluding the amcheck "relocate" patch, which is less important). I can put that one off for a while after the first 3 go in. I will also put off the "split after new item" commit for at least a week or two. I'm sure that the idea behind the "continuescan" patch will now seem pretty obvious to you -- it's just taking advantage of the high key when an index scan on the leaf level (which uses a search style scankey, not an insertion style scankey) looks like it may have to move to the next leaf page, but we'd like to avoid it where possible. Checking the high key there is now much more likely to result in the index scan not going to the next page, since we're more careful when considering a leaf split point these days. The high key often looks like the items on the page to the right, not the items on the same page. Thanks -- Peter Geoghegan
Re: Making all nbtree entries unique by having heap TIDs participate in comparisons
On Tue, Mar 12, 2019 at 11:40 AM Robert Haas wrote: > I think it's pretty clear that we have to view that as acceptable. I > mean, we could reduce contention even further by finding a way to make > indexes 40% larger, but I think it's clear that nobody wants that. I found this analysis of bloat in the production database of Gitlab in January 2019 fascinating: https://about.gitlab.com/handbook/engineering/infrastructure/blueprint/201901-postgres-bloat/ They determined that their tables consisted of about 2% bloat, whereas indexes were 51% bloat (determined by running VACUUM FULL, and measuring how much smaller indexes and tables were afterwards). That in itself may not be that telling. What is telling is the index bloat disproportionately affects certain kinds of indexes. As they put it, "Indexes that do not serve a primary key constraint make up 95% of the overall index bloat". In other words, the vast majority of all bloat occurs within non-unique indexes, with most remaining bloat in unique indexes. One factor that could be relevant is that unique indexes get a lot more opportunistic LP_DEAD killing. Unique indexes don't rely on the similar-but-distinct kill_prior_tuple optimization -- a lot more tuples can be killed within _bt_check_unique() than with kill_prior_tuple in realistic cases. That's probably not really that big a factor, though. It seems almost certain that "getting tired" is the single biggest problem. The blog post drills down further, and cites the examples of several *extremely* bloated indexes on a single-column, which is obviously low cardinality. This includes an index on a boolean field, and an index on an enum-like text field. In my experience, having many indexes like that is very common in real world applications, though not at all common in popular benchmarks (with the exception of TPC-E). It also looks like they may benefit from the "split after new item" optimization, at least among the few unique indexes that were very bloated, such as merge_requests_pkey: https://gitlab.com/snippets/1812014 Gitlab is open source, so it should be possible to confirm my theory about the "split after new item" optimization (I am certain about "getting tired", though). Their schema is defined here: https://gitlab.com/gitlab-org/gitlab-ce/blob/master/db/schema.rb I don't have time to confirm all this right now, but I am pretty confident that they have both problems. And almost as confident that they'd notice substantial benefits from this patch series. -- Peter Geoghegan
Re: Making all nbtree entries unique by having heap TIDs participate in comparisons
On 13/03/2019 03:28, Peter Geoghegan wrote: On Wed, Mar 6, 2019 at 10:15 PM Heikki Linnakangas wrote: I made a copy of the _bt_binsrch, _bt_binsrch_insert. It does the binary search like _bt_binsrch does, but the bounds caching is only done in _bt_binsrch_insert. Seems more clear to have separate functions for them now, even though there's some duplication. /* * Do the insertion. First move right to find the correct page to * insert to, if necessary. If we're inserting to a non-unique index, * _bt_search() already did this when it checked if a move to the * right was required for leaf page. Insertion scankey's scantid * would have been filled out at the time. On a unique index, the * current buffer is the first buffer containing duplicates, however, * so we may need to move right to the correct location for this * tuple. */ if (checkingunique || itup_key->heapkeyspace) _bt_findinsertpage(rel, &insertstate, stack, heapRel); newitemoff = _bt_binsrch_insert(rel, &insertstate); The attached new version simplifies this, IMHO. The bounds and the current buffer go together in the same struct, so it's easier to keep track whether the bounds are valid or not. Now that you have a full understanding of how the negative infinity sentinel values work, and how page deletion's leaf page search and !heapkeyspace indexes need to be considered, I think that we should come back to this _bt_binsrch()/_bt_findsplitloc() stuff. My sense is that you now have a full understanding of all the subtleties of the patch, including those that that affect unique index insertion. That will make it much easier to talk about these unresolved questions. My current sense is that it isn't useful to store the current buffer alongside the binary search bounds/hint. It'll hardly ever need to be invalidated, because we'll hardly ever have to move right within _bt_findsplitloc() when doing unique index insertion (as I said before, the regression tests *never* have to do this according to gcov). It doesn't matter how often it happens, the code still needs to deal with it. So let's try to make it as readable as possible. We're talking about a very specific set of conditions here, so I'd like something that's lightweight and specialized. I agree that the savebinsrch/restorebinsrch fields are a bit ugly, though. I can't think of anything that's better offhand. Perhaps you can suggest something that is both lightweight, and an improvement on savebinsrch/restorebinsrch. Well, IMHO holding the buffer and the bounds in the new struct is more clean than the savebinsrc/restorebinsrch flags. That's exactly why I suggested it. I don't know what else to suggest. I haven't done any benchmarking, but I doubt there's any measurable difference. I'm of the opinion that having a separate _bt_binsrch_insert() does not make anything clearer. Actually, I think that saving the bounds within the original _bt_binsrch() makes the design of that function clearer, not less clear. It's all quite confusing at the moment, given the rightmost/!leaf/page empty special cases. Seeing how the bounds are reused (or not reused) outside of _bt_binsrch() helps with that. Ok. I think having some code duplication is better than one function that tries to do many things, but I'm not wedded to that. - Heikki
Re: Making all nbtree entries unique by having heap TIDs participate in comparisons
On 13/03/2019 03:28, Peter Geoghegan wrote: It would be great if you could take a look at the 'Add high key "continuescan" optimization' patch, which is the only one you haven't commented on so far (excluding the amcheck "relocate" patch, which is less important). I can put that one off for a while after the first 3 go in. I will also put off the "split after new item" commit for at least a week or two. I'm sure that the idea behind the "continuescan" patch will now seem pretty obvious to you -- it's just taking advantage of the high key when an index scan on the leaf level (which uses a search style scankey, not an insertion style scankey) looks like it may have to move to the next leaf page, but we'd like to avoid it where possible. Checking the high key there is now much more likely to result in the index scan not going to the next page, since we're more careful when considering a leaf split point these days. The high key often looks like the items on the page to the right, not the items on the same page. Oh yeah, that makes perfect sense. I wonder why we haven't done it like that before? The new page split logic makes it more likely to help, but even without that, I don't see any downside. I find it a bit confusing, that the logic is now split between _bt_checkkeys() and _bt_readpage(). For a forward scan, _bt_readpage() does the high-key check, but the corresponding "first-key" check in a backward scan is done in _bt_checkkeys(). I'd suggest moving the logic completely to _bt_readpage(), so that it's in one place. With that, _bt_checkkeys() can always check the keys as it's told, without looking at the LP_DEAD flag. Like the attached. - Heikki >From 4b5ea0f361e3feda93852bd084fb0d325e808e4c Mon Sep 17 00:00:00 2001 From: Peter Geoghegan Date: Mon, 12 Nov 2018 13:11:21 -0800 Subject: [PATCH 1/1] Add high key "continuescan" optimization. Teach B-Tree forward index scans to check the high key before moving to the next page in the hopes of finding that it isn't actually necessary to move to the next page. We already opportunistically force a key check of the last item on leaf pages, even when it's clear that it cannot be returned to the scan due to being dead-to-all, for the same reason. Since forcing the last item to be key checked no longer makes any difference in the case of forward scans, the existing extra key check is now only used for backwards scans. Like the existing check, the new check won't always work out, but that seems like an acceptable price to pay. The new approach is more effective than just checking non-pivot tuples, especially with composite indexes and non-unique indexes. The high key represents an upper bound on all values that can appear on the page, which is often greater than whatever tuple happens to appear last at the time of the check. Also, suffix truncation's new logic for picking a split point will often result in high keys that are relatively dissimilar to the other (non-pivot) tuples on the page, and therefore more likely to indicate that the scan need not proceed to the next page. Note that even pre-pg_upgrade'd v3 indexes make use of this optimization. (This is Heikki's refactored version) --- src/backend/access/nbtree/nbtsearch.c | 86 ++--- src/backend/access/nbtree/nbtutils.c | 103 +++--- src/include/access/nbtree.h | 3 +- 3 files changed, 122 insertions(+), 70 deletions(-) diff --git a/src/backend/access/nbtree/nbtsearch.c b/src/backend/access/nbtree/nbtsearch.c index af3da3aa5b6..243be6f410d 100644 --- a/src/backend/access/nbtree/nbtsearch.c +++ b/src/backend/access/nbtree/nbtsearch.c @@ -1220,7 +1220,6 @@ _bt_readpage(IndexScanDesc scan, ScanDirection dir, OffsetNumber offnum) OffsetNumber minoff; OffsetNumber maxoff; int itemIndex; - IndexTuple itup; bool continuescan; /* @@ -1241,6 +1240,7 @@ _bt_readpage(IndexScanDesc scan, ScanDirection dir, OffsetNumber offnum) _bt_parallel_release(scan, BufferGetBlockNumber(so->currPos.buf)); } + continuescan = true; /* default assumption */ minoff = P_FIRSTDATAKEY(opaque); maxoff = PageGetMaxOffsetNumber(page); @@ -1282,23 +1282,58 @@ _bt_readpage(IndexScanDesc scan, ScanDirection dir, OffsetNumber offnum) while (offnum <= maxoff) { - itup = _bt_checkkeys(scan, page, offnum, dir, &continuescan); - if (itup != NULL) + ItemId iid = PageGetItemId(page, offnum); + IndexTuple itup; + + /* + * If the scan specifies not to return killed tuples, then we + * treat a killed tuple as not passing the qual. Most of the + * time, it's a win to not bother examining the tuple's index + * keys, but just skip to the next tuple. + */ + if (scan->ignore_killed_tuples && ItemIdIsDead(iid)) + { +offnum = OffsetNumberNext(offnum); +continue; + } + + itup = (IndexTuple) PageGetItem(page, iid); + + if (_bt_checkkeys(scan, itup, dir, &continuescan)) { /* tuple passes all scan key c
Re: Making all nbtree entries unique by having heap TIDs participate in comparisons
On Thu, Mar 14, 2019 at 4:00 AM Heikki Linnakangas wrote: > Oh yeah, that makes perfect sense. I wonder why we haven't done it like > that before? The new page split logic makes it more likely to help, but > even without that, I don't see any downside. The only downside is that we spend a few extra cycles, and that might not work out. This optimization would have always worked, though. The new page split logic clearly makes checking the high key in the "continuescan" path an easy win. > I find it a bit confusing, that the logic is now split between > _bt_checkkeys() and _bt_readpage(). For a forward scan, _bt_readpage() > does the high-key check, but the corresponding "first-key" check in a > backward scan is done in _bt_checkkeys(). I'd suggest moving the logic > completely to _bt_readpage(), so that it's in one place. With that, > _bt_checkkeys() can always check the keys as it's told, without looking > at the LP_DEAD flag. Like the attached. I'm convinced. I'd like to go a bit further, and also pass tupnatts to _bt_checkkeys(). That makes it closer to the similar _bt_check_rowcompare() function that _bt_checkkeys() must sometimes call. It also allows us to only call BTreeTupleGetNAtts() for the high key, while passes down a generic, loop-invariant IndexRelationGetNumberOfAttributes() value for non-pivot tuples. I'll do it that way in the next revision. Thanks -- Peter Geoghegan
Re: Making all nbtree entries unique by having heap TIDs participate in comparisons
On 16/03/2019 06:16, Peter Geoghegan wrote: On Thu, Mar 14, 2019 at 2:21 AM Heikki Linnakangas wrote: It doesn't matter how often it happens, the code still needs to deal with it. So let's try to make it as readable as possible. Well, IMHO holding the buffer and the bounds in the new struct is more clean than the savebinsrc/restorebinsrch flags. That's exactly why I suggested it. I don't know what else to suggest. I haven't done any benchmarking, but I doubt there's any measurable difference. Fair enough. Attached is v17, which does it using the approach taken in your earlier prototype. I even came around to your view on _bt_binsrch_insert() -- I kept that part, too. Note, however, that I still pass checkingunique to _bt_findinsertloc(), because that's a distinct condition to whether or not bounds were cached (they happen to be the same thing right now, but I don't want to assume that). This revision also integrates your approach to the "continuescan" optimization patch, with the small tweak I mentioned yesterday (we also pass ntupatts). I also prefer this approach. Great, thank you! It would be nice if you could take a look at the amcheck "relocate" patch When I started looking at this, I thought that "relocate" means "move". So I thought that the new mode would actually move tuples, i.e. that it would modify the index. That sounded horrible. Of course, it doesn't actually do that. It just checks that each tuple can be re-found, or "relocated", by descending the tree from the root. I'd suggest changing the language to avoid that confusion. It seems like a nice way to catch all kinds of index corruption issues. Although, we already check that the tuples are in order within the page. Is it really necessary to traverse the tree for every tuple, as well? Maybe do it just for the first and last item? + * This routine can detect very subtle transitive consistency issues across + * more than one level of the tree. Leaf pages all have a high key (even the + * rightmost page has a conceptual positive infinity high key), but not a low + * key. Their downlink in parent is a lower bound, which along with the high + * key is almost enough to detect every possible inconsistency. A downlink + * separator key value won't always be available from parent, though, because + * the first items of internal pages are negative infinity items, truncated + * down to zero attributes during internal page splits. While it's true that + * bt_downlink_check() and the high key check can detect most imaginable key + * space problems, there are remaining problems it won't detect with non-pivot + * tuples in cousin leaf pages. Starting a search from the root for every + * existing leaf tuple detects small inconsistencies in upper levels of the + * tree that cannot be detected any other way. (Besides all this, it's + * probably a useful testing strategy to exhaustively verify that all + * non-pivot tuples can be relocated in the index using the same code paths as + * those used by index scans.) I don't understand this. Can you give an example of this kind of inconsistency? - Heikki
Re: Making all nbtree entries unique by having heap TIDs participate in comparisons
On Sat, Mar 16, 2019 at 1:44 AM Heikki Linnakangas wrote: > > It would be nice if you could take a look at the amcheck "relocate" > > patch > When I started looking at this, I thought that "relocate" means "move". > So I thought that the new mode would actually move tuples, i.e. that it > would modify the index. That sounded horrible. Of course, it doesn't > actually do that. It just checks that each tuple can be re-found, or > "relocated", by descending the tree from the root. I'd suggest changing > the language to avoid that confusion. Okay. What do you suggest? :-) > It seems like a nice way to catch all kinds of index corruption issues. > Although, we already check that the tuples are in order within the page. > Is it really necessary to traverse the tree for every tuple, as well? > Maybe do it just for the first and last item? It's mainly intended as a developer option. I want it to be possible to detect any form of corruption, however unlikely. It's an adversarial mindset that will at least make me less nervous about the patch. > I don't understand this. Can you give an example of this kind of > inconsistency? The commit message gives an example, but I suggest trying it out for yourself. Corrupt the least significant key byte of a root page of a B-Tree using pg_hexedit. Say it's an index on a text column, then you'd corrupt the last byte to be something slightly wrong. Then, the only way to catch it is with "relocate" verification. You'll only miss a few tuples on a cousin page at the leaf level (those on either side of the high key that the corrupted separator key in the root was originally copied from). The regular checks won't catch this, because the keys are similar enough one level down. The "minus infinity" item is a kind of a blind spot, because we cannot do a parent check of its children, because we don't have the key (it's truncated when the item becomes a right page minus infinity item, during an internal page split). -- Peter Geoghegan
Re: Making all nbtree entries unique by having heap TIDs participate in comparisons
On 16/03/2019 10:51, Peter Geoghegan wrote: On Sat, Mar 16, 2019 at 1:44 AM Heikki Linnakangas wrote: It would be nice if you could take a look at the amcheck "relocate" patch When I started looking at this, I thought that "relocate" means "move". So I thought that the new mode would actually move tuples, i.e. that it would modify the index. That sounded horrible. Of course, it doesn't actually do that. It just checks that each tuple can be re-found, or "relocated", by descending the tree from the root. I'd suggest changing the language to avoid that confusion. Okay. What do you suggest? :-) Hmm. "re-find", maybe? We use that term in a few error messages already, to mean something similar. It seems like a nice way to catch all kinds of index corruption issues. Although, we already check that the tuples are in order within the page. Is it really necessary to traverse the tree for every tuple, as well? Maybe do it just for the first and last item? It's mainly intended as a developer option. I want it to be possible to detect any form of corruption, however unlikely. It's an adversarial mindset that will at least make me less nervous about the patch. Fair enough. At first, I thought this would be horrendously expensive, but thinking about it a bit more, nearby tuples will always follow the same path through the upper nodes, so it'll all be cached. So maybe it's not quite so bad. I don't understand this. Can you give an example of this kind of inconsistency? The commit message gives an example, but I suggest trying it out for yourself. Corrupt the least significant key byte of a root page of a B-Tree using pg_hexedit. Say it's an index on a text column, then you'd corrupt the last byte to be something slightly wrong. Then, the only way to catch it is with "relocate" verification. You'll only miss a few tuples on a cousin page at the leaf level (those on either side of the high key that the corrupted separator key in the root was originally copied from). The regular checks won't catch this, because the keys are similar enough one level down. The "minus infinity" item is a kind of a blind spot, because we cannot do a parent check of its children, because we don't have the key (it's truncated when the item becomes a right page minus infinity item, during an internal page split). Hmm. So, the initial situation would be something like this: +---+ | 1: root | | | | -inf -> 2 | | 20 -> 3 | | | +---+ +-+ +-+ | 2: internal | | 3: internal | | | | | | -inf -> 4 | | -inf -> 6 | | 10 -> 5 | | 30 -> 7 | | | | | | hi: 20 | | | +-+ +-+ +-+ +-+ +-+ +-+ | 4: leaf | | 5: leaf | | 6: leaf | | 7: leaf | | | | | | | | | | 1 | | 11 | | 21 | | 31 | | 5 | | 15 | | 25 | | 35 | | 9 | | 19 | | 29 | | 39 | | | | | | | | | | hi: 10 | | hi: 20 | | hi: 30 | | | +-+ +-+ +-+ +-+ Then, a cosmic ray changes the 20 on the root page to 18. That causes the the leaf tuple 19 to become non-re-findable; if you descend the for 19, you'll incorrectly land on page 6. But it also causes the high key on page 2 to be different from the downlink on the root page. Wouldn't the existing checks catch this? - Heikki
Re: Making all nbtree entries unique by having heap TIDs participate in comparisons
On Sat, Mar 16, 2019 at 9:17 AM Heikki Linnakangas wrote: > Hmm. "re-find", maybe? We use that term in a few error messages already, > to mean something similar. WFM. > At first, I thought this would be horrendously expensive, but thinking > about it a bit more, nearby tuples will always follow the same path > through the upper nodes, so it'll all be cached. So maybe it's not quite > so bad. That's deliberate, though you could call bt_relocate_from_root() from anywhere else if you wanted to. It's a bit like a big nested loop join, where the inner side has locality. > Then, a cosmic ray changes the 20 on the root page to 18. That causes > the the leaf tuple 19 to become non-re-findable; if you descend the for > 19, you'll incorrectly land on page 6. But it also causes the high key > on page 2 to be different from the downlink on the root page. Wouldn't > the existing checks catch this? No, the existing checks will not check that. I suppose something closer to the existing approach *could* detect this issue, by making sure that the "seam of identical high keys" from the root to the leaf are a match, but we don't use the high key outside of its own page. Besides, there is something useful about having the code actually rely on _bt_search(). There are other similar cases, where it's not obvious how you can do verification without either 1) crossing multiple levels, or 2) retaining a "low key" as well as a high key (this is what Goetz Graefe calls "retaining fence keys to solve the cousin verification problem" in Modern B-Tree Techniques). What if the corruption was in the leaf page 6 from your example, which had a 20 at the start? We wouldn't have compared the downlink from the parent to the child, because leaf page 6 is the leftmost child, and so we only have "-inf". The lower bound actually comes from the root page, because we truncate "-inf" attributes during page splits, even though we don't have to. Most of the time they're not "absolute minus infinity" -- they're "minus infinity in this subtree". Maybe you could actually do something with the high key from leaf page 5 to detect the stray value "20" in leaf page 6, but again, we don't do anything like that right now. -- Peter Geoghegan
Re: Making all nbtree entries unique by having heap TIDs participate in comparisons
On Sat, Mar 16, 2019 at 9:55 AM Peter Geoghegan wrote: > On Sat, Mar 16, 2019 at 9:17 AM Heikki Linnakangas wrote: > > Hmm. "re-find", maybe? We use that term in a few error messages already, > > to mean something similar. > > WFM. Actually, how about "rootsearch", or "rootdescend"? You're supposed to hyphenate "re-find", and so it doesn't really work as a function argument name. -- Peter Geoghegan
Re: Making all nbtree entries unique by having heap TIDs participate in comparisons
On 16/03/2019 19:32, Peter Geoghegan wrote: On Sat, Mar 16, 2019 at 9:55 AM Peter Geoghegan wrote: On Sat, Mar 16, 2019 at 9:17 AM Heikki Linnakangas wrote: Hmm. "re-find", maybe? We use that term in a few error messages already, to mean something similar. WFM. Actually, how about "rootsearch", or "rootdescend"? You're supposed to hyphenate "re-find", and so it doesn't really work as a function argument name. Works for me. - Heikki
Re: Making all nbtree entries unique by having heap TIDs participate in comparisons
On 16/03/2019 18:55, Peter Geoghegan wrote: On Sat, Mar 16, 2019 at 9:17 AM Heikki Linnakangas wrote: Then, a cosmic ray changes the 20 on the root page to 18. That causes the the leaf tuple 19 to become non-re-findable; if you descend the for 19, you'll incorrectly land on page 6. But it also causes the high key on page 2 to be different from the downlink on the root page. Wouldn't the existing checks catch this? No, the existing checks will not check that. I suppose something closer to the existing approach *could* detect this issue, by making sure that the "seam of identical high keys" from the root to the leaf are a match, but we don't use the high key outside of its own page. Besides, there is something useful about having the code actually rely on _bt_search(). There are other similar cases, where it's not obvious how you can do verification without either 1) crossing multiple levels, or 2) retaining a "low key" as well as a high key (this is what Goetz Graefe calls "retaining fence keys to solve the cousin verification problem" in Modern B-Tree Techniques). What if the corruption was in the leaf page 6 from your example, which had a 20 at the start? We wouldn't have compared the downlink from the parent to the child, because leaf page 6 is the leftmost child, and so we only have "-inf". The lower bound actually comes from the root page, because we truncate "-inf" attributes during page splits, even though we don't have to. Most of the time they're not "absolute minus infinity" -- they're "minus infinity in this subtree". AFAICS, there is a copy of every page's high key in its immediate parent. Either in the downlink of the right sibling, or as the high key of the parent page itself. Cross-checking those would catch any corruption in high keys. Note that this would potentially catch some corruption that the descend-from-root would not. If you have a mismatch between the high key of a leaf page and its parent or grandparent, all the current tuples might be pass the rootdescend check. But a tuple might get inserted to wrong location later. Maybe you could actually do something with the high key from leaf page 5 to detect the stray value "20" in leaf page 6, but again, we don't do anything like that right now. Hmm, yeah, to check for stray values, you could follow the left-link, get the high key of the left sibling, and compare against that. - Heikki
Re: Making all nbtree entries unique by having heap TIDs participate in comparisons
On Sat, Mar 16, 2019 at 1:33 PM Heikki Linnakangas wrote: > AFAICS, there is a copy of every page's high key in its immediate > parent. Either in the downlink of the right sibling, or as the high key > of the parent page itself. Cross-checking those would catch any > corruption in high keys. I agree that it's always true that the high key is also in the parent, and we could cross-check that within the child. Actually, I should probably figure out a way of arranging for the Bloom filter used within bt_relocate_from_root() (which has been around since PG v11) to include the key itself when fingerprinting. That would probably necessitate that we don't truncate "negative infinity" items (it was actually that way about 18 years ago), just for the benefit of verification. This is almost the same thing as what Graefe argues for (don't think that you need a low key on the leaf level, since you can cross a single level there). I wonder if truncating the negative infinity item in internal pages to zero attributes is actually worth it, since a low key might be useful for a number of reasons. > Note that this would potentially catch some corruption that the > descend-from-root would not. If you have a mismatch between the high key > of a leaf page and its parent or grandparent, all the current tuples > might be pass the rootdescend check. But a tuple might get inserted to > wrong location later. I also agree with this. However, the rootdescend check will always work better than this in some cases that you can at least imagine, for as long as there are negative infinity items to worry about. (And, even if we decided not to truncate to support easy verification, there is still a good argument to be made for involving the authoritative _bt_search() code at some point). > > Maybe you could actually do something with the high key from leaf page > > 5 to detect the stray value "20" in leaf page 6, but again, we don't > > do anything like that right now. > > Hmm, yeah, to check for stray values, you could follow the left-link, > get the high key of the left sibling, and compare against that. Graefe argues for retaining a low key, even in leaf pages (the left page's old high key becomes the left page's low key during a split, and the left page's new high key becomes the new right pages low key at the same time). It's useful for what he calls "write-optimized B-Trees", and maybe even for optional compression. As I said earlier, I guess you can just go left on the leaf level if you need to, and you have all you need. But I'd need to think about it some more. Point taken; rootdescend isn't enough to make verification exactly perfect. But it makes verification approach being perfect, because you're going to get right answers to queries as long as it passes (I think). There could be a future problem for a future insertion that you could also detect, but can't. But you'd have to be extraordinarily unlucky to have that happen for any amount of time. Unlucky even by my own extremely paranoid standard. -- Peter Geoghegan
Re: Making all nbtree entries unique by having heap TIDs participate in comparisons
On Sat, Mar 16, 2019 at 1:47 PM Peter Geoghegan wrote: > I agree that it's always true that the high key is also in the parent, > and we could cross-check that within the child. Actually, I should > probably figure out a way of arranging for the Bloom filter used > within bt_relocate_from_root() (which has been around since PG v11) to > include the key itself when fingerprinting. That would probably > necessitate that we don't truncate "negative infinity" items (it was > actually that way about 18 years ago), just for the benefit of > verification. Clarification: You'd fingerprint an entire pivot tuple -- key, block number, everything. Then, you'd probe the Bloom filter for the high key one level down, with the downlink block in the high key set to point to the current sibling on the same level (the child level). As I said, I think that the only reason that that cannot be done at present is because of the micro-optimization of truncating the first item on the right page to zero attributes during an internal page split. (We can retain the key without getting rid of the hard-coded logic for negative infinity within _bt_compare()). bt_relocate_from_root() already has smarts around interrupted page splits (with the incomplete split bit set). Finally, you'd also make bt_downlink_check follow every downlink, not all-but-one downlink (no more excuse for leaving out the first one if we don't truncate during internal page splits). -- Peter Geoghegan
Re: Making all nbtree entries unique by having heap TIDs participate in comparisons
On Sat, Mar 16, 2019 at 2:01 PM Peter Geoghegan wrote: > > On Sat, Mar 16, 2019 at 1:47 PM Peter Geoghegan wrote: > > I agree that it's always true that the high key is also in the parent, > > and we could cross-check that within the child. Actually, I should > > probably figure out a way of arranging for the Bloom filter used > > within bt_relocate_from_root() (which has been around since PG v11) to > > include the key itself when fingerprinting. That would probably > > necessitate that we don't truncate "negative infinity" items (it was > > actually that way about 18 years ago), just for the benefit of > > verification. > > Clarification: You'd fingerprint an entire pivot tuple -- key, block > number, everything. Then, you'd probe the Bloom filter for the high > key one level down, with the downlink block in the high key set to > point to the current sibling on the same level (the child level). As I > said, I think that the only reason that that cannot be done at present > is because of the micro-optimization of truncating the first item on > the right page to zero attributes during an internal page split. (We > can retain the key without getting rid of the hard-coded logic for > negative infinity within _bt_compare()). > > bt_relocate_from_root() already has smarts around interrupted page > splits (with the incomplete split bit set). Clarification to my clarification: I meant bt_downlink_missing_check(), not bt_relocate_from_root(). The former really has been around since v11, unlike the latter, which is part of this new amcheck patch we're discussing. -- Peter Geoghegan
Re: Making all nbtree entries unique by having heap TIDs participate in comparisons
On Mon, Mar 18, 2019 at 4:59 AM Heikki Linnakangas wrote: > I'm getting a regression failure from the 'create_table' test with this: > Are you seeing that? Yes -- though the bug is in your revised v18, not the original v18, which passed CFTester. Your revision fails on Travis/Linux, which is pretty close to what I see locally, and much less subtle than the test failures you mentioned: https://travis-ci.org/postgresql-cfbot/postgresql/builds/507816665 "make check" did pass locally for me with your patch, but "make check-world" (parallel recipe) did not. The original v18 passed both CFTester tests about 15 hour ago: https://travis-ci.org/postgresql-cfbot/postgresql/builds/507643402 I see the bug. You're not supposed to test this way with a heapkeyspace index: > + if (P_RIGHTMOST(lpageop) || > + _bt_compare(rel, itup_key, page, P_HIKEY) != 0) > + break; This is because the presence of scantid makes it almost certain that you'll break when you shouldn't. You're doing it the old way, which is inappropriate for a heapkeyspace index. Note that it would probably take much longer to notice this bug if the "consider secondary factors" patch was also applied, because then you would rarely have cause to step right here (duplicates would never occupy more than a single page in the regression tests). The test failures are probably also timing sensitive, though they happen very reliably on my machine. > Looking at the patches 1 and 2 again: > > I'm still not totally happy with the program flow and all the conditions > in _bt_findsplitloc(). I have a hard time following which codepaths are > taken when. I refactored that, so that there is a separate copy of the > loop for V3 and V4 indexes. The big difference is that you make the possible call to _bt_stepright() conditional on this being a checkingunique index -- the duplicate code is indented in that branch of _bt_findsplitloc(). Whereas I break early in the loop when "checkingunique && heapkeyspace". The flow of the original loop not only had less code. It also contrasted the important differences between heapkeyspace and !heapkeyspace cases: /* If this is the page that the tuple must go on, stop */ if (P_RIGHTMOST(lpageop)) break; cmpval = _bt_compare(rel, itup_key, page, P_HIKEY); if (itup_key->heapkeyspace) { if (cmpval <= 0) break; } else { /* * pg_upgrade'd !heapkeyspace index. * * May have to handle legacy case where there is a choice of which * page to place new tuple on, and we must balance space * utilization as best we can. Note that this may invalidate * cached bounds for us. */ if (cmpval != 0 || _bt_useduplicatepage(rel, heapRel, insertstate)) break; } I thought it was obvious that the "cmpval <= 0" code was different for a reason. It now seems that this at least needs a comment. I still believe that the best way to handle the !heapkeyspace case is to make it similar to the heapkeyspace checkingunique case, regardless of whether or not we're checkingunique. The fact that this bug slipped in supports that view. Besides, the alternative that you suggest treats !heapkeyspace indexes as if they were just as important to the reader, which seems inappropriate (better to make the legacy case follow the new case, not the other way around). I'm fine with the comment tweaks that you made that are not related to _bt_findsplitloc(), though. I won't push the patches today, to give you the opportunity to respond. I am not at all convinced right now, though. -- Peter Geoghegan
Re: Making all nbtree entries unique by having heap TIDs participate in comparisons
On Tue, Mar 12, 2019 at 11:40 AM Robert Haas wrote: > I think it's pretty clear that we have to view that as acceptable. I > mean, we could reduce contention even further by finding a way to make > indexes 40% larger, but I think it's clear that nobody wants that. > Now, maybe in the future we'll want to work on other techniques for > reducing contention, but I don't think we should make that the problem > of this patch, especially because the regressions are small and go > away after a few hours of heavy use. We should optimize for the case > where the user intends to keep the database around for years, not > hours. I came back to the question of contention recently. I don't think it's okay to make contention worse in workloads where indexes are mostly the same size as before, such as almost any workload that pgbench can simulate. I have made a lot of the fact that the TPC-C indexes are ~40% smaller, in part because lots of people outside the community find TPC-C interesting, and in part because this patch series is focused on cases where we currently do unusually badly (cases where good intuitions about how B-Trees are supposed to perform break down [1]). These pinpointable problems must affect a lot of users some of the time, but certainly not all users all of the time. The patch series is actually supposed to *improve* the situation with index buffer lock contention in general, and it looks like it manages to do that with pgbench, which doesn't do inserts into indexes, except for those required for non-HOT updates. pgbench requires relatively few page splits, but is in every other sense a high contention workload. With pgbench scale factor 20, here are results for patch and master with a Gaussian distribution on my 8 thread/4 core home server, with each run reported lasting 10 minutes, repeating twice for client counts 1, 2, 8, 16, and 64, patch and master branch: \set aid random_gaussian(1, 10 * :scale, 20) \set bid random(1, 1 * :scale) \set tid random(1, 10 * :scale) \set delta random(-5000, 5000) BEGIN; UPDATE pgbench_accounts SET abalance = abalance + :delta WHERE aid = :aid; SELECT abalance FROM pgbench_accounts WHERE aid = :aid; UPDATE pgbench_tellers SET tbalance = tbalance + :delta WHERE tid = :tid; UPDATE pgbench_branches SET bbalance = bbalance + :delta WHERE bid = :bid; INSERT INTO pgbench_history (tid, bid, aid, delta, mtime) VALUES (:tid, :bid, :aid, :delta, CURRENT_TIMESTAMP); END; 1st pass (init pgbench from scratch for each database, scale 20) 1 client master: tps = 7203.983289 (including connections establishing) tps = 7204.020457 (excluding connections establishing) latency average = 0.139 ms latency stddev = 0.026 ms 1 client patch: tps = 7012.575167 (including connections establishing) tps = 7012.590007 (excluding connections establishing) latency average = 0.143 ms latency stddev = 0.020 ms 2 clients master: tps = 13434.043832 (including connections establishing) tps = 13434.076194 (excluding connections establishing) latency average = 0.149 ms latency stddev = 0.032 ms 2 clients patch: tps = 13105.620223 (including connections establishing) tps = 13105.654109 (excluding connections establishing) latency average = 0.153 ms latency stddev = 0.033 ms 8 clients master: tps = 27126.852038 (including connections establishing) tps = 27126.986978 (excluding connections establishing) latency average = 0.295 ms latency stddev = 0.095 ms 8 clients patch: tps = 27945.457965 (including connections establishing) tps = 27945.565242 (excluding connections establishing) latency average = 0.286 ms latency stddev = 0.089 ms 16 clients master: tps = 32297.612323 (including connections establishing) tps = 32297.743929 (excluding connections establishing) latency average = 0.495 ms latency stddev = 0.185 ms 16 clients patch: tps = 33434.889405 (including connections establishing) tps = 33435.021738 (excluding connections establishing) latency average = 0.478 ms latency stddev = 0.167 ms 64 clients master: tps = 25699.029787 (including connections establishing) tps = 25699.217022 (excluding connections establishing) latency average = 2.482 ms latency stddev = 1.715 ms 64 clients patch: tps = 26513.816673 (including connections establishing) tps = 26514.013638 (excluding connections establishing) latency average = 2.405 ms latency stddev = 1.690 ms 2nd pass (init pgbench from scratch for each database, scale 20) 1 client master: tps = 7172.995796 (including connections establishing) tps = 7173.013472 (excluding connections establishing) latency average = 0.139 ms latency stddev = 0.022 ms 1 client patch: tps = 7024.724365 (including connections establishing) tps = 7024.739237 (excluding connections establishing) latency average = 0.142 ms latency stddev = 0.021 ms 2 clients master: tps = 13489.016303 (including connections establishing) tps = 13489.047968 (excluding connections establishing) latency average = 0.148 ms latency stddev = 0.032 ms 2 clients patch: tps
Re: Making all nbtree entries unique by having heap TIDs participate in comparisons
On Mon, Mar 18, 2019 at 7:34 PM Peter Geoghegan wrote: > With pgbench scale factor 20, here are results for patch and master > with a Gaussian distribution on my 8 thread/4 core home server, with > each run reported lasting 10 minutes, repeating twice for client > counts 1, 2, 8, 16, and 64, patch and master branch: > > 1 client master: > tps = 7203.983289 (including connections establishing) > 1 client patch: > tps = 7012.575167 (including connections establishing) > > 2 clients master: > tps = 13434.043832 (including connections establishing) > 2 clients patch: > tps = 13105.620223 (including connections establishing) Blech. I think the patch has enough other advantages that it's worth accepting that, but it's not great. We seem to keep finding reasons to reduce single client performance in the name of scalability, which is often reasonable not but wonderful. > However, this isn't completely > free (particularly the page split stuff), and it doesn't pay for > itself until the number of clients ramps up. I don't really understand that explanation. It makes sense that more intelligent page split decisions could require more CPU cycles, but it is not evident to me why more clients would help better page split decisions pay off. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
Re: Making all nbtree entries unique by having heap TIDs participate in comparisons
On Mon, Mar 18, 2019 at 5:00 PM Robert Haas wrote: > Blech. I think the patch has enough other advantages that it's worth > accepting that, but it's not great. We seem to keep finding reasons > to reduce single client performance in the name of scalability, which > is often reasonable not but wonderful. The good news is that the quicksort that we now perform in nbtsplitloc.c is not optimized at all. Heikki thought it premature to optimize that, for example by inlining/specializing the quicksort. I can make that 3x faster fairly easily, which could easily change the picture here. The code will be uglier that way, but not much more complicated. I even prototyped this, and managed to make serial microbenchmarks I've used noticeably faster. This could very well fix the problem here. It clearly showed up in perf profiles with serial bulks loads. > > However, this isn't completely > > free (particularly the page split stuff), and it doesn't pay for > > itself until the number of clients ramps up. > > I don't really understand that explanation. It makes sense that more > intelligent page split decisions could require more CPU cycles, but it > is not evident to me why more clients would help better page split > decisions pay off. Smarter choices on page splits pay off with higher client counts because they reduce contention at likely hot points. It's kind of crazy that the code in _bt_check_unique() sometimes has to move right, while holding an exclusive buffer lock on the original page and a shared buffer lock on its sibling page at the same time. It then has to hold a third buffer lock concurrently, this time on any heap pages it is interested in. Each in turn, to check if they're possibly conflicting. gcov shows that that never happens with the regression tests once the patch is applied (you can at least get away with only having one buffer lock on a leaf page at all times in practically all cases). -- Peter Geoghegan
Re: Making all nbtree entries unique by having heap TIDs participate in comparisons
On Mon, Mar 18, 2019 at 5:12 PM Peter Geoghegan wrote: > Smarter choices on page splits pay off with higher client counts > because they reduce contention at likely hot points. It's kind of > crazy that the code in _bt_check_unique() sometimes has to move right, > while holding an exclusive buffer lock on the original page and a > shared buffer lock on its sibling page at the same time. It then has > to hold a third buffer lock concurrently, this time on any heap pages > it is interested in. Actually, by the time we get to 16 clients, this workload does make the indexes and tables smaller. Here is pg_buffercache output collected after the first 16 client case: Master == relname │ relforknumber │ size_main_rel_fork_blocks │ buffer_count │ avg_buffer_usg ─┼───┼───┼──┼ pgbench_history │ 0 │ 123,484 │ 123,484 │ 4.9989715266755207 pgbench_accounts│ 0 │ 34,665 │ 10,682 │ 4.4948511514697622 pgbench_accounts_pkey │ 0 │ 5,708 │1,561 │ 4.8731582319026265 pgbench_tellers │ 0 │ 489 │ 489 │ 5. pgbench_branches│ 0 │ 284 │ 284 │ 5. pgbench_tellers_pkey│ 0 │ 56 │ 56 │ 5. Patch = relname │ relforknumber │ size_main_rel_fork_blocks │ buffer_count │ avg_buffer_usg ─┼───┼───┼──┼ pgbench_history │ 0 │ 127,864 │ 127,864 │ 4.9980447975974473 pgbench_accounts│ 0 │ 33,933 │9,614 │ 4.3517786561264822 pgbench_accounts_pkey │ 0 │ 5,487 │1,322 │ 4.8857791225416036 pgbench_tellers │ 0 │ 204 │ 204 │ 4.9803921568627451 pgbench_branches│ 0 │ 198 │ 198 │ 4.3535353535353535 pgbench_tellers_pkey│ 0 │ 14 │ 14 │ 5. The main fork for pgbench_history is larger with the patch, obviously, but that's good. pgbench_accounts_pkey is about 4% smaller, which is probably the most interesting observation that can be made here, but the tables are also smaller. pgbench_accounts itself is ~2% smaller. pgbench_branches is ~30% smaller, and pgbench_tellers is 60% smaller. Of course, the smaller tables were already very small, so maybe that isn't important. I think that this is due to more effective pruning, possibly because we get better lock arbitration as a consequence of better splits, and also because duplicates are in heap TID order. I haven't observed this effect with larger databases, which have been my focus. It isn't weird that shared_buffers doesn't have all the pgbench_accounts blocks, since, of course, this is highly skewed by design -- most blocks were never accessed from the table. This effect seems to be robust, at least with this workload. The second round of benchmarks (which have their own pgbench -i initialization) show very similar amounts of bloat at the same point. It may not be that significant, but it's also not a fluke. -- Peter Geoghegan
Re: Making all nbtree entries unique by having heap TIDs participate in comparisons
On Mon, Mar 18, 2019 at 10:17 AM Peter Geoghegan wrote: > The big difference is that you make the possible call to > _bt_stepright() conditional on this being a checkingunique index -- > the duplicate code is indented in that branch of _bt_findsplitloc(). > Whereas I break early in the loop when "checkingunique && > heapkeyspace". Heikki and I discussed this issue privately, over IM, and reached final agreement on remaining loose ends. I'm going to use his code for _bt_findsplitloc(). Plan to push a final version of the first four patches tomorrow morning PST. -- Peter Geoghegan
Re: Making all nbtree entries unique by having heap TIDs participate in comparisons
On Tue, Mar 19, 2019 at 4:15 PM Peter Geoghegan wrote: > Heikki and I discussed this issue privately, over IM, and reached > final agreement on remaining loose ends. I'm going to use his code for > _bt_findsplitloc(). Plan to push a final version of the first four > patches tomorrow morning PST. I've committed the first 4 patches. Many thanks to Heikki for his very valuable help! Thanks also to the other reviewers. I'll likely push the remaining two patches on Sunday or Monday. -- Peter Geoghegan
Re: Making all nbtree entries unique by having heap TIDs participate in comparisons
On Thu, Mar 21, 2019 at 10:28 AM Peter Geoghegan wrote: > I've committed the first 4 patches. Many thanks to Heikki for his very > valuable help! Thanks also to the other reviewers. > > I'll likely push the remaining two patches on Sunday or Monday. I noticed that if I initidb and run "make installcheck" with and without the "split after new tuple" optimization patch, the largest system catalog indexes shrink quite noticeably: Master == pg_depend_depender_index 1456 kB pg_depend_reference_index 1416 kB pg_class_tblspc_relfilenode_index 224 kB Patch = pg_depend_depender_index 1088 kB -- ~25% smaller pg_depend_reference_index 1136 kB -- ~20% smaller pg_class_tblspc_relfilenode_index 160 kB -- 28% smaller This is interesting to me because it is further evidence that the problem that the patch targets is reasonably common. It's also interesting to me because we benefit despite the fact there are a lot of duplicates in parts of these indexes; we vary our strategy at different parts of the key space, which works well. We pack pages tightly where they're full of duplicates, using the "single value" strategy that I've already committed, whereas the apply the "split after new tuple" optimization in parts of the index with localized monotonically increasing insertions. If there were no duplicates in the indexes, then they'd be about 40% smaller, which is exactly what we see with the TPC-C indexes (they're all unique indexes, with very few physical duplicates). Looks like the duplicates are mostly bootstrap mode entries. Lots of the pg_depend_depender_index duplicates look like "(classid, objid, objsubid)=(0, 0, 0)", for example. I also noticed one further difference: the pg_shdepend_depender_index index grew from 40 kB to 48 kB. I guess that might count as a regression, though I'm not sure that it should. I think that we would do better if the volume of data in the underlying table was greater. contrib/pageinspect shows that a small number of the leaf pages in the improved cases are not very filled at all, which is more than made up for by the fact that many more pages are packed as if they were created by a rightmost split (262 items 24 byte tuples is exactly consistent with that). IOW, I suspect that the extra page in pg_shdepend_depender_index is due to a "local minimum". -- Peter Geoghegan
Re: Making all nbtree entries unique by having heap TIDs participate in comparisons
On Fri, Mar 22, 2019 at 2:15 PM Peter Geoghegan wrote: > On Thu, Mar 21, 2019 at 10:28 AM Peter Geoghegan wrote: > > I'll likely push the remaining two patches on Sunday or Monday. > > I noticed that if I initidb and run "make installcheck" with and > without the "split after new tuple" optimization patch, the largest > system catalog indexes shrink quite noticeably: I pushed this final patch a week ago, as commit f21668f3, concluding work on integrating the patch series. I have some closing thoughts that I would like to close out the project on. I was casually discussing this project over IM with Robert the other day. I was asked a question I'd often asked myself about the "split after new item" heuristics: What if you're wrong? What if some "black swan" type workload fools your heuristics into bloating an index uncontrollably? I gave an answer to his question that may have seemed kind of inscrutable. My intuition about the worst case for the heuristics is based on its similarity to the worst case for quicksort. Any real-world instance of quicksort going quadratic is essentially a case where we *consistently* do the wrong thing when selecting a pivot. A random pivot selection will still perform reasonably well, because we'll still choose the median pivot on average. A malicious actor will always be able to fool any quicksort implementation into going quadratic [1] in certain circumstances. We're defending against Murphy, not Machiavelli, though, so that's okay. I think that I can produce a more tangible argument than this, though. Attached patch removes every heuristic that limits the application of the "split after new item" optimization (it doesn't force the optimization in the case of rightmost splits, or in the case where the new item happens to be first on the page, since caller isn't prepared for that). This is an attempt to come up with a wildly exaggerated worst case. Nevertheless, the consequences are not actually all that bad. Summary: * The "UK land registry" test case that I leaned on a lot for the patch has a final index that's about 1% larger. However, it was about 16% smaller compared to Postgres without the patch, so this is not a problem. * Most of the TPC-C indexes are actually slightly smaller, because we didn't quite go as far as we could have (TPC-C strongly rewards this optimization). 8 out of the 10 indexes are either smaller or unchanged. The customer name index is about 28% larger, though. The oorder table index is also about 28% larger. * TPC-E never benefits from the "split after new item" optimization, and yet the picture isn't so bad here either. The holding history PK is about 40% bigger, which is quite bad, and the biggest regression overall. However, in other affected cases indexes are about 15% larger, which is not that bad. Also attached are the regressions from my test suite in the form of diff files -- these are the full details of the regression, just in case that's interesting to somebody. This isn't the final word. I'm not asking anybody to accept with total certainty that there can never be a "black swan" workload that the heuristics consistently mishandle, leading to pathological performance. However, I think it's fair to say that the risk of that happening has been managed well. The attached test patch literally removes any restraint on applying the optimization, and yet we arguably do no worse than Postgres 11 would overall. Once again, I would like to thank my collaborators for all their help, especially Heikki. [1] https://www.cs.dartmouth.edu/~doug/mdmspe.pdf -- Peter Geoghegan always-split-after-new-item.patch Description: Binary data land_balanced.diff Description: Binary data tpcc_balanced.diff Description: Binary data tpce_balanced.diff Description: Binary data
Re: Making all nbtree entries unique by having heap TIDs participate in comparisons
On 09/01/2019 02:47, Peter Geoghegan wrote: On Fri, Dec 28, 2018 at 3:32 PM Peter Geoghegan wrote: On Fri, Dec 28, 2018 at 3:20 PM Heikki Linnakangas wrote: I'm envisioning that you have an array, with one element for each item on the page (including the tuple we're inserting, which isn't really on the page yet). In the first pass, you count up from left to right, filling the array. Next, you compute the complete penalties, starting from the middle, walking outwards. Ah, right. I think I see what you mean now. Leave it with me. I'll need to think about this some more. Attached is v10 of the patch series, which has many changes based on your feedback. However, I didn't end up refactoring _bt_findsplitloc() in the way you described, because it seemed hard to balance all of the concerns there. I still have an open mind on this question, andAt a recognize the merit in what you suggested. Perhaps it's possible to reach a compromise here. I spent some time first trying to understand the current algorithm, and then rewriting it in a way that I find easier to understand. I came up with the attached. I think it optimizes for the same goals as your patch, but the approach is quite different. At a very high level, I believe the goals can be described as: 1. Find out how much suffix truncation is possible, i.e. how many key columns can be truncated away, in the best case, among all possible ways to split the page. 2. Among all the splits that achieve that optimum suffix truncation, find the one with smallest "delta". For performance reasons, it doesn't actually do it in that order. It's more like this: 1. First, scan all split positions, recording the 'leftfree' and 'rightfree' at every valid split position. The array of possible splits is kept in order by offset number. (This scans through all items, but the math is simple, so it's pretty fast) 2. Compute the optimum suffix truncation, by comparing the leftmost and rightmost keys, among all the possible split positions. 3. Split the array of possible splits in half, and process both halves recursively. The recursive process "zooms in" to the place where we'd expect to find the best candidate, but will ultimately scan through all split candidates, if no "good enough" match is found. I've only been testing this on leaf splits. I didn't understand how the penalty worked for internal pages in your patch. In this version, the same algorithm is used for leaf and internal pages. I'm sure this still has bugs in it, and could use some polishing, but I think this will be more readable way of doing it. What have you been using to test this? I wrote the attached little test extension, to explore what _bt_findsplitloc() decides in different scenarios. It's pretty rough, but that's what I've been using while hacking on this. It prints output like this: postgres=# select test_split(); NOTICE: test 1: left2/358: 1 0 left 358/358: 1 356 right 1/ 51: 1 357 right 51/ 51: 1 407 SPLIT TUPLE split ratio: 12/87 NOTICE: test 2: left2/358: 0 0 left 358/358: 356 356 right 1/ 51: 357 357 right 51/ 51: 407 407 SPLIT TUPLE split ratio: 12/87 NOTICE: test 3: left2/358: 0 0 left 320/358: 10 10 SPLIT TUPLE left 358/358: 48 48 right 1/ 51: 49 49 right 51/ 51: 99 99 split ratio: 12/87 NOTICE: test 4: left2/309: 1 100 left 309/309: 1 407 SPLIT TUPLE right 1/100: 2 0 right 100/100: 2 99 split ratio: 24/75 Each test consists of creating a temp table with one index, and inserting rows in a certain pattern, until the root page splits. It then prints the first and last tuples on both pages, after the split, as well as the tuple that caused the split. I don't know if this is useful to anyone but myself, but I thought I'd share it. - Heikki /*- * * nbtsplitloc.c * Choose split point code for Postgres btree implementation. * * Portions Copyright (c) 1996-2019, PostgreSQL Global Development Group * Portions Copyright (c) 1994, Regents of the University of California * * * IDENTIFICATION * src/backend/access/nbtree/nbtsplitloc.c * *- */ #include "postgres.h" #include "access/nbtree.h" #include "storage/lmgr.h" typedef struct { /* FindSplitData candidate split */ int leftfree; int rightfree; OffsetNumber firstoldonright; bool newitemonleft; IndexTuple lastleft_tuple; IndexTuple firstright_tuple; } SplitPoint; typedef struct { /* context data for _bt_checksplitloc */ Relation rel; bool is_leaf; /* T if splitting a leaf page */ OffsetNumber newitemoff; /* where the new item is to be inserted */ int leftspace; /* space available for items on left page */ int rightspace; /* space available for items on right page */ int dataitemstotal; /* space taken by all items, old and new */ int ncandidates; /* current numbe
Re: Making all nbtree entries unique by having heap TIDs participate in comparisons
On Mon, Jan 28, 2019 at 7:32 AM Heikki Linnakangas wrote: > I spent some time first trying to understand the current algorithm, and > then rewriting it in a way that I find easier to understand. I came up > with the attached. I think it optimizes for the same goals as your > patch, but the approach is quite different. At a very high level, I > believe the goals can be described as: > > 1. Find out how much suffix truncation is possible, i.e. how many key > columns can be truncated away, in the best case, among all possible ways > to split the page. > > 2. Among all the splits that achieve that optimum suffix truncation, > find the one with smallest "delta". Thanks for going to the trouble of implementing what you have in mind! I agree that the code that I wrote within nbtsplitloc.c is very subtle, and I do think that I have further work to do to make its design clearer. I think that this high level description of the goals of the algorithm is inaccurate in subtle but important ways, though. Hopefully there will be a way of making it more understandable that preserves certain important characteristics. If you had my test cases/data that would probably help you a lot (more on that later). The algorithm I came up with almost always does these two things in the opposite order, with each considered in clearly separate phases. There are good reasons for this. We start with the same criteria as the old algorithm. We assemble a small array of candidate split points, rather than one split point, but otherwise it's almost exactly the same (the array is sorted by delta). Then, at the very end, we iterate through the small array to find the best choice for suffix truncation. IOW, we only consider suffix truncation as a *secondary* goal. The delta is still by far the most important thing 99%+ of the time. I assume it's fairly rare to not have two distinct tuples within 9 or so tuples of the delta-wise optimal split position -- 99% is probably a low estimate, at least in OLTP app, or within unique indexes. I see that you do something with a "good enough" delta that seems like it also makes delta the most important thing, but that doesn't appear to be, uh, good enough. ;-) Now, it's true that my approach does occasionally work in a way close to what you describe above -- it does this when we give up on default mode and check "how much suffix truncation is possible?" exhaustively, for every possible candidate split point. "Many duplicates" mode kicks in when we need to be aggressive about suffix truncation. Even then, the exact goals are different to what you have in mind in subtle but important ways. While "truncating away the heap TID" isn't really a special case in other places, it is a special case for my version of nbtsplitloc.c, which more or less avoids it at all costs. Truncating away heap TID is more important than truncating away any other attribute by a *huge* margin. Many duplicates mode *only* specifically cares about truncating the final TID attribute. That is the only thing that is ever treated as more important than delta, though even there we don't forget about delta entirely. That is, we assume that the "perfect penalty" is nkeyatts when in many duplicates mode, because we don't care about suffix truncation beyond heap TID truncation by then. So, if we find 5 split points out of 250 in the final array that avoid appending heap TID, we use the earliest/lowest delta out of those 5. We're not going to try to maximize the number of *additional* attributes that get truncated, because that can make the leaf pages unbalanced in an *unbounded* way. None of these 5 split points are "good enough", but the distinction between their deltas still matters a lot. We strongly prefer a split with a *mediocre* delta to a split with a *terrible* delta -- a bigger high key is the least of our worries here. (I made similar mistakes myself months ago, BTW.) Your version of the algorithm makes a test case of mine (UK land registry test case [1]) go from having an index that's 1101 MB with my version of the algorithm/patch and 1329 MB on the master branch to an index that's 3030 MB in size. I think that this happens because it effectively fails to give any consideration to delta at all at certain points. On leaf pages with lots of unique keys, your algorithm does about as well as mine because all possible split points look equally good suffix-truncation-wise, plus you have the "good enough" test, so delta isn't ignored. I think that your algorithm also works well when there are many duplicates but only one non-TID index column, since the heap TID truncation versus other truncation issue does not arise. The test case I used is an index on "(county, city, locality)", though -- low cardinality, but more than a single column. That's a *combination* of two separate considerations, that seem to get conflated. I don't think that you can avoid doing "a second pass" in some sense, because these really are separate considerations.
Re: Making all nbtree entries unique by having heap TIDs participate in comparisons
On Mon, Jan 28, 2019 at 1:41 PM Peter Geoghegan wrote: > Thanks for going to the trouble of implementing what you have in mind! > > I agree that the code that I wrote within nbtsplitloc.c is very > subtle, and I do think that I have further work to do to make its > design clearer. I think that this high level description of the goals > of the algorithm is inaccurate in subtle but important ways, though. > Hopefully there will be a way of making it more understandable that > preserves certain important characteristics. Heikki and I had the opportunity to talk about this recently. We found an easy way forward. I believe that the nbtsplitloc.c algorithm itself is fine -- the code will need to be refactored, though. nbtsplitloc.c can be refactored to assemble a list of legal split points up front, before deciding which one to go with in a separate pass (using one of two "alternative modes", as before). I now understand that Heikki simply wants to separate the questions of "Is this candidate split point legal?" from "Is this known-legal candidate split point good/ideal based on my current criteria?". This seems like a worthwhile goal to me. Heikki accepts the need for multiple modes/passes, provided recursion isn't used in the implementation. It's clear to me that the algorithm should start off trying to split towards the middle of the page (or towards the end in the rightmost case), while possibly making a small compromise on the exact split point to maximize the effectiveness of suffix truncation. We must change strategy entirely if and only if the middle of the page (or wherever we'd like to split initially) is found to be completely full of duplicates -- that's where the need for a second pass comes in. This should almost never happen in most applications. Even when it happens, we only care about not splitting inside a group of duplicates. That's not the same thing as caring about maximizing the number of attributes truncated away. Those two things seem similar, but are actually very different. It might have sounded like Heikki and I disagreed on the design of the algorithm at a high level, or what its goals ought to be. That is not the case, though. (At least not so far.) -- Peter Geoghegan
Re: Making all nbtree entries unique by having heap TIDs participate in comparisons
On Mon, Feb 11, 2019 at 12:54 PM Peter Geoghegan wrote: > Notable improvements in v12: I've been benchmarking v12, once again using a slightly modified BenchmarkSQL that doesn't do up-front CREATE INDEX builds [1], since the problems with index bloat don't take so long to manifest themselves when the indexes are inserted into incrementally from the very beginning. This benchmarking process took over 20 hours, with a database that started off at about 90GB (700 TPC-C/BenchmarkSQL warehouses were used). That easily exceeded available main memory on my test server, which was 32GB. This is a pretty I/O bound workload, and a fairly write-heavy one at that. I used a Samsung 970 PRO 512GB, NVMe PCIe M.2 2280 SSD for both pg_wal and the default and only tablespace. Importantly, I figured out that I should disable both hash joins and merge joins with BenchmarkSQL, in order to force all joins to be nested loop joins. Otherwise, the "stock level" transaction eventually starts to use a hash join, even though that's about 10x slower than a nestloop join (~4ms vs. ~40ms on this machine) -- the hash join produces a lot of noise without really testing anything. It usually takes a couple of hours before we start to get obviously-bad plans, but it also usually takes about that long until the patch series starts to noticeably overtake the master branch. I don't think that TPC-C will ever benefit from using a hash join or a merge join, since it's supposed to be a pure OLTP benchmark, and is a benchmark that MySQL is known to do at least respectably-well on. This is the first benchmark I've published that was considerably I/O bound. There are significant improvements in performance across the board, on every measure, though it takes several hours for that to really show. The benchmark was not rate-limited. 16 clients/"terminals" are used throughout. There were 5 runs for master and 5 for patch, interlaced, each lasting 2 hours. Initialization occurred once, so it's expected that both databases will gradually get larger across runs. Summary (appears in same order as the execution of each run) -- each run is 2 hours, so 20 hours total excluding initial load time (2 hours * 5 runs for master + 2 hours * 5 runs for patch): Run 1 -- master: Measured tpmTOTAL = 90063.79, Measured tpmC (NewOrders) = 39172.37 Run 1 -- patch: Measured tpmTOTAL = 90922.63, Measured tpmC (NewOrders) = 39530.2 Run 2 -- master: Measured tpmTOTAL = 77091.63, Measured tpmC (NewOrders) = 33530.66 Run 2 -- patch: Measured tpmTOTAL = 83905.48, Measured tpmC (NewOrders) = 36508.38<-- 8.8% increase in tpmTOTAL/throughput Run 3 -- master: Measured tpmTOTAL = 71224.25, Measured tpmC (NewOrders) = 30949.24 Run 3 -- patch: Measured tpmTOTAL = 78268.29, Measured tpmC (NewOrders) = 34021.98 <-- 9.8% increase in tpmTOTAL/throughput Run 4 -- master: Measured tpmTOTAL = 71671.96, Measured tpmC (NewOrders) = 31163.29 Run 4 -- patch: Measured tpmTOTAL = 73097.42, Measured tpmC (NewOrders) = 31793.99 Run 5 -- master: Measured tpmTOTAL = 66503.38, Measured tpmC (NewOrders) = 28908.8 Run 5 -- patch: Measured tpmTOTAL = 71072.3, Measured tpmC (NewOrders) = 30885.56 <-- 6.9% increase in tpmTOTAL/throughput There were *also* significant reductions in transaction latency for the patch -- see the full html reports in the provided tar archive for full details (URL provided below). The html reports have nice SVG graphs, generated by BenchmarkSQL using R -- one for transaction throughput, and another for transaction latency. The overall picture is that the patched version starts out ahead, and has a much more gradual decline as the database becomes larger and more bloated. Note also that the statistics collector stats show a *big* reduction in blocks read into shared_buffers for the duration of these runs. For example, here is what pg_stat_database shows for run 3 (I reset the stats between runs): master: blks_read = 78,412,640, blks_hit = 4,022,619,556 patch: blks_read = 70,033,583, blks_hit = 4,505,308,517 <-- 10.7% reduction in blks_read/logical I/O This suggests an indirect benefit, likely related to how buffers are evicted in each case. pg_stat_bgwriter indicates that more buffers are written out during checkpoints, while fewer are written out by backends. I won't speculate further on what all of this means right now, though. You can find the raw details for blks_read for each and every run in the full tar archive. It is available for download from: https://drive.google.com/file/d/1kN4fDmh1a9jtOj8URPrnGYAmuMPmcZax/view?usp=sharing There are also dumps of the other pg_stat* views at the end of each run, logs for each run, etc. There's more information than anybody else is likely to find interesting. If anyone needs help in recreating this benchmark, then I'd be happy to assist in that. The is a shell script (zsh) included in the tar archive, although that will need to be changed a bit to point to the correct installations and so on. Independent validation
Re: Making all nbtree entries unique by having heap TIDs participate in comparisons
On Thu, Jun 14, 2018 at 2:44 PM, Peter Geoghegan wrote: > I've been thinking about using heap TID as a tie-breaker when > comparing B-Tree index tuples for a while now [1]. I'd like to make > all tuples at the leaf level unique, as assumed by L&Y. This can > enable "retail index tuple deletion", which I think we'll probably end > up implementing in some form or another, possibly as part of the zheap > project. It's also possible that this work will facilitate GIN-style > deduplication based on run length encoding of TIDs, or storing > versioned heap TIDs in an out-of-line nbtree-versioning structure > (unique indexes only). I can see many possibilities, but we have to > start somewhere. Yes, retail index deletion is essential for the delete-marking approach that is proposed for zheap. It could also be extremely useful in some workloads with the regular heap. If the indexes are large -- say, 100GB -- and the number of tuples that vacuum needs to kill is small -- say, 5 -- scanning them all to remove the references to those tuples is really inefficient. If we had retail index deletion, then we could make a cost-based decision about which approach to use in a particular case. > mind now, while it's still swapped into my head. I won't do any > serious work on this project unless and until I see a way to implement > retail index tuple deletion, which seems like a multi-year project > that requires the buy-in of multiple senior community members. Can you enumerate some of the technical obstacles that you see? > On its > own, my patch regresses performance unacceptably in some workloads, > probably due to interactions with kill_prior_tuple()/LP_DEAD hint > setting, and interactions with page space management when there are > many "duplicates" (it can still help performance in some pgbench > workloads with non-unique indexes, though). I think it would be helpful if you could talk more about these regressions (and the wins). -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
Re: Making all nbtree entries unique by having heap TIDs participate in comparisons
On Fri, Jun 15, 2018 at 2:36 PM, Robert Haas wrote: > Yes, retail index deletion is essential for the delete-marking > approach that is proposed for zheap. Makes sense. I don't know that much about zheap. I'm sure that retail index tuple deletion is really important in general, though. The Gray & Reuter book treats unique keys as a basic assumption, as do other authoritative reference works and papers. Other database systems probably make unique indexes simply use the user-visible attributes as unique values, but appending heap TID as a unique-ifier is probably a reasonably common design for secondary indexes (it would also be nice if we could simply not store duplicates for unique indexes, rather than using heap TID). I generally have a very high opinion on the nbtree code, but this seems like a problem that ought to be fixed. I've convinced myself that I basically have the right idea with this patch, because the classic L&Y invariants have all been tested with an enhanced amcheck run against all indexes in a regression test database. There was other stress-testing, too. The remaining problems are fixable, but I need some guidance. > It could also be extremely useful in some workloads with the regular > heap. If the indexes are large -- say, 100GB -- and the number of > tuples that vacuum needs to kill is small -- say, 5 -- scanning them > all to remove the references to those tuples is really inefficient. > If we had retail index deletion, then we could make a cost-based > decision about which approach to use in a particular case. I remember talking to Andres about this in a bar 3 years ago. I can imagine variations of pruning that do some amount of this when there are lots of duplicates. Perhaps something like InnoDB's purge threads, which do things like in-place deletes of secondary indexes after an updating (or deleting) xact commits. I believe that that mechanism targets secondary indexes specifically, and that is operates quite eagerly. > Can you enumerate some of the technical obstacles that you see? The #1 technical obstacle is that I simply don't know where I should try to take this patch, given that it probably needs to be tied to some much bigger project, such as zheap. I have an open mind, though, and intend to help if I can. I'm not really sure what the #2 and #3 problems are, because I'd need to be able to see a few steps ahead to be sure. Maybe #2 is that I'm doing something wonky to avoid breaking duplicate checking for unique indexes. (The way that unique duplicate checking has always worked [1] is perhaps questionable, though.) > I think it would be helpful if you could talk more about these > regressions (and the wins). I think that the performance regressions are due to the fact that when you have a huge number of duplicates today, it's useful to be able to claim space to fit further duplicates from almost any of the multiple leaf pages that contain or have contained duplicates. I'd hoped that the increased temporal locality that the patch gets would more than make up for that. As far as I can tell, the problem is that temporal locality doesn't help enough. I saw that performance was somewhat improved with extreme Zipf distribution contention, but it went the other way with less extreme contention. The details are not that fresh in my mind, since I shelved this patch for a while following limited performance testing. The code could certainly use more performance testing, and more general polishing. I'm not strongly motivated to do that right now, because I don't quite see a clear path to making this patch useful. But, as I said, I have an open mind about what the next step should be. [1] https://wiki.postgresql.org/wiki/Key_normalization#Avoiding_unnecessary_unique_index_enforcement -- Peter Geoghegan
Re: Making all nbtree entries unique by having heap TIDs participate in comparisons
On Fri, Jun 15, 2018 at 8:47 PM Peter Geoghegan wrote: > > I think it would be helpful if you could talk more about these > > regressions (and the wins). > > I think that the performance regressions are due to the fact that when > you have a huge number of duplicates today, it's useful to be able to > claim space to fit further duplicates from almost any of the multiple > leaf pages that contain or have contained duplicates. I'd hoped that > the increased temporal locality that the patch gets would more than > make up for that. As far as I can tell, the problem is that temporal > locality doesn't help enough. I saw that performance was somewhat > improved with extreme Zipf distribution contention, but it went the > other way with less extreme contention. The details are not that fresh > in my mind, since I shelved this patch for a while following limited > performance testing. > > The code could certainly use more performance testing, and more > general polishing. I'm not strongly motivated to do that right now, > because I don't quite see a clear path to making this patch useful. > But, as I said, I have an open mind about what the next step should > be. Way back when I was dabbling in this kind of endeavor, my main idea to counteract that, and possibly improve performance overall, was a microvacuum kind of thing that would do some on-demand cleanup to remove duplicates or make room before page splits. Since nbtree uniqueification enables efficient retail deletions, that could end up as a net win. I never got around to implementing it though, and it does get tricky if you don't want to allow unbounded latency spikes.
Re: Making all nbtree entries unique by having heap TIDs participate in comparisons
On Mon, Jun 18, 2018 at 7:57 AM, Claudio Freire wrote: > Way back when I was dabbling in this kind of endeavor, my main idea to > counteract that, and possibly improve performance overall, was a > microvacuum kind of thing that would do some on-demand cleanup to > remove duplicates or make room before page splits. Since nbtree > uniqueification enables efficient retail deletions, that could end up > as a net win. That sounds like a mechanism that works a bit like _bt_vacuum_one_page(), which we run at the last second before a page split. We do this to see if a page split that looks necessary can actually be avoided. I imagine that retail index tuple deletion (the whole point of this project) would be run by a VACUUM-like process that kills tuples that are dead to everyone. Even with something like zheap, you cannot just delete index tuples until you establish that they're truly dead. I guess that the delete marking stuff that Robert mentioned marks tuples as dead when the deleting transaction commits. Maybe we could justify having _bt_vacuum_one_page() do cleanup to those tuples (i.e. check if they're visible to anyone, and if not recycle), because we at least know that the deleting transaction committed there. That is, they could be recently dead or dead, and it may be worth going to the extra trouble of checking which when we know that it's one of the two possibilities. -- Peter Geoghegan
Re: Making all nbtree entries unique by having heap TIDs participate in comparisons
On Mon, Jun 18, 2018 at 2:03 PM Peter Geoghegan wrote: > > On Mon, Jun 18, 2018 at 7:57 AM, Claudio Freire > wrote: > > Way back when I was dabbling in this kind of endeavor, my main idea to > > counteract that, and possibly improve performance overall, was a > > microvacuum kind of thing that would do some on-demand cleanup to > > remove duplicates or make room before page splits. Since nbtree > > uniqueification enables efficient retail deletions, that could end up > > as a net win. > > That sounds like a mechanism that works a bit like > _bt_vacuum_one_page(), which we run at the last second before a page > split. We do this to see if a page split that looks necessary can > actually be avoided. > > I imagine that retail index tuple deletion (the whole point of this > project) would be run by a VACUUM-like process that kills tuples that > are dead to everyone. Even with something like zheap, you cannot just > delete index tuples until you establish that they're truly dead. I > guess that the delete marking stuff that Robert mentioned marks tuples > as dead when the deleting transaction commits. Maybe we could justify > having _bt_vacuum_one_page() do cleanup to those tuples (i.e. check if > they're visible to anyone, and if not recycle), because we at least > know that the deleting transaction committed there. That is, they > could be recently dead or dead, and it may be worth going to the extra > trouble of checking which when we know that it's one of the two > possibilities. Yes, but currently bt_vacuum_one_page does local work on the pinned page. Doing dead tuple deletion however involves reading the heap to check visibility at the very least, and doing it on the whole page might involve several heap fetches, so it's an order of magnitude heavier if done naively. But the idea is to do just that, only not naively.
Re: Making all nbtree entries unique by having heap TIDs participate in comparisons
On Mon, Jun 18, 2018 at 10:33 PM, Peter Geoghegan wrote: > On Mon, Jun 18, 2018 at 7:57 AM, Claudio Freire > wrote: >> Way back when I was dabbling in this kind of endeavor, my main idea to >> counteract that, and possibly improve performance overall, was a >> microvacuum kind of thing that would do some on-demand cleanup to >> remove duplicates or make room before page splits. Since nbtree >> uniqueification enables efficient retail deletions, that could end up >> as a net win. > > That sounds like a mechanism that works a bit like > _bt_vacuum_one_page(), which we run at the last second before a page > split. We do this to see if a page split that looks necessary can > actually be avoided. > > I imagine that retail index tuple deletion (the whole point of this > project) would be run by a VACUUM-like process that kills tuples that > are dead to everyone. Even with something like zheap, you cannot just > delete index tuples until you establish that they're truly dead. I > guess that the delete marking stuff that Robert mentioned marks tuples > as dead when the deleting transaction commits. > No, I don't think that is the case because we want to perform in-place updates for indexed-column-updates. If we won't delete-mark the index tuple before performing in-place update, then we will have two tuples in the index which point to the same heap-TID. -- With Regards, Amit Kapila. EnterpriseDB: http://www.enterprisedb.com
Re: Making all nbtree entries unique by having heap TIDs participate in comparisons
On Tue, Jun 19, 2018 at 4:03 AM, Amit Kapila wrote: >> I imagine that retail index tuple deletion (the whole point of this >> project) would be run by a VACUUM-like process that kills tuples that >> are dead to everyone. Even with something like zheap, you cannot just >> delete index tuples until you establish that they're truly dead. I >> guess that the delete marking stuff that Robert mentioned marks tuples >> as dead when the deleting transaction commits. >> > > No, I don't think that is the case because we want to perform in-place > updates for indexed-column-updates. If we won't delete-mark the index > tuple before performing in-place update, then we will have two tuples > in the index which point to the same heap-TID. How can an old MVCC snapshot that needs to find the heap tuple using some now-obsolete key values get to the heap tuple via an index scan if there are no index tuples that stick around until "recently dead" heap tuples become "fully dead"? How can you avoid keeping around both old and new index tuples at the same time? -- Peter Geoghegan
Re: Making all nbtree entries unique by having heap TIDs participate in comparisons
On Tue, Jun 19, 2018 at 11:13 PM, Peter Geoghegan wrote: > On Tue, Jun 19, 2018 at 4:03 AM, Amit Kapila wrote: >>> I imagine that retail index tuple deletion (the whole point of this >>> project) would be run by a VACUUM-like process that kills tuples that >>> are dead to everyone. Even with something like zheap, you cannot just >>> delete index tuples until you establish that they're truly dead. I >>> guess that the delete marking stuff that Robert mentioned marks tuples >>> as dead when the deleting transaction commits. >>> >> >> No, I don't think that is the case because we want to perform in-place >> updates for indexed-column-updates. If we won't delete-mark the index >> tuple before performing in-place update, then we will have two tuples >> in the index which point to the same heap-TID. > > How can an old MVCC snapshot that needs to find the heap tuple using > some now-obsolete key values get to the heap tuple via an index scan > if there are no index tuples that stick around until "recently dead" > heap tuples become "fully dead"? How can you avoid keeping around both > old and new index tuples at the same time? > Both values will be present in the index, but the old value will be delete-marked. It is correct that we can't remove the value (index tuple) from the index until it is truly dead (not visible to anyone), but during a delete or index-update operation, we need to traverse the index to mark the entries as delete-marked. See, at this stage, I don't want to go in too much detail discussion of how delete-marking will happen in zheap and also I am not sure this thread is the right place to discuss details of that technology. -- With Regards, Amit Kapila. EnterpriseDB: http://www.enterprisedb.com
Re: Making all nbtree entries unique by having heap TIDs participate in comparisons
On Tue, Jun 19, 2018 at 8:52 PM, Amit Kapila wrote: > Both values will be present in the index, but the old value will be > delete-marked. It is correct that we can't remove the value (index > tuple) from the index until it is truly dead (not visible to anyone), > but during a delete or index-update operation, we need to traverse the > index to mark the entries as delete-marked. See, at this stage, I > don't want to go in too much detail discussion of how delete-marking > will happen in zheap and also I am not sure this thread is the right > place to discuss details of that technology. I don't understand, but okay. I can provide feedback once a design for delete marking is available. -- Peter Geoghegan
Re: Making all nbtree entries unique by having heap TIDs participate in comparisons
On Thu, Jun 14, 2018 at 11:44 AM, Peter Geoghegan wrote: > I attach an unfinished prototype of suffix truncation, that also > sometimes *adds* a new attribute in pivot tuples. It adds an extra > heap TID from the leaf level when truncating away non-distinguishing > attributes during a leaf page split, though only when it must. The > patch also has nbtree treat heap TID as a first class part of the key > space of the index. Claudio wrote a patch that did something similar, > though without the suffix truncation part [2] (I haven't studied his > patch, to be honest). My patch is actually a very indirect spin-off of > Anastasia's covering index patch, and I want to show what I have in > mind now, while it's still swapped into my head. I won't do any > serious work on this project unless and until I see a way to implement > retail index tuple deletion, which seems like a multi-year project > that requires the buy-in of multiple senior community members. On its > own, my patch regresses performance unacceptably in some workloads, > probably due to interactions with kill_prior_tuple()/LP_DEAD hint > setting, and interactions with page space management when there are > many "duplicates" (it can still help performance in some pgbench > workloads with non-unique indexes, though). I attach a revised version, which is still very much of prototype quality, but manages to solve a few of the problems that v1 had. Andrey Lepikhov (CC'd) asked me to post any improved version I might have for use with his retail index tuple deletion patch, so I thought I'd post what I have. The main development for v2 is that the sort order of the implicit heap TID attribute is flipped. In v1, it was in "ascending" order. In v2, comparisons of heap TIDs are inverted to make the attribute order "descending". This has a number of advantages: * It's almost consistent with the current behavior when there are repeated insertions of duplicates. Currently, this tends to result in page splits of the leftmost leaf page among pages that mostly consist of the same duplicated value. This means that the destabilizing impact on DROP SCHEMA ... CASCADE regression test output noted before [1] is totally eliminated. There is now only a single trivial change to regression test "expected" files, whereas in v1 dozens of "expected" files had to be changed, often resulting in less useful reports for the user. * The performance regression I observed with various pgbench workloads seems to have gone away, or is now within the noise range. A patch like this one requires a lot of validation and testing, so this should be taken with a grain of salt. I may have been too quick to give up on my original ambition of writing a stand-alone patch that can be justified entirely on its own merits, without being tied to some much more ambitious project like retail index tuple deletion by VACUUM, or zheap's deletion marking. I still haven't tried to replace the kludgey handling of unique index enforcement, even though that would probably have a measurable additional performance benefit. I think that this patch could become an unambiguous win. [1] https://postgr.es/m/CAH2-Wz=wAKwhv0PqEBFuK2_s8E60kZRMzDdyLi=-mvcm_ph...@mail.gmail.com -- Peter Geoghegan v2-0001-Ensure-nbtree-leaf-tuple-keys-are-always-unique.patch Description: Binary data
Re: Making all nbtree entries unique by having heap TIDs participate in comparisons
> On Sun, Nov 25, 2018 at 12:14 AM Peter Geoghegan wrote: > > Attached is v8 of the patch series, which has some relatively minor changes: Thank you for working on this patch, Just for the information, cfbot says there are problems on windows: src/backend/catalog/pg_depend.c(33): error C2065: 'INT32_MAX' : undeclared identifier
Re: Making all nbtree entries unique by having heap TIDs participate in comparisons
On Sat, Dec 1, 2018 at 4:10 AM Dmitry Dolgov <9erthali...@gmail.com> wrote: > Just for the information, cfbot says there are problems on windows: > > src/backend/catalog/pg_depend.c(33): error C2065: 'INT32_MAX' : > undeclared identifier Thanks. Looks like I should have used PG_INT32_MAX. -- Peter Geoghegan
Re: Making all nbtree entries unique by having heap TIDs participate in comparisons
On Mon, Dec 3, 2018 at 7:10 PM Peter Geoghegan wrote: > Attached is v9, which does things that way. There are no interesting > changes, though I have set things up so that a later patch in the > series can add "dynamic prefix truncation" -- I do not include any > such patch in v9, though. I'm going to start a new thread on that > topic, and include the patch there, since it's largely unrelated to > this work, and in any case still isn't in scope for Postgres 12 (the > patch is still experimental, for reasons that are of general > interest). The dynamic prefix truncation thread that I started: https://postgr.es/m/cah2-wzn_nayk4pr0hrwo0stwhmxjp5qyu+x8vppt030xpqr...@mail.gmail.com -- Peter Geoghegan
Re: Making all nbtree entries unique by having heap TIDs participate in comparisons
On 04/12/2018 05:10, Peter Geoghegan wrote: Attached is v9, ... I spent some time reviewing this. I skipped the first patch, to add a column to pg_depend, and I got through patches 2, 3 and 4. Impressive results, and the code looks sane. I wrote a laundry list of little comments on minor things, suggested rewordings of comments etc. I hope they're useful, but feel free to ignore/override my opinions of any of those, as you see best. But first, a few slightly bigger (medium-sized?) issues that caught my eye: 1. How about doing the BTScanInsertData refactoring as a separate commit, first? It seems like a good thing for readability on its own, and would slim the big main patch. (And make sure to credit Andrey for that idea in the commit message.) 2. In the "Treat heap TID as part of the nbtree key space" patch: * Build an insertion scan key that contains comparison data from itup * as well as comparator routines appropriate to the key datatypes. * + * When itup is a non-pivot tuple, the returned insertion scan key is + * suitable for finding a place for it to go on the leaf level. When + * itup is a pivot tuple, the returned insertion scankey is suitable + * for locating the leaf page with the pivot as its high key (there + * must have been one like it at some point if the pivot tuple + * actually came from the tree). + * + * Note that we may occasionally have to share lock the metapage, in + * order to determine whether or not the keys in the index are expected + * to be unique (i.e. whether or not heap TID is treated as a tie-breaker + * attribute). Callers that cannot tolerate this can request that we + * assume that this is a heapkeyspace index. + * * The result is intended for use with _bt_compare(). */ -ScanKey -_bt_mkscankey(Relation rel, IndexTuple itup) +BTScanInsert +_bt_mkscankey(Relation rel, IndexTuple itup, bool assumeheapkeyspace) This 'assumeheapkeyspace' flag feels awkward. What if the caller knows that it is a v3 index? There's no way to tell _bt_mkscankey() that. (There's no need for that, currently, but seems a bit weird.) _bt_split() calls _bt_truncate(), which calls _bt_leave_natts(), which calls _bt_mkscankey(). It's holding a lock on the page being split. Do we risk deadlock by locking the metapage at the same time? I don't have any great ideas on what to do about this, but it's awkward as it is. Can we get away without the new argument? Could we somehow arrange things so that rd_amcache would be guaranteed to already be set? 3. In the "Pick nbtree split points discerningly" patch I find the different modes and the logic in _bt_findsplitloc() very hard to understand. I've spent a while looking at it now, and I think I have a vague understanding of what things it takes into consideration, but I don't understand why it performs those multiple stages, what each stage does, and how that leads to an overall strategy. I think a rewrite would be in order, to make that more understandable. I'm not sure what exactly it should look like, though. If _bt_findsplitloc() has to fall back to the MANY_DUPLICATES or SINGLE_VALUE modes, it has to redo a lot of the work that was done in the DEFAULT mode already. That's probably not a big deal in practice, performance-wise, but I feel that it's another hint that some refactoring would be in order. One idea on how to restructure that: Make a single pass over all the offset numbers, considering a split at that location. Like the current code does. For each offset, calculate a "penalty" based on two factors: * free space on each side * the number of attributes in the pivot tuple, and whether it needs to store the heap TID Define the penalty function so that having to add a heap TID to the pivot tuple is considered very expensive, more expensive than anything else, and truncating away other attributes gives a reward of some size. However, naively computing the penalty upfront for every offset would be a bit wasteful. Instead, start from the middle of the page, and walk "outwards" towards both ends, until you find a "good enough" penalty. Or something like that... Now, the laundry list of smaller items: - laundry list begins - 1st commits commit message: Make nbtree treat all index tuples as having a heap TID trailing key attribute. Heap TID becomes a first class part of the key space on all levels of the tree. Index searches can distinguish duplicates by heap TID, at least in principle. What do you mean by "at least in principle"? Secondary index insertions will descend straight to the leaf page that they'll insert on to (unless there is a concurrent page split). What is a "Secondary" index insertion? Naively adding a new attribute to every pivot tuple has unacceptable overhead (it blo
Re: Making all nbtree entries unique by having heap TIDs participate in comparisons
On Fri, Dec 28, 2018 at 10:04 AM Heikki Linnakangas wrote: > I spent some time reviewing this. I skipped the first patch, to add a > column to pg_depend, and I got through patches 2, 3 and 4. Impressive > results, and the code looks sane. Thanks! I really appreciate your taking the time to do such a thorough review. You were right to skip the first patch, because there is a fair chance that it won't be used in the end. Tom is looking into the pg_depend problem that I paper over with the first patch. > I wrote a laundry list of little comments on minor things, suggested > rewordings of comments etc. I hope they're useful, but feel free to > ignore/override my opinions of any of those, as you see best. I think that that feedback is also useful, and I'll end up using 95%+ of it. Much of the information I'm trying to get across is very subtle. > But first, a few slightly bigger (medium-sized?) issues that caught my eye: > > 1. How about doing the BTScanInsertData refactoring as a separate > commit, first? It seems like a good thing for readability on its own, > and would slim the big main patch. (And make sure to credit Andrey for > that idea in the commit message.) Good idea. I'll do that. > This 'assumeheapkeyspace' flag feels awkward. What if the caller knows > that it is a v3 index? There's no way to tell _bt_mkscankey() that. > (There's no need for that, currently, but seems a bit weird.) This is there for CREATE INDEX -- we cannot access the metapage during an index build. We'll only be able to create new v4 indexes with the patch applied, so we can assume that heap TID is part of the key space safely. > _bt_split() calls _bt_truncate(), which calls _bt_leave_natts(), which > calls _bt_mkscankey(). It's holding a lock on the page being split. Do > we risk deadlock by locking the metapage at the same time? I already had vague concerns along the same lines. I am also concerned about index_getprocinfo() calls that happen in the same code path, with a buffer lock held. (SP-GiST's doPickSplit() function can be considered a kind of precedent that makes the second issue okay, I suppose.) See also: My later remarks on the use of "authoritative comparisons" from this same e-mail. > I don't have any great ideas on what to do about this, but it's awkward > as it is. Can we get away without the new argument? Could we somehow > arrange things so that rd_amcache would be guaranteed to already be set? These are probably safe in practice, but the way that we rely on them being safe from a distance is a concern. Let me get back to you on this. > 3. In the "Pick nbtree split points discerningly" patch > > I find the different modes and the logic in _bt_findsplitloc() very hard > to understand. I've spent a while looking at it now, and I think I have > a vague understanding of what things it takes into consideration, but I > don't understand why it performs those multiple stages, what each stage > does, and how that leads to an overall strategy. I think a rewrite would > be in order, to make that more understandable. I'm not sure what exactly > it should look like, though. I've already refactored that a little bit for the upcoming v10. The way _bt_findsplitloc() state is initially set up becomes slightly more streamlined. It still works in the same way, though, so you'll probably only think that the new version is a minor improvement. (Actually, v10 focuses on making _bt_splitatnewitem() a bit less magical, at least right now.) > If _bt_findsplitloc() has to fall back to the MANY_DUPLICATES or > SINGLE_VALUE modes, it has to redo a lot of the work that was done in > the DEFAULT mode already. That's probably not a big deal in practice, > performance-wise, but I feel that it's another hint that some > refactoring would be in order. The logic within _bt_findsplitloc() has been very hard to refactor all along. You're right that there is a fair amount of redundant-ish work that the alternative modes (MANY_DUPLICATES + SINGLE_VALUE) perform. The idea is to not burden the common DEFAULT case, and to keep the control flow relatively simple. I'm sure that if I was in your position I'd say something similar. It is complicated in subtle ways, that looks like they might not matter, but actually do. I am working off a fair variety of test cases, which really came in handy. I remember thinking that I'd simplified it a couple of times back in August or September, only to realize that I'd regressed a case that I cared about. I eventually realized that I needed to come up with a comprehensive though relatively fast test suite, which seems essential for refactoring _bt_findsplitloc(), and maybe even for fully understanding how _bt_findsplitloc() works. Another complicating factor is that I have to worry about the number of cycles used under a buffer lock (not just the impact on space utilization). With all of that said, I am willing to give it another try. You've seen opportunities to refactor that I missed before now.
Re: Making all nbtree entries unique by having heap TIDs participate in comparisons
On 29/12/2018 01:04, Peter Geoghegan wrote: However, naively computing the penalty upfront for every offset would be a bit wasteful. Instead, start from the middle of the page, and walk "outwards" towards both ends, until you find a "good enough" penalty. You can't start at the middle of the page, though. You have to start at the left (though you could probably start at the right instead). This is because of page fragmentation -- it's not correct to assume that the line pointer offset into tuple space on the page (firstright linw pointer lp_off for candidate split point) tells you anything about what the space delta will be after the split. You have to exhaustively add up the free space before the line pointer (the free space for all earlier line pointers) before seeing if the line pointer works as a split point, since each previous line pointer's tuple could be located anywhere in the original page's tuple space (anywhere to the left or to the right of where it would be in the simple/unfragmented case). Right. You'll need to do the free space computations from left to right, but once you have done that, you can compute the penalties in any order. I'm envisioning that you have an array, with one element for each item on the page (including the tuple we're inserting, which isn't really on the page yet). In the first pass, you count up from left to right, filling the array. Next, you compute the complete penalties, starting from the middle, walking outwards. That's not so different from what you're doing now, but I find it more natural to explain the algorithm that way. - Heikki
Re: Making all nbtree entries unique by having heap TIDs participate in comparisons
On Fri, Dec 28, 2018 at 3:20 PM Heikki Linnakangas wrote: > Right. You'll need to do the free space computations from left to right, > but once you have done that, you can compute the penalties in any order. > > I'm envisioning that you have an array, with one element for each item > on the page (including the tuple we're inserting, which isn't really on > the page yet). In the first pass, you count up from left to right, > filling the array. Next, you compute the complete penalties, starting > from the middle, walking outwards. > > That's not so different from what you're doing now, but I find it more > natural to explain the algorithm that way. Ah, right. I think I see what you mean now. I like that this datastructure explicitly has a place for the new item, so you really do "pretend it's already on the page". Maybe that's what you liked about it as well. I'm a little concerned about the cost of maintaining the data structure. This sounds workable, but we probably don't want to allocate a buffer most of the time, or even hold on to the information most of the time. The current design throws away potentially useful information that it may later have to recreate, but even that has the benefit of having little storage overhead in the common case. Leave it with me. I'll need to think about this some more. -- Peter Geoghegan
Re: Making all nbtree entries unique by having heap TIDs participate in comparisons
Hi! I'm starting to look at this patchset. Not ready to post detail review, but have a couple of questions. On Wed, Sep 19, 2018 at 9:24 PM Peter Geoghegan wrote: > I still haven't managed to add pg_upgrade support, but that's my next > step. I am more or less happy with the substance of the patch in v5, > and feel that I can now work backwards towards figuring out the best > way to deal with on-disk compatibility. It shouldn't be too hard -- > most of the effort will involve coming up with a good test suite. Yes, it shouldn't be too hard, but it seems like we have to keep two branches of code for different handling of duplicates. Is that true? + * In the worst case (when a heap TID is appended) the size of the returned + * tuple is the size of the first right tuple plus an additional MAXALIGN() + * quantum. This guarantee is important, since callers need to stay under + * the 1/3 of a page restriction on tuple size. If this routine is ever + * taught to truncate within an attribute/datum, it will need to avoid + * returning an enlarged tuple to caller when truncation + TOAST compression + * ends up enlarging the final datum. I didn't get the point of this paragraph. Does it might happen that first right tuple is under tuple size restriction, but new pivot tuple is beyond that restriction? If so, would we have an error because of too long pivot tuple? If not, I think this needs to be explained better. -- Alexander Korotkov Postgres Professional: http://www.postgrespro.com The Russian Postgres Company
Re: Making all nbtree entries unique by having heap TIDs participate in comparisons
Hi Alexander, On Fri, Jan 4, 2019 at 7:40 AM Alexander Korotkov wrote: > I'm starting to look at this patchset. Not ready to post detail > review, but have a couple of questions. Thanks for taking a look! > Yes, it shouldn't be too hard, but it seems like we have to keep two > branches of code for different handling of duplicates. Is that true? Not really. If you take a look at v9, you'll see the approach I've taken is to make insertion scan keys aware of which rules apply (the "heapkeyspace" field field controls this). I think that there are about 5 "if" statements for that outside of amcheck. It's pretty manageable. I like to imagine that the existing code already has unique keys, but nobody ever gets to look at the final attribute. It works that way most of the time -- the only exception is insertion with user keys that aren't unique already. Note that the way we move left on equal pivot tuples, rather than right (rather than following the pivot's downlink) wasn't invented by Postgres to deal with the lack of unique keys. That's actually a part of the Lehman and Yao design itself. Almost all of the special cases are optimizations rather than truly necessary infrastructure. > I didn't get the point of this paragraph. Does it might happen that > first right tuple is under tuple size restriction, but new pivot tuple > is beyond that restriction? If so, would we have an error because of > too long pivot tuple? If not, I think this needs to be explained > better. The v9 version of the function _bt_check_third_page() shows what it means (comments on this will be improved in v10, too). The old limit of 2712 bytes still applies to pivot tuples, while a new, lower limit of 2704 bytes applied for non-pivot tuples. This difference is necessary because an extra MAXALIGN() quantum could be needed to add a heap TID to a pivot tuple during truncation in the worst case. To users, the limit is 2704 bytes, because that's the limit that actually needs to be enforced during insertion. We never actually say "1/3 of a page means 2704 bytes" in the docs, since the definition was always a bit fuzzy. There will need to be a compatibility note in the release notes, though. -- Peter Geoghegan
Re: Making all nbtree entries unique by having heap TIDs participate in comparisons
On Wed, Sep 19, 2018 at 9:56 PM, Andrey Lepikhov wrote: > Note, that the interface of _bt_moveright() and _bt_binsrch() functions with > combination of scankey, scantid and nextkey parameters is too semantic > loaded. > Everytime of code reading i spend time to remember, what this functions do > exactly. > May be it needed to rewrite comments. I think that it might be a good idea to create an "BTInsertionScankey" struct, or similar, since keysz, nextkey, the scankey array and now scantid are all part of that, and are all common to these 4 or so functions. It could have a flexible array at the end, so that we still only need a single palloc(). I'll look into that. > What do you think about submitting the patch to the next CF? Clearly the project that you're working on is a difficult one. It's easy for me to understand why you might want to take an iterative approach, with lots of prototyping. Your patch needs attention to advance, and IMV the CF is the best way to get that attention. So, I think that it would be fine to go submit it now. I must admit that I didn't even notice that your patch lacked a CF entry. Everyone has a different process, perhaps. -- Peter Geoghegan
Re: Making all nbtree entries unique by having heap TIDs participate in comparisons
On Wed, Sep 19, 2018 at 11:23 AM Peter Geoghegan wrote: > 3 modes > --- > > My new approach is to teach _bt_findsplitloc() 3 distinct modes of > operation: Regular/default mode, many duplicates mode, and single > value mode. I think that I'll have to add a fourth mode, since I came up with another strategy that is really effective though totally complementary to the other 3 -- "multiple insertion point" mode. Credit goes to Kevin Grittner for pointing out that this technique exists about 2 years ago [1]. The general idea is to pick a split point just after the insertion point of the new item (the incoming tuple that prompted a page split) when it looks like there are localized monotonically increasing ranges. This is like a rightmost 90:10 page split, except the insertion point is not at the rightmost page on the level -- it's rightmost within some local grouping of values. This makes the two largest TPC-C indexes *much* smaller. Previously, they were shrunk by a little over 5% by using the new generic strategy, a win that now seems like small potatoes. With this new mode, TPC-C's order_line primary key, which is the largest index of all, is ~45% smaller following a standard initial bulk load at scalefactor 50. It shrinks from 99,085 blocks (774.10 MiB) to 55,020 blocks (429.84 MiB). It's actually slightly smaller than it would be after a fresh REINDEX with the new strategy. We see almost as big a win with the second largest TPC-C index, the stock table's primary key -- it's ~40% smaller. Here is the definition of the biggest index, the order line primary key index: pg@tpcc[3666]=# \d order_line_pkey Index "public.order_line_pkey" Column │ Type │ Key? │ Definition ───┼─┼──┼ ol_w_id │ integer │ yes │ ol_w_id ol_d_id │ integer │ yes │ ol_d_id ol_o_id │ integer │ yes │ ol_o_id ol_number │ integer │ yes │ ol_number primary key, btree, for table "public.order_line" The new strategy/mode works very well because we see monotonically increasing inserts on ol_number (an order's item number), but those are grouped by order. It's kind of an adversarial case for our existing implementation, and yet it seems like it's probably a fairly common scenario in the real world. Obviously these are very significant improvements. They really exceed my initial expectations for the patch. TPC-C is generally considered to be by far the most influential database benchmark of all time, and this is something that we need to pay more attention to. My sense is that the TPC-C benchmark is deliberately designed to almost require that the system under test have this "multiple insertion point" B-Tree optimization, suffix truncation, etc. This is exactly the same index that we've seen reports of out of control bloat on when people run TPC-C over hours or days [2]. My next task is to find heuristics to make the new page split mode/strategy kick in when it's likely to help, but not kick in when it isn't (when we want something close to a generic 50:50 page split). These heuristics should look similar to what I've already done to get cases with lots of duplicates to behave sensibly. Anyone have any ideas on how to do this? I might end up inferring a "multiple insertion point" case from the fact that there are multiple pass-by-value attributes for the index, with the new/incoming tuple having distinct-to-immediate-left-tuple attribute values for the last column, but not the first few. It also occurs to me to consider the fragmentation of the page as a guide, though I'm less sure about that. I'll probably need to experiment with a variety of datasets before I settle on something that looks good. Forcing the new strategy without considering any of this actually works surprisingly well on cases where you'd think it wouldn't, since a 50:50 page split is already something of a guess about where future insertions will end up. [1] https://postgr.es/m/CACjxUsN5fV0kV=yirxwa0s7lqoojuy7soptipdhucemhgwo...@mail.gmail.com [2] https://www.commandprompt.com/blog/postgres_autovacuum_bloat_tpc-c/ -- Peter Geoghegan
Re: Making all nbtree entries unique by having heap TIDs participate in comparisons
On 19/09/2018 20:23, Peter Geoghegan wrote: > Attached is v5, So. I don't know much about the btree code, so don't believe anything I say. I was very interested in the bloat test case that you posted on 2018-07-09 and I tried to understand it more. The current method for inserting a duplicate value into a btree is going to the leftmost point for that value and then move right until we find some space or we get "tired" of searching, in which case just make some space right there. The problem is that it's tricky to decide when to stop searching, and there are scenarios when we stop too soon and repeatedly miss all the good free space to the right, leading to bloat even though the index is perhaps quite empty. I tried playing with the getting-tired factor (it could plausibly be a reloption), but that wasn't very successful. You can use that to postpone the bloat, but you won't stop it, and performance becomes terrible. You propose to address this by appending the tid to the index key, so each key, even if its "payload" is a duplicate value, is unique and has a unique place, so we never have to do this "tiresome" search. This makes a lot of sense, and the results in the bloat test you posted are impressive and reproducible. I tried a silly alternative approach by placing a new duplicate key in a random location. This should be equivalent since tids are effectively random. I didn't quite get this to fully work yet, but at least it doesn't blow up, and it gets the same regression test ordering differences for pg_depend scans that you are trying to paper over. ;-) As far as the code is concerned, I agree with Andrey Lepikhov that one more abstraction layer that somehow combines the scankey and the tid or some combination like that would be useful, instead of passing the tid as a separate argument everywhere. I think it might help this patch move along if it were split up a bit, for example 1) suffix truncation, 2) tid stuff, 3) new split strategies. That way, it would also be easier to test out each piece separately. For example, how much space does suffix truncation save in what scenario, are there any performance regressions, etc. In the last few versions, the patches have still been growing significantly in size and functionality, and most of the supposed benefits are not readily visible in tests. And of course we need to think about how to handle upgrades, but you have already started a separate discussion about that. -- Peter Eisentraut http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
Re: Making all nbtree entries unique by having heap TIDs participate in comparisons
On Fri, Sep 28, 2018 at 7:50 AM Peter Eisentraut wrote: > So. I don't know much about the btree code, so don't believe anything I > say. I think that showing up and reviewing this patch makes you somewhat of an expert, by default. There just isn't enough expertise in this area. > I was very interested in the bloat test case that you posted on > 2018-07-09 and I tried to understand it more. Up until recently, I thought that I would justify the patch primarily as a project to make B-Trees less bloated when there are many duplicates, with maybe as many as a dozen or more secondary benefits. That's what I thought it would say in the release notes, even though the patch was always a broader strategic thing. Now I think that the TPC-C multiple insert point bloat issue might be the primary headline benefit, though. I hate to add more complexity to get it to work well, but just look at how much smaller the indexes are following an initial bulk load (bulk insertions) using my working copy of the patch: Master customer_pkey: 75 MB district_pkey: 40 kB idx_customer_name: 107 MB item_pkey: 2216 kB new_order_pkey: 22 MB oorder_o_w_id_o_d_id_o_c_id_o_id_key: 60 MB oorder_pkey: 78 MB order_line_pkey: 774 MB stock_pkey: 181 MB warehouse_pkey: 24 kB Patch customer_pkey: 50 MB district_pkey: 40 kB idx_customer_name: 105 MB item_pkey: 2216 kB new_order_pkey: 12 MB oorder_o_w_id_o_d_id_o_c_id_o_id_key: 61 MB oorder_pkey: 42 MB order_line_pkey: 429 MB stock_pkey: 111 MB warehouse_pkey: 24 kB All of the indexes used by oltpbench to do TPC-C are listed, so you're seeing the full picture for TPC-C bulk loading here (actually, there is another index that has an identical definition to oorder_o_w_id_o_d_id_o_c_id_o_id_key for some reason, which is omitted as redundant). As you can see, all the largest indexes are *significantly* smaller, with the exception of oorder_o_w_id_o_d_id_o_c_id_o_id_key. You won't be able to see this improvement until I post the next version, though, since this is a brand new development. Note that VACUUM hasn't been run at all, and doesn't need to be run, as there are no dead tuples. Note also that this has *nothing* to do with getting tired -- almost all of these indexes are unique indexes. Note that I'm also testing TPC-E and TPC-H in a very similar way, which have both been improved noticeably, but to a degree that's much less compelling than what we see with TPC-C. They have "getting tired" cases that benefit quite a bit, but those are the minority. Have you ever used HammerDB? I got this data from oltpbench, but I think that HammerDB might be the way to go with TPC-C testing Postgres. > You propose to address this by appending the tid to the index key, so > each key, even if its "payload" is a duplicate value, is unique and has > a unique place, so we never have to do this "tiresome" search.This > makes a lot of sense, and the results in the bloat test you posted are > impressive and reproducible. Thanks. > I tried a silly alternative approach by placing a new duplicate key in a > random location. This should be equivalent since tids are effectively > random. You're never going to get any other approach to work remotely as well, because while the TIDs may seem to be random in some sense, they have various properties that are very useful from a high level, data life cycle point of view. For insertions of duplicates, heap TID has temporal locality -- you are only going to dirty one or two leaf pages, rather than potentially dirtying dozens or hundreds. Furthermore, heap TID is generally strongly correlated with primary key values, so VACUUM can be much much more effective at killing duplicates in low cardinality secondary indexes when there are DELETEs with a range predicate on the primary key. This is a lot more realistic than the 2018-07-09 test case, but it still could make as big of a difference. > I didn't quite get this to fully work yet, but at least it > doesn't blow up, and it gets the same regression test ordering > differences for pg_depend scans that you are trying to paper over. ;-) FWIW, I actually just added to the papering over, rather than creating a new problem. There are plenty of instances of "\set VERBOSITY terse" in the regression tests already, for the same reason. If you run the regression tests with ignore_system_indexes=on, there are very similar failures [1]. > As far as the code is concerned, I agree with Andrey Lepikhov that one > more abstraction layer that somehow combines the scankey and the tid or > some combination like that would be useful, instead of passing the tid > as a separate argument everywhere. I've already drafted this in my working copy. It is a clear improvement. You can expect it in the next version. > I think it might help this patch move along if it were split up a bit, > for example 1) suffix truncation, 2) tid stuff, 3) new split strategies. > That way, it would also be easier to test out each piece separately. > For example,
Re: Making all nbtree entries unique by having heap TIDs participate in comparisons
28.09.2018 23:08, Peter Geoghegan wrote: On Fri, Sep 28, 2018 at 7:50 AM Peter Eisentraut wrote: I think it might help this patch move along if it were split up a bit, for example 1) suffix truncation, 2) tid stuff, 3) new split strategies. That way, it would also be easier to test out each piece separately. For example, how much space does suffix truncation save in what scenario, are there any performance regressions, etc. I'll do my best. I don't think I can sensibly split out suffix truncation from the TID stuff -- those seem truly inseparable, since my mental model for suffix truncation breaks without fully unique keys. I can break out all the cleverness around choosing a split point into its own patch, though -- _bt_findsplitloc() has only been changed to give weight to several factors that become important. It's the "brain" of the optimization, where 90% of the complexity actually lives. Removing the _bt_findsplitloc() changes will make the performance of the other stuff pretty poor, and in particular will totally remove the benefit for cases that "become tired" on the master branch. That could be slightly interesting, I suppose. I am reviewing this patch too. And join to Peter Eisentraut opinion about splitting the patch into a hierarchy of two or three patches: "functional" - tid stuff and "optimizational" - suffix truncation & splitting. My reasons are simplification of code review, investigation and benchmarking. Now benchmarking is not clear. Possible performance degradation from TID ordering interfere with positive effects from the optimizations in non-trivial way. -- Andrey Lepikhov Postgres Professional https://postgrespro.com The Russian Postgres Company
Re: Making all nbtree entries unique by having heap TIDs participate in comparisons
On Fri, Sep 28, 2018 at 10:58 PM Andrey Lepikhov wrote: > I am reviewing this patch too. And join to Peter Eisentraut opinion > about splitting the patch into a hierarchy of two or three patches: > "functional" - tid stuff and "optimizational" - suffix truncation & > splitting. My reasons are simplification of code review, investigation > and benchmarking. As I mentioned to Peter, I don't think that I can split out the heap TID stuff from the suffix truncation stuff. At least not without making the patch even more complicated, for no benefit. I will split out the "brain" of the patch (the _bt_findsplitloc() stuff, which decides on a split point using sophisticated rules) from the "brawn" (the actually changes to how index scans work, including the heap TID stuff, as well as the code for actually physically performing suffix truncation). The brain of the patch is where most of the complexity is, as well as most of the code. The brawn of the patch is _totally unusable_ without intelligence around split points, but I'll split things up along those lines anyway. Doing so should make the whole design a little easier to see follow. > Now benchmarking is not clear. Possible performance degradation from TID > ordering interfere with positive effects from the optimizations in > non-trivial way. Is there any evidence of a regression in the last 2 versions? I've been using pgbench, which didn't show any. That's not a sympathetic case for the patch, though it would be nice to confirm if there was some small improvement there. I've seen contradictory results (slight improvements and slight regressions), but that was with a much earlier version, so it just isn't relevant now. pgbench is mostly interesting as a thing that we want to avoid regressing. Once I post the next version, it would be great if somebody could use HammerDB's OLTP test, which seems like the best fair use implementation of TPC-C that's available. I would like to make that the "this is why you should care, even if you happen to not believe in the patch's strategic importance" benchmark. TPC-C is clearly the most influential database benchmark ever, so I think that that's a fair request. (See the TPC-C commentary at https://www.hammerdb.com/docs/ch03s02.html, for example.) -- Peter Geoghegan
Re: Making all nbtree entries unique by having heap TIDs participate in comparisons
On Wed, Oct 3, 2018 at 4:39 PM Peter Geoghegan wrote: > I did find a pretty clear regression, though only with writes to > unique indexes. Attached is v6, which fixes the issue. More on that > below. I've been benchmarking my patch using oltpbench's TPC-C benchmark these past few weeks, which has been very frustrating -- the picture is very mixed. I'm testing a patch that has evolved from v6, but isn't too different. In one way, the patch does exactly what it's supposed to do when these benchmarks are run: it leaves indexes *significantly* smaller than the master branch will on the same (rate-limited) workload, without affecting the size of tables in any noticeable way. The numbers that I got from my much earlier synthetic single client benchmark mostly hold up. For example, the stock table's primary key is about 35% smaller, and the order line index is only about 20% smaller relative to master, which isn't quite as good as in the synthetic case, but I'll take it (this is all because of the v6-0003-Add-split-at-new-tuple-page-split-optimization.patch stuff). However, despite significant effort, and despite the fact that the index shrinking is reliable, I cannot yet consistently show an increase in either transaction throughput, or transaction latency. I can show a nice improvement in latency on a slightly-rate-limited TPC-C workload when backend_flush_after=0 (something like a 40% reduction on average), but that doesn't hold up when oltpbench isn't rate-limited and/or has backend_flush_after set. Usually, there is a 1% - 2% regression, despite the big improvements in index size, and despite the big reduction in the amount of buffers that backends must write out themselves. The obvious explanation is that throughput is decreased due to our doing extra work (truncation) while under an exclusive buffer lock. However, I've worked hard on that, and, as I said, I can sometimes observe a nice improvement in latency. This makes me doubt the obvious explanation. My working theory is that this has something to do with shared_buffers eviction. Maybe we're making worse decisions about which buffer to evict, or maybe the scalability of eviction is hurt. Perhaps both. You can download results from a recent benchmark to get some sense of this. It includes latency and throughput graphs, plus details statistics collector stats: https://drive.google.com/file/d/1oIjJ3YpSPiyRV_KF6cAfAi4gSm7JdPK1/view?usp=sharing I would welcome any theories as to what could be the problem here. I'm think that this is fixable, since the picture for the patch is very positive, provided you only focus on bgwriter/checkpoint activity and on-disk sizes. It seems likely that there is a very specific gap in my understanding of how the patch affects buffer cleaning. -- Peter Geoghegan
Re: Making all nbtree entries unique by having heap TIDs participate in comparisons
Hi, On 2018-10-18 12:54:27 -0700, Peter Geoghegan wrote: > I can show a nice improvement in latency on a slightly-rate-limited > TPC-C workload when backend_flush_after=0 (something like a 40% > reduction on average), but that doesn't hold up when oltpbench isn't > rate-limited and/or has backend_flush_after set. Usually, there is a > 1% - 2% regression, despite the big improvements in index size, and > despite the big reduction in the amount of buffers that backends must > write out themselves. What kind of backend_flush_after values where you trying? backend_flush_after=0 obviously is the default, so I'm not clear on that. How large is the database here, and how high is shared_buffers > The obvious explanation is that throughput is decreased due to our > doing extra work (truncation) while under an exclusive buffer lock. > However, I've worked hard on that, and, as I said, I can sometimes > observe a nice improvement in latency. This makes me doubt the obvious > explanation. My working theory is that this has something to do with > shared_buffers eviction. Maybe we're making worse decisions about > which buffer to evict, or maybe the scalability of eviction is hurt. > Perhaps both. Is it possible that there's new / prolonged cases where a buffer is read from disk after the patch? Because that might require doing *write* IO when evicting the previous contents of the victim buffer, and obviously that can take longer if you're running with backend_flush_after > 0. I wonder if it'd make sense to hack up a patch that logs when evicting a buffer while already holding another lwlock. That shouldn't be too hard. > You can download results from a recent benchmark to get some sense of > this. It includes latency and throughput graphs, plus details > statistics collector stats: > > https://drive.google.com/file/d/1oIjJ3YpSPiyRV_KF6cAfAi4gSm7JdPK1/view?usp=sharing I'm uncllear which runs are what here? I assume "public" is your patchset, and master is master? Do you reset the stats inbetween runs? Greetings, Andres Freund
Re: Making all nbtree entries unique by having heap TIDs participate in comparisons
Shared_buffers is 10gb iirc. The server has 32gb of memory. Yes, 'public' is the patch case. Sorry for not mentioning it initially. -- Peter Geoghegan (Sent from my phone)
Re: Making all nbtree entries unique by having heap TIDs participate in comparisons
On Thu, Oct 18, 2018 at 1:44 PM Andres Freund wrote: > What kind of backend_flush_after values where you trying? > backend_flush_after=0 obviously is the default, so I'm not clear on > that. How large is the database here, and how high is shared_buffers I *was* trying backend_flush_after=512kB, but it's backend_flush_after=0 in the benchmark I posted. See the "postgres*settings" files. On the master branch, things looked like this after the last run: pg@tpcc_oltpbench[15547]=# \dt+ List of relations Schema │Name│ Type │ Owner │ Size │ Description ┼┼───┼───┼──┼─ public │ customer │ table │ pg│ 4757 MB │ public │ district │ table │ pg│ 5240 kB │ public │ history│ table │ pg│ 1442 MB │ public │ item │ table │ pg│ 10192 kB │ public │ new_order │ table │ pg│ 140 MB │ public │ oorder │ table │ pg│ 1185 MB │ public │ order_line │ table │ pg│ 19 GB│ public │ stock │ table │ pg│ 9008 MB │ public │ warehouse │ table │ pg│ 4216 kB │ (9 rows) pg@tpcc_oltpbench[15547]=# \di+ List of relations Schema │ Name │ Type │ Owner │ Table│ Size │ Description ┼──┼───┼───┼┼─┼─ public │ customer_pkey│ index │ pg│ customer │ 367 MB │ public │ district_pkey│ index │ pg│ district │ 600 kB │ public │ idx_customer_name│ index │ pg│ customer │ 564 MB │ public │ idx_order│ index │ pg│ oorder │ 715 MB │ public │ item_pkey│ index │ pg│ item │ 2208 kB │ public │ new_order_pkey │ index │ pg│ new_order │ 188 MB │ public │ oorder_o_w_id_o_d_id_o_c_id_o_id_key │ index │ pg│ oorder │ 715 MB │ public │ oorder_pkey │ index │ pg│ oorder │ 958 MB │ public │ order_line_pkey │ index │ pg│ order_line │ 9624 MB │ public │ stock_pkey │ index │ pg│ stock │ 904 MB │ public │ warehouse_pkey │ index │ pg│ warehouse │ 56 kB │ (11 rows) > Is it possible that there's new / prolonged cases where a buffer is read > from disk after the patch? Because that might require doing *write* IO > when evicting the previous contents of the victim buffer, and obviously > that can take longer if you're running with backend_flush_after > 0. Yes, I suppose that that's possible, because the buffer popularity/usage_count will be affected in ways that cannot easily be predicted. However, I'm not running with "backend_flush_after > 0" here -- that was before. > I wonder if it'd make sense to hack up a patch that logs when evicting a > buffer while already holding another lwlock. That shouldn't be too hard. I'll look into this. Thanks -- Peter Geoghegan
Re: Making all nbtree entries unique by having heap TIDs participate in comparisons
On Thu, Oct 18, 2018 at 1:44 PM Andres Freund wrote: > I wonder if it'd make sense to hack up a patch that logs when evicting a > buffer while already holding another lwlock. That shouldn't be too hard. I tried this. It looks like we're calling FlushBuffer() with more than a single LWLock held (not just the single buffer lock) somewhat *less* with the patch. This is a positive sign for the patch, but also means that I'm no closer to figuring out what's going on. I tested a case with a 1GB shared_buffers + a TPC-C database sized at about 10GB. I didn't want the extra LOG instrumentation to influence the outcome. -- Peter Geoghegan
Re: Making all nbtree entries unique by having heap TIDs participate in comparisons
On 19.10.2018 0:54, Peter Geoghegan wrote: I would welcome any theories as to what could be the problem here. I'm think that this is fixable, since the picture for the patch is very positive, provided you only focus on bgwriter/checkpoint activity and on-disk sizes. It seems likely that there is a very specific gap in my understanding of how the patch affects buffer cleaning. I have same problem with background heap & index cleaner (based on your patch). In this case the bottleneck is WAL-record which I need to write for each cleaned block and locks which are held during the WAL-record writing process. Maybe you will do a test without writing any data to disk? -- Andrey Lepikhov Postgres Professional https://postgrespro.com The Russian Postgres Company
Re: Making all nbtree entries unique by having heap TIDs participate in comparisons
On Tue, Oct 23, 2018 at 11:35 AM Andrey Lepikhov wrote: > I have same problem with background heap & index cleaner (based on your > patch). In this case the bottleneck is WAL-record which I need to write > for each cleaned block and locks which are held during the WAL-record > writing process. Part of the problem here is that v6 uses up to 25 candidate split points, even during regularly calls to _bt_findsplitloc(). That was based on some synthetic test-cases. I've found that I can get most of the benefit in index size with far fewer spilt points, though. The extra work done with an exclusive buffer lock held will be considerably reduced in v7. I'll probably post that in a couple of weeks, since I'm in Europe for pgConf.EU. I don't fully understand the problems here, but even still I know that what you were testing wasn't very well optimized for write-heavy workloads. It would be especially bad with pgbench, since there isn't much opportunity to reduce the size of indexes there. > Maybe you will do a test without writing any data to disk? Yeah, I should test that on its own. I'm particularly interested in TPC-C, because it's a particularly good target for my patch. I can find a way of only executing the read TPC-C queries, to see where they are on their own. TPC-C is particularly write-heavy, especially compared to the much more recent though less influential TPC-E benchmark. -- Peter Geoghegan
Re: Making all nbtree entries unique by having heap TIDs participate in comparisons
I do the code review. Now, it is first patch - v6-0001... dedicated to a logical duplicates ordering. Documentation is full and clear. All non-trivial logic is commented accurately. Patch applies cleanly on top of current master. Regression tests passed and my "Retail Indextuple deletion" use cases works without mistakes. But I have two comments on the code. New BTScanInsert structure reduces parameters list of many functions and look fine. But it contains some optimization part ('restorebinsrch' field et al.). It is used very locally in the code - _bt_findinsertloc()->_bt_binsrch() routines calling. May be you localize this logic into separate struct, which will passed to _bt_binsrch() as pointer. Another routines may pass NULL value to this routine. It is may simplify usability of the struct. Due to the optimization the _bt_binsrch() size has grown twice. May be you move this to some service routine? -- Andrey Lepikhov Postgres Professional https://postgrespro.com The Russian Postgres Company
Re: Making all nbtree entries unique by having heap TIDs participate in comparisons
On Fri, Nov 2, 2018 at 3:06 AM Andrey Lepikhov wrote: > Documentation is full and clear. All non-trivial logic is commented > accurately. Glad you think so. I had the opportunity to discuss this patch at length with Heikki during pgConf.EU. I don't want to speak on his behalf, but I will say that he seemed to understand all aspects of the patch series, and seemed generally well disposed towards the high level design. The high-level design is the most important aspect -- B-Trees can be optimized in many ways, all at once, and we must be sure to come up with something that enables most or all of them. I really care about the long term perspective. That conversation with Heikki eventually turned into a conversation about reimplementing GIN using the nbtree code, which is actually related to my patch series (sorted on heap TID is the first step to optional run length encoding for duplicates). Heikki seemed to think that we can throw out a lot of the optimizations within GIN, and add a few new ones to nbtree, while still coming out ahead. This made the general nbtree-as-GIN idea (which we've been talking about casually for years) seem a lot more realistic to me. Anyway, he requested that I support this long term goal by getting rid of the DESC TID sort order thing -- that breaks GIN-style TID compression. It also increases the WAL volume unnecessarily when a page is split that contains all duplicates. The DESC heap TID sort order thing probably needs to go. I'll probably have to go fix the regression test failures that occur when ASC heap TID order is used. (Technically those failures are a pre-existing problem, a problem that I mask by using DESC order...which is weird. The problem is masked in the master branch by accidental behaviors around nbtree duplicates, which is something that deserves to die. DESC order is closer to the accidental current behavior.) > Patch applies cleanly on top of current master. Regression tests passed > and my "Retail Indextuple deletion" use cases works without mistakes. Cool. > New BTScanInsert structure reduces parameters list of many functions and > look fine. But it contains some optimization part ('restorebinsrch' > field et al.). It is used very locally in the code - > _bt_findinsertloc()->_bt_binsrch() routines calling. May be you localize > this logic into separate struct, which will passed to _bt_binsrch() as > pointer. Another routines may pass NULL value to this routine. It is may > simplify usability of the struct. Hmm. I see your point. I did it that way because the knowledge of having cached an upper and lower bound for a binary search of a leaf page needs to last for a relatively long time. I'll look into it again, though. > Due to the optimization the _bt_binsrch() size has grown twice. May be > you move this to some service routine? Maybe. There are some tricky details that seem to work against it. I'll see if it's possible to polish that some more, though. -- Peter Geoghegan
Re: Making all nbtree entries unique by having heap TIDs participate in comparisons
On 03.11.2018 5:00, Peter Geoghegan wrote: The DESC heap TID sort order thing probably needs to go. I'll probably have to go fix the regression test failures that occur when ASC heap TID order is used. (Technically those failures are a pre-existing problem, a problem that I mask by using DESC order...which is weird. The problem is masked in the master branch by accidental behaviors around nbtree duplicates, which is something that deserves to die. DESC order is closer to the accidental current behavior.) I applied your patches at top of master. After tests corrections (related to TID ordering in index relations DROP...CASCADE operation) 'make check-world' passed successfully many times. In the case of 'create view' regression test - 'drop cascades to 62 other objects' problem - I verify an Álvaro Herrera hypothesis [1] and it is true. You can verify it by tracking the object_address_present_add_flags() routine return value. Some doubts, however, may be regarding the 'triggers' test. May you specify the test failures do you mean? [1] https://www.postgresql.org/message-id/20180504022601.fflymidf7eoencb2%40alvherre.pgsql -- Andrey Lepikhov Postgres Professional https://postgrespro.com The Russian Postgres Company