On Mon, Mar 21, 2016 at 9:53 AM, Anastasia Lubennikova <a.lubennik...@postgrespro.ru> wrote: > Thanks to David, > I've missed these letters at first. > I'll answer here.
Sorry about using the wrong thread. > I agree that _bt_insertonpg() is not right for truncation. Cool. > It's a bit more complicated to add it into index creation algorithm. > There's a trick with a "high key". > /* > * We copy the last item on the page into the new page, and then > * rearrange the old page so that the 'last item' becomes its high > key > * rather than a true data item. There had better be at least two > * items on the page already, else the page would be empty of useful > * data. > */ > /* > * Move 'last' into the high key position on opage > */ > > To be consistent with other steps of algorithm ( all high keys must be > truncated tuples), I had to update this high key on place: > delete the old one, and insert truncated high key. Hmm. But the high key comparing equal to the Scankey gives insertion the choice of where to put its IndexTuple (it can go on the page with the high key, or its right-sibling, according only to considerations about fillfactor, etc). Is this changed? Does it not matter? Why not? Is it just worth it? The right-most page on every level has no high-key. But you say those pages have an "imaginary" *positive* infinity high key, just as internal pages have (non-imaginary) minus infinity downlinks as their first item/downlink. So tuples in a (say) leaf page are always bound by the downlink lower bound in parent, while their own high key is an upper bound. Either (and, rarely, both) could be (positive or negative) infinity. Maybe you now see why I talked about special _bt_compare() logic for this. I proposed special logic that is similar to the existing minus infinity thing _bt_compare() does (although _bt_binsrch(), an important caller of _bt_compare() also does special things for internal .vs leaf case, so I'm not sure any new special logic must go in _bt_compare()). > It is a very important issue. But I don't think it's a bug there. > I've read amcheck sources thoroughly and found that the problem appears at > "invariant_key_less_than_equal_nontarget_offset() > It uses scankey, made with _bt_mkscankey() which uses only key attributes, > but calls _bt_compare with wrong keysz. > If we wiil use nkeyatts = state->rel->rd_index->relnatts; instead of natts, > all the checks would be passed successfully. I probably shouldn't have brought amcheck into that particular discussion. I thought amcheck might be a useful way to frame the discussion, because amcheck always cares about specific invariants, and notes a few special cases. > In my view, it's the correct way to fix this problem, because the caller is > responsible for passing proper arguments to the function. > Of course I will add a check into bt_compare, but I'd rather make it an > assertion (see the patch attached). I see what you mean, but I think we need to decide what to do about the key space when leaf high keys are truncated. I do think that truncating the high key was the right idea, though, and it nicely illustrates that nothing special should happen in upper levels. Suffix truncation should only happen when leaf pages are split, generally speaking. As I said, the high key is very similar to the downlinks, in that both bound the things that go on each page. Together with downlinks they represent *discrete* ranges *unambiguously*, so INCLUDING truncation needs to make it clear which page new items go on. As I said, _bt_binsrch() already takes special actions for internal pages, making sure to return the item that is < the scankey, not <= the scankey which is only allowed for leaf pages. (See README, from "Lehman and Yao assume that the key range for a subtree S is described by Ki < v <= Ki+1 where Ki and Ki+1 are the adjacent keys in the parent page..."). To give a specific example, I worry about the case where two sibling downlinks in a parent page are distinct, but per specific-to-Postgres "Ki <= v <= Ki+1" thing (which differs from the classic L&Y invariant), some tuples with all right downlink's attributes matching end up in left child page, not right child page. I worry that since _bt_findsplitloc() doesn't consider this (for example), the split point doesn't *reliably* and unambiguously divide the key space between the new halves of a page being split. I think the "Ki <= v <= Ki+1"/_bt_binsrch() thing might save you in common cases where all downlink attributes are distinct, so maybe that simpler case is okay. But to be even more specific, what about the more complicated case where the downlinks *are* fully _bt_compare()-wise equal? This could happen even though they're constrained to be unique in leaf pages, due to bloat. Unique indexes aren't special here; they just make it far less likely that this would happen in practice, because it takes a lot of bloat. Less importantly, when that bloat happens, you don't want to have to do a linear scan through many leaf pages (that should only happen when there are many fully matching IndexTuples at the leaf level -- not just matching on constrained attributes). The more I think about it, the more I doubt that it's okay to not ensure downlinks are always distinct with their siblings, by sometimes including non-constrained (truncatable) attributes within internal pages, as needed to *distinguish* downlinks (also, we must occasionally have *all* attributes including truncatable attributes in internal pages -- we must truncate nothing to keep the key space sane in the parent). Unfortunately, these requirements are very close to the actual full requirements for a full, complete suffix truncation patch, including storing how many attributes are stored in each and every internal IndexTuple (no general thing for the index), page split code to determine where to truncate to make adjacent downlinks distinct, etc. You may think: But that fully-matching-downlink case is okay, because it only makes us do more linear scanning due to the lack of non-truncatable attributes, which is still correct, if a little more slow when there is bloat -- at the leaf level, we'll start at the correct place (the first place the item could be on), per the "Ki <= v <= Ki+1"/_bt_binsrch() thing. I don't think it's correct, though. We need to be able to reliably detect a concurrent page-split. Otherwise, we'll move right within _bt_search() before even considering if anything of interest for our index scan *might* be on the initial page found from downlink (before even calling _bt_binsrch()). Even this bug wouldn't happen in the common case where nextkey = true, but what about when nextkey = false (e.g. for backwards scans)? We'd skip stuff we are not supposed to by spuriously moving right, I think. I have a bad feeling that even then we'd "accidentally fail to fail", because of how backwards scans work at a higher level, but it's just too hard to prove that that is correct. It's just too complicated to rely on so much from a great distance. This might not be the simplest example of where we could run into trouble, but it's one example that I could see. The assumption that downlinks and highkeys discretely separate ranges in the key space is probably made many times. There could be more problematic spots, and it's really hard to know where they might be. :-( In general, it's common for any modification to the B-Tree code to only break in a very subtle way, like this. I would be more comfortable if I knew the patch received extensive stress-testing, probably involving amcheck, lots of bloat, lots of VACUUMing, etc. But generally, I believe we should not allow the key space to fail to be separated fully by downlinks and high keys, even if our original "Ki <= v <= Ki+1" changes to the L&Y algorithm to make duplicates work happens to mask the problems in simple testing. It's too different to what we have today. > I'll add a flag to distinguish regular and truncated tuples, but it will not > be used in this patch. Please, comment, if I've missed something. > As you've already mentioned, neither high keys, nor tuples on internal pages > are using "itup->t_tid.ip_posid", so I'll take one bit of it. > > It will definitely require changes in the future works on suffix truncation > or something like that, but IMHO for now it's enough. I think that we need to discuss whether or not it's okay that we can have that fully-matching-downlink case before we can be sure either way. -- Peter Geoghegan -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers