Re: Reducing the runtime of the core regression tests
On Fri, Apr 12, 2019 at 10:24 AM Alvaro Herrera wrote: > I wonder if it would be useful to add --enable-debug. I think I > purposefully removed that, but I don't remember any details about it. As usual, this stuff is horribly under-documented. I think it's possible that --enable-debug would help, since llvm-gcov requires it, but that doesn't seem particularly likely. It's definitely generally recommended that "-O0" be used, so I think that we can agree that that was an improvement, even if it doesn't fix the remaining problem that I noticed when I rechecked nbtutils.c. -- Peter Geoghegan
Re: Reducing the runtime of the core regression tests
On Fri, Apr 12, 2019 at 10:49 AM Peter Geoghegan wrote: > It's definitely generally recommended that "-O0" be used, so I think > that we can agree that that was an improvement, even if it doesn't fix > the remaining problem that I noticed when I rechecked nbtutils.c. I'm not sure that we can really assume that "-O0" avoids the behavior I pointed out. Perhaps this counts as "semantic flattening" or something, rather than an optimization. I could have easily written the code in _bt_keep_natts_fast() in the way gcov/gcc/whatever thinks I ought to have, which would have obscured the distinction anyway. -- Peter Geoghegan
Re: Zedstore - compressed in-core columnar storage
On Mon, Apr 15, 2019 at 11:35 AM Tom Lane wrote: > We do need to limit what we accept into core PG. I do not buy your > argument that users expect everything to be in core. Or more accurately, > the people who do think that way won't be using PG anyway --- they'll > be using MSSQL because it comes from their OS vendor. I am also concerned by the broad scope of ZedStore, and I tend to agree that it will be difficult to maintain in core. At the same time, I think that Andres and Robert are probably right about the difficulty of maintaining it outside of core -- that would be difficult to impossible as a practical matter. Unfortunately, you're both right. I don't know where that leaves us. -- Peter Geoghegan
Re: Zedstore - compressed in-core columnar storage
On Mon, Apr 15, 2019 at 9:16 AM Ashwin Agrawal wrote: > Would like to know more specifics on this Peter. We may be having different > context on hybrid row/column design. I'm confused about how close your idea of a TID is to the traditional definition from heapam (and even zheap). If it's a purely logical identifier, then why would it have two components like a TID? Is that just a short-term convenience or something? > Yes, the plan to optimize out TID space per datum, either by prefix > compression or delta compression or some other trick. It would be easier to do this if you knew for sure that the TID behaves almost the same as a bigserial column -- a purely logical monotonically increasing identifier. That's why I'm interested in what exactly you mean by TID, the stability of a TID value, etc. If a leaf page almost always stores a range covering no more than few hundred contiguous logical values, you can justify aggressively compressing the representation in the B-Tree entries. Compression would still be based on prefix compression, but the representation itself can be specialized. -- Peter Geoghegan
Re: Zedstore - compressed in-core columnar storage
On Mon, Apr 15, 2019 at 12:19 PM Tom Lane wrote: > Perhaps, but we won't know if we don't try. I think we should try, > and be willing to add hooks and flexibility to core as needed to make > it possible. We could approach it without taking a firm position on inclusion in core until the project begins to mature. I have little faith in our ability to predict which approach will be the least painful at this early stage. -- Peter Geoghegan
Re: Zedstore - compressed in-core columnar storage
On Mon, Apr 15, 2019 at 1:02 PM Andres Freund wrote: > There's not much of an alternative currently. Indexes require tid > looking things, and as a consequence (and some other comparatively small > changes that'd be required) tableam does too. I'm trying to establish whether or not that's the only reason. It might be okay to use the same item pointer struct as the representation of a integer-like logical identifier. Even if it isn't, I'm still interested in just how logical the TIDs are, because it's an important part of the overall design. > That's one of the reasons why I've been trying to get you to get on > board with allowing different leaf-level "item pointer equivalents" > widths inside nbtree... Getting me to agree that that would be nice and getting me to do the work are two very different things. ;-) -- Peter Geoghegan
Re: Improve search for missing parent downlinks in amcheck
On Tue, Apr 16, 2019 at 12:00 PM Peter Geoghegan wrote: > Can you be more specific? What was the cause of the corruption? I'm > always very interested in hearing about cases that amcheck could have > detected, but didn't. FWIW, v4 indexes in Postgres 12 will support the new "rootdescend" verification option, which isn't lossy, and would certainly have detected your customer issue in practice. Admittedly the new check is quite expensive, even compared to the other bt_index_parent_check() checks, but it is nice that we now have a verification option that is *extremely* thorough, and uses _bt_search() directly. -- Peter Geoghegan
Re: Improve search for missing parent downlinks in amcheck
On Mon, Apr 15, 2019 at 7:30 PM Alexander Korotkov wrote: > Currently we amcheck supports lossy checking for missing parent > downlinks. It collects bitmap of downlink hashes and use it to check > subsequent tree level. We've experienced some large corrupted indexes > which pass this check due to its looseness. Can you be more specific? What was the cause of the corruption? I'm always very interested in hearing about cases that amcheck could have detected, but didn't. Was the issue that the Bloom filter was simply undersized/ineffective? > However, it seems to me we can implement this check in non-lossy > manner without making it significantly slower. We anyway traverse > downlinks from parent to children in order to verify that hikeys are > corresponding to downlink keys. Actually, we don't check the high keys in children against the parent (all other items are checked, though). We probably *should* do something with the high key when verifying consistency across levels, but currently we don't. (We only use the high key for the same-page high key check -- more on this below.) > We can also traverse from one > downlink to subsequent using rightlinks. So, if there are some > intermediate pages between them, they are candidates to have missing > parent downlinks. The patch is attached. > > With this patch amcheck could successfully detect corruption for our > customer, which unpatched amcheck couldn't find. Maybe we can be a lot less conservative about sizing the Bloom filter instead? That would be an easier fix IMV -- we can probably change that logic to be a lot more aggressive without anybody noticing, since the Bloom filter is already usually tiny. We are already not very careful about saving work within bt_index_parent_check(), but with this patch we follow each downlink twice instead of once. That seems wasteful. The reason I used a Bloom filter here is because I would eventually like the missing downlink check to fingerprint entire tuples, not just block numbers. In L terms, the check could in the future fingerprint the separator key and the downlink at the same time, not just the downlink. That way, we could probe the Bloom filter on the next level down for its high key (with the right sibling pointer set to be consistent with the parent) iff we don't see that the page split was interrupted (i.e. iff P_INCOMPLETE_SPLIT() bit is not set). Obviously this would be a more effective form of verification, since we would notice high key values that don't agree with the parent's values for the same sibling/cousin/child block. I didn't do it that way for v11 because of "minus infinity" items on internal pages, which don't store the original key (the key remains the high key of the left sibling page, but we truncate the original to 0 attributes in _bt_pgaddtup()). I think that we should eventually stop truncating minus infinity items, and actually store a "low key" at P_FIRSTDATAKEY() within internal pages instead. That would be useful for other things anyway (e.g. prefix compression). -- Peter Geoghegan
Re: New vacuum option to do only freezing
On Wed, Apr 17, 2019 at 10:46 AM Tom Lane wrote: > Yeah, if we wanted to expose these complications more directly, we > could think about adding or changing the main counters. I was wondering > about whether we should have it report "x bytes and y line pointers > freed", rather than counting tuples per se. I like that idea, but I'm pretty sure that there are very few users that are aware of these distinctions at all. It would be a good idea to clearly document them. -- Peter Geoghegan
Pathological performance when inserting many NULLs into a unique index
I thought of a case that results in pathological performance due to a behavior of my nbtree patch series: regression=# create table uniquenulls(nully int4, constraint pp unique(nully)); CREATE TABLE Time: 10.694 ms regression=# insert into uniquenulls select i from generate_series(1, 1e6) i; INSERT 0 100 Time: 1356.025 ms (00:01.356) regression=# insert into uniquenulls select null from generate_series(1, 1e6) i; INSERT 0 100 Time: 270834.196 ms (04:30.834) The issue here is that the duration of the second INSERT statement is wildly excessive, because _bt_stepright() needs to step right many many times for each tuple inserted. I would expect the second insert to take approximately as long as the first, but it takes ~200x longer here. It could take much longer still if we pushed this example further. What we see here is a limited case in which the O(n ^ 2) performance that "getting tired" used to prevent can occur. Clearly that's totally unacceptable. Mea culpa. Sure enough, the problem goes away when the index isn't a unique index (i.e. in the UNIQUE_CHECK_NO case): regression=# alter table uniquenulls drop constraint pp; ALTER TABLE Time: 28.968 ms regression=# create index on uniquenulls (nully); CREATE INDEX Time: 1159.958 ms (00:01.160) regression=# insert into uniquenulls select null from generate_series(1, 1e6) i; INSERT 0 100 Time: 1155.708 ms (00:01.156) Tentatively, I think that the fix here is to not "itup_key->scantid = NULL" when a NULL value is involved (i.e. don't "itup_key->scantid = NULL" when we see IndexTupleHasNulls(itup) within _bt_doinsert()). We may also want to avoid calling _bt_check_unique() entirely in this situation. That way, the performance should be the same as the UNIQUE_CHECK_NO case: we descend to the appropriate leaf page directly, and then we're done. We don't have to find the appropriate leaf page by groveling through indefinitely-many existing leaf pages that are full of NULL values, because we know that there cannot ever be a unique violation for us to detect. I'll create an open item for this, and begin work on a patch tomorrow. -- Peter Geoghegan
Re: Zedstore - compressed in-core columnar storage
On Thu, Apr 11, 2019 at 6:06 AM Rafia Sabih wrote: > Reading about it reminds me of this work -- TAG column storage( > http://www09.sigmod.org/sigmod/record/issues/0703/03.article-graefe.pdf ). > Isn't this storage system inspired from there, with TID as the TAG? > > It is not referenced here so made me wonder. I don't think they're particularly similar, because that paper describes an architecture based on using purely logical row identifiers, which is not what a TID is. TID is a hybrid physical/logical identifier, sometimes called a "physiological" identifier, which will have significant overhead. Ashwin said that ZedStore TIDs are logical identifiers, but I don't see how that's compatible with a hybrid row/column design (unless you map heap TID to logical row identifier using a separate B-Tree). The big idea with Graefe's TAG design is that there is practically no storage overhead for these logical identifiers, because each entry's identifier is calculated by adding its slot number to the page's tag/low key. The ZedStore design, in contrast, explicitly stores TID for every entry. ZedStore seems more flexible for that reason, but at the same time the per-datum overhead seems very high to me. Maybe prefix compression could help here, which a low key and high key can do rather well. -- Peter Geoghegan
Re: Reducing the runtime of the core regression tests
On Thu, Apr 11, 2019 at 9:55 AM Tom Lane wrote: > So I concur that indexing.sql's fastpath test > isn't adding anything useful coverage-wise, and will just nuke it. Good. > (It'd be interesting perhaps to check whether the results shown > by coverage.postgresql.org are similarly unstable. They might be > less so, since I believe those are taken over the whole check-world > suite not just the core regression tests.) I'm almost certain that they're at least slightly unstable. I mostly find the report useful because it shows whether or not something gets hit at all. I don't trust it to be very accurate. I've noticed that the coverage reported on coverage.postgresql.org sometimes looks contradictory, which can happen due to compiler optimizations. I wonder if that could be addressed in some way, because I find the site to be a useful resource. I would at least like to know the settings used by its builds. -- Peter Geoghegan
Re: Reducing the runtime of the core regression tests
On Thu, Apr 11, 2019 at 11:00 AM Alvaro Herrera wrote: > ./configure --enable-depend --enable-coverage --enable-tap-tests --enable-nls > --with-python --with-perl --with-tcl --with-openssl --with-libxml --with-ldap > --with-pam >> $LOG 2>&1 > > make -j4 >> $LOG 2>&1 > make -j4 -C contrib >> $LOG 2>&1 > make check-world PG_TEST_EXTRA="ssl ldap" >> $LOG 2>&1 > make coverage-html >> $LOG 2>&1 > > There are no environment variables that would affect it. Could we add "CFLAGS=-O0"? This should prevent the kind of machine-wise line-counting described here: https://gcc.gnu.org/onlinedocs/gcc/Gcov-and-Optimization.html#Gcov-and-Optimization I think that it makes sense to prioritize making it clear which exact lines were executed in terms of the semantics of C. I might prefer to have optimizations enabled if I was optimizing my code, but that's not what the web resource is for, really. Thanks -- Peter Geoghegan
Re: Making all nbtree entries unique by having heap TIDs participate in comparisons
On Tue, Mar 12, 2019 at 11:40 AM Robert Haas wrote: > I think it's pretty clear that we have to view that as acceptable. I > mean, we could reduce contention even further by finding a way to make > indexes 40% larger, but I think it's clear that nobody wants that. I found this analysis of bloat in the production database of Gitlab in January 2019 fascinating: https://about.gitlab.com/handbook/engineering/infrastructure/blueprint/201901-postgres-bloat/ They determined that their tables consisted of about 2% bloat, whereas indexes were 51% bloat (determined by running VACUUM FULL, and measuring how much smaller indexes and tables were afterwards). That in itself may not be that telling. What is telling is the index bloat disproportionately affects certain kinds of indexes. As they put it, "Indexes that do not serve a primary key constraint make up 95% of the overall index bloat". In other words, the vast majority of all bloat occurs within non-unique indexes, with most remaining bloat in unique indexes. One factor that could be relevant is that unique indexes get a lot more opportunistic LP_DEAD killing. Unique indexes don't rely on the similar-but-distinct kill_prior_tuple optimization -- a lot more tuples can be killed within _bt_check_unique() than with kill_prior_tuple in realistic cases. That's probably not really that big a factor, though. It seems almost certain that "getting tired" is the single biggest problem. The blog post drills down further, and cites the examples of several *extremely* bloated indexes on a single-column, which is obviously low cardinality. This includes an index on a boolean field, and an index on an enum-like text field. In my experience, having many indexes like that is very common in real world applications, though not at all common in popular benchmarks (with the exception of TPC-E). It also looks like they may benefit from the "split after new item" optimization, at least among the few unique indexes that were very bloated, such as merge_requests_pkey: https://gitlab.com/snippets/1812014 Gitlab is open source, so it should be possible to confirm my theory about the "split after new item" optimization (I am certain about "getting tired", though). Their schema is defined here: https://gitlab.com/gitlab-org/gitlab-ce/blob/master/db/schema.rb I don't have time to confirm all this right now, but I am pretty confident that they have both problems. And almost as confident that they'd notice substantial benefits from this patch series. -- Peter Geoghegan
Re: VACUUM can finish an interrupted nbtree page split -- is that okay?
On Mon, May 13, 2019 at 9:09 PM Peter Geoghegan wrote: > Even when that happens, the index is already considered corrupt by > VACUUM, so the same VACUUM process that could in theory be adversely > affected by removing the half-dead internal page check will complain > about the page when it gets to it later on -- the user will be told to > REINDEX. And even then, we will never actually get to apply the check > that I propose to remove, since we're already checking the leaf page > sibling of the leaf-level target -- the leaf-level test that was added > by efada2b8e92 was clearly necessary. But it was also sufficient (no > equivalent internal half-dead right sibling test is needed): a 9.3-era > half-dead internal page cannot have more than one child, which must be > undergoing deletion as well. Actually, now that I look back at how page deletion worked 5+ years ago, I realize that I have this slightly wrong: the leaf level check is not sufficient to figure out if the parent's right sibling is pending deletion (which is represented explicitly as a half-dead internal page prior to 9.4). All the same, I'm going to push ahead with this patch. Bugfix commit efada2b8e92 was always about a bug in 9.4 -- it had nothing to do with 9.3. And, in the unlikely event that there is a problem on a pg_upgrade'd 9.3 -> 12 database that happens to have half-dead internal pages, we'll still get a useful, correct error from VACUUM one way or another. It might be slightly less friendly as error messages about corruption go, but that seems acceptable to me. -- Peter Geoghegan
Re: VACUUM can finish an interrupted nbtree page split -- is that okay?
On Thu, May 16, 2019 at 1:05 PM Peter Geoghegan wrote: > Actually, now that I look back at how page deletion worked 5+ years > ago, I realize that I have this slightly wrong: the leaf level check > is not sufficient to figure out if the parent's right sibling is > pending deletion (which is represented explicitly as a half-dead > internal page prior to 9.4). All the same, I'm going to push ahead > with this patch. Bugfix commit efada2b8e92 was always about a bug in > 9.4 -- it had nothing to do with 9.3. I meant bugfix commit 8da31837803 (commit efada2b8e92 was the commit that had the bug in question). -- Peter Geoghegan
Re: Suppressing noise in successful check-world runs
On Wed, May 22, 2019 at 3:57 PM Tom Lane wrote: > I experimented with the attached quick-hack patch to make pg_regress > suppress notices from its various initial DROP/CREATE IF [NOT] EXISTS > commands. I'm not entirely convinced whether suppressing them is > a good idea though. Perhaps some hack with effects confined to > pg_upgrade's test would be better. I don't have a good idea what > that would look like, however. > > Or we could just say this isn't annoying enough to fix. I think it's worth fixing. -- Peter Geoghegan
Re: Suppressing noise in successful check-world runs
On Fri, May 24, 2019 at 4:18 PM Tom Lane wrote: > Yes, I see that too with sufficiently high -j. I believe this is > what Noah was trying to fix in bd1592e85, but that patch evidently > needs a bit more work :-( It would be nice if this was fixed, but I don't see a problem when I use the optimum number of jobs, so I don't consider it to be urgent. I'm happy with the new approach, since it avoids the problem of regression.diffs files that get deleted before I have a chance to take a look. I should thank Noah for his work on this. -- Peter Geoghegan
Re: Suppressing noise in successful check-world runs
On Fri, May 24, 2019 at 12:31 PM Peter Geoghegan wrote: > > On Wed, May 22, 2019 at 3:57 PM Tom Lane wrote: > > I experimented with the attached quick-hack patch to make pg_regress > > suppress notices from its various initial DROP/CREATE IF [NOT] EXISTS > > commands. I'm not entirely convinced whether suppressing them is > > a good idea though. Perhaps some hack with effects confined to > > pg_upgrade's test would be better. I don't have a good idea what > > that would look like, however. > > > > Or we could just say this isn't annoying enough to fix. > > I think it's worth fixing. My development machine has 8 logical cores, and like you I only see the NOTICE from pg_upgrade's tests with "-j10": pg@bat:/code/postgresql/patch/build$ time make check-world -j10 >/dev/null NOTICE: database "regression" does not exist, skipping make check-world -j10 > /dev/null 86.40s user 34.10s system 140% cpu 1:25.94 total However, I see something else with "-j16", even after a precautionary clean + rebuild: pg@bat:/code/postgresql/patch/build$ time make check-world -j16 >/dev/null NOTICE: database "regression" does not exist, skipping pg_regress: could not open file "/code/postgresql/patch/build/src/test/regress/regression.diffs" for reading: No such file or directory make check-world -j16 > /dev/null 96.35s user 37.45s system 152% cpu 1:27.49 total I suppose this might be because of a pg_regress/make file "regression.diffs" race. This is also a problem for my current workflow for running "make check-world" in parallel [1], though only when there is definitely a regression.diffs file with actual regressions -- there is no regression that I'm missing here, and as far as I know this output about "regression.diffs" is just more noise. I had intended to figure out a way of keeping "regression.diffs" with my existing workflow, since losing the details of a test failure is a real annoyance. Especially when there is a test that doesn't fail reliably. [1] https://wiki.postgresql.org/wiki/Committing_checklist#Basic_checks -- Peter Geoghegan
Re: Sort support for macaddr8
On Tue, Jun 4, 2019 at 2:55 PM Andres Freund wrote: > Yea, I don't immediately see a problem with doing that on a major > version boundary. Obviously that'd only be possible for sizeof(Datum) == > 8 == sizeof(macaddr8) platforms, but that's the vast majority these > days. And generally, I think it's just about *always* worth to go for a > pass-by-value for the cases where that doesn't imply space wastage. It would be faster to do it that way, I think. You would need a more complicated comparator than a classic abbreviated comparator (i.e. a 3-way unsigned int comparator) that way, but it would very probably be faster on balance. I'm glad to hear that it isn't *obviously* a problem from a compatibility perspective -- I really wasn't sure about that, since retrofitting a type to be pass-by-value like this is something that may never have been attempted before now (at least not since we started to care about pg_upgrade). > I think it might be worthwhile to optimize things so that all typlen > 0 > && typlen <= sizeof(Datum) are allowed for byval datums. > > SELECT typname, typlen FROM pg_type WHERE typlen > 0 AND typlen <= 8 AND NOT > typbyval; > ┌──┬┐ > │ typname │ typlen │ > ├──┼┤ > │ tid │ 6 │ > │ macaddr │ 6 │ > │ macaddr8 │ 8 │ > └──┴┘ > (3 rows) This is half the reason why I ended up implementing itemptr_encode() to accelerate the TID sort used by CREATE INDEX CONCURRENTLY some years back -- TID is 6 bytes wide, which doesn't have the necessary macro support within postgres.h. There is no reason why that couldn't be added for the benefit of both TID and macaddr types, though it probably wouldn't be worth it. And, as long as we're not going to those lengths, there may be some value in keeping the macaddr8 code in line with macaddr code -- the two types are currently almost the same (the glaring difference is the lack of macaddr8 sort support). We'll need to draw the line somewhere, and that is likely to be a bit arbitrary. This was what I meant by "weird". -- Peter Geoghegan
Re: Sort support for macaddr8
On Tue, Jun 4, 2019 at 3:23 PM Andres Freund wrote: > > This is half the reason why I ended up implementing itemptr_encode() > > to accelerate the TID sort used by CREATE INDEX CONCURRENTLY some > > years back -- TID is 6 bytes wide, which doesn't have the necessary > > macro support within postgres.h. There is no reason why that couldn't > > be added for the benefit of both TID and macaddr types, though it > > probably wouldn't be worth it. > > I think we should definitely do that. It seems not unlikely that other > people want to write new fixed width types, and we shouldn't have them > deal with all this complexity unnecessarily. On second thought, maybe there is something to be said for being exhaustive here. It seems like there is a preference for making macaddr8 pass-by-value instead of adding abbreviated keys support to macaddr8, and possibly doing the same with the original macaddr type. Do you think that you'll be able to work on the project with this expanded scope, Melanie? -- Peter Geoghegan
Re: Sort support for macaddr8
On Mon, Jun 3, 2019 at 2:59 PM Chapman Flack wrote: > 1. (This one seems like a bug.) In the little-endian case, if >SIZEOF_DATUM is smaller than the type, I'm not convinced by doing >the DatumBigEndianToNative() after the memcpy(). Seems like that's >too late to make sure the most-significant bytes got copied. Uh, when else would you do it? Before the memcpy()? > 2. (This one seems like an API opportunity.) If it becomes common to >add abbreviation support for smallish types such that (as here, >when SIZEOF_DATUM >= 8), an abbreviated "equality" result is in fact >authoritative, would it be worthwhile to have some way for the sort >support routine to announce that fact to the caller? That could >spare the caller the effort of re-checking with the authoritative >routine. It's possible that that would make sense, but I don't think that this patch needs to do that. There is at least one pre-existing cases that does this -- macaddr. -- Peter Geoghegan
Re: Sort support for macaddr8
On Mon, Jun 3, 2019 at 1:17 PM Melanie Plageman wrote: > I tried checking to see if there is a performance difference using the > attached DDL based on src/test/regress/sql/macaddr8.sql. I found > that the sort function is only exercised when creating an index (not, > for example, when doing some type of aggregation). As you know, it's a bit weird that we're proposing adding sort support with abbreviated keys for a type that is 8 bytes, since you'd expect it to also be pass-by-value on most platforms, which largely defeats the purpose of having abbreviated keys (though sort support could still make sense, for the same reason it makes sense to have it for int8). However, macaddr8 isn't actually pass-by-value, and it seems too late to do anything about that now, so abbreviated keys actually make sense. > With the patch applied to current master and using the DDL attached, > the timing for creating the index hovered around 20 ms for master and > 15 ms for the patched version. I would expect a sufficiently large sort to execute in about half the time with abbreviation, based on previous experience. However, I think that this patch can be justified in a relatively straightforward way. It extends sort support for macaddr to macaddr8, since these two types are almost identical in every other way. We should still validate the performance out of an abundance of caution, but I would be very surprised if there was much difference between the macaddr and macaddr8 cases. In short, users should not be surprised by the big gap in performance between macaddr and macaddr8. It's worth being consistent there. > I think that that seems like an improvement. I was thinking of > registering this patch for the next commitfest. Is that okay? Definitely, yes. > I was just wondering what the accepted way to test and share > performance numbers is for a small patch like this. Is sharing DDL > enough? Do I need to use pg_bench? I've always used custom microbenchmarks for stuff like this. Repeatedly executing a particular query and taking the median execution time as representative seems to be the best approach. -- Peter Geoghegan
Re: Sort support for macaddr8
On Mon, Jun 3, 2019 at 2:03 PM Chapman Flack wrote: > Am I going cross-eyed, or would the memset be serving more of a purpose > if it were in the SIZEOF_DATUM != 8 branch? No, it wouldn't -- that's the correct place for it with the macaddr type. However, it isn't actually necessary to memset() at the equivalent point for macaddr8, since we cannot "run out of bytes from the authoritative representation" that go in the Datum/abbreviated key. I suppose that the memset() should simply be removed, since it is superfluous here. -- Peter Geoghegan
unused_oids script should advertise reserved OID range
The unused_oids script has gone from being something of interest to everybody that wants to write a patch that creates a new catalog entry, to something that patch authors could do without in many cases. I think that its output should prominently advertise that patches should use random OIDs in the range 8000 - . Commit a6417078c4140e51cfd717448430f274b449d687 established that this should be standard practice for patch authors. Actually, maybe it should even suggest a *particular* random OID in that range, so that the choice of OID is reliably random -- why even require patch authors to pick a number at random? It also looks like pg_proc.dat should be updated, since it still mentions the old custom of trying to use contiguous OIDs. It also discourages the practice of picking OIDs at random, which is almost the opposite of what it should say. -- Peter Geoghegan
Re: crash testing suggestions for 12 beta 1
On Wed, Jun 5, 2019 at 2:11 PM Alvaro Herrera wrote: > REINDEX CONCURRENTLY would be one good area to focus on, I think, as > well as ALTER TABLE ATTACH PARTITION. Maybe also INCLUDE columns in > GiST, and the stuff in commits 9155580fd, fe280694d and 7df159a62. Those all seem like good things to target. Forgive me for droning on about amcheck once more, but maybe it'll help: amcheck has the capability to detect at least two historic bugs in CREATE INDEX CONCURRENTLY that made it into stable releases. The "heapallindexed" verification option's bt_tuple_present_callback() function has a header comment that talks about this. In short, any "unhandled" broken hot chain (i.e. broken hot chains that are somehow not correctly detected and handled) should be reported as corrupt by amcheck with the "heapallindexed" check, provided the tuple is visible to verification's heap scan. The CREATE INDEX CONCURRENTLY bug that Pavan found a couple of years back while testing the WARM patch is one example. A bug that was fallout from the DROP INDEX CONCURRENTLY work is another historic example. Alvaro will recall that this same check had a role in the "freeze the dead" business. -- Peter Geoghegan
Re: Top-N sorts in EXPLAIN, row count estimates, and parallelism
On Thu, May 23, 2019 at 7:48 PM David Rowley wrote: > > cost_sort() costs sorts as top-N heapsorts. While we cannot make an > > iron-clad guarantee that it will work out that way from within > > tuplesort.c, that doesn't seem like it closes off the possibility of > > more informative EXPLAIN output. For example, can't we at report that > > the tuplesort will be "bounded" within EXPLAIN, indicating that we > > intend to attempt to sort using a top-N heap sort (i.e. we'll > > definitely do it that way if there is sufficient work_mem)? > > I think this really needs more of a concrete proposal. Remember > LIMIT/OFFSET don't need to be constants, they could be a Param or some > return value from a subquery, so the bound might not be known until > after executor startup, to which EXPLAIN is not going to get to know > about that. I was merely pointing out that it is clear when a sort *could* be a top-n sort, which could be exposed by EXPLAIN without anyone feeling misled. > After that, what would we do with it in EXPLAIN? Always show "Bound: > ", if it's not -1? I'm not sure. The distinction between a top-n sort and any other sort is an important one (it's certainly more important than the distinction between an internal and external sort), so it's worth being flexible in order to expose more information in EXPLAIN output. I would be willing to accept some kind of qualified or hedged description in the EXPLAIN output for a bounded sort node, even though that approach doesn't seem desirable in general. -- Peter Geoghegan
Re: nbtdesc.c and nbtpage.c are inconsistent with XLOG_BTREE_META_CLEANUP (11~)
On Sun, Jun 16, 2019 at 6:31 PM Michael Paquier wrote: > I think that we could just patch nbtpage.c so as we fetch the > metadata using XLogRecGetBlockData(), and log its data. Don't you mean nbtdesc.c? > The error > comes from 3d92796, which is one year-old and has been visibly > committed untested. I am surprised that we have not seen that > complain yet. Why is that surprising? https://coverage.postgresql.org/src/backend/access/rmgrdesc/index.html -- Peter Geoghegan
Re: nbtdesc.c and nbtpage.c are inconsistent with XLOG_BTREE_META_CLEANUP (11~)
On Sun, Jun 16, 2019 at 7:05 PM Michael Paquier wrote: > I would have supposed that more people scan WAL records using the > description callbacks. The WAL record in question, XLOG_BTREE_META_CLEANUP, is certainly one of the less common record types used by nbtree. I agree that this should have been tested when it went in, but I'm not surprised that the bug remained undetected for a year. Not that many people use pg_waldump. -- Peter Geoghegan
Re: Status of the table access method work
On Tue, Jun 11, 2019 at 8:59 AM Robert Haas wrote: > I don't think that including visibility information in indexes is a > bad idea; we've thought about making zheap do this someday. But I > think that we need to use some more sophisticated approach involving, > maybe, undo pointers, or some other kind of magic, rather than just > widening the tuples. I expect that just widening the tuples would be > good enough to win for some use cases, but I think there would be > others that lose heavily. +1. Limited visibility information would make sense (e.g. maybe a per tuple all-visible bit), which would have to be backed by something like UNDO, but storing XIDs in tuples seems like a very bad idea. The idea that something like this would have to be usable by any possible table access method seems unworkable in general. Sometimes it seems like the table access method work could use some specific non-goals. Perhaps I just haven't being paying enough attention to have noticed them. -- Peter Geoghegan
Re: ANALYZE: ERROR: tuple already updated by self
On Tue, Jun 18, 2019 at 4:49 PM Justin Pryzby wrote: > Sure: > > (gdb) bt > #0 errfinish (dummy=0) at elog.c:414 > #1 0x0085e834 in elog_finish (elevel=, > fmt=) at elog.c:1376 > #2 0x004b93bd in simple_heap_update (relation=0x7fee161700c8, > otid=0x1fb7f44, tup=0x1fb7f40) at heapam.c:4613 > #3 0x0051bdb7 in CatalogTupleUpdate (heapRel=0x7fee161700c8, > otid=0x1fb7f44, tup=0x1fb7f40) at indexing.c:234 It might be interesting to set a breakpoint within heap_update(), which is called by simple_heap_update() --technically, this is where the reported failure occurs. From there, you could send an image of the page to the list by following the procedure described here: https://wiki.postgresql.org/wiki/Getting_a_stack_trace_of_a_running_PostgreSQL_backend_on_Linux/BSD#Dumping_a_page_image_from_within_GDB You'll have to hit "next" a few times, until heap_update()'s "page" variable is initialized. -- Peter Geoghegan
Re: ANALYZE: ERROR: tuple already updated by self
On Tue, Jun 18, 2019 at 5:09 PM Andres Freund wrote: > > It might be interesting to set a breakpoint within heap_update(), > > which is called by simple_heap_update() --technically, this is where > > the reported failure occurs. From there, you could send an image of > > the page to the list by following the procedure described here: > Hm, what are you hoping to glean by doing so? Nothing in particular. I see no reason to assume that we know what that looks like, though. It's easy to check. -- Peter Geoghegan
Re: [PATCH] Incremental sort (was: PoC: Partial sort)
On Tue, Jun 25, 2019 at 9:53 AM James Coleman wrote: > Given the patch contents I don't see any obviously reason why either > of those possibilities (calling comparetup_heap less often, or it's > cheaper) are likely. Is that something we should investigate further? > Or is it just a nice happy accident that we should ignore since it's > dataset specific? Have you actually confirmed that comparetup_heap() is called less often, by instrumenting the number of individual calls specifically? If there has been a reduction in calls to comparetup_heap(), then that's weird, and needs to be explained. If it turns out that it isn't actually called less often, then I would suggest that the speedup might be related to memory fragmentation, which often matters a lot within tuplesort.c. (This is why external sort merging now uses big buffers, and double buffering.) -- Peter Geoghegan
Re: [PATCH] Incremental sort (was: PoC: Partial sort)
On Tue, Jun 25, 2019 at 11:03 AM James Coleman wrote: > No, I haven't confirmed that it's called less frequently, and I'd be > extremely surprised if it were given the diff doesn't suggest any > changes to that at all. I must have misunderstood, then. I thought that you were suggesting that that might have happened. > If you think it's important enough to do so, I can instrument it to > confirm, but I was mostly wanting to know if there were any other > plausible explanations, and I think you've provided one: there *are* > changes in the patch to memory contexts in tuplesort.c, so if memory > fragmentation is a real concern this patch could definitely notice > changes in that regard. Sounds like it's probably fragmentation. That's generally hard to measure. -- Peter Geoghegan
Re: REINDEX locking
On Thu, Jun 13, 2019 at 1:04 PM Robert Haas wrote: > Typing "COMMIT;" or "ROLLBACK;" in S1 unblocks the reindex and it > succeeds, but otherwise it doesn't, contrary to the claim that a > regular REINDEX does not block reads. The reason for this seems to be > that the REINDEX acquires AccessExclusiveLock on all of the indexes of > the table, and a SELECT acquires AccessShareLock on all indexes of the > table (even if the particular plan at issue does not use them); e.g. > in this case the plan is a Seq Scan. REINDEX acquires only ShareLock > on the table itself, but this apparently does nobody wanting to run a > query any good. > > Is it supposed to work this way? Am I confused? I've always thought that this framing was very user-hostile. Theoretically, REINDEX doesn't have to block reads (e.g. it won't with prepared statements when various conditions are met), but in practice the behavior isn't meaningfully different from blocking reads. -- Peter Geoghegan
Re: PG 12 draft release notes
On Wed, Jun 12, 2019 at 1:16 PM Bruce Momjian wrote: > First, my apologies in getting to this so late. Peter Geoghegan > supplied me with slides and a video to study, and I now understand how > complex the btree improvements are. Here is a video of Peter's > presentation at PGCon: > > https://www.youtube.com/watch?v=p5RaATILoiE Thank you for going to the trouble of researching the B-Tree stuff in detail! I wouldn't ask that of anybody in the position of writing release notes, so it's appreciated. It is awkward to take the work that I've done and make it into multiple bullet points; I have a hard time with that myself. > The over-arching improvement to btree in PG 12 is the ordering of index > entries by tid so all entries are unique. Right. Everything good happens as a direct or indirect result of the TID-as-column thing. That is really the kernel of the whole thing, because it means that the implementation now *fully* follows the Lehman and Yao design. > 1. Since all index tuples are ordered, you can move from one leaf page > to the next without keeping a lock on the internal page that references > it, increasing concurrency. I'm not sure what you mean here. We've never had to keep a lock on an internal page while holding a lock on a leaf page -- such "lock coupling" was necessary in earlier B-Tree designs, but Lehman and Yao's algorithm avoids that. Of course, that isn't new. I think that you're talking about the way that we now check the high key during index scans, and find that we don't have to move to the next leaf page, per this commit: https://git.postgresql.org/gitweb/?p=postgresql.git;a=commit;h=29b64d1de7c77ffb5cb10696693e6ed8a6fc481c All of the suffix truncation stuff is good because it makes separator keys in internal pages smaller, but it's also good because the separator keys are simultaneously more "discriminating". We tend to get a nice "clean break" between leaf pages, so checking the high key before moving right to find additional matches (once we've already returned some tuples to the executor) is surprisingly effective. It would have been possible to add this optimization even without the suffix truncation stuff, but truncation makes it effective. If you had to cut one thing from this list, then I would suggest that it be this item. It's nice, but it's also very obvious, which makes it hard to explain. I mean, why should there be any ambiguity at all? Unless we have to return *hundreds* of items to the index scan, then a simple "select * from foo where bar = ?" style query should only have to visit one leaf page, even when the constant happens to be on the boundary of a leaf page (maybe a concurrent page split could make this happen, but that should be rare). > 2. When inserting a duplicate value in the index, we used to try a few > pages to see if there was space, then "get tired" and just split a page > containing duplicates. Now that there are no duplicates (because > duplicate key fields are sorted by tid) the system knows exactly what > page the index tuple belongs on, and inserts or splits the page as > necessary. Right -- inserts will descend straight to the correct leaf page, and the "getting tired" dance isn't used anymore. This makes insertions faster, but more importantly is a better high level strategy for storing lots of duplicates. We'll dirty far fewer pages, because insertions automatically end up inserting around the same place as we inserted to a moment ago. Insertions of duplicates behave like serial/auto-incrementing insertions, which was already fast/well-optimized in various ways. It's easy to measure this by looking at index bloat when inserting lots of duplicates -- I came up with the 16% figure in the talk based on a simple insert-only test. > 3. Pivot tuples are used on internal pages and as min/max indicators on > leaf pages. These pivot tuples are now trimmed if their trailing key > fields are not significant. For example, if an index is > field1/field2/field3, and the page contains values for which field1==5 > and none that field1==6, there is no need to include field2 and field3 > in the pivot tuple --- it can just list the pivot as field1==5, > field2=infinity. This is called suffix truncation. Right -- that's exactly how it works. Users may find that indexes with lots of extra columns at the end won't get so bloated in Postgres 12. Indexing many columns is typically seen when index-only scans are important. Of course, you could have made those indexes INCLUDE indexes on v11, which is actually a closely related idea, but that makes it impossible to use the trailing/suffix columns in an ORDER BY. And, you have to know about INCLUDE indexes, and remember to use them. (This must be why Oracle can get away with not having INCLUDE indexes.) > Page splits used to just split the page in
Re: New vacuum option to do only freezing
On Tue, Jun 18, 2019 at 10:39 PM Michael Paquier wrote: > +INSERT INTO no_index_cleanup(i, t) VALUES(1, repeat('1234567890',3)); > Do we really need a string as long as that? Specifying EXTERNAL storage might make things easier. I have used PLAIN storage to test the 1/3 of a page restriction within nbtree, and to test a bug in amcheck that was related to TOAST compression. > It seems to me that we'd want tests to make sure that indexes are > actually cleaned up, where pageinspect could prove to be useful. That definitely seems preferable, but it'll be a bit tricky to do it in a way that doesn't run into buildfarm issues due to alignment. I suggest an index on a text column to avoid problems. -- Peter Geoghegan
Re: Index Skip Scan
On Sat, Jun 22, 2019 at 3:15 PM Floris Van Nee wrote: > Perhaps a question: when stepping through code in GDB, is there an easy way > to pretty print for example the contents on an IndexTuple? I saw there's some > tools out there that can pretty print plans, but viewing tuples is more > complicated I guess. Try the attached patch -- it has DEBUG1 traces with the contents of index tuples from key points during index scans, allowing you to see what's going on both as a B-Tree is descended, and as a range scan is performed. It also shows details of the insertion scankey that is set up within _bt_first(). This hasn't been adopted to the patch at all, so you'll probably need to do that. The patch should be considered a very rough hack, for now. It leaks memory like crazy. But I think that you'll find it helpful. -- Peter Geoghegan 0012-Index-scan-positioning-DEBUG1-instrumentation.patch Description: Binary data
Re: PG 12 draft release notes
On Wed, Jun 12, 2019 at 5:22 PM Bruce Momjian wrote: > I had become so confused by this item that I needed a few weeks to > settle on what was actually going on. I put a lot of time into my pgCon talk, especially on the diagrams. Seems like that paid off. Even Heikki was confused by my explanations at one point. I should go add a similar diagram to our documentation, under "Chapter 63. B-Tree Indexes", because diagrams are the only sensible way to explain the concepts. > I was wrong. I was thinking of this commit: > > commit d2086b08b0 > Author: Alexander Korotkov > Date: Sat Jul 28 00:31:40 2018 +0300 > > Reduce path length for locking leaf B-tree pages during insertion > > If you had to cut one thing from this list, then I would suggest that > > it be this item. It's nice, but it's also very obvious, which makes it > > hard to explain. > Right. The commit mentioned a 4.5x speedup in a rare benchmark, so I > added it lower on the list. My remark about cutting an item referred to a lesser item that I worked on (the 'Add nbtree high key "continuescan" optimization' commit), not Alexander independent B-Tree work. I think that Alexander's optimization is also quite effective. Though FWIW the 4.5x improvement concerned a case involving lots of duplicates...cases with a lot of duplicates will be far far better in Postgres 12. (I never tested my patch without Alexander's commit, since it went in early in the v12 cycle.) > Yes, locality. "Locality" is one of my favorite words. > Attached is an updated patch. I might have missed something, but I > think it might be close. This looks great. I do have a few things: * I would put "Improve performance and space utilization of btree indexes with many duplicates" first (before "Allow multi-column btree indexes to be smaller"). I think that this is far more common than we tend to assume, and is also where the biggest benefits are. * The wording of the "many duplicates" item itself is almost perfect, though the "...and inefficiency when VACUUM needed to find a row for removal" part seems a bit off -- this is really about the effectiveness of VACUUM, not the speed at which the VACUUM completes (it's a bit faster, but that's not that important). Perhaps that part should read: "...and often failed to efficiently recycle space made available by VACUUM". Something like that. * The "Allow multi-column btree indexes to be smaller" item is about both suffix truncation, and about the "Split after new tuple" optimization. I think that that makes it more complicated than it needs to be. While the improvements that we saw with TPC-C on account of the "Split after new tuple" optimization were nice, I doubt that users will be looking out for it. I would be okay if you dropped any mention of the "Split after new tuple" optimization, in the interest of making the description more useful to users. We can just lose that. * Once you simplify the item by making it all about suffix truncation, it would make sense to change the single line summary to "Reduce the number of branch blocks needed for multi-column indexes". Then go on to talk about how we now only store those columns that are necessary to guide index scans in tuples stored in branch pages (we tend to call branch pages internal pages, but branch pages seems friendlier to me). Note that the user docs of other database systems reference these details, even in their introductory material on how B-Tree indexes work. The term "suffix truncation" isn't something users have heard of, and we shouldn't use it here, but the *idea* of suffix truncation is very well established. As I mentioned, it matters for things like covering indexes (indexes that are designed to be used by index-only scans, which are not necessarily INCLUDE indexes). Thanks! -- Peter Geoghegan
Re: PG 12 draft release notes
On Wed, Jun 12, 2019 at 7:29 PM Bruce Momjian wrote: > OK, I mentioned something about increased locality now. Patch attached. Looks good -- locality is a good catch-all term. Thanks! -- Peter Geoghegan
Re: Draft back-branch release notes are up for review
On Sat, Jun 15, 2019 at 2:11 PM Peter Geoghegan wrote: > On Sat, Jun 15, 2019 at 1:39 PM Noah Misch wrote: > > To me, this text implies a cautious DBA should amcheck every index. Reading > > the thread[1], I no longer think that. It's enough to monitor that VACUUM > > doesn't start failing persistently on any index. I suggest replacing this > > release note text with something like the following: FWIW, amcheck won't help here. It can only access pages through its breadth-first search, and so will not land on any "leaked" page (i.e. page that has no link to the tree). Ideally, amcheck would notice that it hasn't visited certain blocks, and then inspect the blocks/pages in a separate pass, but that doesn't happen right now. As you know, VACUUM can find leaked blocks/pages because nbtree VACUUM has an optimization that allows it to access them in sequential order. -- Peter Geoghegan
Re: Draft back-branch release notes are up for review
On Sat, Jun 15, 2019 at 3:05 PM Tom Lane wrote: > Thanks for the input, guys. What do you think of > > Avoid writing an invalid empty btree index page in the unlikely case > that a failure occurs while processing INCLUDEd columns during a page > split (Peter Geoghegan) > > The invalid page would not affect normal index operations, but it > might cause failures in subsequent VACUUMs. If that has happened to > one of your indexes, recover by reindexing the index. That seems perfect. Thanks -- Peter Geoghegan
Re: Draft back-branch release notes are up for review
On Sat, Jun 15, 2019 at 1:39 PM Noah Misch wrote: > To me, this text implies a cautious DBA should amcheck every index. Reading > the thread[1], I no longer think that. It's enough to monitor that VACUUM > doesn't start failing persistently on any index. I suggest replacing this > release note text with something like the following: > > Avoid writing erroneous btree index data that does not change query results > but causes VACUUM to abort with "failed to re-find parent key". Affected > indexes are rare; REINDEX fixes them. > > (I removed "key truncation during a page split" as being too technical for > release notes.) I agree that this isn't terribly significant in general. Your proposed wording seems better than what we have now, but a reference to INCLUDE indexes also seems like a good idea. They are the only type of index that could possibly have the issue with page deletion/VACUUM becoming confused. Even then, the risk seems minor, because there has to be an OOM at precisely the wrong point. If there was any kind of _bt_split() OOM in 11.3 that involved a non-INCLUDE B-Tree index, then the OOM could only occur when we allocate a temp page buffer. I verified that this causes no significant issue for VACUUM. It is best avoided, since we still "leak" the new page/buffer, although that is almost harmless. -- Peter Geoghegan
Re: _bt_split(), and the risk of OOM before its critical section
On Wed, May 8, 2019 at 3:37 PM Peter Geoghegan wrote: > It makes perfect sense for _bt_split() to call _bt_findsplitloc() > directly, since _bt_findsplitloc() is already aware of almost every > _bt_split() implementation detail, whereas those same details are not > of interest anywhere else. I discovered that it even used to work like that until 1997, when commit 71b3e93c505 added handling of duplicate index tuples. Tom ripped the duplicate handling stuff out a couple of years later, for what seemed to me to be very good reasons, but _bt_findsplitloc() remained outside of _bt_split() until now. I intend to push ahead with the fix for both v11 and HEAD on Monday. -- Peter Geoghegan
Re: pg12 release notes
On Sat, May 11, 2019 at 7:40 AM Bruce Momjian wrote: > > > > I think I have understood the nuances, as listed above --- I just don't > > > > agree with the solution. > > > > > > To be clear, I don't expect you to agree with the solution. > > > > > > Another thing that you missed from my patch is that bugfix commit > > > 9b10926263d831fac5758f1493c929a49b55669b shouldn't be listed. > > > > Why should it not be listed? > > Thinking some more, I try to aggregate all the feature addition commits > together, but often skip "cleanups" of previous feature additions. Are > you saying that 9b10926263d831fac5758f1493c929a49b55669b is a cleanup, > and not part of the feature addition? It was not clear to me of > 9b10926263d831fac5758f1493c929a49b55669b was a further enhancement > made possible by previous commits, or a "fix" for a previous commit. Yes. It's a bug fix that went in after feature freeze. -- Peter Geoghegan
Re: pg12 release notes
On Fri, May 10, 2019 at 7:11 PM Bruce Momjian wrote: > > Whether or not you include more details is not what I care about here > > -- I *agree* that this is insignificant. > Well, we can move the entire item up to the incompatibility section, but > that seems unbalanced since the incompatibility is so small relative to > the entire item, and it is rare, as you mentioned. It also seems odd to > create a stand-alone incompatibility item that really is part of a later > item already in the release notes. That is what we've always done. The list has always been very long, with individual items that are on average totally insignificant. Breaking with that pattern in this instance will be confusing to users. > I think I have understood the nuances, as listed above --- I just don't > agree with the solution. To be clear, I don't expect you to agree with the solution. Another thing that you missed from my patch is that bugfix commit 9b10926263d831fac5758f1493c929a49b55669b shouldn't be listed. > > As things stand, I feel like the question of authorship and credit > > complicates the question of making the release notes useful, which is > > unfortunate. (Not sure what can be done about that.) > > That part I always need big help with, particularly with multiple > commits being lumped into a single release note entry. I just can't > tell which commit is more important when knowing what order to list the > names. I have that problem big-time with the partition commits. I understand that it's a difficult job. It's inevitable that there will need to be corrections. You don't appear to be particularly receptive to feedback, which makes the process harder for everyone -- even in instances where you make the right call. I don't think that I am alone in seeing it this way. -- Peter Geoghegan
Re: pg12 release notes
On Fri, May 10, 2019 at 6:02 PM Bruce Momjian wrote: > Have new btree indexes sort duplicate index entries in heap-storage > order (Peter Geoghegan) > > This slightly reduces the maximum-allowed length of indexed values. > --- > Indexes pg_upgraded from previous releases will not have this > ordering. > > I don't think more details are really needed. Whether or not you include more details is not what I care about here -- I *agree* that this is insignificant. I think that the maximum allowed length thing should be listed in the compatibility section as a formality -- not alongside the point that the feature is listed. I had better be correct in that general assessment, because it's not possible to opt out of the restriction within CREATE INDEX. That was my point here. > What we have already seems like enough detail: > > Improve speed of btree index insertions (Alexander Korotkov, > Peter Geoghegan) Why? I think that it's unfortunate that Heikki wasn't given an authorship credit here, as proposed in my patch. I think that he deserves it. > Locking speed seems like an implementation detail. They're all implementations details, though. If that was the actual standard, then few or no "indexing" items would be listed. > You have given me very good detail, so the new text is: > > Improve speed of btree index insertions (Alexander Korotkov, Peter > Geoghegan) > > The new code improves the space-efficiency of page splits, reduces > locking > overhead, and gives better performance for UPDATEs > and DELETEs on indexes with many duplicates. I can live with that. I'm not trying to be difficult -- reasonable people can disagree on the level of detail that is appropriate (they can even have radically different ideas about where to draw the line). And, I would expect it to be a little arbitrary under the best of circumstances, no matter who was tasked with writing the release notes. However, I think the process would be easier and more effective if you took more time to understand the concerns behind the feedback you get. There are sometimes important nuances. As things stand, I feel like the question of authorship and credit complicates the question of making the release notes useful, which is unfortunate. (Not sure what can be done about that.) -- Peter Geoghegan
Re: pg12 release notes
On Sat, May 11, 2019 at 11:02 AM Bruce Momjian wrote: > OK, commit removed. You're mistaken -- nothing has been pushed to master in the last 3 hours. -- Peter Geoghegan
Re: What is an item pointer, anyway?
On Sun, May 5, 2019 at 1:14 PM Peter Geoghegan wrote: > Attached draft patch adjusts code comments and error messages where a > line pointer is referred to as an item pointer. It turns out that this > practice isn't all that prevalent. Here are some specific concerns > that I had to think about when writing the patch, though: Ping? Any objections to pushing ahead with this? -- Peter Geoghegan
Re: What is an item pointer, anyway?
On Mon, May 13, 2019 at 12:38 PM Tom Lane wrote: > /* > - * Prune specified item pointer or a HOT chain originating at that item. > + * Prune specified line pointer or a HOT chain originating at that item. > * > * If the item is an index-referenced tuple (i.e. not a heap-only tuple), > > Should "that item" also be re-worded, for consistency? Yes, it should be -- I'll fix it. I'm going to backpatch the storage.sgml change on its own, while pushing everything else in a separate HEAD-only commit. Thanks -- Peter Geoghegan
Re: VACUUM can finish an interrupted nbtree page split -- is that okay?
On Tue, May 7, 2019 at 9:59 AM Peter Geoghegan wrote: > On Tue, May 7, 2019 at 12:27 AM Heikki Linnakangas wrote: > > I don't understand that reasoning. Yes, _bt_pagedel() will complain if > > it finds a half-dead internal page. But how does that mean that > > _bt_lock_branch_parent() can't encounter one? > > I suppose that in theory it could, but only if you allow that any > possible state could be found -- it doesn't seem any more likely than > any other random illegal state. To be fair, I suppose that the code made more sense when it first went in, because at the time there was a chance that there could be leftover half-dead internal pages. But, that was a long time ago now. I wonder why the code wasn't complaining about corruption loudly, like the top level code, instead of treating half-dead pages as a legitimate reason to not proceed with multi-level page deletion. That would have been overkill, but it would have made much more sense IMV. I would like to proceed with pushing this patch to HEAD in the next few days, since it's clearly removing code that can't be useful. Are there any objections? -- Peter Geoghegan
Re: VACUUM can finish an interrupted nbtree page split -- is that okay?
On Mon, May 13, 2019 at 8:30 PM Tom Lane wrote: > Peter Geoghegan writes: > > To be fair, I suppose that the code made more sense when it first went > > in, because at the time there was a chance that there could be > > leftover half-dead internal pages. But, that was a long time ago now. > > Is there a good reason to assume there are none left anywhere? That is not an assumption that the proposed patch rests upon, though it is true that there are probably going to be virtually no half-dead internal pages that make there way on to a Postgres 12 installation. You'd have to do a CREATE INDEX on Postgres 9.3, and then not VACUUM or REINDEX the index once it was on a 9.4+ installation. I suppose that a 9.3 -> 12 upgrade is the most plausible scenario in which you could actual get a half-dead internal page on a Postgres 12 installation. Even when that happens, the index is already considered corrupt by VACUUM, so the same VACUUM process that could in theory be adversely affected by removing the half-dead internal page check will complain about the page when it gets to it later on -- the user will be told to REINDEX. And even then, we will never actually get to apply the check that I propose to remove, since we're already checking the leaf page sibling of the leaf-level target -- the leaf-level test that was added by efada2b8e92 was clearly necessary. But it was also sufficient (no equivalent internal half-dead right sibling test is needed): a 9.3-era half-dead internal page cannot have more than one child, which must be undergoing deletion as well. If somebody doubted my rationale for why we don't need to do anything more on internal page levels in installations where the user didn't pg_upgrade from a version that's < 9.4, then they'd still have to explain why we haven't heard of any problems in 5 years, and probably offer some alternative fix that considers "logically half-dead internal pages" (i.e. pages that are or will be the top parent in a deletion chain). Because the code that I propose to remove obviously cannot be doing much of anything for indexes built on 9.4+. -- Peter Geoghegan
Re: itemptr_encode/itemptr_decode
On Wed, Apr 17, 2019 at 4:22 PM Tom Lane wrote: > As for the general usability argument, I'm not sure --- as we start > to look at alternate AMs, we might have more use for them. When I first > saw the functions, I thought maybe they were part of sort acceleration > for TIDs; evidently they're not (yet), but that seems like another > possible use-case. There is also your join-or-to-union patch, which I thought might make use of this for its TID sort. Maybe it would make sense to put this infrastructure in tuplesort.c, but probably not. TIDs are 6 bytes, which as you once pointed out, is not something that we have appropriate infrastructure for (there isn't a DatumGet*() macro, and so on). The encoding scheme (which you originally suggested as an alternative to my first idea, sort support for item pointers) works particularly well as these things go -- it was about 3x faster when everything fit in memory, and faster still with external sorts. It allowed us to resolve comparisons at the SortTuple level within tuplesort.c, but also allowed tuplesort.c to use the pass-by-value datum qsort specialization. It even allowed sorted array entries (TIDs/int8s) to be fetched without extra pointer chasing -- that can be a big bottleneck these days. The encoding scheme is a bit ugly, but I suspect it would be simpler to stick to the same approach elsewhere than to try and hide all the details within tuplesort.c, or something like that. Unless we're willing to treat TIDs as a whole new type of tuple with its own set of specialized functions in tuplesort.c, which has problems of its own, then it's kind of awkward to do it some other way. -- Peter Geoghegan
Re: crash testing suggestions for 12 beta 1
On Thu, May 23, 2019 at 8:24 AM Jeff Janes wrote: > Now that beta is out, I wanted to do some crash-recovery testing where I > inject PANIC-inducing faults and see if it recovers correctly. Thank you for doing this. It's important work. > Making the ctid be tie-breakers in btree index is also tested inherently > (plus I think Peter tested that pretty thoroughly himself with similar > methods). As you may know, the B-Tree code has a tendency to soldier on when an index is corrupt. "Moving right" tends to conceal problems beyond concurrent page splits. I didn't do very much fault injection type testing with the B-Tree enhancements, but I did lean on amcheck heavily during development. Note that a new, extremely thorough option called "rootdescend" verification was added following the v12 work: https://git.postgresql.org/gitweb/?p=postgresql.git;a=commit;h=c1afd175b5b2e5c44f6da34988342e00ecdfb518 It probably wouldn't add noticeable overhead to use this during your testing, and maybe to combine it with the "heapallindexed" option, while using the bt_index_parent_check() variant -- that will detect almost any imaginable index corruption. Admittedly, amcheck didn't find any bugs in my code after the first couple of versions of the patch series, so this approach seems unlikely to find any problems now. Even still, it wouldn't be very difficult to do this extra step. It seems worthwhile to be thorough here, given that we depend on the B-Tree code so heavily. -- Peter Geoghegan
Re: Top-N sorts in EXPLAIN, row count estimates, and parallelism
On Thu, May 23, 2019 at 3:31 PM Tom Lane wrote: > Given the way that's implemented, I doubt that we can report it > reliably in EXPLAIN. Does it have to be totally reliable? cost_sort() costs sorts as top-N heapsorts. While we cannot make an iron-clad guarantee that it will work out that way from within tuplesort.c, that doesn't seem like it closes off the possibility of more informative EXPLAIN output. For example, can't we at report that the tuplesort will be "bounded" within EXPLAIN, indicating that we intend to attempt to sort using a top-N heap sort (i.e. we'll definitely do it that way if there is sufficient work_mem)? -- Peter Geoghegan
Re: Calling PrepareTempTablespaces in BufFileCreateTemp
On Tue, Apr 30, 2019 at 3:22 PM Tom Lane wrote: > So now I'm feeling more favorable about the idea of adding a > PrepareTempTablespaces call to BufFileCreateTemp, and just stopping > with that. If we want to do more, I feel like it requires a > significant amount of rethinking about what the expectations are for > fd.c, and some rejiggering of PrepareTempTablespaces's API too. > I'm not sufficiently excited about this issue to do that. +1. Let's close this one out. -- Peter Geoghegan
Re: Calling PrepareTempTablespaces in BufFileCreateTemp
On Fri, May 17, 2019 at 6:36 PM Tom Lane wrote: > Will do so tomorrow. Should we back-patch this? I wouldn't, because I see no reason to. Somebody else might. -- Peter Geoghegan
Re: pg12 release notes
On Thu, May 9, 2019 at 6:03 PM Bruce Momjian wrote: > These were all very helpful. I adjusted your changes to create the > attached applied patch. URL updated: I noticed that the compatibility note for Andrew Gierth's RYU floating point patch seems to simply say why the feature is useful. Shouldn't it be listed separately, and its impact on users upgrading be listed here instead? Separately, please find attached suggested changes for items I was involved in. I have attempted to explain them in a way that makes their relevance to users clearer. Even if you don't end up using my wording, you should still change the attribution along the same lines as the patch. Also, I added a compatibility note for the new version of nbtree, which revises the "1/3 of a page" restriction downwards very slightly (by 8 bytes). FWIW, I think it's very unlikely that anyone will be affected, because tuples that are that wide are already compressed in almost all cases -- it seems like it would be hard to be just at the edge of the limit already. Thanks -- Peter Geoghegan 0001-Suggested-changes-to-v12-release-notes.patch Description: Binary data
Re: _bt_split(), and the risk of OOM before its critical section
On Tue, May 7, 2019 at 6:15 PM Peter Geoghegan wrote: > I suppose I'm biased, but I prefer the new approach anyway. Adding the > left high key first, and then the right high key seems simpler and > more logical. It emphasizes the similarities and differences between > leftpage and rightpage. I came up with a better way of doing it in the attached revision. Now, _bt_split() calls _bt_findsplitloc() directly. This makes it possible to significantly simplify the signature of _bt_split(). It makes perfect sense for _bt_split() to call _bt_findsplitloc() directly, since _bt_findsplitloc() is already aware of almost every _bt_split() implementation detail, whereas those same details are not of interest anywhere else. _bt_findsplitloc() also knows all about suffix truncation. It's also nice that the actual _bt_truncate() call is closely tied to the _bt_findsplitloc() call. -- Peter Geoghegan v2-0001-Don-t-leave-behind-junk-nbtree-pages-during-split.patch Description: Binary data
Re: PG 12 draft release notes
On Tue, May 21, 2019 at 1:51 PM Bruce Momjian wrote: > > My concern here (which I believe Alexander shares) is that it doesn't > > make sense to group these two items together. They're two totally > > unrelated pieces of work. Alexander's work does more or less help with > > lock contention with writes, whereas the feature that that was merged > > with is about preventing index bloat, which is mostly helpful for > > reads (it helps writes to the extent that writes are also reads). > > > > The release notes go on to say that this item "gives better > > performance for UPDATEs and DELETEs on indexes with many duplicates", > > which is wrong. That is something that should have been listed below, > > under the "duplicate index entries in heap-storage order" item. > > OK, I understand how the lock stuff improves things, but I have > forgotten how indexes are made smaller. Is it because of better page > split logic? That is clearly the main reason, though suffix truncation (which represents that trailing/suffix columns in index tuples from branch pages have "negative infinity" sentinel values) also contributes to making indexes smaller. The page split stuff was mostly added by commit fab250243 ("Consider secondary factors during nbtree splits"), but commit f21668f32 ("Add "split after new tuple" nbtree optimization") added to that in a way that really helped the TPC-C indexes. The TPC-C indexes are about 40% smaller now. > > > Author: Peter Geoghegan > > > 2019-03-20 [dd299df81] Make heap TID a tiebreaker nbtree index column. > As I remember the benefit currently is that you can find update and > deleted rows faster, right? Yes, that's true when writing to the index. But more importantly, it really helps VACUUM when there are lots of duplicates, which is fairly common in the real world (imagine an index where 20% of the rows are NULL, for example). In effect, there are no duplicates anymore, because all index tuples are unique internally. Indexes with lots of duplicates group older rows together, and new rows together, because treating heap TID as a tiebreaker naturally has that effect. VACUUM will generally dirty far fewer pages, because bulk deletions tend to be correlated with heap TID. And, VACUUM has a much better chance of deleting entire leaf pages, because dead tuples end up getting grouped together. -- Peter Geoghegan
Re: pg_upgrade can result in early wraparound on databases with high transaction load
On Mon, May 20, 2019 at 3:10 AM Jason Harvey wrote: > This week I upgraded one of my large(2.8TB), high-volume databases from 9 to > 11. The upgrade itself went fine. About two days later, we unexpectedly hit > transaction ID wraparound. What was perplexing about this was that the age of > our oldest `datfrozenxid` was only 1.2 billion - far away from where I'd > expect a wraparound. Curiously, the wraparound error referred to a mysterious > database of `OID 0`: > > UPDATE ERROR: database is not accepting commands to avoid wraparound data > loss in database with OID 0 > > We were able to recover after a few hours by greatly speeding up our vacuum > on our largest table. > > In a followup investigation I uncovered the reason we hit the wraparound so > early, and also the cause of the mysterious OID 0 message. When pg_upgrade > executes, it calls pg_resetwal to set the next transaction ID. Within > pg_resetwal is the following code: > https://github.com/postgres/postgres/blob/6cd404b344f7e27f4d64555bb133f18a758fe851/src/bin/pg_resetwal/pg_resetwal.c#L440-L450 > > This sets the controldata to have a fake database (OID 0) on the brink of > transaction wraparound. Specifically, after pg_upgrade is ran, wraparound > will occur within around 140 million transactions (provided the autovacuum > doesn't finish first). I confirmed by analyzing our controldata before and > after the upgrade that this was the cause of our early wraparound. > > Given the size and heavy volume of our database, we tend to complete a vacuum > in the time it takes around 250 million transactions to execute. With our > tunings this tends to be rather safe and we stay well away from the > wraparound point under normal circumstances. This does seem like an unfriendly behavior. Moving the thread over to the -hackers list for further discussion... -- Peter Geoghegan
Re: PG 12 draft release notes
On Mon, May 20, 2019 at 3:17 PM Andres Freund wrote: > > > > Improve speed of btree index insertions (Peter Geoghegan, > Alexander Korotkov) > My concern here (which I believe Alexander shares) is that it doesn't make sense to group these two items together. They're two totally unrelated pieces of work. Alexander's work does more or less help with lock contention with writes, whereas the feature that that was merged with is about preventing index bloat, which is mostly helpful for reads (it helps writes to the extent that writes are also reads). The release notes go on to say that this item "gives better performance for UPDATEs and DELETEs on indexes with many duplicates", which is wrong. That is something that should have been listed below, under the "duplicate index entries in heap-storage order" item. > Author: Peter Geoghegan > 2019-03-20 [dd299df81] Make heap TID a tiebreaker nbtree index column. > Author: Peter Geoghegan > 2019-03-20 [fab250243] Consider secondary factors during nbtree splits. > --> > > > Have new btree indexes sort duplicate index entries in heap-storage > order (Peter Geoghegan, Heikki Linnakangas) > > I'm not sure that the grouping here is quite right. And the second entry > probably should have some explanation about the benefits? It could stand to say something about the benefits. As I said, there is already a little bit about the benefits, but that ended up being tied to the "Improve speed of btree index insertions" item. Moving that snippet to the correct item would be a good start. -- Peter Geoghegan
Re: ERROR: failed to add item to the index page
On Tue, Apr 30, 2019 at 9:44 AM Andreas Joseph Krogh wrote: > I have a 1.4GB dump (only one table) which reliably reproduces this error. > Shall I share it off-list? I would be quite interested in this, too, since there is a chance that it's my bug. -- Peter Geoghegan
Re: Re: ERROR: failed to add item to the index page
On Tue, Apr 30, 2019 at 11:54 AM Andreas Joseph Krogh wrote: > Nice, thanks! Thanks for the report! -- Peter Geoghegan
Re: ERROR: failed to add item to the index page
On Tue, Apr 30, 2019 at 9:47 AM Tom Lane wrote: > Andreas Joseph Krogh writes: > > I have a 1.4GB dump (only one table) which reliably reproduces this error. > > Shall I share it off-list? -- > > That's awfully large :-(. How do you have in mind to transmit it? I've send dumps that were larger than that by providing a Google drive link. Something like that should work reasonably well. -- Peter Geoghegan
Re: ERROR: failed to add item to the index page
On Tue, Apr 30, 2019 at 9:56 AM Andreas Joseph Krogh wrote: > I've sent you guys a link (Google Drive) off-list. I'll start investigating the problem right away. Thanks -- Peter Geoghegan
Re: ERROR: failed to add item to the index page
On Tue, Apr 30, 2019 at 9:59 AM Peter Geoghegan wrote: > I'll start investigating the problem right away. I have found what the problem is. I simply neglected to make a conservative assumption about suffix truncation needing to add a heap TID to a leaf page's new high key in nbtsort.c (following commit dd299df8189), even though I didn't make the same mistake in nbtsplitloc.c. Not sure how I managed to make such a basic error. Andreas' test case works fine with the attached patch. I won't push a fix for this today. -- Peter Geoghegan 0001-Tentative-fix-for-nbtsort.c-space-bug.patch Description: Binary data
Re: Adding a test for speculative insert abort case
On Tue, Apr 30, 2019 at 5:16 PM Melanie Plageman wrote: > Can anyone think of a good way to put this codepath under test? During the initial development of ON CONFLICT, speculative insertion itself was tested using custom stress testing that you can still get here: https://github.com/petergeoghegan/jjanes_upsert I'm not sure that this is something that you can adopt, but I certainly found it very useful at the time. It tests whether or not there is agreement among concurrent speculative inserters, and whether or not there are "unprincipled deadlocks" (user hostile deadlocks that cannot be fixed by reordering something in application code). -- Peter Geoghegan
Re: Improve search for missing parent downlinks in amcheck
On Sun, Apr 28, 2019 at 10:15 AM Alexander Korotkov wrote: > I think this definitely not bug fix. Bloom filter was designed to be > lossy, no way blaming it for that :) I will think about a simple fix, but after the upcoming point release. There is no hurry. -- Peter Geoghegan
Re: ERROR: failed to add item to the index page
On Tue, Apr 30, 2019 at 10:58 AM Peter Geoghegan wrote:j > I have found what the problem is. I simply neglected to make a > conservative assumption about suffix truncation needing to add a heap > TID to a leaf page's new high key in nbtsort.c (following commit > dd299df8189), even though I didn't make the same mistake in > nbtsplitloc.c. Not sure how I managed to make such a basic error. Attached is a much more polished version of the same patch. I tried to make clear how the "page full" test (the test that has been fixed to take heap TID space for high key into account) is related to other close-by code, such as the tuple space limit budget within _bt_check_third_page(), and the code that sets up an actual call to _bt_truncate(). I'll wait a few days before pushing this. This version doesn't feel too far off being committable. I tested it with some of the CREATE INDEX tests that I developed during development of the nbtree unique keys project, including a test with tuples that are precisely at the 1/3 of a page threshold. The new definition of 1/3 of a page takes high key heap TID overhead into account -- see _bt_check_third_page(). -- Peter Geoghegan v2-0001-Fix-nbtsort.c-s-page-space-accounting.patch Description: Binary data
Re: What is an item pointer, anyway?
On Fri, Apr 26, 2019 at 4:23 PM Ashwin Agrawal wrote: > How about we rename ItemPointerData to TupleIdentifier or ItemIdentifier > instead and leave ItemPointer or Item confined to AM term, where item can be > tuple, datum or anything else ? I'm not a fan of that idea, because the reality is that an ItemPointerData is quite explicitly supposed to be a physiological identifier (TID) used by heapam, or a similar heap-like access method such as zheap. This is baked into a number of things. The limitation that pluggable storage engines have to work in terms of item pointers is certainly a problem, especially for things like the Zedstore column store project you're working on. However, I suspect that that problem is best solved by accommodating other types of identifiers that don't work like TIDs. I understand why you've adopted ItemPointerData as a fully-logical identifier in your prototype, but it's not a great long term solution. -- Peter Geoghegan
Re: What is an item pointer, anyway?
On Fri, Apr 26, 2019 at 4:57 PM Tom Lane wrote: > ItemId[Data] is somewhat less widely referenced, but I'm still not > much in favor of renaming that type. I think fixing comments to > uniformly call it an item ID would be more reasonable. (We should > leave the "line pointer" terminology in place, too; if memory serves, > an awful lot of variables of the type are named "lp" or variants. > Renaming all of those is to nobody's benefit.) I was proposing that we not rename any struct at all, and continue to call ItemId[Data]s "line pointers" only. This would involve removing the comment in itemid.h that confusingly refers to line pointers as "item pointers" (plus any other comments that fail to make a clear distinction). I think that the total number of comments that would be affected by this approach is quite low. -- Peter Geoghegan
Re: What is an item pointer, anyway?
On Fri, Apr 26, 2019 at 5:05 PM Tom Lane wrote: > Yeah, I'd be fine with that, although the disconnect between the type > name and the comment terminology might confuse some people. Maybe, but the fact that the ItemIdData struct consists of bit fields that are all named "lp_*" offers a hint. Plus you have the LP_* constants that get stored in ItemIdData.lp_flags. I wouldn't call the struct ItemIdData if I was in a green field situation, but it doesn't seem too bad under the present circumstances. I'd rather not change the struct's name, because that would probably cause problems without any real benefit. OTOH, calling two closely related but distinct things by the same name is atrocious. -- Peter Geoghegan
Re: "long" type is not appropriate for counting tuples
On Mon, Apr 29, 2019 at 8:11 AM Andres Freund wrote: > I think we should start by just removing all uses of long. There's > really no excuse for them today, and a lot of them are bugs waiting to > happen. I like the idea of banning "long" altogether. It will probably be hard to keep it out of third party code that we vendor-in, or even code that interfaces with libraries in some way, but it should be removed from everything else. It actually doesn't seem particularly hard to do so, based on a quick grep of src/backend/. Most uses of "long" is code that sizes something in local memory, where "long" works for the same reason as it works when calculating the size of a work_mem allocation -- ugly, but correct. A few uses of "long" seem to be real live bugs, albeit bugs that are very unlikely to ever hit. _h_indexbuild() has the same bug as _bt_load(), also due to commit ab0dfc961b6 -- I spotted that in passing when I used grep. > We read from larger files in a few places though. E.g. pg_dump. Most > places really just should use pgoff_t... I wasn't even aware of pgoff_t. It is only used in frontend utilities that I don't know that much about, whereas off_t is used all over the backend code. -- Peter Geoghegan
Re: "long" type is not appropriate for counting tuples
On Mon, Apr 29, 2019 at 11:24 AM Andres Freund wrote: > > I don't think that anybody cares about Win64 very much. > > I seriously doubt this assertion. Note that the postgres packages on > https://www.postgresql.org/download/windows/ do not support 32bit > windows anymore (edb from 11 onwards, bigsql apparently always). And I > think there's a pretty substantial number of windows users out there. I was talking about the motivation behind this thread, and I suppose that I included you in that based on things you've said about Windows in the past (apparently I shouldn't have done so). I am interested in making the code less complicated. If we can remove the work_mem kludge for Windows as a consequence of that, then so much the better. -- Peter Geoghegan
Re: "long" type is not appropriate for counting tuples
On Mon, Apr 29, 2019 at 11:10 AM Tom Lane wrote: > If we don't want to rely on "L" constants then we'll have to write these > cases like "work_mem * (size_t) 1024" which is ugly, lots more keystrokes, > and prone to weird precedence problems unless you throw even more > keystrokes (parentheses) at it. I'm not excited about doing that just > to allow larger work_mem settings on Win64. I don't think that anybody cares about Win64 very much. Simplifying the code might lead to larger work_mem settings on that platform, but that's not the end goal I have in mind. For me, the end goal is simpler code. > I'm not suggesting that we don't need to fix uses of "long" for tuple > counts, and perhaps other things. But I think getting rid of it in memory > size calculations might be a lot of work for not a lot of reward. Whether or not *fully* banning the use of "long" is something that will simplify the code is debatable. However, we could substantially reduce the use of "long" across the backend without any real downside. The work_mem question can be considered later. Does that seem reasonable? -- Peter Geoghegan
Re: "long" type is not appropriate for counting tuples
On Mon, Apr 29, 2019 at 11:20 AM Alvaro Herrera wrote: > Agreed. Here's a patch. I see downthread that you also discovered the > same mistake in _h_indexbuild by grepping for "long"; I got to it by > examining callers of pgstat_progress_update_param and > pgstat_progress_update_multi_param. I didn't find any other mistakes of > the same ilk. Some codesites use "double" instead of "int64", but those > are not broken. This seems fine, though FWIW I probably would have gone with int64 instead of uint64. There is generally no downside to using int64, and being to support negative integers can be useful in some contexts (though not this context). -- Peter Geoghegan
Re: "long" type is not appropriate for counting tuples
On Mon, Apr 29, 2019 at 10:32 AM Tom Lane wrote: > There's more to that than you might realize. For example, guc.c > enforces a limit on work_mem that's designed to ensure that > expressions like "work_mem * 1024L" won't overflow, and there are > similar choices elsewhere. I was aware of that, but I wasn't aware of how many places that bleeds into until I checked just now. It would be nice if we could figure out how to make it obvious that the idioms around the use of long for work_mem stuff are idioms that have a specific rationale. It's pretty confusing as things stand. -- Peter Geoghegan
Re: Calling PrepareTempTablespaces in BufFileCreateTemp
On Mon, Apr 29, 2019 at 12:31 PM Ashwin Agrawal wrote: > Well the one thing I wish to point out explicitly is just taking fd.c > changes from [1], and running make check hits no assertions and > doesn't flag issue exist for gistbuildbuffers.c. Means its missing > coverage and in future same can happen as well. I believe that the test coverage of GiST index builds is something that is being actively worked on right now. It's a recognized problem [1]. [1] https://postgr.es/m/24954.1556130...@sss.pgh.pa.us -- Peter Geoghegan
Re: _bt_split(), and the risk of OOM before its critical section
On Mon, May 6, 2019 at 5:15 PM Peter Geoghegan wrote: > VACUUM asserts P_FIRSTDATAKEY(opaque) > PageGetMaxOffsetNumber(page) > within _bt_mark_page_halfdead(), but doesn't test that condition in > release builds. This means that the earliest modifications of the > right page, before the high key PageAddItem(), are enough to cause a > subsequent "failed to re-find parent key" failure in VACUUM. Merely > setting the sibling blocks in the right page special area is enough to > cause VACUUM to refuse to run. To be clear, my point here was that this confirms what you said about PageGetTempPage() failing after _bt_getbuf() has initialized the buffer for the new right page -- that is not in itself a problem. However, practically any other change to the right page that might occur before an error is raised within _bt_split() is a problem -- not just adding a new item. (You were right about that, too.) -- Peter Geoghegan
Re: _bt_split(), and the risk of OOM before its critical section
On Mon, May 6, 2019 at 4:11 PM Peter Geoghegan wrote: > The important question is how VACUUM will recognize it. It's clearly > not as bad as something that causes "failed to re-find parent key" > errors, but I think that VACUUM might not be reclaiming it for the FSM > (haven't checked). Note that _bt_unlink_halfdead_page() is perfectly > happy to ignore the fact that the left sibling of a half-dead page has > a rightlink that doesn't point back to the target. Because, uh, there > might have been a concurrent page deletion, somehow. VACUUM asserts P_FIRSTDATAKEY(opaque) > PageGetMaxOffsetNumber(page) within _bt_mark_page_halfdead(), but doesn't test that condition in release builds. This means that the earliest modifications of the right page, before the high key PageAddItem(), are enough to cause a subsequent "failed to re-find parent key" failure in VACUUM. Merely setting the sibling blocks in the right page special area is enough to cause VACUUM to refuse to run. Of course, the problem goes away if you restart the database, because the right page buffer is never marked dirty, and never can be. That factor would probably make the problem appear to be an intermittent issue in the kinds of environments where it is most likely to be seen. -- Peter Geoghegan
Re: _bt_split(), and the risk of OOM before its critical section
On Mon, May 6, 2019 at 3:29 PM Tom Lane wrote: > Yeah, as _bt_split is currently coded, _bt_truncate has to be a "no > errors" function, which it isn't. The pfree for its result is being > done in an ill-chosen place, too. I am tempted to move the call to _bt_truncate() out of _bt_split() entirely on HEAD, possibly relocating it to nbtsplitloc.c/_bt_findsplitloc(). That way, there is a clearer separation between how split points are chosen, suffix truncation, and the mechanical process of executing a legal page split. > Another problem now that I look at it is that the _bt_getbuf for the right > sibling is probably not too safe. And the _bt_vacuum_cycleid() call seems > a bit dangerous from this standpoint as well. Yeah, we can tighten those up without much difficulty. > I'm not really concerned about that one because at that point the > right page is still in a freshly-pageinit'd state. It's perhaps > not quite as nice as having it be zeroes, but it won't look like > it has any interesting data. The important question is how VACUUM will recognize it. It's clearly not as bad as something that causes "failed to re-find parent key" errors, but I think that VACUUM might not be reclaiming it for the FSM (haven't checked). Note that _bt_unlink_halfdead_page() is perfectly happy to ignore the fact that the left sibling of a half-dead page has a rightlink that doesn't point back to the target. Because, uh, there might have been a concurrent page deletion, somehow. We have heard a lot about "failed to re-find parent key" errors from VACUUM before now because that is about the only strong cross-check that it does. (Not that I'm arguing that we need more of that.) > In any case, once we've started to fill the ropaque area, it would really > be better if we don't call anything that could throw errors. > > Maybe we should bite the bullet and use two temp pages, so that none > of the data ends up in the shared buffer arena until we reach the > critical section? The extra copying is slightly annoying, but > it certainly seems like enforcing this invariant over such a > long stretch of code is not very maintainable. While I think that the smarts we have around deciding a split point will probably improve in future releases, and that we'll probably make _bt_truncate() itself do more, the actual business of performing a split has no reason to change that I can think of. I would like to keep _bt_split() as simple as possible anyway -- it should only be copying tuples using simple primitives like the bufpage.c routines. Living with what we have now (not using a temp buffer for the right page) seems fine. -- Peter Geoghegan
_bt_split(), and the risk of OOM before its critical section
Commit 8fa30f906be reduced the elevel of a number of "can't happen" errors from PANIC to ERROR. These were all critical-section-adjacent errors involved in nbtree page splits, and nbtree page deletion. It also established the following convention within _bt_split(), which allowed Tom to keep the length of the critical section just as short as it had always been: /* * origpage is the original page to be split. leftpage is a temporary * buffer that receives the left-sibling data, which will be copied back * into origpage on success. rightpage is the new page that receives the * right-sibling data. If we fail before reaching the critical section, * origpage hasn't been modified and leftpage is only workspace. In * principle we shouldn't need to worry about rightpage either, because it * hasn't been linked into the btree page structure; but to avoid leaving * possibly-confusing junk behind, we are careful to rewrite rightpage as * zeroes before throwing any error. */ The INCLUDE indexes work looks like it subtly broke this, because it allocated memory after the initialization of the right page -- allocating memory can always fail. On the other hand, even when 8fa30f906be went in back in 2010 this "rule" was arguably broken, because we were already calling PageGetTempPage() after the right sibling page is initialized, which palloc()s a full BLCKSZ, which is far more than truncation is every likely to allocate. On the other other hand, it seems to me that the PageGetTempPage() thing might have been okay, because it happens before the high key is inserted on the new right buffer page. The same cannot be said for the way we generate a new high key for the left/old page via suffix truncation, which happens to occur after the right buffer page is first modified by inserted its high key (the original/left page's original high key). I think that there may be a risk that VACUUM's page deletion code will get confused by finding an errant right sibling page from a failed page split when there is a high key. If so, that would be a risk that was introduced in Postgres 11, and made much more likely in practice in Postgres 12. (I haven't got as far as doing an analysis of the risks to page deletion, though. The "fastpath" rightmost page insertion optimization that was also added to Postgres 11 seems like it also might need to be considered here.) -- Peter Geoghegan
Re: _bt_split(), and the risk of OOM before its critical section
On Mon, May 6, 2019 at 12:48 PM Peter Geoghegan wrote: > On the other other hand, it seems to me that the PageGetTempPage() > thing might have been okay, because it happens before the high key is > inserted on the new right buffer page. The same cannot be said for the > way we generate a new high key for the left/old page via suffix > truncation, which happens to occur after the right buffer page is > first modified by inserted its high key (the original/left page's > original high key). I think that there may be a risk that VACUUM's > page deletion code will get confused by finding an errant right > sibling page from a failed page split when there is a high key. If so, > that would be a risk that was introduced in Postgres 11, and made much > more likely in practice in Postgres 12. (I haven't got as far as doing > an analysis of the risks to page deletion, though. The "fastpath" > rightmost page insertion optimization that was also added to Postgres > 11 seems like it also might need to be considered here.) It seems like my fears about page deletion were well-founded, at least if you assume that the risk of an OOM at the wrong time is greater than negligible. If I simulate an OOM error during suffix truncation, then non-rightmost page splits leave the tree in a state that confuses VACUUM/page deletion. When I simulate an OOM on page 42, we will later see the dreaded "failed to re-find parent key in index "foo" for deletion target page 42" error message from a VACUUM. That's not good. It doesn't matter if the same things happens when splitting a rightmost page, which naturally doesn't insert a new high key on the new right half. This confirms my theory that the PageGetTempPage() memory allocation can fail without confusing VACUUM, since that allocation occurs before the critical-but-not-critical point (the point that we really start to modify the new right half of the split). Fortunately, this bug seems easy enough to fix: we can simply move the "insert new high key on right page" code so that it comes after suffix truncation. This makes it safe for suffix truncation to have an OOM, or at least as safe as the PageGetTempPage() allocation that seems safe to me. -- Peter Geoghegan
Re: VACUUM can finish an interrupted nbtree page split -- is that okay?
On Tue, May 7, 2019 at 12:27 AM Heikki Linnakangas wrote: > I don't understand that reasoning. Yes, _bt_pagedel() will complain if > it finds a half-dead internal page. But how does that mean that > _bt_lock_branch_parent() can't encounter one? I suppose that in theory it could, but only if you allow that any possible state could be found -- it doesn't seem any more likely than any other random illegal state. Even when it happens, we'll get a "failed to re-find parent key" error message when we go a bit further. Isn't that a logical outcome? Actually, maybe we won't get that error, because we're talking about a corrupt index, and all bets are off -- no reason to think that the half-dead internal page would be consistent with other pages in any way. But even then, you'll go on to report it in the usual way, since VACUUM scans nbtree indexes in physical order. -- Peter Geoghegan
Re: New EXPLAIN option: ALL
On Tue, May 7, 2019 at 9:31 AM David Fetter wrote: > If you're tuning a query interactively, it's a lot simpler to prepend, > for example, > > EXPLAIN (ALL, FORMAT JSON) > > to it than to prepend something along the lines of > > EXPLAIN(ANALYZE, VERBOSE, COSTS, BUFFERS, SETTINGS, TIMING, SUMMARY, > PARTRIDGE_IN_A_PEAR_TREE, FORMAT JSON) > > to it. FWIW, I have the following in my psqlrc: \set ea 'EXPLAIN (ANALYZE, SETTINGS, VERBOSE, BUFFERS) ' The idea behind that is that I can prepend ":ea" as needed, rather than doing a lot of typing each time, as in: :ea SELECT ... -- Peter Geoghegan
Re: _bt_split(), and the risk of OOM before its critical section
On Mon, May 6, 2019 at 4:11 PM Peter Geoghegan wrote: > I am tempted to move the call to _bt_truncate() out of _bt_split() > entirely on HEAD, possibly relocating it to > nbtsplitloc.c/_bt_findsplitloc(). That way, there is a clearer > separation between how split points are chosen, suffix truncation, and > the mechanical process of executing a legal page split. I decided against that -- better to make it clear how truncation deals with space overhead within _bt_split(). Besides, the resource management around sharing a maybe-palloc()'d high key across module boundaries seems complicated, and best avoided. Attached draft patch for HEAD fixes the bug by organizing _bt_split() into clear phases. _bt_split() now works as follows, which is a little different: * An initial phase that is entirely concerned with the left page temp buffer itself -- initializes its special area. * Suffix truncation to get left page's new high key, and then add it to left page. * A phase that is mostly concerned with initializing the right page special area, but also finishes off one or two details about the left page that needed to be delayed. This is also where the "shadow critical section" begins. Note also that this is where _bt_vacuum_cycleid() is called, because its contract actually *requires* that caller has a buffer lock on both pages at once. This should not be changed on the grounds that _bt_vacuum_cycleid() might fail (nor for any other reason). * Add new high key to right page if needed. (No change, other than the fact that it happens later now.) * Add other items to both leftpage and rightpage. Critical section that copies leftpage into origpage buffer. (No changes here.) I suppose I'm biased, but I prefer the new approach anyway. Adding the left high key first, and then the right high key seems simpler and more logical. It emphasizes the similarities and differences between leftpage and rightpage. Furthermore, this approach fixes the theoretical risk of leaving behind a minimally-initialized nbtree page that has existed since 2010. We don't allocated *any* memory after the point that a new rightpage buffer is acquired. I suppose that this will need to be backpatched. Thoughts? -- Peter Geoghegan v1-0001-Don-t-leave-behind-junk-nbtree-pages-during-split.patch Description: Binary data
Re: Pathological performance when inserting many NULLs into a unique index
On Wed, Apr 17, 2019 at 7:37 PM Peter Geoghegan wrote: > Tentatively, I think that the fix here is to not "itup_key->scantid = > NULL" when a NULL value is involved (i.e. don't "itup_key->scantid = > NULL" when we see IndexTupleHasNulls(itup) within _bt_doinsert()). We > may also want to avoid calling _bt_check_unique() entirely in this > situation. > I'll create an open item for this, and begin work on a patch tomorrow. I came up with the attached patch, which effectively treats a unique index insertion as if the index was not unique at all in the case where new tuple is IndexTupleHasNulls() within _bt_doinsert(). This is correct because inserting a new tuple with a NULL value can neither contribute to somebody else's duplicate violation, nor have a duplicate violation error of its own. I need to test this some more, though I am fairly confident that I have the right idea. We didn't have this problem before my v12 work due to the "getting tired" strategy, but that's not the full story. We also didn't (and don't) move right within _bt_check_unique() when IndexTupleHasNulls(itup), because _bt_isequal() treats NULL values as not equal to any value, including even NULL -- this meant that there was no useless search to the right within _bt_check_unique(). Actually, the overall effect of how _bt_isequal() is used is that _bt_check_unique() does *no* useful work at all with a NULL scankey. It's far simpler to remove responsibility for new tuples/scankeys with NULL values from _bt_check_unique() entirely, which is what I propose to do with this patch. The patch actually ends up removing quite a lot more code than it adds, because it removes _bt_isequal() outright. I always thought that nbtinsert.c dealt with NULL values in unique indexes at the wrong level, and I'm glad to see _bt_isequal() go. Vadim's accompanying 1997 comment about "not using _bt_compare in real comparison" seems confused to me. The new _bt_check_unique() may still need to compare the scankey to some existing, adjacent tuple with a NULL value, but _bt_compare() is perfectly capable of recognizing that that adjacent tuple shouldn't be considered equal. IOW, _bt_compare() is merely incapable of behaving as if "NULL != NULL", which is a bad reason for keeping _bt_isequal() around. -- Peter Geoghegan 0001-Prevent-O-N-2-unique-index-insertion-edge-case.patch Description: Binary data
Re: Pathological performance when inserting many NULLs into a unique index
On Thu, Apr 18, 2019 at 5:46 PM Michael Paquier wrote: > Adding an open item is adapted, nice you found this issue. Testing on > my laptop with v11, the non-NULL case takes 5s and the NULL case 6s. > On HEAD, the non-NULL case takes 6s, and the NULL case takes... I > just gave up and cancelled it :) This was one of those things that occurred to me out of the blue. It just occurred to me that the final patch will need to be more careful about non-key attributes in INCLUDE indexes. It's not okay for it to avoid calling _bt_check_unique() just because a non-key attribute was NULL. It should only do that when a key attribute is NULL. -- Peter Geoghegan
Re: Pathological performance when inserting many NULLs into a unique index
On Thu, Apr 18, 2019 at 8:13 PM Peter Geoghegan wrote: > It just occurred to me that the final patch will need to be more > careful about non-key attributes in INCLUDE indexes. It's not okay for > it to avoid calling _bt_check_unique() just because a non-key > attribute was NULL. It should only do that when a key attribute is > NULL. Attached revision does it that way, specifically by adding a new field to the insertion scankey struct (BTScanInsertData). The field is filled-in when initializing an insertion scankey in the usual way. I was able to reuse alignment padding for the new field, so there is no additional space overhead on Linux x86-64. This is a bit more complicated than v1, but there is still an overall net reduction in overall complexity (and in LOC). I'm going to commit this early next week, barring any objections, and assuming I don't think of any more problems between now and then. -- Peter Geoghegan v2-0001-Prevent-O-N-2-unique-index-insertion-edge-case.patch Description: Binary data
Re: ERROR: failed to add item to the index page
On Tue, Apr 30, 2019 at 6:28 PM Peter Geoghegan wrote: > Attached is a much more polished version of the same patch. I tried to > make clear how the "page full" test (the test that has been fixed to > take heap TID space for high key into account) is related to other > close-by code, such as the tuple space limit budget within > _bt_check_third_page(), and the code that sets up an actual call to > _bt_truncate(). Pushed, though final version does the test a little differently. It adds the required heap TID space to itupsz, rather than subtracting it from pgspc. This is actually representative of the underlying logic, and avoids unsigned underflow. -- Peter Geoghegan
Re: Improve search for missing parent downlinks in amcheck
On Sat, Apr 27, 2019 at 4:57 PM Alexander Korotkov wrote: > "rootdescend" is cool type of check. Thank you for noticing, I wasn't aware > of it. > But can it detect the missing downlink in following situation? > > A > / \ > B <-> C <-> D > > Here A has downlinks to B and D, which downlink to C is missing, > while B, C and D are correctly connected with leftlinks and rightlinks. > I can see "rootdescend" calls _bt_search(), which would just step > right from C to D as if it was concurrent split. There is a comment about this scenario above bt_rootdescend() in amcheck. You're right -- this is a kind of corruption that even the new rootdescend verification option would miss. We can imagine a version of rootdescend verification that tells the core code to only move right when there was an *interrupted* page split (i.e. P_INCOMPLETE_SPLIT() flag bit is set), but that isn't what happens right now. That said, the lossy downlink check that you want to improve on *should* already catch this situation. Of course it might not because it is lossy (uses a Bloom filter), but I think that that's very unlikely. That's why I would like to understand the problem that you found with the check. -- Peter Geoghegan
Re: Improve search for missing parent downlinks in amcheck
On Sat, Apr 27, 2019 at 5:13 PM Alexander Korotkov wrote: > Yes, increasing of Bloom filter size also helps. But my intention was > to make non-lossy check here. Why is that your intention? Do you want to do this as a feature for Postgres 13, or do you want to treat this as a bug that we need to backpatch a fix for? Can we avoid the problem you saw with the Bloom filter approach by using the real size of the index (i.e. smgrnblocks()/RelationGetNumberOfBlocks()) to size the Bloom filter, and/or by rethinking the work_mem cap? Maybe we can have a WARNING that advertises that work_mem is probably too low? The state->downlinkfilter Bloom filter should be small in almost all cases, so I still don't fully understand your concern. With a 100GB index, we'll have ~13 million blocks. We only need a Bloom filter that is ~250MB to have less than a 1% chance of missing inconsistencies even with such a large index. I admit that its unfriendly that users are not warned about the shortage currently, but that is something we can probably find a simple (backpatchable) fix for. -- Peter Geoghegan
Re: Improve search for missing parent downlinks in amcheck
On Sat, Apr 27, 2019 at 5:13 PM Alexander Korotkov wrote: > Yes, increasing of Bloom filter size also helps. But my intention was > to make non-lossy check here. I agree that that might be a good goal, but I am interested in knowing if there is something naive about how the downlinkfilter Bloom filter is sized. I think that we could probably do better than this without much work: /* * Extra readonly downlink check. * * In readonly case, we know that there cannot be a concurrent * page split or a concurrent page deletion, which gives us the * opportunity to verify that every non-ignorable page had a * downlink one level up. We must be tolerant of interrupted page * splits and page deletions, though. This is taken care of in * bt_downlink_missing_check(). */ total_pages = (int64) state->rel->rd_rel->relpages; state->downlinkfilter = bloom_create(total_pages, work_mem, seed); Maybe we could use "smgrnblocks(index->rd_smgr, MAIN_FORKNUM))" instead of relpages, for example. > It wouldn't be really wasteful, because the same children were > accessed recently. So, they are likely not yet evicted from > shared_buffers. I think we can fit both checks into one loop over > downlinks instead of two. I see your point, but if we're going to treat this as a bug then I would prefer a simple fix. > Yes, we can do more checks with bloom filter. But assuming that we > anyway iterate over children for each non-leaf page, can we do: > 1) All the checks, which bt_downlink_check() does for now > 2) Check there are no missing downlinks > 3) Check hikeys > in one pass, can't we? We can expect every high key in a page to have a copy contained within its parent, either as one of the keys, or as parent's own high key (assuming no concurrent or interrupted page splits or page deletions). This is true today, even though we truncate negative infinity items in internal pages. I think that the simple answer to your question is yes. It would be more complicated that way, and the only extra check would be the check of high keys against their parent, but overall this does seem possible. -- Peter Geoghegan
"long" type is not appropriate for counting tuples
Commit ab0dfc961b6 used a "long" variable within _bt_load() to count the number of tuples entered into a B-Tree index as it is built. This will not work as expected on Windows, even on 64-bit Windows, because "long" is only 32-bits wide. It's far from impossible that you'd have ~2 billion index tuples when building a new index. Programmers use "long" because it is assumed to be wider than "int" (even though that isn't required by the standard, and isn't true across all of the platforms we support). The use of "long" seems inherently suspect given our constraints, except perhaps in the context of sizing work_mem-based allocations, where it is used as part of a semi-standard idiom...albeit one that only works because of the restrictions on work_mem size on Windows. ISTM that we should try to come up with a way of making code like this work, rather than placing the burden on new code to get it right. This exact issue has bitten users on a number of occasions that I can recall. There is also a hidden landmine that we know about but haven't fixed: logtape.c, which will break on Windows with very very large index builds. Also, "off_t" is only 32-bits on Windows, which broke parallel CREATE INDEX (issued fixed by commit aa551830). I suppose that "off_t" is really a variant of the same problem. -- Peter Geoghegan
Re: "long" type is not appropriate for counting tuples
On Sun, Apr 28, 2019 at 4:25 PM Tom Lane wrote: > > ISTM that we should try to come up with a way of making code like this > > work, rather than placing the burden on new code to get it right. > > Other than "use the right datatype", I'm not sure what we can do? Ambiguity seems like the real problem here. If we could impose a general rule that you cannot use "long" (perhaps with some limited wiggle-room), then a lint-like tool could catch bugs like this. This may not be that difficult. Nobody is particularly concerned about performance on 32-bit platforms these days. > In the meantime, somebody should fix ab0dfc961b6 ... I'll leave this to Alvaro. > Hmm, why is this a problem? We should only use off_t for actual file > access operations, and we don't use files greater than 1GB. (There's a > reason for that.) The issue that was fixed by commit aa551830 showed this assumption to be kind of brittle. Admittedly this is not as clear-cut as the "long" issue, and might not be worth worrying about. I don't want to go as far as requiring explicit width integer types in all situations, since that seems totally impractical, and without any real upside. But it would be nice to identify integer types where there is a real risk of making an incorrect assumption, and then eliminate that risk once and for all. -- Peter Geoghegan
What is an item pointer, anyway?
itemid.h introduces the struct ItemIdData as follows: /* * An item pointer (also called line pointer) on a buffer page Meanwhile, itemptr.h introduces the struct ItemPointerData as follows: /* * ItemPointer: * * This is a pointer to an item within a disk page of a known file * (for example, a cross-link from an index to its parent table). It doesn't seem reasonable to assume that you should know the difference based on context. The two concepts are closely related. An ItemPointerData points to a block, as well as the, uh, item pointer within that block. This ambiguity is avoidable, and should be avoided. ISTM that the least confusing way of removing the ambiguity would be to no longer refer to ItemIds as item pointers, without changing anything else. -- Peter Geoghegan
Re: What is an item pointer, anyway?
On Fri, Apr 26, 2019 at 5:13 PM Peter Geoghegan wrote: > On Fri, Apr 26, 2019 at 5:05 PM Tom Lane wrote: > > Yeah, I'd be fine with that, although the disconnect between the type > > name and the comment terminology might confuse some people. > > Maybe, but the fact that the ItemIdData struct consists of bit fields > that are all named "lp_*" offers a hint. Plus you have the LP_* > constants that get stored in ItemIdData.lp_flags. Attached draft patch adjusts code comments and error messages where a line pointer is referred to as an item pointer. It turns out that this practice isn't all that prevalent. Here are some specific concerns that I had to think about when writing the patch, though: * I ended up removing a big indexam.c "old comments" comment paragraph from the Berkeley days, because it used the term item pointer in what seemed like the wrong way, but also because AFAICT it's totally obsolete. * Someone should confirm that I have preserved the original intent of the changes within README.HOT, and the heapam changes that relate to pruning. It's possible that I changed "item pointer" to "line pointer" in one or two places where I should have changed "item pointer" to "tuple" instead. * I changed a few long standing "can't happen" error messages that concern corruption, most of which also relate to pruning. Maybe that's a cost that needs to be considered. * I changed a lazy_scan_heap() log message of long-standing. Another downside that needs to be considered. * I expanded a little on the advantages of using line pointers within bufpage.h. That seemed in scope to me, because it drives home the distinction between item pointers and line pointers. -- Peter Geoghegan v1-0001-Standardize-line-pointers-item-pointer-terminolog.patch Description: Binary data
Re: VACUUM can finish an interrupted nbtree page split -- is that okay?
On Fri, Mar 1, 2019 at 3:59 PM Peter Geoghegan wrote: > /* > * Perform the same check on this internal level that > * _bt_mark_page_halfdead performed on the leaf level. > */ > if (_bt_is_page_halfdead(rel, *rightsib)) > I thought that internal pages were never half-dead after Postgres 9.4. > If that happens, then the check within _bt_pagedel() will throw an > ERRCODE_INDEX_CORRUPTED error, and tell the DBA to REINDEX. Shouldn't > this internal level _bt_is_page_halfdead() check contain a "can't > happen" error, or even a simple assertion? I think that we should get rid of this code on HEAD shortly, because it's effectively dead code. You don't have to know anything about B-Trees to see why this must be true: VACUUM is specifically checking if an internal page is half-dead here, even though it's already treating half-dead internal pages as ipso facto corrupt in higher level code (it's the first thing we check in _bt_pagedel()). This is clearly contradictory. If there is a half-dead internal page, then there is no danger of VACUUM not complaining loudly that you need to REINDEX. This has been true since the page deletion overhaul that went into 9.4. Attached patch removes the internal page check, and adds a comment that explains why it's sufficient to check on the leaf level alone. Admittedly, it's much easier to understand why the _bt_is_page_halfdead() internal page check is useless than it is to understand why replacing it with this comment is helpful. My observation that a half-dead leaf page is representative of the subtree whose root the leaf page has stored as its "top parent" is hardly controversial, though -- that's the whole basis of multi-level page deletion. If you *visualize* how multi-level deletion works, and consider its rightmost-in-subtree restriction, then it isn't hard to see why everything works out with just the leaf level right-sibling-is-half-dead check: We can only have two adjacent "skinny" pending-deletion subtrees in cases where the removed check might seem to be helpful -- each page is both the leftmost and the rightmost on its level in its subtree. It's okay to just check if the leaf is half-dead because it "owns" exactly the same range in the keyspace as the internal pages up to and including its top parent, if any, and because it is marked half-dead by the same atomic operation that does initial removal of downlinks in an ancestor page. I'm fine with waiting until we branch-off v12 before pushing the patch, even though it seems low risk. -- Peter Geoghegan 0001-Remove-nbtree-half-dead-internal-page-check.patch Description: Binary data
Re: [HACKERS] [WIP] Effective storage of duplicates in B-tree index.
On Thu, Jul 4, 2019 at 10:38 AM Peter Geoghegan wrote: > I tried this on my own "UK land registry" test data [1], which was > originally used for the v12 nbtree work. My test case has a low > cardinality, multi-column text index. I chose this test case because > it was convenient for me. > > On v12/master, the index is 1100MB. Whereas with your patch, it ends > up being 196MB -- over 5.5x smaller! I also see a huge and consistent space saving for TPC-H. All 9 indexes are significantly smaller. The lineitem orderkey index is "just" 1/3 smaller, which is the smallest improvement among TPC-H indexes in my index bloat test case. The two largest indexes after the initial bulk load are *much* smaller: the lineitem parts supplier index is ~2.7x smaller, while the lineitem ship date index is a massive ~4.2x smaller. Also, the orders customer key index is ~2.8x smaller, and the order date index is ~2.43x smaller. Note that the test involved retail insertions, not CREATE INDEX. I haven't seen any regression in the size of any index so far, including when the number of internal pages is all that we measure. Actually, there seems to be cases where there is a noticeably larger reduction in internal pages than in leaf pages, probably because of interactions with suffix truncation. This result is very impressive. We'll need to revisit what the right trade-off is for the compression scheme, which Heikki had some thoughts on when we left off 3 years ago, but that should be a lot easier now. I am very encouraged by the fact that this relatively simple approach already works quite nicely. It's also great to see that bulk insertions with lots of compression are very clearly faster with this latest revision of your patch, unlike earlier versions from 2016 that made those cases slower (though I haven't tested indexes that don't really use compression). I think that this is because you now do the compression lazily, at the point where it looks like we may need to split the page. Previous versions of the patch had to perform compression eagerly, just like GIN, which is not really appropriate for nbtree. -- Peter Geoghegan