Re: Reducing the runtime of the core regression tests

2019-04-12 Thread Peter Geoghegan
On Fri, Apr 12, 2019 at 10:24 AM Alvaro Herrera
 wrote:
> I wonder if it would be useful to add --enable-debug.  I think I
> purposefully removed that, but I don't remember any details about it.

As usual, this stuff is horribly under-documented. I think it's
possible that  --enable-debug would help, since llvm-gcov requires it,
but that doesn't seem particularly likely.

It's definitely generally recommended that "-O0" be used, so I think
that we can agree that that was an improvement, even if it doesn't fix
the remaining problem that I noticed when I rechecked nbtutils.c.

-- 
Peter Geoghegan




Re: Reducing the runtime of the core regression tests

2019-04-12 Thread Peter Geoghegan
On Fri, Apr 12, 2019 at 10:49 AM Peter Geoghegan  wrote:
> It's definitely generally recommended that "-O0" be used, so I think
> that we can agree that that was an improvement, even if it doesn't fix
> the remaining problem that I noticed when I rechecked nbtutils.c.

I'm not sure that we can really assume that "-O0" avoids the behavior
I pointed out. Perhaps this counts as "semantic flattening" or
something, rather than an optimization. I could have easily written
the code in _bt_keep_natts_fast() in the way gcov/gcc/whatever thinks
I ought to have, which would have obscured the distinction anyway.

-- 
Peter Geoghegan




Re: Zedstore - compressed in-core columnar storage

2019-04-15 Thread Peter Geoghegan
On Mon, Apr 15, 2019 at 11:35 AM Tom Lane  wrote:
> We do need to limit what we accept into core PG.  I do not buy your
> argument that users expect everything to be in core.  Or more accurately,
> the people who do think that way won't be using PG anyway --- they'll
> be using MSSQL because it comes from their OS vendor.

I am also concerned by the broad scope of ZedStore, and I tend to
agree that it will be difficult to maintain in core. At the same time,
I think that Andres and Robert are probably right about the difficulty
of maintaining it outside of core -- that would be difficult to
impossible as a practical matter.

Unfortunately, you're both right. I don't know where that leaves us.
-- 
Peter Geoghegan




Re: Zedstore - compressed in-core columnar storage

2019-04-15 Thread Peter Geoghegan
On Mon, Apr 15, 2019 at 9:16 AM Ashwin Agrawal  wrote:
> Would like to know more specifics on this Peter. We may be having different 
> context on hybrid row/column design.

I'm confused about how close your idea of a TID is to the traditional
definition from heapam (and even zheap). If it's a purely logical
identifier, then why would it have two components like a TID? Is that
just a short-term convenience or something?

> Yes, the plan to optimize out TID space per datum, either by prefix 
> compression or delta compression or some other trick.

It would be easier to do this if you knew for sure that the TID
behaves almost the same as a bigserial column -- a purely logical
monotonically increasing identifier. That's why I'm interested in what
exactly you mean by TID, the stability of a TID value, etc. If a leaf
page almost always stores a range covering no more than few hundred
contiguous logical values,  you can justify aggressively compressing
the representation in the B-Tree entries. Compression would still be
based on prefix compression, but the representation itself can be
specialized.

-- 
Peter Geoghegan




Re: Zedstore - compressed in-core columnar storage

2019-04-15 Thread Peter Geoghegan
On Mon, Apr 15, 2019 at 12:19 PM Tom Lane  wrote:
> Perhaps, but we won't know if we don't try.  I think we should try,
> and be willing to add hooks and flexibility to core as needed to make
> it possible.

We could approach it without taking a firm position on inclusion in
core until the project begins to mature. I have little faith in our
ability to predict which approach will be the least painful at this
early stage.

-- 
Peter Geoghegan




Re: Zedstore - compressed in-core columnar storage

2019-04-15 Thread Peter Geoghegan
On Mon, Apr 15, 2019 at 1:02 PM Andres Freund  wrote:
> There's not much of an alternative currently. Indexes require tid
> looking things, and as a consequence (and some other comparatively small
> changes that'd be required) tableam does too.

I'm trying to establish whether or not that's the only reason. It
might be okay to use the same item pointer struct as the
representation of a integer-like logical identifier. Even if it isn't,
I'm still interested in just how logical the TIDs are, because it's an
important part of the overall design.

> That's one of the reasons why I've been trying to get you to get on
> board with allowing different leaf-level "item pointer equivalents"
> widths inside nbtree...

Getting me to agree that that would be nice and getting me to do the
work are two very different things.  ;-)

-- 
Peter Geoghegan




Re: Improve search for missing parent downlinks in amcheck

2019-04-16 Thread Peter Geoghegan
On Tue, Apr 16, 2019 at 12:00 PM Peter Geoghegan  wrote:
> Can you be more specific? What was the cause of the corruption? I'm
> always very interested in hearing about cases that amcheck could have
> detected, but didn't.

FWIW, v4 indexes in Postgres 12 will support the new "rootdescend"
verification option, which isn't lossy, and would certainly have
detected your customer issue in practice. Admittedly the new check is
quite expensive, even compared to the other bt_index_parent_check()
checks, but it is nice that we now have a verification option that is
*extremely* thorough, and uses _bt_search() directly.

-- 
Peter Geoghegan




Re: Improve search for missing parent downlinks in amcheck

2019-04-16 Thread Peter Geoghegan
On Mon, Apr 15, 2019 at 7:30 PM Alexander Korotkov
 wrote:
> Currently we amcheck supports lossy checking for missing parent
> downlinks.  It collects bitmap of downlink hashes and use it to check
> subsequent tree level.  We've experienced some large corrupted indexes
> which pass this check due to its looseness.

Can you be more specific? What was the cause of the corruption? I'm
always very interested in hearing about cases that amcheck could have
detected, but didn't.

Was the issue that the Bloom filter was simply undersized/ineffective?

> However, it seems to me we can implement this check in non-lossy
> manner without making it significantly slower.  We anyway traverse
> downlinks from parent to children in order to verify that hikeys are
> corresponding to downlink keys.

Actually, we don't check the high keys in children against the parent
(all other items are checked, though). We probably *should* do
something with the high key when verifying consistency across levels,
but currently we don't. (We only use the high key for the same-page
high key check -- more on this below.)

> We can also traverse from one
> downlink to subsequent using rightlinks.  So, if there are some
> intermediate pages between them, they are candidates to have missing
> parent downlinks.  The patch is attached.
>
> With this patch amcheck could successfully detect corruption for our
> customer, which unpatched amcheck couldn't find.

Maybe we can be a lot less conservative about sizing the Bloom filter
instead? That would be an easier fix IMV -- we can probably change
that logic to be a lot more aggressive without anybody noticing, since
the Bloom filter is already usually tiny. We are already not very
careful about saving work within bt_index_parent_check(), but with
this patch we follow each downlink twice instead of once. That seems
wasteful.

The reason I used a Bloom filter here is because I would eventually
like the missing downlink check to fingerprint entire tuples, not just
block numbers. In L terms, the check could in the future fingerprint
the separator key and the downlink at the same time, not just the
downlink. That way, we could probe the Bloom filter on the next level
down for its high key (with the right sibling pointer set to be
consistent with the parent) iff we don't see that the page split was
interrupted (i.e. iff P_INCOMPLETE_SPLIT() bit is not set). Obviously
this would be a more effective form of verification, since we would
notice high key values that don't agree with the parent's values for
the same sibling/cousin/child block.

I didn't do it that way for v11 because of "minus infinity" items on
internal pages, which don't store the original key (the key remains
the high key of the left sibling page, but we truncate the original to
0 attributes in _bt_pgaddtup()). I think that we should eventually
stop truncating minus infinity items, and actually store a "low key"
at P_FIRSTDATAKEY() within internal pages instead. That would be
useful for other things anyway (e.g. prefix compression).

--
Peter Geoghegan




Re: New vacuum option to do only freezing

2019-04-17 Thread Peter Geoghegan
On Wed, Apr 17, 2019 at 10:46 AM Tom Lane  wrote:
> Yeah, if we wanted to expose these complications more directly, we
> could think about adding or changing the main counters.  I was wondering
> about whether we should have it report "x bytes and y line pointers
> freed", rather than counting tuples per se.

I like that idea, but I'm pretty sure that there are very few users
that are aware of these distinctions at all. It would be a good idea
to clearly document them.

-- 
Peter Geoghegan




Pathological performance when inserting many NULLs into a unique index

2019-04-17 Thread Peter Geoghegan
I thought of a case that results in pathological performance due to a
behavior of my nbtree patch series:

regression=# create table uniquenulls(nully int4, constraint pp unique(nully));
CREATE TABLE
Time: 10.694 ms
regression=# insert into uniquenulls select i from generate_series(1, 1e6) i;
INSERT 0 100
Time: 1356.025 ms (00:01.356)
regression=# insert into uniquenulls select null from generate_series(1, 1e6) i;
INSERT 0 100
Time: 270834.196 ms (04:30.834)

The issue here is that the duration of the second INSERT statement is
wildly excessive, because _bt_stepright() needs to step right many
many times for each tuple inserted. I would expect the second insert
to take approximately as long as the first, but it takes ~200x longer
here. It could take much longer still if we pushed this example
further. What we see here is a limited case in which the O(n ^ 2)
performance that "getting tired" used to prevent can occur. Clearly
that's totally unacceptable. Mea culpa.

Sure enough, the problem goes away when the index isn't a unique index
(i.e. in the UNIQUE_CHECK_NO case):

regression=# alter table uniquenulls drop constraint pp;
ALTER TABLE
Time: 28.968 ms
regression=# create index on uniquenulls (nully);
CREATE INDEX
Time: 1159.958 ms (00:01.160)
regression=# insert into uniquenulls select null from generate_series(1, 1e6) i;
INSERT 0 100
Time: 1155.708 ms (00:01.156)

Tentatively, I think that the fix here is to not "itup_key->scantid =
NULL" when a NULL value is involved (i.e. don't "itup_key->scantid =
NULL" when we see IndexTupleHasNulls(itup) within _bt_doinsert()). We
may also want to avoid calling _bt_check_unique() entirely in this
situation. That way, the performance should be the same as the
UNIQUE_CHECK_NO case: we descend to the appropriate leaf page
directly, and then we're done. We don't have to find the appropriate
leaf page by groveling through indefinitely-many existing leaf pages
that are full of NULL values, because we know that there cannot ever
be a unique violation for us to detect.

I'll create an open item for this, and begin work on a patch tomorrow.
-- 
Peter Geoghegan




Re: Zedstore - compressed in-core columnar storage

2019-04-13 Thread Peter Geoghegan
On Thu, Apr 11, 2019 at 6:06 AM Rafia Sabih  wrote:
> Reading about it reminds me of this work -- TAG column storage( 
> http://www09.sigmod.org/sigmod/record/issues/0703/03.article-graefe.pdf ).
> Isn't this storage system inspired from there, with TID as the TAG?
>
> It is not referenced here so made me wonder.

I don't think they're particularly similar, because that paper
describes an architecture based on using purely logical row
identifiers, which is not what a TID is. TID is a hybrid
physical/logical identifier, sometimes called a "physiological"
identifier, which will have significant overhead. Ashwin said that
ZedStore TIDs are logical identifiers, but I don't see how that's
compatible with a hybrid row/column design (unless you map heap TID to
logical row identifier using a separate B-Tree).

The big idea with Graefe's TAG design is that there is practically no
storage overhead for these logical identifiers, because each entry's
identifier is calculated by adding its slot number to the page's
tag/low key. The ZedStore design, in contrast, explicitly stores TID
for every entry. ZedStore seems more flexible for that reason, but at
the same time the per-datum overhead seems very high to me. Maybe
prefix compression could help here, which a low key and high key can
do rather well.

-- 
Peter Geoghegan




Re: Reducing the runtime of the core regression tests

2019-04-11 Thread Peter Geoghegan
On Thu, Apr 11, 2019 at 9:55 AM Tom Lane  wrote:
> So I concur that indexing.sql's fastpath test
> isn't adding anything useful coverage-wise, and will just nuke it.

Good.

> (It'd be interesting perhaps to check whether the results shown
> by coverage.postgresql.org are similarly unstable.  They might be
> less so, since I believe those are taken over the whole check-world
> suite not just the core regression tests.)

I'm almost certain that they're at least slightly unstable. I mostly
find the report useful because it shows whether or not something gets
hit at all. I don't trust it to be very accurate.

I've noticed that the coverage reported on coverage.postgresql.org
sometimes looks contradictory, which can happen due to compiler
optimizations. I wonder if that could be addressed in some way,
because I find the site to be a useful resource. I would at least like
to know the settings used by its builds.

-- 
Peter Geoghegan




Re: Reducing the runtime of the core regression tests

2019-04-11 Thread Peter Geoghegan
On Thu, Apr 11, 2019 at 11:00 AM Alvaro Herrera
 wrote:
> ./configure --enable-depend --enable-coverage --enable-tap-tests --enable-nls 
> --with-python --with-perl --with-tcl --with-openssl --with-libxml --with-ldap 
> --with-pam >> $LOG 2>&1
>
> make -j4 >> $LOG 2>&1
> make -j4 -C contrib >> $LOG 2>&1
> make check-world PG_TEST_EXTRA="ssl ldap" >> $LOG 2>&1
> make coverage-html >> $LOG 2>&1
>
> There are no environment variables that would affect it.

Could we add "CFLAGS=-O0"? This should prevent the kind of
machine-wise line-counting described here:

https://gcc.gnu.org/onlinedocs/gcc/Gcov-and-Optimization.html#Gcov-and-Optimization

I think that it makes sense to prioritize making it clear which exact
lines were executed in terms of the semantics of C. I might prefer to
have optimizations enabled if I was optimizing my code, but that's not
what the web resource is for, really.

Thanks
-- 
Peter Geoghegan




Re: Making all nbtree entries unique by having heap TIDs participate in comparisons

2019-03-13 Thread Peter Geoghegan
On Tue, Mar 12, 2019 at 11:40 AM Robert Haas  wrote:
> I think it's pretty clear that we have to view that as acceptable.  I
> mean, we could reduce contention even further by finding a way to make
> indexes 40% larger, but I think it's clear that nobody wants that.

I found this analysis of bloat in the production database of Gitlab in
January 2019 fascinating:

https://about.gitlab.com/handbook/engineering/infrastructure/blueprint/201901-postgres-bloat/

They determined that their tables consisted of about 2% bloat, whereas
indexes were 51% bloat (determined by running VACUUM FULL, and
measuring how much smaller indexes and tables were afterwards). That
in itself may not be that telling. What is telling is the index bloat
disproportionately affects certain kinds of indexes. As they put it,
"Indexes that do not serve a primary key constraint make up 95% of the
overall index bloat". In other words, the vast majority of all bloat
occurs within non-unique indexes, with most remaining bloat in unique
indexes.

One factor that could be relevant is that unique indexes get a lot
more opportunistic LP_DEAD killing. Unique indexes don't rely on the
similar-but-distinct kill_prior_tuple optimization --  a lot more
tuples can be killed within _bt_check_unique() than with
kill_prior_tuple in realistic cases. That's probably not really that
big a factor, though. It seems almost certain that "getting tired" is
the single biggest problem.

The blog post drills down further, and cites the examples of several
*extremely* bloated indexes on a single-column, which is obviously low
cardinality. This includes an index on a boolean field, and an index
on an enum-like text field. In my experience, having many indexes like
that is very common in real world applications, though not at all
common in popular benchmarks (with the exception of TPC-E).

It also looks like they may benefit from the "split after new item"
optimization, at least among the few unique indexes that were very
bloated, such as merge_requests_pkey:

https://gitlab.com/snippets/1812014

Gitlab is open source, so it should be possible to confirm my theory
about the "split after new item" optimization (I am certain about
"getting tired", though). Their schema is defined here:

https://gitlab.com/gitlab-org/gitlab-ce/blob/master/db/schema.rb

I don't have time to confirm all this right now, but I am pretty
confident that they have both problems. And almost as confident that
they'd notice substantial benefits from this patch series.
-- 
Peter Geoghegan



Re: VACUUM can finish an interrupted nbtree page split -- is that okay?

2019-05-16 Thread Peter Geoghegan
On Mon, May 13, 2019 at 9:09 PM Peter Geoghegan  wrote:
> Even when that happens, the index is already considered corrupt by
> VACUUM, so the same VACUUM process that could in theory be adversely
> affected by removing the half-dead internal page check will complain
> about the page when it gets to it later on -- the user will be told to
> REINDEX. And even then, we will never actually get to apply the check
> that I propose to remove, since we're already checking the leaf page
> sibling of the leaf-level target -- the leaf-level test that was added
> by efada2b8e92 was clearly necessary. But it was also sufficient (no
> equivalent internal half-dead right sibling test is needed): a 9.3-era
> half-dead internal page cannot have more than one child, which must be
> undergoing deletion as well.

Actually, now that I look back at how page deletion worked 5+ years
ago, I realize that I have this slightly wrong: the leaf level check
is not sufficient to figure out if the parent's right sibling is
pending deletion (which is represented explicitly as a half-dead
internal page prior to 9.4). All the same, I'm going to push ahead
with this patch. Bugfix commit efada2b8e92 was always about a bug in
9.4 -- it had nothing to do with 9.3. And, in the unlikely event that
there is a problem on a pg_upgrade'd 9.3 -> 12 database that happens
to have half-dead internal pages, we'll still get a useful, correct
error from VACUUM one way or another. It might be slightly less
friendly as error messages about corruption go, but that seems
acceptable to me.

-- 
Peter Geoghegan




Re: VACUUM can finish an interrupted nbtree page split -- is that okay?

2019-05-16 Thread Peter Geoghegan
On Thu, May 16, 2019 at 1:05 PM Peter Geoghegan  wrote:
> Actually, now that I look back at how page deletion worked 5+ years
> ago, I realize that I have this slightly wrong: the leaf level check
> is not sufficient to figure out if the parent's right sibling is
> pending deletion (which is represented explicitly as a half-dead
> internal page prior to 9.4). All the same, I'm going to push ahead
> with this patch. Bugfix commit efada2b8e92 was always about a bug in
> 9.4 -- it had nothing to do with 9.3.

I meant bugfix commit 8da31837803 (commit efada2b8e92 was the commit
that had the bug in question).


-- 
Peter Geoghegan




Re: Suppressing noise in successful check-world runs

2019-05-24 Thread Peter Geoghegan
On Wed, May 22, 2019 at 3:57 PM Tom Lane  wrote:
> I experimented with the attached quick-hack patch to make pg_regress
> suppress notices from its various initial DROP/CREATE IF [NOT] EXISTS
> commands.  I'm not entirely convinced whether suppressing them is
> a good idea though.  Perhaps some hack with effects confined to
> pg_upgrade's test would be better.  I don't have a good idea what
> that would look like, however.
>
> Or we could just say this isn't annoying enough to fix.

I think it's worth fixing.

-- 
Peter Geoghegan




Re: Suppressing noise in successful check-world runs

2019-05-24 Thread Peter Geoghegan
On Fri, May 24, 2019 at 4:18 PM Tom Lane  wrote:
> Yes, I see that too with sufficiently high -j.  I believe this is
> what Noah was trying to fix in bd1592e85, but that patch evidently
> needs a bit more work :-(

It would be nice if this was fixed, but I don't see a problem when I
use the optimum number of jobs, so I don't consider it to be urgent.

I'm happy with the new approach, since it avoids the problem of
regression.diffs files that get deleted before I have a chance to take
a look. I should thank Noah for his work on this.

-- 
Peter Geoghegan




Re: Suppressing noise in successful check-world runs

2019-05-24 Thread Peter Geoghegan
On Fri, May 24, 2019 at 12:31 PM Peter Geoghegan  wrote:
>
> On Wed, May 22, 2019 at 3:57 PM Tom Lane  wrote:
> > I experimented with the attached quick-hack patch to make pg_regress
> > suppress notices from its various initial DROP/CREATE IF [NOT] EXISTS
> > commands.  I'm not entirely convinced whether suppressing them is
> > a good idea though.  Perhaps some hack with effects confined to
> > pg_upgrade's test would be better.  I don't have a good idea what
> > that would look like, however.
> >
> > Or we could just say this isn't annoying enough to fix.
>
> I think it's worth fixing.

My development machine has 8 logical cores, and like you I only see
the NOTICE from pg_upgrade's tests with "-j10":

pg@bat:/code/postgresql/patch/build$ time make check-world -j10 >/dev/null
NOTICE:  database "regression" does not exist, skipping
make check-world -j10 > /dev/null  86.40s user 34.10s system 140% cpu
1:25.94 total

However, I see something else with "-j16", even after a precautionary
clean + rebuild:

pg@bat:/code/postgresql/patch/build$ time make check-world -j16 >/dev/null
NOTICE:  database "regression" does not exist, skipping
pg_regress: could not open file
"/code/postgresql/patch/build/src/test/regress/regression.diffs" for
reading: No such file or directory
make check-world -j16 > /dev/null  96.35s user 37.45s system 152% cpu
1:27.49 total

I suppose this might be because of a pg_regress/make file
"regression.diffs" race. This is also a problem for my current
workflow for running "make check-world" in parallel [1], though only
when there is definitely a regression.diffs file with actual
regressions -- there is no regression that I'm missing here, and as
far as I know this output about "regression.diffs" is just more noise.
I had intended to figure out a way of keeping "regression.diffs" with
my existing workflow, since losing the details of a test failure is a
real annoyance. Especially when there is a test that doesn't fail
reliably.

[1] https://wiki.postgresql.org/wiki/Committing_checklist#Basic_checks
-- 
Peter Geoghegan




Re: Sort support for macaddr8

2019-06-04 Thread Peter Geoghegan
On Tue, Jun 4, 2019 at 2:55 PM Andres Freund  wrote:
> Yea, I don't immediately see a problem with doing that on a major
> version boundary. Obviously that'd only be possible for sizeof(Datum) ==
> 8 == sizeof(macaddr8) platforms, but that's the vast majority these
> days. And generally, I think it's just about *always* worth to go for a
> pass-by-value for the cases where that doesn't imply space wastage.

It would be faster to do it that way, I think. You would need a more
complicated comparator than a classic abbreviated comparator (i.e. a
3-way unsigned int comparator) that way, but it would very probably be
faster on balance.

I'm glad to hear that it isn't *obviously* a problem from a
compatibility perspective -- I really wasn't sure about that, since
retrofitting a type to be pass-by-value like this is something that
may never have been attempted before now (at least not since we
started to care about pg_upgrade).

> I think it might be worthwhile to optimize things so that all typlen > 0
> && typlen <= sizeof(Datum) are allowed for byval datums.
>
> SELECT typname, typlen FROM pg_type WHERE typlen > 0 AND typlen <= 8 AND NOT 
> typbyval;
> ┌──┬┐
> │ typname  │ typlen │
> ├──┼┤
> │ tid  │  6 │
> │ macaddr  │  6 │
> │ macaddr8 │  8 │
> └──┴┘
> (3 rows)

This is half the reason why I ended up implementing itemptr_encode()
to accelerate the TID sort used by CREATE INDEX CONCURRENTLY some
years back -- TID is 6 bytes wide, which doesn't have the necessary
macro support within postgres.h. There is no reason why that couldn't
be added for the benefit of both TID and macaddr types, though it
probably wouldn't be worth it. And, as long as we're not going to
those lengths, there may be some value in keeping the macaddr8 code in
line with macaddr code -- the two types are currently almost the same
(the glaring difference is the lack of macaddr8 sort support).

We'll need to draw the line somewhere, and that is likely to be a bit
arbitrary. This was what I meant by "weird".

-- 
Peter Geoghegan




Re: Sort support for macaddr8

2019-06-04 Thread Peter Geoghegan
On Tue, Jun 4, 2019 at 3:23 PM Andres Freund  wrote:
> > This is half the reason why I ended up implementing itemptr_encode()
> > to accelerate the TID sort used by CREATE INDEX CONCURRENTLY some
> > years back -- TID is 6 bytes wide, which doesn't have the necessary
> > macro support within postgres.h. There is no reason why that couldn't
> > be added for the benefit of both TID and macaddr types, though it
> > probably wouldn't be worth it.
>
> I think we should definitely do that. It seems not unlikely that other
> people want to write new fixed width types, and we shouldn't have them
> deal with all this complexity unnecessarily.

On second thought, maybe there is something to be said for being
exhaustive here.

It seems like there is a preference for making macaddr8 pass-by-value
instead of adding abbreviated keys support to macaddr8, and possibly
doing the same with the original macaddr type.

Do you think that you'll be able to work on the project with this
expanded scope, Melanie?

-- 
Peter Geoghegan




Re: Sort support for macaddr8

2019-06-03 Thread Peter Geoghegan
On Mon, Jun 3, 2019 at 2:59 PM Chapman Flack  wrote:
> 1. (This one seems like a bug.) In the little-endian case, if
>SIZEOF_DATUM is smaller than the type, I'm not convinced by doing
>the DatumBigEndianToNative() after the memcpy(). Seems like that's
>too late to make sure the most-significant bytes got copied.

Uh, when else would you do it? Before the memcpy()?

> 2. (This one seems like an API opportunity.) If it becomes common to
>add abbreviation support for smallish types such that (as here,
>when SIZEOF_DATUM >= 8), an abbreviated "equality" result is in fact
>authoritative, would it be worthwhile to have some way for the sort
>support routine to announce that fact to the caller? That could
>spare the caller the effort of re-checking with the authoritative
>routine.

It's possible that that would make sense, but I don't think that this
patch needs to do that. There is at least one pre-existing cases that
does this -- macaddr.

-- 
Peter Geoghegan




Re: Sort support for macaddr8

2019-06-03 Thread Peter Geoghegan
On Mon, Jun 3, 2019 at 1:17 PM Melanie Plageman
 wrote:
> I tried checking to see if there is a performance difference using the
> attached DDL based on src/test/regress/sql/macaddr8.sql. I found
> that the sort function is only exercised when creating an index (not,
> for example, when doing some type of aggregation).

As you know, it's a bit weird that we're proposing adding sort support
with abbreviated keys for a type that is 8 bytes, since you'd expect
it to also be pass-by-value on most platforms, which largely defeats
the purpose of having abbreviated keys (though sort support could
still make sense, for the same reason it makes sense to have it for
int8). However, macaddr8 isn't actually pass-by-value, and it seems
too late to do anything about that now, so abbreviated keys actually
make sense.

> With the patch applied to current master and using the DDL attached,
> the timing for creating the index hovered around 20 ms for master and
> 15 ms for the patched version.

I would expect a sufficiently large sort to execute in about half the
time with abbreviation, based on previous experience. However, I think
that this patch can be justified in a relatively straightforward way.
It extends sort support for macaddr to macaddr8, since these two types
are almost identical in every other way. We should still validate the
performance out of an abundance of caution, but I would be very
surprised if there was much difference between the macaddr and
macaddr8 cases.

In short, users should not be surprised by the big gap in performance
between macaddr and macaddr8. It's worth being consistent there.

> I think that that seems like an improvement. I was thinking of
> registering this patch for the next commitfest. Is that okay?

Definitely, yes.

> I was just wondering what the accepted way to test and share
> performance numbers is for a small patch like this. Is sharing DDL
> enough? Do I need to use pg_bench?

I've always used custom microbenchmarks for stuff like this.
Repeatedly executing a particular query and taking the median
execution time as representative seems to be the best approach.

--
Peter Geoghegan




Re: Sort support for macaddr8

2019-06-03 Thread Peter Geoghegan
On Mon, Jun 3, 2019 at 2:03 PM Chapman Flack  wrote:
> Am I going cross-eyed, or would the memset be serving more of a purpose
> if it were in the SIZEOF_DATUM != 8 branch?

No, it wouldn't -- that's the correct place for it with the macaddr
type. However, it isn't actually necessary to memset() at the
equivalent point for macaddr8, since we cannot "run out of bytes from
the authoritative representation" that go in the Datum/abbreviated
key. I suppose that the memset() should simply be removed, since it is
superfluous here.


-- 
Peter Geoghegan




unused_oids script should advertise reserved OID range

2019-06-02 Thread Peter Geoghegan
The unused_oids script has gone from being something of interest to
everybody that wants to write a patch that creates a new catalog
entry, to something that patch authors could do without in many cases.
I think that its output should prominently advertise that patches
should use random OIDs in the range 8000 - . Commit
a6417078c4140e51cfd717448430f274b449d687 established that this should
be standard practice for patch authors.

Actually, maybe it should even suggest a *particular* random OID in
that range, so that the choice of OID is reliably random -- why even
require patch authors to pick a number at random?

It also looks like pg_proc.dat should be updated, since it still
mentions the old custom of trying to use contiguous OIDs. It also
discourages the practice of picking OIDs at random, which is almost
the opposite of what it should say.

-- 
Peter Geoghegan




Re: crash testing suggestions for 12 beta 1

2019-06-05 Thread Peter Geoghegan
On Wed, Jun 5, 2019 at 2:11 PM Alvaro Herrera  wrote:
> REINDEX CONCURRENTLY would be one good area to focus on, I think, as
> well as ALTER TABLE ATTACH PARTITION.  Maybe also INCLUDE columns in
> GiST, and the stuff in commits 9155580fd, fe280694d and 7df159a62.

Those all seem like good things to target.

Forgive me for droning on about amcheck once more, but maybe it'll
help: amcheck has the capability to detect at least two historic bugs
in CREATE INDEX CONCURRENTLY that made it into stable releases. The
"heapallindexed" verification option's bt_tuple_present_callback()
function has a header comment that talks about this. In short, any
"unhandled" broken hot chain (i.e. broken hot chains that are somehow
not correctly detected and handled) should be reported as corrupt by
amcheck with the "heapallindexed" check, provided the tuple is visible
to verification's heap scan.

The CREATE INDEX CONCURRENTLY bug that Pavan found a couple of years
back while testing the WARM patch is one example. A bug that was
fallout from the DROP INDEX CONCURRENTLY work is another historic
example. Alvaro will recall that this same check had a role in the
"freeze the dead" business.

-- 
Peter Geoghegan




Re: Top-N sorts in EXPLAIN, row count estimates, and parallelism

2019-06-05 Thread Peter Geoghegan
On Thu, May 23, 2019 at 7:48 PM David Rowley
 wrote:
> > cost_sort() costs sorts as top-N heapsorts. While we cannot make an
> > iron-clad guarantee that it will work out that way from within
> > tuplesort.c, that doesn't seem like it closes off the possibility of
> > more informative EXPLAIN output. For example, can't we at report that
> > the tuplesort will be "bounded" within EXPLAIN, indicating that we
> > intend to attempt to sort using a top-N heap sort (i.e. we'll
> > definitely do it that way if there is sufficient work_mem)?
>
> I think this really needs more of a concrete proposal.  Remember
> LIMIT/OFFSET don't need to be constants, they could be a Param or some
> return value from a subquery, so the bound might not be known until
> after executor startup, to which EXPLAIN is not going to get to know
> about that.

I was merely pointing out that it is clear when a sort *could* be a
top-n sort, which could be exposed by EXPLAIN without anyone feeling
misled.

> After that, what would we do with it in EXPLAIN?  Always show "Bound:
> ", if it's not -1?

I'm not sure.

The distinction between a top-n sort and any other sort is an
important one (it's certainly more important than the distinction
between an internal and external sort), so it's worth being flexible
in order to expose more information in EXPLAIN output. I would be
willing to accept some kind of qualified or hedged description in the
EXPLAIN output for a bounded sort node, even though that approach
doesn't seem desirable in general.

-- 
Peter Geoghegan




Re: nbtdesc.c and nbtpage.c are inconsistent with XLOG_BTREE_META_CLEANUP (11~)

2019-06-16 Thread Peter Geoghegan
On Sun, Jun 16, 2019 at 6:31 PM Michael Paquier  wrote:
> I think that we could just patch nbtpage.c so as we fetch the
> metadata using XLogRecGetBlockData(), and log its data.

Don't you mean nbtdesc.c?

> The error
> comes from 3d92796, which is one year-old and has been visibly
> committed untested.  I am surprised that we  have not seen that
> complain yet.

Why is that surprising?

https://coverage.postgresql.org/src/backend/access/rmgrdesc/index.html

-- 
Peter Geoghegan




Re: nbtdesc.c and nbtpage.c are inconsistent with XLOG_BTREE_META_CLEANUP (11~)

2019-06-16 Thread Peter Geoghegan
On Sun, Jun 16, 2019 at 7:05 PM Michael Paquier  wrote:
> I would have supposed that more people scan WAL records using the
> description callbacks.

The WAL record in question, XLOG_BTREE_META_CLEANUP, is certainly one
of the less common record types used by nbtree. I agree that this
should have been tested when it went in, but I'm not surprised that
the bug remained undetected for a year. Not that many people use
pg_waldump.

-- 
Peter Geoghegan




Re: Status of the table access method work

2019-06-11 Thread Peter Geoghegan
On Tue, Jun 11, 2019 at 8:59 AM Robert Haas  wrote:
> I don't think that including visibility information in indexes is a
> bad idea; we've thought about making zheap do this someday. But I
> think that we need to use some more sophisticated approach involving,
> maybe, undo pointers, or some other kind of magic, rather than just
> widening the tuples. I expect that just widening the tuples would be
> good enough to win for some use cases, but I think there would be
> others that lose heavily.

+1. Limited visibility information would make sense (e.g. maybe a per
tuple all-visible bit), which would have to be backed by something
like UNDO, but storing XIDs in tuples seems like a very bad idea. The
idea that something like this would have to be usable by any possible
table access method seems unworkable in general.

Sometimes it seems like the table access method work could use some
specific non-goals. Perhaps I just haven't being paying enough
attention to have noticed them.

-- 
Peter Geoghegan




Re: ANALYZE: ERROR: tuple already updated by self

2019-06-18 Thread Peter Geoghegan
On Tue, Jun 18, 2019 at 4:49 PM Justin Pryzby  wrote:
> Sure:
>
> (gdb) bt
> #0  errfinish (dummy=0) at elog.c:414
> #1  0x0085e834 in elog_finish (elevel=, 
> fmt=) at elog.c:1376
> #2  0x004b93bd in simple_heap_update (relation=0x7fee161700c8, 
> otid=0x1fb7f44, tup=0x1fb7f40) at heapam.c:4613
> #3  0x0051bdb7 in CatalogTupleUpdate (heapRel=0x7fee161700c8, 
> otid=0x1fb7f44, tup=0x1fb7f40) at indexing.c:234

It might be interesting to set a breakpoint within heap_update(),
which is called by simple_heap_update() --technically, this is where
the reported failure occurs. From there, you could send an image of
the page to the list by following the procedure described here:

https://wiki.postgresql.org/wiki/Getting_a_stack_trace_of_a_running_PostgreSQL_backend_on_Linux/BSD#Dumping_a_page_image_from_within_GDB

You'll have to hit "next" a few times, until heap_update()'s "page"
variable is initialized.
-- 
Peter Geoghegan




Re: ANALYZE: ERROR: tuple already updated by self

2019-06-18 Thread Peter Geoghegan
On Tue, Jun 18, 2019 at 5:09 PM Andres Freund  wrote:
> > It might be interesting to set a breakpoint within heap_update(),
> > which is called by simple_heap_update() --technically, this is where
> > the reported failure occurs. From there, you could send an image of
> > the page to the list by following the procedure described here:

> Hm, what are you hoping to glean by doing so?

Nothing in particular. I see no reason to assume that we know what
that looks like, though. It's easy to check.

-- 
Peter Geoghegan




Re: [PATCH] Incremental sort (was: PoC: Partial sort)

2019-06-25 Thread Peter Geoghegan
On Tue, Jun 25, 2019 at 9:53 AM James Coleman  wrote:
> Given the patch contents I don't see any obviously reason why either
> of those possibilities (calling comparetup_heap less often, or it's
> cheaper) are likely. Is that something we should investigate further?
> Or is it just a nice happy accident that we should ignore since it's
> dataset specific?

Have you actually confirmed that comparetup_heap() is called less
often, by instrumenting the number of individual calls specifically?
If there has been a reduction in calls to comparetup_heap(), then
that's weird, and needs to be explained. If it turns out that it isn't
actually called less often, then I would suggest that the speedup
might be related to memory fragmentation, which often matters a lot
within tuplesort.c. (This is why external sort merging now uses big
buffers, and double buffering.)

-- 
Peter Geoghegan




Re: [PATCH] Incremental sort (was: PoC: Partial sort)

2019-06-25 Thread Peter Geoghegan
On Tue, Jun 25, 2019 at 11:03 AM James Coleman  wrote:
> No, I haven't confirmed that it's called less frequently, and I'd be
> extremely surprised if it were given the diff doesn't suggest any
> changes to that at all.

I must have misunderstood, then. I thought that you were suggesting
that that might have happened.

> If you think it's important enough to do so, I can instrument it to
> confirm, but I was mostly wanting to know if there were any other
> plausible explanations, and I think you've provided one: there *are*
> changes in the patch to memory contexts in tuplesort.c, so if memory
> fragmentation is a real concern this patch could definitely notice
> changes in that regard.

Sounds like it's probably fragmentation. That's generally hard to measure.

-- 
Peter Geoghegan




Re: REINDEX locking

2019-06-13 Thread Peter Geoghegan
On Thu, Jun 13, 2019 at 1:04 PM Robert Haas  wrote:
> Typing "COMMIT;" or "ROLLBACK;" in S1 unblocks the reindex and it
> succeeds, but otherwise it doesn't, contrary to the claim that a
> regular REINDEX does not block reads.  The reason for this seems to be
> that the REINDEX acquires AccessExclusiveLock on all of the indexes of
> the table, and a SELECT acquires AccessShareLock on all indexes of the
> table (even if the particular plan at issue does not use them); e.g.
> in this case the plan is a Seq Scan.  REINDEX acquires only ShareLock
> on the table itself, but this apparently does nobody wanting to run a
> query any good.
>
> Is it supposed to work this way?  Am I confused?

I've always thought that this framing was very user-hostile.
Theoretically, REINDEX doesn't have to block reads (e.g. it won't with
prepared statements when various conditions are met), but in practice
the behavior isn't meaningfully different from blocking reads.

-- 
Peter Geoghegan




Re: PG 12 draft release notes

2019-06-12 Thread Peter Geoghegan
On Wed, Jun 12, 2019 at 1:16 PM Bruce Momjian  wrote:
> First, my apologies in getting to this so late.  Peter Geoghegan
> supplied me with slides and a video to study, and I now understand how
> complex the btree improvements are.  Here is a video of Peter's
> presentation at PGCon:
>
> https://www.youtube.com/watch?v=p5RaATILoiE

Thank you for going to the trouble of researching the B-Tree stuff in
detail! I wouldn't ask that of anybody in the position of writing
release notes, so it's appreciated. It is awkward to take the work
that I've done and make it into multiple bullet points; I have a hard
time with that myself.

> The over-arching improvement to btree in PG 12 is the ordering of index
> entries by tid so all entries are unique.

Right. Everything good happens as a direct or indirect result of the
TID-as-column thing. That is really the kernel of the whole thing,
because it means that the implementation now *fully* follows the
Lehman and Yao design.

> 1.  Since all index tuples are ordered, you can move from one leaf page
> to the next without keeping a lock on the internal page that references
> it, increasing concurrency.

I'm not sure what you mean here. We've never had to keep a lock on an
internal page while holding a lock on a leaf page -- such "lock
coupling" was necessary in earlier B-Tree designs, but Lehman and
Yao's algorithm avoids that. Of course, that isn't new.

I think that you're talking about the way that we now check the high
key during index scans, and find that we don't have to move to the
next leaf page, per this commit:

https://git.postgresql.org/gitweb/?p=postgresql.git;a=commit;h=29b64d1de7c77ffb5cb10696693e6ed8a6fc481c

All of the suffix truncation stuff is good because it makes separator
keys in internal pages smaller, but it's also good because the
separator keys are simultaneously more "discriminating". We tend to
get a nice "clean break" between leaf pages, so checking the high key
before moving right to find additional matches (once we've already
returned some tuples to the executor) is surprisingly effective. It
would have been possible to add this optimization even without the
suffix truncation stuff, but truncation makes it effective.

If you had to cut one thing from this list, then I would suggest that
it be this item. It's nice, but it's also very obvious, which makes it
hard to explain. I mean, why should there be any ambiguity at all?
Unless we have to return *hundreds* of items to the index scan, then a
simple "select * from foo where bar = ?" style query should only have
to visit one leaf page, even when the constant happens to be on the
boundary of a leaf page (maybe a concurrent page split could make this
happen, but that should be rare).

> 2.  When inserting a duplicate value in the index, we used to try a few
> pages to see if there was space, then "get tired" and just split a page
> containing duplicates.  Now that there are no duplicates (because
> duplicate key fields are sorted by tid) the system knows exactly what
> page the index tuple belongs on, and inserts or splits the page as
> necessary.

Right -- inserts will descend straight to the correct leaf page, and
the "getting tired" dance isn't used anymore. This makes insertions
faster, but more importantly is a better high level strategy for
storing lots of duplicates. We'll dirty far fewer pages, because
insertions automatically end up inserting around the same place as we
inserted to a moment ago. Insertions of duplicates behave like
serial/auto-incrementing insertions, which was already
fast/well-optimized in various ways.

It's easy to measure this by looking at index bloat when inserting
lots of duplicates -- I came up with the 16% figure in the talk based
on a simple insert-only test.

> 3.  Pivot tuples are used on internal pages and as min/max indicators on
> leaf pages.  These pivot tuples are now trimmed if their trailing key
> fields are not significant.  For example, if an index is
> field1/field2/field3, and the page contains values for which field1==5
> and none that field1==6, there is no need to include field2 and field3
> in the pivot tuple --- it can just list the pivot as field1==5,
> field2=infinity.  This is called suffix truncation.

Right -- that's exactly how it works. Users may find that indexes with
lots of extra columns at the end won't get so bloated in Postgres 12.
Indexing many columns is typically seen when index-only scans are
important. Of course, you could have made those indexes INCLUDE
indexes on v11, which is actually a closely related idea, but that
makes it impossible to use the trailing/suffix columns in an ORDER BY.
And, you have to know about INCLUDE indexes, and remember to use them.

(This must be why Oracle can get away with not having INCLUDE indexes.)

> Page splits used to just split the page in

Re: New vacuum option to do only freezing

2019-06-19 Thread Peter Geoghegan
On Tue, Jun 18, 2019 at 10:39 PM Michael Paquier  wrote:
> +INSERT INTO no_index_cleanup(i, t) VALUES(1, repeat('1234567890',3));
> Do we really need a string as long as that?

Specifying EXTERNAL storage might make things easier. I have used
PLAIN storage to test the 1/3 of a page restriction within nbtree, and
to test a bug in amcheck that was related to TOAST compression.

> It seems to me that we'd want tests to make sure that indexes are
> actually cleaned up, where pageinspect could prove to be useful.

That definitely seems preferable, but it'll be a bit tricky to do it
in a way that doesn't run into buildfarm issues due to alignment. I
suggest an index on a text column to avoid problems.

-- 
Peter Geoghegan




Re: Index Skip Scan

2019-06-22 Thread Peter Geoghegan
On Sat, Jun 22, 2019 at 3:15 PM Floris Van Nee  wrote:
> Perhaps a question: when stepping through code in GDB, is there an easy way 
> to pretty print for example the contents on an IndexTuple? I saw there's some 
> tools out there that can pretty print plans, but viewing tuples is more 
> complicated I guess.

Try the attached patch -- it has DEBUG1 traces with the contents of
index tuples from key points during index scans, allowing you to see
what's going on both as a B-Tree is descended, and as a range scan is
performed. It also shows details of the insertion scankey that is set
up within _bt_first(). This hasn't been adopted to the patch at all,
so you'll probably need to do that.

The patch should be considered a very rough hack, for now. It leaks
memory like crazy. But I think that you'll find it helpful.

-- 
Peter Geoghegan


0012-Index-scan-positioning-DEBUG1-instrumentation.patch
Description: Binary data


Re: PG 12 draft release notes

2019-06-12 Thread Peter Geoghegan
On Wed, Jun 12, 2019 at 5:22 PM Bruce Momjian  wrote:
> I had become so confused by this item that I needed a few weeks to
> settle on what was actually going on.

I put a lot of time into my pgCon talk, especially on the diagrams.
Seems like that paid off. Even Heikki was confused by my explanations
at one point.

I should go add a similar diagram to our documentation, under "Chapter
63. B-Tree Indexes", because diagrams are the only sensible way to
explain the concepts.

> I was wrong.  I was thinking of this commit:
>
> commit d2086b08b0
> Author: Alexander Korotkov 
> Date:   Sat Jul 28 00:31:40 2018 +0300
>
> Reduce path length for locking leaf B-tree pages during insertion

> > If you had to cut one thing from this list, then I would suggest that
> > it be this item. It's nice, but it's also very obvious, which makes it
> > hard to explain.

> Right.  The commit mentioned a 4.5x speedup in a rare benchmark, so I
> added it lower on the list.

My remark about cutting an item referred to a lesser item that I
worked on (the 'Add nbtree high key "continuescan" optimization'
commit), not Alexander independent B-Tree work. I think that
Alexander's optimization is also quite effective. Though FWIW the 4.5x
improvement concerned a case involving lots of duplicates...cases with
a lot of duplicates will be far far better in Postgres 12. (I never
tested my patch without Alexander's commit, since it went in early in
the v12 cycle.)

> Yes, locality.

"Locality" is one of my favorite words.

> Attached is an updated patch.  I might have missed something, but I
> think it might be close.

This looks great. I do have a few things:

* I would put "Improve performance and space utilization of btree
indexes with many duplicates" first (before "Allow multi-column btree
indexes to be smaller"). I think that this is far more common than we
tend to assume, and is also where the biggest benefits are.

* The wording of the "many duplicates" item itself is almost perfect,
though the "...and inefficiency when VACUUM needed to find a row for
removal" part seems a bit off -- this is really about the
effectiveness of VACUUM, not the speed at which the VACUUM completes
(it's a bit faster, but that's not that important). Perhaps that part
should read: "...and often failed to efficiently recycle space made
available by VACUUM". Something like that.

* The "Allow multi-column btree indexes to be smaller" item is about
both suffix truncation, and about the "Split after new tuple"
optimization. I think that that makes it more complicated than it
needs to be. While the improvements that we saw with TPC-C on account
of the "Split after new tuple" optimization were nice, I doubt that
users will be looking out for it. I would be okay if you dropped any
mention of the "Split after new tuple" optimization, in the interest
of making the description more useful to users. We can just lose that.

* Once you simplify the item by making it all about suffix truncation,
it would make sense to change the single line summary to "Reduce the
number of branch blocks needed for multi-column indexes". Then go on
to talk about how we now only store those columns that are necessary
to guide index scans in tuples stored in branch pages (we tend to call
branch pages internal pages, but branch pages seems friendlier to me).
Note that the user docs of other database systems reference these
details, even in their introductory material on how B-Tree indexes
work. The term "suffix truncation" isn't something users have heard
of, and we shouldn't use it here, but the *idea* of suffix truncation
is very well established. As I mentioned, it matters for things like
covering indexes (indexes that are designed to be used by index-only
scans, which are not necessarily INCLUDE indexes).

Thanks!
--
Peter Geoghegan




Re: PG 12 draft release notes

2019-06-12 Thread Peter Geoghegan
On Wed, Jun 12, 2019 at 7:29 PM Bruce Momjian  wrote:
> OK, I mentioned something about increased locality now.  Patch attached.

Looks good -- locality is a good catch-all term.

Thanks!
-- 
Peter Geoghegan




Re: Draft back-branch release notes are up for review

2019-06-15 Thread Peter Geoghegan
On Sat, Jun 15, 2019 at 2:11 PM Peter Geoghegan  wrote:
> On Sat, Jun 15, 2019 at 1:39 PM Noah Misch  wrote:
> > To me, this text implies a cautious DBA should amcheck every index.  Reading
> > the thread[1], I no longer think that.  It's enough to monitor that VACUUM
> > doesn't start failing persistently on any index.  I suggest replacing this
> > release note text with something like the following:

FWIW, amcheck won't help here. It can only access pages through its
breadth-first search, and so will not land on any "leaked" page (i.e.
page that has no link to the tree). Ideally, amcheck would notice that
it hasn't visited certain blocks, and then inspect the blocks/pages in
a separate pass, but that doesn't happen right now.

As you know, VACUUM can find leaked blocks/pages because nbtree VACUUM
has an optimization that allows it to access them in sequential order.

-- 
Peter Geoghegan




Re: Draft back-branch release notes are up for review

2019-06-15 Thread Peter Geoghegan
On Sat, Jun 15, 2019 at 3:05 PM Tom Lane  wrote:
> Thanks for the input, guys.  What do you think of
>
>  Avoid writing an invalid empty btree index page in the unlikely case
>  that a failure occurs while processing INCLUDEd columns during a page
>      split (Peter Geoghegan)
>
>  The invalid page would not affect normal index operations, but it
>  might cause failures in subsequent VACUUMs. If that has happened to
>  one of your indexes, recover by reindexing the index.

That seems perfect.

Thanks
-- 
Peter Geoghegan




Re: Draft back-branch release notes are up for review

2019-06-15 Thread Peter Geoghegan
On Sat, Jun 15, 2019 at 1:39 PM Noah Misch  wrote:
> To me, this text implies a cautious DBA should amcheck every index.  Reading
> the thread[1], I no longer think that.  It's enough to monitor that VACUUM
> doesn't start failing persistently on any index.  I suggest replacing this
> release note text with something like the following:
>
>   Avoid writing erroneous btree index data that does not change query results
>   but causes VACUUM to abort with "failed to re-find parent key".  Affected
>   indexes are rare; REINDEX fixes them.
>
> (I removed "key truncation during a page split" as being too technical for
> release notes.)

I agree that this isn't terribly significant in general. Your proposed
wording seems better than what we have now, but a reference to INCLUDE
indexes also seems like a good idea. They are the only type of index
that could possibly have the issue with page deletion/VACUUM becoming
confused. Even then, the risk seems minor, because there has to be an
OOM at precisely the wrong point.

If there was any kind of _bt_split() OOM in 11.3 that involved a
non-INCLUDE B-Tree index, then the OOM could only occur when we
allocate a temp page buffer. I verified that this causes no
significant issue for VACUUM. It is best avoided, since we still
"leak" the new page/buffer, although that is almost harmless.

-- 
Peter Geoghegan




Re: _bt_split(), and the risk of OOM before its critical section

2019-05-09 Thread Peter Geoghegan
On Wed, May 8, 2019 at 3:37 PM Peter Geoghegan  wrote:
> It makes perfect sense for _bt_split() to call _bt_findsplitloc()
> directly, since _bt_findsplitloc() is already aware of almost every
> _bt_split() implementation detail, whereas those same details are not
> of interest anywhere else.

I discovered that it even used to work like that until 1997, when
commit 71b3e93c505 added handling of duplicate index tuples. Tom
ripped the duplicate handling stuff out a couple of years later, for
what seemed to me to be very good reasons, but _bt_findsplitloc()
remained outside of _bt_split() until now.

I intend to push ahead with the fix for both v11 and HEAD on Monday.
-- 
Peter Geoghegan




Re: pg12 release notes

2019-05-11 Thread Peter Geoghegan
On Sat, May 11, 2019 at 7:40 AM Bruce Momjian  wrote:
> > > > I think I have understood the nuances, as listed above --- I just don't
> > > > agree with the solution.
> > >
> > > To be clear, I don't expect you to agree with the solution.
> > >
> > > Another thing that you missed from my patch is that bugfix commit
> > > 9b10926263d831fac5758f1493c929a49b55669b shouldn't be listed.
> >
> > Why should it not be listed?
>
> Thinking some more, I try to aggregate all the feature addition commits
> together, but often skip "cleanups" of previous feature additions.  Are
> you saying that 9b10926263d831fac5758f1493c929a49b55669b is a cleanup,
> and not part of the feature addition?  It was not clear to me of
> 9b10926263d831fac5758f1493c929a49b55669b was a further enhancement
> made possible by previous commits, or a "fix" for a previous commit.

Yes. It's a bug fix that went in after feature freeze.

-- 
Peter Geoghegan




Re: pg12 release notes

2019-05-10 Thread Peter Geoghegan
On Fri, May 10, 2019 at 7:11 PM Bruce Momjian  wrote:
> >  Whether or not you include more details is not what I care about here
> > -- I *agree* that this is insignificant.

> Well, we can move the entire item up to the incompatibility section, but
> that seems unbalanced since the incompatibility is so small relative to
> the entire item, and it is rare, as you mentioned.  It also seems odd to
> create a stand-alone incompatibility item that really is part of a later
> item already in the release notes.

That is what we've always done. The list has always been very long,
with individual items that are on average totally insignificant.
Breaking with that pattern in this instance will be confusing to
users.

> I think I have understood the nuances, as listed above --- I just don't
> agree with the solution.

To be clear, I don't expect you to agree with the solution.

Another thing that you missed from my patch is that bugfix commit
9b10926263d831fac5758f1493c929a49b55669b shouldn't be listed.

> > As things stand, I feel like the question of authorship and credit
> > complicates the question of making the release notes useful, which is
> > unfortunate. (Not sure what can be done about that.)
>
> That part I always need big help with, particularly with multiple
> commits being lumped into a single release note entry.  I just can't
> tell which commit is more important when knowing what order to list the
> names.  I have that problem big-time with the partition commits.

I understand that it's a difficult job. It's inevitable that there
will need to be corrections. You don't appear to be particularly
receptive to feedback, which makes the process harder for everyone --
even in instances where you make the right call. I don't think that I
am alone in seeing it this way.

-- 
Peter Geoghegan




Re: pg12 release notes

2019-05-10 Thread Peter Geoghegan
On Fri, May 10, 2019 at 6:02 PM Bruce Momjian  wrote:
> Have new btree indexes sort duplicate index entries in heap-storage
> order (Peter Geoghegan)
>
> This slightly reduces the maximum-allowed length of indexed values.
> ---
> Indexes pg_upgraded from previous releases will not have this 
> ordering.
>
> I don't think more details are really needed.

 Whether or not you include more details is not what I care about here
-- I *agree* that this is insignificant.

I think that the maximum allowed length thing should be listed in the
compatibility section as a formality -- not alongside the point that
the feature is listed. I had better be correct in that general
assessment, because it's not possible to opt out of the restriction
within CREATE INDEX. That was my point here.

> What we have already seems like enough detail:
>
> Improve speed of btree index insertions (Alexander Korotkov,
> Peter Geoghegan)

Why?

I think that it's unfortunate that Heikki wasn't given an authorship
credit here, as proposed in my patch. I think that he deserves it.

> Locking speed seems like an implementation detail.

They're all implementations details, though. If that was the actual
standard, then few or no "indexing" items would be listed.

> You have given me very good detail, so the new text is:
>
> Improve speed of btree index insertions (Alexander Korotkov, Peter
> Geoghegan)
>
> The new code improves the space-efficiency of page splits, reduces
> locking
> overhead, and gives better performance for UPDATEs
> and DELETEs on indexes with many duplicates.

I can live with that.

I'm not trying to be difficult -- reasonable people can disagree on
the level of detail that is appropriate (they can even have radically
different ideas about where to draw the line). And, I would expect it
to be a little arbitrary under the best of circumstances, no matter
who was tasked with writing the release notes. However, I think the
process would be easier and more effective if you took more time to
understand the concerns behind the feedback you get. There are
sometimes important nuances.

As things stand, I feel like the question of authorship and credit
complicates the question of making the release notes useful, which is
unfortunate. (Not sure what can be done about that.)

--
Peter Geoghegan




Re: pg12 release notes

2019-05-11 Thread Peter Geoghegan
On Sat, May 11, 2019 at 11:02 AM Bruce Momjian  wrote:
> OK, commit removed.

You're mistaken -- nothing has been pushed to master in the last 3 hours.

-- 
Peter Geoghegan




Re: What is an item pointer, anyway?

2019-05-13 Thread Peter Geoghegan
On Sun, May 5, 2019 at 1:14 PM Peter Geoghegan  wrote:
> Attached draft patch adjusts code comments and error messages where a
> line pointer is referred to as an item pointer. It turns out that this
> practice isn't all that prevalent. Here are some specific concerns
> that I had to think about when writing the patch, though:

Ping? Any objections to pushing ahead with this?

-- 
Peter Geoghegan




Re: What is an item pointer, anyway?

2019-05-13 Thread Peter Geoghegan
On Mon, May 13, 2019 at 12:38 PM Tom Lane  wrote:
>  /*
> - * Prune specified item pointer or a HOT chain originating at that item.
> + * Prune specified line pointer or a HOT chain originating at that item.
>   *
>   * If the item is an index-referenced tuple (i.e. not a heap-only tuple),
>
> Should "that item" also be re-worded, for consistency?

Yes, it should be -- I'll fix it.

I'm going to backpatch the storage.sgml change on its own, while
pushing everything else in a separate HEAD-only commit.

Thanks
-- 
Peter Geoghegan




Re: VACUUM can finish an interrupted nbtree page split -- is that okay?

2019-05-13 Thread Peter Geoghegan
On Tue, May 7, 2019 at 9:59 AM Peter Geoghegan  wrote:
> On Tue, May 7, 2019 at 12:27 AM Heikki Linnakangas  wrote:
> > I don't understand that reasoning. Yes, _bt_pagedel() will complain if
> > it finds a half-dead internal page. But how does that mean that
> > _bt_lock_branch_parent() can't encounter one?
>
> I suppose that in theory it could, but only if you allow that any
> possible state could be found -- it doesn't seem any more likely than
> any other random illegal state.

To be fair, I suppose that the code made more sense when it first went
in, because at the time there was a chance that there could be
leftover half-dead internal pages. But, that was a long time ago now.

I wonder why the code wasn't complaining about corruption loudly, like
the top level code, instead of treating half-dead pages as a
legitimate reason to not proceed with multi-level page deletion. That
would have been overkill, but it would have made much more sense IMV.

I would like to proceed with pushing this patch to HEAD in the next
few days, since it's clearly removing code that can't be useful. Are
there any objections?

-- 
Peter Geoghegan




Re: VACUUM can finish an interrupted nbtree page split -- is that okay?

2019-05-13 Thread Peter Geoghegan
On Mon, May 13, 2019 at 8:30 PM Tom Lane  wrote:
> Peter Geoghegan  writes:
> > To be fair, I suppose that the code made more sense when it first went
> > in, because at the time there was a chance that there could be
> > leftover half-dead internal pages. But, that was a long time ago now.
>
> Is there a good reason to assume there are none left anywhere?

That is not an assumption that the proposed patch rests upon, though
it is true that there are probably going to be virtually no half-dead
internal pages that make there way on to a Postgres 12 installation.
You'd have to do a CREATE INDEX on Postgres 9.3, and then not VACUUM
or REINDEX the index once it was on a 9.4+ installation. I suppose
that a 9.3 -> 12 upgrade is the most plausible scenario in which you
could actual get a half-dead internal page on a Postgres 12
installation.

Even when that happens, the index is already considered corrupt by
VACUUM, so the same VACUUM process that could in theory be adversely
affected by removing the half-dead internal page check will complain
about the page when it gets to it later on -- the user will be told to
REINDEX. And even then, we will never actually get to apply the check
that I propose to remove, since we're already checking the leaf page
sibling of the leaf-level target -- the leaf-level test that was added
by efada2b8e92 was clearly necessary. But it was also sufficient (no
equivalent internal half-dead right sibling test is needed): a 9.3-era
half-dead internal page cannot have more than one child, which must be
undergoing deletion as well.

If somebody doubted my rationale for why we don't need to do anything
more on internal page levels in installations where the user didn't
pg_upgrade from a version that's < 9.4, then they'd still have to
explain why we haven't heard of any problems in 5 years, and probably
offer some alternative fix that considers "logically half-dead
internal pages" (i.e. pages that are or will be the top parent in a
deletion chain). Because the code that I propose to remove obviously
cannot be doing much of anything for indexes built on 9.4+.

-- 
Peter Geoghegan




Re: itemptr_encode/itemptr_decode

2019-05-18 Thread Peter Geoghegan
On Wed, Apr 17, 2019 at 4:22 PM Tom Lane  wrote:
> As for the general usability argument, I'm not sure --- as we start
> to look at alternate AMs, we might have more use for them.  When I first
> saw the functions, I thought maybe they were part of sort acceleration
> for TIDs; evidently they're not (yet), but that seems like another
> possible use-case.

There is also your join-or-to-union patch, which I thought might make
use of this for its TID sort.

Maybe it would make sense to put this infrastructure in tuplesort.c,
but probably not. TIDs are 6 bytes, which as you once pointed out, is
not something that we have appropriate infrastructure for (there isn't
a DatumGet*() macro, and so on). The encoding scheme (which you
originally suggested as an alternative to my first idea, sort support
for item pointers) works particularly well as these things go -- it
was about 3x faster when everything fit in memory, and faster still
with external sorts. It allowed us to resolve comparisons at the
SortTuple level within tuplesort.c, but also allowed tuplesort.c to
use the pass-by-value datum qsort specialization. It even allowed
sorted array entries (TIDs/int8s) to be fetched without extra pointer
chasing -- that can be a big bottleneck these days.

The encoding scheme is a bit ugly, but I suspect it would be simpler
to stick to the same approach elsewhere than to try and hide all the
details within tuplesort.c, or something like that. Unless we're
willing to treat TIDs as a whole new type of tuple with its own set of
specialized functions in tuplesort.c, which has problems of its own,
then it's kind of awkward to do it some other way.


--
Peter Geoghegan




Re: crash testing suggestions for 12 beta 1

2019-05-23 Thread Peter Geoghegan
On Thu, May 23, 2019 at 8:24 AM Jeff Janes  wrote:
> Now that beta is out, I wanted to do some crash-recovery testing where I 
> inject PANIC-inducing faults and see if it recovers correctly.

Thank you for doing this. It's important work.

> Making the ctid be tie-breakers in btree index is also tested inherently 
> (plus I think Peter tested that pretty thoroughly himself with similar 
> methods).

As you may know, the B-Tree code has a tendency to soldier on when an
index is corrupt. "Moving right" tends to conceal problems beyond
concurrent page splits. I didn't do very much fault injection type
testing with the B-Tree enhancements, but I did lean on amcheck
heavily during development. Note that a new, extremely thorough option
called "rootdescend" verification was added following the v12 work:

https://git.postgresql.org/gitweb/?p=postgresql.git;a=commit;h=c1afd175b5b2e5c44f6da34988342e00ecdfb518

It probably wouldn't add noticeable overhead to use this during your
testing, and maybe to combine it with the "heapallindexed" option,
while using the bt_index_parent_check() variant -- that will detect
almost any imaginable index corruption. Admittedly, amcheck didn't
find any bugs in my code after the first couple of versions of the
patch series, so this approach seems unlikely to find any problems
now. Even still, it wouldn't be very difficult to do this extra step.
It seems worthwhile to be thorough here, given that we depend on the
B-Tree code so heavily.

-- 
Peter Geoghegan




Re: Top-N sorts in EXPLAIN, row count estimates, and parallelism

2019-05-23 Thread Peter Geoghegan
On Thu, May 23, 2019 at 3:31 PM Tom Lane  wrote:
> Given the way that's implemented, I doubt that we can report it
> reliably in EXPLAIN.

Does it have to be totally reliable?

cost_sort() costs sorts as top-N heapsorts. While we cannot make an
iron-clad guarantee that it will work out that way from within
tuplesort.c, that doesn't seem like it closes off the possibility of
more informative EXPLAIN output. For example, can't we at report that
the tuplesort will be "bounded" within EXPLAIN, indicating that we
intend to attempt to sort using a top-N heap sort (i.e. we'll
definitely do it that way if there is sufficient work_mem)?

-- 
Peter Geoghegan




Re: Calling PrepareTempTablespaces in BufFileCreateTemp

2019-05-17 Thread Peter Geoghegan
On Tue, Apr 30, 2019 at 3:22 PM Tom Lane  wrote:
> So now I'm feeling more favorable about the idea of adding a
> PrepareTempTablespaces call to BufFileCreateTemp, and just stopping
> with that.  If we want to do more, I feel like it requires a
> significant amount of rethinking about what the expectations are for
> fd.c, and some rejiggering of PrepareTempTablespaces's API too.
> I'm not sufficiently excited about this issue to do that.

+1. Let's close this one out.

--
Peter Geoghegan




Re: Calling PrepareTempTablespaces in BufFileCreateTemp

2019-05-17 Thread Peter Geoghegan
On Fri, May 17, 2019 at 6:36 PM Tom Lane  wrote:
> Will do so tomorrow.  Should we back-patch this?

I wouldn't, because I see no reason to. Somebody else might.

-- 
Peter Geoghegan




Re: pg12 release notes

2019-05-09 Thread Peter Geoghegan
On Thu, May 9, 2019 at 6:03 PM Bruce Momjian  wrote:
> These were all very helpful.  I adjusted your changes to create the
> attached applied patch.  URL updated:

I noticed that the compatibility note for Andrew Gierth's RYU floating
point patch seems to simply say why the feature is useful. Shouldn't
it be listed separately, and its impact on users upgrading be listed
here instead?

Separately, please find attached suggested changes for items I was
involved in. I have attempted to explain them in a way that makes
their relevance to users clearer. Even if you don't end up using my
wording, you should still change the attribution along the same lines
as the patch.

Also, I added a compatibility note for the new version of nbtree,
which revises the "1/3 of a page" restriction downwards very slightly
(by 8 bytes). FWIW, I think it's very unlikely that anyone will be
affected, because tuples that are that wide are already compressed in
almost all cases -- it seems like it would be hard to be just at the
edge of the limit already.

Thanks
-- 
Peter Geoghegan


0001-Suggested-changes-to-v12-release-notes.patch
Description: Binary data


Re: _bt_split(), and the risk of OOM before its critical section

2019-05-08 Thread Peter Geoghegan
On Tue, May 7, 2019 at 6:15 PM Peter Geoghegan  wrote:
> I suppose I'm biased, but I prefer the new approach anyway. Adding the
> left high key first, and then the right high key seems simpler and
> more logical. It emphasizes the similarities and differences between
> leftpage and rightpage.

I came up with a better way of doing it in the attached revision. Now,
_bt_split() calls _bt_findsplitloc() directly. This makes it possible
to significantly simplify the signature of _bt_split().

It makes perfect sense for _bt_split() to call _bt_findsplitloc()
directly, since _bt_findsplitloc() is already aware of almost every
_bt_split() implementation detail, whereas those same details are not
of interest anywhere else. _bt_findsplitloc() also knows all about
suffix truncation. It's also nice that the actual _bt_truncate() call
is closely tied to the _bt_findsplitloc() call.

-- 
Peter Geoghegan


v2-0001-Don-t-leave-behind-junk-nbtree-pages-during-split.patch
Description: Binary data


Re: PG 12 draft release notes

2019-05-21 Thread Peter Geoghegan
On Tue, May 21, 2019 at 1:51 PM Bruce Momjian  wrote:
> > My concern here (which I believe Alexander shares) is that it doesn't
> > make sense to group these two items together. They're two totally
> > unrelated pieces of work. Alexander's work does more or less help with
> > lock contention with writes, whereas the feature that that was merged
> > with is about preventing index bloat, which is mostly helpful for
> > reads (it helps writes to the extent that writes are also reads).
> >
> > The release notes go on to say that this item "gives better
> > performance for UPDATEs and DELETEs on indexes with many duplicates",
> > which is wrong. That is something that should have been listed below,
> > under the "duplicate index entries in heap-storage order" item.
>
> OK, I understand how the lock stuff improves things, but I have
> forgotten how indexes are made smaller.  Is it because of better page
> split logic?

That is clearly the main reason, though suffix truncation (which
represents that trailing/suffix columns in index tuples from branch
pages have "negative infinity" sentinel values) also contributes to
making indexes smaller.

The page split stuff was mostly added by commit fab250243 ("Consider
secondary factors during nbtree splits"), but commit f21668f32 ("Add
"split after new tuple" nbtree optimization") added to that in a way
that really helped the TPC-C indexes. The TPC-C indexes are about 40%
smaller now.

> > > Author: Peter Geoghegan 
> > > 2019-03-20 [dd299df81] Make heap TID a tiebreaker nbtree index column.

> As I remember the benefit currently is that you can find update and
> deleted rows faster, right?

Yes, that's true when writing to the index. But more importantly, it
really helps VACUUM when there are lots of duplicates, which is fairly
common in the real world (imagine an index where 20% of the rows are
NULL, for example). In effect, there are no duplicates anymore,
because all index tuples are unique internally.

Indexes with lots of duplicates group older rows together, and new
rows together, because treating heap TID as a tiebreaker naturally has
that effect. VACUUM will generally dirty far fewer pages, because bulk
deletions tend to be correlated with heap TID. And, VACUUM has a much
better chance of deleting entire leaf pages, because dead tuples end
up getting grouped together.

-- 
Peter Geoghegan




Re: pg_upgrade can result in early wraparound on databases with high transaction load

2019-05-21 Thread Peter Geoghegan
On Mon, May 20, 2019 at 3:10 AM Jason Harvey  wrote:
> This week I upgraded one of my large(2.8TB), high-volume databases from 9 to 
> 11. The upgrade itself went fine. About two days later, we unexpectedly hit 
> transaction ID wraparound. What was perplexing about this was that the age of 
> our oldest `datfrozenxid` was only 1.2 billion - far away from where I'd 
> expect a wraparound. Curiously, the wraparound error referred to a mysterious 
> database of `OID 0`:
>
> UPDATE ERROR:  database is not accepting commands to avoid wraparound data 
> loss in database with OID 0
>
> We were able to recover after a few hours by greatly speeding up our vacuum 
> on our largest table.
>
> In a followup investigation I uncovered the reason we hit the wraparound so 
> early, and also the cause of the mysterious OID 0 message. When pg_upgrade 
> executes, it calls pg_resetwal to set the next transaction ID. Within 
> pg_resetwal is the following code: 
> https://github.com/postgres/postgres/blob/6cd404b344f7e27f4d64555bb133f18a758fe851/src/bin/pg_resetwal/pg_resetwal.c#L440-L450
>
> This sets the controldata to have a fake database (OID 0) on the brink of 
> transaction wraparound. Specifically, after pg_upgrade is ran, wraparound 
> will occur within around 140 million transactions (provided the autovacuum 
> doesn't finish first). I confirmed by analyzing our controldata before and 
> after the upgrade that this was the cause of our early wraparound.
>
> Given the size and heavy volume of our database, we tend to complete a vacuum 
> in the time it takes around 250 million transactions to execute. With our 
> tunings this tends to be rather safe and we stay well away from the 
> wraparound point under normal circumstances.

This does seem like an unfriendly behavior. Moving the thread over to
the -hackers list for further discussion...

-- 
Peter Geoghegan




Re: PG 12 draft release notes

2019-05-20 Thread Peter Geoghegan
On Mon, May 20, 2019 at 3:17 PM Andres Freund  wrote:
> 
>
>
> Improve speed of btree index insertions (Peter Geoghegan,
> Alexander Korotkov)
>

My concern here (which I believe Alexander shares) is that it doesn't
make sense to group these two items together. They're two totally
unrelated pieces of work. Alexander's work does more or less help with
lock contention with writes, whereas the feature that that was merged
with is about preventing index bloat, which is mostly helpful for
reads (it helps writes to the extent that writes are also reads).

The release notes go on to say that this item "gives better
performance for UPDATEs and DELETEs on indexes with many duplicates",
which is wrong. That is something that should have been listed below,
under the "duplicate index entries in heap-storage order" item.

> Author: Peter Geoghegan 
> 2019-03-20 [dd299df81] Make heap TID a tiebreaker nbtree index column.
> Author: Peter Geoghegan 
> 2019-03-20 [fab250243] Consider secondary factors during nbtree splits.
> -->
>
>
> Have new btree indexes sort duplicate index entries in heap-storage
> order (Peter Geoghegan, Heikki Linnakangas)
>

> I'm not sure that the grouping here is quite right. And the second entry
> probably should have some explanation about the benefits?

It could stand to say something about the benefits. As I said, there
is already a little bit about the benefits, but that ended up being
tied to the "Improve speed of btree index insertions" item. Moving
that snippet to the correct item would be a good start.

-- 
Peter Geoghegan




Re: ERROR: failed to add item to the index page

2019-04-30 Thread Peter Geoghegan
On Tue, Apr 30, 2019 at 9:44 AM Andreas Joseph Krogh
 wrote:
> I have a 1.4GB dump (only one table) which reliably reproduces this error.
> Shall I share it off-list?

I would be quite interested in this, too, since there is a chance that
it's my bug.

-- 
Peter Geoghegan




Re: Re: ERROR: failed to add item to the index page

2019-04-30 Thread Peter Geoghegan
On Tue, Apr 30, 2019 at 11:54 AM Andreas Joseph Krogh
 wrote:
> Nice, thanks!

Thanks for the report!

-- 
Peter Geoghegan




Re: ERROR: failed to add item to the index page

2019-04-30 Thread Peter Geoghegan
On Tue, Apr 30, 2019 at 9:47 AM Tom Lane  wrote:
> Andreas Joseph Krogh  writes:
> > I have a 1.4GB dump (only one table) which reliably reproduces this error.
> > Shall I share it off-list? --
>
> That's awfully large :-(.  How do you have in mind to transmit it?

I've send dumps that were larger than that by providing a Google drive
link. Something like that should work reasonably well.

-- 
Peter Geoghegan




Re: ERROR: failed to add item to the index page

2019-04-30 Thread Peter Geoghegan
On Tue, Apr 30, 2019 at 9:56 AM Andreas Joseph Krogh  wrote:
> I've sent you guys a link (Google Drive) off-list.

I'll start investigating the problem right away.

Thanks
-- 
Peter Geoghegan




Re: ERROR: failed to add item to the index page

2019-04-30 Thread Peter Geoghegan
On Tue, Apr 30, 2019 at 9:59 AM Peter Geoghegan  wrote:
> I'll start investigating the problem right away.

I have found what the problem is. I simply neglected to make a
conservative assumption about suffix truncation needing to add a heap
TID to a leaf page's new high key in nbtsort.c (following commit
dd299df8189), even though I didn't make the same mistake in
nbtsplitloc.c. Not sure how I managed to make such a basic error.

Andreas' test case works fine with the attached patch. I won't push a
fix for this today.

-- 
Peter Geoghegan


0001-Tentative-fix-for-nbtsort.c-space-bug.patch
Description: Binary data


Re: Adding a test for speculative insert abort case

2019-04-30 Thread Peter Geoghegan
On Tue, Apr 30, 2019 at 5:16 PM Melanie Plageman
 wrote:
> Can anyone think of a good way to put this codepath under test?

During the initial development of ON CONFLICT, speculative insertion
itself was tested using custom stress testing that you can still get
here:

https://github.com/petergeoghegan/jjanes_upsert

I'm not sure that this is something that you can adopt, but I
certainly found it very useful at the time. It tests whether or not
there is agreement among concurrent speculative inserters, and whether
or not there are "unprincipled deadlocks" (user hostile deadlocks that
cannot be fixed by reordering something in application code).

-- 
Peter Geoghegan




Re: Improve search for missing parent downlinks in amcheck

2019-04-30 Thread Peter Geoghegan
On Sun, Apr 28, 2019 at 10:15 AM Alexander Korotkov
 wrote:
> I think this definitely not bug fix.  Bloom filter was designed to be
> lossy, no way blaming it for that :)

I will think about a simple fix, but after the upcoming point release.
There is no hurry.


-- 
Peter Geoghegan




Re: ERROR: failed to add item to the index page

2019-04-30 Thread Peter Geoghegan
On Tue, Apr 30, 2019 at 10:58 AM Peter Geoghegan  wrote:j
> I have found what the problem is. I simply neglected to make a
> conservative assumption about suffix truncation needing to add a heap
> TID to a leaf page's new high key in nbtsort.c (following commit
> dd299df8189), even though I didn't make the same mistake in
> nbtsplitloc.c. Not sure how I managed to make such a basic error.

Attached is a much more polished version of the same patch. I tried to
make clear how the "page full" test (the test that has been fixed to
take heap TID space for high key into account) is related to other
close-by code, such as the tuple space limit budget within
_bt_check_third_page(), and the code that sets up an actual call to
_bt_truncate().

I'll wait a few days before pushing this. This version doesn't feel
too far off being committable. I tested it with some of the CREATE
INDEX tests that I developed during development of the nbtree unique
keys project, including a test with tuples that are precisely at the
1/3 of a page threshold. The new definition of 1/3 of a page takes
high key heap TID overhead into account -- see _bt_check_third_page().

-- 
Peter Geoghegan


v2-0001-Fix-nbtsort.c-s-page-space-accounting.patch
Description: Binary data


Re: What is an item pointer, anyway?

2019-04-26 Thread Peter Geoghegan
On Fri, Apr 26, 2019 at 4:23 PM Ashwin Agrawal  wrote:
> How about we rename ItemPointerData to TupleIdentifier or ItemIdentifier 
> instead and leave ItemPointer or Item confined to AM term, where item can be 
> tuple, datum or anything else ?

I'm not a fan of that idea, because the reality is that an
ItemPointerData is quite explicitly supposed to be a physiological
identifier (TID) used by heapam, or a similar heap-like access method
such as zheap. This is baked into a number of things.

The limitation that pluggable storage engines have to work in terms of
item pointers is certainly a problem, especially for things like the
Zedstore column store project you're working on. However, I suspect
that that problem is best solved by accommodating other types of
identifiers that don't work like TIDs.

I understand why you've adopted ItemPointerData as a fully-logical
identifier in your prototype, but it's not a great long term solution.

-- 
Peter Geoghegan




Re: What is an item pointer, anyway?

2019-04-26 Thread Peter Geoghegan
On Fri, Apr 26, 2019 at 4:57 PM Tom Lane  wrote:
> ItemId[Data] is somewhat less widely referenced, but I'm still not
> much in favor of renaming that type.  I think fixing comments to
> uniformly call it an item ID would be more reasonable.  (We should
> leave the "line pointer" terminology in place, too; if memory serves,
> an awful lot of variables of the type are named "lp" or variants.
> Renaming all of those is to nobody's benefit.)

I was proposing that we not rename any struct at all, and continue to
call ItemId[Data]s "line pointers" only.  This would involve removing
the comment in itemid.h that confusingly refers to line pointers as
"item pointers" (plus any other comments that fail to make a clear
distinction).

I think that the total number of comments that would be affected by
this approach is quite low.

-- 
Peter Geoghegan




Re: What is an item pointer, anyway?

2019-04-26 Thread Peter Geoghegan
On Fri, Apr 26, 2019 at 5:05 PM Tom Lane  wrote:
> Yeah, I'd be fine with that, although the disconnect between the type
> name and the comment terminology might confuse some people.

Maybe, but the fact that the ItemIdData struct consists of bit fields
that are all named "lp_*" offers a hint. Plus you have the LP_*
constants that get stored in ItemIdData.lp_flags.

I wouldn't call the struct ItemIdData if I was in a green field
situation, but it doesn't seem too bad under the present
circumstances. I'd rather not change the struct's name, because that
would probably cause problems without any real benefit. OTOH, calling
two closely related but distinct things by the same name is atrocious.

-- 
Peter Geoghegan




Re: "long" type is not appropriate for counting tuples

2019-04-29 Thread Peter Geoghegan
On Mon, Apr 29, 2019 at 8:11 AM Andres Freund  wrote:
> I think we should start by just removing all uses of long. There's
> really no excuse for them today, and a lot of them are bugs waiting to
> happen.

I like the idea of banning "long" altogether. It will probably be hard
to keep it out of third party code that we vendor-in, or even code
that interfaces with libraries in some way, but it should be removed
from everything else. It actually doesn't seem particularly hard to do
so, based on a quick grep of src/backend/. Most uses of "long" is code
that sizes something in local memory, where "long" works for the same
reason as it works when calculating the size of a work_mem allocation
-- ugly, but correct. A few uses of "long" seem to be real live bugs,
albeit bugs that are very unlikely to ever hit.

_h_indexbuild() has the same bug as _bt_load(), also due to commit
ab0dfc961b6 -- I spotted that in passing when I used grep.

> We read from larger files in a few places though. E.g. pg_dump. Most
> places really just should use pgoff_t...

I wasn't even aware of pgoff_t. It is only used in frontend utilities
that I don't know that much about, whereas off_t is used all over the
backend code.

-- 
Peter Geoghegan




Re: "long" type is not appropriate for counting tuples

2019-04-29 Thread Peter Geoghegan
On Mon, Apr 29, 2019 at 11:24 AM Andres Freund  wrote:
> > I don't think that anybody cares about Win64 very much.
>
> I seriously doubt this assertion. Note that the postgres packages on
> https://www.postgresql.org/download/windows/ do not support 32bit
> windows anymore (edb from 11 onwards, bigsql apparently always). And I
> think there's a pretty substantial number of windows users out there.

I was talking about the motivation behind this thread, and I suppose
that I included you in that based on things you've said about Windows
in the past (apparently I shouldn't have done so).

I am interested in making the code less complicated. If we can remove
the work_mem kludge for Windows as a consequence of that, then so much
the better.

-- 
Peter Geoghegan




Re: "long" type is not appropriate for counting tuples

2019-04-29 Thread Peter Geoghegan
On Mon, Apr 29, 2019 at 11:10 AM Tom Lane  wrote:
> If we don't want to rely on "L" constants then we'll have to write these
> cases like "work_mem * (size_t) 1024" which is ugly, lots more keystrokes,
> and prone to weird precedence problems unless you throw even more
> keystrokes (parentheses) at it.  I'm not excited about doing that just
> to allow larger work_mem settings on Win64.

I don't think that anybody cares about Win64 very much. Simplifying
the code might lead to larger work_mem settings on that platform, but
that's not the end goal I have in mind. For me, the end goal is
simpler code.

> I'm not suggesting that we don't need to fix uses of "long" for tuple
> counts, and perhaps other things.  But I think getting rid of it in memory
> size calculations might be a lot of work for not a lot of reward.

Whether or not *fully* banning the use of "long" is something that
will simplify the code is debatable. However, we could substantially
reduce the use of "long" across the backend without any real downside.
The work_mem question can be considered later. Does that seem
reasonable?

-- 
Peter Geoghegan




Re: "long" type is not appropriate for counting tuples

2019-04-29 Thread Peter Geoghegan
On Mon, Apr 29, 2019 at 11:20 AM Alvaro Herrera
 wrote:
> Agreed.  Here's a patch.  I see downthread that you also discovered the
> same mistake in _h_indexbuild by grepping for "long"; I got to it by
> examining callers of pgstat_progress_update_param and
> pgstat_progress_update_multi_param.  I didn't find any other mistakes of
> the same ilk.  Some codesites use "double" instead of "int64", but those
> are not broken.

This seems fine, though FWIW I probably would have gone with int64
instead of uint64. There is generally no downside to using int64, and
being to support negative integers can be useful in some contexts
(though not this context).

-- 
Peter Geoghegan




Re: "long" type is not appropriate for counting tuples

2019-04-29 Thread Peter Geoghegan
On Mon, Apr 29, 2019 at 10:32 AM Tom Lane  wrote:
> There's more to that than you might realize.  For example, guc.c
> enforces a limit on work_mem that's designed to ensure that
> expressions like "work_mem * 1024L" won't overflow, and there are
> similar choices elsewhere.

I was aware of that, but I wasn't aware of how many places that bleeds
into until I checked just now.

It would be nice if we could figure out how to make it obvious that
the idioms around the use of long for work_mem stuff are idioms that
have a specific rationale. It's pretty confusing as things stand.

-- 
Peter Geoghegan




Re: Calling PrepareTempTablespaces in BufFileCreateTemp

2019-04-29 Thread Peter Geoghegan
On Mon, Apr 29, 2019 at 12:31 PM Ashwin Agrawal  wrote:
> Well the one thing I wish to point out explicitly is just taking fd.c
> changes from [1], and running make check hits no assertions and
> doesn't flag issue exist for gistbuildbuffers.c. Means its missing
> coverage and in future same can happen as well.

I believe that the test coverage of GiST index builds is something
that is being actively worked on right now. It's a recognized problem
[1].

[1] https://postgr.es/m/24954.1556130...@sss.pgh.pa.us
-- 
Peter Geoghegan




Re: _bt_split(), and the risk of OOM before its critical section

2019-05-06 Thread Peter Geoghegan
On Mon, May 6, 2019 at 5:15 PM Peter Geoghegan  wrote:
> VACUUM asserts P_FIRSTDATAKEY(opaque) > PageGetMaxOffsetNumber(page)
> within _bt_mark_page_halfdead(), but doesn't test that condition in
> release builds. This means that the earliest modifications of the
> right page, before the high key PageAddItem(), are enough to cause a
> subsequent "failed to re-find parent key" failure in VACUUM. Merely
> setting the sibling blocks in the right page special area is enough to
> cause VACUUM to refuse to run.

To be clear, my point here was that this confirms what you said about
PageGetTempPage() failing after _bt_getbuf() has initialized the
buffer for the new right page -- that is not in itself a problem.
However, practically any other change to the right page that might
occur before an error is raised within _bt_split() is a problem -- not
just adding a new item. (You were right about that, too.)

-- 
Peter Geoghegan




Re: _bt_split(), and the risk of OOM before its critical section

2019-05-06 Thread Peter Geoghegan
On Mon, May 6, 2019 at 4:11 PM Peter Geoghegan  wrote:
> The important question is how VACUUM will recognize it. It's clearly
> not as bad as something that causes "failed to re-find parent key"
> errors, but I think that VACUUM might not be reclaiming it for the FSM
> (haven't checked). Note that _bt_unlink_halfdead_page() is perfectly
> happy to ignore the fact that the left sibling of a half-dead page has
> a rightlink that doesn't point back to the target. Because, uh, there
> might have been a concurrent page deletion, somehow.

VACUUM asserts P_FIRSTDATAKEY(opaque) > PageGetMaxOffsetNumber(page)
within _bt_mark_page_halfdead(), but doesn't test that condition in
release builds. This means that the earliest modifications of the
right page, before the high key PageAddItem(), are enough to cause a
subsequent "failed to re-find parent key" failure in VACUUM. Merely
setting the sibling blocks in the right page special area is enough to
cause VACUUM to refuse to run.

Of course, the problem goes away if you restart the database, because
the right page buffer is never marked dirty, and never can be. That
factor would probably make the problem appear to be an intermittent
issue in the kinds of environments where it is most likely to be seen.

-- 
Peter Geoghegan




Re: _bt_split(), and the risk of OOM before its critical section

2019-05-06 Thread Peter Geoghegan
On Mon, May 6, 2019 at 3:29 PM Tom Lane  wrote:
> Yeah, as _bt_split is currently coded, _bt_truncate has to be a "no
> errors" function, which it isn't.  The pfree for its result is being
> done in an ill-chosen place, too.

I am tempted to move the call to _bt_truncate() out of _bt_split()
entirely on HEAD, possibly relocating it to
nbtsplitloc.c/_bt_findsplitloc(). That way, there is a clearer
separation between how split points are chosen, suffix truncation, and
the mechanical process of executing a legal page split.

> Another problem now that I look at it is that the _bt_getbuf for the right
> sibling is probably not too safe.  And the _bt_vacuum_cycleid() call seems
> a bit dangerous from this standpoint as well.

Yeah, we can tighten those up without much difficulty.

> I'm not really concerned about that one because at that point the
> right page is still in a freshly-pageinit'd state.  It's perhaps
> not quite as nice as having it be zeroes, but it won't look like
> it has any interesting data.

The important question is how VACUUM will recognize it. It's clearly
not as bad as something that causes "failed to re-find parent key"
errors, but I think that VACUUM might not be reclaiming it for the FSM
(haven't checked). Note that _bt_unlink_halfdead_page() is perfectly
happy to ignore the fact that the left sibling of a half-dead page has
a rightlink that doesn't point back to the target. Because, uh, there
might have been a concurrent page deletion, somehow.

We have heard a lot about "failed to re-find parent key" errors from
VACUUM before now because that is about the only strong cross-check
that it does. (Not that I'm arguing that we need more of that.)

> In any case, once we've started to fill the ropaque area, it would really
> be better if we don't call anything that could throw errors.
>
> Maybe we should bite the bullet and use two temp pages, so that none
> of the data ends up in the shared buffer arena until we reach the
> critical section?  The extra copying is slightly annoying, but
> it certainly seems like enforcing this invariant over such a
> long stretch of code is not very maintainable.

While I think that the smarts we have around deciding a split point
will probably improve in future releases, and that we'll probably make
_bt_truncate() itself do more, the actual business of performing a
split has no reason to change that I can think of. I would like to
keep _bt_split() as simple as possible anyway -- it should only be
copying tuples using simple primitives like the bufpage.c routines.
Living with what we have now (not using a temp buffer for the right
page) seems fine.

-- 
Peter Geoghegan




_bt_split(), and the risk of OOM before its critical section

2019-05-06 Thread Peter Geoghegan
Commit 8fa30f906be reduced the elevel of a number of "can't happen"
errors from PANIC to ERROR. These were all critical-section-adjacent
errors involved in nbtree page splits, and nbtree page deletion. It
also established the following convention within _bt_split(), which
allowed Tom to keep the length of the critical section just as short
as it had always been:

/*
 * origpage is the original page to be split.  leftpage is a temporary
 * buffer that receives the left-sibling data, which will be copied back
 * into origpage on success.  rightpage is the new page that receives the
 * right-sibling data.  If we fail before reaching the critical section,
 * origpage hasn't been modified and leftpage is only workspace. In
 * principle we shouldn't need to worry about rightpage either, because it
 * hasn't been linked into the btree page structure; but to avoid leaving
 * possibly-confusing junk behind, we are careful to rewrite rightpage as
 * zeroes before throwing any error.
 */

The INCLUDE indexes work looks like it subtly broke this, because it
allocated memory after the initialization of the right page --
allocating memory can always fail. On the other hand, even when
8fa30f906be went in back in 2010 this "rule" was arguably broken,
because we were already calling PageGetTempPage() after the right
sibling page is initialized, which palloc()s a full BLCKSZ, which is
far more than truncation is every likely to allocate.

On the other other hand, it seems to me that the PageGetTempPage()
thing might have been okay, because it happens before the high key is
inserted on the new right buffer page. The same cannot be said for the
way we generate a new high key for the left/old page via suffix
truncation, which happens to occur after the right buffer page is
first modified by inserted its high key (the original/left page's
original high key). I think that there may be a risk that VACUUM's
page deletion code will get confused by finding an errant right
sibling page from a failed page split when there is a high key. If so,
that would be a risk that was introduced in Postgres 11, and made much
more likely in practice in Postgres 12. (I haven't got as far as doing
an analysis of the risks to page deletion, though. The "fastpath"
rightmost page insertion optimization that was also added to Postgres
11 seems like it also might need to be considered here.)

-- 
Peter Geoghegan




Re: _bt_split(), and the risk of OOM before its critical section

2019-05-06 Thread Peter Geoghegan
On Mon, May 6, 2019 at 12:48 PM Peter Geoghegan  wrote:
> On the other other hand, it seems to me that the PageGetTempPage()
> thing might have been okay, because it happens before the high key is
> inserted on the new right buffer page. The same cannot be said for the
> way we generate a new high key for the left/old page via suffix
> truncation, which happens to occur after the right buffer page is
> first modified by inserted its high key (the original/left page's
> original high key). I think that there may be a risk that VACUUM's
> page deletion code will get confused by finding an errant right
> sibling page from a failed page split when there is a high key. If so,
> that would be a risk that was introduced in Postgres 11, and made much
> more likely in practice in Postgres 12. (I haven't got as far as doing
> an analysis of the risks to page deletion, though. The "fastpath"
> rightmost page insertion optimization that was also added to Postgres
> 11 seems like it also might need to be considered here.)

It seems like my fears about page deletion were well-founded, at least
if you assume that the risk of an OOM at the wrong time is greater
than negligible.

If I simulate an OOM error during suffix truncation, then
non-rightmost page splits leave the tree in a state that confuses
VACUUM/page deletion. When I simulate an OOM on page 42, we will later
see the dreaded "failed to re-find parent key in index "foo" for
deletion target page 42" error message from a VACUUM. That's not good.

It doesn't matter if the same things happens when splitting a
rightmost page, which naturally doesn't insert a new high key on the
new right half. This confirms my theory that the PageGetTempPage()
memory allocation can fail without confusing VACUUM, since that
allocation occurs before the critical-but-not-critical point (the
point that we really start to modify the new right half of the split).

Fortunately, this bug seems easy enough to fix: we can simply move the
"insert new high key on right page" code so that it comes after suffix
truncation. This makes it safe for suffix truncation to have an OOM,
or at least as safe as the PageGetTempPage() allocation that seems
safe to me.

-- 
Peter Geoghegan




Re: VACUUM can finish an interrupted nbtree page split -- is that okay?

2019-05-07 Thread Peter Geoghegan
On Tue, May 7, 2019 at 12:27 AM Heikki Linnakangas  wrote:
> I don't understand that reasoning. Yes, _bt_pagedel() will complain if
> it finds a half-dead internal page. But how does that mean that
> _bt_lock_branch_parent() can't encounter one?

I suppose that in theory it could, but only if you allow that any
possible state could be found -- it doesn't seem any more likely than
any other random illegal state.

Even when it happens, we'll get a "failed to re-find parent key" error
message when we go a bit further. Isn't that a logical outcome?

Actually, maybe we won't get that error, because we're talking about a
corrupt index, and all bets are off -- no reason to think that the
half-dead internal page would be consistent with other pages in any
way. But even then, you'll go on to report it in the usual way, since
VACUUM scans nbtree indexes in physical order.
-- 
Peter Geoghegan




Re: New EXPLAIN option: ALL

2019-05-07 Thread Peter Geoghegan
On Tue, May 7, 2019 at 9:31 AM David Fetter  wrote:
> If you're tuning a query interactively, it's a lot simpler to prepend,
> for example,
>
> EXPLAIN (ALL, FORMAT JSON)
>
> to it than to prepend something along the lines of
>
> EXPLAIN(ANALYZE, VERBOSE, COSTS, BUFFERS, SETTINGS, TIMING, SUMMARY, 
> PARTRIDGE_IN_A_PEAR_TREE, FORMAT JSON)
>
> to it.

FWIW, I have the following in my psqlrc:

\set ea 'EXPLAIN (ANALYZE, SETTINGS, VERBOSE, BUFFERS) '

The idea behind that is that I can prepend ":ea" as needed, rather
than doing a lot of typing each time, as in:

:ea SELECT ...


--
Peter Geoghegan




Re: _bt_split(), and the risk of OOM before its critical section

2019-05-07 Thread Peter Geoghegan
On Mon, May 6, 2019 at 4:11 PM Peter Geoghegan  wrote:
> I am tempted to move the call to _bt_truncate() out of _bt_split()
> entirely on HEAD, possibly relocating it to
> nbtsplitloc.c/_bt_findsplitloc(). That way, there is a clearer
> separation between how split points are chosen, suffix truncation, and
> the mechanical process of executing a legal page split.

I decided against that -- better to make it clear how truncation deals
with space overhead within _bt_split(). Besides, the resource
management around sharing a maybe-palloc()'d high key across module
boundaries seems complicated, and best avoided.

Attached draft patch for HEAD fixes the bug by organizing _bt_split()
into clear phases. _bt_split() now works as follows, which is a little
different:

*  An initial phase that is entirely concerned with the left page temp
buffer itself -- initializes its special area.

* Suffix truncation to get left page's new high key, and then add it
to left page.

* A phase that is mostly concerned with initializing the right page
special area, but also finishes off one or two details about the left
page that needed to be delayed. This is also where the "shadow
critical section" begins. Note also that this is where
_bt_vacuum_cycleid() is called, because its contract actually
*requires* that caller has a buffer lock on both pages at once. This
should not be changed on the grounds that _bt_vacuum_cycleid() might
fail (nor for any other reason).

* Add new high key to right page if needed. (No change, other than the
fact that it happens later now.)

* Add other items to both leftpage and rightpage. Critical section
that copies leftpage into origpage buffer. (No changes here.)

I suppose I'm biased, but I prefer the new approach anyway. Adding the
left high key first, and then the right high key seems simpler and
more logical. It emphasizes the similarities and differences between
leftpage and rightpage. Furthermore, this approach fixes the
theoretical risk of leaving behind a minimally-initialized nbtree page
that has existed since 2010. We don't allocated *any* memory after the
point that a new rightpage buffer is acquired.

I suppose that this will need to be backpatched.

Thoughts?
--
Peter Geoghegan


v1-0001-Don-t-leave-behind-junk-nbtree-pages-during-split.patch
Description: Binary data


Re: Pathological performance when inserting many NULLs into a unique index

2019-04-18 Thread Peter Geoghegan
On Wed, Apr 17, 2019 at 7:37 PM Peter Geoghegan  wrote:
> Tentatively, I think that the fix here is to not "itup_key->scantid =
> NULL" when a NULL value is involved (i.e. don't "itup_key->scantid =
> NULL" when we see IndexTupleHasNulls(itup) within _bt_doinsert()). We
> may also want to avoid calling _bt_check_unique() entirely in this
> situation.

> I'll create an open item for this, and begin work on a patch tomorrow.

I came up with the attached patch, which effectively treats a unique
index insertion as if the index was not unique at all in the case
where new tuple is IndexTupleHasNulls() within _bt_doinsert(). This is
correct because inserting a new tuple with a NULL value can neither
contribute to somebody else's duplicate violation, nor have a
duplicate violation error of its own. I need to test this some more,
though I am fairly confident that I have the right idea.

We didn't have this problem before my v12 work due to the "getting
tired" strategy, but that's not the full story. We also didn't (and
don't) move right within _bt_check_unique() when
IndexTupleHasNulls(itup), because _bt_isequal() treats NULL values as
not equal to any value, including even NULL -- this meant that there
was no useless search to the right within _bt_check_unique().
Actually, the overall effect of how _bt_isequal() is used is that
_bt_check_unique() does *no* useful work at all with a NULL scankey.
It's far simpler to remove responsibility for new tuples/scankeys with
NULL values from _bt_check_unique() entirely, which is what I propose
to do with this patch.

The patch actually ends up removing quite a lot more code than it
adds, because it removes _bt_isequal() outright. I always thought that
nbtinsert.c dealt with NULL values in unique indexes at the wrong
level, and I'm glad to see _bt_isequal() go. Vadim's accompanying 1997
comment about "not using _bt_compare in real comparison" seems
confused to me. The new _bt_check_unique() may still need to compare
the scankey to some existing, adjacent tuple with a NULL value, but
_bt_compare() is perfectly capable of recognizing that that adjacent
tuple shouldn't be considered equal. IOW, _bt_compare() is merely
incapable of behaving as if "NULL != NULL", which is a bad reason for
keeping _bt_isequal() around.

--
Peter Geoghegan


0001-Prevent-O-N-2-unique-index-insertion-edge-case.patch
Description: Binary data


Re: Pathological performance when inserting many NULLs into a unique index

2019-04-18 Thread Peter Geoghegan
On Thu, Apr 18, 2019 at 5:46 PM Michael Paquier  wrote:
> Adding an open item is adapted, nice you found this issue.  Testing on
> my laptop with v11, the non-NULL case takes 5s and the NULL case 6s.
> On HEAD, the non-NULL case takes 6s, and the NULL case takes...  I
> just gave up and cancelled it :)

This was one of those things that occurred to me out of the blue.

It just occurred to me that the final patch will need to be more
careful about non-key attributes in INCLUDE indexes. It's not okay for
it to avoid calling _bt_check_unique() just because a non-key
attribute was NULL. It should only do that when a key attribute is
NULL.

-- 
Peter Geoghegan




Re: Pathological performance when inserting many NULLs into a unique index

2019-04-19 Thread Peter Geoghegan
On Thu, Apr 18, 2019 at 8:13 PM Peter Geoghegan  wrote:
> It just occurred to me that the final patch will need to be more
> careful about non-key attributes in INCLUDE indexes. It's not okay for
> it to avoid calling _bt_check_unique() just because a non-key
> attribute was NULL. It should only do that when a key attribute is
> NULL.

Attached revision does it that way, specifically by adding a new field
to the insertion scankey struct (BTScanInsertData). The field is
filled-in when initializing an insertion scankey in the usual way. I
was able to reuse alignment padding for the new field, so there is no
additional space overhead on Linux x86-64.

This is a bit more complicated than v1, but there is still an overall
net reduction in overall complexity (and in LOC). I'm going to commit
this early next week, barring any objections, and assuming I don't
think of any more problems between now and then.
--
Peter Geoghegan


v2-0001-Prevent-O-N-2-unique-index-insertion-edge-case.patch
Description: Binary data


Re: ERROR: failed to add item to the index page

2019-05-02 Thread Peter Geoghegan
On Tue, Apr 30, 2019 at 6:28 PM Peter Geoghegan  wrote:
> Attached is a much more polished version of the same patch. I tried to
> make clear how the "page full" test (the test that has been fixed to
> take heap TID space for high key into account) is related to other
> close-by code, such as the tuple space limit budget within
> _bt_check_third_page(), and the code that sets up an actual call to
> _bt_truncate().

Pushed, though final version does the test a little differently. It
adds the required heap TID space to itupsz, rather than subtracting it
from pgspc. This is actually representative of the underlying logic,
and avoids unsigned underflow.

-- 
Peter Geoghegan




Re: Improve search for missing parent downlinks in amcheck

2019-04-27 Thread Peter Geoghegan
On Sat, Apr 27, 2019 at 4:57 PM Alexander Korotkov
 wrote:
> "rootdescend" is cool type of check.  Thank you for noticing, I wasn't aware 
> of it.
> But can it detect the missing downlink in following situation?
>
> A
>  / \
>   B <-> C <-> D
>
> Here A has downlinks to B and D, which downlink to C is missing,
> while B, C and D are correctly connected with leftlinks and rightlinks.
> I can see "rootdescend" calls _bt_search(), which would just step
> right from C to D as if it was concurrent split.

There is a comment about this scenario above bt_rootdescend() in amcheck.

You're right -- this is a kind of corruption that even the new
rootdescend verification option would miss. We can imagine a version
of rootdescend verification that tells the core code to only move
right when there was an *interrupted* page split (i.e.
P_INCOMPLETE_SPLIT() flag bit is set), but that isn't what happens
right now.

That said, the lossy downlink check that you want to improve on
*should* already catch this situation. Of course it might not because
it is lossy (uses a Bloom filter), but I think that that's very
unlikely. That's why I would like to understand the problem that you
found with the check.

-- 
Peter Geoghegan




Re: Improve search for missing parent downlinks in amcheck

2019-04-27 Thread Peter Geoghegan
On Sat, Apr 27, 2019 at 5:13 PM Alexander Korotkov
 wrote:
> Yes, increasing of Bloom filter size also helps.  But my intention was
> to make non-lossy check here.

Why is that your intention? Do you want to do this as a feature for
Postgres 13, or do you want to treat this as a bug that we need to
backpatch a fix for?

Can we avoid the problem you saw with the Bloom filter approach by
using the real size of the index (i.e.
smgrnblocks()/RelationGetNumberOfBlocks()) to size the Bloom filter,
and/or by rethinking the work_mem cap? Maybe we can have a WARNING
that advertises that work_mem is probably too low?

The  state->downlinkfilter Bloom filter should be small in almost all
cases, so I still don't fully understand your concern. With a 100GB
index, we'll have ~13 million blocks. We only need a Bloom filter that
is ~250MB to have less than a 1% chance of missing inconsistencies
even with such a large index. I admit that its unfriendly that users
are not warned about the shortage currently, but that is something we
can probably find a simple (backpatchable) fix for.

-- 
Peter Geoghegan




Re: Improve search for missing parent downlinks in amcheck

2019-04-27 Thread Peter Geoghegan
On Sat, Apr 27, 2019 at 5:13 PM Alexander Korotkov
 wrote:
> Yes, increasing of Bloom filter size also helps.  But my intention was
> to make non-lossy check here.

I agree that that might be a good goal, but I am interested in knowing
if there is something naive about how the downlinkfilter Bloom filter
is sized. I think that we could probably do better than this without
much work:

/*
 * Extra readonly downlink check.
 *
 * In readonly case, we know that there cannot be a concurrent
 * page split or a concurrent page deletion, which gives us the
 * opportunity to verify that every non-ignorable page had a
 * downlink one level up.  We must be tolerant of interrupted page
 * splits and page deletions, though.  This is taken care of in
 * bt_downlink_missing_check().
 */
total_pages = (int64) state->rel->rd_rel->relpages;
state->downlinkfilter = bloom_create(total_pages, work_mem, seed);

Maybe we could use "smgrnblocks(index->rd_smgr, MAIN_FORKNUM))"
instead of relpages, for example.

> It wouldn't be really wasteful, because the same children were
> accessed recently.  So, they are likely not yet evicted from
> shared_buffers.  I think we can fit both checks into one loop over
> downlinks instead of two.

I see your point, but if we're going to treat this as a bug then I
would prefer a simple fix.

> Yes, we can do more checks with bloom filter.  But assuming that we
> anyway iterate over children for each non-leaf page, can we do:
> 1) All the checks, which bt_downlink_check() does for now
> 2) Check there are no missing downlinks
> 3) Check hikeys
> in one pass, can't we?

We can expect every high key in a page to have a copy contained within
its parent, either as one of the keys, or as parent's own high key
(assuming no concurrent or interrupted page splits or page deletions).
This is true today, even though we truncate negative infinity items in
internal pages.

I think that the simple answer to your question is yes. It would be
more complicated that way, and the only extra check would be the check
of high keys against their parent, but overall this does seem
possible.

-- 
Peter Geoghegan




"long" type is not appropriate for counting tuples

2019-04-28 Thread Peter Geoghegan
Commit ab0dfc961b6 used a "long" variable within _bt_load() to count
the number of tuples entered into a B-Tree index as it is built. This
will not work as expected on Windows, even on 64-bit Windows, because
"long" is only 32-bits wide. It's far from impossible that you'd have
~2 billion index tuples when building a new index.

Programmers use "long" because it is assumed to be wider than "int"
(even though that isn't required by the standard, and isn't true
across all of the platforms we support). The use of "long" seems
inherently suspect given our constraints, except perhaps in the
context of sizing work_mem-based allocations, where it is used as part
of a semi-standard idiom...albeit one that only works because of the
restrictions on work_mem size on Windows.

ISTM that we should try to come up with a way of making code like this
work, rather than placing the burden on new code to get it right. This
exact issue has bitten users on a number of occasions that I can
recall. There is also a hidden landmine that we know about but haven't
fixed: logtape.c, which will break on Windows with very very large
index builds.

Also, "off_t" is only 32-bits on Windows, which broke parallel CREATE
INDEX (issued fixed by commit aa551830). I suppose that "off_t" is
really a variant of the same problem.

--
Peter Geoghegan




Re: "long" type is not appropriate for counting tuples

2019-04-28 Thread Peter Geoghegan
On Sun, Apr 28, 2019 at 4:25 PM Tom Lane  wrote:
> > ISTM that we should try to come up with a way of making code like this
> > work, rather than placing the burden on new code to get it right.
>
> Other than "use the right datatype", I'm not sure what we can do?

Ambiguity seems like the real problem here. If we could impose a
general rule that you cannot use "long" (perhaps with some limited
wiggle-room), then a lint-like tool could catch bugs like this. This
may not be that difficult. Nobody is particularly concerned about
performance on 32-bit platforms these days.

> In the meantime, somebody should fix ab0dfc961b6 ...

I'll leave this to Alvaro.

> Hmm, why is this a problem?  We should only use off_t for actual file
> access operations, and we don't use files greater than 1GB.  (There's a
> reason for that.)

The issue that was fixed by commit aa551830 showed this assumption to
be kind of brittle. Admittedly this is not as clear-cut as the "long"
issue, and might not be worth worrying about. I don't want to go as
far as requiring explicit width integer types in all situations, since
that seems totally impractical, and without any real upside. But it
would be nice to identify integer types where there is a real risk of
making an incorrect assumption, and then eliminate that risk once and
for all.

-- 
Peter Geoghegan




What is an item pointer, anyway?

2019-04-26 Thread Peter Geoghegan
itemid.h introduces the struct ItemIdData as follows:

/*
 * An item pointer (also called line pointer) on a buffer page

Meanwhile, itemptr.h introduces the struct ItemPointerData as follows:

/*
 * ItemPointer:
 *
 * This is a pointer to an item within a disk page of a known file
 * (for example, a cross-link from an index to its parent table).

It doesn't seem reasonable to assume that you should know the
difference based on context. The two concepts are closely related. An
ItemPointerData points to a block, as well as the, uh, item pointer
within that block.

This ambiguity is avoidable, and should be avoided. ISTM that the
least confusing way of removing the ambiguity would be to no longer
refer to ItemIds as item pointers, without changing anything else.

-- 
Peter Geoghegan




Re: What is an item pointer, anyway?

2019-05-05 Thread Peter Geoghegan
On Fri, Apr 26, 2019 at 5:13 PM Peter Geoghegan  wrote:
> On Fri, Apr 26, 2019 at 5:05 PM Tom Lane  wrote:
> > Yeah, I'd be fine with that, although the disconnect between the type
> > name and the comment terminology might confuse some people.
>
> Maybe, but the fact that the ItemIdData struct consists of bit fields
> that are all named "lp_*" offers a hint. Plus you have the LP_*
> constants that get stored in ItemIdData.lp_flags.

Attached draft patch adjusts code comments and error messages where a
line pointer is referred to as an item pointer. It turns out that this
practice isn't all that prevalent. Here are some specific concerns
that I had to think about when writing the patch, though:

* I ended up removing a big indexam.c "old comments" comment paragraph
from the Berkeley days, because it used the term item pointer in what
seemed like the wrong way, but also because AFAICT it's totally
obsolete.

* Someone should confirm that I have preserved the original intent of
the changes within README.HOT, and the heapam changes that relate to
pruning. It's possible that I changed "item pointer" to "line pointer"
in one or two places where I should have changed "item pointer" to
"tuple" instead.

* I changed a few long standing "can't happen" error messages that
concern corruption, most of which also relate to pruning. Maybe that's
a cost that needs to be considered.

* I changed a lazy_scan_heap() log message of long-standing. Another
downside that needs to be considered.

* I expanded a little on the advantages of using line pointers within
bufpage.h. That seemed in scope to me, because it drives home the
distinction between item pointers and line pointers.

-- 
Peter Geoghegan


v1-0001-Standardize-line-pointers-item-pointer-terminolog.patch
Description: Binary data


Re: VACUUM can finish an interrupted nbtree page split -- is that okay?

2019-05-04 Thread Peter Geoghegan
On Fri, Mar 1, 2019 at 3:59 PM Peter Geoghegan  wrote:
> /*
>  * Perform the same check on this internal level that
>  * _bt_mark_page_halfdead performed on the leaf level.
>  */
> if (_bt_is_page_halfdead(rel, *rightsib))

> I thought that internal pages were never half-dead after Postgres 9.4.
> If that happens, then the check within _bt_pagedel() will throw an
> ERRCODE_INDEX_CORRUPTED error, and tell the DBA to REINDEX. Shouldn't
> this internal level _bt_is_page_halfdead() check contain a "can't
> happen" error, or even a simple assertion?

I think that we should get rid of this code on HEAD shortly, because
it's effectively dead code. You don't have to know anything about
B-Trees to see why this must be true: VACUUM is specifically checking
if an internal page is half-dead here, even though it's already
treating half-dead internal pages as ipso facto corrupt in higher
level code (it's the first thing we check in _bt_pagedel()). This is
clearly contradictory. If there is a half-dead internal page, then
there is no danger of VACUUM not complaining loudly that you need to
REINDEX. This has been true since the page deletion overhaul that went
into 9.4.

Attached patch removes the internal page check, and adds a comment
that explains why it's sufficient to check on the leaf level alone.
Admittedly, it's much easier to understand why the
_bt_is_page_halfdead() internal page check is useless than it is to
understand why replacing it with this comment is helpful. My
observation that a half-dead leaf page is representative of the
subtree whose root the leaf page has stored as its "top parent" is
hardly controversial, though -- that's the whole basis of multi-level
page deletion. If you *visualize* how multi-level deletion works, and
consider its rightmost-in-subtree restriction, then it isn't hard to
see why everything works out with just the leaf level
right-sibling-is-half-dead check:

We can only have two adjacent "skinny" pending-deletion subtrees in
cases where the removed check might seem to be helpful -- each page is
both the leftmost and the rightmost on its level in its subtree. It's
okay to just check if the leaf is half-dead because it "owns" exactly
the same range in the keyspace as the internal pages up to and
including its top parent, if any, and because it is marked half-dead
by the same atomic operation that does initial removal of downlinks in
an ancestor page.

I'm fine with waiting until we branch-off v12 before pushing the
patch, even though it seems low risk.

--
Peter Geoghegan


0001-Remove-nbtree-half-dead-internal-page-check.patch
Description: Binary data


Re: [HACKERS] [WIP] Effective storage of duplicates in B-tree index.

2019-07-04 Thread Peter Geoghegan
On Thu, Jul 4, 2019 at 10:38 AM Peter Geoghegan  wrote:
> I tried this on my own "UK land registry" test data [1], which was
> originally used for the v12 nbtree work. My test case has a low
> cardinality, multi-column text index. I chose this test case because
> it was convenient for me.
>
> On v12/master, the index is 1100MB. Whereas with your patch, it ends
> up being 196MB -- over 5.5x smaller!

I also see a huge and consistent space saving for TPC-H. All 9 indexes
are significantly smaller. The lineitem orderkey index is "just" 1/3
smaller, which is the smallest improvement among TPC-H indexes in my
index bloat test case. The two largest indexes after the initial bulk
load are *much* smaller: the lineitem parts supplier index is ~2.7x
smaller, while the lineitem ship date index is a massive ~4.2x
smaller. Also, the orders customer key index is ~2.8x smaller, and the
order date index is ~2.43x smaller. Note that the test involved retail
insertions, not CREATE INDEX.

I haven't seen any regression in the size of any index so far,
including when the number of internal pages is all that we measure.
Actually, there seems to be cases where there is a noticeably larger
reduction in internal pages than in leaf pages, probably because of
interactions with suffix truncation.

This result is very impressive. We'll need to revisit what the right
trade-off is for the compression scheme, which Heikki had some
thoughts on when we left off 3 years ago, but that should be a lot
easier now. I am very encouraged by the fact that this relatively
simple approach already works quite nicely. It's also great to see
that bulk insertions with lots of compression are very clearly faster
with this latest revision of your patch, unlike earlier versions from
2016 that made those cases slower (though I haven't tested indexes
that don't really use compression). I think that this is because you
now do the compression lazily, at the point where it looks like we may
need to split the page. Previous versions of the patch had to perform
compression eagerly, just like GIN, which is not really appropriate
for nbtree.

--
Peter Geoghegan




<    2   3   4   5   6   7   8   9   10   11   >