Re: PG17 optimizations to vacuum

2024-09-02 Thread Peter Geoghegan
On Mon, Sep 2, 2024 at 4:58 PM Peter Geoghegan  wrote:
> On Mon, Sep 2, 2024 at 4:35 PM Pavel Luzanov  wrote:
> > If it helps, without creating index on id column, the numbers will be
> > much closer:
>
> Yes, avoiding all index vacuuming seems useful. It makes the test case
> cleaner, since we don't have to think about the variability from the
> TIDStore work (and from index vacuuming more generally).

It just occurred to me that earlier versions don't have the
HEAP_PAGE_PRUNE_MARK_UNUSED_NOW optimization added by commit
c120550edb. Postgres 17 does have that optimization, though, so it
should easily be able to write far fewer WAL records than earlier
versions. And yet your revised no-indexes test case seems to show that
Postgres 17 is doing slightly worse by that measure (and by others).

-- 
Peter Geoghegan




Re: PG17 optimizations to vacuum

2024-09-02 Thread Peter Geoghegan
On Mon, Sep 2, 2024 at 4:58 PM Heikki Linnakangas  wrote:
> Do you have any non-default settings? "select name,
> current_setting(name), source  from pg_settings where setting <>
> boot_val;" would show that.

What about page checksums?

One simple explanation is that we're writing extra FPIs to set hint
bits. But that explanation only works if you assume that page-level
checksums are in use (or that wal_log_hints is turned on).

-- 
Peter Geoghegan




Re: PG17 optimizations to vacuum

2024-09-02 Thread Peter Geoghegan
On Mon, Sep 2, 2024 at 4:35 PM Pavel Luzanov  wrote:
> If it helps, without creating index on id column, the numbers will be
> much closer:

Yes, avoiding all index vacuuming seems useful. It makes the test case
cleaner, since we don't have to think about the variability from the
TIDStore work (and from index vacuuming more generally).

> master:
> WAL usage: 78502 records, 22090 full page images, 196215494 bytes
>
> v15:
> WAL usage: 77437 records, 30872 full page images, 152080268 bytes

To be clear, I wouldn't be all that surprised if some issue with
alignment padding and/or the new WAL format made master write more
bytes of WAL total than on earlier versions. Small inefficiencies like
that were always a possibility. That's not the problem that I'm
principally concerned about right now.

I'm most concerned about the simple fact that we're writing more
individual WAL records than on previous versions, despite the fact
that VACUUM does exactly the same amount of useful pruning and
freezing work across versions. How could a patch that merged together
the previous PRUNE and VACUUM WAL record types into a single record
type actually cause an increase in the number of WAL records written?
I'd have thought that that was simply impossible.

-- 
Peter Geoghegan




Re: PG17 optimizations to vacuum

2024-09-02 Thread Peter Geoghegan
On Mon, Sep 2, 2024 at 3:23 PM Melanie Plageman
 wrote:
> This is roughly what I get for records by vacuum. Note that I prefixed
> VACUUM with BTREE on master to indicate those records are from index
> vacuuming. By default the headesc routine for records emitted by index
> vacuuming prints just VACUUM -- perhaps it would be better to prefix
> it.
>
> Note that these add up to almost the same thing. I don't know yet why
> the number PRUNE_VACUUM_SCAN is different than PRUNE on 16.

That is indeed surprising, given that Pavel's VACUUM VERBOSE output
indicates that the number of heap tuples deleted is identical across
versions. The output also strongly suggests that the same heap pages
are pruned on both versions, since the "index scan needed: " line is
also identical across versions.

Might it be that the extra PRUNE_VACUUM_SCAN records originated in
pages that only contained existing LP_UNUSED items when scanned by
VACUUM?

-- 
Peter Geoghegan




Re: PG17 optimizations to vacuum

2024-09-02 Thread Peter Geoghegan
On Mon, Sep 2, 2024 at 1:29 PM Melanie Plageman
 wrote:
> I'll investigate more tomorrow, but based on my initial investigation,
> there appears to be some interaction related to how much of the
> relation is in shared buffers after creating the table and updating
> it. If you set shared_buffers sufficiently high and prewarm the table
> after the update, master has fewer WAL records reported by vacuum
> verbose.

Fewer of what specific kind of WAL record?

All of the details about useful work done by VACUUM were identical
across versions. It was only the details related to WAL, buffers, and
CPU time that changed.

Perhaps I'm not thinking of something obvious. Maybe it's extra
VISIBILITY records? But I'd expect the number of VISIBILITY records to
match the number of pages frozen, given these particulars. VACUUM
VERBOSE at least shows that that hasn't changed.

-- 
Peter Geoghegan




Re: PG17 optimizations to vacuum

2024-09-01 Thread Peter Geoghegan
On Sun, Sep 1, 2024 at 5:44 PM Pavel Luzanov  wrote:
> I see a perfectly working TID-store optimization.
> With reduced maintenance_work_mem it used only one 'vacuuming indexes'
> phase instead of 21 in v16.
> But I also expected to see a reduction in the number of WAL records
> and the total size of the WAL. Instead, WAL numbers have significantly
> degraded.
>
> What am I doing wrong?

That does seem weird.

CC'ing the authors of the relevant VACUUM enhancements.

-- 
Peter Geoghegan




Re: Query performance going from Oracle to Postgres

2023-09-07 Thread Peter Geoghegan
On Thu, Sep 7, 2023 at 3:48 AM David Rowley  wrote:
> On Thu, 7 Sept 2023 at 19:17, Peter Geoghegan  wrote:
> > It seems likely that the problem here is that some of the predicates
> > appear as so-called "Filter:" conditions, as opposed to true index
> > quals.
>
> hmm, if that were true we'd see "Rows Removed by Filter" in the
> explain analyze.

That's why I hedged, with "seems likely". The problem with using
filter conditions rather than true index quals isn't limited to the
problem of extra heap accesses. It happened to be convenient to make
my point that way, but that isn't particularly fundamental here.

I deliberately chose to make my example involve an index-only scan
(that doesn't require any heap accesses) for this reason.

> I think all that's going on is that each tuple is on a different page
> and the heap accesses are just causing many buffers to be accessed.

This index is an absolute monstrosity. I find it far easier to believe
that the real explanation is the one that Steve intuited: that there
is an issue with the way that the physical data structures (which are
more or less comparable in both systems) are accessed in Postgres.

The index in question
("history_event_display_timesta_prism_guid_display_timestamp_idx1")
has certain columns that are omitted from the query. These columns
nevertheless appear between other columns that the query filters on.
The leading two columns ("prism_guid" and "display_timestamp") are
made into index quals by Postgres, but the next index column after
those two ("unique_lookup_key") is omitted by the query, and so isn't an index
qual. In fact *four* columns are omitted after that one. But, the very-low-order
index column "product_sid" *does* appear in the query, and so also
appears as Postgres index quals.

There is every chance that the true underlying explanation is that
Oracle is able to skip over significant parts of the index structure
dynamically. In fact I'm practically certain that that's the case,
since the "product_sid" column appears as an "access predicate", rather
than as a "filter predicate". These terms are explained here:

https://use-the-index-luke.com/sql/explain-plan/oracle/filter-predicates

https://use-the-index-luke.com/sql/explain-plan/postgresql/filter-predicates

How could "product_sid" be used as an "access predicate" given the
omitted index columns? It seems very likely that parts of the index
can be skipped in Oracle, but not in Postgres -- at least not yet.

Like Markus Winand, I think that it's a real problem that EXPLAIN
doesn't yet expose the difference between access predicates and filter
predicates. Some of the index quals shown by EXPLAIN are marked
SK_BT_REQFWD/SK_BT_REQBKWD by nbtree (meaning that they're what Oracle
calls "access predicates"), while other are not (meaning that they're
what Oracle calls "filter predicates"). That issue makes it far harder to spot
these sorts of problems.

> It seems to me that an IOS would likely fix that problem which is why I
> suggested count(*) instead of
> count()

Probably, but why should we have to use an index-only scan? And what
about my example, that already used one, and still showed a big
disparity where there is no inherently reason why there had to be? My
example didn't involve jumping to another part of the index because the
problem seems to be more general than that.

--
Peter Geoghegan




Re: Query performance going from Oracle to Postgres

2023-09-06 Thread Peter Geoghegan
On Wed, Sep 6, 2023 at 1:07 PM Dirschel, Steve
 wrote:
> Oracle will find the same 332 rows using the same index but in Oracle it only 
> does 20 logical reads.  I thought maybe the index was fragmented so I 
> reindexed that index:

It seems likely that the problem here is that some of the predicates
appear as so-called "Filter:" conditions, as opposed to true index
quals. For reasons that aren't particularly fundamental, Postgres
B-Trees cannot push down != (or <>) to the index level. Strangely
enough, this is possible with IS NOT NULL. I'm working to improve
things in this area.

That difference is easy to see in the following example. The first
query does far fewer buffer accesses than the. second query, despite
being very similar:

regression=# explain (analyze, buffers) select * from multi_test where
a = 1 and b is not null;
 QUERY PLAN
-
 Index Only Scan using multi_test_idx on multi_test  (cost=0.29..50.10
rows=2191 width=8) (actual time=0.095..0.100 rows=14 loops=1)
   Index Cond: ((a = 1) AND (b IS NOT NULL))
   Heap Fetches: 0
   Buffers: shared hit=3
 Planning Time: 0.056 ms
 Execution Time: 0.145 ms
(6 rows)

regression=# explain (analyze, buffers) select * from multi_test where
a = 1 and b != 42;
  QUERY PLAN
--
 Index Only Scan using multi_test_idx on multi_test
(cost=0.29..222.57 rows=2191 width=8) (actual time=0.087..2.982
rows=14 loops=1)
   Index Cond: (a = 1)
   Filter: (b <> 42)
   Rows Removed by Filter: 1
   Heap Fetches: 0
   Buffers: shared hit=11
 Planning Time: 0.076 ms
 Execution Time: 3.204 ms
(8 rows)

(There are lots of index tuples matching (a, b) = "(1, NULL)" here, you see.)

-- 
Peter Geoghegan




Re: Large pkey index on insert-only table

2023-06-26 Thread Peter Geoghegan
On Mon, Jun 26, 2023 at 8:50 AM Devin Ivy  wrote:
> Any thoughts on why this may be, or where to go next to continue tracking 
> this down?  Also, could the primary key column order `(id, ancestor_id)` vs 
> `(ancestor_id, id)` significantly affect the index size depending on the 
> column cardinalities?

That is certainly possible, yes. I describe one particular pathology
that causes composite B-Tree indexes to only have about 50% space
utilization here:

https://youtu.be/p5RaATILoiE?t=2079

Theoretically this shouldn't be a problem anymore, because (as the
talk describes) Postgres 12 added heuristics that avoid the problem.
But those heuristics are kind of conservative; they only kick in when
it's fairly clearly the right thing to do. I don't think that they'll
work very reliably for varchar columns.

Note that 90% space utilization isn't really the standard case for
B-Trees in general. Even absent updates and deletes, an index with
completely random insertions (e.g., a UUID index) is expected to have
about 70% space utilization. You can only really expect ~90% space
utilization with monotonically increasing insertions.

On the other hand having less than 50% space utilization is pretty
poor, so (assuming that that's what "bloat percentage 47%" means) then
I'd say that you're right to suspect that something is a bit off here.
This isn't necessarily a big deal, but I tend to agree that what
you're seeing is something that theoretically can be avoided by the
implementation (if there were deletes involved then that wouldn't
apply, but there aren't).

--
Peter Geoghegan




Re: VACUUM (INDEX_CLEANUP OFF) and GIN indexes

2023-04-25 Thread Peter Geoghegan
On Tue, Apr 25, 2023 at 9:18 AM Christophe Pettus  wrote:
> Does VACUUM (INDEX_CLEANUP OFF) flush the pending list for GIN indexes, or is 
> that skipped as well?

It's skipped by VACUUM, but not by ANALYZE. So if you're using the
reloption version of index_cleanup=off, it isn't necessarily going to
stop autovacuum/autoanalyze from doing pending list cleanup.

The ANALYZE pending list cleanup path has some problems:

https://postgr.es/m/cah2-wzkjrk556envtflmyxedw91xguwiyzvep2kp5yqt_-3...@mail.gmail.com


--
Peter Geoghegan




Re: UPSERT in Postgres

2023-04-06 Thread Peter Geoghegan
On Thu, Apr 6, 2023 at 1:21 PM Louis Tian  wrote:
> An implicit assumption behind this definition is that table must have a 
> primary key for the upsert operation to make sense since it's the primary key 
> that uniquely identifies a row.

It could just be a unique index or a unique constraint. So you can
upsert on any individual unique constraint/index, or the primary key.
Of course there might be several on a given table, but you can only
use one as the "conflict arbiter" per statement.

> But can we achieve the same effect with "INSERT ... ON CONFLICT DO UPDATE" 
> like a lot of references on the internet seems to suggest.
>
> insert into person (id, name) values (0, 'foo') on conflict ("id") do update 
> set id=excluded.id, name=excluded.name
> insert into person (id, is_active) values (0, true) on conflict ("id") do 
> update set id=excluded.id, is_active=excluded.is_active
>
> Unfortunately. the second statement will fail due to violation of the not 
> null constraint on the "name" column.
> PostgreSQL will always try to insert the row into the table first. and only 
> fallback to update when the uniqueness constraint is violated.
> Is this behavior wrong? maybe not, I think it is doing what it reads quite 
> literally.

It sort of has to work that way, though. In general your example might
*not* fail, due to a row-level before trigger in the insert path.

Why doesn't your proposed upsert syntax have the same problem? I mean,
how could it not? I guess it doesn't if you assume that it'll never
take the insert path with your not NULL constraint example? But if you
know that for sure, why not just use a regular update statement? On
the other hand, if you're not sure if the insert path can be taken,
then why is it actually helpful to not just throw an error at the
earliest opportunity?

Surely upsert means "update or insert", so why wouldn't the user expect
to see an error like this, independent of the specifics of the row in question?
Isn't the user tacitly saying "I don't specifically know if the update or insert
path will be taken in respect of any given row" by using ON CONFLICT
DO UPDATE in the first place?

> That being said, I have never had a need for the ON CONFLICT DO UPDATE 
> statement other than where I need upsert.
> But using it as "upsert" is only valid when the table is absent of any NOT 
> NULL constraint on it's non primary key columns.

I don't know what you mean by that. "Valid"?

> The MERGE command introduced in PG15 in theory can be used to do UPSERT 
> properly that is void of the aforementioned limitation.
> The downside is it is rather verbose.

The MERGE command has various race conditions that are particularly
relevant to UPSERT type use cases. See the wiki page you referenced
for a huge amount of information on this.

> *Feature Request*
> Given that UPSERT is an *idempotent* operator it is extremely useful.

In general UPSERT (or any definition of it that I can think of) does
not imply idempotency.


--
Peter Geoghegan




Re: PG16devel - vacuum_freeze_table_age seems not being taken into account

2023-03-03 Thread Peter Geoghegan
On Fri, Mar 3, 2023 at 2:43 AM Simon Elbaz  wrote:
> I expected it not to be processed by vacuum freeze.
> However it has been entirely frozen.
> Moreover, among the 51 rows, only 1 was eligible for freeze because its XID 
> was older than vacuum_freeze_min_age.

The effect that you noticed is a consequence of page-level freezing,
which is new to Postgres 16. VACUUM will now freeze all of the tuples
on a page whenever it needs to freeze any tuples at all (barring any
tuples that are fundamentally ineligible due to being after the
removable/freezable cutoff). This is justified by the cost profile.
Once we decide to freeze at least one tuple of a page, the added cost
in WAL is low enough that it really doesn't make sense to not just
freeze everything.

The page that gets frozen by your test case is also set all-frozen in
the visibility map. Without the optimization, we'd have frozen that
one tuple and then set the page all-visible. The page would likely be
frozen again by the next aggressive VACUUM, which is usually much more
expensive.

-- 
Peter Geoghegan




Re: Automatic aggressive vacuum on almost frozen table takes too long

2023-02-20 Thread Peter Geoghegan
On Mon, Feb 20, 2023 at 9:43 PM Mikhail Balayan  wrote:
> What catches my eye: scanning indexes smaller than 3.1GB is fast, larger ones 
> are slow. For example:
> idx_applications2_policy_id is 3131 MB took just 5 seconds (DETAIL:  CPU: 
> user: 2.99 s, system: 1.65 s, elapsed: 5.32 s)
> but idx_applications2_deleted_at with 3264 MB took 1 minute 22 seconds 
> (DETAIL:  CPU: user: 67.93 s, system: 3.41 s, elapsed: 82.75 s)

I think that I know what this is.

If you delete many index pages during VACUUM, and those pages are all
full of duplicate values, the deletion operation can sometimes be
slower due to the need to relocate a downlink to each to-be-deleted
leaf page. When there are thousands of matches, you'll start to notice
O(n^2) behavior due to the way in which the B-Tree VACUUM code must
grovel through the parent level, which is full of duplicate keys.

If you were on Postgres 12+, then this wouldn't happen, because the
heap TID is treated as a part of the key space there, affecting sort
order. The implementation would immediately relocate the matching
parent downlink using a unique key (unique because heap TID would act
as a unique-ifier on that version). And if you were on 14+, things in
this area would be much better still.

-- 
Peter Geoghegan




Re: Automatic aggressive vacuum on almost frozen table takes too long

2023-02-17 Thread Peter Geoghegan
On Thu, Feb 16, 2023 at 5:40 PM Mikhail Balayan  wrote:
>> >> Do you have any non-btree indexes on the table? Can you show us the 
>> >> details of the
>> >> table, including all of its indexes? In other words, can you show "\d 
>> >> applications" output from psql?
>
> Only btree indexes. Please find the full table schema below:

It's possible that VACUUM had to wait a long time for a cleanup lock
on one individual heap page here, which could have added a long delay.
But...that doesn't seem particularly likely.

Can you run amcheck's bt_index_check() routine against some of the
indexes you've shown? There is perhaps some chance that index
corruption exists and causes VACUUM to take a very long time to delete
index pages. This is pretty much a wild guess, though.

-- 
Peter Geoghegan




Re: Automatic aggressive vacuum on almost frozen table takes too long

2023-02-16 Thread Peter Geoghegan
On Thu, Feb 16, 2023 at 7:44 AM Adrian Klaver  wrote:
> > That is, if I understand it correctly, it says that there were (and
> > actually are) 2013128 pages of which 2008230 were skipped, which leaves
> > 4898 blocks to be scanned. I.e. it seems that the allocated 1GB
> > (autovacuum_work_mem) should be enough to handle that amount of blocks
> > and to avoid multiple scans of the indexes.

It's quite clear that there is only one pass over the indexes, since
"index scans: 1" says exactly that.

> > But, based on buffer usage, one can see that a huge amount of data is
> > read, greatly exceeding not only the number of remaining unfrozen
> > blocks, but also the size of the table and indexes taken together: 2
> > billion blocks, more than 15TB.
> >
> > Is this a bug in Postgresql or am I interpreting the log data wrong?

I think that it might be a bug in Postgres.

I addressed a similar issue in the same "buffer" instrumentation in
commit d3609dd2, but that wasn't backpatched because I imagined that
it only applied to the new VACUUM VERBOSE case (VACUUM VERBOSE only
started sharing the same instrumentation code as log_autovacuum in
Postgres 15). It's not immediately obvious how you could see a problem
like the one you've shown in the autovacuum log output. Even still,
the information about buffers that you've shown does indeed appear to
be total nonsense (while everything else we can see looks plausible).
There has to be some explanation for that.

The only other explanation I can think of is a pathological case where
an index scan by some particular ambulkdelete routine scans a number
of buffers that vastly exceeds the total size of the index. That does
seem just about possible with an access method like GIN. Do you have
any non-btree indexes on the table? Can you show us the details of the
table, including all of its indexes? In other words, can you show "\d
applications" output from psql?

-- 
Peter Geoghegan




Re: ERROR: posting list tuple with 2 items cannot be split at offset 17

2023-02-09 Thread Peter Geoghegan
On Thu, Feb 9, 2023 at 3:55 PM Paul McGarry  wrote:
> Will the amcheck reliably identify all issues that may arise from a collation 
> change?

Theoretically it might not. In practice I'd be very surprised if it
ever failed to detect such an inconsistency. If you want to be extra
careful, and can afford to block concurrent writes, then I suggest
using bt_index_parent_check instead of bt_index_check.

> and therefore I can just use that to identify bad indexes and recreate them, 
> or should I recreate all btree indexes involving text fields?

It might be easier to just reindex them all. Hard to say.

> We planned to do a dump/restore upgrade to PG14 in a month or so (already 
> done in dev).
> I am wondering whether it will be less work overall to bring that forward 
> than rebuild these indexes...

pg_amcheck is available on 14. It offers a much simpler interface for
running amcheck routine, so maybe look into that once you upgrade.

-- 
Peter Geoghegan




Re: ERROR: posting list tuple with 2 items cannot be split at offset 17

2023-02-09 Thread Peter Geoghegan
On Wed, Feb 8, 2023 at 11:54 PM Paul McGarry  wrote:
> But if it is the problem, why did the update start working after I recreated 
> the other index?

There is no reason why reindexing another index ought to have had that
effect. The likely explanation is that subsequent updates used a
successor version heap TID that didn't overlap with some existing
posting list in whatever way. If you repeat the update again and
again, and get an error each time, the incoming TID will differ each
time. Eventually you won't get an error, because at some point there
won't be a posting-list-TID range overlap for some new successor TID
that leads to the insert/posting list split code detecting a problem.

It's also possible that a concurrent autovacuum "fixed" the issue.

The amcheck error shows a problem in an internal page, which cannot
have posting list tuples -- which suggests broad corruption. An issue
with collation instability due to an OS update does seem likely.

Note that the hardening/defensive checks in this area have increased.
I added an additional defensive check to 13.4, and followed up with
another similar check in 13.5. It looks like the error you've seen
("ERROR:  posting list tuple with 2 items cannot be split at offset
17") comes from the initial 13.4 hardening, since I'd expect the
additional 13.5 hardening to catch the same issue sooner, with a
different error message (something like "table tid from new index
tuple (%u,%u) cannot find insert offset between offsets %u and %u of
block %u in index \"%s\"").

> I think I should now:
> - recreate the widget_name_idx on the problem servers
> - run bt_index_check across all other indexes
>
> Any suggestions on what else I should look into, in particular anything I 
> should check before upgrading the remaining 13.8 DB to 13.9?

I recommend running amcheck on all indexes, or at least all
possibly-affected text indexes.

-- 
Peter Geoghegan




Re: Question regarding UTF-8 data and "C" collation on definition of field of table

2023-02-05 Thread Peter Geoghegan
On Sun, Feb 5, 2023 at 4:19 PM Tom Lane  wrote:
> If there's a predominant language in the data, selecting a collation
> matching that seems like your best bet.  Otherwise, maybe you should
> just shrug your shoulders and stick with C collation.  It's likely
> to be faster than any alternative.

FWIW there are certain "compromise locales" supported by ICU/CLDR.
These include "English (Europe)", and, most notably, EOR (European
Ordering Rules):

https://en.wikipedia.org/wiki/European_ordering_rules

I'm not sure how widely used those are. EOR seems to have been
standardized by the EU or by an adjacent institution, so not sure how
widely used it really is.

It's also possible to use a custom collation with ICU, which is almost
infinitely flexible:

http://www.unicode.org/reports/tr10/#Customization

As an example, the rules about the relative ordering of each script
can be changed this way. There is also something called merged
tailorings.

The OP should see the Postgres ICU docs for hints on how to use these
facilities to make a custom collation that matches whatever their
requirements are:

https://www.postgresql.org/docs/current/collation.html#COLLATION-MANAGING

-- 
Peter Geoghegan




Re: Lots of read activity on index only scan

2022-11-18 Thread Peter Geoghegan
On Fri, Nov 18, 2022 at 1:50 PM Peter J. Holzer  wrote:
> There should be about 27000 of them, same as for the othe index, right?

There aren't that many. The point I'm making is that you can access
each VM page approximately once (and check relatively many index
tuple's TIDs all in one go), or many times. The total number of VM
pages may be constant, but the access patterns are quite different
owing to differences in how the data is clustered in each index.

> > When there is naturally a high correlation (or some kind of
> > clustering) in how we access VM pages, we'll naturally be able to do
> > more visibility checks covering more index tuples per VM page
> > accessed.
>
> So you are saying that these are accesses to the visibility map, not the
> base table?

Yes. I see "Heap Fetches: 0" for both plans, that each query the same
table and scan approximately the same number of index pages. So VM
accesses are the only explanation that makes any sense.

> > Also worth bearing in mind that it's unusual to have a perfectly
> > random and uniformly distributed clustering of index tuples,
>
> Sure. This is a highly contrived example.

FWIW I think that it could be a lot less bad, even with indexes that
you'd think would be almost as bad as the bad one from your test case.
Even things that appear to be random aren't usually nearly as random
as what you've shown.

-- 
Peter Geoghegan




Re: Lots of read activity on index only scan

2022-11-18 Thread Peter Geoghegan
On Fri, Nov 18, 2022 at 12:46 PM Peter J. Holzer  wrote:
> Both do a parallel index only scan. Both perform 0 heap fetches.
> But one reads 27336 buffers (or about 22 bytes per index entry, which
> sounds reasonable) while the other reads 9995216 buffers (or almost one
> full buffer per row). Why? The entries should be dense in the index in
> both cases and since it's an index only scan (and explain says there
> were 0 heap fetches) I would not expect extra accesses. Where do these
> buffer reads come from?

The index-only scan processes an index leaf page at a time. When there
is naturally a high correlation (or some kind of clustering) in how we
access VM pages, we'll naturally be able to do more visibility checks
covering more index tuples per VM page accessed. This is a less
severe problem here than it would be with an equivalent pair of plain
index scans, just because there are so few VM pages relative to
heap pages. But it's more or less an analogous problem. You're
really noticing it here because these index scans have very low
selectivity -- which is kinda unusual in most environments.

Also worth bearing in mind that it's unusual to have a perfectly
random and uniformly distributed clustering of index tuples, which is
what the index built via hashing exhibits. Even a v4 UUID index could
easily have plenty of duplicates, which would probably do
significantly better on the metric you've focussed on.


--
Peter Geoghegan




Re: ON CONFLICT and WHERE

2022-11-13 Thread Peter Geoghegan
On Sun, Nov 13, 2022 at 1:07 PM Tom Lane  wrote:
> A WHERE placed there is an index_predicate attachment to the ON CONFLICT
> clause.  It doesn't have any run-time effect other than to allow partial
> indexes to be chosen as arbiter indexes.  TFM explains
>
> index_predicate
>
> Used to allow inference of partial unique indexes. Any indexes
> that satisfy the predicate (which need not actually be partial
> indexes) can be inferred.
>
> This strikes me as a bit of a foot-gun.  I wonder if we should make
> it safer by insisting that the resolved index be partial when there's
> a WHERE clause here.

I don't think that it would be safer.

Adrian has asked why it's possible to attach an arbitrary
index_predicate type WHERE clause to an ON CONFLICT query, without
that really changing the behavior of the statement. That *is* a little
odd, so it's certainly a fair question (I can recall perhaps as many
as 5 similar questions over the years). But it's not the end of the
world, either -- there are far worse things.

I think that it would be a lot worse (just for example) to have your
ON CONFLICT query suddenly start throwing an ERROR in production, just
because you replaced a partial unique index with a unique constraint.
If we have a suitable unique index or constraint, why wouldn't we use
it in ON CONFLICT? Maybe it won't work out that way (maybe there won't
be any suitable unique index or constraint), but why not do our utmost
to insulate the user from what might be a serious production issue?
That was the guiding principle.

Overall I'm quite happy with the amount of foot-guns ON CONFLICT has,
especially compared to other comparable features in other DB systems
(which had plenty). There are one or two ostensibly odd things about
the syntax that are downstream consequences of trying to make the
constraint/unique index inference process maximally forgiving. I'm
pretty happy with that trade-off.

> (This documentation text is about as clear as
> mud, too.  What does "inferred" mean here?  I think it means "chosen as
> arbiter index", but maybe I misunderstand.)

Unique index/constraint inference is the process by which we choose an
arbiter index. See the second paragraph of the "ON CONFLICT Clause"
section of the INSERT docs.

-- 
Peter Geoghegan




Re: Weird planner issue on a standby

2022-10-12 Thread Peter Geoghegan
On Wed, Oct 12, 2022 at 6:47 AM Tom Lane  wrote:
> However, that doesn't explain the downthread report that a
> VACUUM on the primary fixed it.  What I suspect is that that
> caused some in-fact-dead index entries to get cleaned out.

Seems likely.

> But ... if the primary is allowed to vacuum away an index
> entry that it thinks is dead, exactly what is the point of
> making standbys ignore LP_DEAD bits?  There's no additional
> interlock that guarantees the tuple will be there at all.

The interlock doesn't really protect the leaf page or its index tuples
so much as the referenced TIDs themselves. In other words it's a TID
recycling interlock.

That's why we don't need a cleanup lock to perform index tuple
deletions, even though the WAL records for those are almost identical
to the WAL records used by index vacuuming (in the case of nbtree the
only difference is the extra latestRemovedXid field in the deletion
variant WAL record). We know that there is no VACUUM process involved,
and no question of heap vacuuming going ahead for the same TIDs once
index vacuuming is allowed to complete.

We can get away with not having the interlock at all in the case of
nbtree index scans with MVCC snapshots -- but *not* with index-only
scans. See "Making concurrent TID recycling safe" in the nbtree
README. I only got around to documenting all of the details here quite
recently. The index-only scan thing dates back to 9.5.

-- 
Peter Geoghegan




Re: Weird planner issue on a standby

2022-10-11 Thread Peter Geoghegan
On Tue, Oct 11, 2022 at 10:04 PM Tom Lane  wrote:
> Do we propagate visibility-map bits to standbys?

Yes.

-- 
Peter Geoghegan




Re: Weird planner issue on a standby

2022-10-11 Thread Peter Geoghegan
On Tue, Oct 11, 2022 at 9:27 AM Alvaro Herrera  wrote:
> I remember having an hypothesis, upon getting a report of this exact
> problem on a customer system once, that it could be due to killtuple not
> propagating to standbys except by FPIs.  I do not remember if we proved
> that true or not.  I do not remember observing that tables were being
> read, however.

That's true, but it doesn't matter whether or not there are LP_DEAD
bits set on the standby, since in any case they cannot be trusted when
in Hot Standby mode. IndexScanDescData.ignore_killed_tuples will be
set to false on the standby.

-- 
Peter Geoghegan




Re: Monitoring multixact members growth

2022-08-19 Thread Peter Geoghegan
On Fri, Aug 19, 2022 at 8:40 AM Vido Vlahinic
 wrote:
> However it is not quite clear to me how I can interpret results from above 
> and e.g. conclude: my_table accumulates x more multixact members since I 
> measured last.

You can't. And not just because nobody got around to implementing it
yet -- it's quite a fundamental restriction. VACUUM must always make
sure of that, and must always scan all unfrozen pages to safely
determine that much (it may or may not have to freeze *any* MultiXacts
as part of that process, but it must always be sure that no Multis <
its final relminmxid remain).

> My goal here is to predict where multixact members are growing the fastest so 
> I can perform manual VACUUM FREEZE only on those tables

The problem with that strategy is that you still have to do
anti-wraparound autovacuums when the mxid_age(relminmxid) of a table
crosses the usual threshold, even when in reality there are *zero*
MultiXacts in the table (often the case with the largest tables).
That's just how it works, unfortunately.

There is one piece of good news, though: work in Postgres 15 taught
VACUUM to track the oldest extant XID and MXID in the table, and set
relfrozenxid and remind to those oldest values (rather than using the
cutoffs for freezing, which in general might be much older than the
oldest remaining unfrozen XID/MXID).

I expect that this will make it much less likely that anti-wraparound
autovacuums affecting many tables will all stampede, hurting
performance. This will be possible because VACUUM will now be able to
set relminmxid to a value that actually tells us something about
what's really going on in each table, MultiXact-wise (not just what
you set vacuum_multixact_freeze_min_age and
autovacuum_multixact_freeze_max_age to in postgresql.conf, which is
pretty far removed from what matters most of the time). Simply by
noticing that there are no remaining MultiXacts (and probably never
were any in the first place) with the larger tables.

The timeline for anti-wraparound autovacuums will tend to make a lot
more sense for *your* workload, where huge differences in the rate of
MultiXact consumption among tables is likely the norm. This still
isn't perfect (far from it), but it has the potential to make things
far better here.

-- 
Peter Geoghegan




Re: Automatic autovacuum to prevent wraparound - PG13.5

2022-06-24 Thread Peter Geoghegan
On Wed, Jun 15, 2022 at 4:13 AM Mauro Farracha  wrote:
> The scenario:
> - Out of nowhere (during the weekend), without database activity load or 
> batches running, with previous nightly run of vacuum freeze, in the middle of 
> the day, with xids and mxids below 20M we are seeing autovacuum being 
> triggered to prevent wraparound.
>
> My question is why this is occurring, which condition might be responsible 
> for this behaviour?

There is a behavior that seems like it might be relevant: VACUUM
interprets autovacuum_multixact_freeze_max_age in a way that accounts
for both MultiXactId consumption and the consumption of "member space"
by MultiXacts. Technically there are 2 SLRUs for MultiXacts, either of
which can wraparound.

This behavior was established by commit 53bb309d2d. It is documented.
Admittedly this whole area of the documentation is in dire need of an
overhaul.  :-(

-- 
Peter Geoghegan




Re: autovacuum_freeze_max_age on append-only tables

2022-04-21 Thread Peter Geoghegan
On Thu, Apr 21, 2022 at 8:14 PM Senor  wrote:
> Are the autovacuum_vacuum_cost_* settings handled any differently for
> 'to avoid wraparound' vacuums? I understand that it won't give up a lock
> but I was expecting it to still back off due to cost and allow the query
> with conflicting lock to proceed.

In general, no. For the most part an antiwraparound autovacuum does
exactly the same work as any other autovacuum. Or any other aggressive
VACUUM, at least. But even the extra work that it does over what
non-aggressive VACUUM is still work that any VACUUM might do, if the
circumstances were right.

We still freeze in regular VACUUMs, provided we scan pages with XIDs
that are sufficiently old. The most important difference between it
and aggressive VACUUM is that the former can skip all-visible pages
that have unfrozen XIDs, putting that work off. This can sometimes
lead to a big balloon payment later on, when you finally have an
aggressive VACUUM. I think that that's a design flaw that ought to be
fixed.

Currently non-aggressive VACUUMs always skip all-visible pages. They
should probably freeze some older all-visible pages eagerly, rather
than skipping them, so that the system never gets too far behind on
freezing.

> Is there any benefit to manually running a vacuum every so many inserts
> as opposed to using autovacuum_freeze_max_age. And in this case should
> it be a vacuum freeze.

Given your restrictions, this is probably the best option available.
But maybe you should just set vacuum_freeze_min_age to 0 at the table
level, instead of using vacuum freeze (so you freeze more without
doing aggressive vacuuming all the time, which FREEZE also forces).

Users understandably think that there are several different flavors of
vacuum, but that's not really true (apart from VACUUM FULL, which
really is quite different). The difference between aggressive and
non-aggressive can be big in practice due to an accumulation of
unfrozen pages over multiple non-aggressive vacuums.

-- 
Peter Geoghegan




Re: autovacuum_freeze_max_age on append-only tables

2022-04-20 Thread Peter Geoghegan
On Wed, Apr 20, 2022 at 4:06 PM senor  wrote:
> I'm attempting to mimic a new feature in version 13 where INSERTS will 
> trigger vacuum for an append-only table.

The problem with that idea is that you need to express the idea that
the table needs to be vacuumed now in terms of its "age", denominated
in XIDs -- but XIDs consumed by the entire system, not just those XIDs
that happen to modify your append-only table. It will likely be very
hard for you to figure out a way to relate these logical units (XIDs)
to some kind of physical cost that captures how far behind you are on
freezing (like blocks, or even tuples). Maybe you'll find something
that works through trial and error, but I wouldn't count on it.

> I'm apparently needing an education on how this "to avoid wraparound" vacuum 
> differs from any other. I've seen it referenced as "more aggressive" but I'd 
> like details. An upgrade to 13 is "right around the corner".

It's complicated -- more complicated than it really should be.
Technically an anti-wraparound autovacuum and an aggressive vacuum are
two different things. In practice anti-wraparound autovacuums are
virtually guaranteed to be aggressive, though an aggressive autovacuum
may not be an antiwraparound VACUUM (sometimes we do aggressive
vacuuming because autovacuum launched a worker before
age(relfrozenxid) reached autovacuum_freeze_max_age, but after
age(relfrozenxid) reached vacuum_freeze_table_age).

See my recent response to a similar question here:

https://postgr.es/m/CAH2-WzkFQ-okvVXizpy4dCEVq75N-Qykh=crhzao-eajflv...@mail.gmail.com

--
Peter Geoghegan




Re: oldest xmin is far in the past :: BUT xmin is not available in system

2022-04-19 Thread Peter Geoghegan
On Mon, Apr 18, 2022 at 11:37 PM bhargav kamineni  wrote:
> executing the vacuum on the entire cluster is also giving the same HINTS and 
> WARNING's

You're using Aurora, not PostgreSQL. Perhaps this is actually a bug,
but there is no way for anybody here to know.

-- 
Peter Geoghegan




Re: Packages, inner subprograms, and parameterizable anonymous blocks for PL/pgSQL

2021-12-17 Thread Peter Geoghegan
On Fri, Dec 17, 2021 at 11:43 AM Bryn Llewellyn  wrote:
> Modular design recommends exposing functionality through a purpose oriented 
> interface and hiding all implementation details from the API’s user. A 
> package achieves this with declarative syntax via the spec/body separation. 
> The body encapsulates as many (top-level) subprograms as you want. Each of 
> these is visible to all of its peers. But none is visible outside of the 
> package unless the spec declares that it should be. This is a simple opt-in 
> scheme.

I still don't get it. It sounds like you're mostly talking about
encapsulation, or Information hiding, for stored procedures. I can
certainly see how plpgsql doesn't do those things very well, but it
still seems like there might be a lot of nuance that isn't getting
across. The list of specific features that seem to be missing are not
unreasonable, individually, and yet it feels like I cannot see some
bigger picture that's apparent to you.

Maybe you should explain your position by way of a motivating example,
involving a real world use case. Something that makes the issues
concrete. Are these items compelling because of how they allow an
organization to deploy a program in a production environment, complete
with version control? Does it have something to do with decoupling the
mutable business data stored in tables from the programs contained/run
in the same database?

-- 
Peter Geoghegan




Re: Packages, inner subprograms, and parameterizable anonymous blocks for PL/pgSQL

2021-12-16 Thread Peter Geoghegan
On Thu, Dec 16, 2021 at 6:39 PM Mladen Gogala  wrote:
> I am very well aware of the animosity toward the Oracle community and
> the reasons for that animosity. Oracle wiped the floor with the
> predecessor of Postgres, the database called "Ingres". Sandra Kurtzig,
> the Ingres CEO at the the time, and Michael Stonebraker were both
> involved in very public spat with Larry Ellison.

These events happened several years before I was born.

> Stonebraker is still
> very much a factor in Postgres community and I doubt that his feelings
> toward the Larry and his company have got any warmer with years.

No, he isn't. I think that Stonebraker appeared at a few events that
were hosted by EDB, but to the best of my knowledge that's the extent
of his involvement in the project after university POSTGRES. I've
personally never met the man, or communicated with him online, and
I've worked on Postgres more or less full time for a full decade now.
As far as I'm aware he hasn't ever publicly posting to any of the
mailing lists.

-- 
Peter Geoghegan




Re: Packages, inner subprograms, and parameterizable anonymous blocks for PL/pgSQL

2021-12-16 Thread Peter Geoghegan
On Thu, Dec 16, 2021 at 11:22 AM Bryn Llewellyn  wrote:
> The advantages are self-evident to these programmers; and their lack comes as 
> a shocking disappointment when they start to write application code for 
> PostgreSQL*. The absence of packages and inner subprograms is huge.

Why are those things huge? It's not self-evident to me. I can only
speak for myself, but throwing around terms like "shocking
disappointment" is never going to convince me of anything. You can
make similar statements about many other things.

Any functionality that gets added to PostgreSQL should ideally be
compelling to users that have not worked with Oracle in the past.
Maybe that happens to result in a feature that very much looks like
what you already have in mind, or maybe there are significant
differences. If you're not going to meet others closer to where they
are, then how can you expect it for yourself?

-- 
Peter Geoghegan




Re: Error with Insert from View with ON Conflict

2021-11-03 Thread Peter Geoghegan
On Wed, Nov 3, 2021 at 2:18 PM Steve Baldwin  wrote:
> I'm pretty sure the 'alias' for the '.. on conflict do update ..' needs to be 
> 'excluded' (i.e. checks = excluded.checks, ...). Check the docs.

That's right . The excluded.* pseudo-table isn't exactly the same
thing as the target table -- it is a tuple that has the same "shape",
that represents what the implementation tried (and failed) to insert
into the table. I have to imagine that Alex wants to reference that,
because that's the standard idiomatic approach with ON CONFLICT. And
because the only alternative interpretation is that Alex intends to
update those columns using their current values (not new values),
which won't really change anything -- that seems unlikely to have been
the intent.

-- 
Peter Geoghegan




Re: Segmentation fault in volatile c function

2021-10-27 Thread Peter Geoghegan
On Wed, Oct 27, 2021 at 12:39 PM Louise Grandjonc
 wrote:
> I'm creating my first ever extension. The function that I'm trying to write 
> takes the schema and name of a table, and adds it in a table with all the 
> tables the extension follows.
> In that table I want the schema, name and oid of the table.
>
> I created a C function for that. Here is the code

I don't think that using cstring as an argument is appropriate here.
That's only used for "internal" functions that are called through C,
rather than being called through SQL.

It would probably be better to use a regclass argument instead. Take a
look at the example of (say) amcheck's bt_index_check() or
verify_heapam() functions to see how that's done. Not all C functions
use this built-in way of specifying a particular relation from SQL (a
few older ones really do just pass a string some other way), but it's
definitely considered the way to go.

Note that you can do stuff like this from SQL:

pg@regression:5432 [2999218]=# select 'pg_class'::regclass::oid;
  oid
───
 1,259
(1 row)

-- 
Peter Geoghegan




Re: Frequetly updated partial index leads to bloat on index for Postresql 11

2021-09-24 Thread Peter Geoghegan
> On Thu, Mar 18, 2021 at 6:51 AM Tom Dearman  wrote:
> Is this a known issue, are they any ways around it, and if it is an
> issue is there a plan to fix it if a fix is possible?

On second thought I do think that the improvements to 14 will fix this
for you. See the test case here:

https://www.postgresql.org/message-id/flat/CAL9smLC%3DSxYiN7yZ4HDyk0RnZyXoP2vaHD-Vg1JskOEHyhMXug%40mail.gmail.com#e79eca5922789de828314e296fdcb82d

--
Peter Geoghegan




Re: Behavior change in PostgreSQL 14Beta3 or bug?

2021-09-06 Thread Peter Geoghegan
On Mon, Sep 6, 2021 at 7:52 AM Daniel Westermann (DWE)
 wrote:
> >Try running vacuum with index cleanup = on.
>
> Thank you, Peter

Thanks for testing!

-- 
Peter Geoghegan




Re: Behavior change in PostgreSQL 14Beta3 or bug?

2021-09-06 Thread Peter Geoghegan
On Mon, Sep 6, 2021 at 8:59 AM Tom Lane  wrote:
> Assuming that that choice was made appropriately, I think the advice you
> propose here will just cause people to waste lots of cycles on VACUUM
> runs that have only marginal effects.

Right. The advice that they should receive (if any) is to tune
autovacuum aggressively, and enable autovacuum log output. The log
output reports on whether or not the implementation applied the
optimization in each case.

As I pointed out to Laurenz just now, users that care about index-only
scans are actually the big beneficiaries here. Now they can set
autovacuum_vacuum_insert_threshold very aggressively, without doing a
useless round of index vacuuming just because one inserting
transaction out of a million aborted. Once indexes are removed from
the equation (to the extent that that makes sense), each round of
vacuuming by autovacuum only needs to do work that is proportional to
the number of unset-in-vm heap pages.

I believe that that trade-off makes a lot of sense. Autovacuum has
little chance of keeping anything like 100% of all pages set in the VM
anyway. But it can get a lot closer to it in some cases now.

-- 
Peter Geoghegan




Re: Behavior change in PostgreSQL 14Beta3 or bug?

2021-09-06 Thread Peter Geoghegan
On Mon, Sep 6, 2021 at 9:21 AM Laurenz Albe  wrote:
> #define BYPASS_THRESHOLD_PAGES  0.02/* i.e. 2% of rel_pages */
>
> So up to an additional 2% of all pages can have the all-visible bit
> unset with "index_cleanup = auto".
>
> That is probably not worth worrying, right?

I don't think it's worth worrying about. I would say that, since I
chose the exact threshold myself. The threshold was a bit arbitrary,
of course.

Note that Daniel's example had a non-HOT update, even though it's the
kind of update that we imagine can use HOT (because it didn't modify
an indexed column). He could have ensured a HOT update by lowering
heap fill factor, but why should that be necessary if updates are rare
anyway?

The bypass-index-vacuuming feature may have had a bit of a messaging
problem. It was something we usually talked about as being about
skipping index vacuuming, because that's what it actually does.
However, the feature isn't really about doing less work during VACUUM.
It's actually about doing *more* work during VACUUM -- more useful
work. Especially setting visibility map bits. But also freezing. Now
you can very aggressively tune VACUUM to do these things more often,
with no fear of that being way too expensive because of index
vacuuming that has only marginal value.

The threshold is not so much about any one VACUUM operation -- you
have to think about the aggregate effect on the table over time. Most
individual tables will never have the new optimization kick in even
once, because the workload just couldn't possibly allow it -- the 2%
threshold is vastly exceeded every single time. The cases that it
actually applies to are pretty much insert-only tables, perhaps with
some HOT updates. 100% clean inserts are probably very rare in the
real world. I believe that it's *vastly* more likely that such a table
will have pages that are ~98%+ free of LP_DEAD line pointers in heap
pages (i.e., the thing that BYPASS_THRESHOLD_PAGES applies to). To get
to 100% you cannot allow even one single insert transaction to abort
since the last VACUUM.

If you assume that BYPASS_THRESHOLD_PAGES is actually too low for your
workload (which is the opposite problem), then it doesn't matter very
much. The feature as coded should still have the desired effect of
skipping index vacuuming in *most* cases where it's unnecessary
(however you happen to define "unnecessary") -- the number of pages
with LP_DEAD items will naturally increase over time without index
vacuuming, until the threshold is crossed. Maybe still-unnecessary
index vacuuming will still take place in 1 out of 5 cases with the
feature. This is still much better than 5 out of 5. More importantly,
you can now aggressively tune vacuuming without noticeably increasing
the number of individual vacuums that still have the problem of
unnecessary index vacuuming. So if you go from 5 vacuums per day to 20
through tuning alone, the number of vacuum operations that do
unnecessary index vacuuming doesn't increase at all (except perhaps
due to rounding effects).

-- 
Peter Geoghegan




Re: Behavior change in PostgreSQL 14Beta3 or bug?

2021-09-06 Thread Peter Geoghegan
It's a non-hot update, and so there is a single dead index tuple. You're
seeing the new optimization that makes vacuum skip indexes in marginal
cases.

Try running vacuum with index cleanup = on.

Peter Geoghegan
(Sent from my phone)


Re: gen_random_uuid key collision

2021-09-02 Thread Peter Geoghegan
On Thu, Sep 2, 2021 at 4:48 PM Tom Lane  wrote:
> That is pretty weird, all right.  The only idea that comes to mind
> immediately is that maybe that table's pkey index is corrupt and needs
> to be reindexed.  This isn't a great theory, because I don't see why
> a corrupt index would lead to bogus unique-constraint errors rather
> than missed ones.  But at least it squares with the observation that
> only that table is having issues.

This is easy enough to check using the contrib/amcheck extension.

jesusthefrog could try this, and report back what they see:

CREATE EXTENSION IF NOT EXISTS amcheck
SELECT bt_index_check('my_uuid_index', true);

If that doesn't show any errors, then there is a chance that this will:

SELECT bt_index_parent_check('my_uuid_index', true);

Note that the parent variant takes a disruptive lock that will block
write DML. You might prefer to just use the first query if this is
running in a production environment.

--
Peter Geoghegan




Re: Frequetly updated partial index leads to bloat on index for Postresql 11

2021-07-16 Thread Peter Geoghegan
On Fri, Jul 16, 2021 at 9:19 AM Tom Dearman  wrote:
> Other indexes do bloat, but the percentage bloat is a lot less

I have to imagine that the remaining problems have a lot to do with
the fact that this is a partial index -- the partial index naturally
gets vacuumed much less frequently than what would generally be
considered ideal for the index itself. In general VACUUM scheduling
makes the naive assumption that the indexes have the same needs as the
table, which is far from the case with this partial index, for your
workload. It's all of the specifics, taken together.

It sounds like this is a case where bottom-up index deletion won't
help -- it will only trigger in those indexes that are not "logically
modified" by updates. But you're logically modifying these values. Or
you're causing them to not need to be in the index anymore, by
modifying the predicate. But that won't trigger bottom-up deletion.
It's a bit like a delete, as far as the physical index structure is
concerned -- the index won't be eagerly modified by the executor.

The overall picture is that you cycle through all of the values in the
table, and no cleanup can take place other than plain VACUUM (at least
not to any significant extent). Although only a few hundred values are
logically required to be indexed by the partial index at any one time,
in practice no cleanup can run for long stretches of time (autovacuum
just doesn't know about cases like this). This is why the partial
index inevitably exceeds its theoretical pristine/high watermark size,
which is actually more than 1 page/8KB, but still probably a lot less
than what you actually see -- the partial index "falls through the
cracks", even with recent enhancements that made cleanup more eager
and more dynamic in certain other cases.

I am afraid that I don't have a good suggestion right now. I can think
of incremental improvements that would address this case, but for now
they're just ideas. Fundamentally, we need to get to the partial index
much more frequently than the other indexes, either within VACUUM or
within some other mechanism. For example a conservative implementation
of retail index tuple deletion might totally fix the issue. It does
very much look like a problem in these partial indexes in particular
-- it's quite possible that the other indexes won't grow at all due to
garbage index tuples, especially on Postgres 14.

--
Peter Geoghegan




Re: hot_standby_feedback implementation

2021-06-15 Thread Peter Geoghegan
On Tue, Jun 15, 2021 at 5:24 PM Christophe Pettus  wrote:
> When a replica sends a hot_standby_feedback message to the primary, does that 
> create an entry in the primary's lock table, or is it flagged to autovacuum 
> some other way?

It pretty much works by making the WAL sender process on the primary
look like it holds a snapshot that's as old as the oldest snapshot on
the replica.

A replica can block VACUUM on the primary *directly* by holding a
table-level lock, though in practice only when somebody on the replica
acquires an AccessExclusiveLock -- which is presumably rare in
practice.

-- 
Peter Geoghegan




Re: Frequetly updated partial index leads to bloat on index for Postresql 11

2021-03-18 Thread Peter Geoghegan
On Thu, Mar 18, 2021 at 6:51 AM Tom Dearman  wrote:
> Is this a known issue, are they any ways around it, and if it is an
> issue is there a plan to fix it if a fix is possible?

It's not exactly a known issue per se, but I think the problem here is
related to the fact that you have lots of duplicates, which did
perform rather badly prior to Postgres 12. I bet that you'd benefit
from upgrading to Postgres 12, or especially to Postgres 13. The
B-Tree space management is a lot better now. (Actually, it'll be
better again in Postgres 14.)

-- 
Peter Geoghegan




Re: autovacuum verbose?

2021-01-21 Thread Peter Geoghegan
On Thu, Jan 21, 2021 at 12:55 PM Tommy Li  wrote:
> Is there any way to configure autovacuum to log the same information as 
> VACUUM VERBOSE?

No. Though there really should be.

-- 
Peter Geoghegan




Re: Need explanation on index size

2020-09-24 Thread Peter Geoghegan
On Thu, Sep 24, 2020 at 6:55 AM Guillaume Luchet  wrote:
> I don’t understand why after the update where I only update a non indexed 
> column the indexes size is growing. Is it something someone can explain ?

If you reduce the table fillfactor then these updates will all be HOT
updates. That will make the table larger initially, but leaving enough
space behind on the same heap pages for successor tuples makes it
possible to use HOT updates.


-- 
Peter Geoghegan




Re: Index tuple deduplication limitations in pg13

2020-08-18 Thread Peter Geoghegan
On Tue, Aug 18, 2020 at 1:31 PM Matthias van de Meent
 wrote:
> Would this extra ordering not effectively be an extra tiebreaker in
> the ordering, applied before the TID? I do not know the full
> implications of that, but I believe that would not result in the
> limitations that you are mentioning.

You could probably do it that way, but again you end up with a lot of
new complexity. Not to mention overhead that would have to be paid by
everyone. It would require code that supported the old way (even if it
was added to Postgres 13) for pg_upgrade, that would also be hard to
test. And it might defeat certain future optimizations based on heap
TID being the only tiebreaker. Having two types of equality might have
to bleed into the optimizer.

It's a question of engineering trade-offs. I don't think that it's worth it.

-- 
Peter Geoghegan




Re: Index tuple deduplication limitations in pg13

2020-08-18 Thread Peter Geoghegan
On Tue, Aug 18, 2020 at 11:52 AM Matthias van de Meent
 wrote:
> Given that the above could work, the current btree tuple ordering is
> not optimized for opclass-equal but datum image-distinct values:
> ordering of opclass-equal values is currently determined only by tid,
> with as an example current ordering ['0.0', '0', '0.00', '0', '0.0',
> '0']. It would be more optimized for deduplication if that was stored
> as e.g. ['0', '0', '0', '0.0', '0.0', '0.00'], which is why I
> suggested to add an ordering by the datum image before the tid
> ordering. Additionally, this extra ordering also prevents the problem
> of [0] by never attempting an insertion of non-equal image datums in a
> posting list of otherwise equal values, as it would be ordered either
> before or after the posting list, never inside the list.

Yeah, that would work, but at the cost of making numeric totally
unusable. Now you cannot rely on unique enforcement detecting that '0'
is a duplicate of '0.0'. In fact, even the most trivial use of the =
operator will be broken in the presence of different display scales.
It's a non-starter.

The numeric2 design that I sketched is a bit ugly, but I can see no
better way. A three-way posting list split (i.e. the other design that
you sketched) is a special case that is very hard to test, very
complicated, and of little value in the grand scheme of things.

-- 
Peter Geoghegan




Re: Index tuple deduplication limitations in pg13

2020-08-18 Thread Peter Geoghegan
On Tue, Aug 18, 2020 at 11:52 AM Matthias van de Meent
 wrote:
> Deduplication does not need to destroy semantic differences? 'equal'
> can (in my book) mean:
> - 'opclass-equal', that is the opclass returns true for an equality check
> - 'binary equal' or 'datum-equal' (? maybe incorrect term), that is
> the unprocessed on-disk representations (datum image is the right term
> I believe?) of the compared values are indistinguishable.
>
> Runs of 'binary equal' datums can be freely deduplicated [0] when found.

> [0]
> Inserting a row in a deduplicated index with in, with TID ntid, can
> encounter a posting list of a opclass-equal but not datum image-equal
> tuples where the lowest TID of the posting list is less than ntid, and
> ntid is less than the highest TID of the posting list. This would
> require a posting list split to accomodate the new tuples' index entry
> in order to not lose data.

But you can't do that easily, because it breaks subtle assumptions
about posting list splits and space utilization. In particular, it
means that you can no longer think of a posting list split as
rewriting an incoming new item such that you can more or less pretend
that there was no overlap in the first place -- code like _bt_split
and nbtsplitloc.c relies on this. Introducing new special cases to
nbtsplitloc.c is very unappealing.

More concretely, if you introduce a posting list split like this then
you need three copies of the key -- the original, the new, and a
second copy of the original. That's much more complicated.

--
Peter Geoghegan




Re: Index tuple deduplication limitations in pg13

2020-08-18 Thread Peter Geoghegan
On Tue, Aug 18, 2020 at 9:44 AM Peter Geoghegan  wrote:
> If we wanted to fix this for numeric, we'd have to invent a new
> numeric datatype (called numeric2, say). That probably isn't as hard
> as it sounds, since it could be part of the same B-Tree operator
> family as numeric. It could also be implicitly cast to numeric.

I forgot to say: numeric2 would be just like numeric, except in one
specific way: it wouldn't care about display scale. The user would be
giving up on display scale by choosing numeric2 over numeric. The "5
vs 5.000" information would always be lost by design, so there'd be
nothing for deduplication to break. Deduplication could then be
enabled.

-- 
Peter Geoghegan




Re: Index tuple deduplication limitations in pg13

2020-08-18 Thread Peter Geoghegan
On Mon, Aug 17, 2020 at 11:44 PM Matthias van de Meent
 wrote:
> But, if the ordering of operator-class equal tuples is already
> system-defined, could the physical ordering of index tuples in a btree
> (with deduplication enabled for "unsafe" opclasses) be updated from
> [index_columns, tid] to [index_columns,
> image_compare(non_datum_equal_columns), tid], giving a stable sorting
> of opclass-equal and image-equal values and enabling safe consistent
> deduplication?

The issue isn't the physical ordering. The issue is that we cannot
allow the implementation to destroy semantic differences among equal
datums. We avoid deduplication with cases where two equal datums are
visibly different. For example, it would not be okay if we forgot that
your numeric datum was originally input as '5.000', and output '5'
later on.

If we wanted to fix this for numeric, we'd have to invent a new
numeric datatype (called numeric2, say). That probably isn't as hard
as it sounds, since it could be part of the same B-Tree operator
family as numeric. It could also be implicitly cast to numeric.

-- 
Peter Geoghegan




Re: 12.3 replicas falling over during WAL redo

2020-08-03 Thread Peter Geoghegan
On Mon, Aug 3, 2020 at 2:35 PM Alvaro Herrera  wrote:
> You can use pageinspect's page_header() function to obtain the page's
> LSN.  You can use dd to obtain the page from the file,
>
> dd if=16605/16613/60529051 bs=8192 count=1 seek=6501 of=/tmp/page.6501

Ben might find this approach to dumping out a single page image
easier, since it doesn't involve relfilenodes or filesystem files:

https://wiki.postgresql.org/wiki/Getting_a_stack_trace_of_a_running_PostgreSQL_backend_on_Linux/BSD#contrib.2Fpageinspect_page_dump

-- 
Peter Geoghegan




Re: 12.3 replicas falling over during WAL redo

2020-08-03 Thread Peter Geoghegan
On Sun, Aug 2, 2020 at 9:39 PM Kyotaro Horiguchi
 wrote:
> All of the cited log lines seem suggesting relation with deleted btree
> page items. As a possibility I can guess, that can happen if the pages
> were flushed out during a vacuum after the last checkpoint and
> full-page-writes didn't restored the page to the state before the
> index-item deletion happened(that is, if full_page_writes were set to
> off.). (If it found to be the cause, I'm not sure why that didn't
> happen on 9.5.)

There is also a Heap/HOT_UPDATE log line with similar errors.

-- 
Peter Geoghegan




Re: 12.2: Why do my Redo Logs disappear??

2020-06-08 Thread Peter Geoghegan
On Mon, Jun 8, 2020 at 5:17 PM Peter  wrote:
> Loosing a RedoLog is very bad, because there is no redundancy,
> loosing a single one of them makes the timeline disappear and it
> will only reappear after another Base Backup. Very very bad.

>In this case, it seems,  Postgres will delete the current log
>without archiving it. :(

I strongly suspect that you were hit by the bug fixed in commit
4e87c483. You should upgrade to Postgres 12.3 ASAP, to get that fix:

"Avoid premature recycling of WAL segments during crash recovery
(Jehan-Guillaume de Rorthais)

WAL segments that become ready to be archived during crash recovery
were potentially recycled without being archived."

Sorry that you were affected by this bug -- it really sucks.

-- 
Peter Geoghegan




Re: create index insist on 2 workers only

2020-04-24 Thread Peter Geoghegan
On Fri, Apr 24, 2020 at 7:32 AM Radoslav Nedyalkov  wrote:
> We 're rebuilding a big table which has set parallel_workers = 6
> system has
>  max_parallel_maintenance_workers | 6   | 
> /var/lib/pgsql/11/data/postgresql.sumup.conf
>  max_parallel_workers | 16  | 
> /var/lib/pgsql/11/data/postgresql.sumup.conf
>  max_parallel_workers_per_gather  | 4   | 
> /var/lib/pgsql/11/data/postgresql.sumup.conf
>
> Also session level on index restore there is
> set max_parallel_maintenance_workers = 6;
>
> Still we get only 2 parallel processes in a free of any other load system.
> It is postgres 11.7

Try increasing maintenance_work_mem from the default of 64MB. MWM
constrains the number of parallel workers used.

-- 
Peter Geoghegan




Re: Querying an index's btree version

2020-03-11 Thread Peter Geoghegan
On Wed, Mar 11, 2020 at 2:13 PM Darren Lafreniere
 wrote:
> when you restore a DB from a backup, does the restored index use the old 
> format or the latest one?

If you use pg_restore, it uses the latest index format.

If you're using pg_upgrade, the version won't change unless and until
you REINDEX. This includes cases where you're running pg_upgrade
against a restored physical backup.

-- 
Peter Geoghegan




Re: Querying an index's btree version

2020-03-11 Thread Peter Geoghegan
On Wed, Mar 11, 2020 at 1:26 PM Darren Lafreniere
 wrote:
> We've read that PG 12 has improved btree index support, and that the latest 
> internal btree version was bumped from 3 to 4. Is it possible to query the 
> btree version that a particular index is using? We'd like to automatically 
> start a concurrent re-index if we detect any btree indexes are still on 
> version 3.

It's possible, but you have to install the superuser-only pageinspect
extension. Here is how you'd determine that an index called
'pg_aggregate_fnoid_index' is on version 4:

regression=# create extension pageinspect;
CREATE EXTENSION
regression=# select version from bt_metap('pg_aggregate_fnoid_index');
 version
-
   4
(1 row)

-- 
Peter Geoghegan




Re: Query returns no results until REINDEX

2020-02-11 Thread Peter Geoghegan
On Sun, Feb 9, 2020 at 12:50 PM Colin Adler  wrote:
> Looks like it found something. I checked out the contrib/pageinspect docs but
> wasn't too sure what to run. Are incompatible libc versions causing btree
> corruption something you consider a bug? If it's something you'd like to look
> into further I can gladly send over the database files.

No, this is not considered a bug.

It's unfortunate that there is no built in collation versioning
mechanism, or something like that -- that might have alerted you to
the problem before any real damage occurred. We have that for the ICU
collations, but it currently isn't possible to use ICU as the default
collation provider. You really have to go out of your way to use ICU
collations.

-- 
Peter Geoghegan




Re: Query returns no results until REINDEX

2020-02-07 Thread Peter Geoghegan
On Fri, Feb 7, 2020 at 3:52 PM Colin Adler  wrote:
> Seems to work now. My question is, is this something I should report to the
> maintainers?

I am one of the people that maintains the B-Tree code.

You didn't mention what version of Postgres you're using here. That
could be important. Please let us know. Mention the minor component of
the release version, too (i.e. say 12.2, not just 12).

> I took a snapshot of the data folder before the reindex in case it
> would be helpful. Is index corruption something that should be actively looked
> out for?

Yes -- look for corruption. If I had to guess, I'd say that this has
something to do with upgrading the operating system to use a
different, incompatible glibc. Or perhaps it has something to do with
streaming replication between machines with different glibc version.

You should try running contrib/amcheck, which should be able to isolate
index corruption, and give you a specific complaint. You may then be
able to inspect the exact index page with the problem using
contrib/pageinspect. Something like this ought to do it on Postgres
11 or 12:

CREATE EXTENSION IF NOT EXISTS amcheck
SELECT bt_index_check('my_index', true);

If that doesn't show any errors, then perhaps try this:

SELECT bt_index_parent_check('my_index', true);

If you're on Postgres 10, then you should leave out the second
argument, "true", since that version doesn't have the extra
heapallindexed check.

Let us know what you see.

--
Peter Geoghegan




Re: Rows violating Foreign key constraint exists

2019-11-29 Thread Peter Geoghegan
On Fri, Nov 29, 2019 at 7:23 AM Tom Lane  wrote:
> The most likely "corruption" explanation is something wrong with the
> indexes on the referenced and/or referencing column, causing rows to
> not be found when referential actions should have found them.  Random
> querying of the tables wouldn't necessarily expose that --- you'd need
> to be sure that your queries use the questionable indexes, and maybe
> even search for some of the specific rows that seem mis-indexed.

Or try using contrib/amcheck, which is available in Postgres 10.
Perhaps try the query here, modified to verify all B-Tree indexes (not
just those indexes in the pg_catalog schema):

https://www.postgresql.org/docs/10/amcheck.html

--
Peter Geoghegan




Re: PG11 Parallel Thanks!!

2019-10-04 Thread Peter Geoghegan
On Thu, Oct 3, 2019 at 10:31 AM Jason Ralph
 wrote:
> The end of month process that we run at my company was a pg_dump and 
> pg_restore of 3 tables, these tables are around ~(400GB) each.  The entire 
> process on pg93 took 29 hours.
>
> The index creation portion of the restore on the target pg9.3 database took:
> 5) time: -15 hours -4 minute ((-54264 % 60)) seconds
>
> The index creation of the restore on the target db after pg11 upgrade on 
> source and target took:
>  5) time: -5 hours -7 minute ((-18434 % 60)) seconds
>
> We saved 10 hours!!

The sort code received many improvements over the years, really
starting in 9.5, and continuing in 9.6, 10 and 11. FWIW, I think that
that was probably the biggest factor here. Though parallel CREATE
INDEX will have helped as well.

-- 
Peter Geoghegan




Re: Clarification on the release notes of postgresql 12 regarding pg_upgrade

2019-10-04 Thread Peter Geoghegan
On Fri, Oct 4, 2019 at 9:09 AM Tom Lane  wrote:
> > You can't REINDEX safely regarding that note.
>
> Actually running into that problem is quite unlikely; and if you did
> hit it, it'd just mean that the REINDEX fails, not that you have any
> urgent problem to fix.  I'd encourage you to just go ahead and REINDEX,
> if you have indexes that could benefit from the other changes.

Right. It is hard to imagine an application that evolved to fully rely
on the previous slightly higher limit, and cannot tolerate a reduction
in the limit by only 8 bytes. The limit applies to a tuple *after*
TOAST compression has been applied.

-- 
Peter Geoghegan




Re: PG11 Parallel Thanks!!

2019-10-02 Thread Peter Geoghegan
On Wed, Oct 2, 2019 at 8:41 AM Jason Ralph  wrote:
> Since pg11 on both the target and source, the run time has decreased a lot, I 
> chalk it up to the parallel index creations in pg11 which was a very time 
> consuming process on pg9.3.
> The process has finished almost 10 hours earlier than pg93.  So thank you for 
> your hard work and dedication to this awesome piece of software.

How long did it take on 9.3?

I am the author of the parallel CREATE INDEX feature. It's good to get
feedback like this.

-- 
Peter Geoghegan




Re: Whan is it safe to mark a function PARALLEL SAFE?

2019-09-08 Thread Peter Geoghegan
On Sun, Sep 8, 2019 at 12:27 PM Tom Lane  wrote:
> > If an error is raised in one parallel worker, does this
> > cause the other parallel workers to be immediately terminated?
>
> I think not, though I didn't work on that code.  The error will
> be reported to the parent backend, which will cause it to fail
> the query ... but I think it just waits for the other worker
> children to exit first.  That's not something to rely on of course.
> Even if we don't make an attempt to cancel the other workers today
> we probably will in future.  But the cancel attempt would certainly
> be asynchronous, so I'm not sure how "immediate" you are worried
> about it being.

If workers call CHECK_FOR_INTERRUPTS() frequently, which they should,
then it should appear to users as if raising an error in one worker
kills everything almost immediately, or immediately.  For example, if
a parallel CREATE INDEX has a worker that raises a unique violation
error, that must work in a way that at least *appears* to be very
similar to what the user would get with a serial CREATE INDEX. (The
worst that can happen is that they'll very occasionally get two unique
violation errors instead of one, or something like that.)

That said, there are theoretical failures where it could take rather a
long time for the parent/leader to get the memo -- see
WaitForParallelWorkersToAttach() and its caller (note that anything
using a gather node is subject to the same kind of failure that
WaitForParallelWorkersToAttach() handles, even though they won't call
the function themselves). These failures (e.g. fork() failure) are
generally assumed to be rare to non-existent, though. Users will
surely be upset if parallel queries cannot be cancelled almost
immediately. If it happened with any regularity, somebody would have
complained by now.

As Tom said, it's hard to give a useful answer without more context --
how you define "immediate"?

-- 
Peter Geoghegan




Re: Question from someone who is not trained in computer sciences

2019-09-05 Thread Peter Geoghegan
On Thu, Sep 5, 2019 at 2:00 PM Judith Lacoste  wrote:
> I think PostgreSQL is the solution for my needs, but I am not a 
> programmer/coder.

I don't think that it's restricted to people that are computer
scientists. At least, I certainly hope it isn't. SQL was originally
supposed to be something that is usable by domain experts/analysts,
rather than by computer people (that was at a time when the divide was
far larger than it is today).

> I plan to install the database on a server in the office. Me and my four 
> colleagues will occasionally connect to this database when we are working in 
> other locations (usually hospitals or universities). In such remote 
> locations, we often do not have internet/network, yet we still need to access 
> the database.  Currently, we use a system where a copy of the database lives 
> on each of our laptops.  We can access all the information in the database 
> despite being offline.  This local copy of the database is synchronized with 
> the server once network becomes available again.
>
> My question is whether or not such set up is possible with PostgreSQL?

Since you're a biologist, you may like to play around with the Mouse
Genome database using PostgreSQL:

http://www.informatics.jax.org/downloads/database_backups/

Any supported version of PostgreSQL will work. You'll need to use
pg_restore to restore the databases. Something like this will do it:

  pg_restore -d mgd  /path/to/mgd.postgres.dump

(I'm not sure what operating system you'll use -- something similar to
this invocation ought to work on Windows through cmd.exe, though.)

>From there, you can play around with the database using a GUI tool
such as pgAdmin. I sometimes use this database to test certain things,
since it's the only example of a living, breathing PostgreSQL database
that you can just download that I am aware of. Its schema is probably
not an exemplar of good design, but it does seem reasonably well
thought out. I'm not a domain expert, though, so I can't really be
sure how good it is.

The nice thing about this approach is that you can figure it out using
a "top down" approach, by first understanding how the database is used
in practical terms, and then filling in the details of how the
application that it backs uses the database.

Last I checked, restoring this database will take about 30GB of disk
space on top of the dump file itself.

-- 
Peter Geoghegan




Re: Corrupt index stopping autovacuum system wide

2019-07-18 Thread Peter Geoghegan
On Thu, Jul 18, 2019 at 9:06 AM Aaron Pelz  wrote:
> It's a simple btree expression on a geometry(Point,4326) , no expression no 
> partial no composite.

The cause of the corruption may be a bug in a Postgis B-Tree operator
class. I reported a bug in the Geography type that could lead to
corrupt B-Tree indexes (not Geometry):

https://trac.osgeo.org/postgis/ticket/3841

Though I also see what could be a comparable bug in Geometry:

https://trac.osgeo.org/postgis/ticket/3777

These bugs are from about 3 years ago. If I'm right you should be able
to isolate the bug using amcheck.

-- 
Peter Geoghegan




Re: Corrupt index stopping autovacuum system wide

2019-07-17 Thread Peter Geoghegan
On Wed, Jul 17, 2019 at 10:27 AM Peter Geoghegan  wrote:
> > It's possible that amcheck would have given you an accurate diagnosis
> > of the problem -- especially if you used bt_index_parent_check():
> >
> > https://www.postgresql.org/docs/current/amcheck.html
>
> BTW, be sure to use the 'heapallindexed' option with
> bt_index_parent_check() to detect missing downlinks, which is exactly
> the problem that VACUUM complained about.

Can you tell us more about this index? Can you share its definition
(i.e. what does \d show in psql)?

Is it an expression index, or a partial index? A composite? What
datatypes are indexed?

Thanks
--
Peter Geoghegan




Re: Corrupt index stopping autovacuum system wide

2019-07-17 Thread Peter Geoghegan
On Wed, Jul 17, 2019 at 11:43 AM Alvaro Herrera
 wrote:
> This might make things worse operationally, though.  If searches aren't
> failing but vacuum is, we'd break a production system that currently
> works.

If searches aren't failing and VACUUM works, then that's probably down
to dumb luck. The user's luck could change at any time (actually, it's
quite possible that the index is already giving wrong answers without
anybody realizing). That's not always true, of course -- you could
have an OOM condition in VACUUM, where it really does make sense to
retry. But it should be true for the category of errors where we
behave more aggressively than just giving up, such as "failed to
re-find parent key" error Aaron noticed.

> Well, vacuum knows what index is being processed.  Maybe you're thinking
> that autovac can get an out-of-memory condition or something like that;
> perhaps we can limit the above only when an ERRCODE_DATA_CORRUPTED
> condition is reported (and make sure all such conditions do that.  As
> far as I remember we have a patch for this particular error to be
> reported as such.)

I don't think that it would be that hard to identify errors that
nbtree VACUUM could throw that clearly indicate corruption, without
any hope of the problem self-correcting without the DBA running a
REINDEX. There will be a small amount of gray area, perhaps, but
probably not enough to matter.

> > (c) automatically disabling constraint indexes seems less than desirable.
>
> Disabling them for writes, yeah.

I think that it's fair to say that all bets are off once you see the
"failed to re-find parent key" error, or any other such error that
indicates corruption. Admittedly it isn't 100% clear that disabling
constraint enforcement to unblock autovacuum for the whole cluster is
better than any available alternative; it's really hard to reason
about things when we already know that the database has corruption.

I think that it's okay that almost anything can break when somebody
creates an index on a non-immutable expression (including VACUUM),
provided that all the problems only affect the table with the broken
index. OTOH, taking down the entire Postgres cluster as an indirect
consequence of one person's ill-considered CREATE INDEX really sucks.
That distinction seems important to me.

-- 
Peter Geoghegan




Re: Corrupt index stopping autovacuum system wide

2019-07-17 Thread Peter Geoghegan
On Wed, Jul 17, 2019 at 10:27 AM Tom Lane  wrote:
> Right, you're eventually going to get to a forced shutdown if vacuum never
> succeeds on one table; no question that that's bad.

It occurs to me that we use operator class/insertion scankey
comparisons within page deletion, to relocate a leaf page that looks
like a candidate for deletion. Despite this, README.hot claims:

"Standard vacuuming scans the indexes to ensure all such index entries
are removed, amortizing the index scan cost across as many dead tuples
as possible; this approach does not scale down well to the case of
reclaiming just a few tuples.  In principle one could recompute the
index keys and do standard index searches to find the index entries,
but this is risky in the presence of possibly-buggy user-defined
functions in functional indexes.  An allegedly immutable function that
in fact is not immutable might prevent us from re-finding an index
entry"

That probably wasn't the problem in Aaron's case, but it is worth
considering as a possibility.

> My concern here is
> that if we have blinders on to the extent of only processing that one
> table or DB, we're unnecessarily allowing bloat to occur in other tables,
> and causing that missed vacuuming work to pile up so that there's more of
> it to be done once the breakage is cleared.  If the DBA doesn't notice the
> problem until getting into a forced shutdown, that is going to extend his
> outage time --- and, in a really bad worst case, maybe make the difference
> between being able to recover at all and not.

The comment about "...any db at risk of Xid wraparound..." within
do_start_worker() hints at such a problem.

Maybe nbtree VACUUM should do something more aggressive than give up
when there is a "failed to re-find parent key" or similar condition.
Perhaps it would make more sense to make the index inactive (for some
value of "inactive") instead of just complaining. That might be the
least worst option, all things considered.

--
Peter Geoghegan




Re: Corrupt index stopping autovacuum system wide

2019-07-17 Thread Peter Geoghegan
On Wed, Jul 17, 2019 at 9:21 AM Peter Geoghegan  wrote:
> It's possible that amcheck would have given you an accurate diagnosis
> of the problem -- especially if you used bt_index_parent_check():
>
> https://www.postgresql.org/docs/current/amcheck.html

BTW, be sure to use the 'heapallindexed' option with
bt_index_parent_check() to detect missing downlinks, which is exactly
the problem that VACUUM complained about. Hopefully this probably will
be limited to the single index that you've already REINDEXed. In
theory the same problem could be hiding in other indexes, though I
don't consider that particularly likely.

Note that bt_index_parent_check() requires a lock on tables that
effectively blocks writes, but not reads, so verification may require
planning or coordination. bt_index_check() doesn't have any of these
problems, but also won't detect missing downlinks specifically.

-- 
Peter Geoghegan




Re: Corrupt index stopping autovacuum system wide

2019-07-17 Thread Peter Geoghegan
On Wed, Jul 17, 2019 at 9:57 AM Tom Lane  wrote:
> It looks that way, but how would a broken non-shared index have held up
> autovacuuming in other databases?  Maybe, as this one's xmin horizon
> got further and further behind, the launcher eventually stopped
> considering launching workers into any other databases?  That seems
> like a bad thing; it's postponing work that will need to be done
> eventually.

I don't know exactly how the launcher would behave offhand, but it's
clear that not being able to VACUUM one table in one database (because
it has a corrupt index) ultimately risks the availability of every
database in the cluster. Many installations receive little to no
supervision, so it may just be a matter of time there. That is
certainly a bad thing.

-- 
Peter Geoghegan




Re: Corrupt index stopping autovacuum system wide

2019-07-17 Thread Peter Geoghegan
On Wed, Jul 17, 2019 at 8:54 AM Aaron Pelz  wrote:
> To me it looks like a *single* corrupt index held up autovacuums across our 
> entire server, even other in other databases on the same server. Am I 
> interpreting this correctly?

Yes -- that is correct.

What PostgreSQL version are you on? Was this an INCLUDE index on PostgreSQL 11?

> Would love guidance on diagnosing this type of thing and strategies for 
> preventing it.

It's possible that amcheck would have given you an accurate diagnosis
of the problem -- especially if you used bt_index_parent_check():

https://www.postgresql.org/docs/current/amcheck.html
-- 
Peter Geoghegan




Re: after restore the size of the database is increased

2019-07-15 Thread Peter Geoghegan
On Mon, Jul 15, 2019 at 6:22 AM Luca Ferrari  wrote:
> What am I missing here?

Sometimes B-Tree indexes can be *larger* after a REINDEX (or after
they're recreated with a CREATE INDEX). It's not that common, but it
does happen. There isn't actually a very large size difference here,
so it seems worth comparing index size in detail.

-- 
Peter Geoghegan




Re: REINDEX : new parameter to preserve current average leaf density as new implicit FILLFACTOR

2019-07-09 Thread Peter Geoghegan
On Tue, Jul 9, 2019 at 9:04 PM Peter Geoghegan  wrote:
> ISTM that the simplest explanation here is that index fragmentation
> (and even index size) is a red herring, and the real issue is that
> you're suffering from problems similar to those that are described in
> these old threads:
>
> https://www.postgresql.org/message-id/flat/20160524173914.GA11880%40telsasoft.com
> https://www.postgresql.org/message-id/flat/520D6610.8040907%40emulex.com

I can imagine why you found you needed to reduce fillfactor to get
much of any benefit from a REINDEX. Page splits are inherently
expensive, for one thing. Also, in this specific scenario a succession
of page splits might hasten the index returning to having little
correlation with the underlying table within each large group of
duplicates. Splits on fillfactor 90 pages would make new space
available for future insertions on earlier duplicate pages, mixing old
and new rows together before long.

-- 
Peter Geoghegan




Re: REINDEX : new parameter to preserve current average leaf density as new implicit FILLFACTOR

2019-07-09 Thread Peter Geoghegan
On Tue, Jul 9, 2019 at 3:18 PM John Lumby  wrote:
> > Or, it could be that range scan performance benefitted from reduced 
> > fragmentation,
> >
>
> Yes,  I think so.

ISTM that the simplest explanation here is that index fragmentation
(and even index size) is a red herring, and the real issue is that
you're suffering from problems similar to those that are described in
these old threads:

https://www.postgresql.org/message-id/flat/20160524173914.GA11880%40telsasoft.com
https://www.postgresql.org/message-id/flat/520D6610.8040907%40emulex.com

There have been numerous reports from users with problems involving
low cardinality indexes that gradually became less correlated with the
underlying table over time. At least a couple of these users found
that a periodic REINDEX temporarily fixed the problem -- see the first
thread for an example. Postgres 12 maintains the heap/table sort order
among duplicates by treating heap TID as a tiebreaker column, which
may make REINDEXing totally unnecessary for you. It's harder to model
this issue because the problem with heap TID order will only be seen
when there is at least a moderate amount of churn.

-- 
Peter Geoghegan




Re: REINDEX : new parameter to preserve current average leaf density as new implicit FILLFACTOR

2019-07-09 Thread Peter Geoghegan
On Tue, Jul 9, 2019 at 12:29 PM John Lumby  wrote:
> I was not thinking of a new command,  just an extension of the existing 
> REINDEX
> which would apply a fillfactor equal to current average page density,
> by adding a preliminary step to sample that first.

That would be a very different thing to REINDEX no matter how you
spelt it, though. REINDEX creates a new index, from scratch, whereas
you're talking about restructuring what's already there.

> > I believe that this is a lot more important in systems that generally
> > use clustered indexes, such as MS SQL Server. This kind of
> > "fragmentation" isn't usually much of a problem when using Postgres.
> >
> We have found that, for an index which has both experienced large number of 
> page splits
> and whose table has a large number of dead tuples (despite autovacuum),
> REINDEX with FILLFACTOR set to current page_density does produce a 
> performance improvement,
> and also does reduce future growth in number of pages.I don't have 
> numbers to
> hand,  and in fact not sure if any catalog view or pgstattuple tells me about 
> the proportion
> of dead key-tids in the index itself (do you know of any source?) as opposed 
> to the table,
> but based on that recollection,  yes,   REINDEX can reduce fragmentation.

This could help the old "getting tired" behavior with many duplicates,
by making the free space available in earlier leaf pages (those
further to the left) that are full of duplicates -- the original
average space utilization may reflect a very uneven distribution of
free space overall. Or, it could be that range scan performance
benefitted from reduced fragmentation, because your workload happened
to be bottlenecked on large range scans. Though that seems unlikely.

I believe that the effect that you identified is real, but at a
minimum it's not clear why a REINDEX with a fillfactor to match the
original leaf space utilization helped. It would be fairly difficult
to figure it out for sure. If it was a problem with
duplicates/"getting tired", then I'd expect the new v12 code will help
a lot.

> However we did not run a VACUUM command first. Maybe if we had run VACUUM 
> instead of
> the REINDEX commands,   we might have obtained the same degree of 
> improvement,  I don't know.
> I think this was Tom's point earlier on in this thread.

It was. Tom's intuition about that matches my own, though I
acknowledge that the old behavior with duplicates muddies the waters.

> Correct me if I'm wrong but I believe whether an index is "clustered" or not 
> is not relevant for
> this discussion because the clustering in that context is referring to 
> ordering of the
> table pages,  not the index pages.

Right.

> I believe it is quite possible to have a perfectly
> "clustered" table whose clustering index is itself badly disorganized.

Technically the two things are separate metrics, so that is
theoretically possible, but it doesn't seem all that likely. It could
happen with lots of non-HOT updates, where all new index tuples relate
to the same logical row as some existing index tuple, causing many
page splits despite there being no real change in the logical contents
of the index. Even then, the table will itself lose much of its
original order, so the index will become "unclustered" as it becomes
fragmented.

-- 
Peter Geoghegan




Re: REINDEX : new parameter to preserve current average leaf density as new implicit FILLFACTOR

2019-07-09 Thread Peter Geoghegan
On Tue, Jul 9, 2019 at 11:27 AM John Lumby  wrote:
> And the point of the REINDEX at that point (below) is to remove dead tuple 
> keys-tids
> and  reorganize those split pages back into physical order without losing the 
> freespace.

VACUUM already removes the tuples, accounting for all overhead.

You are right that it would be possible for us to "defragment" the
pages, so that they'd be in sequential order on disk from the point of
view of a whole index scan -- this is what the "leaf_fragmentation"
statistic from pgstatindex() reports on. We could in principle come up
with a way of moving pages around, which would have some modest
benefit for certain types of queries (it wouldn't improve the
heap/index correlation, though, which is far more important). That
would either necessitate that the command acquire a disruptive lock on
the index (i.e. no writes, just like regular REINDEX), or that we
drastically rearchitect the B-Tree code to make it support this.
Neither of which seem particularly appealing.

I believe that this is a lot more important in systems that generally
use clustered indexes, such as MS SQL Server. This kind of
"fragmentation" isn't usually much of a problem when using Postgres.

-- 
Peter Geoghegan




Re: REINDEX : new parameter to preserve current average leaf density as new implicit FILLFACTOR

2019-07-09 Thread Peter Geoghegan
On Tue, Jul 9, 2019 at 10:31 AM John Lumby  wrote:
> Yes,   I see that. But surely "making splits occur less often" is a 
> desirable
> objective in itself, is it not? And I believe that a parameter to 
> preserve the "steady-state"
> density in high-traffic indexes would help achieve that goal,   wouldn't you 
> agree?

Anything that reliably reduces page splits without hurting space
utilization is well worthwhile. I can't see how what you describe
could have that effect, though. If you expect the leaf density to be
the same after a REINDEX, then why bother at all? There is no reason
to think that that will be more effective than simple vacuuming.

-- 
Peter Geoghegan




Re: REINDEX : new parameter to preserve current average leaf density as new implicit FILLFACTOR

2019-07-08 Thread Peter Geoghegan
On Mon, Jul 8, 2019 at 12:19 PM Peter Geoghegan  wrote:
> Well, you're still running autovacuum very aggressively here. It'll
> easily keep up when run on a relatively small table such as this.

Also, an exactly equal number of insertions and deletions is rather
likely to result in bloated indexes in a way that probably isn't
representative of many workloads. Even if the number of insertions was
only slightly greater than the number of deletions, then the overall
pattern would be one of continual growth, which is generally
considered much more interesting.

For far far more information on the topic than you want, see the paper
"B-Trees with Inserts and Deletes: Why Free-at-Empty Is Better Than
Merge-at-Half":

https://www.sciencedirect.com/science/article/pii/00229390020W

The salient point made by the paper is that good space utilization
rests on the assumption that there are fewer deletes than inserts,
though maybe only slightly fewer:

"The tendency of the utilization to remain near 69% can be explained
by the following arguments: If there are even just a few more inserts
than deletes, the B-tree will grow at the net insert rate (the rate of
inserts minus the rate of deletes)."

If the volume of data never grows past a certain point, then it's
unlikely that the space utilization is very important. This may even
be premature optimization.
-- 
Peter Geoghegan




Re: REINDEX : new parameter to preserve current average leaf density as new implicit FILLFACTOR

2019-07-08 Thread Peter Geoghegan
On Mon, Jul 8, 2019 at 12:10 PM John Lumby  wrote:
> Actually the test workload does not run any explicit VACUUM command,
> it relies on autovacuum with these settings
> (same settings for 9.4 and 12beta2)

> To correspond to your " more churn between each VACUUM"
> Would you then suggest increasing
> autovacuum_vacuum_cost_delay and/or  autovacuum_vacuum_scale_factor?

Well, you're still running autovacuum very aggressively here. It'll
easily keep up when run on a relatively small table such as this.

BTW, you should definitely run the latest point release of 9.4 -- not
9.4.6. You're missing years of bug fixes by sticking to such an old
point release, including some rather nasty ones -- 9.4.23 is the
current 9.4 point release. Actually, 9.4 is going to lose support this
year, as the oldest stable version that's currently supported by the
community.

-- 
Peter Geoghegan




Re: REINDEX : new parameter to preserve current average leaf density as new implicit FILLFACTOR

2019-07-08 Thread Peter Geoghegan
On Mon, Jul 8, 2019 at 9:23 AM John Lumby  wrote:
> Overall,  pg-12beta2 yielded a 6.7% reduction in sizes (total pages) of 
> indexes,   which was most noticable with the 6 non-unique ones.
> In fact the primary-key index was larger with pg-12.

The avg_leaf_density was actually higher for the primary key index, so
it looks like it really came out slightly ahead on v12. Perhaps you
didn't take deleted_pages into account -- there must be free space
that is reusable by the index that has yet to be reused. It would
probably make sense to subtract that across the board.

> Would you have expected better than 6.7%?

I don't think that a test case that runs VACUUM when there are only
4300 deletions and 4300 insertions is particularly realistic, in
general. You might see a larger difference if there was more churn
between each VACUUM run.

> Although a welcome improvement,  I think it is not enough to justify stopping 
> use of setting a lower explicit FILLFACTOR. Which then brings me back to  
> thinking there is a case for the subject of this thread,  an automatic way to 
> preserve density.

I don't think that such an option would make much sense. The "waves of
misery" paper is about smoothing out the frequency of page splits
following bulk loading and a CREATE INDEX. It is not about making
splits occur less often. It's well understood that a certain amount of
free space is the overhead of B-Tree indexes, albeit an overhead that
can be avoided in certain specific instances.

> And one question :
> I notice that in some pg-11 release,   a new config parameter appeared  :
>   vacuum_cleanup_index_scale_factor

> I have not researched this at all and nor did I set it to anything for my 
> pg-12beta2 run,  but it sounds as though maybe it could be relevant to 
> this kind of workload  -   Is that so?

You seem to be worried about keeping indexes as small as possible.
vacuum_cleanup_index_scale_factor won't help with that.

-- 
Peter Geoghegan




Re: Postgresql 12 Beta2 Crashes for any Insert/Update

2019-06-27 Thread Peter Geoghegan
On Thu, Jun 27, 2019 at 1:28 AM Mohsen Bande  wrote:
> i have a working database in PG 12 Beta 1. today i upgraded it to Beta 2. 
> everything goes fine and server is up and running. but trying to 
> INSERT/UPDATE anything, server crashes:

Is it possible for you to send us a stacktrace?

https://wiki.postgresql.org/wiki/Generating_a_stack_trace_of_a_PostgreSQL_backend

Thanks
-- 
Peter Geoghegan




Re: REINDEX : new parameter to preserve current average leaf density as new implicit FILLFACTOR

2019-06-26 Thread Peter Geoghegan
On Wed, Jun 26, 2019 at 8:05 AM John Lumby  wrote:
> > There is a very recent research paper that discusses the idea of
> > varying fillfactor with a view to ameliorating page splits:
> >
>
> Thanks,  I am chewing my way through that.  As you say,  it does address 
> exactly the issues I raised.
> Do you happen to know if their source-code is available somewhere?
> ( I did see their db is MS SQL Server but it still might provide some 
> portable ideas. )

It's just a research paper. It might never be implemented in any system.

The point of the paper is to make page splits occur at a steady rate
after REINDEX'ing -- not to eliminate or even reduce page splits.

> > I suspect that you might find that the enhancements to B-Tree indexes
> > that went into Postgres 12 would help with this workload, especially
> > if you notice that this happens with indexes that have a lot of duplicates
> >
>
> I had not noticed that,   thanks for pointing it out.  Yes ,  in my workload 
> most of the indexes in question are non-unique and some have very low key 
> card.I will try out the pg-12 beta when I get a chance.

It's easy to show problems with very low cardinality indexes in the
old code. You'll definitely notice a difference there.

> Is there a pdf or text version?

Just the talk slides:
https://www.pgcon.org/2019/schedule/attachments/518_nbtree-arch-pgcon.pdf

-- 
Peter Geoghegan




Re: REINDEX : new parameter to preserve current average leaf density as new implicit FILLFACTOR

2019-06-25 Thread Peter Geoghegan
On Tue, Jun 25, 2019 at 2:45 PM John Lumby  wrote:
> Background :   For some workloads involving high volume of 
> INSERT/UPDATE/DELETE,  It is sometimes beneficial
> to schedule regular REINDEX of high-activity indexes,   so as to improve 
> performance,  or restore performance levels back to what it was earlier,   by 
> removing dead keys etc.  This can result in the average page density of 
> these indexes fluctuating up and down in a saw-tooth fashion,  REINDEX 
> causing large increase in density (large drop in total number of pages) and 
> the workload gradually  decreasing density back to some "steady-state".

I suspect that you might find that the enhancements to B-Tree indexes
that went into Postgres 12 would help with this workload, especially
if you notice that this happens with indexes that have a lot of
duplicates. For the full background, you might take a look at my pgCon
talk:

https://youtu.be/p5RaATILoiE

Fair warning: this is a very technical talk.

Does it seem at all possible that you were affected by either the
issue with duplicates, or the issue that is addressed by the "split
after new tuple" optimization? They're both causes of index bloat that
VACUUM cannot usually prevent.

-- 
Peter Geoghegan




Re: REINDEX : new parameter to preserve current average leaf density as new implicit FILLFACTOR

2019-06-25 Thread Peter Geoghegan
On Tue, Jun 25, 2019 at 2:56 PM Tom Lane  wrote:
> > This would avoid the saw-tooth effect on number of pages,   and also reduce 
> > the number of index page-splits which occur during the period immediately 
> > following a REINDEX done with default FILLFACTOR of 90%.   In effect,  it 
> > lessens the need for the physical reorganization aspect of REINDEX and 
> > focusses more on the function of removing dead  keys.
>
> I think you've confused REINDEX with VACUUM.  It seems like a pretty poor
> substitute for that --- it's much more expensive and has worse locking
> requirements.

There is a very recent research paper that discusses the idea of
varying fillfactor with a view to ameliorating page splits:

https://btw.informatik.uni-rostock.de/download/tagungsband/B2-2.pdf

I found the paper to be fairly convincing. The general idea is to make
page splits occur at a steady rate following a REINDEX, rather than
having "waves" of page splits. This is quite different to changing
leaf fillfactor based on the observed leaf density, though. You can
already do that by looking at pgstattuple's pgstatindex() function,
which reports a avg_leaf_density for the index. Though I agree that
that's not likely to help matters. Apart from anything else, the
steady state of an index is embodied by more than just its
avg_leaf_density. Especially following the v12 enhancements to B-Tree
indexes.

-- 
Peter Geoghegan




Re: checkpoints taking much longer than expected

2019-06-15 Thread Peter Geoghegan
On Sat, Jun 15, 2019 at 1:50 AM Tiemen Ruiten  wrote:
> During normal operation I don't mind that it takes a long time, but when 
> performing maintenance I want to be able to gracefully bring down the master 
> without long delays to promote one of the standby's.

Maybe an "immediate" mode shutdown is appropriate, then. That will
mean that the primary will need to go through crash recovery if and
when you bring it back up, though.

-- 
Peter Geoghegan




Re: Table partition with primary key in 11.3

2019-06-07 Thread Peter Geoghegan
On Fri, Jun 7, 2019 at 2:35 PM Alvaro Herrera  wrote:
> I envision this happening automatically -- you drop the partition, a
> persistent work item is registered, autovacuum takes care of it
> whenever.  The user doesn't have to do anything about it.

We don't have to agree on anything now, but I think that it's possible
that the page split thing will very effective. Perhaps even so
effective that it won't make much sense to vacuum global indexes just
because there is a pending dropped partition.

--
Peter Geoghegan




Re: Table partition with primary key in 11.3

2019-06-07 Thread Peter Geoghegan
On Fri, Jun 7, 2019 at 1:22 PM Alvaro Herrera  wrote:
> Somehow we ended up discussing this topic in a rather mistitled thread
> ... oh well :-)  (Nowadays I hesitate to change threads' subject lines,
> because gmail).

You can blame me for that, I think.

> > It occurs to me that we could add a code path to nbtree page splits,
> > that considered removing dropped partition tuples to avert a page
> > split. This would be a bit like the LP_DEAD/kill_prior_tuple thing.
> > Technically the space used by index tuples that point to a dropped
> > partitions wouldn't become reclaimable immediately, but it might not
> > matter with this optimization.
>
> This seems useful on the surface: you drop a partition, and slowly and
> incrementally any index items that point to it are removed by processes
> scanning the index.  You can't rely solely on this, though: as pointed
> out by Robert in the indirect index thread, doing this only means that
> non-scanned parts of the index to retain entries for arbitrary long,
> which is bad.  Also, this adds latency to client-connected processes.

Well, we don't have to rely on index scans to set the LP_DEAD bit in
this case. We probably wouldn't do that at all. Rather, we'd have the
page split code refer to a list of dropped partition numbers as
targets for killing immediately. Maybe we'd hint the number of
distinct partitions represented on the page, to make it a bit faster.

> Because you can't rely on that exclusively, and you want to reuse the
> partition ID eventually, you still need a cleanup process that removes
> those remaining index entries.  This cleanup process is a background
> process, so it doesn't affect latency.  I think it's not a good idea to
> add latency to clients in order to optimize a background process.

Ordinarily I would agree, but we're talking about something that takes
place at the point that we're just about to split the page, that will
probably make the page split unnecessary when we can reclaim as few as
one or two tuples. A page split is already a very expensive thing by
any measure, and something that can rarely be "undone", so avoiding
them entirely is very compelling. Delaying a split will often prevent
it altogether. We're already doing foreground processing, just by
having page splits at all.

Other DB systems that don't do much foreground processing will still
do a certain amount of it if that avoids a split in some cases --
"Modern B-Tree techniques" mentions this, and suggests quite a number
of ways that a split might be averted.

> This way, when a partition is dropped, we have to take the time to scan
> all global indexes; when they've been scanned we can remove the catalog
> entry, and at that point the partition ID becomes available to future
> partitions.

It seems worth recycling partition IDs, but it should be possible to
delay that for a very long time if necessary. Ideally, users wouldn't
have to bother with it when they have really huge global indexes.

> > The nbtree heap TID column and partition number column should probably
> > be a single varwidth column (not two separate columns), that is often
> > no wider than 6 bytes, but can be wider when there are many partitions
> > and/or very large partitions. That will be challenging, but it seems
> > like the right place to solve the problem. I think that I could make
> > that happen. Maybe this same representation could be used for all
> > nbtree indexes, not just global nbtree indexes.
>
> Maybe local nbtree indexes would have a partition ID of length 0, since
> that many bits are necessary to identify which table is pointed to by
> each index item.

Right -- special cases are best avoided here. In general, we'd push as
much of the new complexity as we can into this new TID-like table
identifier, while requiring it to work with our existing requirements
for TIDs, plus certain new requirements for global indexes (and maybe
other new requirements, such as relations that are larger than 35GB).
If the complexity is well-encapsulated, then it probably won't be too
bad. Access methods would have to be okay with varwidth table
identifiers, which is a big change, but they at least shouldn't have
to worry about anything else breaking. They'd probably have a
pg_attribute entry for the varwidth table identifier column, too (it
would be the last column in every nbtree index).

We'd expect a space efficient representation with real world
relations, that at least matches what we get with heap TIDs today.
This isn't quite as hard as it sounds. You don't have to be Claude
Shannon to realize that it's kind of silly to reserve 16 bits for the
offset number component of a TID/ItemPointer. We need to continue to
support offset numbers that go that high, but the implementation would
optimize for the common case where offset numbers are less than 512
(or maybe less than 1024).

--
Peter Geoghegan




Re: Table partition with primary key in 11.3

2019-06-07 Thread Peter Geoghegan
On Fri, Jun 7, 2019 at 12:43 PM Alvaro Herrera  wrote:
> Well, "quickly" might mean within a week.  If it takes that long to
> fully remove a monthly partition to make that partition ID available to
> some future month's partition, that seems acceptable.  Blocking
> DROP/DETACH for one hour is certainly not acceptable.

I agree that synchronous clean-up of global indexes wouldn't make
sense there, and might not be very compelling in practice.

It occurs to me that we could add a code path to nbtree page splits,
that considered removing dropped partition tuples to avert a page
split. This would be a bit like the LP_DEAD/kill_prior_tuple thing.
Technically the space used by index tuples that point to a dropped
partitions wouldn't become reclaimable immediately, but it might not
matter with this optimization.

> If this scheme means that you can keep the partition identifiers stored
> in the index to, for instance, 10 bits (allowing for 1024 partitions to
> exist at any one time, including those in the process of being cleaned
> up) instead of having to expand to (say) 24 because that covers a couple
> of years of operation before having to recreate the index, it seems
> worthwhile.

I think that we should have no inherent limit on the number of
partitions available at once, on general principle. Limiting the
number of partitions is a design that probably has a lot of sharp
edges.

The nbtree heap TID column and partition number column should probably
be a single varwidth column (not two separate columns), that is often
no wider than 6 bytes, but can be wider when there are many partitions
and/or very large partitions. That will be challenging, but it seems
like the right place to solve the problem. I think that I could make
that happen. Maybe this same representation could be used for all
nbtree indexes, not just global nbtree indexes.

-- 
Peter Geoghegan




Re: Table partition with primary key in 11.3

2019-06-07 Thread Peter Geoghegan
On Fri, Jun 7, 2019 at 12:18 PM Alvaro Herrera  wrote:
> I was thinking of asynchonously cleaning it up rather than blocking
> DROP/DETACH ... which means you need to keep state somewhere.  I don't
> think blocking DROP/DETACH is valuable -- a global index that blocks
> DROP/DETACH until the index is clean serves no useful purpose.  (You
> could think of a multi-step approach with internal transaction commits,
> similar to CIC, but you still need a plan to clean that up in case the
> server crashes during that operation.)

The advantage of synchronous clean-up of global indexes when DROP'ing
a partition are that you can recycle the partition number (or whatever
we end up calling it) immediately and predictably, and you can reuse
the space in indexes occupied by keys from the dropped partition
immediately and predictably. That seems less useful than asynchronous
processing on average, certainly, but those are still real advantages.
You seemed to be particularly concerned about quickly recycling
partition numbers when we drop a partition.

I hope that we can come up with a very efficient on-disk
representation for global index tuples, where only the bare minimum
amount of space is used for partition numbers. Maybe it won't matter
that much if partition numbers cannot be recycled due to this
asynchronous processing.

-- 
Peter Geoghegan




Re: Table partition with primary key in 11.3

2019-06-07 Thread Peter Geoghegan
On Fri, Jun 7, 2019 at 9:10 AM Alvaro Herrera  wrote:
> I think vacuuming for global indexes is somewhat challenging as well :-)
> Maybe not as much as for indirect indexes, that's true.
>
> In order for it to be sustainable, I think you'll want to reuse
> partition identifiers when the partitions are dropped/detached, which
> means that you need a way to ensure that index entries to those
> partitions are removed from all indexes.

I'm not so sure about that. I see your point, but I think that you can
also make the opposite argument. That is, you can make a good case for
asynchronously cleaning up the dead entries that point to a dropped
partition (probably within VACUUM). Perhaps we should offer *both* as
options.

--
Peter Geoghegan




Re: Table partition with primary key in 11.3

2019-06-07 Thread Peter Geoghegan
On Thu, Jun 6, 2019 at 3:00 PM David Rowley
 wrote:
> You may already be aware, but another use case for such variable-width
> identifiers was with indirect indexes as discussed in [1]

Right. I went with global indexes because indirect indexes are
probably a lot more risky as a project. I'd be particularly concerned
about the complexity of VACUUM there, whereas that doesn't seem all
that bad in the case of global indexes.

-- 
Peter Geoghegan




Re: Table partition with primary key in 11.3

2019-06-06 Thread Peter Geoghegan
On Thu, Jun 6, 2019 at 10:03 AM Alvaro Herrera  wrote:
> If you are saying that you think that Postgres should support primary
> keys that don't necessarily overlap partition keys, then I agree with
> you.  Please send a patch to implement that capability.

FWIW, I could probably be convinced to work on the nbtree parts of
global indexes if there was a reasonably worked out design that had
some momentum behind it. I would at least need to know what the
partition identifiers will need to look like in indexes. Andres has
suggested that I work on variable-width table identifiers in nbtree
for the benefit of pluggable storage engines, but I don't have much
enthusiasm for the idea of doing that without delivering a clear
benefit to users in the same release.

It seems pretty obvious to me that the right general approach for
global indexes is to make the partition identifier a tiebreaker column
that comes right before the heap TID tiebreaker column (which would be
per-partition). It probably needs to be variable-width, so that the
overhead isn't too bad most of the time. I don't think that it's
necessary to remove the restriction on the size of relations for
global indexes, though maybe global indexes could focus our attention
on that problem.

-- 
Peter Geoghegan




Re: Questions about btree_gin vs btree_gist for low cardinality columns

2019-06-03 Thread Peter Geoghegan
On Sun, Jun 2, 2019 at 4:07 PM Tom Lane  wrote:
> Anyway, I said to Jeremy in the hallway that it might not be that
> hard to bolt IOS support onto GIN for cases where the opclass is
> a non-subdividing one, but after looking at the code I'm less sure
> about that.  GIN hasn't even got an "amgettuple" code path, just
> "amgetbitmap", and a big part of the reason why is the need to merge
> results from the fastupdate pending list with results from the main
> index area.  Not sure how we could deal with that.

I suspect that GIN also avoids several other problems by only offer an
"amgetbitmap", and not an "amgettuple". For example, it doesn't have
to worry about things like numeric display scale, where a datum is
substantively different to another datum, while still being equal
according to opclass semantics (this is an example that I end up
citing in many discussions about indexing). I bet that there are a few
more of those beyond those two, that I haven't thought about.

> Anyway, the larger point here is that right now btree_gin is just a quick
> hack, and it seems like it might be worth putting some more effort into
> it, because the addition of duplicate-compression changes the calculus
> for whether it's useful.

There was also discussion about making nbtree support deduplication
during the hallway track. Jim Finnerty reminded me that there is a
patch from Anastasia that did deduplication in nbtree that didn't go
anywhere. Heikki independently talked about the possibility that he
would work on this project in the next release, without being prompted
by me.

I think that the fact that nbtree sorts duplicate entries in heap TID
order these days makes that worth looking into again. We can use
something like GIN's varbyte encoding process to compress duplicates
effectively. A lot of the problems (the numeric display scale problem,
pg_upgrade) can be avoided by defining deduplication as something that
happens on a best-effort basis. This is not the case within GIN, where
it's impossible for the main entry tree to have duplicates without at
least storing them in a posting list.

-- 
Peter Geoghegan




Re: Upgrading locale issues

2019-05-06 Thread Peter Geoghegan
On Wed, May 1, 2019 at 3:09 PM Thomas Munro  wrote:
> As discussed over on -hackers[1], I think it's worth pursuing that
> though.  FWIW I've proposed locale versioning for FreeBSD's libc[2].
> The reason I haven't gone further with that yet even though the code
> change has been accepted in principle by FreeBSD reviewers is because
> I got stuck on the question of how exactly to model the versions.  If,
> say, just Turkish changes, I don't want to be rebuilding my French
> indexes, which means that I don't think you can use the CLDR version
> string.

The ICU versions can handle that, though. Importantly, ICU decouples
implementation details from actual versioning.

I must say that I am not enthused about the idea of trying to get libc
people of any variety on board. I don't have an objection to it if it
can work for FreeBSD, but I don't think it can scale. ICU is built
around a culture that takes our concerns seriously already, which is
what it boils down to. Also, we can imagine a package manager taking
it upon themselves to vendor their own ICU, with platform-specific
guarantees around stability. That seems like a nice option to have, at
least.

> There is also the question of how PostgreSQL should model versions,
> and as I've argued in [1], I think we should track them at the level
> of database object dependencies.

I think you're probably right about that.

> I'm hoping to reopen this can of worms for PostgreSQL 13 (and the
> corresponding support could in theory be in FreeBSD 13... coincidence,
> or a sign!?)

Maybe we should do what Oracle did, and call it PostgreSQL 18c
instead. Actually, any number that isn't of interest to numerologists
will do.

> Unfortunately you can't use ICU collations as a database default yet
> (though there was some WIP code[3]), so ICU only saves you from
> versioning problems if you explicitly set collations for columns or
> expressions, and even then the version tracking is currently just a
> warning that you clear manually with a command, not a mechanism that
> really tracks which database objects were last rebuilt/validated with
> a given version.

Yes, that does seem like a big remaining weakness.

-- 
Peter Geoghegan




Re: Upgrading locale issues

2019-05-01 Thread Peter Geoghegan
On Mon, Apr 29, 2019 at 7:45 AM rihad  wrote:
> Hi. Today we run pg_ctl promote on a slave server (10.7) and started
> using it as a master. The OS also got upgraded FreeBSD 10.4 -> FreeBSD
> 11.2. And you guessed it, most varchar indexes got corrupted because
> system local changed in subtle ways. So I created the extension amcheck
> and reindexed all bad indexes one by one. Is there any way to prevent
> such things in the future? Will switching to ICU fix all such issues?

Not necessarily, but it will detect the incompatibility more or less
automatically, making it far more likely that the problem will be
caught before it does any harm. ICU versions collations, giving
Postgres a way to reason about their compatibility over time. The libc
collations are not versioned, though (at least not in any standard way
that Postgres can take advantage of).

> The problem with it is that ICU collations are absent in pg_collation,
> initdb should be run to create them, but pg_basebackup only runs on an
> empty base directory, so I couldn't run initdb + pg_basebackup to
> prepare the replica server. I believe I can run the create collation
> command manually, but what would it look like for en-x-icu?

It is safe to call pg_import_system_collations() directly, which is
all that initdb does. This is documented, so you wouldn't be relying
on a hack.

-- 
Peter Geoghegan




Re: Pg analog to DBCC CCHECKDB?

2019-04-05 Thread Peter Geoghegan
On Fri, Apr 5, 2019 at 7:33 AM Ron  wrote:
> In 9.6, does such a thing exist?  (We just restored a VM from snapshot and I
> want to verify the cluster sanity.)

amcheck is available for versions 9.4+, though it only appears in
contrib in Postgres 10. There are both yum and deb packages available.
See:

https://github.com/petergeoghegan/amcheck

-- 
Peter Geoghegan




Re: CREATE COLLATION to match pg_collation data

2019-01-12 Thread Peter Geoghegan
On Sat, Jan 12, 2019 at 7:13 AM Tom Lane  wrote:
> That's an ICU collation, so you're out of luck: there is no ICU
> support in 9.6.

FWIW, there is some form of FreeBSD Postgres support for ICU that
predates the officially supported ICU feature. The FreeBSD Postgres
packages have been built with ICU support for many years -- they
modify the source code minimally to make this work. It may well still
be impossible to use "az-x-icu" on a FreeBSD installation of 9.6,
though.

-- 
Peter Geoghegan



Re: Code of Conduct plan

2018-09-14 Thread Peter Geoghegan
On Fri, Sep 14, 2018 at 11:06 AM, Dimitri Maziuk  wrote:
> Personally I would like that. Others might prefer an invitation to
> unsubscribe or forever hold their peace, I could live with that too, but
> I believe explicit opt-ins are preferable to opt-outs.

I think that it's a legitimate position to be opposed to a CoC like
this. I also think it's legitimate to feel so strongly about it, on
philosophical or political grounds, that you are compelled to avoid
participating while subject to the CoC. FWIW, the latter position
seems rather extreme to me personally, but I still respect it.

In all sincerity, if you're compelled to walk away from participating
in mailing list discussions on a point of principle, then I wish you
well. That is your right.

-- 
Peter Geoghegan



Re: Code of Conduct plan

2018-09-14 Thread Peter Geoghegan
On Fri, Sep 14, 2018 at 10:31 AM, Dimitri Maziuk  wrote:
> So let me get this straight: you want to have a "sanctioned" way to deny
> people access to postgresql community support channel?

Yes.

> "Because
> somebody who may or may not be the same person, allegedly said something
> somewhere that some other tweet disagreed with on faceplant"?
>
> Great plan if you do for-pay postgresql support for the living.

You can make your own conclusions about my motivations, just as I'll
make my own conclusions about yours. I'm not going to engage with you
on either, though.

-- 
Peter Geoghegan



Re: Code of Conduct plan

2018-09-14 Thread Peter Geoghegan
On Fri, Sep 14, 2018 at 7:19 AM, Joshua D. Drake  wrote:
> Sure and that is unfortunate but isn't it up to the individual to deal with
> it through appropriate channels for whatever platform they are on? All of
> these platforms are:
>
> 1. Voluntary to use
> 2. Have their own Terms of Use and complaint departments
> 3. If it is abuse there are laws
>
> I agree that within Postgresql.org we must have a professional code of
> conduct but the idea that an arbitrary committee appointed by an unelected
> board can decide the fate of a community member based on actions outside of
> the community is a bit authoritarian don't you think?

The choice of the committee members is hardly arbitrary. Having
committee members be appointed by core is more or less consistent with
how the community has always dealt with disciplinary issues. The
criteria used by core were discussed quite openly. While the risk that
the committee will yield their power in an "authoritarian" way seems
very small, it cannot be ruled out entirely. In fact, it hasn't been
ruled out by the draft CoC itself.

No CoC can possibly provide for every conceivable situation. Somebody
has to interpret the rules, and it has to be possible to impose
sanctions when the CoC is violated -- otherwise, what's the point?
There are several checks and balances in place, and I for one have
confidence in the process as outlined. It's imperfect, but quite a lot
better than either the status quo, or a platitude about inclusivity.

-- 
Peter Geoghegan



  1   2   >