Re: PG17 optimizations to vacuum
On Mon, Sep 2, 2024 at 4:58 PM Peter Geoghegan wrote: > On Mon, Sep 2, 2024 at 4:35 PM Pavel Luzanov wrote: > > If it helps, without creating index on id column, the numbers will be > > much closer: > > Yes, avoiding all index vacuuming seems useful. It makes the test case > cleaner, since we don't have to think about the variability from the > TIDStore work (and from index vacuuming more generally). It just occurred to me that earlier versions don't have the HEAP_PAGE_PRUNE_MARK_UNUSED_NOW optimization added by commit c120550edb. Postgres 17 does have that optimization, though, so it should easily be able to write far fewer WAL records than earlier versions. And yet your revised no-indexes test case seems to show that Postgres 17 is doing slightly worse by that measure (and by others). -- Peter Geoghegan
Re: PG17 optimizations to vacuum
On Mon, Sep 2, 2024 at 4:58 PM Heikki Linnakangas wrote: > Do you have any non-default settings? "select name, > current_setting(name), source from pg_settings where setting <> > boot_val;" would show that. What about page checksums? One simple explanation is that we're writing extra FPIs to set hint bits. But that explanation only works if you assume that page-level checksums are in use (or that wal_log_hints is turned on). -- Peter Geoghegan
Re: PG17 optimizations to vacuum
On Mon, Sep 2, 2024 at 4:35 PM Pavel Luzanov wrote: > If it helps, without creating index on id column, the numbers will be > much closer: Yes, avoiding all index vacuuming seems useful. It makes the test case cleaner, since we don't have to think about the variability from the TIDStore work (and from index vacuuming more generally). > master: > WAL usage: 78502 records, 22090 full page images, 196215494 bytes > > v15: > WAL usage: 77437 records, 30872 full page images, 152080268 bytes To be clear, I wouldn't be all that surprised if some issue with alignment padding and/or the new WAL format made master write more bytes of WAL total than on earlier versions. Small inefficiencies like that were always a possibility. That's not the problem that I'm principally concerned about right now. I'm most concerned about the simple fact that we're writing more individual WAL records than on previous versions, despite the fact that VACUUM does exactly the same amount of useful pruning and freezing work across versions. How could a patch that merged together the previous PRUNE and VACUUM WAL record types into a single record type actually cause an increase in the number of WAL records written? I'd have thought that that was simply impossible. -- Peter Geoghegan
Re: PG17 optimizations to vacuum
On Mon, Sep 2, 2024 at 3:23 PM Melanie Plageman wrote: > This is roughly what I get for records by vacuum. Note that I prefixed > VACUUM with BTREE on master to indicate those records are from index > vacuuming. By default the headesc routine for records emitted by index > vacuuming prints just VACUUM -- perhaps it would be better to prefix > it. > > Note that these add up to almost the same thing. I don't know yet why > the number PRUNE_VACUUM_SCAN is different than PRUNE on 16. That is indeed surprising, given that Pavel's VACUUM VERBOSE output indicates that the number of heap tuples deleted is identical across versions. The output also strongly suggests that the same heap pages are pruned on both versions, since the "index scan needed: " line is also identical across versions. Might it be that the extra PRUNE_VACUUM_SCAN records originated in pages that only contained existing LP_UNUSED items when scanned by VACUUM? -- Peter Geoghegan
Re: PG17 optimizations to vacuum
On Mon, Sep 2, 2024 at 1:29 PM Melanie Plageman wrote: > I'll investigate more tomorrow, but based on my initial investigation, > there appears to be some interaction related to how much of the > relation is in shared buffers after creating the table and updating > it. If you set shared_buffers sufficiently high and prewarm the table > after the update, master has fewer WAL records reported by vacuum > verbose. Fewer of what specific kind of WAL record? All of the details about useful work done by VACUUM were identical across versions. It was only the details related to WAL, buffers, and CPU time that changed. Perhaps I'm not thinking of something obvious. Maybe it's extra VISIBILITY records? But I'd expect the number of VISIBILITY records to match the number of pages frozen, given these particulars. VACUUM VERBOSE at least shows that that hasn't changed. -- Peter Geoghegan
Re: PG17 optimizations to vacuum
On Sun, Sep 1, 2024 at 5:44 PM Pavel Luzanov wrote: > I see a perfectly working TID-store optimization. > With reduced maintenance_work_mem it used only one 'vacuuming indexes' > phase instead of 21 in v16. > But I also expected to see a reduction in the number of WAL records > and the total size of the WAL. Instead, WAL numbers have significantly > degraded. > > What am I doing wrong? That does seem weird. CC'ing the authors of the relevant VACUUM enhancements. -- Peter Geoghegan
Re: Query performance going from Oracle to Postgres
On Thu, Sep 7, 2023 at 3:48 AM David Rowley wrote: > On Thu, 7 Sept 2023 at 19:17, Peter Geoghegan wrote: > > It seems likely that the problem here is that some of the predicates > > appear as so-called "Filter:" conditions, as opposed to true index > > quals. > > hmm, if that were true we'd see "Rows Removed by Filter" in the > explain analyze. That's why I hedged, with "seems likely". The problem with using filter conditions rather than true index quals isn't limited to the problem of extra heap accesses. It happened to be convenient to make my point that way, but that isn't particularly fundamental here. I deliberately chose to make my example involve an index-only scan (that doesn't require any heap accesses) for this reason. > I think all that's going on is that each tuple is on a different page > and the heap accesses are just causing many buffers to be accessed. This index is an absolute monstrosity. I find it far easier to believe that the real explanation is the one that Steve intuited: that there is an issue with the way that the physical data structures (which are more or less comparable in both systems) are accessed in Postgres. The index in question ("history_event_display_timesta_prism_guid_display_timestamp_idx1") has certain columns that are omitted from the query. These columns nevertheless appear between other columns that the query filters on. The leading two columns ("prism_guid" and "display_timestamp") are made into index quals by Postgres, but the next index column after those two ("unique_lookup_key") is omitted by the query, and so isn't an index qual. In fact *four* columns are omitted after that one. But, the very-low-order index column "product_sid" *does* appear in the query, and so also appears as Postgres index quals. There is every chance that the true underlying explanation is that Oracle is able to skip over significant parts of the index structure dynamically. In fact I'm practically certain that that's the case, since the "product_sid" column appears as an "access predicate", rather than as a "filter predicate". These terms are explained here: https://use-the-index-luke.com/sql/explain-plan/oracle/filter-predicates https://use-the-index-luke.com/sql/explain-plan/postgresql/filter-predicates How could "product_sid" be used as an "access predicate" given the omitted index columns? It seems very likely that parts of the index can be skipped in Oracle, but not in Postgres -- at least not yet. Like Markus Winand, I think that it's a real problem that EXPLAIN doesn't yet expose the difference between access predicates and filter predicates. Some of the index quals shown by EXPLAIN are marked SK_BT_REQFWD/SK_BT_REQBKWD by nbtree (meaning that they're what Oracle calls "access predicates"), while other are not (meaning that they're what Oracle calls "filter predicates"). That issue makes it far harder to spot these sorts of problems. > It seems to me that an IOS would likely fix that problem which is why I > suggested count(*) instead of > count() Probably, but why should we have to use an index-only scan? And what about my example, that already used one, and still showed a big disparity where there is no inherently reason why there had to be? My example didn't involve jumping to another part of the index because the problem seems to be more general than that. -- Peter Geoghegan
Re: Query performance going from Oracle to Postgres
On Wed, Sep 6, 2023 at 1:07 PM Dirschel, Steve wrote: > Oracle will find the same 332 rows using the same index but in Oracle it only > does 20 logical reads. I thought maybe the index was fragmented so I > reindexed that index: It seems likely that the problem here is that some of the predicates appear as so-called "Filter:" conditions, as opposed to true index quals. For reasons that aren't particularly fundamental, Postgres B-Trees cannot push down != (or <>) to the index level. Strangely enough, this is possible with IS NOT NULL. I'm working to improve things in this area. That difference is easy to see in the following example. The first query does far fewer buffer accesses than the. second query, despite being very similar: regression=# explain (analyze, buffers) select * from multi_test where a = 1 and b is not null; QUERY PLAN - Index Only Scan using multi_test_idx on multi_test (cost=0.29..50.10 rows=2191 width=8) (actual time=0.095..0.100 rows=14 loops=1) Index Cond: ((a = 1) AND (b IS NOT NULL)) Heap Fetches: 0 Buffers: shared hit=3 Planning Time: 0.056 ms Execution Time: 0.145 ms (6 rows) regression=# explain (analyze, buffers) select * from multi_test where a = 1 and b != 42; QUERY PLAN -- Index Only Scan using multi_test_idx on multi_test (cost=0.29..222.57 rows=2191 width=8) (actual time=0.087..2.982 rows=14 loops=1) Index Cond: (a = 1) Filter: (b <> 42) Rows Removed by Filter: 1 Heap Fetches: 0 Buffers: shared hit=11 Planning Time: 0.076 ms Execution Time: 3.204 ms (8 rows) (There are lots of index tuples matching (a, b) = "(1, NULL)" here, you see.) -- Peter Geoghegan
Re: Large pkey index on insert-only table
On Mon, Jun 26, 2023 at 8:50 AM Devin Ivy wrote: > Any thoughts on why this may be, or where to go next to continue tracking > this down? Also, could the primary key column order `(id, ancestor_id)` vs > `(ancestor_id, id)` significantly affect the index size depending on the > column cardinalities? That is certainly possible, yes. I describe one particular pathology that causes composite B-Tree indexes to only have about 50% space utilization here: https://youtu.be/p5RaATILoiE?t=2079 Theoretically this shouldn't be a problem anymore, because (as the talk describes) Postgres 12 added heuristics that avoid the problem. But those heuristics are kind of conservative; they only kick in when it's fairly clearly the right thing to do. I don't think that they'll work very reliably for varchar columns. Note that 90% space utilization isn't really the standard case for B-Trees in general. Even absent updates and deletes, an index with completely random insertions (e.g., a UUID index) is expected to have about 70% space utilization. You can only really expect ~90% space utilization with monotonically increasing insertions. On the other hand having less than 50% space utilization is pretty poor, so (assuming that that's what "bloat percentage 47%" means) then I'd say that you're right to suspect that something is a bit off here. This isn't necessarily a big deal, but I tend to agree that what you're seeing is something that theoretically can be avoided by the implementation (if there were deletes involved then that wouldn't apply, but there aren't). -- Peter Geoghegan
Re: VACUUM (INDEX_CLEANUP OFF) and GIN indexes
On Tue, Apr 25, 2023 at 9:18 AM Christophe Pettus wrote: > Does VACUUM (INDEX_CLEANUP OFF) flush the pending list for GIN indexes, or is > that skipped as well? It's skipped by VACUUM, but not by ANALYZE. So if you're using the reloption version of index_cleanup=off, it isn't necessarily going to stop autovacuum/autoanalyze from doing pending list cleanup. The ANALYZE pending list cleanup path has some problems: https://postgr.es/m/cah2-wzkjrk556envtflmyxedw91xguwiyzvep2kp5yqt_-3...@mail.gmail.com -- Peter Geoghegan
Re: UPSERT in Postgres
On Thu, Apr 6, 2023 at 1:21 PM Louis Tian wrote: > An implicit assumption behind this definition is that table must have a > primary key for the upsert operation to make sense since it's the primary key > that uniquely identifies a row. It could just be a unique index or a unique constraint. So you can upsert on any individual unique constraint/index, or the primary key. Of course there might be several on a given table, but you can only use one as the "conflict arbiter" per statement. > But can we achieve the same effect with "INSERT ... ON CONFLICT DO UPDATE" > like a lot of references on the internet seems to suggest. > > insert into person (id, name) values (0, 'foo') on conflict ("id") do update > set id=excluded.id, name=excluded.name > insert into person (id, is_active) values (0, true) on conflict ("id") do > update set id=excluded.id, is_active=excluded.is_active > > Unfortunately. the second statement will fail due to violation of the not > null constraint on the "name" column. > PostgreSQL will always try to insert the row into the table first. and only > fallback to update when the uniqueness constraint is violated. > Is this behavior wrong? maybe not, I think it is doing what it reads quite > literally. It sort of has to work that way, though. In general your example might *not* fail, due to a row-level before trigger in the insert path. Why doesn't your proposed upsert syntax have the same problem? I mean, how could it not? I guess it doesn't if you assume that it'll never take the insert path with your not NULL constraint example? But if you know that for sure, why not just use a regular update statement? On the other hand, if you're not sure if the insert path can be taken, then why is it actually helpful to not just throw an error at the earliest opportunity? Surely upsert means "update or insert", so why wouldn't the user expect to see an error like this, independent of the specifics of the row in question? Isn't the user tacitly saying "I don't specifically know if the update or insert path will be taken in respect of any given row" by using ON CONFLICT DO UPDATE in the first place? > That being said, I have never had a need for the ON CONFLICT DO UPDATE > statement other than where I need upsert. > But using it as "upsert" is only valid when the table is absent of any NOT > NULL constraint on it's non primary key columns. I don't know what you mean by that. "Valid"? > The MERGE command introduced in PG15 in theory can be used to do UPSERT > properly that is void of the aforementioned limitation. > The downside is it is rather verbose. The MERGE command has various race conditions that are particularly relevant to UPSERT type use cases. See the wiki page you referenced for a huge amount of information on this. > *Feature Request* > Given that UPSERT is an *idempotent* operator it is extremely useful. In general UPSERT (or any definition of it that I can think of) does not imply idempotency. -- Peter Geoghegan
Re: PG16devel - vacuum_freeze_table_age seems not being taken into account
On Fri, Mar 3, 2023 at 2:43 AM Simon Elbaz wrote: > I expected it not to be processed by vacuum freeze. > However it has been entirely frozen. > Moreover, among the 51 rows, only 1 was eligible for freeze because its XID > was older than vacuum_freeze_min_age. The effect that you noticed is a consequence of page-level freezing, which is new to Postgres 16. VACUUM will now freeze all of the tuples on a page whenever it needs to freeze any tuples at all (barring any tuples that are fundamentally ineligible due to being after the removable/freezable cutoff). This is justified by the cost profile. Once we decide to freeze at least one tuple of a page, the added cost in WAL is low enough that it really doesn't make sense to not just freeze everything. The page that gets frozen by your test case is also set all-frozen in the visibility map. Without the optimization, we'd have frozen that one tuple and then set the page all-visible. The page would likely be frozen again by the next aggressive VACUUM, which is usually much more expensive. -- Peter Geoghegan
Re: Automatic aggressive vacuum on almost frozen table takes too long
On Mon, Feb 20, 2023 at 9:43 PM Mikhail Balayan wrote: > What catches my eye: scanning indexes smaller than 3.1GB is fast, larger ones > are slow. For example: > idx_applications2_policy_id is 3131 MB took just 5 seconds (DETAIL: CPU: > user: 2.99 s, system: 1.65 s, elapsed: 5.32 s) > but idx_applications2_deleted_at with 3264 MB took 1 minute 22 seconds > (DETAIL: CPU: user: 67.93 s, system: 3.41 s, elapsed: 82.75 s) I think that I know what this is. If you delete many index pages during VACUUM, and those pages are all full of duplicate values, the deletion operation can sometimes be slower due to the need to relocate a downlink to each to-be-deleted leaf page. When there are thousands of matches, you'll start to notice O(n^2) behavior due to the way in which the B-Tree VACUUM code must grovel through the parent level, which is full of duplicate keys. If you were on Postgres 12+, then this wouldn't happen, because the heap TID is treated as a part of the key space there, affecting sort order. The implementation would immediately relocate the matching parent downlink using a unique key (unique because heap TID would act as a unique-ifier on that version). And if you were on 14+, things in this area would be much better still. -- Peter Geoghegan
Re: Automatic aggressive vacuum on almost frozen table takes too long
On Thu, Feb 16, 2023 at 5:40 PM Mikhail Balayan wrote: >> >> Do you have any non-btree indexes on the table? Can you show us the >> >> details of the >> >> table, including all of its indexes? In other words, can you show "\d >> >> applications" output from psql? > > Only btree indexes. Please find the full table schema below: It's possible that VACUUM had to wait a long time for a cleanup lock on one individual heap page here, which could have added a long delay. But...that doesn't seem particularly likely. Can you run amcheck's bt_index_check() routine against some of the indexes you've shown? There is perhaps some chance that index corruption exists and causes VACUUM to take a very long time to delete index pages. This is pretty much a wild guess, though. -- Peter Geoghegan
Re: Automatic aggressive vacuum on almost frozen table takes too long
On Thu, Feb 16, 2023 at 7:44 AM Adrian Klaver wrote: > > That is, if I understand it correctly, it says that there were (and > > actually are) 2013128 pages of which 2008230 were skipped, which leaves > > 4898 blocks to be scanned. I.e. it seems that the allocated 1GB > > (autovacuum_work_mem) should be enough to handle that amount of blocks > > and to avoid multiple scans of the indexes. It's quite clear that there is only one pass over the indexes, since "index scans: 1" says exactly that. > > But, based on buffer usage, one can see that a huge amount of data is > > read, greatly exceeding not only the number of remaining unfrozen > > blocks, but also the size of the table and indexes taken together: 2 > > billion blocks, more than 15TB. > > > > Is this a bug in Postgresql or am I interpreting the log data wrong? I think that it might be a bug in Postgres. I addressed a similar issue in the same "buffer" instrumentation in commit d3609dd2, but that wasn't backpatched because I imagined that it only applied to the new VACUUM VERBOSE case (VACUUM VERBOSE only started sharing the same instrumentation code as log_autovacuum in Postgres 15). It's not immediately obvious how you could see a problem like the one you've shown in the autovacuum log output. Even still, the information about buffers that you've shown does indeed appear to be total nonsense (while everything else we can see looks plausible). There has to be some explanation for that. The only other explanation I can think of is a pathological case where an index scan by some particular ambulkdelete routine scans a number of buffers that vastly exceeds the total size of the index. That does seem just about possible with an access method like GIN. Do you have any non-btree indexes on the table? Can you show us the details of the table, including all of its indexes? In other words, can you show "\d applications" output from psql? -- Peter Geoghegan
Re: ERROR: posting list tuple with 2 items cannot be split at offset 17
On Thu, Feb 9, 2023 at 3:55 PM Paul McGarry wrote: > Will the amcheck reliably identify all issues that may arise from a collation > change? Theoretically it might not. In practice I'd be very surprised if it ever failed to detect such an inconsistency. If you want to be extra careful, and can afford to block concurrent writes, then I suggest using bt_index_parent_check instead of bt_index_check. > and therefore I can just use that to identify bad indexes and recreate them, > or should I recreate all btree indexes involving text fields? It might be easier to just reindex them all. Hard to say. > We planned to do a dump/restore upgrade to PG14 in a month or so (already > done in dev). > I am wondering whether it will be less work overall to bring that forward > than rebuild these indexes... pg_amcheck is available on 14. It offers a much simpler interface for running amcheck routine, so maybe look into that once you upgrade. -- Peter Geoghegan
Re: ERROR: posting list tuple with 2 items cannot be split at offset 17
On Wed, Feb 8, 2023 at 11:54 PM Paul McGarry wrote: > But if it is the problem, why did the update start working after I recreated > the other index? There is no reason why reindexing another index ought to have had that effect. The likely explanation is that subsequent updates used a successor version heap TID that didn't overlap with some existing posting list in whatever way. If you repeat the update again and again, and get an error each time, the incoming TID will differ each time. Eventually you won't get an error, because at some point there won't be a posting-list-TID range overlap for some new successor TID that leads to the insert/posting list split code detecting a problem. It's also possible that a concurrent autovacuum "fixed" the issue. The amcheck error shows a problem in an internal page, which cannot have posting list tuples -- which suggests broad corruption. An issue with collation instability due to an OS update does seem likely. Note that the hardening/defensive checks in this area have increased. I added an additional defensive check to 13.4, and followed up with another similar check in 13.5. It looks like the error you've seen ("ERROR: posting list tuple with 2 items cannot be split at offset 17") comes from the initial 13.4 hardening, since I'd expect the additional 13.5 hardening to catch the same issue sooner, with a different error message (something like "table tid from new index tuple (%u,%u) cannot find insert offset between offsets %u and %u of block %u in index \"%s\""). > I think I should now: > - recreate the widget_name_idx on the problem servers > - run bt_index_check across all other indexes > > Any suggestions on what else I should look into, in particular anything I > should check before upgrading the remaining 13.8 DB to 13.9? I recommend running amcheck on all indexes, or at least all possibly-affected text indexes. -- Peter Geoghegan
Re: Question regarding UTF-8 data and "C" collation on definition of field of table
On Sun, Feb 5, 2023 at 4:19 PM Tom Lane wrote: > If there's a predominant language in the data, selecting a collation > matching that seems like your best bet. Otherwise, maybe you should > just shrug your shoulders and stick with C collation. It's likely > to be faster than any alternative. FWIW there are certain "compromise locales" supported by ICU/CLDR. These include "English (Europe)", and, most notably, EOR (European Ordering Rules): https://en.wikipedia.org/wiki/European_ordering_rules I'm not sure how widely used those are. EOR seems to have been standardized by the EU or by an adjacent institution, so not sure how widely used it really is. It's also possible to use a custom collation with ICU, which is almost infinitely flexible: http://www.unicode.org/reports/tr10/#Customization As an example, the rules about the relative ordering of each script can be changed this way. There is also something called merged tailorings. The OP should see the Postgres ICU docs for hints on how to use these facilities to make a custom collation that matches whatever their requirements are: https://www.postgresql.org/docs/current/collation.html#COLLATION-MANAGING -- Peter Geoghegan
Re: Lots of read activity on index only scan
On Fri, Nov 18, 2022 at 1:50 PM Peter J. Holzer wrote: > There should be about 27000 of them, same as for the othe index, right? There aren't that many. The point I'm making is that you can access each VM page approximately once (and check relatively many index tuple's TIDs all in one go), or many times. The total number of VM pages may be constant, but the access patterns are quite different owing to differences in how the data is clustered in each index. > > When there is naturally a high correlation (or some kind of > > clustering) in how we access VM pages, we'll naturally be able to do > > more visibility checks covering more index tuples per VM page > > accessed. > > So you are saying that these are accesses to the visibility map, not the > base table? Yes. I see "Heap Fetches: 0" for both plans, that each query the same table and scan approximately the same number of index pages. So VM accesses are the only explanation that makes any sense. > > Also worth bearing in mind that it's unusual to have a perfectly > > random and uniformly distributed clustering of index tuples, > > Sure. This is a highly contrived example. FWIW I think that it could be a lot less bad, even with indexes that you'd think would be almost as bad as the bad one from your test case. Even things that appear to be random aren't usually nearly as random as what you've shown. -- Peter Geoghegan
Re: Lots of read activity on index only scan
On Fri, Nov 18, 2022 at 12:46 PM Peter J. Holzer wrote: > Both do a parallel index only scan. Both perform 0 heap fetches. > But one reads 27336 buffers (or about 22 bytes per index entry, which > sounds reasonable) while the other reads 9995216 buffers (or almost one > full buffer per row). Why? The entries should be dense in the index in > both cases and since it's an index only scan (and explain says there > were 0 heap fetches) I would not expect extra accesses. Where do these > buffer reads come from? The index-only scan processes an index leaf page at a time. When there is naturally a high correlation (or some kind of clustering) in how we access VM pages, we'll naturally be able to do more visibility checks covering more index tuples per VM page accessed. This is a less severe problem here than it would be with an equivalent pair of plain index scans, just because there are so few VM pages relative to heap pages. But it's more or less an analogous problem. You're really noticing it here because these index scans have very low selectivity -- which is kinda unusual in most environments. Also worth bearing in mind that it's unusual to have a perfectly random and uniformly distributed clustering of index tuples, which is what the index built via hashing exhibits. Even a v4 UUID index could easily have plenty of duplicates, which would probably do significantly better on the metric you've focussed on. -- Peter Geoghegan
Re: ON CONFLICT and WHERE
On Sun, Nov 13, 2022 at 1:07 PM Tom Lane wrote: > A WHERE placed there is an index_predicate attachment to the ON CONFLICT > clause. It doesn't have any run-time effect other than to allow partial > indexes to be chosen as arbiter indexes. TFM explains > > index_predicate > > Used to allow inference of partial unique indexes. Any indexes > that satisfy the predicate (which need not actually be partial > indexes) can be inferred. > > This strikes me as a bit of a foot-gun. I wonder if we should make > it safer by insisting that the resolved index be partial when there's > a WHERE clause here. I don't think that it would be safer. Adrian has asked why it's possible to attach an arbitrary index_predicate type WHERE clause to an ON CONFLICT query, without that really changing the behavior of the statement. That *is* a little odd, so it's certainly a fair question (I can recall perhaps as many as 5 similar questions over the years). But it's not the end of the world, either -- there are far worse things. I think that it would be a lot worse (just for example) to have your ON CONFLICT query suddenly start throwing an ERROR in production, just because you replaced a partial unique index with a unique constraint. If we have a suitable unique index or constraint, why wouldn't we use it in ON CONFLICT? Maybe it won't work out that way (maybe there won't be any suitable unique index or constraint), but why not do our utmost to insulate the user from what might be a serious production issue? That was the guiding principle. Overall I'm quite happy with the amount of foot-guns ON CONFLICT has, especially compared to other comparable features in other DB systems (which had plenty). There are one or two ostensibly odd things about the syntax that are downstream consequences of trying to make the constraint/unique index inference process maximally forgiving. I'm pretty happy with that trade-off. > (This documentation text is about as clear as > mud, too. What does "inferred" mean here? I think it means "chosen as > arbiter index", but maybe I misunderstand.) Unique index/constraint inference is the process by which we choose an arbiter index. See the second paragraph of the "ON CONFLICT Clause" section of the INSERT docs. -- Peter Geoghegan
Re: Weird planner issue on a standby
On Wed, Oct 12, 2022 at 6:47 AM Tom Lane wrote: > However, that doesn't explain the downthread report that a > VACUUM on the primary fixed it. What I suspect is that that > caused some in-fact-dead index entries to get cleaned out. Seems likely. > But ... if the primary is allowed to vacuum away an index > entry that it thinks is dead, exactly what is the point of > making standbys ignore LP_DEAD bits? There's no additional > interlock that guarantees the tuple will be there at all. The interlock doesn't really protect the leaf page or its index tuples so much as the referenced TIDs themselves. In other words it's a TID recycling interlock. That's why we don't need a cleanup lock to perform index tuple deletions, even though the WAL records for those are almost identical to the WAL records used by index vacuuming (in the case of nbtree the only difference is the extra latestRemovedXid field in the deletion variant WAL record). We know that there is no VACUUM process involved, and no question of heap vacuuming going ahead for the same TIDs once index vacuuming is allowed to complete. We can get away with not having the interlock at all in the case of nbtree index scans with MVCC snapshots -- but *not* with index-only scans. See "Making concurrent TID recycling safe" in the nbtree README. I only got around to documenting all of the details here quite recently. The index-only scan thing dates back to 9.5. -- Peter Geoghegan
Re: Weird planner issue on a standby
On Tue, Oct 11, 2022 at 10:04 PM Tom Lane wrote: > Do we propagate visibility-map bits to standbys? Yes. -- Peter Geoghegan
Re: Weird planner issue on a standby
On Tue, Oct 11, 2022 at 9:27 AM Alvaro Herrera wrote: > I remember having an hypothesis, upon getting a report of this exact > problem on a customer system once, that it could be due to killtuple not > propagating to standbys except by FPIs. I do not remember if we proved > that true or not. I do not remember observing that tables were being > read, however. That's true, but it doesn't matter whether or not there are LP_DEAD bits set on the standby, since in any case they cannot be trusted when in Hot Standby mode. IndexScanDescData.ignore_killed_tuples will be set to false on the standby. -- Peter Geoghegan
Re: Monitoring multixact members growth
On Fri, Aug 19, 2022 at 8:40 AM Vido Vlahinic wrote: > However it is not quite clear to me how I can interpret results from above > and e.g. conclude: my_table accumulates x more multixact members since I > measured last. You can't. And not just because nobody got around to implementing it yet -- it's quite a fundamental restriction. VACUUM must always make sure of that, and must always scan all unfrozen pages to safely determine that much (it may or may not have to freeze *any* MultiXacts as part of that process, but it must always be sure that no Multis < its final relminmxid remain). > My goal here is to predict where multixact members are growing the fastest so > I can perform manual VACUUM FREEZE only on those tables The problem with that strategy is that you still have to do anti-wraparound autovacuums when the mxid_age(relminmxid) of a table crosses the usual threshold, even when in reality there are *zero* MultiXacts in the table (often the case with the largest tables). That's just how it works, unfortunately. There is one piece of good news, though: work in Postgres 15 taught VACUUM to track the oldest extant XID and MXID in the table, and set relfrozenxid and remind to those oldest values (rather than using the cutoffs for freezing, which in general might be much older than the oldest remaining unfrozen XID/MXID). I expect that this will make it much less likely that anti-wraparound autovacuums affecting many tables will all stampede, hurting performance. This will be possible because VACUUM will now be able to set relminmxid to a value that actually tells us something about what's really going on in each table, MultiXact-wise (not just what you set vacuum_multixact_freeze_min_age and autovacuum_multixact_freeze_max_age to in postgresql.conf, which is pretty far removed from what matters most of the time). Simply by noticing that there are no remaining MultiXacts (and probably never were any in the first place) with the larger tables. The timeline for anti-wraparound autovacuums will tend to make a lot more sense for *your* workload, where huge differences in the rate of MultiXact consumption among tables is likely the norm. This still isn't perfect (far from it), but it has the potential to make things far better here. -- Peter Geoghegan
Re: Automatic autovacuum to prevent wraparound - PG13.5
On Wed, Jun 15, 2022 at 4:13 AM Mauro Farracha wrote: > The scenario: > - Out of nowhere (during the weekend), without database activity load or > batches running, with previous nightly run of vacuum freeze, in the middle of > the day, with xids and mxids below 20M we are seeing autovacuum being > triggered to prevent wraparound. > > My question is why this is occurring, which condition might be responsible > for this behaviour? There is a behavior that seems like it might be relevant: VACUUM interprets autovacuum_multixact_freeze_max_age in a way that accounts for both MultiXactId consumption and the consumption of "member space" by MultiXacts. Technically there are 2 SLRUs for MultiXacts, either of which can wraparound. This behavior was established by commit 53bb309d2d. It is documented. Admittedly this whole area of the documentation is in dire need of an overhaul. :-( -- Peter Geoghegan
Re: autovacuum_freeze_max_age on append-only tables
On Thu, Apr 21, 2022 at 8:14 PM Senor wrote: > Are the autovacuum_vacuum_cost_* settings handled any differently for > 'to avoid wraparound' vacuums? I understand that it won't give up a lock > but I was expecting it to still back off due to cost and allow the query > with conflicting lock to proceed. In general, no. For the most part an antiwraparound autovacuum does exactly the same work as any other autovacuum. Or any other aggressive VACUUM, at least. But even the extra work that it does over what non-aggressive VACUUM is still work that any VACUUM might do, if the circumstances were right. We still freeze in regular VACUUMs, provided we scan pages with XIDs that are sufficiently old. The most important difference between it and aggressive VACUUM is that the former can skip all-visible pages that have unfrozen XIDs, putting that work off. This can sometimes lead to a big balloon payment later on, when you finally have an aggressive VACUUM. I think that that's a design flaw that ought to be fixed. Currently non-aggressive VACUUMs always skip all-visible pages. They should probably freeze some older all-visible pages eagerly, rather than skipping them, so that the system never gets too far behind on freezing. > Is there any benefit to manually running a vacuum every so many inserts > as opposed to using autovacuum_freeze_max_age. And in this case should > it be a vacuum freeze. Given your restrictions, this is probably the best option available. But maybe you should just set vacuum_freeze_min_age to 0 at the table level, instead of using vacuum freeze (so you freeze more without doing aggressive vacuuming all the time, which FREEZE also forces). Users understandably think that there are several different flavors of vacuum, but that's not really true (apart from VACUUM FULL, which really is quite different). The difference between aggressive and non-aggressive can be big in practice due to an accumulation of unfrozen pages over multiple non-aggressive vacuums. -- Peter Geoghegan
Re: autovacuum_freeze_max_age on append-only tables
On Wed, Apr 20, 2022 at 4:06 PM senor wrote: > I'm attempting to mimic a new feature in version 13 where INSERTS will > trigger vacuum for an append-only table. The problem with that idea is that you need to express the idea that the table needs to be vacuumed now in terms of its "age", denominated in XIDs -- but XIDs consumed by the entire system, not just those XIDs that happen to modify your append-only table. It will likely be very hard for you to figure out a way to relate these logical units (XIDs) to some kind of physical cost that captures how far behind you are on freezing (like blocks, or even tuples). Maybe you'll find something that works through trial and error, but I wouldn't count on it. > I'm apparently needing an education on how this "to avoid wraparound" vacuum > differs from any other. I've seen it referenced as "more aggressive" but I'd > like details. An upgrade to 13 is "right around the corner". It's complicated -- more complicated than it really should be. Technically an anti-wraparound autovacuum and an aggressive vacuum are two different things. In practice anti-wraparound autovacuums are virtually guaranteed to be aggressive, though an aggressive autovacuum may not be an antiwraparound VACUUM (sometimes we do aggressive vacuuming because autovacuum launched a worker before age(relfrozenxid) reached autovacuum_freeze_max_age, but after age(relfrozenxid) reached vacuum_freeze_table_age). See my recent response to a similar question here: https://postgr.es/m/CAH2-WzkFQ-okvVXizpy4dCEVq75N-Qykh=crhzao-eajflv...@mail.gmail.com -- Peter Geoghegan
Re: oldest xmin is far in the past :: BUT xmin is not available in system
On Mon, Apr 18, 2022 at 11:37 PM bhargav kamineni wrote: > executing the vacuum on the entire cluster is also giving the same HINTS and > WARNING's You're using Aurora, not PostgreSQL. Perhaps this is actually a bug, but there is no way for anybody here to know. -- Peter Geoghegan
Re: Packages, inner subprograms, and parameterizable anonymous blocks for PL/pgSQL
On Fri, Dec 17, 2021 at 11:43 AM Bryn Llewellyn wrote: > Modular design recommends exposing functionality through a purpose oriented > interface and hiding all implementation details from the API’s user. A > package achieves this with declarative syntax via the spec/body separation. > The body encapsulates as many (top-level) subprograms as you want. Each of > these is visible to all of its peers. But none is visible outside of the > package unless the spec declares that it should be. This is a simple opt-in > scheme. I still don't get it. It sounds like you're mostly talking about encapsulation, or Information hiding, for stored procedures. I can certainly see how plpgsql doesn't do those things very well, but it still seems like there might be a lot of nuance that isn't getting across. The list of specific features that seem to be missing are not unreasonable, individually, and yet it feels like I cannot see some bigger picture that's apparent to you. Maybe you should explain your position by way of a motivating example, involving a real world use case. Something that makes the issues concrete. Are these items compelling because of how they allow an organization to deploy a program in a production environment, complete with version control? Does it have something to do with decoupling the mutable business data stored in tables from the programs contained/run in the same database? -- Peter Geoghegan
Re: Packages, inner subprograms, and parameterizable anonymous blocks for PL/pgSQL
On Thu, Dec 16, 2021 at 6:39 PM Mladen Gogala wrote: > I am very well aware of the animosity toward the Oracle community and > the reasons for that animosity. Oracle wiped the floor with the > predecessor of Postgres, the database called "Ingres". Sandra Kurtzig, > the Ingres CEO at the the time, and Michael Stonebraker were both > involved in very public spat with Larry Ellison. These events happened several years before I was born. > Stonebraker is still > very much a factor in Postgres community and I doubt that his feelings > toward the Larry and his company have got any warmer with years. No, he isn't. I think that Stonebraker appeared at a few events that were hosted by EDB, but to the best of my knowledge that's the extent of his involvement in the project after university POSTGRES. I've personally never met the man, or communicated with him online, and I've worked on Postgres more or less full time for a full decade now. As far as I'm aware he hasn't ever publicly posting to any of the mailing lists. -- Peter Geoghegan
Re: Packages, inner subprograms, and parameterizable anonymous blocks for PL/pgSQL
On Thu, Dec 16, 2021 at 11:22 AM Bryn Llewellyn wrote: > The advantages are self-evident to these programmers; and their lack comes as > a shocking disappointment when they start to write application code for > PostgreSQL*. The absence of packages and inner subprograms is huge. Why are those things huge? It's not self-evident to me. I can only speak for myself, but throwing around terms like "shocking disappointment" is never going to convince me of anything. You can make similar statements about many other things. Any functionality that gets added to PostgreSQL should ideally be compelling to users that have not worked with Oracle in the past. Maybe that happens to result in a feature that very much looks like what you already have in mind, or maybe there are significant differences. If you're not going to meet others closer to where they are, then how can you expect it for yourself? -- Peter Geoghegan
Re: Error with Insert from View with ON Conflict
On Wed, Nov 3, 2021 at 2:18 PM Steve Baldwin wrote: > I'm pretty sure the 'alias' for the '.. on conflict do update ..' needs to be > 'excluded' (i.e. checks = excluded.checks, ...). Check the docs. That's right . The excluded.* pseudo-table isn't exactly the same thing as the target table -- it is a tuple that has the same "shape", that represents what the implementation tried (and failed) to insert into the table. I have to imagine that Alex wants to reference that, because that's the standard idiomatic approach with ON CONFLICT. And because the only alternative interpretation is that Alex intends to update those columns using their current values (not new values), which won't really change anything -- that seems unlikely to have been the intent. -- Peter Geoghegan
Re: Segmentation fault in volatile c function
On Wed, Oct 27, 2021 at 12:39 PM Louise Grandjonc wrote: > I'm creating my first ever extension. The function that I'm trying to write > takes the schema and name of a table, and adds it in a table with all the > tables the extension follows. > In that table I want the schema, name and oid of the table. > > I created a C function for that. Here is the code I don't think that using cstring as an argument is appropriate here. That's only used for "internal" functions that are called through C, rather than being called through SQL. It would probably be better to use a regclass argument instead. Take a look at the example of (say) amcheck's bt_index_check() or verify_heapam() functions to see how that's done. Not all C functions use this built-in way of specifying a particular relation from SQL (a few older ones really do just pass a string some other way), but it's definitely considered the way to go. Note that you can do stuff like this from SQL: pg@regression:5432 [2999218]=# select 'pg_class'::regclass::oid; oid ─── 1,259 (1 row) -- Peter Geoghegan
Re: Frequetly updated partial index leads to bloat on index for Postresql 11
> On Thu, Mar 18, 2021 at 6:51 AM Tom Dearman wrote: > Is this a known issue, are they any ways around it, and if it is an > issue is there a plan to fix it if a fix is possible? On second thought I do think that the improvements to 14 will fix this for you. See the test case here: https://www.postgresql.org/message-id/flat/CAL9smLC%3DSxYiN7yZ4HDyk0RnZyXoP2vaHD-Vg1JskOEHyhMXug%40mail.gmail.com#e79eca5922789de828314e296fdcb82d -- Peter Geoghegan
Re: Behavior change in PostgreSQL 14Beta3 or bug?
On Mon, Sep 6, 2021 at 7:52 AM Daniel Westermann (DWE) wrote: > >Try running vacuum with index cleanup = on. > > Thank you, Peter Thanks for testing! -- Peter Geoghegan
Re: Behavior change in PostgreSQL 14Beta3 or bug?
On Mon, Sep 6, 2021 at 8:59 AM Tom Lane wrote: > Assuming that that choice was made appropriately, I think the advice you > propose here will just cause people to waste lots of cycles on VACUUM > runs that have only marginal effects. Right. The advice that they should receive (if any) is to tune autovacuum aggressively, and enable autovacuum log output. The log output reports on whether or not the implementation applied the optimization in each case. As I pointed out to Laurenz just now, users that care about index-only scans are actually the big beneficiaries here. Now they can set autovacuum_vacuum_insert_threshold very aggressively, without doing a useless round of index vacuuming just because one inserting transaction out of a million aborted. Once indexes are removed from the equation (to the extent that that makes sense), each round of vacuuming by autovacuum only needs to do work that is proportional to the number of unset-in-vm heap pages. I believe that that trade-off makes a lot of sense. Autovacuum has little chance of keeping anything like 100% of all pages set in the VM anyway. But it can get a lot closer to it in some cases now. -- Peter Geoghegan
Re: Behavior change in PostgreSQL 14Beta3 or bug?
On Mon, Sep 6, 2021 at 9:21 AM Laurenz Albe wrote: > #define BYPASS_THRESHOLD_PAGES 0.02/* i.e. 2% of rel_pages */ > > So up to an additional 2% of all pages can have the all-visible bit > unset with "index_cleanup = auto". > > That is probably not worth worrying, right? I don't think it's worth worrying about. I would say that, since I chose the exact threshold myself. The threshold was a bit arbitrary, of course. Note that Daniel's example had a non-HOT update, even though it's the kind of update that we imagine can use HOT (because it didn't modify an indexed column). He could have ensured a HOT update by lowering heap fill factor, but why should that be necessary if updates are rare anyway? The bypass-index-vacuuming feature may have had a bit of a messaging problem. It was something we usually talked about as being about skipping index vacuuming, because that's what it actually does. However, the feature isn't really about doing less work during VACUUM. It's actually about doing *more* work during VACUUM -- more useful work. Especially setting visibility map bits. But also freezing. Now you can very aggressively tune VACUUM to do these things more often, with no fear of that being way too expensive because of index vacuuming that has only marginal value. The threshold is not so much about any one VACUUM operation -- you have to think about the aggregate effect on the table over time. Most individual tables will never have the new optimization kick in even once, because the workload just couldn't possibly allow it -- the 2% threshold is vastly exceeded every single time. The cases that it actually applies to are pretty much insert-only tables, perhaps with some HOT updates. 100% clean inserts are probably very rare in the real world. I believe that it's *vastly* more likely that such a table will have pages that are ~98%+ free of LP_DEAD line pointers in heap pages (i.e., the thing that BYPASS_THRESHOLD_PAGES applies to). To get to 100% you cannot allow even one single insert transaction to abort since the last VACUUM. If you assume that BYPASS_THRESHOLD_PAGES is actually too low for your workload (which is the opposite problem), then it doesn't matter very much. The feature as coded should still have the desired effect of skipping index vacuuming in *most* cases where it's unnecessary (however you happen to define "unnecessary") -- the number of pages with LP_DEAD items will naturally increase over time without index vacuuming, until the threshold is crossed. Maybe still-unnecessary index vacuuming will still take place in 1 out of 5 cases with the feature. This is still much better than 5 out of 5. More importantly, you can now aggressively tune vacuuming without noticeably increasing the number of individual vacuums that still have the problem of unnecessary index vacuuming. So if you go from 5 vacuums per day to 20 through tuning alone, the number of vacuum operations that do unnecessary index vacuuming doesn't increase at all (except perhaps due to rounding effects). -- Peter Geoghegan
Re: Behavior change in PostgreSQL 14Beta3 or bug?
It's a non-hot update, and so there is a single dead index tuple. You're seeing the new optimization that makes vacuum skip indexes in marginal cases. Try running vacuum with index cleanup = on. Peter Geoghegan (Sent from my phone)
Re: gen_random_uuid key collision
On Thu, Sep 2, 2021 at 4:48 PM Tom Lane wrote: > That is pretty weird, all right. The only idea that comes to mind > immediately is that maybe that table's pkey index is corrupt and needs > to be reindexed. This isn't a great theory, because I don't see why > a corrupt index would lead to bogus unique-constraint errors rather > than missed ones. But at least it squares with the observation that > only that table is having issues. This is easy enough to check using the contrib/amcheck extension. jesusthefrog could try this, and report back what they see: CREATE EXTENSION IF NOT EXISTS amcheck SELECT bt_index_check('my_uuid_index', true); If that doesn't show any errors, then there is a chance that this will: SELECT bt_index_parent_check('my_uuid_index', true); Note that the parent variant takes a disruptive lock that will block write DML. You might prefer to just use the first query if this is running in a production environment. -- Peter Geoghegan
Re: Frequetly updated partial index leads to bloat on index for Postresql 11
On Fri, Jul 16, 2021 at 9:19 AM Tom Dearman wrote: > Other indexes do bloat, but the percentage bloat is a lot less I have to imagine that the remaining problems have a lot to do with the fact that this is a partial index -- the partial index naturally gets vacuumed much less frequently than what would generally be considered ideal for the index itself. In general VACUUM scheduling makes the naive assumption that the indexes have the same needs as the table, which is far from the case with this partial index, for your workload. It's all of the specifics, taken together. It sounds like this is a case where bottom-up index deletion won't help -- it will only trigger in those indexes that are not "logically modified" by updates. But you're logically modifying these values. Or you're causing them to not need to be in the index anymore, by modifying the predicate. But that won't trigger bottom-up deletion. It's a bit like a delete, as far as the physical index structure is concerned -- the index won't be eagerly modified by the executor. The overall picture is that you cycle through all of the values in the table, and no cleanup can take place other than plain VACUUM (at least not to any significant extent). Although only a few hundred values are logically required to be indexed by the partial index at any one time, in practice no cleanup can run for long stretches of time (autovacuum just doesn't know about cases like this). This is why the partial index inevitably exceeds its theoretical pristine/high watermark size, which is actually more than 1 page/8KB, but still probably a lot less than what you actually see -- the partial index "falls through the cracks", even with recent enhancements that made cleanup more eager and more dynamic in certain other cases. I am afraid that I don't have a good suggestion right now. I can think of incremental improvements that would address this case, but for now they're just ideas. Fundamentally, we need to get to the partial index much more frequently than the other indexes, either within VACUUM or within some other mechanism. For example a conservative implementation of retail index tuple deletion might totally fix the issue. It does very much look like a problem in these partial indexes in particular -- it's quite possible that the other indexes won't grow at all due to garbage index tuples, especially on Postgres 14. -- Peter Geoghegan
Re: hot_standby_feedback implementation
On Tue, Jun 15, 2021 at 5:24 PM Christophe Pettus wrote: > When a replica sends a hot_standby_feedback message to the primary, does that > create an entry in the primary's lock table, or is it flagged to autovacuum > some other way? It pretty much works by making the WAL sender process on the primary look like it holds a snapshot that's as old as the oldest snapshot on the replica. A replica can block VACUUM on the primary *directly* by holding a table-level lock, though in practice only when somebody on the replica acquires an AccessExclusiveLock -- which is presumably rare in practice. -- Peter Geoghegan
Re: Frequetly updated partial index leads to bloat on index for Postresql 11
On Thu, Mar 18, 2021 at 6:51 AM Tom Dearman wrote: > Is this a known issue, are they any ways around it, and if it is an > issue is there a plan to fix it if a fix is possible? It's not exactly a known issue per se, but I think the problem here is related to the fact that you have lots of duplicates, which did perform rather badly prior to Postgres 12. I bet that you'd benefit from upgrading to Postgres 12, or especially to Postgres 13. The B-Tree space management is a lot better now. (Actually, it'll be better again in Postgres 14.) -- Peter Geoghegan
Re: autovacuum verbose?
On Thu, Jan 21, 2021 at 12:55 PM Tommy Li wrote: > Is there any way to configure autovacuum to log the same information as > VACUUM VERBOSE? No. Though there really should be. -- Peter Geoghegan
Re: Need explanation on index size
On Thu, Sep 24, 2020 at 6:55 AM Guillaume Luchet wrote: > I don’t understand why after the update where I only update a non indexed > column the indexes size is growing. Is it something someone can explain ? If you reduce the table fillfactor then these updates will all be HOT updates. That will make the table larger initially, but leaving enough space behind on the same heap pages for successor tuples makes it possible to use HOT updates. -- Peter Geoghegan
Re: Index tuple deduplication limitations in pg13
On Tue, Aug 18, 2020 at 1:31 PM Matthias van de Meent wrote: > Would this extra ordering not effectively be an extra tiebreaker in > the ordering, applied before the TID? I do not know the full > implications of that, but I believe that would not result in the > limitations that you are mentioning. You could probably do it that way, but again you end up with a lot of new complexity. Not to mention overhead that would have to be paid by everyone. It would require code that supported the old way (even if it was added to Postgres 13) for pg_upgrade, that would also be hard to test. And it might defeat certain future optimizations based on heap TID being the only tiebreaker. Having two types of equality might have to bleed into the optimizer. It's a question of engineering trade-offs. I don't think that it's worth it. -- Peter Geoghegan
Re: Index tuple deduplication limitations in pg13
On Tue, Aug 18, 2020 at 11:52 AM Matthias van de Meent wrote: > Given that the above could work, the current btree tuple ordering is > not optimized for opclass-equal but datum image-distinct values: > ordering of opclass-equal values is currently determined only by tid, > with as an example current ordering ['0.0', '0', '0.00', '0', '0.0', > '0']. It would be more optimized for deduplication if that was stored > as e.g. ['0', '0', '0', '0.0', '0.0', '0.00'], which is why I > suggested to add an ordering by the datum image before the tid > ordering. Additionally, this extra ordering also prevents the problem > of [0] by never attempting an insertion of non-equal image datums in a > posting list of otherwise equal values, as it would be ordered either > before or after the posting list, never inside the list. Yeah, that would work, but at the cost of making numeric totally unusable. Now you cannot rely on unique enforcement detecting that '0' is a duplicate of '0.0'. In fact, even the most trivial use of the = operator will be broken in the presence of different display scales. It's a non-starter. The numeric2 design that I sketched is a bit ugly, but I can see no better way. A three-way posting list split (i.e. the other design that you sketched) is a special case that is very hard to test, very complicated, and of little value in the grand scheme of things. -- Peter Geoghegan
Re: Index tuple deduplication limitations in pg13
On Tue, Aug 18, 2020 at 11:52 AM Matthias van de Meent wrote: > Deduplication does not need to destroy semantic differences? 'equal' > can (in my book) mean: > - 'opclass-equal', that is the opclass returns true for an equality check > - 'binary equal' or 'datum-equal' (? maybe incorrect term), that is > the unprocessed on-disk representations (datum image is the right term > I believe?) of the compared values are indistinguishable. > > Runs of 'binary equal' datums can be freely deduplicated [0] when found. > [0] > Inserting a row in a deduplicated index with in, with TID ntid, can > encounter a posting list of a opclass-equal but not datum image-equal > tuples where the lowest TID of the posting list is less than ntid, and > ntid is less than the highest TID of the posting list. This would > require a posting list split to accomodate the new tuples' index entry > in order to not lose data. But you can't do that easily, because it breaks subtle assumptions about posting list splits and space utilization. In particular, it means that you can no longer think of a posting list split as rewriting an incoming new item such that you can more or less pretend that there was no overlap in the first place -- code like _bt_split and nbtsplitloc.c relies on this. Introducing new special cases to nbtsplitloc.c is very unappealing. More concretely, if you introduce a posting list split like this then you need three copies of the key -- the original, the new, and a second copy of the original. That's much more complicated. -- Peter Geoghegan
Re: Index tuple deduplication limitations in pg13
On Tue, Aug 18, 2020 at 9:44 AM Peter Geoghegan wrote: > If we wanted to fix this for numeric, we'd have to invent a new > numeric datatype (called numeric2, say). That probably isn't as hard > as it sounds, since it could be part of the same B-Tree operator > family as numeric. It could also be implicitly cast to numeric. I forgot to say: numeric2 would be just like numeric, except in one specific way: it wouldn't care about display scale. The user would be giving up on display scale by choosing numeric2 over numeric. The "5 vs 5.000" information would always be lost by design, so there'd be nothing for deduplication to break. Deduplication could then be enabled. -- Peter Geoghegan
Re: Index tuple deduplication limitations in pg13
On Mon, Aug 17, 2020 at 11:44 PM Matthias van de Meent wrote: > But, if the ordering of operator-class equal tuples is already > system-defined, could the physical ordering of index tuples in a btree > (with deduplication enabled for "unsafe" opclasses) be updated from > [index_columns, tid] to [index_columns, > image_compare(non_datum_equal_columns), tid], giving a stable sorting > of opclass-equal and image-equal values and enabling safe consistent > deduplication? The issue isn't the physical ordering. The issue is that we cannot allow the implementation to destroy semantic differences among equal datums. We avoid deduplication with cases where two equal datums are visibly different. For example, it would not be okay if we forgot that your numeric datum was originally input as '5.000', and output '5' later on. If we wanted to fix this for numeric, we'd have to invent a new numeric datatype (called numeric2, say). That probably isn't as hard as it sounds, since it could be part of the same B-Tree operator family as numeric. It could also be implicitly cast to numeric. -- Peter Geoghegan
Re: 12.3 replicas falling over during WAL redo
On Mon, Aug 3, 2020 at 2:35 PM Alvaro Herrera wrote: > You can use pageinspect's page_header() function to obtain the page's > LSN. You can use dd to obtain the page from the file, > > dd if=16605/16613/60529051 bs=8192 count=1 seek=6501 of=/tmp/page.6501 Ben might find this approach to dumping out a single page image easier, since it doesn't involve relfilenodes or filesystem files: https://wiki.postgresql.org/wiki/Getting_a_stack_trace_of_a_running_PostgreSQL_backend_on_Linux/BSD#contrib.2Fpageinspect_page_dump -- Peter Geoghegan
Re: 12.3 replicas falling over during WAL redo
On Sun, Aug 2, 2020 at 9:39 PM Kyotaro Horiguchi wrote: > All of the cited log lines seem suggesting relation with deleted btree > page items. As a possibility I can guess, that can happen if the pages > were flushed out during a vacuum after the last checkpoint and > full-page-writes didn't restored the page to the state before the > index-item deletion happened(that is, if full_page_writes were set to > off.). (If it found to be the cause, I'm not sure why that didn't > happen on 9.5.) There is also a Heap/HOT_UPDATE log line with similar errors. -- Peter Geoghegan
Re: 12.2: Why do my Redo Logs disappear??
On Mon, Jun 8, 2020 at 5:17 PM Peter wrote: > Loosing a RedoLog is very bad, because there is no redundancy, > loosing a single one of them makes the timeline disappear and it > will only reappear after another Base Backup. Very very bad. >In this case, it seems, Postgres will delete the current log >without archiving it. :( I strongly suspect that you were hit by the bug fixed in commit 4e87c483. You should upgrade to Postgres 12.3 ASAP, to get that fix: "Avoid premature recycling of WAL segments during crash recovery (Jehan-Guillaume de Rorthais) WAL segments that become ready to be archived during crash recovery were potentially recycled without being archived." Sorry that you were affected by this bug -- it really sucks. -- Peter Geoghegan
Re: create index insist on 2 workers only
On Fri, Apr 24, 2020 at 7:32 AM Radoslav Nedyalkov wrote: > We 're rebuilding a big table which has set parallel_workers = 6 > system has > max_parallel_maintenance_workers | 6 | > /var/lib/pgsql/11/data/postgresql.sumup.conf > max_parallel_workers | 16 | > /var/lib/pgsql/11/data/postgresql.sumup.conf > max_parallel_workers_per_gather | 4 | > /var/lib/pgsql/11/data/postgresql.sumup.conf > > Also session level on index restore there is > set max_parallel_maintenance_workers = 6; > > Still we get only 2 parallel processes in a free of any other load system. > It is postgres 11.7 Try increasing maintenance_work_mem from the default of 64MB. MWM constrains the number of parallel workers used. -- Peter Geoghegan
Re: Querying an index's btree version
On Wed, Mar 11, 2020 at 2:13 PM Darren Lafreniere wrote: > when you restore a DB from a backup, does the restored index use the old > format or the latest one? If you use pg_restore, it uses the latest index format. If you're using pg_upgrade, the version won't change unless and until you REINDEX. This includes cases where you're running pg_upgrade against a restored physical backup. -- Peter Geoghegan
Re: Querying an index's btree version
On Wed, Mar 11, 2020 at 1:26 PM Darren Lafreniere wrote: > We've read that PG 12 has improved btree index support, and that the latest > internal btree version was bumped from 3 to 4. Is it possible to query the > btree version that a particular index is using? We'd like to automatically > start a concurrent re-index if we detect any btree indexes are still on > version 3. It's possible, but you have to install the superuser-only pageinspect extension. Here is how you'd determine that an index called 'pg_aggregate_fnoid_index' is on version 4: regression=# create extension pageinspect; CREATE EXTENSION regression=# select version from bt_metap('pg_aggregate_fnoid_index'); version - 4 (1 row) -- Peter Geoghegan
Re: Query returns no results until REINDEX
On Sun, Feb 9, 2020 at 12:50 PM Colin Adler wrote: > Looks like it found something. I checked out the contrib/pageinspect docs but > wasn't too sure what to run. Are incompatible libc versions causing btree > corruption something you consider a bug? If it's something you'd like to look > into further I can gladly send over the database files. No, this is not considered a bug. It's unfortunate that there is no built in collation versioning mechanism, or something like that -- that might have alerted you to the problem before any real damage occurred. We have that for the ICU collations, but it currently isn't possible to use ICU as the default collation provider. You really have to go out of your way to use ICU collations. -- Peter Geoghegan
Re: Query returns no results until REINDEX
On Fri, Feb 7, 2020 at 3:52 PM Colin Adler wrote: > Seems to work now. My question is, is this something I should report to the > maintainers? I am one of the people that maintains the B-Tree code. You didn't mention what version of Postgres you're using here. That could be important. Please let us know. Mention the minor component of the release version, too (i.e. say 12.2, not just 12). > I took a snapshot of the data folder before the reindex in case it > would be helpful. Is index corruption something that should be actively looked > out for? Yes -- look for corruption. If I had to guess, I'd say that this has something to do with upgrading the operating system to use a different, incompatible glibc. Or perhaps it has something to do with streaming replication between machines with different glibc version. You should try running contrib/amcheck, which should be able to isolate index corruption, and give you a specific complaint. You may then be able to inspect the exact index page with the problem using contrib/pageinspect. Something like this ought to do it on Postgres 11 or 12: CREATE EXTENSION IF NOT EXISTS amcheck SELECT bt_index_check('my_index', true); If that doesn't show any errors, then perhaps try this: SELECT bt_index_parent_check('my_index', true); If you're on Postgres 10, then you should leave out the second argument, "true", since that version doesn't have the extra heapallindexed check. Let us know what you see. -- Peter Geoghegan
Re: Rows violating Foreign key constraint exists
On Fri, Nov 29, 2019 at 7:23 AM Tom Lane wrote: > The most likely "corruption" explanation is something wrong with the > indexes on the referenced and/or referencing column, causing rows to > not be found when referential actions should have found them. Random > querying of the tables wouldn't necessarily expose that --- you'd need > to be sure that your queries use the questionable indexes, and maybe > even search for some of the specific rows that seem mis-indexed. Or try using contrib/amcheck, which is available in Postgres 10. Perhaps try the query here, modified to verify all B-Tree indexes (not just those indexes in the pg_catalog schema): https://www.postgresql.org/docs/10/amcheck.html -- Peter Geoghegan
Re: PG11 Parallel Thanks!!
On Thu, Oct 3, 2019 at 10:31 AM Jason Ralph wrote: > The end of month process that we run at my company was a pg_dump and > pg_restore of 3 tables, these tables are around ~(400GB) each. The entire > process on pg93 took 29 hours. > > The index creation portion of the restore on the target pg9.3 database took: > 5) time: -15 hours -4 minute ((-54264 % 60)) seconds > > The index creation of the restore on the target db after pg11 upgrade on > source and target took: > 5) time: -5 hours -7 minute ((-18434 % 60)) seconds > > We saved 10 hours!! The sort code received many improvements over the years, really starting in 9.5, and continuing in 9.6, 10 and 11. FWIW, I think that that was probably the biggest factor here. Though parallel CREATE INDEX will have helped as well. -- Peter Geoghegan
Re: Clarification on the release notes of postgresql 12 regarding pg_upgrade
On Fri, Oct 4, 2019 at 9:09 AM Tom Lane wrote: > > You can't REINDEX safely regarding that note. > > Actually running into that problem is quite unlikely; and if you did > hit it, it'd just mean that the REINDEX fails, not that you have any > urgent problem to fix. I'd encourage you to just go ahead and REINDEX, > if you have indexes that could benefit from the other changes. Right. It is hard to imagine an application that evolved to fully rely on the previous slightly higher limit, and cannot tolerate a reduction in the limit by only 8 bytes. The limit applies to a tuple *after* TOAST compression has been applied. -- Peter Geoghegan
Re: PG11 Parallel Thanks!!
On Wed, Oct 2, 2019 at 8:41 AM Jason Ralph wrote: > Since pg11 on both the target and source, the run time has decreased a lot, I > chalk it up to the parallel index creations in pg11 which was a very time > consuming process on pg9.3. > The process has finished almost 10 hours earlier than pg93. So thank you for > your hard work and dedication to this awesome piece of software. How long did it take on 9.3? I am the author of the parallel CREATE INDEX feature. It's good to get feedback like this. -- Peter Geoghegan
Re: Whan is it safe to mark a function PARALLEL SAFE?
On Sun, Sep 8, 2019 at 12:27 PM Tom Lane wrote: > > If an error is raised in one parallel worker, does this > > cause the other parallel workers to be immediately terminated? > > I think not, though I didn't work on that code. The error will > be reported to the parent backend, which will cause it to fail > the query ... but I think it just waits for the other worker > children to exit first. That's not something to rely on of course. > Even if we don't make an attempt to cancel the other workers today > we probably will in future. But the cancel attempt would certainly > be asynchronous, so I'm not sure how "immediate" you are worried > about it being. If workers call CHECK_FOR_INTERRUPTS() frequently, which they should, then it should appear to users as if raising an error in one worker kills everything almost immediately, or immediately. For example, if a parallel CREATE INDEX has a worker that raises a unique violation error, that must work in a way that at least *appears* to be very similar to what the user would get with a serial CREATE INDEX. (The worst that can happen is that they'll very occasionally get two unique violation errors instead of one, or something like that.) That said, there are theoretical failures where it could take rather a long time for the parent/leader to get the memo -- see WaitForParallelWorkersToAttach() and its caller (note that anything using a gather node is subject to the same kind of failure that WaitForParallelWorkersToAttach() handles, even though they won't call the function themselves). These failures (e.g. fork() failure) are generally assumed to be rare to non-existent, though. Users will surely be upset if parallel queries cannot be cancelled almost immediately. If it happened with any regularity, somebody would have complained by now. As Tom said, it's hard to give a useful answer without more context -- how you define "immediate"? -- Peter Geoghegan
Re: Question from someone who is not trained in computer sciences
On Thu, Sep 5, 2019 at 2:00 PM Judith Lacoste wrote: > I think PostgreSQL is the solution for my needs, but I am not a > programmer/coder. I don't think that it's restricted to people that are computer scientists. At least, I certainly hope it isn't. SQL was originally supposed to be something that is usable by domain experts/analysts, rather than by computer people (that was at a time when the divide was far larger than it is today). > I plan to install the database on a server in the office. Me and my four > colleagues will occasionally connect to this database when we are working in > other locations (usually hospitals or universities). In such remote > locations, we often do not have internet/network, yet we still need to access > the database. Currently, we use a system where a copy of the database lives > on each of our laptops. We can access all the information in the database > despite being offline. This local copy of the database is synchronized with > the server once network becomes available again. > > My question is whether or not such set up is possible with PostgreSQL? Since you're a biologist, you may like to play around with the Mouse Genome database using PostgreSQL: http://www.informatics.jax.org/downloads/database_backups/ Any supported version of PostgreSQL will work. You'll need to use pg_restore to restore the databases. Something like this will do it: pg_restore -d mgd /path/to/mgd.postgres.dump (I'm not sure what operating system you'll use -- something similar to this invocation ought to work on Windows through cmd.exe, though.) >From there, you can play around with the database using a GUI tool such as pgAdmin. I sometimes use this database to test certain things, since it's the only example of a living, breathing PostgreSQL database that you can just download that I am aware of. Its schema is probably not an exemplar of good design, but it does seem reasonably well thought out. I'm not a domain expert, though, so I can't really be sure how good it is. The nice thing about this approach is that you can figure it out using a "top down" approach, by first understanding how the database is used in practical terms, and then filling in the details of how the application that it backs uses the database. Last I checked, restoring this database will take about 30GB of disk space on top of the dump file itself. -- Peter Geoghegan
Re: Corrupt index stopping autovacuum system wide
On Thu, Jul 18, 2019 at 9:06 AM Aaron Pelz wrote: > It's a simple btree expression on a geometry(Point,4326) , no expression no > partial no composite. The cause of the corruption may be a bug in a Postgis B-Tree operator class. I reported a bug in the Geography type that could lead to corrupt B-Tree indexes (not Geometry): https://trac.osgeo.org/postgis/ticket/3841 Though I also see what could be a comparable bug in Geometry: https://trac.osgeo.org/postgis/ticket/3777 These bugs are from about 3 years ago. If I'm right you should be able to isolate the bug using amcheck. -- Peter Geoghegan
Re: Corrupt index stopping autovacuum system wide
On Wed, Jul 17, 2019 at 10:27 AM Peter Geoghegan wrote: > > It's possible that amcheck would have given you an accurate diagnosis > > of the problem -- especially if you used bt_index_parent_check(): > > > > https://www.postgresql.org/docs/current/amcheck.html > > BTW, be sure to use the 'heapallindexed' option with > bt_index_parent_check() to detect missing downlinks, which is exactly > the problem that VACUUM complained about. Can you tell us more about this index? Can you share its definition (i.e. what does \d show in psql)? Is it an expression index, or a partial index? A composite? What datatypes are indexed? Thanks -- Peter Geoghegan
Re: Corrupt index stopping autovacuum system wide
On Wed, Jul 17, 2019 at 11:43 AM Alvaro Herrera wrote: > This might make things worse operationally, though. If searches aren't > failing but vacuum is, we'd break a production system that currently > works. If searches aren't failing and VACUUM works, then that's probably down to dumb luck. The user's luck could change at any time (actually, it's quite possible that the index is already giving wrong answers without anybody realizing). That's not always true, of course -- you could have an OOM condition in VACUUM, where it really does make sense to retry. But it should be true for the category of errors where we behave more aggressively than just giving up, such as "failed to re-find parent key" error Aaron noticed. > Well, vacuum knows what index is being processed. Maybe you're thinking > that autovac can get an out-of-memory condition or something like that; > perhaps we can limit the above only when an ERRCODE_DATA_CORRUPTED > condition is reported (and make sure all such conditions do that. As > far as I remember we have a patch for this particular error to be > reported as such.) I don't think that it would be that hard to identify errors that nbtree VACUUM could throw that clearly indicate corruption, without any hope of the problem self-correcting without the DBA running a REINDEX. There will be a small amount of gray area, perhaps, but probably not enough to matter. > > (c) automatically disabling constraint indexes seems less than desirable. > > Disabling them for writes, yeah. I think that it's fair to say that all bets are off once you see the "failed to re-find parent key" error, or any other such error that indicates corruption. Admittedly it isn't 100% clear that disabling constraint enforcement to unblock autovacuum for the whole cluster is better than any available alternative; it's really hard to reason about things when we already know that the database has corruption. I think that it's okay that almost anything can break when somebody creates an index on a non-immutable expression (including VACUUM), provided that all the problems only affect the table with the broken index. OTOH, taking down the entire Postgres cluster as an indirect consequence of one person's ill-considered CREATE INDEX really sucks. That distinction seems important to me. -- Peter Geoghegan
Re: Corrupt index stopping autovacuum system wide
On Wed, Jul 17, 2019 at 10:27 AM Tom Lane wrote: > Right, you're eventually going to get to a forced shutdown if vacuum never > succeeds on one table; no question that that's bad. It occurs to me that we use operator class/insertion scankey comparisons within page deletion, to relocate a leaf page that looks like a candidate for deletion. Despite this, README.hot claims: "Standard vacuuming scans the indexes to ensure all such index entries are removed, amortizing the index scan cost across as many dead tuples as possible; this approach does not scale down well to the case of reclaiming just a few tuples. In principle one could recompute the index keys and do standard index searches to find the index entries, but this is risky in the presence of possibly-buggy user-defined functions in functional indexes. An allegedly immutable function that in fact is not immutable might prevent us from re-finding an index entry" That probably wasn't the problem in Aaron's case, but it is worth considering as a possibility. > My concern here is > that if we have blinders on to the extent of only processing that one > table or DB, we're unnecessarily allowing bloat to occur in other tables, > and causing that missed vacuuming work to pile up so that there's more of > it to be done once the breakage is cleared. If the DBA doesn't notice the > problem until getting into a forced shutdown, that is going to extend his > outage time --- and, in a really bad worst case, maybe make the difference > between being able to recover at all and not. The comment about "...any db at risk of Xid wraparound..." within do_start_worker() hints at such a problem. Maybe nbtree VACUUM should do something more aggressive than give up when there is a "failed to re-find parent key" or similar condition. Perhaps it would make more sense to make the index inactive (for some value of "inactive") instead of just complaining. That might be the least worst option, all things considered. -- Peter Geoghegan
Re: Corrupt index stopping autovacuum system wide
On Wed, Jul 17, 2019 at 9:21 AM Peter Geoghegan wrote: > It's possible that amcheck would have given you an accurate diagnosis > of the problem -- especially if you used bt_index_parent_check(): > > https://www.postgresql.org/docs/current/amcheck.html BTW, be sure to use the 'heapallindexed' option with bt_index_parent_check() to detect missing downlinks, which is exactly the problem that VACUUM complained about. Hopefully this probably will be limited to the single index that you've already REINDEXed. In theory the same problem could be hiding in other indexes, though I don't consider that particularly likely. Note that bt_index_parent_check() requires a lock on tables that effectively blocks writes, but not reads, so verification may require planning or coordination. bt_index_check() doesn't have any of these problems, but also won't detect missing downlinks specifically. -- Peter Geoghegan
Re: Corrupt index stopping autovacuum system wide
On Wed, Jul 17, 2019 at 9:57 AM Tom Lane wrote: > It looks that way, but how would a broken non-shared index have held up > autovacuuming in other databases? Maybe, as this one's xmin horizon > got further and further behind, the launcher eventually stopped > considering launching workers into any other databases? That seems > like a bad thing; it's postponing work that will need to be done > eventually. I don't know exactly how the launcher would behave offhand, but it's clear that not being able to VACUUM one table in one database (because it has a corrupt index) ultimately risks the availability of every database in the cluster. Many installations receive little to no supervision, so it may just be a matter of time there. That is certainly a bad thing. -- Peter Geoghegan
Re: Corrupt index stopping autovacuum system wide
On Wed, Jul 17, 2019 at 8:54 AM Aaron Pelz wrote: > To me it looks like a *single* corrupt index held up autovacuums across our > entire server, even other in other databases on the same server. Am I > interpreting this correctly? Yes -- that is correct. What PostgreSQL version are you on? Was this an INCLUDE index on PostgreSQL 11? > Would love guidance on diagnosing this type of thing and strategies for > preventing it. It's possible that amcheck would have given you an accurate diagnosis of the problem -- especially if you used bt_index_parent_check(): https://www.postgresql.org/docs/current/amcheck.html -- Peter Geoghegan
Re: after restore the size of the database is increased
On Mon, Jul 15, 2019 at 6:22 AM Luca Ferrari wrote: > What am I missing here? Sometimes B-Tree indexes can be *larger* after a REINDEX (or after they're recreated with a CREATE INDEX). It's not that common, but it does happen. There isn't actually a very large size difference here, so it seems worth comparing index size in detail. -- Peter Geoghegan
Re: REINDEX : new parameter to preserve current average leaf density as new implicit FILLFACTOR
On Tue, Jul 9, 2019 at 9:04 PM Peter Geoghegan wrote: > ISTM that the simplest explanation here is that index fragmentation > (and even index size) is a red herring, and the real issue is that > you're suffering from problems similar to those that are described in > these old threads: > > https://www.postgresql.org/message-id/flat/20160524173914.GA11880%40telsasoft.com > https://www.postgresql.org/message-id/flat/520D6610.8040907%40emulex.com I can imagine why you found you needed to reduce fillfactor to get much of any benefit from a REINDEX. Page splits are inherently expensive, for one thing. Also, in this specific scenario a succession of page splits might hasten the index returning to having little correlation with the underlying table within each large group of duplicates. Splits on fillfactor 90 pages would make new space available for future insertions on earlier duplicate pages, mixing old and new rows together before long. -- Peter Geoghegan
Re: REINDEX : new parameter to preserve current average leaf density as new implicit FILLFACTOR
On Tue, Jul 9, 2019 at 3:18 PM John Lumby wrote: > > Or, it could be that range scan performance benefitted from reduced > > fragmentation, > > > > Yes, I think so. ISTM that the simplest explanation here is that index fragmentation (and even index size) is a red herring, and the real issue is that you're suffering from problems similar to those that are described in these old threads: https://www.postgresql.org/message-id/flat/20160524173914.GA11880%40telsasoft.com https://www.postgresql.org/message-id/flat/520D6610.8040907%40emulex.com There have been numerous reports from users with problems involving low cardinality indexes that gradually became less correlated with the underlying table over time. At least a couple of these users found that a periodic REINDEX temporarily fixed the problem -- see the first thread for an example. Postgres 12 maintains the heap/table sort order among duplicates by treating heap TID as a tiebreaker column, which may make REINDEXing totally unnecessary for you. It's harder to model this issue because the problem with heap TID order will only be seen when there is at least a moderate amount of churn. -- Peter Geoghegan
Re: REINDEX : new parameter to preserve current average leaf density as new implicit FILLFACTOR
On Tue, Jul 9, 2019 at 12:29 PM John Lumby wrote: > I was not thinking of a new command, just an extension of the existing > REINDEX > which would apply a fillfactor equal to current average page density, > by adding a preliminary step to sample that first. That would be a very different thing to REINDEX no matter how you spelt it, though. REINDEX creates a new index, from scratch, whereas you're talking about restructuring what's already there. > > I believe that this is a lot more important in systems that generally > > use clustered indexes, such as MS SQL Server. This kind of > > "fragmentation" isn't usually much of a problem when using Postgres. > > > We have found that, for an index which has both experienced large number of > page splits > and whose table has a large number of dead tuples (despite autovacuum), > REINDEX with FILLFACTOR set to current page_density does produce a > performance improvement, > and also does reduce future growth in number of pages.I don't have > numbers to > hand, and in fact not sure if any catalog view or pgstattuple tells me about > the proportion > of dead key-tids in the index itself (do you know of any source?) as opposed > to the table, > but based on that recollection, yes, REINDEX can reduce fragmentation. This could help the old "getting tired" behavior with many duplicates, by making the free space available in earlier leaf pages (those further to the left) that are full of duplicates -- the original average space utilization may reflect a very uneven distribution of free space overall. Or, it could be that range scan performance benefitted from reduced fragmentation, because your workload happened to be bottlenecked on large range scans. Though that seems unlikely. I believe that the effect that you identified is real, but at a minimum it's not clear why a REINDEX with a fillfactor to match the original leaf space utilization helped. It would be fairly difficult to figure it out for sure. If it was a problem with duplicates/"getting tired", then I'd expect the new v12 code will help a lot. > However we did not run a VACUUM command first. Maybe if we had run VACUUM > instead of > the REINDEX commands, we might have obtained the same degree of > improvement, I don't know. > I think this was Tom's point earlier on in this thread. It was. Tom's intuition about that matches my own, though I acknowledge that the old behavior with duplicates muddies the waters. > Correct me if I'm wrong but I believe whether an index is "clustered" or not > is not relevant for > this discussion because the clustering in that context is referring to > ordering of the > table pages, not the index pages. Right. > I believe it is quite possible to have a perfectly > "clustered" table whose clustering index is itself badly disorganized. Technically the two things are separate metrics, so that is theoretically possible, but it doesn't seem all that likely. It could happen with lots of non-HOT updates, where all new index tuples relate to the same logical row as some existing index tuple, causing many page splits despite there being no real change in the logical contents of the index. Even then, the table will itself lose much of its original order, so the index will become "unclustered" as it becomes fragmented. -- Peter Geoghegan
Re: REINDEX : new parameter to preserve current average leaf density as new implicit FILLFACTOR
On Tue, Jul 9, 2019 at 11:27 AM John Lumby wrote: > And the point of the REINDEX at that point (below) is to remove dead tuple > keys-tids > and reorganize those split pages back into physical order without losing the > freespace. VACUUM already removes the tuples, accounting for all overhead. You are right that it would be possible for us to "defragment" the pages, so that they'd be in sequential order on disk from the point of view of a whole index scan -- this is what the "leaf_fragmentation" statistic from pgstatindex() reports on. We could in principle come up with a way of moving pages around, which would have some modest benefit for certain types of queries (it wouldn't improve the heap/index correlation, though, which is far more important). That would either necessitate that the command acquire a disruptive lock on the index (i.e. no writes, just like regular REINDEX), or that we drastically rearchitect the B-Tree code to make it support this. Neither of which seem particularly appealing. I believe that this is a lot more important in systems that generally use clustered indexes, such as MS SQL Server. This kind of "fragmentation" isn't usually much of a problem when using Postgres. -- Peter Geoghegan
Re: REINDEX : new parameter to preserve current average leaf density as new implicit FILLFACTOR
On Tue, Jul 9, 2019 at 10:31 AM John Lumby wrote: > Yes, I see that. But surely "making splits occur less often" is a > desirable > objective in itself, is it not? And I believe that a parameter to > preserve the "steady-state" > density in high-traffic indexes would help achieve that goal, wouldn't you > agree? Anything that reliably reduces page splits without hurting space utilization is well worthwhile. I can't see how what you describe could have that effect, though. If you expect the leaf density to be the same after a REINDEX, then why bother at all? There is no reason to think that that will be more effective than simple vacuuming. -- Peter Geoghegan
Re: REINDEX : new parameter to preserve current average leaf density as new implicit FILLFACTOR
On Mon, Jul 8, 2019 at 12:19 PM Peter Geoghegan wrote: > Well, you're still running autovacuum very aggressively here. It'll > easily keep up when run on a relatively small table such as this. Also, an exactly equal number of insertions and deletions is rather likely to result in bloated indexes in a way that probably isn't representative of many workloads. Even if the number of insertions was only slightly greater than the number of deletions, then the overall pattern would be one of continual growth, which is generally considered much more interesting. For far far more information on the topic than you want, see the paper "B-Trees with Inserts and Deletes: Why Free-at-Empty Is Better Than Merge-at-Half": https://www.sciencedirect.com/science/article/pii/00229390020W The salient point made by the paper is that good space utilization rests on the assumption that there are fewer deletes than inserts, though maybe only slightly fewer: "The tendency of the utilization to remain near 69% can be explained by the following arguments: If there are even just a few more inserts than deletes, the B-tree will grow at the net insert rate (the rate of inserts minus the rate of deletes)." If the volume of data never grows past a certain point, then it's unlikely that the space utilization is very important. This may even be premature optimization. -- Peter Geoghegan
Re: REINDEX : new parameter to preserve current average leaf density as new implicit FILLFACTOR
On Mon, Jul 8, 2019 at 12:10 PM John Lumby wrote: > Actually the test workload does not run any explicit VACUUM command, > it relies on autovacuum with these settings > (same settings for 9.4 and 12beta2) > To correspond to your " more churn between each VACUUM" > Would you then suggest increasing > autovacuum_vacuum_cost_delay and/or autovacuum_vacuum_scale_factor? Well, you're still running autovacuum very aggressively here. It'll easily keep up when run on a relatively small table such as this. BTW, you should definitely run the latest point release of 9.4 -- not 9.4.6. You're missing years of bug fixes by sticking to such an old point release, including some rather nasty ones -- 9.4.23 is the current 9.4 point release. Actually, 9.4 is going to lose support this year, as the oldest stable version that's currently supported by the community. -- Peter Geoghegan
Re: REINDEX : new parameter to preserve current average leaf density as new implicit FILLFACTOR
On Mon, Jul 8, 2019 at 9:23 AM John Lumby wrote: > Overall, pg-12beta2 yielded a 6.7% reduction in sizes (total pages) of > indexes, which was most noticable with the 6 non-unique ones. > In fact the primary-key index was larger with pg-12. The avg_leaf_density was actually higher for the primary key index, so it looks like it really came out slightly ahead on v12. Perhaps you didn't take deleted_pages into account -- there must be free space that is reusable by the index that has yet to be reused. It would probably make sense to subtract that across the board. > Would you have expected better than 6.7%? I don't think that a test case that runs VACUUM when there are only 4300 deletions and 4300 insertions is particularly realistic, in general. You might see a larger difference if there was more churn between each VACUUM run. > Although a welcome improvement, I think it is not enough to justify stopping > use of setting a lower explicit FILLFACTOR. Which then brings me back to > thinking there is a case for the subject of this thread, an automatic way to > preserve density. I don't think that such an option would make much sense. The "waves of misery" paper is about smoothing out the frequency of page splits following bulk loading and a CREATE INDEX. It is not about making splits occur less often. It's well understood that a certain amount of free space is the overhead of B-Tree indexes, albeit an overhead that can be avoided in certain specific instances. > And one question : > I notice that in some pg-11 release, a new config parameter appeared : > vacuum_cleanup_index_scale_factor > I have not researched this at all and nor did I set it to anything for my > pg-12beta2 run, but it sounds as though maybe it could be relevant to > this kind of workload - Is that so? You seem to be worried about keeping indexes as small as possible. vacuum_cleanup_index_scale_factor won't help with that. -- Peter Geoghegan
Re: Postgresql 12 Beta2 Crashes for any Insert/Update
On Thu, Jun 27, 2019 at 1:28 AM Mohsen Bande wrote: > i have a working database in PG 12 Beta 1. today i upgraded it to Beta 2. > everything goes fine and server is up and running. but trying to > INSERT/UPDATE anything, server crashes: Is it possible for you to send us a stacktrace? https://wiki.postgresql.org/wiki/Generating_a_stack_trace_of_a_PostgreSQL_backend Thanks -- Peter Geoghegan
Re: REINDEX : new parameter to preserve current average leaf density as new implicit FILLFACTOR
On Wed, Jun 26, 2019 at 8:05 AM John Lumby wrote: > > There is a very recent research paper that discusses the idea of > > varying fillfactor with a view to ameliorating page splits: > > > > Thanks, I am chewing my way through that. As you say, it does address > exactly the issues I raised. > Do you happen to know if their source-code is available somewhere? > ( I did see their db is MS SQL Server but it still might provide some > portable ideas. ) It's just a research paper. It might never be implemented in any system. The point of the paper is to make page splits occur at a steady rate after REINDEX'ing -- not to eliminate or even reduce page splits. > > I suspect that you might find that the enhancements to B-Tree indexes > > that went into Postgres 12 would help with this workload, especially > > if you notice that this happens with indexes that have a lot of duplicates > > > > I had not noticed that, thanks for pointing it out. Yes , in my workload > most of the indexes in question are non-unique and some have very low key > card.I will try out the pg-12 beta when I get a chance. It's easy to show problems with very low cardinality indexes in the old code. You'll definitely notice a difference there. > Is there a pdf or text version? Just the talk slides: https://www.pgcon.org/2019/schedule/attachments/518_nbtree-arch-pgcon.pdf -- Peter Geoghegan
Re: REINDEX : new parameter to preserve current average leaf density as new implicit FILLFACTOR
On Tue, Jun 25, 2019 at 2:45 PM John Lumby wrote: > Background : For some workloads involving high volume of > INSERT/UPDATE/DELETE, It is sometimes beneficial > to schedule regular REINDEX of high-activity indexes, so as to improve > performance, or restore performance levels back to what it was earlier, by > removing dead keys etc. This can result in the average page density of > these indexes fluctuating up and down in a saw-tooth fashion, REINDEX > causing large increase in density (large drop in total number of pages) and > the workload gradually decreasing density back to some "steady-state". I suspect that you might find that the enhancements to B-Tree indexes that went into Postgres 12 would help with this workload, especially if you notice that this happens with indexes that have a lot of duplicates. For the full background, you might take a look at my pgCon talk: https://youtu.be/p5RaATILoiE Fair warning: this is a very technical talk. Does it seem at all possible that you were affected by either the issue with duplicates, or the issue that is addressed by the "split after new tuple" optimization? They're both causes of index bloat that VACUUM cannot usually prevent. -- Peter Geoghegan
Re: REINDEX : new parameter to preserve current average leaf density as new implicit FILLFACTOR
On Tue, Jun 25, 2019 at 2:56 PM Tom Lane wrote: > > This would avoid the saw-tooth effect on number of pages, and also reduce > > the number of index page-splits which occur during the period immediately > > following a REINDEX done with default FILLFACTOR of 90%. In effect, it > > lessens the need for the physical reorganization aspect of REINDEX and > > focusses more on the function of removing dead keys. > > I think you've confused REINDEX with VACUUM. It seems like a pretty poor > substitute for that --- it's much more expensive and has worse locking > requirements. There is a very recent research paper that discusses the idea of varying fillfactor with a view to ameliorating page splits: https://btw.informatik.uni-rostock.de/download/tagungsband/B2-2.pdf I found the paper to be fairly convincing. The general idea is to make page splits occur at a steady rate following a REINDEX, rather than having "waves" of page splits. This is quite different to changing leaf fillfactor based on the observed leaf density, though. You can already do that by looking at pgstattuple's pgstatindex() function, which reports a avg_leaf_density for the index. Though I agree that that's not likely to help matters. Apart from anything else, the steady state of an index is embodied by more than just its avg_leaf_density. Especially following the v12 enhancements to B-Tree indexes. -- Peter Geoghegan
Re: checkpoints taking much longer than expected
On Sat, Jun 15, 2019 at 1:50 AM Tiemen Ruiten wrote: > During normal operation I don't mind that it takes a long time, but when > performing maintenance I want to be able to gracefully bring down the master > without long delays to promote one of the standby's. Maybe an "immediate" mode shutdown is appropriate, then. That will mean that the primary will need to go through crash recovery if and when you bring it back up, though. -- Peter Geoghegan
Re: Table partition with primary key in 11.3
On Fri, Jun 7, 2019 at 2:35 PM Alvaro Herrera wrote: > I envision this happening automatically -- you drop the partition, a > persistent work item is registered, autovacuum takes care of it > whenever. The user doesn't have to do anything about it. We don't have to agree on anything now, but I think that it's possible that the page split thing will very effective. Perhaps even so effective that it won't make much sense to vacuum global indexes just because there is a pending dropped partition. -- Peter Geoghegan
Re: Table partition with primary key in 11.3
On Fri, Jun 7, 2019 at 1:22 PM Alvaro Herrera wrote: > Somehow we ended up discussing this topic in a rather mistitled thread > ... oh well :-) (Nowadays I hesitate to change threads' subject lines, > because gmail). You can blame me for that, I think. > > It occurs to me that we could add a code path to nbtree page splits, > > that considered removing dropped partition tuples to avert a page > > split. This would be a bit like the LP_DEAD/kill_prior_tuple thing. > > Technically the space used by index tuples that point to a dropped > > partitions wouldn't become reclaimable immediately, but it might not > > matter with this optimization. > > This seems useful on the surface: you drop a partition, and slowly and > incrementally any index items that point to it are removed by processes > scanning the index. You can't rely solely on this, though: as pointed > out by Robert in the indirect index thread, doing this only means that > non-scanned parts of the index to retain entries for arbitrary long, > which is bad. Also, this adds latency to client-connected processes. Well, we don't have to rely on index scans to set the LP_DEAD bit in this case. We probably wouldn't do that at all. Rather, we'd have the page split code refer to a list of dropped partition numbers as targets for killing immediately. Maybe we'd hint the number of distinct partitions represented on the page, to make it a bit faster. > Because you can't rely on that exclusively, and you want to reuse the > partition ID eventually, you still need a cleanup process that removes > those remaining index entries. This cleanup process is a background > process, so it doesn't affect latency. I think it's not a good idea to > add latency to clients in order to optimize a background process. Ordinarily I would agree, but we're talking about something that takes place at the point that we're just about to split the page, that will probably make the page split unnecessary when we can reclaim as few as one or two tuples. A page split is already a very expensive thing by any measure, and something that can rarely be "undone", so avoiding them entirely is very compelling. Delaying a split will often prevent it altogether. We're already doing foreground processing, just by having page splits at all. Other DB systems that don't do much foreground processing will still do a certain amount of it if that avoids a split in some cases -- "Modern B-Tree techniques" mentions this, and suggests quite a number of ways that a split might be averted. > This way, when a partition is dropped, we have to take the time to scan > all global indexes; when they've been scanned we can remove the catalog > entry, and at that point the partition ID becomes available to future > partitions. It seems worth recycling partition IDs, but it should be possible to delay that for a very long time if necessary. Ideally, users wouldn't have to bother with it when they have really huge global indexes. > > The nbtree heap TID column and partition number column should probably > > be a single varwidth column (not two separate columns), that is often > > no wider than 6 bytes, but can be wider when there are many partitions > > and/or very large partitions. That will be challenging, but it seems > > like the right place to solve the problem. I think that I could make > > that happen. Maybe this same representation could be used for all > > nbtree indexes, not just global nbtree indexes. > > Maybe local nbtree indexes would have a partition ID of length 0, since > that many bits are necessary to identify which table is pointed to by > each index item. Right -- special cases are best avoided here. In general, we'd push as much of the new complexity as we can into this new TID-like table identifier, while requiring it to work with our existing requirements for TIDs, plus certain new requirements for global indexes (and maybe other new requirements, such as relations that are larger than 35GB). If the complexity is well-encapsulated, then it probably won't be too bad. Access methods would have to be okay with varwidth table identifiers, which is a big change, but they at least shouldn't have to worry about anything else breaking. They'd probably have a pg_attribute entry for the varwidth table identifier column, too (it would be the last column in every nbtree index). We'd expect a space efficient representation with real world relations, that at least matches what we get with heap TIDs today. This isn't quite as hard as it sounds. You don't have to be Claude Shannon to realize that it's kind of silly to reserve 16 bits for the offset number component of a TID/ItemPointer. We need to continue to support offset numbers that go that high, but the implementation would optimize for the common case where offset numbers are less than 512 (or maybe less than 1024). -- Peter Geoghegan
Re: Table partition with primary key in 11.3
On Fri, Jun 7, 2019 at 12:43 PM Alvaro Herrera wrote: > Well, "quickly" might mean within a week. If it takes that long to > fully remove a monthly partition to make that partition ID available to > some future month's partition, that seems acceptable. Blocking > DROP/DETACH for one hour is certainly not acceptable. I agree that synchronous clean-up of global indexes wouldn't make sense there, and might not be very compelling in practice. It occurs to me that we could add a code path to nbtree page splits, that considered removing dropped partition tuples to avert a page split. This would be a bit like the LP_DEAD/kill_prior_tuple thing. Technically the space used by index tuples that point to a dropped partitions wouldn't become reclaimable immediately, but it might not matter with this optimization. > If this scheme means that you can keep the partition identifiers stored > in the index to, for instance, 10 bits (allowing for 1024 partitions to > exist at any one time, including those in the process of being cleaned > up) instead of having to expand to (say) 24 because that covers a couple > of years of operation before having to recreate the index, it seems > worthwhile. I think that we should have no inherent limit on the number of partitions available at once, on general principle. Limiting the number of partitions is a design that probably has a lot of sharp edges. The nbtree heap TID column and partition number column should probably be a single varwidth column (not two separate columns), that is often no wider than 6 bytes, but can be wider when there are many partitions and/or very large partitions. That will be challenging, but it seems like the right place to solve the problem. I think that I could make that happen. Maybe this same representation could be used for all nbtree indexes, not just global nbtree indexes. -- Peter Geoghegan
Re: Table partition with primary key in 11.3
On Fri, Jun 7, 2019 at 12:18 PM Alvaro Herrera wrote: > I was thinking of asynchonously cleaning it up rather than blocking > DROP/DETACH ... which means you need to keep state somewhere. I don't > think blocking DROP/DETACH is valuable -- a global index that blocks > DROP/DETACH until the index is clean serves no useful purpose. (You > could think of a multi-step approach with internal transaction commits, > similar to CIC, but you still need a plan to clean that up in case the > server crashes during that operation.) The advantage of synchronous clean-up of global indexes when DROP'ing a partition are that you can recycle the partition number (or whatever we end up calling it) immediately and predictably, and you can reuse the space in indexes occupied by keys from the dropped partition immediately and predictably. That seems less useful than asynchronous processing on average, certainly, but those are still real advantages. You seemed to be particularly concerned about quickly recycling partition numbers when we drop a partition. I hope that we can come up with a very efficient on-disk representation for global index tuples, where only the bare minimum amount of space is used for partition numbers. Maybe it won't matter that much if partition numbers cannot be recycled due to this asynchronous processing. -- Peter Geoghegan
Re: Table partition with primary key in 11.3
On Fri, Jun 7, 2019 at 9:10 AM Alvaro Herrera wrote: > I think vacuuming for global indexes is somewhat challenging as well :-) > Maybe not as much as for indirect indexes, that's true. > > In order for it to be sustainable, I think you'll want to reuse > partition identifiers when the partitions are dropped/detached, which > means that you need a way to ensure that index entries to those > partitions are removed from all indexes. I'm not so sure about that. I see your point, but I think that you can also make the opposite argument. That is, you can make a good case for asynchronously cleaning up the dead entries that point to a dropped partition (probably within VACUUM). Perhaps we should offer *both* as options. -- Peter Geoghegan
Re: Table partition with primary key in 11.3
On Thu, Jun 6, 2019 at 3:00 PM David Rowley wrote: > You may already be aware, but another use case for such variable-width > identifiers was with indirect indexes as discussed in [1] Right. I went with global indexes because indirect indexes are probably a lot more risky as a project. I'd be particularly concerned about the complexity of VACUUM there, whereas that doesn't seem all that bad in the case of global indexes. -- Peter Geoghegan
Re: Table partition with primary key in 11.3
On Thu, Jun 6, 2019 at 10:03 AM Alvaro Herrera wrote: > If you are saying that you think that Postgres should support primary > keys that don't necessarily overlap partition keys, then I agree with > you. Please send a patch to implement that capability. FWIW, I could probably be convinced to work on the nbtree parts of global indexes if there was a reasonably worked out design that had some momentum behind it. I would at least need to know what the partition identifiers will need to look like in indexes. Andres has suggested that I work on variable-width table identifiers in nbtree for the benefit of pluggable storage engines, but I don't have much enthusiasm for the idea of doing that without delivering a clear benefit to users in the same release. It seems pretty obvious to me that the right general approach for global indexes is to make the partition identifier a tiebreaker column that comes right before the heap TID tiebreaker column (which would be per-partition). It probably needs to be variable-width, so that the overhead isn't too bad most of the time. I don't think that it's necessary to remove the restriction on the size of relations for global indexes, though maybe global indexes could focus our attention on that problem. -- Peter Geoghegan
Re: Questions about btree_gin vs btree_gist for low cardinality columns
On Sun, Jun 2, 2019 at 4:07 PM Tom Lane wrote: > Anyway, I said to Jeremy in the hallway that it might not be that > hard to bolt IOS support onto GIN for cases where the opclass is > a non-subdividing one, but after looking at the code I'm less sure > about that. GIN hasn't even got an "amgettuple" code path, just > "amgetbitmap", and a big part of the reason why is the need to merge > results from the fastupdate pending list with results from the main > index area. Not sure how we could deal with that. I suspect that GIN also avoids several other problems by only offer an "amgetbitmap", and not an "amgettuple". For example, it doesn't have to worry about things like numeric display scale, where a datum is substantively different to another datum, while still being equal according to opclass semantics (this is an example that I end up citing in many discussions about indexing). I bet that there are a few more of those beyond those two, that I haven't thought about. > Anyway, the larger point here is that right now btree_gin is just a quick > hack, and it seems like it might be worth putting some more effort into > it, because the addition of duplicate-compression changes the calculus > for whether it's useful. There was also discussion about making nbtree support deduplication during the hallway track. Jim Finnerty reminded me that there is a patch from Anastasia that did deduplication in nbtree that didn't go anywhere. Heikki independently talked about the possibility that he would work on this project in the next release, without being prompted by me. I think that the fact that nbtree sorts duplicate entries in heap TID order these days makes that worth looking into again. We can use something like GIN's varbyte encoding process to compress duplicates effectively. A lot of the problems (the numeric display scale problem, pg_upgrade) can be avoided by defining deduplication as something that happens on a best-effort basis. This is not the case within GIN, where it's impossible for the main entry tree to have duplicates without at least storing them in a posting list. -- Peter Geoghegan
Re: Upgrading locale issues
On Wed, May 1, 2019 at 3:09 PM Thomas Munro wrote: > As discussed over on -hackers[1], I think it's worth pursuing that > though. FWIW I've proposed locale versioning for FreeBSD's libc[2]. > The reason I haven't gone further with that yet even though the code > change has been accepted in principle by FreeBSD reviewers is because > I got stuck on the question of how exactly to model the versions. If, > say, just Turkish changes, I don't want to be rebuilding my French > indexes, which means that I don't think you can use the CLDR version > string. The ICU versions can handle that, though. Importantly, ICU decouples implementation details from actual versioning. I must say that I am not enthused about the idea of trying to get libc people of any variety on board. I don't have an objection to it if it can work for FreeBSD, but I don't think it can scale. ICU is built around a culture that takes our concerns seriously already, which is what it boils down to. Also, we can imagine a package manager taking it upon themselves to vendor their own ICU, with platform-specific guarantees around stability. That seems like a nice option to have, at least. > There is also the question of how PostgreSQL should model versions, > and as I've argued in [1], I think we should track them at the level > of database object dependencies. I think you're probably right about that. > I'm hoping to reopen this can of worms for PostgreSQL 13 (and the > corresponding support could in theory be in FreeBSD 13... coincidence, > or a sign!?) Maybe we should do what Oracle did, and call it PostgreSQL 18c instead. Actually, any number that isn't of interest to numerologists will do. > Unfortunately you can't use ICU collations as a database default yet > (though there was some WIP code[3]), so ICU only saves you from > versioning problems if you explicitly set collations for columns or > expressions, and even then the version tracking is currently just a > warning that you clear manually with a command, not a mechanism that > really tracks which database objects were last rebuilt/validated with > a given version. Yes, that does seem like a big remaining weakness. -- Peter Geoghegan
Re: Upgrading locale issues
On Mon, Apr 29, 2019 at 7:45 AM rihad wrote: > Hi. Today we run pg_ctl promote on a slave server (10.7) and started > using it as a master. The OS also got upgraded FreeBSD 10.4 -> FreeBSD > 11.2. And you guessed it, most varchar indexes got corrupted because > system local changed in subtle ways. So I created the extension amcheck > and reindexed all bad indexes one by one. Is there any way to prevent > such things in the future? Will switching to ICU fix all such issues? Not necessarily, but it will detect the incompatibility more or less automatically, making it far more likely that the problem will be caught before it does any harm. ICU versions collations, giving Postgres a way to reason about their compatibility over time. The libc collations are not versioned, though (at least not in any standard way that Postgres can take advantage of). > The problem with it is that ICU collations are absent in pg_collation, > initdb should be run to create them, but pg_basebackup only runs on an > empty base directory, so I couldn't run initdb + pg_basebackup to > prepare the replica server. I believe I can run the create collation > command manually, but what would it look like for en-x-icu? It is safe to call pg_import_system_collations() directly, which is all that initdb does. This is documented, so you wouldn't be relying on a hack. -- Peter Geoghegan
Re: Pg analog to DBCC CCHECKDB?
On Fri, Apr 5, 2019 at 7:33 AM Ron wrote: > In 9.6, does such a thing exist? (We just restored a VM from snapshot and I > want to verify the cluster sanity.) amcheck is available for versions 9.4+, though it only appears in contrib in Postgres 10. There are both yum and deb packages available. See: https://github.com/petergeoghegan/amcheck -- Peter Geoghegan
Re: CREATE COLLATION to match pg_collation data
On Sat, Jan 12, 2019 at 7:13 AM Tom Lane wrote: > That's an ICU collation, so you're out of luck: there is no ICU > support in 9.6. FWIW, there is some form of FreeBSD Postgres support for ICU that predates the officially supported ICU feature. The FreeBSD Postgres packages have been built with ICU support for many years -- they modify the source code minimally to make this work. It may well still be impossible to use "az-x-icu" on a FreeBSD installation of 9.6, though. -- Peter Geoghegan
Re: Code of Conduct plan
On Fri, Sep 14, 2018 at 11:06 AM, Dimitri Maziuk wrote: > Personally I would like that. Others might prefer an invitation to > unsubscribe or forever hold their peace, I could live with that too, but > I believe explicit opt-ins are preferable to opt-outs. I think that it's a legitimate position to be opposed to a CoC like this. I also think it's legitimate to feel so strongly about it, on philosophical or political grounds, that you are compelled to avoid participating while subject to the CoC. FWIW, the latter position seems rather extreme to me personally, but I still respect it. In all sincerity, if you're compelled to walk away from participating in mailing list discussions on a point of principle, then I wish you well. That is your right. -- Peter Geoghegan
Re: Code of Conduct plan
On Fri, Sep 14, 2018 at 10:31 AM, Dimitri Maziuk wrote: > So let me get this straight: you want to have a "sanctioned" way to deny > people access to postgresql community support channel? Yes. > "Because > somebody who may or may not be the same person, allegedly said something > somewhere that some other tweet disagreed with on faceplant"? > > Great plan if you do for-pay postgresql support for the living. You can make your own conclusions about my motivations, just as I'll make my own conclusions about yours. I'm not going to engage with you on either, though. -- Peter Geoghegan
Re: Code of Conduct plan
On Fri, Sep 14, 2018 at 7:19 AM, Joshua D. Drake wrote: > Sure and that is unfortunate but isn't it up to the individual to deal with > it through appropriate channels for whatever platform they are on? All of > these platforms are: > > 1. Voluntary to use > 2. Have their own Terms of Use and complaint departments > 3. If it is abuse there are laws > > I agree that within Postgresql.org we must have a professional code of > conduct but the idea that an arbitrary committee appointed by an unelected > board can decide the fate of a community member based on actions outside of > the community is a bit authoritarian don't you think? The choice of the committee members is hardly arbitrary. Having committee members be appointed by core is more or less consistent with how the community has always dealt with disciplinary issues. The criteria used by core were discussed quite openly. While the risk that the committee will yield their power in an "authoritarian" way seems very small, it cannot be ruled out entirely. In fact, it hasn't been ruled out by the draft CoC itself. No CoC can possibly provide for every conceivable situation. Somebody has to interpret the rules, and it has to be possible to impose sanctions when the CoC is violated -- otherwise, what's the point? There are several checks and balances in place, and I for one have confidence in the process as outlined. It's imperfect, but quite a lot better than either the status quo, or a platitude about inclusivity. -- Peter Geoghegan