Re: Add 64-bit XIDs into PostgreSQL 15
The following review has been posted through the commitfest application: make installcheck-world: not tested Implements feature: not tested Spec compliant: not tested Documentation:not tested I have a very serious concern about the current patch set. as someone who has faced transaction id wraparound in the past. I can start by saying I think it would be helpful (if the other issues are approached reasonably) to have 64-bit xids, but there is an important piece of context in reventing xid wraparounds that seems missing from this patch unless I missed something. XID wraparound is a symptom, not an underlying problem. It usually occurs when autovacuum or other vacuum strategies have unexpected stalls and therefore fail to work as expected. Shifting to 64-bit XIDs dramatically changes the sorts of problems that these stalls are likely to pose to operational teams. -- you can find you are running out of storage rather than facing an imminent database shutdown. Worse, this patch delays the problem until some (possibly far later!) time, when vacuum will take far longer to finish, and options for resolving the problem are diminished. As a result I am concerned that merely changing xids from 32-bit to 64-bit will lead to a smaller number of far more serious outages. What would make a big difference from my perspective would be to combine this with an inverse system for warning that there is a problem, allowing the administrator to throw warnings about xids since last vacuum, with a configurable threshold. We could have this at two billion by default as that would pose operational warnings not much later than we have now. Otherwise I can imagine cases where instead of 30 hours to vacuum a table, it takes 300 hours on a database that is short on space. And I would not want to be facing such a situation. The new status of this patch is: Waiting on Author
Re: Fix comments atop pg_get_replication_slots
Amit, thanks for looking into this! On Sun, Nov 20, 2022 at 11:38 PM Amit Kapila wrote: > On Mon, Nov 21, 2022 at 12:45 PM sirisha chamarthi > wrote: > > > > Hi Hackers, > > > > The comments atop seem to indicate that it is only showing active > replication slots. The comment is ambiguous as it also shows all the slots > including lost and inactive slots. Attached a small patch to fix it. > > > > I agree that it is a bit confusing. How about "SQL SRF showing all > replication slots that currently exist on the database cluster"? > Looks good to me. Attached a patch for the same. > > -- > With Regards, > Amit Kapila. > 0002-Fix-atop-pg_get_replication_slots-function-to-reflec.patch Description: Binary data
Re: Reducing power consumption on idle servers
On Sun, 20 Nov 2022 at 22:55, Nathan Bossart wrote: > > On Mon, Nov 21, 2022 at 10:31:15AM +1300, Thomas Munro wrote: > > On Mon, Nov 21, 2022 at 9:00 AM Simon Riggs > > wrote: > >> As a 3rd patch, I will work on making logical workers hibernate. > > > > Duelling patch warning: Nathan mentioned[1] that he's hacking on a > > patch for that, along the lines of the recent walreceiver change IIUC. > > I coded something up last week, but, like the walreceiver patch, it caused > check-world to take much longer [0], and I haven't looked into whether it > could be easily fixed. I'm hoping to make some time for this again in the > near future. > > [0] https://postgr.es/m/20221113222644.GA1269110%40nathanxps13 OK, Nathan, will leave this one to you - remembering that we need to fix ALL processes to get a useful power reduction when idle. -- Simon Riggshttp://www.EnterpriseDB.com/
Re: Fix comments atop pg_get_replication_slots
On Mon, Nov 21, 2022 at 12:45 PM sirisha chamarthi wrote: > > Hi Hackers, > > The comments atop seem to indicate that it is only showing active replication > slots. The comment is ambiguous as it also shows all the slots including lost > and inactive slots. Attached a small patch to fix it. > I agree that it is a bit confusing. How about "SQL SRF showing all replication slots that currently exist on the database cluster"? -- With Regards, Amit Kapila.
Re: Reducing power consumption on idle servers
On Mon, 21 Nov 2022 at 05:07, Laurenz Albe wrote: > > On Mon, 2022-11-21 at 10:13 +1300, Thomas Munro wrote: > > I'll wait 24 hours before committing, to > > provide a last chance for anyone who wants to complain about dropping > > promote_trigger_file. > > Remove "promote_trigger_file"? Now I have never seen anybody use that > parameter, but I don't think that it is a good idea to deviate from our > usual standard of deprecating a feature for about five years before > actually removing it. We aren't removing the ability to promote, just enforcing a change to a better mechanism, hence I don't see a reason for a long(er) deprecation period than we have already had. -- Simon Riggshttp://www.EnterpriseDB.com/
Re: Assertion failure in SnapBuildInitialSnapshot()
On Sat, Nov 19, 2022 at 6:35 AM Andres Freund wrote: > > On 2022-11-18 11:20:36 +0530, Amit Kapila wrote: > > Okay, updated the patch accordingly. > > Assuming it passes tests etc, this'd work for me. > Thanks, Pushed. -- With Regards, Amit Kapila.
Re: [PoC] Improve dead tuple storage for lazy vacuum
On Fri, Nov 18, 2022 at 2:48 PM I wrote: > One issue with this patch: The "fanout" member is a uint8, so it can't hold 256 for the largest node kind. That's not an issue in practice, since we never need to grow it, and we only compare that value with the count in an Assert(), so I just set it to zero. That does break an invariant, so it's not great. We could use 2 bytes to be strictly correct in all cases, but that limits what we can do with the smallest node kind. Thinking about this part, there's an easy resolution -- use a different macro for fixed- and variable-sized node kinds to determine if there is a free slot. Also, I wanted to share some results of adjusting the boundary between the two smallest node kinds. In the hackish attached patch, I modified the fixed height search benchmark to search a small (within L1 cache) tree thousands of times. For the first set I modified node4's maximum fanout and filled it up. For the second, I set node4's fanout to 1, which causes 2+ to spill to node32 (actually the partially-filled node15 size class as demoed earlier). node4: NOTICE: num_keys = 16, height = 3, n4 = 15, n15 = 0, n32 = 0, n128 = 0, n256 = 0 fanout | nkeys | rt_mem_allocated | rt_load_ms | rt_search_ms +---+--++-- 2 |16 |16520 | 0 |3 NOTICE: num_keys = 81, height = 3, n4 = 40, n15 = 0, n32 = 0, n128 = 0, n256 = 0 fanout | nkeys | rt_mem_allocated | rt_load_ms | rt_search_ms +---+--++-- 3 |81 |16456 | 0 | 17 NOTICE: num_keys = 256, height = 3, n4 = 85, n15 = 0, n32 = 0, n128 = 0, n256 = 0 fanout | nkeys | rt_mem_allocated | rt_load_ms | rt_search_ms +---+--++-- 4 | 256 |16456 | 0 | 89 NOTICE: num_keys = 625, height = 3, n4 = 156, n15 = 0, n32 = 0, n128 = 0, n256 = 0 fanout | nkeys | rt_mem_allocated | rt_load_ms | rt_search_ms +---+--++-- 5 | 625 |16488 | 0 | 327 node32: NOTICE: num_keys = 16, height = 3, n4 = 0, n15 = 15, n32 = 0, n128 = 0, n256 = 0 fanout | nkeys | rt_mem_allocated | rt_load_ms | rt_search_ms +---+--++-- 2 |16 |16488 | 0 |5 (1 row) NOTICE: num_keys = 81, height = 3, n4 = 0, n15 = 40, n32 = 0, n128 = 0, n256 = 0 fanout | nkeys | rt_mem_allocated | rt_load_ms | rt_search_ms +---+--++-- 3 |81 |16520 | 0 | 28 NOTICE: num_keys = 256, height = 3, n4 = 0, n15 = 85, n32 = 0, n128 = 0, n256 = 0 fanout | nkeys | rt_mem_allocated | rt_load_ms | rt_search_ms +---+--++-- 4 | 256 |16408 | 0 | 79 NOTICE: num_keys = 625, height = 3, n4 = 0, n15 = 156, n32 = 0, n128 = 0, n256 = 0 fanout | nkeys | rt_mem_allocated | rt_load_ms | rt_search_ms +---+--++-- 5 | 625 |24616 | 0 | 199 In this test, node32 seems slightly faster than node4 with 4 elements, at the cost of more memory. Assuming the smallest node is fixed size (i.e. fanout/capacity member not part of the common set, so only part of variable-sized nodes), 3 has a nice property: no wasted padding space: node4: 5 + 4+(7) + 4*8 = 48 bytes node3: 5 + 3 + 3*8 = 32 -- John Naylor EDB: http://www.enterprisedb.com
Fix comments atop pg_get_replication_slots
Hi Hackers, The comments atop seem to indicate that it is only showing active replication slots. The comment is ambiguous as it also shows all the slots including lost and inactive slots. Attached a small patch to fix it. Thanks, Sirisha 0001-Fix-atop-pg_get_replication_slots-function-to-reflec.patch Description: Binary data
Catalog_xmin is not advanced when a logical slot is lost
Hi Hackers, forking this thread from the discussion [1] as suggested by Amit. Catalog_xmin is not advanced when a logical slot is invalidated (lost) until the invalidated slot is dropped. This patch ignores invalidated slots while computing the oldest xmin. Attached a small patch to address this and the output after the patch is as shown below. postgres=# select * from pg_replication_slots; slot_name | plugin | slot_type | datoid | database | temporary | active | active_pid | xmin | catalog_xmin | restart_lsn | confirmed_flush_lsn | wal_status | safe_wal_size | two_phase ---+---+---++--+---+++--+--+-+-++---+--- s2 | test_decoding | logical | 5 | postgres | f | f | | | 771 | 0/30466368 | 0/304663A0 | reserved | 28903824 | f (1 row) postgres=# create table t2(c int, c1 char(100)); CREATE TABLE postgres=# drop table t2; DROP TABLE postgres=# vacuum pg_class; VACUUM postgres=# select n_dead_tup from pg_stat_all_tables where relname = 'pg_class'; n_dead_tup 2 (1 row) postgres=# select * from pg_stat_replication; pid | usesysid | usename | application_name | client_addr | client_hostname | client_port | backend_start | backend_xmin | state | sent_lsn | write_lsn | flush_lsn | replay_lsn | write_lag | flush_lag | replay_lag | sync_pri ority | sync_state | reply_time -+--+-+--+-+-+-+---+--+---+--+---+---++---+---++- --++ (0 rows) postgres=# insert into t1 select * from t1; INSERT 0 2097152 postgres=# checkpoint; CHECKPOINT postgres=# select * from pg_replication_slots; slot_name | plugin | slot_type | datoid | database | temporary | active | active_pid | xmin | catalog_xmin | restart_lsn | confirmed_flush_lsn | wal_status | safe_wal_size | two_phase ---+---+---++--+---+++--+--+-+-++---+--- s2 | test_decoding | logical | 5 | postgres | f | f | | | 771 | | 0/304663A0 | lost | | f (1 row) postgres=# vacuum pg_class; VACUUM postgres=# select n_dead_tup from pg_stat_all_tables where relname = 'pg_class'; n_dead_tup 0 (1 row) [1] https://www.postgresql.org/message-id/flat/CAKrAKeW-sGqvkw-2zKuVYiVv%3DEOG4LEqJn01RJPsHfS2rQGYng%40mail.gmail.com Thanks, Sirisha 0001-Ignore-invalidated-slots-while-computing-the-oldest-.patch Description: Binary data
Re: Reducing power consumption on idle servers
On Mon, Nov 21, 2022 at 2:43 AM Thomas Munro wrote: > > On Mon, Nov 21, 2022 at 3:35 AM Simon Riggs > wrote: > > On Sat, 19 Nov 2022 at 10:59, Simon Riggs > > wrote: > > > New version attached. > > > > Fix for doc xref > > I removed a stray variable declaration from xlogrecovery.h, and wrote > a draft commit message. I'll wait 24 hours before committing, to > provide a last chance for anyone who wants to complain about dropping > promote_trigger_file. > > (We could of course change it so that the timeout based wakeup only > happens if you have actually set promote_trigger_file, but I think > we've established that we just don't want the filesystem polling > feature so I'm whispering this in parentheses.) Thanks. The v11 patch mostly looks good to me and it passes cirrus-ci tests - https://github.com/BRupireddy/postgres/tree/do_away_with_promote_trigger_file. I have a comment: - * Wait for more WAL to arrive. Time out after 5 seconds - * to react to a trigger file promptly and to check if the - * WAL receiver is still active. + * Wait for more WAL to arrive, when we will be woken + * immediately by the WAL receiver. Use of trigger file + * via promote_trigger_file is now fully removed. */ Do we need to introduce reference to removal of promote_trigger_file in the code? If left as-is, it'll lie there for many years. Isn't it enough to specify in appendix-obsolete-recovery-config.sgml? -- Bharath Rupireddy PostgreSQL Contributors Team RDS Open Source Databases Amazon Web Services: https://aws.amazon.com
Re: [PoC] Improve dead tuple storage for lazy vacuum
On Fri, Nov 18, 2022 at 8:20 PM Masahiko Sawada wrote: > > On Thu, Nov 17, 2022 at 12:24 AM Masahiko Sawada wrote: > > > > On Wed, Nov 16, 2022 at 4:39 PM John Naylor > > wrote: > > > That means my idea for the pointer struct might have some problems, at least as currently implemented. Maybe in the course of separating out and polishing that piece, an inefficiency will fall out. Or, it might be another reason to template local and shared separately. Not sure yet. I also haven't tried to adjust this test for the shared memory case. Digging a bit deeper, I see a flaw in my benchmark: Even though the total distribution of node kinds is decently even, the pattern that the benchmark sees is not terribly random: 3,343,352 branch-misses:u #0.85% of all branches 393,204,959 branches:u Recall a previous benchmark [1] where the leaf node was about half node16 and half node32. Randomizing the leaf node between the two caused branch misses to go from 1% to 2%, causing a noticeable slowdown. Maybe in this new benchmark, each level has a skewed distribution of nodes, giving a smart branch predictor something to work with. We will need a way to efficiently generate keys that lead to a relatively unpredictable distribution of node kinds, as seen by a searcher. Especially in the leaves (or just above the leaves), since those are less likely to be cached. > > I'll also run the test on my environment and do the investigation tomorrow. > > > > FYI I've not tested the patch you shared today but here are the > benchmark results I did with the v9 patch in my environment (I used > the second filter). I splitted 0004 patch into two patches: a patch > for pure refactoring patch to introduce rt_node_ptr and a patch to do > pointer tagging. Would you be able to share the refactoring patch? And a fix for the failing tests? I'm thinking I want to try the templating approach fairly soon. [1] https://www.postgresql.org/message-id/CAFBsxsFEVckVzsBsfgGzGR4Yz%3DJp%3DUxOtjYvTjOz6fOoLXtOig%40mail.gmail.com -- John Naylor EDB: http://www.enterprisedb.com
Re: Perform streaming logical transactions by background workers and parallel apply
On Fri, Nov 18, 2022 at 6:03 PM Peter Smith wrote: > > Here are some review comments for v47-0001 > > (This review is a WIP - I will post more comments for this patch next week) > Here are the rest of my comments for v47-0001 == doc/src/sgml/monitoring. 1. @@ -1851,6 +1851,11 @@ postgres 27093 0.0 0.0 30096 2752 ? Ss 11:34 0:00 postgres: ser Waiting to acquire an advisory user lock. + applytransaction + Waiting to acquire acquire a lock on a remote transaction being + applied on the subscriber side. + + 1a. Typo "acquire acquire" ~ 1b. Maybe "on the subscriber side" does not mean much without any context. Maybe better to word it as below. SUGGESTION Waiting to acquire a lock on a remote transaction being applied by a logical replication subscriber. == doc/src/sgml/system-views.sgml 2. @@ -1361,8 +1361,9 @@ virtualxid, spectoken, object, - userlock, or - advisory. + userlock, + advisory or + applytransaction. This change removed the Oxford comma that was there before. I assume it was unintended. == .../replication/logical/applyparallelworker.c 3. globals The parallel_apply_XXX functions were all shortened to pa_XXX. I wondered if the same simplification should be done also to the global statics... e.g. ParallelApplyWorkersHash -> PAWorkerHash ParallelApplyWorkersList -> PAWorkerList ParallelApplyMessagePending -> PAMessagePending etc... ~~~ 4. pa_get_free_worker + foreach(lc, active_workers) + { + ParallelApplyWorkerInfo *winfo = NULL; + + winfo = (ParallelApplyWorkerInfo *) lfirst(lc); No need to assign NULL because the next line just overwrites that anyhow. ~ 5. + /* + * Try to free the worker first, because we don't wait for the rollback + * command to finish so the worker may not be freed at the end of the + * transaction. + */ + if (pa_free_worker(winfo, winfo->shared->xid)) + continue; + + if (!winfo->in_use) + return winfo; Shouldn't the (!winfo->in_use) check be done first as well -- e.g. why are we trying to free a worker which is maybe not even in_use? SUGGESTION (this will need some comment to explain what it is doing) if (!winfo->in_use || !pa_free_worker(winfo, winfo->shared->xid) && !winfo->in_use) return winfo; ~~~ 6. pa_free_worker +/* + * Remove the parallel apply worker entry from the hash table. Stop the work if + * there are enough workers in the pool. + * Typo? "work" -> "worker" ~ 7. + /* Are there enough workers in the pool? */ + if (napplyworkers > (max_parallel_apply_workers_per_subscription / 2)) + { IMO that comment should be something more like "Don't detach/stop the worker unless..." ~~~ 8. pa_send_data + /* + * Retry after 1s to reduce the cost of getting the system time and + * calculating the time difference. + */ + (void) WaitLatch(MyLatch, + WL_LATCH_SET | WL_TIMEOUT | WL_EXIT_ON_PM_DEATH, + 1000L, + WAIT_EVENT_LOGICAL_PARALLEL_APPLY_STATE_CHANGE); 8a. I am not sure you need to explain the reason in the comment. Just saying "Wait before retrying." seems sufficient to me. ~ 8b. Instead of the hardwired "1s" in the comment, and 1000L in the code, maybe better to just have another constant. SUGGESTION #define SHM_SEND_RETRY_INTERVAL_MS 1000 #define SHM_SEND_TIMEOUT_MS 1 ~ 9. + if (startTime == 0) + startTime = GetCurrentTimestamp(); + else if (TimestampDifferenceExceeds(startTime, GetCurrentTimestamp(), IMO the initial startTime should be at top of the function otherwise the timeout calculation seems wrong. == src/backend/replication/logical/worker.c 10. handle_streamed_transaction + * In streaming case (receiving a block of streamed transaction), for + * SUBSTREAM_ON mode, simply redirect it to a file for the proper toplevel + * transaction, and for SUBSTREAM_PARALLEL mode, send the changes to parallel + * apply workers (LOGICAL_REP_MSG_RELATION or LOGICAL_REP_MSG_TYPE changes + * will be applied by both leader apply worker and parallel apply workers). I'm not sure this function comment should be referring to SUBSTREAM_ON and SUBSTREAM_PARALLEL because the function body does not use those anywhere in the logic. ~~~ 11. apply_handle_stream_start + /* + * Increment the number of messages waiting to be processed by + * parallel apply worker. + */ + pg_atomic_add_fetch_u32(&(winfo->shared->pending_message_count), 1); + The &() parens are not needed. Just write >shared->pending_message_count. Also, search/replace others like this -- there are a few of them. ~~~ 12. apply_handle_stream_stop + if (!abort_toplevel_transaction && + pg_atomic_sub_fetch_u32(&(MyParallelShared->pending_message_count), 1) == 0) + { + pa_lock_stream(MyParallelShared->xid, AccessShareLock); + pa_unlock_stream(MyParallelShared->xid, AccessShareLock); + } That lock/unlock seems like it is done just as a way of testing/waiting for an exclusive lock held on the xid to be released. But the code is too tricky -- IMO it
Re: Reducing power consumption on idle servers
On Mon, Nov 21, 2022 at 10:37 AM Laurenz Albe wrote: > > On Mon, 2022-11-21 at 10:13 +1300, Thomas Munro wrote: > > I'll wait 24 hours before committing, to > > provide a last chance for anyone who wants to complain about dropping > > promote_trigger_file. > > Remove "promote_trigger_file"? Now I have never seen anybody use that > parameter, but I don't think that it is a good idea to deviate from our > usual standard of deprecating a feature for about five years before > actually removing it. I'm not sure what the guidelines are here, however years have gone by since pg_ctl promote [1] in 2011 and pg_promote() [2] in 2018 were added. With two such alternatives in place for many years, it was sort of an undeclared deprecation of promote_trigger_file GUC. And the changes required to move to newer ways from the GUC aren't that hard for those who're still relying on the GUC. Therefore, I think it's now time for us to do away with the GUC. [1] https://git.postgresql.org/gitweb/?p=postgresql.git;a=commitdiff;h=4695da5ae97bbb58d274887fd68edbe88d03ebcb [2] https://git.postgresql.org/gitweb/?p=postgresql.git;a=commitdiff;h=10074651e3355e2405015f6253602be8344bc829 -- Bharath Rupireddy PostgreSQL Contributors Team RDS Open Source Databases Amazon Web Services: https://aws.amazon.com
Re: Error-safe user functions
Corey Huinker writes: > On Tue, Nov 15, 2022 at 11:36 AM Andrew Dunstan wrote: >> Nikita, >> just checking in, are you making progress on this? I think we really >> need to get this reviewed and committed ASAP if we are to have a chance >> to get the SQL/JSON stuff reworked to use it in time for release 16. > I'm making an attempt at this or something very similar to it. I don't yet > have a patch ready. Cool. We can't delay too much longer on this if we want to have a credible feature in v16. Although I want a minimal initial patch, there will still be a ton of incremental work to do after the core capability is reviewed and committed, so there's no time to lose. regards, tom lane
Re: Error-safe user functions
On Tue, Nov 15, 2022 at 11:36 AM Andrew Dunstan wrote: > > On 2022-10-07 Fr 13:37, Tom Lane wrote: > > > [ lots of detailed review ] > > > Basically, this patch set should be a lot smaller and not have ambitions > > beyond "get the API right" and "make one or two datatypes support COPY > > NULL_ON_ERROR". Add more code once that core functionality gets reviewed > > and committed. > > > > > > > Nikita, > > just checking in, are you making progress on this? I think we really > need to get this reviewed and committed ASAP if we are to have a chance > to get the SQL/JSON stuff reworked to use it in time for release 16. > > I'm making an attempt at this or something very similar to it. I don't yet have a patch ready.
Re: Multitable insert syntax support on Postgres?
> > WITH data_src AS (SELECT * FROM source_tbl), > insert_a AS (INSERT INTO a SELECT * FROM data_src WHERE d < 5), > insert_b AS (INSERT INTO b SELECT * FROM data_src WHERE d >= 5) > INSERT INTO c SELECT * FROM data_src WHERE d < 5 > I suppose you could just do a dummy SELECT at the bottom to make it look more symmetrical WITH data_src AS (SELECT * FROM source_tbl), insert_a AS (INSERT INTO a SELECT * FROM data_src WHERE d < 5), insert_b AS (INSERT INTO b SELECT * FROM data_src WHERE d >= 5) insert_c AS (INSERT INTO c SELECT * FROM data_src WHERE d < 5) SELECT true AS inserts_complete; Or maybe get some diagnostics out of it: WITH data_src AS (SELECT * FROM source_tbl), insert_a AS (INSERT INTO a SELECT * FROM data_src WHERE d < 5 RETURNING NULL), insert_b AS (INSERT INTO b SELECT * FROM data_src WHERE d >= 5 RETURNING NULL), insert_c AS (INSERT INTO c SELECT * FROM data_src WHERE d < 5 RETURNING NULL) SELECT (SELECT COUNT(*) FROM insert_a) AS new_a_rows, (SELECT COUNT(*) FROM insert_b) AS new_b_rows, (SELECT COUNT(*) FROM insert_c) AS new_c_rows;
Re: Reducing power consumption on idle servers
On Mon, 2022-11-21 at 10:13 +1300, Thomas Munro wrote: > I'll wait 24 hours before committing, to > provide a last chance for anyone who wants to complain about dropping > promote_trigger_file. Remove "promote_trigger_file"? Now I have never seen anybody use that parameter, but I don't think that it is a good idea to deviate from our usual standard of deprecating a feature for about five years before actually removing it. Yours, Laurenz Albe
Re: Multitable insert syntax support on Postgres?
On Mon, Nov 14, 2022 at 7:06 PM Alexandre hadjinlian guerra < alexhgue...@gmail.com> wrote: > Hello > Are there any plans to incorporate a formal syntax multitable/conditional > insert , similar to the syntax below? snowflake does have the same feature > > https://oracle-base.com/articles/9i/multitable-inserts > > Today, im resorting to a function that receives the necessary parameters > from the attributes definition/selection area in a sql select query, called > by each tuple retrieved. A proper syntax show be real cool > > Thanks! > I'm not aware of any efforts to implement this at this time, mostly because I don't think it's supported in the SQL Standard. Being in the standard would change the question from "why" to "why not". I've used that feature when I worked with Oracle in a data warehouse situation. I found it most useful when migrating data dumps from mainframes where the data file contained subrecords and in cases where one field in a row changes the meaning of subsequent fields in the same row. That may sound like a First Normal Form violation, and it is, but such data formats are common in the IBM VSAM world, or at least they were in the data dumps that I had to import.
Re: Unstable regression test for contrib/pageinspect
Peter Geoghegan writes: > On Sun, Nov 20, 2022 at 12:37 PM Tom Lane wrote: >> The core reloptions.sql and vacuum.sql tests are two places that are >> also using this option, but they are applying it to temp tables, >> which I think makes it safe (and the lack of failures, seeing that >> they run within parallel test groups, reinforces that). Can we apply >> that idea in pageinspect? > I believe so. The temp table horizons guarantee isn't all that old, so > the tests may well have been written before it was possible. Ah, right, I see that that only dates back to v14 (cf a7212be8b). So we can fix pageinspect's issue by making that table be temp, but only as far back as v14. That's probably good enough in terms of reducing the buildfarm noise level, seeing that mamba has only reported this failure on HEAD so far. I'd be tempted to propose back-patching a7212be8b, but there would be ABI-stability issues, and it's probably not worth dealing with that. >> contrib/amcheck and contrib/pg_visibility are also using >> DISABLE_PAGE_SKIPPING, so I wonder if they have similar hazards. > I think that most use of DISABLE_PAGE_SKIPPING by the regression tests > just isn't necessary. Apparently not -- see followup discussion with Mark Dilger. regards, tom lane
Re: Unstable regression test for contrib/pageinspect
Mark Dilger writes: > On Nov 20, 2022, at 12:37 PM, Tom Lane wrote: >> contrib/amcheck and contrib/pg_visibility are also using >> DISABLE_PAGE_SKIPPING, so I wonder if they have similar hazards. >> I haven't seen them fall over, though. > In the amcheck regression test case, it's because the test isn't > sensitive to whether the freeze actually happens. You can comment > out that line, and the only test difference is the comment: Interesting. I tried that with pg_visibility, with the same result: removing its VACUUM commands altogether changes nothing else in the test output. I'm not sure this is a good thing. It makes one wonder whether these tests really test what they claim to. But it certainly explains the lack of failures. > The amcheck TAP test is sensitive to commenting out the freeze, though: > ... > But the TAP test also disables autovacuum, so a background > auto-analyze shouldn't be running. Maybe that's why you haven't > seen amcheck fall over? Ah, right, I see $node->append_conf('postgresql.conf', 'autovacuum=off'); in 001_verify_heapam.pl. So that one's okay too. Bottom line seems to be that converting pageinspect's test table to a temp table should fix this. If no objections, I'll do that tomorrow. regards, tom lane
Re: Unstable regression test for contrib/pageinspect
> On Nov 20, 2022, at 12:37 PM, Tom Lane wrote: > > contrib/amcheck and contrib/pg_visibility are also using > DISABLE_PAGE_SKIPPING, so I wonder if they have similar hazards. > I haven't seen them fall over, though. In the amcheck regression test case, it's because the test isn't sensitive to whether the freeze actually happens. You can comment out that line, and the only test difference is the comment: @@ -108,8 +108,8 @@ ERROR: ending block number must be between 0 and 0 SELECT * FROM verify_heapam(relation := 'heaptest', startblock := 1, endblock := 11000); ERROR: starting block number must be between 0 and 0 --- Vacuum freeze to change the xids encountered in subsequent tests -VACUUM (FREEZE, DISABLE_PAGE_SKIPPING) heaptest; +-- -- Vacuum freeze to change the xids encountered in subsequent tests +-- VACUUM (FREEZE, DISABLE_PAGE_SKIPPING) heaptest; -- Check that valid options are not rejected nor corruption reported -- for a non-empty frozen table SELECT * FROM verify_heapam(relation := 'heaptest', skip := 'none'); The amcheck TAP test is sensitive to commenting out the freeze, though: t/001_verify_heapam.pl .. 42/? # Failed test 'all-frozen corrupted table skipping all-frozen' # at t/001_verify_heapam.pl line 58. # got: '0|3||line pointer redirection to item at offset 21840 exceeds maximum offset 38 # 0|4||line pointer to page offset 21840 with length 21840 ends beyond maximum page offset 8192 # 0|5||line pointer redirection to item at offset 0 precedes minimum offset 1 # 0|6||line pointer length 0 is less than the minimum tuple header size 24 # 0|7||line pointer to page offset 15 is not maximally aligned # 0|8||line pointer length 15 is less than the minimum tuple header size 24' # expected: '' t/001_verify_heapam.pl .. 211/? # Looks like you failed 1 test of 272. t/001_verify_heapam.pl .. Dubious, test returned 1 (wstat 256, 0x100) Failed 1/272 subtests t/002_cic.pl ok t/003_cic_2pc.pl ok Test Summary Report --- t/001_verify_heapam.pl (Wstat: 256 (exited 1) Tests: 272 Failed: 1) Failed test: 80 Non-zero exit status: 1 Files=3, Tests=280, 10 wallclock secs ( 0.05 usr 0.02 sys + 3.84 cusr 3.10 csys = 7.01 CPU) Result: FAIL make: *** [check] Error 1 But the TAP test also disables autovacuum, so a background auto-analyze shouldn't be running. Maybe that's why you haven't seen amcheck fall over? — Mark Dilger EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
Re: CREATE INDEX CONCURRENTLY on partitioned index
I finally found time to digest and integrate your changes into my local branch. This fixes the three issues you reported: FORCE_RELEASE, issue with INVALID partitions issue (for which I adapted your patch into an earlier patch in my series), and progress reporting. And rebased. -- Justin >From 4ba360eaaac5e1ac169d41c26cf6213b0c6a2432 Mon Sep 17 00:00:00 2001 From: Justin Pryzby Date: Sat, 6 Jun 2020 17:42:23 -0500 Subject: [PATCH] Allow CREATE INDEX CONCURRENTLY on partitioned table Note, this effectively reverts 050098b14, so take care to not reintroduce the bug it fixed. --- doc/src/sgml/ddl.sgml | 4 +- doc/src/sgml/ref/create_index.sgml | 9 -- src/backend/commands/indexcmds.c | 214 +++-- src/include/catalog/index.h| 1 + src/test/regress/expected/indexing.out | 136 +++- src/test/regress/sql/indexing.sql | 26 ++- 6 files changed, 320 insertions(+), 70 deletions(-) diff --git a/doc/src/sgml/ddl.sgml b/doc/src/sgml/ddl.sgml index 03c01937094..fd56e21ef49 100644 --- a/doc/src/sgml/ddl.sgml +++ b/doc/src/sgml/ddl.sgml @@ -4131,9 +4131,7 @@ ALTER TABLE measurement ATTACH PARTITION measurement_y2008m02 so that they are applied automatically to the entire hierarchy. This is very convenient, as not only will the existing partitions become indexed, but - also any partitions that are created in the future will. One limitation is - that it's not possible to use the CONCURRENTLY - qualifier when creating such a partitioned index. To avoid long lock + also any partitions that are created in the future will. To avoid long lock times, it is possible to use CREATE INDEX ON ONLY the partitioned table; such an index is marked invalid, and the partitions do not get the index applied automatically. The indexes on partitions can diff --git a/doc/src/sgml/ref/create_index.sgml b/doc/src/sgml/ref/create_index.sgml index 40986aa502f..fc8cda655f0 100644 --- a/doc/src/sgml/ref/create_index.sgml +++ b/doc/src/sgml/ref/create_index.sgml @@ -692,15 +692,6 @@ Indexes: cannot. - -Concurrent builds for indexes on partitioned tables are currently not -supported. However, you may concurrently build the index on each -partition individually and then finally create the partitioned index -non-concurrently in order to reduce the time where writes to the -partitioned table will be locked out. In this case, building the -partitioned index is a metadata only operation. - - diff --git a/src/backend/commands/indexcmds.c b/src/backend/commands/indexcmds.c index 91cee27743d..bb98e745267 100644 --- a/src/backend/commands/indexcmds.c +++ b/src/backend/commands/indexcmds.c @@ -71,6 +71,7 @@ /* non-export function prototypes */ +static void reindex_invalid_child_indexes(Oid indexRelationId); static bool CompareOpclassOptions(Datum *opts1, Datum *opts2, int natts); static void CheckPredicate(Expr *predicate); static void ComputeIndexAttrs(IndexInfo *indexInfo, @@ -104,7 +105,9 @@ static void reindex_error_callback(void *arg); static void ReindexPartitions(Oid relid, ReindexParams *params, bool isTopLevel); static void ReindexMultipleInternal(List *relids, - ReindexParams *params); + ReindexParams *params, + Oid parent, + int npart); static bool ReindexRelationConcurrently(Oid relationOid, ReindexParams *params); static void update_relispartition(Oid relationId, bool newval); @@ -697,17 +700,6 @@ DefineIndex(Oid relationId, partitioned = rel->rd_rel->relkind == RELKIND_PARTITIONED_TABLE; if (partitioned) { - /* - * Note: we check 'stmt->concurrent' rather than 'concurrent', so that - * the error is thrown also for temporary tables. Seems better to be - * consistent, even though we could do it on temporary table because - * we're not actually doing it concurrently. - */ - if (stmt->concurrent) - ereport(ERROR, - (errcode(ERRCODE_FEATURE_NOT_SUPPORTED), - errmsg("cannot create index on partitioned table \"%s\" concurrently", - RelationGetRelationName(rel; if (stmt->excludeOpNames) ereport(ERROR, (errcode(ERRCODE_FEATURE_NOT_SUPPORTED), @@ -1147,6 +1139,11 @@ DefineIndex(Oid relationId, if (pd->nparts != 0) flags |= INDEX_CREATE_INVALID; } + else if (concurrent && OidIsValid(parentIndexId)) + { + /* If concurrent, initially build index partitions as "invalid" */ + flags |= INDEX_CREATE_INVALID; + } if (stmt->deferrable) constr_flags |= INDEX_CONSTR_CREATE_DEFERRABLE; @@ -1212,17 +1209,30 @@ DefineIndex(Oid relationId, partdesc = RelationGetPartitionDesc(rel, true); if ((!stmt->relation || stmt->relation->inh) && partdesc->nparts > 0) { + /* + * Need to close the relation before recursing into children, so + * copy needed data into a longlived context. + */ + + MemoryContext ind_context =
Re: Unable to reset stats using pg_stat_reset_replication_slot
This doesn't seem like fitting here, but.. At Fri, 18 Nov 2022 18:38:56 +0530, Satya Thirumani wrote in > I'm unable to reset stats. Please help me to fix this? > > testdb => select * from pg_stat_reset_replication_slot(NULL); > ERROR: permission denied for function pg_stat_reset_replication_slot Yeah, the user doesn't seem to be allowed to do that. Only superusers can do that defaultly. https://www.postgresql.org/docs/devel/monitoring-stats.html > This function is restricted to superusers by default, but other > users can be granted EXECUTE to run the function. regards. -- Kyotaro Horiguchi NTT Open Source Software Center
Re: when the startup process doesn't (logging startup delays)
At Fri, 18 Nov 2022 15:55:00 +0530, Bharath Rupireddy wrote in > On Fri, Nov 18, 2022 at 12:42 AM Robert Haas wrote: > > > > On Thu, Nov 17, 2022 at 2:22 AM Bharath Rupireddy > > wrote: > > > Duplication is a problem that I agree with and I have an idea here - > > > how about introducing a new function, say EnableStandbyMode() that > > > sets StandbyMode to true and disables the startup progress timeout, > > > something like the attached? > > > > That works for me, more or less. But I think that > > enable_startup_progress_timeout should be amended to either say if > > (log_startup_progress_interval == 0 || StandbyMode) return; or else it > > should at least Assert(!StandbyMode), so that we can't accidentally > > re-enable the timer after we shut it off. > > Hm, an assertion may not help in typical production servers running on > non-assert builds. I've modified the if condition, please see the > attached v5 patch. I prefer Robert's approach as it is more robust for future changes and simple. I prefer to avoid this kind of piggy-backing and it doesn't seem to be needed in this case. XLogShutdownWalRcv() looks like a similar case to me and honestly I don't like it in the sense of robustness but it is simpler than checking walreceiver status at every site that refers to the flag. regards. -- Kyotaro Horiguchi NTT Open Source Software Center
Re: Understanding WAL - large amount of activity from removing data
On Sun, Nov 20, 2022 at 6:24 PM Isaac Morland wrote: > What I'm finding is that the UPDATE is taking over an hour for 5000 > records, and tons of WAL is being generated, several files per minute. > Selecting the non-PDF columns from the entire table takes a few > milliseconds, and the only thing I'm doing with the records is updating > them to much smaller values. Why so much activity just to remove data? The > new rows are tiny. > Simplistic answer (partly because the second part of this isn't spelled out explicitly in the docs that I could find) when you UPDATE two things happen, the old record is modified to indicate it has been deleted and a new record is inserted. Both of these are written to the WAL, and a record is always written to the WAL as a self-contained unit, so the old record is full sized in the newly written WAL. TOAST apparently has an optimization if you don't change the TOASTed value, but here you are so that optimization doesn't apply. David J.
Understanding WAL - large amount of activity from removing data
I'm encountering some surprising (to me) behaviour related to WAL, and I'm wondering if anybody can point me at an article that might help me understand what is happening, or give a brief explanation. I'm trying to make a slimmed down version of my database for testing purposes. As part of this, I'm running a query something like this: UPDATE table1 SET pdfcolumn = 'redacted' WHERE pdfcolumn IS NOT NULL; (literally 'redacted', not redacted here for your benefit) The idea is to replace the actual contents of the column, which are PDF documents totalling 70GB, with just a short placeholder value, without affecting the other columns, which are a more ordinary collection - a few integers and short strings. The end result will be a database which is way easier to copy around but which still has all the records of the original; the only change is that an attempt to access one of the PDFs will not return the actual PDF but rather a garbage value. For most testing this will make little to no difference. What I'm finding is that the UPDATE is taking over an hour for 5000 records, and tons of WAL is being generated, several files per minute. Selecting the non-PDF columns from the entire table takes a few milliseconds, and the only thing I'm doing with the records is updating them to much smaller values. Why so much activity just to remove data? The new rows are tiny.
Re: pg_stat_bgwriter.buffers_backend is pretty meaningless (and more?)
Hi, One good follow up patch will be to rip out the accounting for pg_stat_bgwriter's buffers_backend, buffers_backend_fsync and perhaps buffers_alloc and replace it with a subselect getting the equivalent data from pg_stat_io. It might not be quite worth doing for buffers_alloc because of the way that's tied into bgwriter pacing. On 2022-11-03 13:00:24 -0400, Melanie Plageman wrote: > > + again. A high number of repossessions is a sign of contention for the + > > blocks operated on by the strategy operation. > > > > This (and in general the repossession description) makes sense, but > > I'm not sure what to do with the information. Maybe Andres is right > > that we could skip this in the first version? > > I've removed repossessed and rejected in attached v37. I am a bit sad > about this because I don't see a good way forward and I think those > could be useful for users. Let's get the basic patch in and then check whether we can find a way to have something providing at least some more information like repossessed and rejected. I think it'll be easier to analyze in isolation. > I have added the new column Andres recommended in [1] ("io_object") to > clarify temp and local buffers and pave the way for bypass IO (IO not > done through a buffer pool), which can be done on temp or permanent > files for temp or permanent relations, and spill file IO which is done > on temporary files but isn't related to temporary tables. > IOObject has increased the memory footprint and complexity of the code > around tracking and accumulating the statistics, though it has not > increased the number of rows in the view. It doesn't look too bad from here. Is there a specific portion of the code where it concerns you the most? > One question I still have about this additional dimension is how much > enumeration we need of the various combinations of IO operations, IO > objects, IO ops, and backend types which are allowed and not allowed. > > Currently because it is only valid to operate on both IOOBJECT_RELATION > and IOOBJECT_TEMP_RELATION in IOCONTEXT_BUFFER_POOL, the changes to the > various functions asserting and validating what is "allowed" in terms of > combinations of ops, objects, contexts, and backend types aren't much > different than they were without IO Object. However, once we begin > adding other objects and contexts, we will need to make this logic more > comprehensive. I'm not sure whether or not I should do that > preemptively. I'd not do it preemptively. > @@ -833,6 +836,22 @@ ReadBuffer_common(SMgrRelation smgr, char > relpersistence, ForkNumber forkNum, > > isExtend = (blockNum == P_NEW); > > + if (isLocalBuf) > + { > + /* > + * Though a strategy object may be passed in, no strategy is > employed > + * when using local buffers. This could happen when doing, for > example, > + * CREATE TEMPORRARY TABLE AS ... > + */ > + io_context = IOCONTEXT_BUFFER_POOL; > + io_object = IOOBJECT_TEMP_RELATION; > + } > + else > + { > + io_context = IOContextForStrategy(strategy); > + io_object = IOOBJECT_RELATION; > + } I think given how frequently ReadBuffer_common() is called in some workloads, it'd be good to make IOContextForStrategy inlinable. But I guess that's not easily doable, because struct BufferAccessStrategyData is only defined in freelist.c. Could we defer this until later, given that we don't currently need this in case of buffer hits afaict? > @@ -1121,6 +1144,8 @@ BufferAlloc(SMgrRelation smgr, char relpersistence, > ForkNumber forkNum, > BufferAccessStrategy strategy, > bool *foundPtr) > { > + boolfrom_ring; > + IOContext io_context; > BufferTag newTag; /* identity of requested block > */ > uint32 newHash;/* hash value for newTag */ > LWLock *newPartitionLock; /* buffer partition lock for it */ > @@ -1187,9 +1212,12 @@ BufferAlloc(SMgrRelation smgr, char relpersistence, > ForkNumber forkNum, >*/ > LWLockRelease(newPartitionLock); > > + io_context = IOContextForStrategy(strategy); Hm - doesn't this mean we do IOContextForStrategy() twice? Once in ReadBuffer_common() and then again here? > /* Loop here in case we have to try another victim buffer */ > for (;;) > { > + > /* >* Ensure, while the spinlock's not yet held, that there's a > free >* refcount entry. > @@ -1200,7 +1228,7 @@ BufferAlloc(SMgrRelation smgr, char relpersistence, > ForkNumber forkNum, >* Select a victim buffer. The buffer is returned with its > header >* spinlock still held! >*/ > - buf = StrategyGetBuffer(strategy, _state); > + buf = StrategyGetBuffer(strategy, _state,
Re: Polyphase merge is obsolete
On 19/11/2022 13:00, Peter Eisentraut wrote: On 18.10.21 14:15, Heikki Linnakangas wrote: On 05/10/2021 20:24, John Naylor wrote: I've had a chance to review and test out the v5 patches. Thanks! I fixed the stray reference to PostgreSQL 14 that Zhihong mentioned, and pushed. AFAICT, this thread updated the API of LogicalTapeSetCreate() in PG15, but did not adequately update the function header comment. The comment still mentions the "shared" argument, which has been removed. There is a new "preallocate" argument that is not mentioned at all. Also, it's not easy to match the actual callers to the call variants described in the comment. Could someone who remembers this work perhaps look this over and update the comment? Is the attached more readable? I'm not 100% sure of the "preallocate" argument. If I understand correctly, you should pass true if you are writing multiple tapes at the same time, and false otherwise. HashAgg passed true, tuplesort passes false. However, it's not clear to me why we couldn't just always do the preallocation. It seems pretty harmless even if it's not helpful. Or do it when there are multiple writer tapes, and not otherwise. The parameter was added in commit 0758964963 so presumably there was a reason, but at a quick glance at the thread that led to that commit, I couldn't see what it was. - Heikki From 1c5d4e28a21f6052d22667e79b5bd443aec3a5b5 Mon Sep 17 00:00:00 2001 From: Heikki Linnakangas Date: Mon, 21 Nov 2022 01:55:05 +0200 Subject: [PATCH 1/1] Fix and clarify function comment on LogicalTapeSetCreate. Commit c4649cce39 removed the "shared" and "ntapes" arguments, but the comment still talked about "shared". It also talked about "a shared file handle", which was technically correct because it even before commit c4649cce39, the "shared file handle" referred to the "fileset" argument, not "shared". But it was very confusing. Improve the comment. Also add a comment on what the "preallocate" argument does. Discussion: https://www.postgresql.org/message-id/af989685-91d5-aad4-8f60-1d066b5ec...@enterprisedb.com --- src/backend/utils/sort/logtape.c | 14 ++ 1 file changed, 10 insertions(+), 4 deletions(-) diff --git a/src/backend/utils/sort/logtape.c b/src/backend/utils/sort/logtape.c index c384f98e13..b518a4b9c5 100644 --- a/src/backend/utils/sort/logtape.c +++ b/src/backend/utils/sort/logtape.c @@ -544,14 +544,20 @@ ltsInitReadBuffer(LogicalTape *lt) * The tape set is initially empty. Use LogicalTapeCreate() to create * tapes in it. * - * Serial callers pass NULL argument for shared, and -1 for worker. Parallel - * worker callers pass a shared file handle and their own worker number. + * In a single-process sort, pass NULL argument for fileset, and -1 for + * worker. * - * Leader callers pass a shared file handle and -1 for worker. After creating - * the tape set, use LogicalTapeImport() to import the worker tapes into it. + * In a parallel sort, parallel workers pass the shared fileset handle and + * their own worker number. After the workers have finished, create the + * tape set in the leader, passing the shared fileset handle and -1 for + * worker, and use LogicalTapeImport() to import the worker tapes into it. * * Currently, the leader will only import worker tapes into the set, it does * not create tapes of its own, although in principle that should work. + * + * If preallocate is true, blocks for each individual tape are allocated in + * batches. This avoids fragmentation when writing multiple tapes at the + * same time. */ LogicalTapeSet * LogicalTapeSetCreate(bool preallocate, SharedFileSet *fileset, int worker) -- 2.30.2
Re: Split index and table statistics into different types of stats
Hi, On 2022-11-18 12:18:38 +0100, Drouvot, Bertrand wrote: > On 11/16/22 9:12 PM, Andres Freund wrote: > > This still leaves a fair bit of boilerplate. ISTM that the function body > > really should just be a single line. > > > > Might even be worth defining the whole function via a macro. Perhaps > > something like > > > > PGSTAT_DEFINE_REL_FIELD_ACCESSOR(PGSTAT_KIND_INDEX, pg_stat_get_index, > > numscans); > > Thanks for the feedback! > > Right, what about something like the following? > > " > #define PGSTAT_FETCH_STAT_ENTRY(pgstat_entry_kind, > pgstat_fetch_stat_function, relid, stat_name) \ > do { \ > pgstat_entry_kind *entry = pgstat_fetch_stat_function(relid); \ > PG_RETURN_INT64(entry == NULL ? 0 : (int64) > (entry->stat_name)); \ > } while (0) > > Datum > pg_stat_get_index_numscans(PG_FUNCTION_ARGS) > { > PGSTAT_FETCH_STAT_ENTRY(PgStat_StatIndEntry, > pgstat_fetch_stat_indentry, PG_GETARG_OID(0), numscans); > } > " That's better, but still seems like quite a bit of repetition, given the number of accessors. I think I like my idea of a macro defining the whole function a bit better. I'd define a "base" macro and then a version that's specific to tables and indexes each, so that the pieces related to that don't have to be repeated as often. > > This should probably be done in a preparatory commit. > > Proposal submitted in [1]. Now merged. Greetings, Andres Freund
Re: Getting rid of SQLValueFunction
On Sun, Nov 20, 2022 at 3:12 PM Michael Paquier wrote: > On Sun, Nov 20, 2022 at 08:21:10AM -0800, Ted Yu wrote: > > For get_func_sql_syntax(), the code for cases > > of F_CURRENT_TIME, F_CURRENT_TIMESTAMP, F_LOCALTIME and F_LOCALTIMESTAMP > is > > mostly the same. > > Maybe we can introduce a helper so that code duplication is reduced. > > It would. Thanks for the suggestion. > > Do you like something like the patch 0002 attached? This reduces a > bit the overall size of the patch. Both ought to be merged in the > same commit, still it is easier to see the simplification created. > -- > Michael > Hi, Thanks for the quick response. + * timestamp. These require a specific handling with their typmod is given + * by the function caller through their SQL keyword. typo: typmod is given -> typmod given Other than the above, code looks good to me. Cheers
Re: Add LZ4 compression in pg_dump
On Sun, Nov 20, 2022 at 11:26:11AM -0600, Justin Pryzby wrote: > I think this patch record should be closed for now. You can re-open the > existing patch record once a patch is ready to be reviewed. Indeed. As of things are, this is just a dead entry in the CF which would be confusing. I have marked it as RwF. -- Michael signature.asc Description: PGP signature
Re: Getting rid of SQLValueFunction
On Sun, Nov 20, 2022 at 08:21:10AM -0800, Ted Yu wrote: > For get_func_sql_syntax(), the code for cases > of F_CURRENT_TIME, F_CURRENT_TIMESTAMP, F_LOCALTIME and F_LOCALTIMESTAMP is > mostly the same. > Maybe we can introduce a helper so that code duplication is reduced. It would. Thanks for the suggestion. Do you like something like the patch 0002 attached? This reduces a bit the overall size of the patch. Both ought to be merged in the same commit, still it is easier to see the simplification created. -- Michael From 3611cfa6f0171dcd65eeb88461f4c48989cd3f1a Mon Sep 17 00:00:00 2001 From: Michael Paquier Date: Sun, 20 Nov 2022 11:53:28 +0900 Subject: [PATCH v5 1/2] Replace SQLValueFunction by direct function calls This impacts 9 patterns where the SQL grammar takes priority: - localtime, that can take a typmod. - localtimestamp, that can take a typmod. - current_time, that can take a typmod. - current_timestamp, that can take a typmod. - current_date. Five new entries are added to pg_proc to compensate the removal of SQLValueFunction to keep compatibility (when a keyword is specified in a SELECT or in a FROM clause without an alias), with tests to cover all that. XXX: bump catalog version. Reviewed-by: Corey Huinker Discussion: https://postgr.es/m/yzag3morycguu...@paquier.xyz --- src/include/catalog/pg_proc.dat | 15 +++ src/include/executor/execExpr.h | 8 -- src/include/nodes/primnodes.h | 33 --- src/include/utils/date.h | 4 - src/include/utils/timestamp.h | 4 - src/backend/catalog/system_functions.sql | 26 ++ src/backend/executor/execExpr.c | 11 --- src/backend/executor/execExprInterp.c | 46 - src/backend/jit/llvm/llvmjit_expr.c | 6 -- src/backend/jit/llvm/llvmjit_types.c | 1 - src/backend/nodes/nodeFuncs.c | 27 +- src/backend/optimizer/path/costsize.c | 1 - src/backend/optimizer/util/clauses.c | 39 ++-- src/backend/parser/gram.y | 59 +++- src/backend/parser/parse_expr.c | 52 --- src/backend/parser/parse_target.c | 25 - src/backend/utils/adt/date.c | 81 +--- src/backend/utils/adt/ruleutils.c | 109 +- src/backend/utils/adt/timestamp.c | 72 -- src/backend/utils/misc/queryjumble.c | 9 -- src/test/regress/expected/expressions.out | 2 +- src/test/regress/sql/expressions.sql | 2 +- src/tools/pgindent/typedefs.list | 2 - 23 files changed, 245 insertions(+), 389 deletions(-) diff --git a/src/include/catalog/pg_proc.dat b/src/include/catalog/pg_proc.dat index fd2559442e..35dc369728 100644 --- a/src/include/catalog/pg_proc.dat +++ b/src/include/catalog/pg_proc.dat @@ -1520,6 +1520,21 @@ { oid => '9977', descr => 'system user name', proname => 'system_user', provolatile => 's', prorettype => 'text', proargtypes => '', prosrc => 'system_user' }, +{ oid => '9978', descr => 'current date', + proname => 'current_date', provolatile => 's', prorettype => 'date', + proargtypes => '', prosrc => 'current_date' }, +{ oid => '9979', descr => 'current time', + proname => 'current_time', provolatile => 's', prorettype => 'timetz', + proargtypes => 'int4', prosrc => 'current_time', proisstrict => 'f' }, +{ oid => '9980', descr => 'current timestamp', + proname => 'current_timestamp', provolatile => 's', prorettype => 'timestamptz', + proargtypes => 'int4', prosrc => 'current_timestamp', proisstrict => 'f' }, +{ oid => '9981', descr => 'local time', + proname => 'localtime', provolatile => 's', prorettype => 'time', + proargtypes => 'int4', prosrc => 'sql_localtime', proisstrict => 'f' }, +{ oid => '9982', descr => 'local timestamp', + proname => 'localtimestamp', provolatile => 's', prorettype => 'timestamp', + proargtypes => 'int4', prosrc => 'sql_localtimestamp', proisstrict => 'f' }, { oid => '744', proname => 'array_eq', prorettype => 'bool', diff --git a/src/include/executor/execExpr.h b/src/include/executor/execExpr.h index e14f15d435..0557302b92 100644 --- a/src/include/executor/execExpr.h +++ b/src/include/executor/execExpr.h @@ -170,7 +170,6 @@ typedef enum ExprEvalOp EEOP_DISTINCT, EEOP_NOT_DISTINCT, EEOP_NULLIF, - EEOP_SQLVALUEFUNCTION, EEOP_CURRENTOFEXPR, EEOP_NEXTVALUEEXPR, EEOP_ARRAYEXPR, @@ -416,12 +415,6 @@ typedef struct ExprEvalStep FunctionCallInfo fcinfo_data_in; } iocoerce; - /* for EEOP_SQLVALUEFUNCTION */ - struct - { - SQLValueFunction *svf; - } sqlvaluefunction; - /* for EEOP_NEXTVALUEEXPR */ struct { @@ -741,7 +734,6 @@ extern void ExecEvalParamExec(ExprState *state, ExprEvalStep *op, ExprContext *econtext); extern void ExecEvalParamExtern(ExprState *state, ExprEvalStep *op, ExprContext *econtext); -extern void ExecEvalSQLValueFunction(ExprState *state, ExprEvalStep *op); extern
Re: perform_spin_delay() vs wait events
Hi, On 2022-11-20 17:26:11 -0500, Robert Haas wrote: > On Sun, Nov 20, 2022 at 3:43 PM Andres Freund wrote: > > I couldn't quite decide what wait_event_type to best group this under? In > > the > > attached patch I put it under timeouts, which doesn't seem awful. > > I think it would be best to make it its own category, like we do with > buffer pins. I was wondering about that too - but decided against it because it would only show a single wait event. And wouldn't really describe spinlocks as a whole, just the "extreme" delays. If we wanted to report the spin waits more granular, we'd presumably have to fit the wait events into the lwlock, buffers and some new category where we name individual spinlocks. But I guess a single spinlock wait event type with an ExponentialBackoff wait event or such wouldn't be too bad. Greetings, Andres Freund
Re: Reducing power consumption on idle servers
On Mon, Nov 21, 2022 at 10:31:15AM +1300, Thomas Munro wrote: > On Mon, Nov 21, 2022 at 9:00 AM Simon Riggs > wrote: >> As a 3rd patch, I will work on making logical workers hibernate. > > Duelling patch warning: Nathan mentioned[1] that he's hacking on a > patch for that, along the lines of the recent walreceiver change IIUC. I coded something up last week, but, like the walreceiver patch, it caused check-world to take much longer [0], and I haven't looked into whether it could be easily fixed. I'm hoping to make some time for this again in the near future. [0] https://postgr.es/m/20221113222644.GA1269110%40nathanxps13 -- Nathan Bossart Amazon Web Services: https://aws.amazon.com
Re: More efficient build farm animal wakeup?
On Sun, Nov 20, 2022 at 2:44 AM Andrew Dunstan wrote: > It might not suit your use case, but one of the things I do to reduce > fetch load is to run a local mirror which runs > >git fetch -q --prune > > every 5 minutes. It also runs a git daemon, and several of my animals > point at that. Thanks. I understand now that my configuration without a local mirror is super inefficient (it spends the first ~25s of each minute running git commands). Still, even though that can be improved by me setting up more stuff, I'd like something event-driven rather than short polling-based for lower latency. > If there's a better git API I'll be happy to try to use it. Cool. Seems like we just have to invent something first... FWIW I'm also trying to chase the short polling out of cfbot. It regularly harasses the git servers at one end (could be fixed with this approach), and wastes a percentage of our allotted CPU slots on the other end by scheduling periodically (could be fixed with webhooks from Cirrus).
Re: perform_spin_delay() vs wait events
On Sun, Nov 20, 2022 at 3:43 PM Andres Freund wrote: > The lwlock wait queue scalability issue [1] was quite hard to find because of > the exponential backoff and because we adjust spins_per_delay over time within > a backend. > > I think the least we could do to make this easier would be to signal spin > delays as wait events. We surely don't want to do so for non-contended spins > because of the overhead, but once we get to the point of calling pg_usleep() > that's not an issue. > > I don't think it's worth trying to hand down more detailed information about > the specific spinlock we're waiting on, at least for now. We'd have to invent > a whole lot of new wait events because most spinlocks don't have ones yet. > > I couldn't quite decide what wait_event_type to best group this under? In the > attached patch I put it under timeouts, which doesn't seem awful. I think it would be best to make it its own category, like we do with buffer pins. -- Robert Haas EDB: http://www.enterprisedb.com
Re: Question concerning backport of CVE-2022-2625
Hi Tom, On Sun, Nov 20, 2022 at 11:43:41AM -0500, Tom Lane wrote: > Roberto =?iso-8859-1?Q?C=2E_S=E1nchez?= writes: > > -- this makes a shell "point <<@@ polygon" operator too > > CREATE OPERATOR @@>> ( PROCEDURE = poly_contain_pt, > >LEFTARG = polygon, RIGHTARG = point, > >COMMUTATOR = <<@@ ); > > CREATE EXTENSION test_ext_cor; -- fail > > ERROR: operator <<@@(point,polygon) is not a member of extension > > "test_ext_cor" > > DETAIL: An extension is not allowed to replace an object that it does not > > own. > > DROP OPERATOR <<@@ (point, polygon); > > CREATE EXTENSION test_ext_cor; -- now it should work > > +ERROR: operator 16427 is not a member of extension "test_ext_cor" > > +DETAIL: An extension is not allowed to replace an object that it does not > > own. > > That is ... odd. Since 9.4 is long out of support I'm unenthused > about investigating it myself. (Why is it that people will move heaven > and earth to fix "security" bugs in dead branches, but ignore even > catastrophic data-loss bugs?) But if you're stuck with pursuing > this exercise, I think you'd better figure out exactly what's > happening. I agree that it smells like c94959d41 could be related, > but I don't see just how that'd produce this symptom. Before that > commit, the DROP OPERATOR <<@@ would have left a dangling link > behind in @@>> 's oprcom field, but there doesn't seem to be a > reason why that'd affect the test_ext_cor extension: it will not > be re-using the same operator OID, nor would it have any reason to > touch @@>>, since there's no COMMUTATOR clause in the extension. > I understand your reticence to dive into a branch that is long dead from your perspective. That said, I am grateful for the insights you provided here. > It'd likely be a good idea to reproduce this with a gdb breakpoint > set at errfinish, and see exactly what's leading up to the error. > Thanks for this suggestion. I will see if I am able to isolate the precise cause of the failure with this. Regards, -Roberto -- Roberto C. Sánchez
Re: More efficient build farm animal wakeup?
On Mon, Nov 21, 2022 at 10:31 AM Magnus Hagander wrote: > Um, branches of interest will only pick up when it gets a new *branch*, not a > new *commit*, so I think that would be a very different problem to solve. And > I don't think we have new branche *that* often... Sure, could be done with an extra different request you make from time to time or keeping the existing list. No strong opinions on that, I was just observing that it could also be combined, something like: Client: I have 14@1234, 15@1234, HEAD@1234; what should I do now, boss? Server: You should fetch 14 (it has a new commit) and 16 (it's a new branch you didn't mention). > I'd imagine something like a > GET https://git.postgresql.org/buildfarm-branchtips > X-branch-master: a4adc31f69 > X-branch-REL_14_STABLE: b33283cbd3 > X-longpoll: 120 > > For that one it would check branch master and rel 14, and if either branchtip > doesn't match what was in the header, it'd return immediately with a textfile > that's basically > master: > > if master has changed and not REL_14. > > If nothing has changed, go into longpoll for 120 seconds based on the header, > and if nothing at all has changed in that time, return a 304. LGTM, that's exactly the sort of thing I was imagining. > We could also use something like a websocket to just stream the changes out > over. True. The reason I started on about long polling instead of websockets is that I was imagining that the simpler, dumber protocol where the client doesn't even really know it's participating a new kind of magic would be more cromulent in ye olde perl script (no new cpan dependencies). > In either case it would also need to change the buildfarm client to run as a > daemon rather than a cronjob I think? (obviously optional, we don't have to > remove the current abilities) Given that the point of the build farm is (these days) to test on weird computers and operating systems, I expect that proper 'run like a service' support would be painful or not get done. It'd be nice if there were some way to make this work with simple crontab entries...
Re: Precedence of bitwise operators
Bauyrzhan Sakhariyev writes: > Hi! Do I get it right, that bitwise operations have the same precedence? Yes, that is what the documentation says: https://www.postgresql.org/docs/current/sql-syntax-lexical.html#SQL-PRECEDENCE Operator precedence is hard-wired into our parser, so we don't get to have a lot of flexibility in assigning precedences for any except a very small set of operator names. regards, tom lane
Re: Reducing power consumption on idle servers
On Mon, Nov 21, 2022 at 9:00 AM Simon Riggs wrote: > As a 3rd patch, I will work on making logical workers hibernate. Duelling patch warning: Nathan mentioned[1] that he's hacking on a patch for that, along the lines of the recent walreceiver change IIUC. [1] https://www.postgresql.org/message-id/flat/CA%2BhUKGJGhX4r2LPUE3Oy9BX71Eum6PBcS8L3sJpScR9oKaTVaA%40mail.gmail.com
Re: More efficient build farm animal wakeup?
On Sun, Nov 20, 2022 at 4:56 AM Thomas Munro wrote: > On Sun, Nov 20, 2022 at 1:35 AM Magnus Hagander > wrote: > > tl,tr; it's not there now, but yes if we can find a smart way for th ebf > clients to consume it, it is something we could build and deploy fairly > easily. > > Cool -- it sounds a lot like you've thought about this already :-) > > About the client: currently run_branches.pl makes an HTTP request for > the "branches of interest" list. Seems like a candidate point for a > long poll? I don't think it'd have to be much smarter than it is > today, it'd just have to POST the commits it already has, I think. > Um, branches of interest will only pick up when it gets a new *branch*, not a new *commit*, so I think that would be a very different problem to solve. And I don't think we have new branche *that* often... Perhaps as a first step, the server could immediately report which > branches to bother fetching, considering the client's existing > commits. That'd almost always be none, but ~11.7 times per day a new > commit shows up, and once a year there's a new interesting branch. > That would avoid the need for the 6 git fetches that usually follow in > the common case, which admittedly might not be a change worth making > on its own. After all, the git fetches are probably quite similar > HTTP requests themselves, except that there 6 of them, one per branch, > and they hit the public git server instead of some hypothetical > buildfarm endpoint. > As Andres mentioned downthread, that's not a lot more lightweight than what "git fetch" does. The thing we'd want to avoid is having to do that so much and often. And getting to that is going to require modification of the buildfarm client to make it more "smart" regardless. In particular, making it do this "right" in the face of multiple branches is probably going to be a big win. Then you could switch to long polling by letting the client say "if > currently none, I'm prepared to wait up to X seconds for a different > answer", assuming you know how to build the server side of that > (insert magic here). Of course, you can't make it too long or your > session might be dropped in the badlands between client and server, > but that's just a reason to make X configurable. I think RFC6202 says > that 120 seconds probably works fine across most kinds of links, which > means that you lower the total poll rate hitting the server, but--more > interestingly for me as a client--you minimise latency when something > finally happens. (With various keepalive tricks and/or heartbeat > streaming tricks you could possibly make it much higher, who knows... > but you'd have to set it very very low to do worse than what we're > doing today in total request count). Or maybe there is some existing > easy perl library that could be used for this (joke answer: cpan > install Twitter::API and follow @pg_commits). > I also honestly wonder how big a problem a much longer than 120 seconds timeout would be in practice. Since we own both the client and the server in this case, we'd only be at mercy of network equipment in between and I think we're much less exposed to weirdness there than "the average browser". Thus, as long as it's configurable, I think we could go for something much longer by default. I'd imagine something like a GET https://git.postgresql.org/buildfarm-branchtips X-branch-master: a4adc31f69 X-branch-REL_14_STABLE: b33283cbd3 X-longpoll: 120 For that one it would check branch master and rel 14, and if either branchtip doesn't match what was in the header, it'd return immediately with a textfile that's basically master: if master has changed and not REL_14. If nothing has changed, go into longpoll for 120 seconds based on the header, and if nothing at all has changed in that time, return a 304. We could also use something like a websocket to just stream the changes out over. In either case it would also need to change the buildfarm client to run as a daemon rather than a cronjob I think? (obviously optional, we don't have to remove the current abilities) However, when I started this thread I was half expecting such a thing > to exist already, somewhere, I just haven't been able to find it > myself... Don't other people have this problem? Maybe everybody who > has this problem uses webhooks (git server post commit hook opens > connection to client) as you mentioned, but as you also mentioned > that'd never fly for our topology. > Yeah, webhook seems to be what most people use. FWIW, an implementation for us would be a small daemon that receives such webhooks from our git server and redistributtes it for the long polling. That's still the easiest way to get the data out of git itself... //Magnus
Precedence of bitwise operators
Hi! Do I get it right, that bitwise operations have the same precedence? Query *SELECT 1 & 2 | 3, 3 | 1 & 2* returns 3 and 2 respectively. See also https://www.db-fiddle.com/f/iZHd8zG7A1HjbB6J2y8R7k/1. It looks like the result is calculated from left to right and operators have the same precedence. I checked relevant documentation pages ( https://www.postgresql.org/docs/current/functions-bitstring.html and https://www.postgresql.org/docs/current/sql-syntax-lexical.html) and couldn't find any information about bitwise operations precedence, only information about logical operations precedence. I'm not saying it's a bug, rather trying to clarify as precedence of bitwise operators is different in programming languages, say c++ ( https://en.cppreference.com/w/c/language/operator_precedence) or java ( https://docs.oracle.com/javase/tutorial/java/nutsandbolts/operators.html)
Re: Reducing power consumption on idle servers
On Mon, Nov 21, 2022 at 3:35 AM Simon Riggs wrote: > On Sat, 19 Nov 2022 at 10:59, Simon Riggs > wrote: > > New version attached. > > Fix for doc xref I removed a stray variable declaration from xlogrecovery.h, and wrote a draft commit message. I'll wait 24 hours before committing, to provide a last chance for anyone who wants to complain about dropping promote_trigger_file. (We could of course change it so that the timeout based wakeup only happens if you have actually set promote_trigger_file, but I think we've established that we just don't want the filesystem polling feature so I'm whispering this in parentheses.) From 6a6db8862ceb6618fb5f98c6245dc7d834fe89db Mon Sep 17 00:00:00 2001 From: Thomas Munro Date: Mon, 21 Nov 2022 09:37:47 +1300 Subject: [PATCH v11] Remove promote_trigger_file. Previously an otherwise idle startup (recovery) process would wake up every 5 seconds to have a chance to poll for promote_trigger_file, even if that GUC was not configured. That signaling mechanism was superseded by pg_ctl promote and pg_promote() a long time ago. There probably aren't many users left and it's very easy to change to the modern mechanisms, so let's remove that feature. This is part of a campaign to reduce wakeups on idle systems. Author: Simon Riggs Reviewed-by: Bharath Rupireddy Reviewed-by: Robert Haas Reviewed-by: Thomas Munro Discussion: https://postgr.es/m/CANbhV-FsjnzVOQGBpQ589%3DnWuL1Ex0Ykn74Nh1hEjp2usZSR5g%40mail.gmail.com --- .../appendix-obsolete-recovery-config.sgml| 9 ++ doc/src/sgml/config.sgml | 18 --- doc/src/sgml/high-availability.sgml | 24 ++ src/backend/access/transam/xlogrecovery.c | 32 --- src/backend/utils/misc/guc_tables.c | 10 -- src/backend/utils/misc/postgresql.conf.sample | 1 - src/include/access/xlogrecovery.h | 1 - 7 files changed, 18 insertions(+), 77 deletions(-) diff --git a/doc/src/sgml/appendix-obsolete-recovery-config.sgml b/doc/src/sgml/appendix-obsolete-recovery-config.sgml index 1cf4913114..8ca519b5f1 100644 --- a/doc/src/sgml/appendix-obsolete-recovery-config.sgml +++ b/doc/src/sgml/appendix-obsolete-recovery-config.sgml @@ -35,13 +35,8 @@ The -trigger_file - - trigger_file - promote_trigger_file - -setting has been renamed to -. +trigger_file and promote_trigger_file +have been removed. diff --git a/doc/src/sgml/config.sgml b/doc/src/sgml/config.sgml index bd50ea8e48..211dfc27ac 100644 --- a/doc/src/sgml/config.sgml +++ b/doc/src/sgml/config.sgml @@ -4610,24 +4610,6 @@ ANY num_sync ( -promote_trigger_file (string) - - promote_trigger_file configuration parameter - - - - - Specifies a trigger file whose presence ends recovery in the - standby. Even if this value is not set, you can still promote - the standby using pg_ctl promote or calling - pg_promote(). - This parameter can only be set in the postgresql.conf - file or on the server command line. - - - - hot_standby (boolean) diff --git a/doc/src/sgml/high-availability.sgml b/doc/src/sgml/high-availability.sgml index b2b3129397..cac8c71a44 100644 --- a/doc/src/sgml/high-availability.sgml +++ b/doc/src/sgml/high-availability.sgml @@ -653,11 +653,11 @@ protocol to make nodes agree on a serializable transactional order. Standby mode is exited and the server switches to normal operation -when pg_ctl promote is run, -pg_promote() is called, or a trigger file is found -(promote_trigger_file). Before failover, -any WAL immediately available in the archive or in pg_wal will be -restored, but no attempt is made to connect to the primary. +when pg_ctl promote is run, or +pg_promote() is called. The parameter +promote_trigger_file has been removed. Before failover, +any WAL immediately available in the archive or in pg_wal +will be restored, but no attempt is made to connect to the primary. @@ -1483,15 +1483,11 @@ synchronous_standby_names = 'ANY 2 (s1, s2, s3)' To trigger failover of a log-shipping standby server, run -pg_ctl promote, call pg_promote(), -or create a trigger file with the file name and path specified by the -promote_trigger_file. If you're planning to use -pg_ctl promote or to call -pg_promote() to fail over, -promote_trigger_file is not required. If you're -setting up the reporting servers that are only used to offload read-only -queries from the primary, not for high availability purposes, you don't -need to promote it. +pg_ctl promote or call pg_promote(). +The parameter promote_trigger_file has been removed. +If you're setting up the reporting servers that are only used to offload +read-only queries from the primary, not for high
Re: Unstable regression test for contrib/pageinspect
On Sun, Nov 20, 2022 at 12:37 PM Tom Lane wrote: > The core reloptions.sql and vacuum.sql tests are two places that are > also using this option, but they are applying it to temp tables, > which I think makes it safe (and the lack of failures, seeing that > they run within parallel test groups, reinforces that). Can we apply > that idea in pageinspect? I believe so. The temp table horizons guarantee isn't all that old, so the tests may well have been written before it was possible. > contrib/amcheck and contrib/pg_visibility are also using > DISABLE_PAGE_SKIPPING, so I wonder if they have similar hazards. > I haven't seen them fall over, though. DISABLE_PAGE_SKIPPING forces aggressive mode (which is also possible with FREEZE), but unlike FREEZE it also forces VACUUM to scan even all-frozen pages. The other difference is that DISABLE_PAGE_SKIPPING doesn't affect FreezeLimit/freeze_min_age, whereas FREEZE sets it to 0. I think that most use of DISABLE_PAGE_SKIPPING by the regression tests just isn't necessary. Especially where it's combined with FREEZE like this, as it often seems to be. Why should the behavior around skipping all-frozen pages (the only thing changed by using DISABLE_PAGE_SKIPPING on top of FREEZE) actually matter to these tests? -- Peter Geoghegan
perform_spin_delay() vs wait events
Hi, The lwlock wait queue scalability issue [1] was quite hard to find because of the exponential backoff and because we adjust spins_per_delay over time within a backend. I think the least we could do to make this easier would be to signal spin delays as wait events. We surely don't want to do so for non-contended spins because of the overhead, but once we get to the point of calling pg_usleep() that's not an issue. I don't think it's worth trying to hand down more detailed information about the specific spinlock we're waiting on, at least for now. We'd have to invent a whole lot of new wait events because most spinlocks don't have ones yet. I couldn't quite decide what wait_event_type to best group this under? In the attached patch I put it under timeouts, which doesn't seem awful. I reverted a4adc31f690 and indeed it shows SpinDelay wait events before the fix and not after. Greetings, Andres Freund [1] https://postgr.es/m/20221027165914.2hofzp4cvutj6gin%40awork3.anarazel.de >From 6d7c62e5ebbe9bf906e5f4e1682ece6fff69cd37 Mon Sep 17 00:00:00 2001 From: Andres Freund Date: Sun, 20 Nov 2022 12:39:14 -0800 Subject: [PATCH] Add WAIT_EVENT_SPIN_DELAY --- src/include/utils/wait_event.h | 3 ++- src/backend/storage/lmgr/s_lock.c | 8 src/backend/utils/activity/wait_event.c | 3 +++ doc/src/sgml/monitoring.sgml| 4 4 files changed, 17 insertions(+), 1 deletion(-) diff --git a/src/include/utils/wait_event.h b/src/include/utils/wait_event.h index 6f2d5612e06..3d87d550119 100644 --- a/src/include/utils/wait_event.h +++ b/src/include/utils/wait_event.h @@ -146,7 +146,8 @@ typedef enum WAIT_EVENT_RECOVERY_RETRIEVE_RETRY_INTERVAL, WAIT_EVENT_REGISTER_SYNC_REQUEST, WAIT_EVENT_VACUUM_DELAY, - WAIT_EVENT_VACUUM_TRUNCATE + WAIT_EVENT_VACUUM_TRUNCATE, + WAIT_EVENT_SPIN_DELAY } WaitEventTimeout; /* -- diff --git a/src/backend/storage/lmgr/s_lock.c b/src/backend/storage/lmgr/s_lock.c index 4e473ec27ec..b04930db760 100644 --- a/src/backend/storage/lmgr/s_lock.c +++ b/src/backend/storage/lmgr/s_lock.c @@ -53,6 +53,7 @@ #include "common/pg_prng.h" #include "port/atomics.h" #include "storage/s_lock.h" +#include "utils/wait_event.h" #define MIN_SPINS_PER_DELAY 10 #define MAX_SPINS_PER_DELAY 1000 @@ -136,7 +137,14 @@ perform_spin_delay(SpinDelayStatus *status) if (status->cur_delay == 0) /* first time to delay? */ status->cur_delay = MIN_DELAY_USEC; + /* + * Once we start sleeping, the overhead of reporting a wait event is + * justified. Actively spinning easily stands out in profilers, but + * sleeping with an exponential backoff is harder to spot... + */ + pgstat_report_wait_start(WAIT_EVENT_SPIN_DELAY); pg_usleep(status->cur_delay); + pgstat_report_wait_end(); #if defined(S_LOCK_TEST) fprintf(stdout, "*"); diff --git a/src/backend/utils/activity/wait_event.c b/src/backend/utils/activity/wait_event.c index 92f24a6c9bc..02e953420e5 100644 --- a/src/backend/utils/activity/wait_event.c +++ b/src/backend/utils/activity/wait_event.c @@ -503,6 +503,9 @@ pgstat_get_wait_timeout(WaitEventTimeout w) case WAIT_EVENT_VACUUM_TRUNCATE: event_name = "VacuumTruncate"; break; + case WAIT_EVENT_SPIN_DELAY: + event_name = "SpinDelay"; + break; /* no default case, so that compiler will warn */ } diff --git a/doc/src/sgml/monitoring.sgml b/doc/src/sgml/monitoring.sgml index e5d622d5147..0599eac71a6 100644 --- a/doc/src/sgml/monitoring.sgml +++ b/doc/src/sgml/monitoring.sgml @@ -2298,6 +2298,10 @@ postgres 27093 0.0 0.0 30096 2752 ?Ss 11:34 0:00 postgres: ser Waiting to acquire an exclusive lock to truncate off any empty pages at the end of a table vacuumed. + + SpinDelay + Waiting to acquire a contended spinlock. + -- 2.38.0
Unstable regression test for contrib/pageinspect
My very slow buildfarm animal mamba has failed pageinspect several times [1][2][3][4] with this symptom: diff -U3 /home/buildfarm/bf-data/HEAD/pgsql.build/contrib/pageinspect/expected/page.out /home/buildfarm/bf-data/HEAD/pgsql.build/contrib/pageinspect/results/page.out --- /home/buildfarm/bf-data/HEAD/pgsql.build/contrib/pageinspect/expected/page.out 2022-11-20 10:12:51.780935488 -0500 +++ /home/buildfarm/bf-data/HEAD/pgsql.build/contrib/pageinspect/results/page.out 2022-11-20 14:00:25.818743985 -0500 @@ -92,9 +92,9 @@ SELECT t_infomask, t_infomask2, raw_flags, combined_flags FROM heap_page_items(get_raw_page('test1', 0)), LATERAL heap_tuple_infomask_flags(t_infomask, t_infomask2); - t_infomask | t_infomask2 | raw_flags | combined_flags -+-+---+ - 2816 | 2 | {HEAP_XMIN_COMMITTED,HEAP_XMIN_INVALID,HEAP_XMAX_INVALID} | {HEAP_XMIN_FROZEN} + t_infomask | t_infomask2 |raw_flags| combined_flags ++-+-+ + 2304 | 2 | {HEAP_XMIN_COMMITTED,HEAP_XMAX_INVALID} | {} (1 row) -- tests for decoding of combined flags It's not hard to guess what the problem is here: the immediately preceding bit is hopelessly optimistic. -- If we freeze the only tuple on test1, the infomask should -- always be the same in all test runs. VACUUM (FREEZE, DISABLE_PAGE_SKIPPING) test1; The fact that you asked for a freeze doesn't mean you'll get a freeze. I suppose that a background auto-analyze is holding back global xmin so that the tuple doesn't actually get frozen. The core reloptions.sql and vacuum.sql tests are two places that are also using this option, but they are applying it to temp tables, which I think makes it safe (and the lack of failures, seeing that they run within parallel test groups, reinforces that). Can we apply that idea in pageinspect? contrib/amcheck and contrib/pg_visibility are also using DISABLE_PAGE_SKIPPING, so I wonder if they have similar hazards. I haven't seen them fall over, though. regards, tom lane [1] https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=mamba=2022-11-20%2015%3A13%3A19 [2] https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=mamba=2022-10-31%2013%3A33%3A35 [3] https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=mamba=2022-10-19%2016%3A34%3A07 [4] https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=mamba=2022-08-29%2017%3A49%3A02
Re: Reducing power consumption on idle servers
On Thu, 24 Mar 2022 at 16:21, Robert Haas wrote: > > On Thu, Mar 24, 2022 at 12:02 PM Simon Riggs > > What changes will be acceptable for bgwriter, walwriter and logical worker? > > Hmm, I think it would be fine to introduce some kind of hibernation > mechanism for logical workers. bgwriter and wal writer already have a > hibernation mechanism, so I'm not sure what your concern is there > exactly. In your initial email you said you weren't proposing changes > there, but maybe that changed somewhere in the course of the > subsequent discussion. If you could catch me up on your present > thinking that would be helpful. Now that we seem to have solved the problem for Startup process, let's circle back to the others Bgwriter does hibernate currently, but only at 50x the bgwriter_delay. At default values that is 5s, but could be much less. So we need to move that up to 60s, same as others. WALwriter also hibernates currently, but only at 25x the wal_writer_delay. At default values that is 2.5s, but could be much less. So we need to move that up to 60s, same as others. At the same time, make sure that when we hibernate we use a new WaitEvent, similarly to the way bgwriter reports its hibernation state (which also helps test the patch). As a 3rd patch, I will work on making logical workers hibernate. -- Simon Riggshttp://www.EnterpriseDB.com/ hibernate_bgwriter_walwriter.v5.patch Description: Binary data
Re: HOT chain validation in verify_heapam()
On Tue, Nov 15, 2022 at 10:55 PM Andres Freund wrote: > I'm quite certain that it's possible to end up freezing an earlier row > versions in a hot chain in < 14, I got there with careful gdb > orchestration. Of course possible I screwed something up, given I did it once, > interactively. Not sure if trying to fix it is worth the risk of backpatching > all the necessary changes to switch to the retry approach. There is code in heap_prepare_freeze_tuple() that treats a raw xmax as "xmax_already_frozen = true", even when the raw xmax value isn't already set to InvalidTransactionId. I'm referring to this code: if ( ... ) // process raw xmax else if (TransactionIdIsNormal(xid)) else if ((tuple->t_infomask & HEAP_XMAX_INVALID) || !TransactionIdIsValid(HeapTupleHeaderGetRawXmax(tuple))) { freeze_xmax = false; xmax_already_frozen = true; /* No need for relfrozenxid_out handling for already-frozen xmax */ } else ereport(ERROR, (errcode(ERRCODE_DATA_CORRUPTED), errmsg_internal("found xmax %u (infomask 0x%04x) not frozen, not multi, not normal", xid, tuple->t_infomask))); Why should it be okay to not process xmax during this call (by setting "xmax_already_frozen = true"), just because HEAP_XMAX_INVALID happens to be set? Isn't HEAP_XMAX_INVALID purely a hint? (HEAP_XMIN_FROZEN is *not* a hint, but we're dealing with xmax here.) I'm not sure how relevant this is to the concerns you have about frozen xmax, or even if it's any kind of problem, but it still seems worth fixing. It seems to me that there should be clear rules on what special transaction IDs can appear in xmax. Namely: the only special transaction ID that can ever appear in xmax is InvalidTransactionId. (Also, it's not okay to see *any* other XID in the "xmax_already_frozen = true" path, nor would it be okay to leave any other XID behind in xmax in the nearby "freeze_xmax = true" path.) -- Peter Geoghegan
Re: heavily contended lwlocks with long wait queues scale badly
Hi, On 2022-11-09 17:03:13 -0800, Andres Freund wrote: > On 2022-11-09 09:38:08 -0800, Andres Freund wrote: > > I'm on a hike, without any connectivity, Thu afternoon - Sun. I think it's > > OK > > to push it to HEAD if I get it done in the next few hours. Bigger issues, > > which I do not expect, should show up before tomorrow afternoon. Smaller > > things could wait till Sunday if necessary. > > I didn't get to it in time, so I'll leave it for when I'm back. Took a few days longer, partially because I encountered an independent issue (see 8c954168cff) while testing. I pushed it to HEAD now. I still think it might be worth to backpatch in a bit, but so far the votes on that weren't clear enough on that to feel comfortable. Regards, Andres
Re: ssl tests aren't concurrency safe due to get_free_port()
On 2022-11-20 10:10:38 -0500, Andrew Dunstan wrote: > OK, pushed with a little more tweaking. Thanks! > I didn't upcase top_builddir > because the existing prove_installcheck recipes already export it and I > wanted to stay consistent with those. Makes sense. > If it works ok I will backpatch in couple of days. +1
Re: Avoid double lookup in pgstat_fetch_stat_tabentry()
Hi, On 2022-11-19 09:38:26 +0100, Drouvot, Bertrand wrote: > I'd vote for V3 for readability, size and "backward compatibility" with > current code. Pushed that. Thanks for the patch and evaluation. Greetings, Andres Freund
Re: Allow placeholders in ALTER ROLE w/o superuser
.On Sun, Nov 20, 2022 at 8:48 PM Alexander Korotkov wrote: > I've drafted a patch implementing ALTER ROLE ... SET ... TO ... USER SET > syntax. > > These options are working only for USERSET GUC variables, but require > less privileges to set. I think there is no problem to implement > > Also it seems that this approach doesn't conflict with future > privileges for USERSET GUCs [1]. I expect that USERSET GUCs should be > available unless explicitly REVOKEd. That mean we should be able to > check those privileges during ALTER ROLE. > > Opinions on the patch draft? > > Links > 1. > https://mail.google.com/mail/u/0/?ik=a20b091faa=om=msg-f%3A1749871710745577015 Uh, sorry for the wrong link. I meant https://www.postgresql.org/message-id/2271988.1668807...@sss.pgh.pa.us -- Regards, Alexander Korotkov
Re: Allow placeholders in ALTER ROLE w/o superuser
On Sat, Nov 19, 2022 at 4:02 AM Alexander Korotkov wrote: > On Sat, Nov 19, 2022 at 12:41 AM Tom Lane wrote: > > ... BTW, re-reading the commit message for a0ffa885e: > > > > One caveat is that PGC_USERSET GUCs are unaffected by the SET privilege > > --- one could wish that those were handled by a revocable grant to > > PUBLIC, but they are not, because we couldn't make it robust enough > > for GUCs defined by extensions. > > > > it suddenly struck me to wonder if the later 13d838815 changed the > > situation enough to allow revisiting that problem, and/or if storing > > the source role's OID in pg_db_role_setting would help. > > > > I don't immediately recall all the problems that led us to leave USERSET > > GUCs out of the feature, so maybe this is nuts; but maybe it isn't. > > It'd be worth considering if we're trying to improve matters here. > > I think if we implement the user-visible USERSET flag for ALTER ROLE, > then we might just check permissions for such parameters from the > target role. I've drafted a patch implementing ALTER ROLE ... SET ... TO ... USER SET syntax. These options are working only for USERSET GUC variables, but require less privileges to set. I think there is no problem to implement Also it seems that this approach doesn't conflict with future privileges for USERSET GUCs [1]. I expect that USERSET GUCs should be available unless explicitly REVOKEd. That mean we should be able to check those privileges during ALTER ROLE. Opinions on the patch draft? Links 1. https://mail.google.com/mail/u/0/?ik=a20b091faa=om=msg-f%3A1749871710745577015 -- Regards, Alexander Korotkov 0001-ALTER-ROLE-.-SET-.-TO-.-USER-SET-v1.patch Description: Binary data
Re: Add LZ4 compression in pg_dump
On Fri, Aug 05, 2022 at 02:23:45PM +, Georgios Kokolatos wrote: > Thank you for your work during commitfest. > > The patch is still in development. Given vacation status, expect the next > patches to be ready for the November commitfest. > For now it has moved to the September one. Further action will be taken then > as needed. On Sun, Nov 06, 2022 at 02:53:12PM +, gkokola...@pm.me wrote: > On Wed, Nov 2, 2022 at 14:28, Justin Pryzby wrote: > > Checking if you'll be able to submit new patches soon ? > > Thank you for checking up. Expect new versions within this commitfest cycle. Hi, I think this patch record should be closed for now. You can re-open the existing patch record once a patch is ready to be reviewed. The commitfest is a time for committing/reviewing patches that were previously submitted, but there's no new patch since July. Making a patch available for review at the start of the commitfest seems like a requirement for current patch records (same as for new patch records). I wrote essentially the same patch as your early patches 2 years ago (before postgres was ready to consider new compression algorithms), so I'm happy to review a new patch when it's available, regardless of its status in the cfapp. BTW, some of my own review comments from March weren't addressed. Please check. Also, in February, I asked if you knew how to use cirrusci to run checks on cirrusci, but the patches still had compilation errors and warnings on various OS. https://cirrus-ci.com/github/postgresql-cfbot/postgresql/commitfest/39/3571 -- Justin
Re: predefined role(s) for VACUUM and ANALYZE
On Sat, Nov 19, 2022 at 10:50:04AM -0800, Nathan Bossart wrote: > another rebase Another rebase for cfbot. -- Nathan Bossart Amazon Web Services: https://aws.amazon.com >From 5f716f33e93187491686381b2180894ab2b1b92c Mon Sep 17 00:00:00 2001 From: Nathan Bossart Date: Wed, 7 Sep 2022 22:25:29 -0700 Subject: [PATCH v12 1/4] Change AclMode from a uint32 to a uint64. --- src/backend/nodes/outfuncs.c| 2 +- src/bin/pg_upgrade/check.c | 35 + src/include/catalog/pg_type.dat | 4 ++-- src/include/nodes/parsenodes.h | 6 +++--- src/include/utils/acl.h | 28 +- 5 files changed, 55 insertions(+), 20 deletions(-) diff --git a/src/backend/nodes/outfuncs.c b/src/backend/nodes/outfuncs.c index f05e72f0dc..8f150e9a2e 100644 --- a/src/backend/nodes/outfuncs.c +++ b/src/backend/nodes/outfuncs.c @@ -560,7 +560,7 @@ _outRangeTblEntry(StringInfo str, const RangeTblEntry *node) WRITE_BOOL_FIELD(lateral); WRITE_BOOL_FIELD(inh); WRITE_BOOL_FIELD(inFromCl); - WRITE_UINT_FIELD(requiredPerms); + WRITE_UINT64_FIELD(requiredPerms); WRITE_OID_FIELD(checkAsUser); WRITE_BITMAPSET_FIELD(selectedCols); WRITE_BITMAPSET_FIELD(insertedCols); diff --git a/src/bin/pg_upgrade/check.c b/src/bin/pg_upgrade/check.c index f1bc1e6886..615a53a864 100644 --- a/src/bin/pg_upgrade/check.c +++ b/src/bin/pg_upgrade/check.c @@ -28,6 +28,7 @@ static void check_for_incompatible_polymorphics(ClusterInfo *cluster); static void check_for_tables_with_oids(ClusterInfo *cluster); static void check_for_composite_data_type_usage(ClusterInfo *cluster); static void check_for_reg_data_type_usage(ClusterInfo *cluster); +static void check_for_aclitem_data_type_usage(ClusterInfo *cluster); static void check_for_jsonb_9_4_usage(ClusterInfo *cluster); static void check_for_pg_role_prefix(ClusterInfo *cluster); static void check_for_new_tablespace_dir(ClusterInfo *new_cluster); @@ -107,6 +108,13 @@ check_and_dump_old_cluster(bool live_check) check_for_reg_data_type_usage(_cluster); check_for_isn_and_int8_passing_mismatch(_cluster); + /* + * PG 16 increased the size of the 'aclitem' type, which breaks the on-disk + * format for existing data. + */ + if (GET_MAJOR_VERSION(old_cluster.major_version) <= 1500) + check_for_aclitem_data_type_usage(_cluster); + /* * PG 14 changed the function signature of encoding conversion functions. * Conversions from older versions cannot be upgraded automatically @@ -1319,6 +1327,33 @@ check_for_reg_data_type_usage(ClusterInfo *cluster) check_ok(); } +/* + * check_for_aclitem_data_type_usage + * + * aclitem changed its storage format in 16, so check for it. + */ +static void +check_for_aclitem_data_type_usage(ClusterInfo *cluster) +{ + char output_path[MAXPGPATH]; + + prep_status("Checking for incompatible aclitem data type in user tables"); + + snprintf(output_path, sizeof(output_path), "tables_using_aclitem.txt"); + + if (check_for_data_type_usage(cluster, "pg_catalog.aclitem", output_path)) + { + pg_log(PG_REPORT, "fatal"); + pg_fatal("Your installation contains the \"aclitem\" data type in user tables.\n" + "The internal format of \"aclitem\" changed in PostgreSQL version 16\n" + "so this cluster cannot currently be upgraded. You can drop the\n" + "problem columns and restart the upgrade. A list of the problem\n" + "columns is in the file:\n" + "%s", output_path); + } + else + check_ok(); +} /* * check_for_jsonb_9_4_usage() diff --git a/src/include/catalog/pg_type.dat b/src/include/catalog/pg_type.dat index df45879463..0763dfde39 100644 --- a/src/include/catalog/pg_type.dat +++ b/src/include/catalog/pg_type.dat @@ -267,9 +267,9 @@ # OIDS 1000 - 1099 { oid => '1033', array_type_oid => '1034', descr => 'access control list', - typname => 'aclitem', typlen => '12', typbyval => 'f', typcategory => 'U', + typname => 'aclitem', typlen => '16', typbyval => 'f', typcategory => 'U', typinput => 'aclitemin', typoutput => 'aclitemout', typreceive => '-', - typsend => '-', typalign => 'i' }, + typsend => '-', typalign => 'd' }, { oid => '1042', array_type_oid => '1014', descr => 'char(length), blank-padded string, fixed storage length', typname => 'bpchar', typlen => '-1', typbyval => 'f', typcategory => 'S', diff --git a/src/include/nodes/parsenodes.h b/src/include/nodes/parsenodes.h index 7caff62af7..f4ed9bbff9 100644 --- a/src/include/nodes/parsenodes.h +++ b/src/include/nodes/parsenodes.h @@ -73,12 +73,12 @@ typedef enum SetQuantifier /* * Grantable rights are encoded so that we can OR them together in a bitmask. - * The present representation of AclItem limits us to 16 distinct rights, - * even though AclMode is defined as uint32. See utils/acl.h. + * The present representation of AclItem limits us to 32 distinct rights, + * even though AclMode is defined as uint64. See utils/acl.h. * * Caution: changing these codes breaks stored ACLs, hence forces initdb.
Re: Question concerning backport of CVE-2022-2625
Roberto =?iso-8859-1?Q?C=2E_S=E1nchez?= writes: > -- this makes a shell "point <<@@ polygon" operator too > CREATE OPERATOR @@>> ( PROCEDURE = poly_contain_pt, >LEFTARG = polygon, RIGHTARG = point, >COMMUTATOR = <<@@ ); > CREATE EXTENSION test_ext_cor; -- fail > ERROR: operator <<@@(point,polygon) is not a member of extension > "test_ext_cor" > DETAIL: An extension is not allowed to replace an object that it does not > own. > DROP OPERATOR <<@@ (point, polygon); > CREATE EXTENSION test_ext_cor; -- now it should work > +ERROR: operator 16427 is not a member of extension "test_ext_cor" > +DETAIL: An extension is not allowed to replace an object that it does not > own. That is ... odd. Since 9.4 is long out of support I'm unenthused about investigating it myself. (Why is it that people will move heaven and earth to fix "security" bugs in dead branches, but ignore even catastrophic data-loss bugs?) But if you're stuck with pursuing this exercise, I think you'd better figure out exactly what's happening. I agree that it smells like c94959d41 could be related, but I don't see just how that'd produce this symptom. Before that commit, the DROP OPERATOR <<@@ would have left a dangling link behind in @@>> 's oprcom field, but there doesn't seem to be a reason why that'd affect the test_ext_cor extension: it will not be re-using the same operator OID, nor would it have any reason to touch @@>>, since there's no COMMUTATOR clause in the extension. It'd likely be a good idea to reproduce this with a gdb breakpoint set at errfinish, and see exactly what's leading up to the error. regards, tom lane
Re: Slow standby snapshot
On Sun, 20 Nov 2022 at 13:45, Michail Nikolaev wrote: > If such approach looks committable - I could do more careful > performance testing to find the best value for > WASTED_SNAPSHOT_WORK_LIMIT_TO_COMPRESS. Nice patch. We seem to have replaced one magic constant with another, so not sure if this is autotuning, but I like it much better than what we had before (i.e. better than my prev patch). Few thoughts 1. I was surprised that you removed the limits on size and just had the wasted work limit. If there is no read traffic that will mean we hardly ever compress, which means the removal of xids at commit will get slower over time. I would prefer that we forced compression on a regular basis, such as every time we process an XLOG_RUNNING_XACTS message (every 15s), as well as when we hit certain size limits. 2. If there is lots of read traffic but no changes flowing, it would also make sense to force compression when the startup process goes idle rather than wait for the work to be wasted first. Quick patch to add those two compression events also. That should favour the smaller wasted work limits. -- Simon Riggshttp://www.EnterpriseDB.com/ events_that_force_compression.v1.patch Description: Binary data
Re: Getting rid of SQLValueFunction
On Sat, Nov 19, 2022 at 7:01 PM Michael Paquier wrote: > On Fri, Nov 18, 2022 at 10:23:58AM +0900, Michael Paquier wrote: > > Please note that in order to avoid tweaks when choosing the attribute > > name of function call, this needs a total of 8 new catalog functions > > mapping to the SQL keywords, which is what the test added by 2e0d80c > > is about: > > - current_role > > - user > > - current_catalog > > - current_date > > - current_time > > - current_timestamp > > - localtime > > - localtimestamp > > > > Any objections? > > Hearing nothing, I have gone through 0001 again and applied it as > fb32748 to remove the dependency between names and SQLValueFunction. > Attached is 0002, to bring back the CI to a green state. > -- > Michael > Hi, For get_func_sql_syntax(), the code for cases of F_CURRENT_TIME, F_CURRENT_TIMESTAMP, F_LOCALTIME and F_LOCALTIMESTAMP is mostly the same. Maybe we can introduce a helper so that code duplication is reduced. Cheers
Re: ssl tests aren't concurrency safe due to get_free_port()
On 2022-11-19 Sa 15:16, Andres Freund wrote: > Hi, > > On 2022-11-19 10:56:33 -0500, Andrew Dunstan wrote: >>> Perhaps we should just export a directory in configure instead of this >>> guessing game? >> I think the obvious candidate would be to export top_builddir from >> src/Makefile.global. That would remove the need to infer it from >> TESTDATADIR. > I think that'd be good. I'd perhaps rename it in the process so it's > exported uppercase, but whatever... > OK, pushed with a little more tweaking. I didn't upcase top_builddir because the existing prove_installcheck recipes already export it and I wanted to stay consistent with those. If it works ok I will backpatch in couple of days. cheers andrew -- Andrew Dunstan EDB: https://www.enterprisedb.com
Re: Reducing power consumption on idle servers
On Sat, 19 Nov 2022 at 10:59, Simon Riggs wrote: > New version attached. Fix for doc xref -- Simon Riggshttp://www.EnterpriseDB.com/ hibernate_startup.v10.patch Description: Binary data
Question concerning backport of CVE-2022-2625
Greetings PGSQL hackers, I am working on a backport of CVE-2022-2625 to PostgreSQL 9.6 and 9.4. I am starting from commit 5919bb5a5989cda232ac3d1f8b9d90f337be2077. The backport to 9.6 was relatively straightforward, the principal change being to omit some of the hunks related to commands in 9.6 that did not have support for 'IF NOT EXISTS'. When it came to 9.4, things got a little more interesting. There were additional instances of commands that did not have support for 'IF NOT EXISTS' and some of the contructions were slightly different as well, but nothing insurmountable there. I did have to hack at the 9.4 test harness a bit since the test_extensions sub-directory seems to have been introduced post-9.4 and it seemed like a good idea to have the actual tests from the aforementioned commit to help guard against some sort of unintended change on my part. However, after I got through the CINE changes and started dealing with the COR changes I ran into something fairly peculiar. The test output included this: DROP VIEW ext_cor_view; CREATE TYPE test_ext_type; CREATE EXTENSION test_ext_cor; -- fail ERROR: type test_ext_type is not a member of extension "test_ext_cor" DETAIL: An extension is not allowed to replace an object that it does not own. DROP TYPE test_ext_type; -- this makes a shell "point <<@@ polygon" operator too CREATE OPERATOR @@>> ( PROCEDURE = poly_contain_pt, LEFTARG = polygon, RIGHTARG = point, COMMUTATOR = <<@@ ); CREATE EXTENSION test_ext_cor; -- fail ERROR: operator <<@@(point,polygon) is not a member of extension "test_ext_cor" DETAIL: An extension is not allowed to replace an object that it does not own. DROP OPERATOR <<@@ (point, polygon); CREATE EXTENSION test_ext_cor; -- now it should work +ERROR: operator 16427 is not a member of extension "test_ext_cor" +DETAIL: An extension is not allowed to replace an object that it does not own. SELECT ext_cor_func(); This made me suspect that there was an issue with 'DROP OPERATOR'. After a little scavenger hunt, I located a commit which appears to be related, c94959d4110a1965472956cfd631082a96f64a84, and which was made post-9.4. So then, my question: is the existing behavior that produces "ERROR: operator ... is not a member of extension ..." a sufficient guard against the CVE-2022-2625 vulnerability when it comes to operators? (My thought is that it might be sufficient, and if it is I would need to add something like 'DROP OPERATOR @@>> (point, polygon);' to allow the extension creation to work and the test to complete.) If the apparently buggy behavior is not a sufficient guard, then is a backport of c94959d4110a1965472956cfd631082a96f64a84 in conjunction with the CVE-2022-2625 fix the correct solution? Regards, -Roberto -- Roberto C. Sánchez
Re: Slow standby snapshot
Oh, seems like it is not my day :) The image fixed again.
Re: Slow standby snapshot
Oops, wrong image, this is correct one. But is 1-run tests, so it shows only basic correlation,
Re: Slow standby snapshot
Hello. On Wed, Nov 16, 2022 at 3:44 AM Andres Freund wrote: > Approach 1: > We could have an atomic variable in ProcArrayStruct that counts the amount of > wasted effort and have processes update it whenever they've wasted a > meaningful amount of effort. Something like counting the skipped elements in > KnownAssignedXidsGetAndSetXmin in a function local static variable and > updating the shared counter whenever that reaches I made the WIP patch for that approach and some initial tests. It seems like it works pretty well. At least it is better than previous ways for standbys without high read only load. Both patch and graph in attachments. Strange numbers is a limit of wasted work to perform compression. I have used the same (1) testing script and configuration as before (two 16-CPU machines, long transaction on primary at 60th second, simple-update and select-only for pgbench). If such approach looks committable - I could do more careful performance testing to find the best value for WASTED_SNAPSHOT_WORK_LIMIT_TO_COMPRESS. [1]: https://gist.github.com/michail-nikolaev/e1dfc70bdd7cfd1b902523dbb3db2f28 -- Michail Nikolaev Index: src/backend/storage/ipc/procarray.c IDEA additional info: Subsystem: com.intellij.openapi.diff.impl.patch.CharsetEP <+>UTF-8 === diff --git a/src/backend/storage/ipc/procarray.c b/src/backend/storage/ipc/procarray.c --- a/src/backend/storage/ipc/procarray.c (revision fb32748e32e2b6b2fcb32220980b93d5436f855e) +++ b/src/backend/storage/ipc/procarray.c (date 1668950653116) @@ -272,6 +272,7 @@ */ static TransactionId *KnownAssignedXids; static bool *KnownAssignedXidsValid; +static pg_atomic_uint32 *KnownAssignedXidsWastedSnapshotWork; static TransactionId latestObservedXid = InvalidTransactionId; /* @@ -451,6 +452,10 @@ ShmemInitStruct("KnownAssignedXidsValid", mul_size(sizeof(bool), TOTAL_MAX_CACHED_SUBXIDS), ); + KnownAssignedXidsWastedSnapshotWork = (pg_atomic_uint32 *) + ShmemInitStruct("KnownAssignedXidsWastedSnapshotWork", + sizeof(pg_atomic_uint32), ); + pg_atomic_init_u32(KnownAssignedXidsWastedSnapshotWork, 0); } } @@ -4616,20 +4621,9 @@ if (!force) { - /* - * If we can choose how much to compress, use a heuristic to avoid - * compressing too often or not often enough. - * - * Heuristic is if we have a large enough current spread and less than - * 50% of the elements are currently in use, then compress. This - * should ensure we compress fairly infrequently. We could compress - * less often though the virtual array would spread out more and - * snapshots would become more expensive. - */ - int nelements = head - tail; - - if (nelements < 4 * PROCARRAY_MAXPROCS || - nelements < 2 * pArray->numKnownAssignedXids) +#define WASTED_SNAPSHOT_WORK_LIMIT_TO_COMPRESS 111 + if (pg_atomic_read_u32(KnownAssignedXidsWastedSnapshotWork) + < WASTED_SNAPSHOT_WORK_LIMIT_TO_COMPRESS) return; } @@ -4650,6 +4644,8 @@ pArray->tailKnownAssignedXids = 0; pArray->headKnownAssignedXids = compress_index; + /* Reset wasted work counter */ + pg_atomic_write_u32(KnownAssignedXidsWastedSnapshotWork, 0); } /* @@ -5031,6 +5027,7 @@ KnownAssignedXidsGetAndSetXmin(TransactionId *xarray, TransactionId *xmin, TransactionId xmax) { + ProcArrayStruct *pArray = procArray; int count = 0; int head, tail; @@ -5078,6 +5075,10 @@ } } + /* Add number of invalid items scanned to wasted work counter */ + pg_atomic_add_fetch_u32(KnownAssignedXidsWastedSnapshotWork, + (head - tail) - pArray->numKnownAssignedXids); + return count; }
Re: How to *really* quit psql?
Hello David, Question: is there any way to really abort a psql script from an included file? Under what circumstances would it be appropriate for a script to take it on itself to decide that? It has no way of knowing what the next -f option is or what the user intended. Can we add an exit code argument to the \quit meta-command that could be set to non-zero and, combined with ON_ERROR_STOP, produces the desired effect of aborting everything just like an error under ON_ERROR_STOP does (which is the workaround here I suppose, but an ugly one that involves the server). I like the simple idea of adding an optional exit status argument to \quit. I'm unsure whether "ON_ERROR_STOP" should or should not change the behavior, or whether it should just exit(n) with \quit n. Note that using quit to abort a psql script is already used when loading extensions to prevent them to be run directly by psql: -- from some sql files in "contrib/pg_stat_statements/": \echo Use "ALTER EXTENSION pg_stat_statements UPDATE TO '1.10'" to load this file. \quit But the same trick would fail if the guard is reach with an include. -- Fabien.