Re: recovery modules
On Tue, Jan 03, 2023 at 11:05:38AM -0800, Nathan Bossart wrote: > I noticed that cfbot's Windows tests are failing because the backslashes in > the archive directory path are causing escaping problems. Here is an > attempt to fix that by converting all backslashes to forward slashes, which > is what other tests (e.g., 025_stuck_on_old_timeline.pl) do. + GetOldestRestartPoint(, ); + XLByteToSeg(restartRedoPtr, restartSegNo, wal_segment_size); + XLogFileName(lastRestartPointFname, restartTli, restartSegNo, +wal_segment_size); + + shell_archive_cleanup(lastRestartPointFname); Hmm. Is passing down the file name used as a cutoff point the best interface for the modules? Perhaps passing down the redo LSN and its TLI would be a cleaner approach in terms of flexibility? I agree with letting the startup enforce these numbers as that can be easy to mess up for plugin authors, leading to critical problems. The same code pattern is repeated twice for the end-of-recovery callback and the cleanup commands when it comes to building the file name. Not critical, still not really nice. MODULES = basic_archive -PGFILEDESC = "basic_archive - basic archive module" +PGFILEDESC = "basic_archive - basic archive and recovery module" "basic_archive" does not reflect what this module does. Using one library simplifies the whole configuration picture and the tests, so perhaps something like basic_wal_module, or something like that, would be better long-term? -- Michael signature.asc Description: PGP signature
RE: [Proposal] Add foreign-server health checks infrastructure
Dear Ted, Thank you for reviewing! PSA new version. > + /* quick exit if connection cache has been not initialized yet. */ > > been not initialized -> not been initialized Fixed. > + > (errcode(ERRCODE_CONNECTION_FAILURE), > +errmsg("could not connect to > server \"%s\"", > > Currently each server which is not connected would log a warning. > Is it better to concatenate names for such servers and log one line ? This > would be cleaner when there are multiple such servers. Sounds good, fixed as you said. The following shows the case that two disconnections are detected by postgres_fdw_verify_connection_states_all(). ``` postgres=*# select postgres_fdw_verify_connection_states_all (); WARNING: could not connect to server "my_external_server2", "my_external_server" DETAIL: Socket close is detected. HINT: Plsease check the health of server. postgres_fdw_verify_connection_states_all --- f (1 row) ``` Currently, the name of servers is concatenated without doing unique checks. IIUC a backend process cannot connect to the same foreign server by using different user mapping, so there is no possibility that the same name appears twice. If the user mapping is altered in the transaction, the cache entry is invalidated and will not be checked. Best Regards, Hayato Kuroda FUJITSU LIMITED v24-0001-Add-PQConncheck-and-PQCanConncheck-to-libpq.patch Description: v24-0001-Add-PQConncheck-and-PQCanConncheck-to-libpq.patch v24-0002-postgres_fdw-add-postgres_fdw_verify_connection_.patch Description: v24-0002-postgres_fdw-add-postgres_fdw_verify_connection_.patch v24-0003-add-test.patch Description: v24-0003-add-test.patch
Re: Minimal logical decoding on standbys
On Tue, Jan 10, 2023 at 2:03 PM Drouvot, Bertrand wrote: > > Please find attached, V37 taking care of: Thanks. I started to digest the design specified in the commit message and these patches. Here are some quick comments: 1. Does logical decoding on standby work without any issues if the standby is set for cascading replication? 2. Do logical decoding output plugins work without any issues on the standby with decoding enabled, say, when the slot is invalidated? 3. Is this feature still a 'minimal logical decoding on standby'? Firstly, why is it 'minimal'? 4. What happens in case of failover to the standby that's already decoding for its clients? Will the decoding work seamlessly? If not, what are recommended things that users need to take care of during/after failovers? 0002: 1. -if (InvalidateObsoleteReplicationSlots(_logSegNo)) +InvalidateObsoleteOrConflictingLogicalReplicationSlots(_logSegNo, , InvalidOid, NULL); Isn't the function name too long and verbose? How about just InvalidateLogicalReplicationSlots() let the function comment talk about what sorts of replication slots it invalides? 2. +errdetail("Logical decoding on standby requires wal_level to be at least logical on master")); + * master wal_level is set back to replica, so existing logical slots need to invalidate such slots. Also do the same thing if wal_level on master Can we use 'primary server' instead of 'master' like elsewhere? This comment also applies for other patches too, if any. 3. Can we show a new status in pg_get_replication_slots's wal_status for invalidated due to the conflict so that the user can monitor for the new status and take necessary actions? 4. How will the user be notified when logical replication slots are invalidated due to conflict with the primary server? How can they (firstly, is there a way?) repair/restore such replication slots? Or is recreating/reestablishing logical replication only the way out for them? What users can do to avoid their logical replication slots getting invalidated and run into these situations? Because recreating/reestablishing logical replication comes with cost (sometimes huge) as it involves building another instance, syncing tables etc. Isn't it a good idea to touch up on all these aspects in the documentation at least as to why we're doing this and why we can't do this? 5. @@ -1253,6 +1253,14 @@ StartLogicalReplication(StartReplicationCmd *cmd) ReplicationSlotAcquire(cmd->slotname, true); +if (!TransactionIdIsValid(MyReplicationSlot->data.xmin) + && !TransactionIdIsValid(MyReplicationSlot->data.catalog_xmin)) +ereport(ERROR, +(errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE), + errmsg("cannot read from logical replication slot \"%s\"", +cmd->slotname), + errdetail("This slot has been invalidated because it was conflicting with recovery."))); Having the invalidation check in StartLogicalReplication() looks fine, however, what happens if the slot gets invalidated when the replication is in-progress? Do we need to error out in WalSndLoop() or XLogSendLogical() too? Or is it already taken care of somewhere? -- Bharath Rupireddy PostgreSQL Contributors Team RDS Open Source Databases Amazon Web Services: https://aws.amazon.com
Re: SQL/JSON revisited
On Wed, Jan 11, 2023 at 2:02 PM Elena Indrupskaya < e.indrupsk...@postgrespro.ru> wrote: > > Sorry for upsetting your bot. :( What I do in these cases is save the incremental patch as a .txt file -- that way people can read it, but the cf bot doesn't try to launch a CI run. And if I forget that detail, well it's not a big deal, it happens sometimes. -- John Naylor EDB: http://www.enterprisedb.com
Re: cutting down the TODO list thread
So, I had intended to spend some time on this at least three times a year. I've clearly failed at that, but now is as good a time as any to pick it back up again. Over in [1], Tom opined: > John Naylor writes: > > > "WARNING for Developers: Unfortunately this list does not contain all the > > information necessary for someone to start coding a feature. Some of these > > items might have become unnecessary since they were added --- others might > > be desirable but the implementation might be unclear. When selecting items > > listed below, be prepared to first discuss the value of the feature. Do not > > assume that you can select one, code it and then expect it to be committed. > > " > > I think we could make that even stronger: there's basically nothing on > the TODO list that isn't problematic in some way. Otherwise it would > have been done already. The entries involve large amounts of work, > or things that are subtler than they might appear, or cases where the > desirable semantics aren't clear, or tradeoffs that there's not > consensus about, or combinations of those. > > IME it's typically a lot more productive to approach things via > "scratch your own itch". If a problem is biting you directly, then > at least you have some clear idea of what it is that needs to be fixed. > You might have to work up to an understanding of how to fix it, but > you have a clear goal. I've come up with some revised language, including s/15/16/ and removing the category of "[E]" (easier to implement), since it wouldn't be here if it were actually easy: -- WARNING for Developers: This list contains some known PostgreSQL bugs, some feature requests, and some things we are not even sure we want. This is not meant to be a resource for beginning developers to get ideas for things to work on. All of these items are hard, and some are perhaps impossible. Some of these items might have become unnecessary since they were added. Others might be desirable but: - a large amount work is required - the problems are subtler than they might appear - the desirable semantics aren't clear - there are tradeoffs that there's not consensus about - some combinations of the above If you really need a feature that is listed below, it will be worth reading the linked email thread if there is one, since it will often show the difficulties, or perhaps contain previous failed attempts to get a patch committed. If after that you still want to work on it, be prepared to first discuss the value of the feature. Do not assume that you can start coding and expect it to be committed. Always discuss design on the Hackers list before starting to code. Over time, it may become clear that a TODO item has become outdated or otherwise determined to be either too controversial or not worth the development effort. Such items should be retired to the Not Worth Doing page. [D] marks changes that are done, and will appear in the PostgreSQL 16 release. -- We could also revise the developer FAQ: - remove phrase "Outstanding features are detailed in Todo." - add suggestion to to check the Todo or Not_worth_doing pages to see if the desired feature is undesirable or problematic - rephrase "Working in isolation is not advisable because others might be working on the same TODO item, or you might have misunderstood the TODO item." so it doesn't mention 'TODO' at all. [1] https://www.postgresql.org/message-id/415636.1673411259%40sss.pgh.pa.us -- John Naylor EDB: http://www.enterprisedb.com
Re: SQL/JSON revisited
Tags in the patch follow the markup of the XMLTABLE function: XMLTABLE ( XMLNAMESPACES ( namespace_uri AS namespace_name , ... ), row_expression PASSING BY {REF|VALUE} document_expression BY {REF|VALUE} COLUMNS name { type PATH column_expression DEFAULT default_expression NOT NULL | NULL | FOR ORDINALITY } , ... ) setof record In the above, as well as in the signatures of SQL/JSON functions, there are no exact parameter names; otherwise, they should have been followed by the tag, which is not the case. There are no parameter names in the functions' code either. Therefore, tags seem more appropriate, according to the comment to commit 47046763c3. Sorry for upsetting your bot. :( -- Elena Indrupskaya Lead Technical Writer Postgres Professional http://www.postgrespro.com On 2023-01-10 Tu 07:51, Elena Indrupskaya wrote: Hi, The Postgres Pro documentation team prepared another SQL/JSON documentation patch (attached), to apply on top of v1-0009-Documentation-for-SQL-JSON-features.patch. The new patch: - Fixes minor typos - Does some rewording agreed with Nikita Glukhov - Updates Docbook markup to make tags consistent across SQL/JSON documentation and across func.sgml, and in particular, consistent with the XMLTABLE function, which resembles SQL/JSON functions pretty much. That's nice, but please don't post incremental patches like this. It upsets the cfbot. (I wish there were a way to tell the cfbot to ignore patches) Also, I'm fairly certain that a good many of your changes are not according to project style. The rule as I understand it is that is used for things that are parameters and is only used for things that are not parameters. (I'm not sure where that's documented other than the comment on commit 47046763c3, but it's what I attempted to do with the earlier doc tidy up.) cheers andrew -- Andrew Dunstan EDB: https://www.enterprisedb.com
Re: mprove tab completion for ALTER EXTENSION ADD/DROP
On Wed, Jan 11, 2023 at 12:10:33PM +0900, Kyotaro Horiguchi wrote: > It suggests the *kinds* of objects that are part of the extension, but > lists the objects of that kind regardless of dependency. I read > Michael suggested (and I agree) to restrict the objects (not kinds) to > actually be a part of the extension. (And not for object kinds.) Yeah, that's what I meant. Now, if Vignesh does not want to extend that, that's fine for me as well at the end on second thought, as this involves much more code for each DROP path depending on the object type involved. Adding the object names after DROP/ADD is useful on its own, and we already have some completion once the object type is specified, so simpler is perhaps just better here. -- Michael signature.asc Description: PGP signature
Re: Avoiding "wrong tuple length" errors at the end of VACUUM on pg_database update (Backpatch of 947789f to v12 and v13)
On Tue, Jan 10, 2023 at 11:05:04AM -0800, Andres Freund wrote: > What about a define that forces external toasting very aggressively for > catalog tables, iff they have a toast table? I suspect doing so for > non-catalog tables as well would trigger test changes. Running a buildfarm > animal with that would at least make issues like this much easier to discover. Hmm. That could work. I guess that you mean to do something like that in SearchSysCacheCopy() when we build the tuple copy. There is an access to the where the cacheId, meaning that we know the catalog involved. Still, we would need a lookup at its pg_class entry to check after a reltoastrelid, meaning an extra relation opened, which would be fine under a specific #define, anyway.. -- Michael signature.asc Description: PGP signature
Re: Fix pg_publication_tables to exclude generated columns
On Wed, Jan 11, 2023 at 10:07 AM Tom Lane wrote: > > Amit Kapila writes: > >> On Mon, Jan 9, 2023 11:06 PM Tom Lane wrote: > >>> We could just not fix it in the back branches. I'd argue that this is > >>> as much a definition change as a bug fix, so it doesn't really feel > >>> like something to back-patch anyway. > > > So, if we don't backpatch then it could lead to an error when it > > shouldn't have which is clearly a bug. I think we should backpatch > > this unless Tom or others are against it. > > This isn't a hill that I'm ready to die on ... but do we have any field > complaints about this? If not, I still lean against a back-patch. > I think there's a significant risk of breaking case A while fixing > case B when we change this behavior, and that's something that's > better done only in a major release. > Fair enough, but note that there is a somewhat related problem for dropped columns [1] as well. While reviewing that it occurred to me that generated columns also have a similar problem which leads to this thread (it would have been better if there is a mention of the same in the initial email). Now, as symptoms are similar, I think we shouldn't back-patch that as well, otherwise, it will appear to be partially fixed. What do you think? [1] - https://www.postgresql.org/message-id/OSZPR01MB631087C65BA81E1FEE5A60D2FDF59%40OSZPR01MB6310.jpnprd01.prod.outlook.com -- With Regards, Amit Kapila.
Re: [PATCH] Improve ability to display optimizer analysis using OPTIMIZER_DEBUG
On 11/01/23 09:57, Tom Lane wrote: IME it's typically a lot more productive to approach things via "scratch your own itch". If a problem is biting you directly, then at least you have some clear idea of what it is that needs to be fixed. You might have to work up to an understanding of how to fix it, but you have a clear goal. Question is, how newcomers should start contribution if they are not coming with a problem in their hand? Todo list is possibly first thing anyone, who is willing to contribute is going to read and for a new contributor, it is not easy to judge situation (if todo item is easy for newcomers or bit involved). One way to exacerbate this issue is to mention mailing thread with discussions under todo items. It is done for most of todo items but sometime pressing issues are left out. That being said, I think this is part of learning process and okay to come up with ideas and fail. Pghackers can possibly bring up issues in their approach (if discussion for the issue is not mentioned under todo item). -- Regards, Ankit Kumar Pandey
Re: Avoiding "wrong tuple length" errors at the end of VACUUM on pg_database update (Backpatch of 947789f to v12 and v13)
On Tue, Jan 10, 2023 at 09:54:31AM -0800, Nathan Bossart wrote: > +1 Okay, thanks. Done this part as of c0ee694 and 72b6098, then. -- Michael signature.asc Description: PGP signature
Re: typos
On Tue, Jan 10, 2023 at 01:55:56PM +0530, Amit Kapila wrote: > I have not yet started, so please go ahead. Okay, I have looked at that and fixed the whole new things, including the typo you have introduced. 0001~0004 have been left out, as of the same reasons as upthread. -- Michael signature.asc Description: PGP signature
Re: Todo: Teach planner to evaluate multiple windows in the optimal order
On 11/01/23 06:18, David Rowley wrote: Not sure if we should be trying to improve that in this patch. I just wanted to identify it as something else that perhaps could be done. This could be within reach but still original problem of having hashagg removing any gains from this remains. eg set enable_hashagg=0; explain select distinct relkind, relname, count(*) over (partition by relkind) from pg_Class; QUERY PLAN Unique (cost=41.26..65.32 rows=412 width=73) -> Incremental Sort (cost=41.26..62.23 rows=412 width=73) Sort Key: relkind, relname, (count(*) OVER (?)) Presorted Key: relkind -> WindowAgg (cost=36.01..43.22 rows=412 width=73) -> Sort (cost=36.01..37.04 rows=412 width=65) Sort Key: relkind -> Seq Scan on pg_class (cost=0.00..18.12 rows=412 width=65) (8 rows) reset enable_hashagg; explain select distinct relkind, relname, count(*) over (partition by relkind) from pg_Class; QUERY PLAN -- HashAggregate (cost=46.31..50.43 rows=412 width=73) Group Key: relkind, relname, count(*) OVER (?) -> WindowAgg (cost=36.01..43.22 rows=412 width=73) -> Sort (cost=36.01..37.04 rows=412 width=65) Sort Key: relkind -> Seq Scan on pg_class (cost=0.00..18.12 rows=412 width=65) (6 rows) HashAgg has better cost than Unique even with incremental sort (tried with other case where we have more columns pushed down but still hashAgg wins). explain select distinct a, b, count(*) over (partition by a order by b) from abcd; QUERY PLAN -- Unique (cost=345712.12..400370.25 rows=1595 width=16) -> Incremental Sort (cost=345712.12..395456.14 rows=655214 width=16) Sort Key: a, b, (count(*) OVER (?)) Presorted Key: a, b -> WindowAgg (cost=345686.08..358790.36 rows=655214 width=16) -> Sort (cost=345686.08..347324.11 rows=655214 width=8) Sort Key: a, b -> Seq Scan on abcd (cost=0.00..273427.14 rows=655214 width=8) explain select distinct a, b, count(*) over (partition by a order by b) from abcd; QUERY PLAN HashAggregate (cost=363704.46..363720.41 rows=1595 width=16) Group Key: a, b, count(*) OVER (?) -> WindowAgg (cost=345686.08..358790.36 rows=655214 width=16) -> Sort (cost=345686.08..347324.11 rows=655214 width=8) Sort Key: a, b -> Seq Scan on abcd (cost=0.00..273427.14 rows=655214 width=8) (6 rows) I'm not really all that sure the above query shape makes much sense in the real world. Would anyone ever want to use DISTINCT on some results containing WindowFuncs? This could still have been good to have if there were no negative impact and some benefit in few cases but as mentioned before, if hashagg removes any sort (which happened due to push down), all gains will be lost and we will be probably worse off than before. -- Regards, Ankit Kumar Pandey
Re: split TOAST support out of postgres.h
On Tue, Jan 10, 2023 at 12:00:46PM -0500, Robert Haas wrote: > On Tue, Jan 10, 2023 at 9:46 AM Tom Lane wrote: > > Now, there is a fair question whether splitting this code out of > > postgres.h is worth any trouble at all. TBH my initial reaction > > had been "no". But once we found that only 40-ish backend files > > need to read this new header, I became a "yes" vote because it > > seems clear that there will be a total-compilation-time benefit. A time claim with no benchmarks is a red flag. I've chosen to run one: export CCACHE_DISABLE=1 change=d952373a987bad331c0e499463159dd142ced1ef for commit in $change $change^; do echo === git checkout $commit git checkout $commit for n in `seq 1 200`; do make -j20 clean; env time make -j20; done done Results: commit median mean count d952373a987bad331c0e499463159dd142ced1ef 49.35 49.37 200 d952373a987bad331c0e499463159dd142ced1ef^ 49.33 49.36 200 That is to say, the patch made the build a bit slower, not faster. That's with GCC 4.8.5 (RHEL 7). I likely should have interleaved the run types, but in any case the speed win didn't show up. > I wasn't totally about this, either, but I think on balance it's > probably a smart thing to do. I always found it kind of weird that we > put that stuff in postgres.h. It seems to privilege the TOAST > mechanism to an undue degree; what's the argument, for example, that > TOAST macros are more generally relevant than CHECK_FOR_INTERRUPTS() > or LWLockAcquire or HeapTuple? It felt to me like we'd just decided > that one subsystem gets to go into the main header file and everybody > else just had to have their own headers. > > One thing that's particularly awkward about that is that if you want > to write some front-end code that knows about how varlenas are stored > on disk, it was very awkward with the old structure. You're not > supposed to include "postgres.h" into frontend code, but if the stuff > you need is defined there then what else can you do? So I generally > think that anything that's in postgres.h should have a strong claim to > be not only widely-needed in the backend, but also never needed at all > in any other executable. If the patch had just made postgres.h include varatt.h, like it does elog.h, I'd consider that change a nonnegative. Grouping things is nice, even if it makes compilation a bit slower. That also covers your frontend use case. How about that? I agree fixing any one extension is easy enough. Thinking back to the htup_details.h refactor, I found the aggregate pain unreasonable relative to alleged benefits, even though each individual extension wasn't too bad. SET_VARSIZE is used more (74 pgxn distributions) than htup_details.h (62).
Re: [DOCS] Stats views and functions not in order?
On Wed, Jan 4, 2023 at 6:08 PM vignesh C wrote: > > On Mon, 2 Jan 2023 at 13:47, Peter Eisentraut > wrote: > > > > On 08.12.22 03:30, Peter Smith wrote: > > > PSA patches for v9* > > > > > > v9-0001 - Now the table rows are ordered per PeterE's suggestions [1] > > > > committed Thanks for pushing. > > > > > v9-0002 - All the review comments from DavidJ [2] are addressed > > > > I'm not sure about this one. It removes the "see [link] for details" > > phrases and instead makes the view name a link. I think this loses the > > cue that there is more information elsewhere. Otherwise, one could > > think that, say, the entry about pg_stat_activity is the primary source > > and the link just links to itself. Also keep in mind that people use > > media where links are not that apparent (PDF), so the presence of a link > > by itself cannot be the only cue about the flow of the information. > PSA new patch for v10-0001 v9-0001 --> pushed, thanks! v9-0002 --> I removed this based on the reject reason above v9-0003 --> v10-0001 > I'm not sure if anything is pending for v9-0003, if there is something > pending, please post an updated patch for the same. > Thanks for the reminder. PSA v10. -- Kind Regards, Peter Smith. Fujitsu Australia v10-0001-Add-Statistics-Views-section-and-refentry-for-ea.patch Description: Binary data
low wal_retrieve_retry_interval causes missed signals on Windows
I discussed this elsewhere [0], but I thought it deserved its own thread. After setting wal_retrieve_retry_interval to 1ms in the tests, I noticed that some of the archiving tests began consistently failing on Windows. I believe the problem is that WaitForWALToBecomeAvailable() depends on the call to WaitLatch() for wal_retrieve_retry_interval to ensure that signals are dispatched (i.e., pgwin32_dispatch_queued_signals()). With a low retry interval, WaitForWALToBecomeAvailable() might skip the call to WaitLatch(), and the signals are never processed. The attached patch fixes this by always calling WaitLatch(), even if wal_retrieve_retry_interval milliseconds have already elapsed and the timeout is 0. [0] https://postgr.es/m/20221231235019.GA1223171%40nathanxps13 -- Nathan Bossart Amazon Web Services: https://aws.amazon.com >From d08f01156ce1ce1290bad440da97852c2aa6c24b Mon Sep 17 00:00:00 2001 From: Nathan Bossart Date: Sat, 31 Dec 2022 15:16:54 -0800 Subject: [PATCH v1 1/1] ensure signals are dispatched in startup process on Windows --- src/backend/access/transam/xlogrecovery.c | 36 ++- 1 file changed, 22 insertions(+), 14 deletions(-) diff --git a/src/backend/access/transam/xlogrecovery.c b/src/backend/access/transam/xlogrecovery.c index bc3c3eb3e7..b1a4c4ab6d 100644 --- a/src/backend/access/transam/xlogrecovery.c +++ b/src/backend/access/transam/xlogrecovery.c @@ -3466,6 +3466,8 @@ WaitForWALToBecomeAvailable(XLogRecPtr RecPtr, bool randAccess, */ if (lastSourceFailed) { + long wait_time; + /* * Don't allow any retry loops to occur during nonblocking * readahead. Let the caller process everything that has been @@ -3556,33 +3558,39 @@ WaitForWALToBecomeAvailable(XLogRecPtr RecPtr, bool randAccess, * and retry from the archive, but if it hasn't been long * since last attempt, sleep wal_retrieve_retry_interval * milliseconds to avoid busy-waiting. + * + * NB: Even if it's already been wal_retrieve_rety_interval + * milliseconds since the last attempt, we still call + * WaitLatch() with the timeout set to 0 to make sure any + * queued signals are dispatched on Windows builds. */ now = GetCurrentTimestamp(); if (!TimestampDifferenceExceeds(last_fail_time, now, wal_retrieve_retry_interval)) { - long wait_time; - wait_time = wal_retrieve_retry_interval - TimestampDifferenceMilliseconds(last_fail_time, now); elog(LOG, "waiting for WAL to become available at %X/%X", LSN_FORMAT_ARGS(RecPtr)); + } + else + wait_time = 0; - /* Do background tasks that might benefit us later. */ - KnownAssignedTransactionIdsIdleMaintenance(); + /* Do background tasks that might benefit us later. */ + KnownAssignedTransactionIdsIdleMaintenance(); - (void) WaitLatch(>recoveryWakeupLatch, - WL_LATCH_SET | WL_TIMEOUT | - WL_EXIT_ON_PM_DEATH, - wait_time, - WAIT_EVENT_RECOVERY_RETRIEVE_RETRY_INTERVAL); - ResetLatch(>recoveryWakeupLatch); - now = GetCurrentTimestamp(); + (void) WaitLatch(>recoveryWakeupLatch, + WL_LATCH_SET | WL_TIMEOUT | + WL_EXIT_ON_PM_DEATH, + wait_time, + WAIT_EVENT_RECOVERY_RETRIEVE_RETRY_INTERVAL); + ResetLatch(>recoveryWakeupLatch); + now = GetCurrentTimestamp(); + + /* Handle interrupt signals of startup process */ + HandleStartupProcInterrupts(); - /* Handle interrupt signals of startup process */ - HandleStartupProcInterrupts(); - } last_fail_time = now; currentSource = XLOG_FROM_ARCHIVE; break; -- 2.25.1
Re: Perform streaming logical transactions by background workers and parallel apply
On Wed, Jan 11, 2023 at 9:34 AM houzj.f...@fujitsu.com wrote: > > > I was looking into 0001, IMHO the pid should continue to represent the main > > apply worker. So the pid will always show the main apply worker which is > > actually receiving all the changes for the subscription (in short working as > > logical receiver) and if it is applying changes through a parallel worker > > then it > > should put the parallel worker pid in a new column called > > 'parallel_worker_pid' > > or 'parallel_apply_worker_pid' otherwise NULL. Thoughts? > > Thanks for the comment. > > IIRC, you mean something like following, right ? > (sorry if I misunderstood) > -- > For parallel apply worker: > 'pid' column shows the pid of the leader, new column parallel_worker_pid > shows its own pid > > For leader apply worker: > 'pid' column shows its own pid, new column parallel_worker_pid shows 0 > -- > > If so, I am not sure if the above is better, because it is changing the > existing column's('pid') meaning, the 'pid' will no longer represent the pid > of > the worker itself. Besides, it seems not consistent with what we have for > parallel query workers in pg_stat_activity. What do you think ? Actually, I always imagined the pid is the process id of the worker which is actually receiving the changes for the subscriber. Keeping the pid to represent the leader makes more sense. But as you said, that parallel worker for backend is already following the terminology as you have in your patch to show the pid as the pid of the applying worker so I am fine with the way you have. -- Regards, Dilip Kumar EnterpriseDB: http://www.enterprisedb.com
Re: MultiXact\SLRU buffers configuration
On Mon, Jan 9, 2023 at 9:49 AM Andrey Borodin wrote: > > On Tue, Jan 3, 2023 at 5:02 AM vignesh C wrote: > > does not apply on top of HEAD as in [1], please post a rebased patch: > > > Thanks! Here's the rebase. I was looking into this patch, it seems like three different optimizations are squeezed in a single patch 1) dividing buffer space in banks to reduce the seq search cost 2) guc parameter for buffer size scale 3) some of the buffer size values are modified compared to what it is on the head. I think these are 3 patches which should be independently committable. While looking into the first idea of dividing the buffer space in banks, I see that it will speed up finding the buffers but OTOH while searching the victim buffer it will actually can hurt the performance the slru pages which are frequently accessed are not evenly distributed across the banks. So imagine the cases where we have some banks with a lot of empty slots and other banks from which we frequently have to evict out the pages in order to get the new pages in. -- Regards, Dilip Kumar EnterpriseDB: http://www.enterprisedb.com
Re: delay starting WAL receiver
On Wed, Jan 11, 2023 at 05:20:38PM +1300, Thomas Munro wrote: > Is the problem here that SIGCHLD is processed ... > > PG_SETMASK(); <--- here? > > selres = select(nSockets, , NULL, NULL, ); > > Meanwhile the SIGCHLD handler code says: > > * Was it the wal receiver? If exit status is zero (normal) or one > * (FATAL exit), we assume everything is all right just like normal > * backends. (If we need a new wal receiver, we'll start one at the > * next iteration of the postmaster's main loop.) > > ... which is true, but that won't be reached for a while in this case > if the timeout has already been set to 60s. Your patch makes that > 100ms, in that case, a time delay that by now attracts my attention > like a red rag to a bull (I don't know why you didn't make it 0). I think this is right. At the very least, it seems consistent with my observations. > I'm not sure, but if I got that right, then I think the whole problem > might automatically go away with CF #4032. The SIGCHLD processing > code will run not when signals are unblocked before select() (that is > gone), but instead *after* the event loop wakes up with WL_LATCH_SET, > and runs the handler code in the regular user context before dropping > through to the rest of the main loop. Yeah, with those patches, the problem goes away. IIUC the key part is that the postmaster's latch gets set when SIGCHLD is received, so even if SIGUSR1 and SIGCHLD are processed out of order, WalReceiverPID gets cleared and we try to restart it shortly thereafter. I find this much easier to reason about. I'll go ahead and withdraw this patch from the commitfest. Thanks for chiming in. -- Nathan Bossart Amazon Web Services: https://aws.amazon.com
Re: Spinlock is missing when updating two_phase of ReplicationSlot
On Wed, Jan 11, 2023 at 11:07:05AM +0900, Masahiko Sawada wrote: > I think we should acquire the spinlock when updating fields of the > replication slot even by its owner. Otherwise readers could see > inconsistent results. Looking at another place where we update > two_phase_at, we acquire the spinlock: > > SpinLockAcquire(>mutex); > slot->data.confirmed_flush = ctx->reader->EndRecPtr; > if (slot->data.two_phase) > slot->data.two_phase_at = ctx->reader->EndRecPtr; > SpinLockRelease(>mutex); > > It seems to me an oversight of commit a8fd13cab0b. I've attached the > small patch to fix it. Looks right to me, the paths updating the data related to the slots are careful about that, even when it comes to fetching a slot from MyReplicationSlot. I have been looking around the slot code to see if there are other inconsistencies, and did not notice anything standing out. Will fix.. -- Michael signature.asc Description: PGP signature
Re: [PATCH] Support using "all" for the db user in pg_ident.conf
On Wed, Dec 28, 2022 at 09:11:05AM +, Jelte Fennema wrote: > That's totally fair, I attached a new iteration of this patchset where this > refactoring and the new functionality are split up in two patches. The confusion that 0001 is addressing is fair (cough, fc579e1, cough), still I am wondering whether we could do a bit better to be more consistent with the lines of the ident file, as of: - renaming "pg_role" to "pg_user", as we use pg-username or database-username when referring to it in the docs or the conf sample file. - renaming "systemuser" to "system_user_token" to outline that this is not a simple string but an AuthToken with potentially a regexp? - Changing the order of the elements in IdentLine to map with the actual ident lines: usermap, systemuser and pg_user. > I'm not sure if I'm understanding your concern correctly, but > the system username will still be compared case sensitively if requested. > The only thing this changes is that: before comparing the pg_role > column to the requested pg_role case sensitively, it now checks if > the value of the pg_role column is lowercase "all". If that's the case, > then the pg_role column is not compared to the requested > pg|role anymore, and instead access is granted. Yeah, still my spider sense reacts on this proposal, and I think that I know why.. In what is your proposal different from the following entry in pg_ident.conf? As of: mapname /^(.*)$ \1 This would allow everything, and use for the PG user the same user as the one provided by the client. That's what your proposal with "all" does, no? The heart of the problem is that you still require PG roles that have a 1:1 mapping the user names provided by the clients. -- Michael signature.asc Description: PGP signature
Re: Add proper planner support for ORDER BY / DISTINCT aggregates
On Wed, 11 Jan 2023 at 17:32, Tom Lane wrote: > > David Rowley writes: > > I think whatever the fix is here, we should likely ensure that the > > results are consistent regardless of which Aggrefs are the presorted > > ones. Perhaps the easiest way to do that, and to ensure we call the > > volatile functions are called the same number of times would just be > > to never choose Aggrefs with volatile functions when doing > > make_pathkeys_for_groupagg(). > > There's existing logic in equivclass.c and other places that tries > to draw very tight lines around what we'll assume about volatile > sort expressions (pathkeys). It sounds like there's someplace in > this recent patch that didn't get that memo. I'm not sure I did a good job of communicating my thoughts there. What I mean is, having volatile functions in the aggregate's ORDER BY or DISTINCT clause didn't seem very well behaved prior to the presorted aggregates patch. If I go and fix the bug with the missing targetlist items, then a query such as: select string_agg(random()::text, ',' order by random()) from generate_series(1,3); should start putting the random() numbers in order where it didn't prior to 1349d279. Perhaps users might be happy that those are in order, however, if they then go and change the query to: select sum(a order by a),string_agg(random()::text, ',' order by random()) from generate_series(1,3); then they might become unhappy again that their string_agg is not ordered the way they specified because the planner opted to sort by "a" rather than "random()" after the initial scan. I'm wondering if 1349d279 should have just never opted to presort Aggrefs which have volatile functions so that the existing behaviour of unordered output is given always and nobody is fooled into thinking this works correctly only to be disappointed later when they add some other aggregate to their query or if we should fix both. Certainly, it seems much easier to do the former. David
Re: Add a new pg_walinspect function to extract FPIs from WAL records
On Tue, Jan 10, 2023 at 09:29:03AM +0100, Drouvot, Bertrand wrote: > Thanks for updating the patch! > > +-- Compare FPI from WAL record and page from table, they must be same > > I think "must be the same" or "must be identical" sounds better (but not 100% > sure). > > Except this nit, V4 looks good to me. +postgres=# SELECT lsn, tablespace_oid, database_oid, relfile_number, block_number, fork_name, length(fpi) > 0 as fpi_ok FROM pg_get_wal_fpi_info('0/7418E60', '0/7518218'); This query in the docs is too long IMO. Could you split that across multiple lines for readability? + pg_get_wal_fpi_info(start_lsn pg_lsn, + end_lsn pg_lsn, + lsn OUT pg_lsn, + tablespace_oid OUT oid, + database_oid OUT oid, + relfile_number OUT oid, + block_number OUT int8, + fork_name OUT text, + fpi OUT bytea) I am a bit surprised by this format, used to define the functions part of the module in the docs, while we have examples that actually show what's printed out. I understand that this comes from the original commit of the module, but the rendered docs are really hard to parse as well, no? FWIW, I think that this had better be fixed as well in the docs of v15.. Showing a full set of attributes for the returned record is fine by me, still if these are too long we could just use \x. For this one, I think that there is little point in showing 14 records, so I would stick with a style similar to pageinspect. +CREATE FUNCTION pg_get_wal_fpi_info(IN start_lsn pg_lsn, +IN end_lsn pg_lsn, +OUT lsn pg_lsn, + OUT tablespace_oid oid, Slight indentation issue here. Using "relfile_number" would be a first, for what is defined in the code and the docs as a filenode. +SELECT pg_current_wal_lsn() AS wal_lsn4 \gset +-- Get FPI from WAL record +SELECT fpi AS page_from_wal FROM pg_get_wal_fpi_info(:'wal_lsn3', :'wal_lsn4') + WHERE relfile_number = :'sample_tbl_oid' \gset I would be tempted to keep the checks run here minimal with only a basic set of checks on the LSN, without the dependencies to pageinspect (tuple_data_split and get_raw_page), which would be fine enough to check the execution of the function. FWIW, I am surprised by the design choice behind ValidateInputLSNs() to allow data to be gathered until the end of WAL in some cases, but to not allow it in others. It is likely too late to come back to this choice for the existing functions in v15 (quoique?), but couldn't it be useful to make this new FPI function work at least with an insanely high LSN value to make sure that we fetch all the FPIs from a given start position, up to the end of WAL? That looks like a pretty good default behavior to me, rather than issuing an error when a LSN is defined as in the future.. I am really wondering why we have ValidateInputLSNs(till_end_of_wal=false) to begin with, while we could just allow any LSN value in the future automatically, as we can know the current insert or replay LSNs (depending on the recovery state). -- Michael signature.asc Description: PGP signature
Re: Fix pg_publication_tables to exclude generated columns
Amit Kapila writes: >> On Mon, Jan 9, 2023 11:06 PM Tom Lane wrote: >>> We could just not fix it in the back branches. I'd argue that this is >>> as much a definition change as a bug fix, so it doesn't really feel >>> like something to back-patch anyway. > So, if we don't backpatch then it could lead to an error when it > shouldn't have which is clearly a bug. I think we should backpatch > this unless Tom or others are against it. This isn't a hill that I'm ready to die on ... but do we have any field complaints about this? If not, I still lean against a back-patch. I think there's a significant risk of breaking case A while fixing case B when we change this behavior, and that's something that's better done only in a major release. regards, tom lane
Re: Add proper planner support for ORDER BY / DISTINCT aggregates
David Rowley writes: > I think whatever the fix is here, we should likely ensure that the > results are consistent regardless of which Aggrefs are the presorted > ones. Perhaps the easiest way to do that, and to ensure we call the > volatile functions are called the same number of times would just be > to never choose Aggrefs with volatile functions when doing > make_pathkeys_for_groupagg(). There's existing logic in equivclass.c and other places that tries to draw very tight lines around what we'll assume about volatile sort expressions (pathkeys). It sounds like there's someplace in this recent patch that didn't get that memo. regards, tom lane
Re: [PATCH] Improve ability to display optimizer analysis using OPTIMIZER_DEBUG
John Naylor writes: > Note that the TODO list has accumulated some cruft over the years. Some > time ago I started an effort to remove outdated/undesirable entries, and I > should get back to that, but for the present, please take the warning at > the top to heart: > "WARNING for Developers: Unfortunately this list does not contain all the > information necessary for someone to start coding a feature. Some of these > items might have become unnecessary since they were added --- others might > be desirable but the implementation might be unclear. When selecting items > listed below, be prepared to first discuss the value of the feature. Do not > assume that you can select one, code it and then expect it to be committed. > " I think we could make that even stronger: there's basically nothing on the TODO list that isn't problematic in some way. Otherwise it would have been done already. The entries involve large amounts of work, or things that are subtler than they might appear, or cases where the desirable semantics aren't clear, or tradeoffs that there's not consensus about, or combinations of those. IME it's typically a lot more productive to approach things via "scratch your own itch". If a problem is biting you directly, then at least you have some clear idea of what it is that needs to be fixed. You might have to work up to an understanding of how to fix it, but you have a clear goal. regards, tom lane
Re: delay starting WAL receiver
On Wed, Jan 11, 2023 at 5:20 PM Thomas Munro wrote: > (I don't know why you didn't make it 0) (Oh, I see why it had to be non-zero to avoiding burning CPU, ignore that part.)
Re: Fix pg_publication_tables to exclude generated columns
On Tue, Jan 10, 2023 at 8:38 AM shiy.f...@fujitsu.com wrote: > > On Mon, Jan 9, 2023 11:06 PM Tom Lane wrote: > > > > Amit Kapila writes: > > > On Mon, Jan 9, 2023 at 5:29 PM shiy.f...@fujitsu.com > > > wrote: > > >> I think one way to fix it is to modify pg_publication_tables query to > > >> exclude > > >> generated columns. But in this way, we need to bump catalog version when > > fixing > > >> it in back-branch. Another way is to modify function > > >> pg_get_publication_tables()'s return value to contain all supported > > >> columns > > if > > >> no column list is specified, and we don't need to change system view. > > > > > That sounds like a reasonable approach to fix the issue. > > > > We could just not fix it in the back branches. I'd argue that this is > > as much a definition change as a bug fix, so it doesn't really feel > > like something to back-patch anyway. > > > > If this is not fixed in back-branch, in some cases we will get an error when > creating/refreshing subscription because we query pg_publication_tables in > column list check. > > e.g. > > -- publisher > CREATE TABLE test_mix_4 (a int PRIMARY KEY, b int, c int, d int GENERATED > ALWAYS AS (a + 1) STORED); > CREATE PUBLICATION pub_mix_7 FOR TABLE test_mix_4 (a, b, c); > CREATE PUBLICATION pub_mix_8 FOR TABLE test_mix_4; > > -- subscriber > CREATE TABLE test_mix_4 (a int PRIMARY KEY, b int, c int, d int); > > postgres=# CREATE SUBSCRIPTION sub1 CONNECTION 'port=5432' PUBLICATION > pub_mix_7, pub_mix_8; > ERROR: cannot use different column lists for table "public.test_mix_4" in > different publications > > I think it might be better to fix it in back-branch. And if we fix it by > modifying pg_get_publication_tables(), we don't need to bump catalog version > in > back-branch, I think this seems acceptable. > So, if we don't backpatch then it could lead to an error when it shouldn't have which is clearly a bug. I think we should backpatch this unless Tom or others are against it. -- With Regards, Amit Kapila.
Re: Perform streaming logical transactions by background workers and parallel apply
On Wed, Jan 11, 2023 at 9:34 AM houzj.f...@fujitsu.com wrote: > > On Tuesday, January 10, 2023 7:48 PM Dilip Kumar > wrote: > > > > I was looking into 0001, IMHO the pid should continue to represent the main > > apply worker. So the pid will always show the main apply worker which is > > actually receiving all the changes for the subscription (in short working as > > logical receiver) and if it is applying changes through a parallel worker > > then it > > should put the parallel worker pid in a new column called > > 'parallel_worker_pid' > > or 'parallel_apply_worker_pid' otherwise NULL. Thoughts? > > Thanks for the comment. > > IIRC, you mean something like following, right ? > (sorry if I misunderstood) > -- > For parallel apply worker: > 'pid' column shows the pid of the leader, new column parallel_worker_pid > shows its own pid > > For leader apply worker: > 'pid' column shows its own pid, new column parallel_worker_pid shows 0 > -- > > If so, I am not sure if the above is better, because it is changing the > existing column's('pid') meaning, the 'pid' will no longer represent the pid > of > the worker itself. Besides, it seems not consistent with what we have for > parallel query workers in pg_stat_activity. What do you think ? > +1. I think it makes sense to keep it similar to pg_stat_activity. + + Process ID of the leader apply worker, if this process is a apply + parallel worker. NULL if this process is a leader apply worker or a + synchronization worker. Can we change the above description to something like: "Process ID of the leader apply worker, if this process is a parallel apply worker. NULL if this process is a leader apply worker or does not participate in parallel apply, or a synchronization worker."? -- With Regards, Amit Kapila.
Re: delay starting WAL receiver
On Wed, Jan 11, 2023 at 2:08 PM Nathan Bossart wrote: > I discussed this a bit in a different thread [0], but I thought it deserved > its own thread. > > After setting wal_retrieve_retry_interval to 1ms in the tests, I noticed > that the recovery tests consistently take much longer. Upon further > inspection, it looks like a similar race condition to the one described in > e5d494d's commit message. With some added debug logs, I see that all of > the callers of MaybeStartWalReceiver() complete before SIGCHLD is > processed, so ServerLoop() waits for a minute before starting the WAL > receiver. > > The attached patch fixes this by adjusting DetermineSleepTime() to limit > the sleep to at most 100ms when WalReceiverRequested is set, similar to how > the sleep is limited when background workers must be restarted. Is the problem here that SIGCHLD is processed ... PG_SETMASK(); <--- here? selres = select(nSockets, , NULL, NULL, ); Meanwhile the SIGCHLD handler code says: * Was it the wal receiver? If exit status is zero (normal) or one * (FATAL exit), we assume everything is all right just like normal * backends. (If we need a new wal receiver, we'll start one at the * next iteration of the postmaster's main loop.) ... which is true, but that won't be reached for a while in this case if the timeout has already been set to 60s. Your patch makes that 100ms, in that case, a time delay that by now attracts my attention like a red rag to a bull (I don't know why you didn't make it 0). I'm not sure, but if I got that right, then I think the whole problem might automatically go away with CF #4032. The SIGCHLD processing code will run not when signals are unblocked before select() (that is gone), but instead *after* the event loop wakes up with WL_LATCH_SET, and runs the handler code in the regular user context before dropping through to the rest of the main loop.
Re: Strengthen pg_waldump's --save-fullpage tests
On Wed, Jan 11, 2023 at 6:32 AM Michael Paquier wrote: > > On Tue, Jan 10, 2023 at 05:25:44PM +0100, Drouvot, Bertrand wrote: > > I like the idea of comparing the full page (and not just the LSN) but > > I'm not sure that adding the pageinspect dependency is a good thing. > > > > What about extracting the block directly from the relation file and > > comparing it with the one extracted from the WAL? (We'd need to skip the > > first 8 bytes to skip the LSN though). > > Byte-by-byte counting for the page hole? The page checksum would > matter as well, FWIW, as it is not set in WAL and a FPW logged in WAL > means that the page got modified. It means that it could have been > flushed, updating its pd_lsn and its pd_checksum on the way. Right. LSN of FPI from the WAL record and page from the table won't be the same, essentially FPI LSN <= table page. Since the LSNs are different, checksums too. This is the reason we have masking functions common/bufmask.c and rm_mask functions defined for some of the resource managers while verifying FPI consistency in verifyBackupPageConsistency(). Note that pageinspect can give only unmasked/raw page data, which means, byte-by-byte comparison isn't possible with pageinspect too, hence I was comparing only the rows with tuple_data_split(). Therefore, reading bytes from the table page and comparing byte-by-byte with FPI requires us to invent new masking functions in the tests - simply a no-go IMO. As the concern here is to not establish pageinspect dependency with pg_waldump, I'm fine to withdraw this patch and be happy with what we have today. -- Bharath Rupireddy PostgreSQL Contributors Team RDS Open Source Databases Amazon Web Services: https://aws.amazon.com
Re: Add proper planner support for ORDER BY / DISTINCT aggregates
On Wed, 11 Jan 2023 at 15:46, Richard Guo wrote: > However the scan/join plan's > tlist does not contain random(), which I think we need to fix. I was wondering if that's true and considered that we don't want to evaluate random() for the sort then again when doing the aggregate transitions, but I see that does not really work before 1349d279, per: postgres=# set enable_presorted_aggregate=0; SET postgres=# select string_agg(random()::text, ',' order by random()) from generate_series(1,3); string_agg --- 0.8659110018246505,0.15612649559563474,0.2022878955613403 (1 row) I'd have expected those random numbers to be concatenated in ascending order. Running: select random() from generate_Series(1,3) order by random(); gives me the results in the order I'd have expected. I think whatever the fix is here, we should likely ensure that the results are consistent regardless of which Aggrefs are the presorted ones. Perhaps the easiest way to do that, and to ensure we call the volatile functions are called the same number of times would just be to never choose Aggrefs with volatile functions when doing make_pathkeys_for_groupagg(). David
RE: [PATCH] Support % wildcard in extension upgrade filenames
> Sandro Santilli writes: > > On Mon, Jan 09, 2023 at 05:51:49PM -0500, Tom Lane wrote: > >> ... you still need one script file for each supported upgrade step > > > That's exactly the problem we're trying to solve here. > > The include support is nice on itself, but won't solve our problem. > > The script-file-per-upgrade-path aspect solves a problem that you have, > whether you admit it or not; I think you simply aren't realizing that because > you have not had to deal with the consequences of your proposed feature. > Namely that you won't have any control over what the backend will try to do in > terms of upgrade paths. > It would be nice if there were some way to apply at least a range match to upgrades or explicitly state in the control file what paths should be applied for an upgrade. Ultimately as Sandro mentioned, it's the 1000 files issue we are trying to address. The only way we can fix that in the current setup, is to move to a minor version mode which means we can never do micro updates. > As an example, suppose that a database has foo 4.0 installed, and the DBA > decides to try to downgrade to 3.0. With the system as it stands, if you've > provided foo--4.0--3.0.sql then the conversion will go through, and presumably > it will work because you tested that that script does what it is intended to. If > you haven't provided any such downgrade script, then ALTER EXTENSION > UPDATE will say "Sorry Dave, I'm afraid I can't do that" and no harm is done. > In our case we've already addressed that in our script, because our upgrades don't rely on what extension model tells us is the version, ultimately what our postgis..version() tells us is the true determinate (one for the lib file and one for the script). But you are right, that's a selfish way of thinking about it, cause we have internal plumbing to handle that issue already and other projects probably don't. What I was hoping for was to having a section in the control file to say something like Upgrade patterns: 3.0.%--4.0.0.sql, 2.0.%--4.0.0.sql (and perhaps some logic to guarantee no way to match two patterns) So you wouldn't be able to have a set of patterns like Upgrade patterns: %--4.0.0.sql, 3.0.%--4.0.0.sql, 2.0.%--4.0.0.sql That would allow your proposed include something or other to work and for us to be able to use that, and still reduce our file footprint. I'd even settle for absolute paths stated in the control file if we can dispense with the need a file path to push you forward. In that mode, your downgrade issue would never happen even with the way people normally create scripts now. > So I really think this is a case of "be careful what you ask for, you might get it". > Even if PostGIS is willing to put in the amount of infrastructure legwork needed > to make such a design bulletproof, I'm quite sure nobody else will manage to > use such a thing successfully. I'd rather spend our development effort on a > feature that has more than one use-case. > > regards, tom lane I think you are underestimating the number of extensions that have this issue and could benefit (agree not in the current incarnation of the patch). It's not just PostGIS, it's pgRouting (has 56 files), h3 extension (has 37 files in last release, most of them a do nothing, except at the minor update steps) that have the same issue (and both pgRouting and h3 do have little bitty script updates that follows the best practice way of doing this). For pgRouting alone there are 56 files for the last release (of which it can easily be reduced to about 5 files if the paths could be explicitly stated in the control file). Yes all those extensions should dispense with their dreams of having micro updates (I honestly wish they would). Perhaps I'm a little obsessive, but it annoys me to see 1000s of files in my extension folder, when I know even if I followed best practice I only need like 5 files for each extension. And as a packager to have to ship these files is even more annoying. The reality is for micros, no one ships new functions (or at least shouldn't be), so all functions just need to be replaced perhaps with a micro update file you propose. Heck if we could even have the option to itemize our own upgrade file paths explicitly in the control file, Like: paths: 3.0.1--4.0.0 = 3--4.0.0.sql, 3.0.2--4.0.0 = 3--4.0.0.sql, 2--4.0.0.sql = 2.0.2--4.0.0.sql I'd be happy, and PostgreSQL can do the math pretending there are files it thinks it needs. So if we could somehow rework this patch perhaps adding something in the control to explicitly state the upgrade paths. I think that would satisfy your concern? And I'd be eternally grateful. Thanks, Regina
RE: Perform streaming logical transactions by background workers and parallel apply
On Tuesday, January 10, 2023 7:48 PM Dilip Kumar wrote: > > On Tue, Jan 10, 2023 at 10:26 AM houzj.f...@fujitsu.com > wrote: > > > > On Monday, January 9, 2023 4:51 PM Amit Kapila > wrote: > > > > > > On Sun, Jan 8, 2023 at 11:32 AM houzj.f...@fujitsu.com > > > wrote: > > > > > > > > On Sunday, January 8, 2023 11:59 AM houzj.f...@fujitsu.com > > > wrote: > > > > > Attach the updated patch set. > > > > > > > > Sorry, the commit message of 0001 was accidentally deleted, just > > > > attach the same patch set again with commit message. > > > > > > > > > > Pushed the first (0001) patch. > > > > Thanks for pushing, here are the remaining patches. > > I reordered the patch number to put patches that are easier to commit > > in the front of others. > > I was looking into 0001, IMHO the pid should continue to represent the main > apply worker. So the pid will always show the main apply worker which is > actually receiving all the changes for the subscription (in short working as > logical receiver) and if it is applying changes through a parallel worker > then it > should put the parallel worker pid in a new column called > 'parallel_worker_pid' > or 'parallel_apply_worker_pid' otherwise NULL. Thoughts? Thanks for the comment. IIRC, you mean something like following, right ? (sorry if I misunderstood) -- For parallel apply worker: 'pid' column shows the pid of the leader, new column parallel_worker_pid shows its own pid For leader apply worker: 'pid' column shows its own pid, new column parallel_worker_pid shows 0 -- If so, I am not sure if the above is better, because it is changing the existing column's('pid') meaning, the 'pid' will no longer represent the pid of the worker itself. Besides, it seems not consistent with what we have for parallel query workers in pg_stat_activity. What do you think ? Best regards, Hou zj
Re: releasing ParallelApplyTxnHash when pa_launch_parallel_worker returns NULL
On Tue, Jan 10, 2023 at 7:55 PM houzj.f...@fujitsu.com < houzj.f...@fujitsu.com> wrote: > On Wednesday, January 11, 2023 10:21 AM Ted Yu > wrote: > > /* First time through, initialize parallel apply worker state > hashtable. */ > > if (!ParallelApplyTxnHash) > > > > I think it would be better if `ParallelApplyTxnHash` is created by the > first > > successful parallel apply worker. > > Thanks for the suggestion. But I am not sure if it's worth to changing the > order here, because It will only optimize the case where user enable > parallel > apply but never get an available worker which should be rare. And in such a > case, it'd be better to increase the number of workers or disable the > parallel mode. > > Best Regards, > Hou zj > I think even though the chance is rare, we shouldn't leak resource. The `ParallelApplyTxnHash` shouldn't be created if there is no single apply worker.
Re: [PATCH] Improve ability to display optimizer analysis using OPTIMIZER_DEBUG
On Tue, Jan 3, 2023 at 1:59 PM Ankit Kumar Pandey wrote: > > > On 03/01/23 08:38, David Rowley wrote: > > > > Do you actually have a need for this or are you just trying to tick > > off some TODO items? > > > I would say Iatter but reason I picked it up was more on side of > learning optimizer better. Note that the TODO list has accumulated some cruft over the years. Some time ago I started an effort to remove outdated/undesirable entries, and I should get back to that, but for the present, please take the warning at the top to heart: "WARNING for Developers: Unfortunately this list does not contain all the information necessary for someone to start coding a feature. Some of these items might have become unnecessary since they were added --- others might be desirable but the implementation might be unclear. When selecting items listed below, be prepared to first discuss the value of the feature. Do not assume that you can select one, code it and then expect it to be committed. " -- John Naylor EDB: http://www.enterprisedb.com
RE: releasing ParallelApplyTxnHash when pa_launch_parallel_worker returns NULL
On Wednesday, January 11, 2023 10:21 AM Ted Yu wrote: > /* First time through, initialize parallel apply worker state > hashtable. */ > if (!ParallelApplyTxnHash) > > I think it would be better if `ParallelApplyTxnHash` is created by the first > successful parallel apply worker. Thanks for the suggestion. But I am not sure if it's worth to changing the order here, because It will only optimize the case where user enable parallel apply but never get an available worker which should be rare. And in such a case, it'd be better to increase the number of workers or disable the parallel mode. Best Regards, Hou zj
Re: Can we let extensions change their dumped catalog schemas?
Jacob Champion writes: > We'd like to be allowed to change the schema for a table that's been > marked in the past with pg_extension_config_dump(). > Unless I'm missing something obvious (please, let it be that) there's no > way to do this safely. Once you've marked an internal table as dumpable, > its schema is effectively frozen if you want your dumps to work across > versions, because otherwise you'll try to restore that "catalog" data > into a table that has different columns. And while sometimes you can > make that work, it doesn't in the general case. I agree that's a problem, but it's not that we're arbitrarily prohibiting something that would work. What, concretely, do you think could be done to improve the situation? > 2) Provide a way to record the exact version of an extension in a dump. Don't really see how that helps? I also fear that it will break a bunch of use-cases that work fine today, which are exactly the ones for which we originally defined pg_dump as *not* committing to a particular extension version. It feels like what we really need here is some way to mutate the old format of an extension config table into the new format. Simple addition of new columns shouldn't be a problem (in fact, I think that works already, or could easily be made to). If you want some ETL processing then it's harder :-(. Could an ON INSERT trigger on an old config table transpose converted data into a newer config table? Another point that ought to be made here is that pg_dump is not the only outside consumer of extension config data. You're likely to break some applications if you change a config table too much. That's not an argument that we shouldn't try to make pg_dump more forgiving, but I'm not sure that we need to move heaven and earth. regards, tom lane
Re: [PATCH] Simple code cleanup in tuplesort.c.
On Mon, Jan 9, 2023 at 7:29 PM Xing Guo wrote: > > Thank you John. This is my first patch, I'll keep it in mind that > adding a version number next time I sending the patch. Welcome to the community! You may also consider reviewing a patch from the current commitfest, since we can always use additional help there. -- John Naylor EDB: http://www.enterprisedb.com
Re: ATTACH PARTITION seems to ignore column generation status
On Wed, Jan 11, 2023 at 7:13 AM Tom Lane wrote: > I wrote: > > Amit Langote writes: > >> Thanks for the patch. It looks good, though I guess you said that we > >> should also change the error message that CREATE TABLE ... PARTITION > >> OF emits to match the other cases while we're here. I've attached a > >> delta patch. > > > Thanks. I hadn't touched that issue because I wasn't entirely sure > > which error message(s) you were unhappy with. These changes look > > fine offhand. > > So, after playing with that a bit ... removing the block in > parse_utilcmd.c allows you to do > > regression=# CREATE TABLE gtest_parent (f1 date NOT NULL, f2 bigint, f3 > bigint GENERATED ALWAYS AS (f2 * 2) STORED) PARTITION BY RANGE (f1); > CREATE TABLE > regression=# CREATE TABLE gtest_child PARTITION OF gtest_parent ( > regression(#f3 WITH OPTIONS GENERATED ALWAYS AS (f2 * 3) STORED > regression(# ) FOR VALUES FROM ('2016-07-01') TO ('2016-08-01'); > CREATE TABLE > regression=# \d gtest_child > Table "public.gtest_child" > Column | Type | Collation | Nullable | Default > ++---+--+- > f1 | date | | not null | > f2 | bigint | | | > f3 | bigint | | | generated always as (f2 * 3) stored > Partition of: gtest_parent FOR VALUES FROM ('2016-07-01') TO ('2016-08-01') > > regression=# insert into gtest_parent values('2016-07-01', 42); > INSERT 0 1 > regression=# table gtest_parent; > f1 | f2 | f3 > ++- > 2016-07-01 | 42 | 126 > (1 row) > > That is, you can make a partition with a different generated expression > than the parent has. Do we really want to allow that? I think it works > as far as the backend is concerned, but it breaks pg_dump, which tries > to dump this state of affairs as > > CREATE TABLE public.gtest_parent ( > f1 date NOT NULL, > f2 bigint, > f3 bigint GENERATED ALWAYS AS ((f2 * 2)) STORED > ) > PARTITION BY RANGE (f1); > > CREATE TABLE public.gtest_child ( > f1 date NOT NULL, > f2 bigint, > f3 bigint GENERATED ALWAYS AS ((f2 * 3)) STORED > ); > > ALTER TABLE ONLY public.gtest_parent ATTACH PARTITION public.gtest_child FOR > VALUES FROM ('2016-07-01') TO ('2016-08-01'); > > and that fails at reload because the ATTACH PARTITION code path > checks for equivalence of the generation expressions. > > This different-generated-expression situation isn't really morally > different from different ordinary DEFAULT expressions, which we > do endeavor to support. Ah, right, we are a bit more flexible in allowing that. Though, partition-specific defaults, unlike generated columns, are not respected when inserting/updating via the parent: create table partp (a int, b int generated always as (a+1) stored, c int default 0) partition by list (a); create table partc1 partition of partp (b generated always as (a+2) stored, c default 1) for values in (1); insert into partp values (1); table partp; a | b | c ---+---+--- 1 | 3 | 0 (1 row) create table partc2 partition of partp (b generated always as (a+2) stored) for values in (2); update partp set a = 2; table partp; a | b | c ---+---+--- 2 | 4 | 0 (1 row) > So maybe we should deem this a supported > case and remove ATTACH PARTITION's insistence that the generation > expressions match I tend to agree now that we have 3f7836ff6. > ... which I think would be a good thing anyway, > because that check-for-same-reverse-compiled-expression business > is pretty cheesy in itself. Hmm, yeah, we usually transpose a parent's expression into one that has a child's attribute numbers and vice versa when checking their equivalence. > AFAIK, 3f7836ff6 took care of the > only problem that the backend would have with this, and pg_dump > looks like it will work as long as the backend will take the > ATTACH command. Right. > I also looked into making CREATE TABLE ... PARTITION OF reject > this case; but that's much harder than it sounds, because what we > have at the relevant point is a raw (unanalyzed) expression for > the child's generation expression but a cooked one for the > parent's, so there is no easy way to match the two. > > In short, it's seeming like the rule for both partitioning and > traditional inheritance ought to be "a child column must have > the same generated-or-not property as its parent, but their > generation expressions need not be the same". Thoughts? Agreed. I've updated your disallow-generated-child-columns-2.patch to do this, and have also merged the delta post that I had attached with my last email, whose contents you sound to agree with. -- Thanks, Amit Langote EDB: http://www.enterprisedb.com disallow-generated-child-columns-3.patch Description: Binary data
Re: pgsql: Add new GUC createrole_self_grant.
Robert Haas writes: > On Tue, Jan 10, 2023 at 9:40 PM Tom Lane wrote: >> The scenario I'm worried about could be closed, mostly, if we were willing >> to invent an intermediate GUC privilege level "can be set interactively >> but only by CREATEROLE holders" ("PGC_CRSET"?). > Of course, if it's possible for a non-CREATEROLE user to set the value > that a CREATEROLE user experiences, that'd be more of a problem -- That's exactly the case I'm worried about, and it's completely reachable if a CREATEROLE user makes a SECURITY DEFINER function that executes an affected GRANT and is callable by an unprivileged user. Now, that probably isn't terribly likely, and it's unclear that there'd be any serious security consequence even if the GRANT did do something different than the author of the SECURITY DEFINER function was expecting. Nonetheless, I'm feeling itchy about this chain of assumptions. > To answer your question directly, though, I don't know of any other > setting where that would be a useful level. Yeah, I didn't think of one either. Also, even if we invented PGC_CRSET, it can't stop one CREATEROLE user from attacking another one, assuming that there is some interesting attack that can be constructed here. I think the whole point of your recent patches is to not assume that CREATEROLE users are mutually trusting, so that's bad. Bottom line is that a GUC doesn't feel like the right mechanism to use. What do you think about inventing a role property, or a grantable role that controls this behavior? regards, tom lane
Re: [PoC] Improve dead tuple storage for lazy vacuum
On Tue, Jan 10, 2023 at 7:08 PM Masahiko Sawada wrote: > It looks no problem in terms of vacuum integration, although I've not > fully tested yet. TID store uses the radix tree as the main storage, > and with the template radix tree, the data types for shared and > non-shared will be different. TID store can have an union for the > radix tree and the structure would be like follows: > /* Storage for Tids */ > union tree > { > local_radix_tree*local; > shared_radix_tree *shared; > }; We could possibly go back to using a common data type for this, but with unused fields in each setting, as before. We would have to be more careful of things like the 32-bit crash from a few weeks ago. > In the functions of TID store, we need to call either local or shared > radix tree functions depending on whether TID store is shared or not. > We need if-branch for each key-value pair insertion, but I think it > would not be a big performance problem in TID store use cases, since > vacuum is an I/O intensive operation in many cases. Also, the branch will be easily predicted. That was still true in earlier patches, but with many more branches and fatter code paths. > Overall, I think > there is no problem and I'll investigate it in depth. Okay, great. If the separate-functions approach turns out to be ugly, we can always go back to the branching approach for shared memory. I think we'll want to keep this as a template overall, at least to allow different value types and to ease adding variable-length keys if someone finds a need. > Apart from that, I've been considering the lock support for shared > radix tree. As we discussed before, the current usage (i.e, only > parallel index vacuum) doesn't require locking support at all, so it > would be enough to have a single lock for simplicity. Right, that should be enough for PG16. > If we want to > use the shared radix tree for other use cases such as the parallel > heap vacuum or the replacement of the hash table for shared buffers, > we would need better lock support. For future parallel pruning, I still think a global lock is "probably" fine if the workers buffer in local arrays. Highly concurrent applications will need additional work, of course. > For example, if we want to support > Optimistic Lock Coupling[1], Interesting, from the same authors! > we would need to change not only the node > structure but also the logic. Which probably leads to widen the gap > between the code for non-shared and shared radix tree. In this case, > once we have a better radix tree optimized for shared case, perhaps we > can replace the templated shared radix tree with it. I'd like to hear > your opinion on this line. I'm not in a position to speculate on how best to do scalable concurrency, much less how it should coexist with the local implementation. It's interesting that their "ROWEX" scheme gives up maintaining order in the linear nodes. > > One review point I'll mention: Somehow I didn't notice there is no use for the "chunk" field in the rt_node type -- it's only set to zero and copied when growing. What is the purpose? Removing it would allow the smallest node to take up only 32 bytes with a fanout of 3, by eliminating padding. > > Oh, I didn't notice that. The chunk field was originally used when > redirecting the child pointer in the parent node from old to new > (grown) node. When redirecting the pointer, since the corresponding > chunk surely exists on the parent we can skip existence checks. > Currently we use RT_NODE_UPDATE_INNER() for that (see > RT_REPLACE_NODE()) but having a dedicated function to update the > existing chunk and child pointer might improve the performance. Or > reducing the node size by getting rid of the chunk field might be > better. I see. IIUC from a brief re-reading of the code, saving that chunk would only save us from re-loading "parent->shift" from L1 cache and shifting the key. The cycles spent doing that seem small compared to the rest of the work involved in growing a node. Expressions like "if (idx < 0) return false;" return to an asserts-only variable, so in production builds, I would hope that branch gets elided (I haven't checked). I'm quite keen on making the smallest node padding-free, (since we don't yet have path compression or lazy path expansion), and this seems the way to get there. -- John Naylor EDB: http://www.enterprisedb.com
Re: pgsql: Add new GUC createrole_self_grant.
On Tue, Jan 10, 2023 at 9:40 PM Tom Lane wrote: > Yeah. I concur that a SUSET GUC isn't much fun for a non-superuser > CREATEROLE holder who might wish to adjust the default behavior they get. > I also concur that it seems a bit far-fetched that a CREATEROLE holder > might create a SECURITY DEFINER function that would do something that > would be affected by this setting. Still, we have no field experience > with how these mechanisms will actually be used, so I'm worried. All right. I'm not that worried because I think any problems that crop up probably won't be that bad, primarily due to the extremely restricted set of circumstances in which the GUC operates -- but that's a judgement call, and reasonable people can think differently. > The scenario I'm worried about could be closed, mostly, if we were willing > to invent an intermediate GUC privilege level "can be set interactively > but only by CREATEROLE holders" ("PGC_CRSET"?). But that's an awful lot > of infrastructure to add for one GUC. Are there any other GUCs where > that'd be a more useful choice than any we have now? I don't quite understand what that would do. If a non-CREATEROLE user sets the GUC, absolutely nothing happens, because the code that is controlled by the GUC cannot be reached without CREATEROLE privileges. Of course, if it's possible for a non-CREATEROLE user to set the value that a CREATEROLE user experiences, that'd be more of a problem -- though still insufficient to create a security vulnerability in and of itself -- but if user A can change the GUC settings that user B experiences, why screw around with this when you could just set search_path? To answer your question directly, though, I don't know of any other setting where that would be a useful level. Up until this morning, CREATEROLE was not usable for any serious purpose because we've been shipping something that was broken by design for years, so it's probably fortunate that not much depends on it. -- Robert Haas EDB: http://www.enterprisedb.com
Re: mprove tab completion for ALTER EXTENSION ADD/DROP
At Mon, 2 Jan 2023 13:19:50 +0530, vignesh C wrote in > On Mon, 5 Dec 2022 at 06:53, Michael Paquier wrote: > > > > The DROP could be matched with the objects that are actually part of > > the so-said extension? > > The modified v2 version has the changes to handle the same. Sorry for > the delay as I was working on another project. It suggests the *kinds* of objects that are part of the extension, but lists the objects of that kind regardless of dependency. I read Michael suggested (and I agree) to restrict the objects (not kinds) to actually be a part of the extension. (And not for object kinds.) However I'm not sure it is useful to restrict object kinds since the operator already knows what to drop, if you still want to do that, the use of completion_dont_quote looks ugly since the function (requote_identifier) is assuming an identifier as input. I didn't looked closer but maybe we need another way to do that. regards. -- Kyotaro Horiguchi NTT Open Source Software Center
Re: Using WaitEventSet in the postmaster
On Sun, Jan 8, 2023 at 11:55 AM Andres Freund wrote: > On 2023-01-07 18:08:11 +1300, Thomas Munro wrote: > > On Sat, Jan 7, 2023 at 12:25 PM Andres Freund wrote: > > > On 2023-01-07 11:08:36 +1300, Thomas Munro wrote: > > > > 3. Is it OK to clobber the shared pending flag for SIGQUIT, SIGTERM, > > > > SIGINT? If you send all of these extremely rapidly, it's > > > > indeterminate which one will be seen by handle_shutdown_request(). > > > > > > That doesn't seem optimal. I'm mostly worried that we can end up > > > downgrading a > > > shutdown request. > > > > I was contemplating whether I needed to do some more push-ups to > > prefer the first delivered signal (instead of the last), but you're > > saying that it would be enough to prefer the fastest shutdown type, in > > cases where more than one signal was handled between server loops. > > WFM. > > I don't see any need for such an order requirement - in case of receiving a > "less severe" shutdown request first, we'd process the more severe one soon > after. There's nothing to be gained by trying to follow the order of the > incoming signals. Oh, I fully agree. I was working through the realisation that I might need to serialise the handlers to implement the priority logic correctly (upgrades good, downgrades bad), but your suggestion fast-forwards to the right answer and doesn't require blocking, so I prefer it, and had already gone that way in v9. In this version I've added a comment to explain that the outcome is the same in the end, and also fixed the flag clearing logic which was subtly wrong before. > > > I wonder if it'd be good to have a _pm_ in the name. > > > > I dunno about this one, it's all static stuff in a file called > > postmaster.c and one (now) already has pm in it (see below). > > I guess stuff like signal handlers and their state somehow seems more global > to me than their C linkage type suggests. Hence the desire to be a bit more > "namespaced" in their naming. I do find it somewhat annoying when reasonably > important global variables aren't uniquely named when using a debugger... Alright, renamed. > A few more code review comments: > > DetermineSleepTime() still deals with struct timeval, which we maintain at > some effort. Just to then convert it away from struct timeval in the > WaitEventSetWait() call. That doesn't seem quite right, and is basically > introduced in this patch. I agree, but I was trying to minimise the patch: signals and events stuff is a lot already. I didn't want to touch DetermineSleepTime()'s time logic in the same commit. But here's a separate patch for that. > I think ServerLoop still has an outdated comment: > > * > * NB: Needs to be called with signals blocked Fixed. > > + /* Process work scheduled by signal handlers. > > */ > > Very minor: It feels a tad off to say that the work was scheduled by signal > handlers, it's either from other processes or by the OS. But ... OK, now it's "requested via signal handlers". > > +/* > > + * Child processes use SIGUSR1 to send 'pmsignals'. pg_ctl uses SIGUSR1 > > to ask > > + * postmaster to check for logrotate and promote files. > > + */ > > s/send/notify us of/, since the concrete "pmsignal" is actually transported > outside of the "OS signal" level? Fixed. > LGTM. Thanks. Here's v10. I'll wait a bit longer to see if anyone else has feedback. From 01bd9b3ef6029d5e5d568b514874a23a167d826d Mon Sep 17 00:00:00 2001 From: Thomas Munro Date: Wed, 9 Nov 2022 22:59:58 +1300 Subject: [PATCH v10 1/2] Give the postmaster a WaitEventSet and a latch. Switch to an architecture similar to regular backends, where signal handlers just set flags instead of doing real work. Changes: * Allow the postmaster to set up its own local latch. For now, we don't want other backends setting a shared memory latch directly (but that could be made to work with more research on robustness). * The existing signal handlers are cut in two: a handle_pm_XXX part that sets a pending_pm_XXX variable and sets the local latch, and a process_pm_XXX part. * Signal handlers are now installed with the regular pqsignal() function rather then the special pqsignal_pm() function; the concerns about the portability of SA_RESTART vs select() are no longer relevant, as we are not using select() or relying on EINTR. Reviewed-by: Andres Freund Discussion: https://postgr.es/m/CA%2BhUKG%2BZ-HpOj1JsO9eWUP%2Bar7npSVinsC_npxSy%2BjdOMsx%3DGg%40mail.gmail.com --- src/backend/libpq/pqcomm.c| 3 +- src/backend/libpq/pqsignal.c | 40 --- src/backend/postmaster/fork_process.c | 18 +- src/backend/postmaster/postmaster.c | 412 +++--- src/backend/tcop/postgres.c | 1 - src/backend/utils/init/miscinit.c | 13 +- src/include/libpq/pqsignal.h | 3 - src/include/miscadmin.h | 1 + 8 files changed, 264 insertions(+), 227 deletions(-) diff --git
Re: Add SHELL_EXIT_CODE to psql
On Tue, 10 Jan 2023 at 00:06, Corey Huinker wrote: > > On Mon, Jan 9, 2023 at 10:01 AM Maxim Orlov wrote: >> >> Hi! >> >> In overall, I think we move in the right direction. But we could make code >> better, should we? >> >> + /* Capture exit code for SHELL_EXIT_CODE */ >> + close_exit_code = pclose(fd); >> + if (close_exit_code == -1) >> + { >> + pg_log_error("%s: %m", cmd); >> + error = true; >> + } >> + if (WIFEXITED(close_exit_code)) >> + exit_code=WEXITSTATUS(close_exit_code); >> + else if(WIFSIGNALED(close_exit_code)) >> + exit_code=WTERMSIG(close_exit_code); >> + else if(WIFSTOPPED(close_exit_code)) >> + exit_code=WSTOPSIG(close_exit_code); >> + if (exit_code) >> + error = true; >> I think, it's better to add spaces around middle if block. It will be easy >> to read. >> Also, consider, adding spaces around assignment in this block. > > > Noted and will implement. > >> >> + /* >> + snprintf(exit_code_buf, sizeof(exit_code_buf), "%d", >> WEXITSTATUS(exit_code)); >> + */ >> Probably, this is not needed. > > > Heh. Oops. > >> >> > 1. pg_regress now creates an environment variable called PG_OS_TARGET >> Maybe, we can use env "OS"? I do not know much about Windows, but I think >> this is kind of standard environment variable there. > > > I chose a name that would avoid collisions with anything a user might > potentially throw into their environment, so if the var "OS" is fairly > standard is a reason to avoid using it. Also, going with our own env var > allows us to stay in perfect synchronization with the build's #ifdef WIN32 > ... and whatever #ifdefs may come in the future for new OSes. If there is > already an environment variable that does that for us, I would rather use > that, but I haven't found it. > > The 0001 patch is unchanged from last time (aside from anything rebasing > might have done). The patch does not apply on top of HEAD as in [1], please post a rebased patch: === Applying patches on top of PostgreSQL commit ID 3c6fc58209f24b959ee18f5d19ef96403d08f15c === === applying patch ./v4-0002-Add-psql-variables-SHELL_ERROR-and-SHELL_EXIT_COD.patch patching file doc/src/sgml/ref/psql-ref.sgml Hunk #1 FAILED at 4264. 1 out of 1 hunk FAILED -- saving rejects to file doc/src/sgml/ref/psql-ref.sgml.rej [1] - http://cfbot.cputube.org/patch_41_4073.log Regards, Vignesh
Re: Allow tailoring of ICU locales with custom rules
On Thu, 5 Jan 2023 at 20:45, Peter Eisentraut wrote: > > Patch needed a rebase; no functionality changes. The patch does not apply on top of HEAD as in [1], please post a rebased patch: === Applying patches on top of PostgreSQL commit ID d952373a987bad331c0e499463159dd142ced1ef === === applying patch ./v2-0001-Allow-tailoring-of-ICU-locales-with-custom-rules.patch patching file doc/src/sgml/catalogs.sgml patching file doc/src/sgml/ref/create_collation.sgml patching file doc/src/sgml/ref/create_database.sgml Hunk #1 FAILED at 192. 1 out of 1 hunk FAILED -- saving rejects to file doc/src/sgml/ref/create_database.sgml.rej [1] - http://cfbot.cputube.org/patch_41_4075.log Regards, Vignesh
Re: Add proper planner support for ORDER BY / DISTINCT aggregates
On Tue, Jan 10, 2023 at 6:12 PM Dean Rasheed wrote: > While doing some random testing, I noticed that the following is broken in > HEAD: > > SELECT COUNT(DISTINCT random()) FROM generate_series(1,10); > > ERROR: ORDER/GROUP BY expression not found in targetlist > > It appears to have been broken by 1349d279, though I haven't looked at > the details. Yeah, this is definitely broken. For this query, we try to sort the scan/join path by random() before performing the Aggregate, which is an optimization implemented in 1349d2790b. However the scan/join plan's tlist does not contain random(), which I think we need to fix. Thanks Richard
Re: pgsql: Add new GUC createrole_self_grant.
Robert Haas writes: > On Tue, Jan 10, 2023 at 8:47 PM Tom Lane wrote: >> [ squint ... ] Are you sure it's not a security *hazard*, though? > I think you have to squint pretty hard to find a security hazard here. Maybe, but I'd be sad if somebody manages to find one after this is out in the wild. > That said, in my original design, this was controlled via a different > mechanism which was superuser-only. I was informed that made no sense, > so I changed it. Now here we are. Yeah. I concur that a SUSET GUC isn't much fun for a non-superuser CREATEROLE holder who might wish to adjust the default behavior they get. I also concur that it seems a bit far-fetched that a CREATEROLE holder might create a SECURITY DEFINER function that would do something that would be affected by this setting. Still, we have no field experience with how these mechanisms will actually be used, so I'm worried. The scenario I'm worried about could be closed, mostly, if we were willing to invent an intermediate GUC privilege level "can be set interactively but only by CREATEROLE holders" ("PGC_CRSET"?). But that's an awful lot of infrastructure to add for one GUC. Are there any other GUCs where that'd be a more useful choice than any we have now? regards, tom lane
Re: [PATCH] support tab-completion for single quote input with equal sign
I wrote: > I've spent some effort previously on getting tab-completion to deal > sanely with single-quoted strings, but everything I've tried has > crashed and burned :-(, mainly because it's not clear when to take > the whole literal as one "word" and when not. After a little further thought, a new idea occurred to me: maybe we could push some of the problem down into the Matches functions. Consider inventing a couple of new match primitives: * MatchLiteral matches one or more parse tokens that form a single complete, valid SQL literal string (either single-quoted or dollar style). Use it like else if (Matches("CREATE", "SUBSCRIPTION", MatchAny, "CONNECTION", MatchLiteral)) COMPLETE_WITH("PUBLICATION"); I think there's no question that most Matches calls that might subsume a quoted literal would prefer to treat the literal as a single word, and this'd let them do that correctly. * MatchLiteralBegin matches the opening of a literal string (either ' or $...$). Handwaving freely, we might do else if (Matches("CREATE", "SUBSCRIPTION", MatchAny, "CONNECTION", MatchLiteralBegin)) COMPLETE_WITH(List_of_connection_keywords); This part of the idea still needs some thought, because it remains unclear how we might offer completion for connection keywords after the first one. Implementing these primitives might be a little tricky too. If memory serves, readline and libedit have different behaviors around quote marks. But at least it seems like a framework that could solve a number of problems, if we can make it go. regards, tom lane
Re: pgsql: Add new GUC createrole_self_grant.
On Tue, Jan 10, 2023 at 8:47 PM Tom Lane wrote: > [ squint ... ] Are you sure it's not a security *hazard*, though? I think you have to squint pretty hard to find a security hazard here. The effect of this GUC is to control the set of privileges that a CREATEROLE user automatically grants to themselves. But granting yourself privileges does not normally lead to any sort of security privilege. It's not completely impossible, I believe. For example, suppose that I, as a CREATEROLE user who is not a superuser, execute "CREATE ROLE shifty". I then set up a schema that shifty can access and I cannot. I then put that schema into my search_path despite the fact that I haven't given myself access to it. Now, depending on the value of this setting, I might either implicitly inherit shifty's privileges, or I might not. So, if I was expecting that I wouldn't, and I do, then I have now created a situation where if I do more dumb things I could execute some shifty code that lets that shifty user take over my account. But, you know, if I'm that dumb, I could also hit myself in the head with a hammer and the shifty guy could use the fact that I'm unconscious to fish the sticky note out of my wallet where, presumably, I keep my database password. The bigger point here, I think, is that this GUC only controls default privileges -- and we already have a system for default privileges that allows any user to give away privileges on virtually any object that they create to anyone. Nothing about that system is superuser-only. This system is far more restricted in its scope. It only allows you to give privileges to yourself, not anyone else, and only if you're a CREATEROLE user who is not a SUPERUSER. It seems a bit crazy to say that it's not a hazard for Alice to automatically grant every permission in the book to Emil every time she creates a table or schema or type or sequence or a function, but it is a hazard if Bob can grant INHERIT and SET to himself on roles that he creates. That said, in my original design, this was controlled via a different mechanism which was superuser-only. I was informed that made no sense, so I changed it. Now here we are. -- Robert Haas EDB: http://www.enterprisedb.com
Re: releasing ParallelApplyTxnHash when pa_launch_parallel_worker returns NULL
On Tue, Jan 10, 2023 at 6:13 PM houzj.f...@fujitsu.com < houzj.f...@fujitsu.com> wrote: > On Wednesday, January 11, 2023 1:25 AM Ted Yu wrote: > > > I was reading src/backend/replication/logical/applyparallelworker.c . > > In `pa_allocate_worker`, when pa_launch_parallel_worker returns NULL, I > think the `ParallelApplyTxnHash` should be released. > > Thanks for reporting. > > ParallelApplyTxnHash is used to cache the state of streaming transactions > being > applied. There could be multiple streaming transactions being applied in > parallel and their information were already saved in ParallelApplyTxnHash, > so > we should not release them just because we don't have a worker available to > handle a new transaction here. > > Best Regards, > Hou zj > Hi, /* First time through, initialize parallel apply worker state hashtable. */ if (!ParallelApplyTxnHash) I think it would be better if `ParallelApplyTxnHash` is created by the first successful parallel apply worker. Please take a look at the new patch and see if it makes sense. Cheers create-parallel-apply-txn-hash-after-the-first-worker.patch Description: Binary data
Re: Handle infinite recursion in logical replication setup
On Tue, Jan 10, 2023 at 11:24 PM Jonathan S. Katz wrote: > > On 1/10/23 10:17 AM, Amit Kapila wrote: > > On Tue, Jan 10, 2023 at 8:13 AM Jonathan S. Katz > > wrote: > > > One can use local or higher > > for reducing the latency for COMMIT when synchronous replication is > > used in the publisher. Won't using 'local' while creating subscription > > would suffice the need to consistently replicate the data? I mean it > > is equivalent to somebody using levels greater than local in case of > > physical replication. I think in the case of physical replication, we > > won't wait for standby to replicate to another node before sending a > > response, so why to wait in the case of logical replication? If this > > understanding is correct, then probably it is sufficient to support > > 'local' for a subscription. > > I think this is a good explanation. We should incorporate some version > of this into the docs, and add some clarity around the use of > `synchronous_commit` option in `CREATE SUBSCRIPTION` in particular with > the origin feature. I can make an attempt at this. > Okay, thanks! > Perhaps another question: based on this, should we disallow setting > values of `synchronous_commit` greater than `local` when using > "origin=none"? > I think it should be done irrespective of the value of origin because it can create a deadlock in those cases as well. I think one idea as you suggest is to block levels higher than local and the other is to make it behave like 'local' even if the level is higher than 'local' which would be somewhat similar to our physical replication. > That might be too strict, but maybe we should warn around > the risk of deadlock either in the logs or in the docs. > Yeah, we can mention it in docs as well. We can probably document as part of the docs work for this thread but it would be better to start a separate thread to fix this behavior by either of the above two approaches. -- With Regards, Amit Kapila.
Re: [PATCH] Add overlaps geometric operators that ignore point overlaps
Hello. At Sun, 1 Jan 2023 01:13:24 +0530, Ankit Kumar Pandey wrote in > This is patch for todo item: Add overlaps geometric operators that > ignore point overlaps > > Issue: > > SELECT circle '((0,0), 1)' && circle '((2,0),1) returns True > > Expectation: In above case, both figures touch other but do not > overlap (i.e. touching != overlap). Hence, it should return false. This may be slightly off from the common definition in other geometric processing systems, it is the established behavior of PostgreSQL that should already have users. About the behavior itself, since it seems to me that the words "touch" and "overlap" have no rigorous mathematical definitions, that depends on definition. The following discussion would be mere a word play.. If circle ((0,0),1) means a circumference, i.e. a set of points described as "x^2 + y^2 = 1" (or it may be a disc containing the area inside (<=) here) and "overlap" means "share at least a point", the two circles are overlapping. This seems to be our current stand point and what is expressed in the doc. If it meant the area exclusively inside the outline (i.e. x^2 + y^2 < 1), the two circles could be said touching but not overlapping. Or, if circle is defined as "(<)= 1" but "overlap" meant "share at least an area", they could be said not overlapping but touching? (I'm not sure about the border between a point and an area here and the distinction would be connected with the annoying EPSILON..) The same discussion holds for boxes or other shapes. > Now, as per as discussion > (https://www.postgresql.org/message-id/20100322175532.GG26428%40fetter.org) > and corresponding change in docs, > https://www.postgresql.org/docs/15/functions-geometry.html, it > mentions > > `Do these objects overlap? (One point in common makes this true.) > `. Does this means current behavior is correct? Or do we still need > the proposed change (if so, with proper updates in docs)? > > If current behavior is correct, this todo item might need some update > (unless I missed anything) otherwise any suggestion is welcomed. I read the todo description as we may want *another set* of operators to do that, not to change the current behavior of the existing operators. regards. -- Kyotaro Horiguchi NTT Open Source Software Center
RE: releasing ParallelApplyTxnHash when pa_launch_parallel_worker returns NULL
On Wednesday, January 11, 2023 1:25 AM Ted Yu wrote: > I was reading src/backend/replication/logical/applyparallelworker.c . > In `pa_allocate_worker`, when pa_launch_parallel_worker returns NULL, I think > the `ParallelApplyTxnHash` should be released. Thanks for reporting. ParallelApplyTxnHash is used to cache the state of streaming transactions being applied. There could be multiple streaming transactions being applied in parallel and their information were already saved in ParallelApplyTxnHash, so we should not release them just because we don't have a worker available to handle a new transaction here. Best Regards, Hou zj
Spinlock is missing when updating two_phase of ReplicationSlot
Hi, I realized that in CreateDecodingContext() function, we update both slot->data.two_phase and two_phase_at without acquiring the spinlock: /* Mark slot to allow two_phase decoding if not already marked */ if (ctx->twophase && !slot->data.two_phase) { slot->data.two_phase = true; slot->data.two_phase_at = start_lsn; ReplicationSlotMarkDirty(); ReplicationSlotSave(); SnapBuildSetTwoPhaseAt(ctx->snapshot_builder, start_lsn); } I think we should acquire the spinlock when updating fields of the replication slot even by its owner. Otherwise readers could see inconsistent results. Looking at another place where we update two_phase_at, we acquire the spinlock: SpinLockAcquire(>mutex); slot->data.confirmed_flush = ctx->reader->EndRecPtr; if (slot->data.two_phase) slot->data.two_phase_at = ctx->reader->EndRecPtr; SpinLockRelease(>mutex); It seems to me an oversight of commit a8fd13cab0b. I've attached the small patch to fix it. Regards, -- Masahiko Sawada Amazon Web Services: https://aws.amazon.com fix_spinlock.patch Description: Binary data
Re: Fixing a couple of buglets in how VACUUM sets visibility map bits
On Tue, Jan 10, 2023 at 4:39 PM Peter Geoghegan wrote: > * Run VACUUM FREEZE. We need FREEZE in order to be able to hit the > relevant visibilitymap_set() call site (the one that passes > VISIBILITYMAP_ALL_FROZEN as its flags, without also passing > VISIBILITYMAP_ALL_VISIBLE). > > Now all_visible_according_to_vm is set to true, but we don't have a > lock/pin on the same heap page just yet. > > * Acquire several non-conflicting row locks on a row on the block in > question, so that a new multi is allocated. Forgot to mention that there needs to be a HOT update mixed in with these SELECT ... FOR SHARE row lockers, too, which must abort once its XID has been added to a multi. Obviously heap_lock_tuple() won't ever unset VISIBILITYMAP_ALL_VISIBLE or PD_ALL_VISIBLE (it only ever clears VISIBILITYMAP_ALL_FROZEN) -- so we need a heap_update() to clear all of these status bits. This enables the assertion to fail because: * Pruning can get rid of the aborted successor heap-only tuple right away, so it is not going to block us from setting the page all_visible (that just leaves the original tuple to consider). * The original tuple's xmax is a Multi, so it won't automatically be ineligible for freezing because it's > OldestXmin in this scenario. * FreezeMultiXactId() processing will completely remove xmax, without caring too much about cutoffs like OldestXmin -- it only cares about whether each individual XID needs to be kept or not. (Granted, FreezeMultiXactId() will only remove xmax like this because I deliberately removed its FRM_NOOP handling, but that is a very delicate thing to rely on, especially from such a great distance. I can't imagine that it doesn't fail on HEAD for any reason beyond sheer luck.) -- Peter Geoghegan
Re: Infinite Interval
On Sun, Jan 8, 2023 at 11:17 PM jian he wrote: > > > > On Sun, Jan 8, 2023 at 4:22 AM Joseph Koshakow wrote: >> >> On Sat, Jan 7, 2023 at 3:05 PM Joseph Koshakow wrote: >> > >> > On Sat, Jan 7, 2023 at 3:04 PM Joseph Koshakow wrote: >> > > >> > > I think this patch is just about ready for review, except for the >> > > following two questions: >> > > 1. Should finite checks on intervals only look at months or all three >> > > fields? >> > > 2. Should we make the error messages for adding/subtracting infinite >> > > values more generic or leave them as is? >> > > >> > > My opinions are >> > > 1. We should only look at months. >> > > 2. We should make the errors more generic. >> > > >> > > Anyone else have any thoughts? >> >> Here's a patch with the more generic error messages. >> >> - Joe > > > HI. > > I just found out another problem. > > select * from generate_series(timestamp'-infinity', timestamp 'infinity', > interval 'infinity'); > ERROR: timestamp out of range > > select * from generate_series(timestamp'-infinity',timestamp 'infinity', > interval '-infinity'); --return following > > generate_series > - > (0 rows) > > > select * from generate_series(timestamp 'infinity',timestamp 'infinity', > interval 'infinity'); > --will run all the time. > > select * from generate_series(timestamp 'infinity',timestamp 'infinity', > interval '-infinity'); > ERROR: timestamp out of range > > select * from generate_series(timestamp'-infinity',timestamp'-infinity', > interval 'infinity'); > ERROR: timestamp out of range > > select * from generate_series(timestamp'-infinity',timestamp'-infinity', > interval '-infinity'); > --will run all the time. Good catch, I didn't think to check non date/time functions. Unfortunately, I think you may have opened Pandoras box. I went through pg_proc.dat and found the following functions that all involve intervals. We should probably investigate all of them and make sure that they handle infinite intervals properly. { oid => '1026', descr => 'adjust timestamp to new time zone', proname => 'timezone', prorettype => 'timestamp', proargtypes => 'interval timestamptz', prosrc => 'timestamptz_izone' }, { oid => '4133', descr => 'window RANGE support', proname => 'in_range', prorettype => 'bool', proargtypes => 'date date interval bool bool', prosrc => 'in_range_date_interval' }, { oid => '1305', descr => 'intervals overlap?', proname => 'overlaps', prolang => 'sql', proisstrict => 'f', provolatile => 's', prorettype => 'bool', proargtypes => 'timestamptz interval timestamptz interval', prosrc => 'see system_functions.sql' }, { oid => '1305', descr => 'intervals overlap?', proname => 'overlaps', prolang => 'sql', proisstrict => 'f', provolatile => 's', prorettype => 'bool', proargtypes => 'timestamptz interval timestamptz interval', prosrc => 'see system_functions.sql' }, { oid => '1306', descr => 'intervals overlap?', proname => 'overlaps', prolang => 'sql', proisstrict => 'f', provolatile => 's', prorettype => 'bool', proargtypes => 'timestamptz timestamptz timestamptz interval', prosrc => 'see system_functions.sql' }, { oid => '1307', descr => 'intervals overlap?', proname => 'overlaps', prolang => 'sql', proisstrict => 'f', provolatile => 's', prorettype => 'bool', proargtypes => 'timestamptz interval timestamptz timestamptz', prosrc => 'see system_functions.sql' }, { oid => '1308', descr => 'intervals overlap?', proname => 'overlaps', proisstrict => 'f', prorettype => 'bool', proargtypes => 'time time time time', prosrc => 'overlaps_time' }, { oid => '1309', descr => 'intervals overlap?', proname => 'overlaps', prolang => 'sql', proisstrict => 'f', prorettype => 'bool', proargtypes => 'time interval time interval', prosrc => 'see system_functions.sql' }, { oid => '1310', descr => 'intervals overlap?', proname => 'overlaps', prolang => 'sql', proisstrict => 'f', prorettype => 'bool', proargtypes => 'time time time interval', prosrc => 'see system_functions.sql' }, { oid => '1311', descr => 'intervals overlap?', proname => 'overlaps', prolang => 'sql', proisstrict => 'f', prorettype => 'bool', proargtypes => 'time interval time time', prosrc => 'see system_functions.sql' }, { oid => '1386', descr => 'date difference from today preserving months and years', proname => 'age', prolang => 'sql', provolatile => 's', prorettype => 'interval', proargtypes => 'timestamptz', prosrc => 'see system_functions.sql' }, { oid => '2042', descr => 'intervals overlap?', proname => 'overlaps', prolang => 'sql', proisstrict => 'f', prorettype => 'bool', proargtypes => 'timestamp interval timestamp interval', prosrc => 'see system_functions.sql' }, { oid => '2043', descr => 'intervals overlap?', proname => 'overlaps', prolang => 'sql', proisstrict => 'f', prorettype => 'bool', proargtypes => 'timestamp timestamp timestamp interval', prosrc => 'see system_functions.sql' }, { oid => '2044', descr => 'intervals overlap?', proname => 'overlaps', prolang =>
Re: Allow +group in pg_ident.conf
On Tue, Jan 10, 2023 at 09:42:19AM -0500, Andrew Dunstan wrote: > Ok, that sounds reasonable, but the cfbot doesn't like patches that > depend on other patches in a different email. Maybe you can roll this up > as an extra patch in your next version? It's pretty small. This can go two ways if both of you agree, by sending an updated patch on this thread based on the other one.. And actually, Jelte's patch has less C code than this thread's proposal, eventhough it lacks tests. -- Michael signature.asc Description: PGP signature
delay starting WAL receiver
I discussed this a bit in a different thread [0], but I thought it deserved its own thread. After setting wal_retrieve_retry_interval to 1ms in the tests, I noticed that the recovery tests consistently take much longer. Upon further inspection, it looks like a similar race condition to the one described in e5d494d's commit message. With some added debug logs, I see that all of the callers of MaybeStartWalReceiver() complete before SIGCHLD is processed, so ServerLoop() waits for a minute before starting the WAL receiver. The attached patch fixes this by adjusting DetermineSleepTime() to limit the sleep to at most 100ms when WalReceiverRequested is set, similar to how the sleep is limited when background workers must be restarted. [0] https://postgr.es/m/20221215224721.GA694065%40nathanxps13 -- Nathan Bossart Amazon Web Services: https://aws.amazon.com >From 6f32238f119236322dfb16262a88c3a9c5141413 Mon Sep 17 00:00:00 2001 From: Nathan Bossart Date: Thu, 15 Dec 2022 14:11:43 -0800 Subject: [PATCH v1 1/1] handle race condition when restarting wal receivers --- src/backend/postmaster/postmaster.c | 28 ++-- 1 file changed, 22 insertions(+), 6 deletions(-) diff --git a/src/backend/postmaster/postmaster.c b/src/backend/postmaster/postmaster.c index eac3450774..9d6f58e0b3 100644 --- a/src/backend/postmaster/postmaster.c +++ b/src/backend/postmaster/postmaster.c @@ -1601,9 +1601,9 @@ checkControlFile(void) * * In normal conditions we wait at most one minute, to ensure that the other * background tasks handled by ServerLoop get done even when no requests are - * arriving. However, if there are background workers waiting to be started, - * we don't actually sleep so that they are quickly serviced. Other exception - * cases are as shown in the code. + * arriving. However, if there are background workers or a WAL receiver + * waiting to be started, we make sure they are quickly serviced. Other + * exception cases are as shown in the code. */ static void DetermineSleepTime(struct timeval *timeout) @@ -1611,11 +1611,12 @@ DetermineSleepTime(struct timeval *timeout) TimestampTz next_wakeup = 0; /* - * Normal case: either there are no background workers at all, or we're in - * a shutdown sequence (during which we ignore bgworkers altogether). + * Normal case: either there are no background workers and no WAL receiver + * at all, or we're in a shutdown sequence (during which we ignore + * bgworkers and the WAL receiver altogether). */ if (Shutdown > NoShutdown || - (!StartWorkerNeeded && !HaveCrashedWorker)) + (!StartWorkerNeeded && !HaveCrashedWorker && !WalReceiverRequested)) { if (AbortStartTime != 0) { @@ -1674,6 +1675,21 @@ DetermineSleepTime(struct timeval *timeout) } } + /* + * If WalReceiverRequested is set, we're probably waiting on a SIGCHLD to + * arrive to clear WalReceiverPID before starting the new WAL receiver. We + * don't expect that to take long, so limit the sleep to 100ms so that we + * start the new WAL receiver promptly. + */ + if (WalReceiverRequested) + { + TimestampTz walrcv_wakeup; + + walrcv_wakeup = TimestampTzPlusMilliseconds(GetCurrentTimestamp(), 100); + if (next_wakeup == 0 || walrcv_wakeup < next_wakeup) + next_wakeup = walrcv_wakeup; + } + if (next_wakeup != 0) { long secs; -- 2.25.1
Re: Strengthen pg_waldump's --save-fullpage tests
On Tue, Jan 10, 2023 at 05:25:44PM +0100, Drouvot, Bertrand wrote: > I like the idea of comparing the full page (and not just the LSN) but > I'm not sure that adding the pageinspect dependency is a good thing. > > What about extracting the block directly from the relation file and > comparing it with the one extracted from the WAL? (We'd need to skip the > first 8 bytes to skip the LSN though). Byte-by-byte counting for the page hole? The page checksum would matter as well, FWIW, as it is not set in WAL and a FPW logged in WAL means that the page got modified. It means that it could have been flushed, updating its pd_lsn and its pd_checksum on the way. -- Michael signature.asc Description: PGP signature
Re: [PATCH] support tab-completion for single quote input with equal sign
torikoshia writes: > I updated the patch going along with the v3 direction. I think this adds about as many failure modes as it removes, if not more. * The connection string doesn't necessarily end with "'"; it could be a dollar-quoted string. * If it is a dollar-quoted string, there could be "'" marks internal to it, allowing PUBLICATION to be falsely offered when we're really still within the connection string. * The following WITH options could contain "'", allowing PUBLICATION to be falsely offered within that clause. I've spent some effort previously on getting tab-completion to deal sanely with single-quoted strings, but everything I've tried has crashed and burned :-(, mainly because it's not clear when to take the whole literal as one "word" and when not. A relevant example here is that somebody might wish that we could tab-complete within the connection string, e.g. that CREATE SUBSCRIPTION sub CONNECTION 'db would complete with "name=". We have the info available from libpq to do that, if only we could figure out when to apply it. I think we need some pretty fundamental design work to figure out what we want to do in this area, and that in the meantime putting band-aids on specific cases is probably not very productive. regards, tom lane
Re: Todo: Teach planner to evaluate multiple windows in the optimal order
On Tue, 10 Jan 2023 at 18:36, Tom Lane wrote: > > David Rowley writes: > > Ideally, our sort costing would just be better, but I think that > > raises the bar a little too high to start thinking of making > > improvements to that for this patch. > > It's trickier than it looks, cf f4c7c410e. But if you just want > to add a small correction based on number of columns being sorted > by, that seems within reach. See the comment for cost_sort though. > Also, I suppose for incremental sorts we'd want to consider only > the number of newly-sorted columns, but I'm not sure if that info > is readily at hand either. Yeah, I had exactly that in mind when I mentioned about setting the bar higher. It seems like a worthy enough goal to improve the sort costs separately from this work. I'm starting to consider if we might need to revisit cost_sort() anyway. There's been quite a number of performance improvements made to sort in the past few years and I don't recall if anything has been done to check if the sort costs are still realistic. I'm aware that it's a difficult problem as the number of comparisons is highly dependent on the order of the input rows. David
Re: Todo: Teach planner to evaluate multiple windows in the optimal order
On Wed, 11 Jan 2023 at 08:17, Ankit Kumar Pandey wrote: > > > > On 10/01/23 10:53, David Rowley wrote: > > > the total cost is the same for both of these, but the execution time > > seems to vary quite a bit. > > This is really weird, I tried it different ways (to rule out any issues > due to > > caching) and execution time varied in spite of having same cost. > > > Maybe we should try and do this for DISTINCT queries if the > > distinct_pathkeys match the orderby_pathkeys. That seems a little less > > copout-ish. If the ORDER BY is the same as the DISTINCT then it seems > > likely that the ORDER BY might opt to use the Unique path for DISTINCT > > since it'll already have the correct pathkeys. > > > However, if the ORDER BY has fewer columns then it might be cheaper to Hash > > Aggregate and > > then sort all over again, especially so when the DISTINCT removes a > > large proportion of the rows. > > Isn't order by pathkeys are always fewer than distinct pathkeys? Just thinking about this again, I remembered why I thought DISTINCT was uninteresting to start with. The problem is that if the query has WindowFuncs and also has a DISTINCT clause, then the WindowFunc results *must* be in the DISTINCT clause and, optionally also in the ORDER BY clause. There's no other place to write WindowFuncs IIRC. Since we cannot pushdown the sort when the more strict version of the pathkeys have WindowFuncs, then we must still perform the additional sort if the planner chooses to do a non-hashed DISTINCT. The aim of this patch is to reduce the total number of sorts, and I don't think that's possible in this case as you can't have WindowFuncs in the ORDER BY when they're not in the DISTINCT clause: postgres=# select distinct relname from pg_Class order by row_number() over (order by oid); ERROR: for SELECT DISTINCT, ORDER BY expressions must appear in select list LINE 1: select distinct relname from pg_Class order by row_number() ... Another type of query which is suboptimal still is when there's a DISTINCT and WindowClause but no ORDER BY. We'll reorder the DISTINCT clause so that the leading columns of the ORDER BY come first in transformDistinctClause(), but we've nothing to do the same for WindowClauses. It can't happen around when transformDistinctClause() is called as we've yet to decide the WindowClause evaluation order, so if we were to try to make that better it would maybe have to do in the upper planner somewhere. It's possible it's too late by that time to adjust the DISTINCT clause. Here's an example of it. # set enable_hashagg=0; # explain select distinct relname,relkind,count(*) over (partition by relkind) from pg_Class; QUERY PLAN Unique (cost=61.12..65.24 rows=412 width=73) -> Sort (cost=61.12..62.15 rows=412 width=73) Sort Key: relname, relkind, (count(*) OVER (?)) -> WindowAgg (cost=36.01..43.22 rows=412 width=73) -> Sort (cost=36.01..37.04 rows=412 width=65) Sort Key: relkind -> Seq Scan on pg_class (cost=0.00..18.12 rows=412 width=65) (7 rows) We can simulate the optimisation by swapping the column order in the targetlist. Note the planner can use Incremental Sort (at least since 3c6fc5820, from about 2 hours ago) # explain select distinct relkind,relname,count(*) over (partition by relkind) from pg_Class; QUERY PLAN Unique (cost=41.26..65.32 rows=412 width=73) -> Incremental Sort (cost=41.26..62.23 rows=412 width=73) Sort Key: relkind, relname, (count(*) OVER (?)) Presorted Key: relkind -> WindowAgg (cost=36.01..43.22 rows=412 width=73) -> Sort (cost=36.01..37.04 rows=412 width=65) Sort Key: relkind -> Seq Scan on pg_class (cost=0.00..18.12 rows=412 width=65) (8 rows) Not sure if we should be trying to improve that in this patch. I just wanted to identify it as something else that perhaps could be done. I'm not really all that sure the above query shape makes much sense in the real world. Would anyone ever want to use DISTINCT on some results containing WindowFuncs? David
Re: Fixing a couple of buglets in how VACUUM sets visibility map bits
On Tue, Jan 10, 2023 at 12:08 PM Peter Geoghegan wrote: > Actually, FreezeMultiXactId() can fully remove an xmax that has some > member XIDs >= OldestXmin, provided FRM_NOOP processing isn't > possible, at least when no individual member is still running. Doesn't > have to involve transaction aborts at all. > > Let me go try to break it that way... Attached patch shows how this could break. It adds an assertion that checks that the expected PD_ALL_VISIBLE/VM_ALL_VISIBLE() conditions hold at the right point. It also comments out FreezeMultiXactId()'s FRM_NOOP handling. The FRM_NOOP case is really just an optimization, and shouldn't be needed for correctness. This is amply demonstrated by running "meston test" with the patch applied, which will pass without incident. I can get the PD_ALL_VISIBLE assertion to fail by following this procedure with the patch applied: * Run a plain VACUUM to set all the pages from a table all-visible, but not all-frozen. * Set a breakpoint that will hit after all_visible_according_to_vm is set to true, for an interesting blkno. * Run VACUUM FREEZE. We need FREEZE in order to be able to hit the relevant visibilitymap_set() call site (the one that passes VISIBILITYMAP_ALL_FROZEN as its flags, without also passing VISIBILITYMAP_ALL_VISIBLE). Now all_visible_according_to_vm is set to true, but we don't have a lock/pin on the same heap page just yet. * Acquire several non-conflicting row locks on a row on the block in question, so that a new multi is allocated. * End every session whose XID is stored in our multi (commit/abort). * Within GDB, continue from before -- observe assertion failure. Obviously this scenario doesn't demonstrate the presence of a bug -- not quite. But it does prove that we rely on FRM_NOOP to not allow the VM to become corrupt, which just doesn't make any sense, and can't have been intended. At a minimum, it strongly suggests that the current approach is very fragile. -- Peter Geoghegan 0001-Debug-freeze-map-issue.patch Description: Binary data
Re: Use windows VMs instead of windows containers on the CI
On Wed, Jan 11, 2023 at 8:20 AM Andres Freund wrote: > On 2023-01-10 09:22:12 -0600, Justin Pryzby wrote: > > > There is more than 2x speed gain when VMs are used. > > > > One consideration is that if windows runs twice as fast, we'll suddenly > > start using twice as many resources at cirrus/google/amazon - the > > windows task has been throttling everything else. Not sure if we should > > to do anything beyond the limits that cfbot already uses. > > I'm not sure we would. cfbot has a time based limit for how often it tries to > rebuild entries, and I think we were just about keeping up with that. In which > case we shouldn't, on average, schedule more jobs than we currently > do. Although peak "job throughput" would be higher. > > Thomas? It currently tries to re-test each patch every 24 hours, but doesn't achieve that. It looks like it's currently re-testing every ~30 hours. Justin's right, we'll consume more non-Windows resources if Windows speeds up, but not 2x, more like 1.25x when cfbot's own throttling kicks in. Or I could change the cycle target to 36 or 48 hours, to spread the work out more. Back-of-a-napkin maths: * there are currently 240 entries in a testable status * it takes ~0.5 hours to test (because that's the slow Windows time) * therefore it takes ~120 hours to test them all * but we can do 4 at a time, so that's ~30 hours to get through them all and start again * that matches what we see: cfbot=> select created - lag(created) over (order by created) from branch where submission_id = 4068; ?column? --- 1 day 06:30:00.265047 1 day 05:43:59.978949 1 day 04:13:59.754048 1 day 05:28:59.811916 1 day 07:00:00.651655 (6 rows) If, with this change, we can test in only ~0.25 hours, then we'll only need 60 hours of Cirrus time to test them all. With a target of re-testing every 24 hours, it should now only have to run ~2.5 jobs at all times. Having free slots would be kind to Cirrus, and also lower the latency when a new patch is posted (which currently has to wait for a free slot before it can begin). Great news.
Can we let extensions change their dumped catalog schemas?
Hi all, I've been talking to other Timescale devs about a requested change to pg_dump, and there's been quite a bit of back-and-forth to figure out what, exactly, we want. Any mistakes here are mine, but I think we've been able to distill it down to the following request: We'd like to be allowed to change the schema for a table that's been marked in the past with pg_extension_config_dump(). Unless I'm missing something obvious (please, let it be that) there's no way to do this safely. Once you've marked an internal table as dumpable, its schema is effectively frozen if you want your dumps to work across versions, because otherwise you'll try to restore that "catalog" data into a table that has different columns. And while sometimes you can make that work, it doesn't in the general case. We (Timescale) do already change the schemas today, but we pay the associated costs in that dump/restore doesn't work without manual version bookkeeping and user fiddling -- and in the worst cases, it appears to "work" across versions but leaves our catalog tables in an inconsistent state. So the request is to come up with a way to support this case. Some options that have been proposed so far: 1) Don't ask for a new feature, and instead try to ensure infinite backwards compatibility for those tables. For extension authors who have already done this -- and have likely done some heavy architectural lifting to make it work -- this is probably the first thing that will come to mind, and it was the first thing I said, too. But the more I say it, the less acceptable it feels. Not even Postgres is expected to maintain infinite catalog compatibility into the future. We need to evolve our catalogs, too -- and we already provide the standard update scripts to perform migrations of those tables, but a dump/restore doesn't have any way to use them today. 2) Provide a way to record the exact version of an extension in a dump. Brute-force, but pretty much guaranteed to fix the cross-version problem, because the dump can't be accidentally restored to an extension version with a different catalog schema. Users then manually ALTER EXTENSION ... UPDATE (or we could even include that in the dump itself, as the final action). Doing this by default would punish extensions that don't have this problem, so it would have to be opt-in in some way. It's also unnecessarily strict IMO -- even if we don't have a config table change in a new version, we'll still require the old extension version to be available alongside the new version during a restore. Maybe a tweak on this idea would be to introduce a catversion for extensions. 3) Provide a way to record the entire internal state of an extension in a dump. Every extension is already expected to handle the case where the internal state is at version X but the installed extension is at version X+N, and the update scripts we provide will perform the necessary migrations. But there's no way to reproduce this case using dump/restore, because dumping an extension omits its internals. If a dump could instead include the entire internal state of an extension, then we'd be guaranteed to reproduce the exact situation that we already have to support for an in-place upgrade. After a restore, the SQL is at version X, the installed extension is some equal or later version, and all that remains is to run the update scripts, either manually or within the dump itself. Like (2), I think there's no way you'd all accept this cost for every extension. It'd have to be opt-in. -- Hopefully that makes a certain amount of sense. Does it seem like a reasonable thing to ask? I'm happy to clarify anything above, and if you know of an obvious solution I'm missing, I would love to be corrected. :D Thanks, --Jacob
Re: [PATCH] Support % wildcard in extension upgrade filenames
Sandro Santilli writes: > On Mon, Jan 09, 2023 at 05:51:49PM -0500, Tom Lane wrote: >> ... you still need one script file for each supported upgrade step > That's exactly the problem we're trying to solve here. > The include support is nice on itself, but won't solve our problem. The script-file-per-upgrade-path aspect solves a problem that you have, whether you admit it or not; I think you simply aren't realizing that because you have not had to deal with the consequences of your proposed feature. Namely that you won't have any control over what the backend will try to do in terms of upgrade paths. As an example, suppose that a database has foo 4.0 installed, and the DBA decides to try to downgrade to 3.0. With the system as it stands, if you've provided foo--4.0--3.0.sql then the conversion will go through, and presumably it will work because you tested that that script does what it is intended to. If you haven't provided any such downgrade script, then ALTER EXTENSION UPDATE will say "Sorry Dave, I'm afraid I can't do that" and no harm is done. With the proposed % feature, if foo--%--3.0.sql exists then the system will invoke it and expect the end result to be a valid 3.0 installation, whether or not the script actually has any ability to do a downgrade. Moreover, there isn't any very good way to detect or prevent unsupported version transitions. (I suppose you could add code to look at pg_extension.extversion, but I'm not sure if that works: it looks to me like we update that before we run the extension script. Besides which, if you have to add such code is that really better than having a number of one-liner scripts implementing the same check declaratively?) It gets worse though, because above I'm supposing that 4.0 at least existed when this copy of foo--%--3.0.sql was made. Suppose that somebody fat-fingered a package upgrade, such that the extension fileset available to a database containing foo 4.0 now corresponds to foo 3.0, and there's no knowledge of 4.0 at all in the extension scripts. The DBA trustingly issues ALTER EXTENSION UPDATE, which will conclude from foo.control that it should update to 3.0, and invoke foo--%--3.0.sql to do it. Maybe the odds of success are higher than zero, but not by much; almost certainly you are going to end with an extension containing some leftover 4.0 objects, some 3.0 objects, and maybe some objects with properties that don't exactly match either 3.0 or 4.0. Even if that state of affairs manages not to cause immediate problems, it'll surely be a mess whenever somebody tries to re-update to 4.0 or later. So I really think this is a case of "be careful what you ask for, you might get it". Even if PostGIS is willing to put in the amount of infrastructure legwork needed to make such a design bulletproof, I'm quite sure nobody else will manage to use such a thing successfully. I'd rather spend our development effort on a feature that has more than one use-case. regards, tom lane
Re: verbose mode for pg_input_error_message?
On Wed, Jan 04, 2023 at 04:18:59PM -0500, Andrew Dunstan wrote: > On 2023-01-02 Mo 10:44, Tom Lane wrote: >> I don't think that just concatenating those strings would make for a >> pleasant API. More sensible, perhaps, to have a separate function >> that returns a record. Or we could redefine the existing function >> that way, but I suspect that "just the primary error" will be a >> principal use-case. >> >> Being able to get the SQLSTATE is likely to be interesting too. > > OK, here's a patch along those lines. My vote would be to redefine the existing pg_input_error_message() function to return a record, but I recognize that this would inflate the patch quite a bit due to all the existing uses in the tests. If this is the only argument against this approach, I'm happy to help with the patch. -- Nathan Bossart Amazon Web Services: https://aws.amazon.com
wal_compression = method:level
Is it desirable to support specifying a level ? Maybe there's a concern about using high compression levels, but I'll start by asking if the feature is wanted at all. Previous discussion at: 20210614012412.ga31...@telsasoft.com >From cb30e17cf19fffa370a887d28d6d7e683d588b71 Mon Sep 17 00:00:00 2001 From: Justin Pryzby Date: Sun, 14 Mar 2021 17:12:07 -0500 Subject: [PATCH] Use GUC hooks to support compression:level.. ..which is useful for zstd, but less so for lz4. TODO: windows, pglz, zlib case --- doc/src/sgml/config.sgml| 6 +- src/backend/access/transam/xlog.c | 20 src/backend/access/transam/xloginsert.c | 6 +- src/backend/access/transam/xlogreader.c | 124 src/backend/utils/misc/guc_tables.c | 39 ++ src/common/compression.c| 3 - src/include/access/xlog.h | 6 + src/include/access/xlogreader.h | 2 + src/include/utils/guc_hooks.h | 3 + src/test/regress/expected/compression.out | 19 +++ src/test/regress/expected/compression_1.out | 19 +++ src/test/regress/sql/compression.sql| 10 ++ 12 files changed, 220 insertions(+), 37 deletions(-) diff --git a/doc/src/sgml/config.sgml b/doc/src/sgml/config.sgml index 77574e2d4ec..7b34ed630d5 100644 --- a/doc/src/sgml/config.sgml +++ b/doc/src/sgml/config.sgml @@ -3161,7 +3161,7 @@ include_dir 'conf.d' - wal_compression (enum) + wal_compression (string) wal_compression configuration parameter @@ -3169,7 +3169,7 @@ include_dir 'conf.d' This parameter enables compression of WAL using the specified -compression method. +compression method and optional compression level. When enabled, the PostgreSQL server compresses full page images written to WAL when is on or during a base backup. @@ -3179,6 +3179,8 @@ include_dir 'conf.d' was compiled with --with-lz4) and zstd (if PostgreSQL was compiled with --with-zstd). +A compression level may optionally be specified, by appending the level +number after a colon (:) The default value is off. Only superusers and users with the appropriate SET privilege can change this setting. diff --git a/src/backend/access/transam/xlog.c b/src/backend/access/transam/xlog.c index 74bd1f5fbe2..e2d81129549 100644 --- a/src/backend/access/transam/xlog.c +++ b/src/backend/access/transam/xlog.c @@ -125,6 +125,8 @@ bool EnableHotStandby = false; bool fullPageWrites = true; bool wal_log_hints = false; int wal_compression = WAL_COMPRESSION_NONE; +char *wal_compression_string = "no"; /* Overwritten by GUC */ +int wal_compression_level = 1; char *wal_consistency_checking_string = NULL; bool *wal_consistency_checking = NULL; bool wal_init_zero = true; @@ -8118,6 +8120,24 @@ assign_xlog_sync_method(int new_sync_method, void *extra) } } +bool +check_wal_compression(char **newval, void **extra, GucSource source) +{ + int tmp; + if (get_compression_level(*newval, ) != -1) + return true; + + return false; +} + +/* Parse the GUC into integers for wal_compression and wal_compression_level */ +void +assign_wal_compression(const char *newval, void *extra) +{ + wal_compression = get_compression_level(newval, _compression_level); + Assert(wal_compression >= 0); +} + /* * Issue appropriate kind of fsync (if any) for an XLOG output file. diff --git a/src/backend/access/transam/xloginsert.c b/src/backend/access/transam/xloginsert.c index 47b6d16eaef..93003744ed0 100644 --- a/src/backend/access/transam/xloginsert.c +++ b/src/backend/access/transam/xloginsert.c @@ -906,8 +906,8 @@ XLogCompressBackupBlock(char *page, uint16 hole_offset, uint16 hole_length, case WAL_COMPRESSION_LZ4: #ifdef USE_LZ4 - len = LZ4_compress_default(source, dest, orig_len, - COMPRESS_BUFSIZE); + len = LZ4_compress_fast(source, dest, orig_len, + COMPRESS_BUFSIZE, wal_compression_level); if (len <= 0) len = -1; /* failure */ #else @@ -918,7 +918,7 @@ XLogCompressBackupBlock(char *page, uint16 hole_offset, uint16 hole_length, case WAL_COMPRESSION_ZSTD: #ifdef USE_ZSTD len = ZSTD_compress(dest, COMPRESS_BUFSIZE, source, orig_len, -ZSTD_CLEVEL_DEFAULT); +wal_compression_level); if (ZSTD_isError(len)) len = -1; /* failure */ #else diff --git a/src/backend/access/transam/xlogreader.c b/src/backend/access/transam/xlogreader.c index a61b4a99219..728ca9f18c4 100644 --- a/src/backend/access/transam/xlogreader.c +++ b/src/backend/access/transam/xlogreader.c @@ -17,7 +17,9 @@ */ #include "postgres.h" +#include #include + #ifdef USE_LZ4 #include #endif @@ -30,6 +32,7 @@ #include "access/xlogreader.h" #include "access/xlogrecord.h" #include "catalog/pg_control.h" +#include "common/compression.h" #include
Re: [PATCH] Add `verify-system` sslmode to use system CA pool for server cert
On Mon, Jan 9, 2023 at 7:07 AM Jelte Fennema wrote: > I also took a closer look at the code, and the only comment I have is: > > > appendPQExpBuffer(>errorMessage, > > These calls can all be replaced by the recently added libpq_append_conn_error Argh, thanks for the catch. Fixed. > Finally I tested this against a Postgres server I created on Azure and > the new value works as expected. The only thing that I think would be > good to change is the error message when sslmode=verify-full, and > sslrootcert is not provided, but ~/.postgresql/root.crt is also not available. > I think it would be good for the error to mention sslrootcert=system Good idea. The wording I chose in v6 is Either provide the file, use the system's trusted roots with sslrootcert=system, or change sslmode to disable server certificate verification. What do you think? Thanks! --Jacob 1: 7618037c86 ! 1: e7e2d43b18 libpq: add sslrootcert=system to use default CAs @@ src/interfaces/libpq/fe-secure-openssl.c: initialize_SSL(PGconn *conn) + { + char *err = SSLerrmessage(ERR_get_error()); + -+ appendPQExpBuffer(>errorMessage, -+libpq_gettext("could not load system root certificate paths: %s\n"), -+err); ++ libpq_append_conn_error(conn, "could not load system root certificate paths: %s", ++ err); + SSLerrfree(err); + SSL_CTX_free(SSL_context); + return -1; @@ src/interfaces/libpq/fe-secure-openssl.c: initialize_SSL(PGconn *conn) { X509_STORE *cvstore; +@@ src/interfaces/libpq/fe-secure-openssl.c: initialize_SSL(PGconn *conn) +*/ + if (fnbuf[0] == '\0') + libpq_append_conn_error(conn, "could not get home directory to locate root certificate file\n" +- "Either provide the file or change sslmode to disable server certificate verification."); ++ "Either provide the file, use the system's trusted roots with sslrootcert=system, or change sslmode to disable server certificate verification."); + else + libpq_append_conn_error(conn, "root certificate file \"%s\" does not exist\n" +- "Either provide the file or change sslmode to disable server certificate verification.", fnbuf); ++ "Either provide the file, use the system's trusted roots with sslrootcert=system, or change sslmode to disable server certificate verification.", fnbuf); + SSL_CTX_free(SSL_context); + return -1; + } ## src/test/ssl/ssl/server-cn-only+server_ca.crt (new) ## @@ 2: c392e1796e ! 2: e4d9731e1e libpq: force sslmode=verify-full for system CAs @@ src/interfaces/libpq/fe-connect.c: connectOptions2(PGconn *conn) + && strcmp(conn->sslrootcert, "system") == 0) + { + conn->status = CONNECTION_BAD; -+ appendPQExpBuffer(>errorMessage, -+libpq_gettext("sslrootcert value \"%s\" invalid when SSL support is not compiled in\n"), -+conn->sslrootcert); ++ libpq_append_conn_error(conn, "sslrootcert value \"%s\" invalid when SSL support is not compiled in", ++ conn->sslrootcert); + return false; + } +#endif @@ src/interfaces/libpq/fe-connect.c: connectOptions2(PGconn *conn) + && strcmp(conn->sslmode, "verify-full") != 0) + { + conn->status = CONNECTION_BAD; -+ appendPQExpBuffer(>errorMessage, -+libpq_gettext("weak sslmode \"%s\" may not be used with sslrootcert=system\n"), -+conn->sslmode); ++ libpq_append_conn_error(conn, "weak sslmode \"%s\" may not be used with sslrootcert=system", ++ conn->sslmode); + return false; + } +#endif From e7e2d43b186a32a414783baa2ca55d88a5d7fe9e Mon Sep 17 00:00:00 2001 From: Jacob Champion Date: Mon, 24 Oct 2022 15:30:25 -0700 Subject: [PATCH v6 1/2] libpq: add sslrootcert=system to use default CAs Based on a patch by Thomas Habets. Note the workaround in .cirrus.yml for a strange interaction between brew and the openssl@3
Re: ATTACH PARTITION seems to ignore column generation status
I wrote: > Amit Langote writes: >> Thanks for the patch. It looks good, though I guess you said that we >> should also change the error message that CREATE TABLE ... PARTITION >> OF emits to match the other cases while we're here. I've attached a >> delta patch. > Thanks. I hadn't touched that issue because I wasn't entirely sure > which error message(s) you were unhappy with. These changes look > fine offhand. So, after playing with that a bit ... removing the block in parse_utilcmd.c allows you to do regression=# CREATE TABLE gtest_parent (f1 date NOT NULL, f2 bigint, f3 bigint GENERATED ALWAYS AS (f2 * 2) STORED) PARTITION BY RANGE (f1); CREATE TABLE regression=# CREATE TABLE gtest_child PARTITION OF gtest_parent ( regression(#f3 WITH OPTIONS GENERATED ALWAYS AS (f2 * 3) STORED regression(# ) FOR VALUES FROM ('2016-07-01') TO ('2016-08-01'); CREATE TABLE regression=# \d gtest_child Table "public.gtest_child" Column | Type | Collation | Nullable | Default ++---+--+- f1 | date | | not null | f2 | bigint | | | f3 | bigint | | | generated always as (f2 * 3) stored Partition of: gtest_parent FOR VALUES FROM ('2016-07-01') TO ('2016-08-01') regression=# insert into gtest_parent values('2016-07-01', 42); INSERT 0 1 regression=# table gtest_parent; f1 | f2 | f3 ++- 2016-07-01 | 42 | 126 (1 row) That is, you can make a partition with a different generated expression than the parent has. Do we really want to allow that? I think it works as far as the backend is concerned, but it breaks pg_dump, which tries to dump this state of affairs as CREATE TABLE public.gtest_parent ( f1 date NOT NULL, f2 bigint, f3 bigint GENERATED ALWAYS AS ((f2 * 2)) STORED ) PARTITION BY RANGE (f1); CREATE TABLE public.gtest_child ( f1 date NOT NULL, f2 bigint, f3 bigint GENERATED ALWAYS AS ((f2 * 3)) STORED ); ALTER TABLE ONLY public.gtest_parent ATTACH PARTITION public.gtest_child FOR VALUES FROM ('2016-07-01') TO ('2016-08-01'); and that fails at reload because the ATTACH PARTITION code path checks for equivalence of the generation expressions. This different-generated-expression situation isn't really morally different from different ordinary DEFAULT expressions, which we do endeavor to support. So maybe we should deem this a supported case and remove ATTACH PARTITION's insistence that the generation expressions match ... which I think would be a good thing anyway, because that check-for-same-reverse-compiled-expression business is pretty cheesy in itself. AFAIK, 3f7836ff6 took care of the only problem that the backend would have with this, and pg_dump looks like it will work as long as the backend will take the ATTACH command. I also looked into making CREATE TABLE ... PARTITION OF reject this case; but that's much harder than it sounds, because what we have at the relevant point is a raw (unanalyzed) expression for the child's generation expression but a cooked one for the parent's, so there is no easy way to match the two. In short, it's seeming like the rule for both partitioning and traditional inheritance ought to be "a child column must have the same generated-or-not property as its parent, but their generation expressions need not be the same". Thoughts? regards, tom lane
Wasted Vacuum cycles when OldestXmin is not moving
Hi Hackers, vacuum is not able to clean up dead tuples when OldestXmin is not moving (because of a long running transaction or when hot_standby_feedback is behind). Even though OldestXmin is not moved from the last time it checked, it keeps retrying every autovacuum_naptime and wastes CPU cycles and IOs when pages are not in memory. Can we not bypass the dead tuple collection and cleanup step until OldestXmin is advanced? Below log shows the vacuum running every 1 minute. 2023-01-09 08:13:01.364 UTC [727219] LOG: automatic vacuum of table "postgres.public.t1": index scans: 0 pages: 0 removed, 6960 remain, 6960 scanned (100.00% of total) tuples: 0 removed, 1572864 remain, 786432 are dead but not yet removable removable cutoff: 852, which was 2 XIDs old when operation ended frozen: 0 pages from table (0.00% of total) had 0 tuples frozen index scan not needed: 0 pages from table (0.00% of total) had 0 dead item identifiers removed avg read rate: 0.000 MB/s, avg write rate: 0.000 MB/s buffer usage: 13939 hits, 0 misses, 0 dirtied WAL usage: 0 records, 0 full page images, 0 bytes system usage: CPU: user: 0.15 s, system: 0.00 s, elapsed: 0.29 s 2023-01-09 08:14:01.363 UTC [727289] LOG: automatic vacuum of table "postgres.public.t1": index scans: 0 pages: 0 removed, 6960 remain, 6960 scanned (100.00% of total) tuples: 0 removed, 1572864 remain, 786432 are dead but not yet removable removable cutoff: 852, which was 2 XIDs old when operation ended frozen: 0 pages from table (0.00% of total) had 0 tuples frozen index scan not needed: 0 pages from table (0.00% of total) had 0 dead item identifiers removed avg read rate: 0.000 MB/s, avg write rate: 0.000 MB/s buffer usage: 13939 hits, 0 misses, 0 dirtied WAL usage: 0 records, 0 full page images, 0 bytes system usage: CPU: user: 0.14 s, system: 0.00 s, elapsed: 0.29 s Thanks, Sirisha
Re: Allow DISTINCT to use Incremental Sort
On Tue, 10 Jan 2023 at 16:07, Richard Guo wrote: > Sorry I didn't make myself clear. I mean currently on HEAD in planner.c > from line 4847 to line 4857, we have the code to make sure we always use > the more rigorous clause for explicit-sort case. I think this code is > not necessary, because we have already done that in line 4791 to 4796, > for both DISTINCT ON and standard DISTINCT. Thanks for explaining. I'm unsure if that code ever did anything useful, but I agree that it does nothing now. >> I've attached an updated patch > > > The patch looks good to me. Thanks for having another look. I've now pushed the patch. David
Re: allowing for control over SET ROLE
On Tue, 2023-01-10 at 11:45 -0500, Robert Haas wrote: > So the risks, which in theory are all very similar, are in practice > far greater in the PostgreSQL context, basically because our default > setup is about 40 years behind the times in terms of implementing > best > practices. I agree that huge improvements could be made with improvements to best practices/defaults. But there are some differences that are harder to fix that way. In postgres, one can attach arbitrary code to pretty much anything, so you need to trust everything you touch. There is no safe postgres equivalent to grepping an untrusted file. > It might be best to repost some of these ideas on a new thread with a > relevant subject line, but I agree that there's some potential here. > Your first idea reminds me a lot of the proposal Tom made in > https://www.postgresql.org/message-id/19327.1533748...@sss.pgh.pa.us > -- except that his mechanism is more general, since you can say whose > code you trust and whose code you don't trust. Noah had a competing > version of that patch, too. But we never settled on an approach. I > still think something like this would be a good idea, and the fact > that you've apparently-independently come up with a similar notion > just reinforces that. Will do, thank you for the reference. -- Jeff Davis PostgreSQL Contributor Team - AWS
Re: [PATCH] Add `verify-system` sslmode to use system CA pool for server cert
On Mon, Jan 9, 2023 at 7:40 AM Andrew Dunstan wrote: > I'm confused. A client cert might not have a hostname at all, and isn't > used to verify the connecting address, but to verify the username. It > needs to have a CN/DN equal to the user name of the connection, or that > maps to that name via pg_ident.conf. Right. But I don't know anything about the security model for using a publicly issued server certificate as a client certificate. So if you tell me that your only requirement is that the hostname/CN matches an entry in your ident file, and that you don't need to verify that the certificate identifying example.org is actually coming from example.org, or do any sort of online revocation processing to help mitigate the risks from that, or even handle wildcards or SANs in the cert -- fine, but I don't know the right questions to ask to review that case for safety or correctness. It'd be better to ask someone who is already comfortable with it. >From my perspective, ssl_ca_file=system sure *looks* like something reasonable for me to choose as a DBA, but I'm willing to guess it's not actually reasonable for 99% of people. (If you get your pg_ident rule wrong, for example, the number of people who can attack you is scoped by the certificates issued by your CA... which for 'system' would be the entire world.) By contrast I would have no problem recommending sslrootcert=system as a default: a certificate can still be misissued, but a would-be attacker would still have to get you to connect to them. That model and its risks are, I think, generally well understood by the community. --Jacob
Re: [PATCH] Support % wildcard in extension upgrade filenames
On Mon, Jan 09, 2023 at 05:51:49PM -0500, Tom Lane wrote: > Have you considered the idea of instead inventing a "\include" facility [...] > cases, you still need one script file for each supported upgrade step That's exactly the problem we're trying to solve here. The include support is nice on itself, but won't solve our problem. --strk;
Re: Fixing a couple of buglets in how VACUUM sets visibility map bits
On Tue, Jan 10, 2023 at 12:19 PM Robert Haas wrote: > I don't understand what distinction you're making. It seems like > hair-splitting to me. We should be able to reproduce problems like > this reliably, at least with the aid of a debugger and some > breakpoints, before we go changing the code. So we can *never* change something defensively, on the basis of a suspected or theoretical hazard, either in backbranches or just on HEAD? Not under any circumstances, ever? > The risk of being wrong > is quite high because the code is subtle, and the consequences of > being wrong are potentially very bad because the code is critical to > data integrity. If the reproducer doesn't require a debugger or other > extreme contortions, then we should consider reducing it to a TAP test > that can be committed. If you agree with that, then I'm not sure what > your last email was complaining about. I was complaining about your prescribing conditions on proceeding with a commit, based on an understanding of things that you yourself acknowledged as incomplete. I cannot imagine how you read that as an unwillingness to test the issue, especially given that I agreed to work on that before you chimed in. > > I have been unable to reproduce the problem, and think it's possible > > that the issue cannot be triggered in practice. Though only through > > sheer luck. Here's why that is: > I guess I'm not very sure that this is sheer luck. That's just my characterization. Other people can make up their own minds. > For the purposes of clarifying my understanding, is this the code > you're principally worried about? > visibilitymap_set(vacrel->rel, blkno, buf, InvalidXLogRecPtr, > vmbuffer, InvalidTransactionId, > VISIBILITYMAP_ALL_FROZEN); Obviously I meant this call site, since it's the only one that passes VISIBILITYMAP_ALL_FROZEN as its flags, without also passing VISIBILITYMAP_ALL_VISIBLE -- in vacuumlazy.c, and in general. The other visibilitymap_set() callsite that you quoted is from the second heap pass, where LP_DEAD items are vacuumed and become LP_UNUSED items. That isn't buggy, but it is a silly approach, in that it cares about what the visibility map says about the page being all-visible, as if it might take a dissenting view that needs to be taken into consideration (obviously we know what's going on with the page because we just scanned it ourselves, and determined that it was at least all-visible). -- Peter Geoghegan
Re: BUG: Postgres 14 + vacuum_defer_cleanup_age + FOR UPDATE + UPDATE
On Tue, 10 Jan 2023 at 20:14, Andres Freund wrote: > > Hi, > > On 2023-01-10 15:03:42 +0100, Matthias van de Meent wrote: > > On Mon, 9 Jan 2023 at 20:34, Andres Freund wrote: > > > It's not too hard to fix in individual places, but I suspect that we'll > > > introduce the bug in future places without some more fundamental > > > protection. > > > > > > Locally I fixed it by clamping vacuum_defer_cleanup_age to a reasonable > > > value > > > in ComputeXidHorizons() and GetSnapshotData(). > > > > I don't think that clamping the value with oldestXid (as seen in patch > > 0001, in GetSnapshotData) is right. > > I agree that using oldestXid to clamp is problematic. > > > > It would clamp the value relative to the oldest frozen xid of all > > databases, which can be millions of transactions behind oldestXmin, > > and thus severely skew the amount of transaction's changes you keep on > > disk (that is, until oldestXid moves past 1000_000). > > What precisely do you mean with "skew" here? Do you just mean that it'd take a > long time until vacuum_defer_cleanup_age takes effect? Somehow it sounds like > you might mean more than that? h->oldest_considered_running can be extremely old due to the global nature of the value and the potential existence of a snapshot in another database that started in parallel to a very old running transaction. Example: With vacuum_defer_cleanup_age set to 100, it is possible that a snapshot in another database (thus another backend) would result in a local intermediate status result of h->o_c_r = 20, h->s_o_n = 20, h->d_o_n = 10030. The clamped offset would then be 20 (clamped using h->o_c_r), which updates h->data_oldest_nonremovable to 10010. The obvious result is that all but the last 20 transactions from this database's data files are available for cleanup, which contradicts with the intention of the vacuum_defer_cleanup_age GUC. > I'm tempted to go with reinterpreting 64bit xids as signed. Except that it > seems like a mighty invasive change to backpatch. I'm not sure either. Protecting against underflow by halving the effective valid value space is quite the intervention, but if it is necessary to make this work in a performant manner, it would be worth it. Maybe someone else with more experience can provide their opinion here. Kind regards, Matthias van de Meent
Re: Add SHELL_EXIT_CODE to psql
On Tue, Jan 10, 2023 at 3:54 AM Maxim Orlov wrote: > > > On Mon, 9 Jan 2023 at 21:36, Corey Huinker > wrote: > >> >> I chose a name that would avoid collisions with anything a user might >> potentially throw into their environment, so if the var "OS" is fairly >> standard is a reason to avoid using it. Also, going with our own env var >> allows us to stay in perfect synchronization with the build's #ifdef WIN32 >> ... and whatever #ifdefs may come in the future for new OSes. If there is >> already an environment variable that does that for us, I would rather use >> that, but I haven't found it. >> >> Perhaps, I didn't make myself clear. Your solution is perfectly adapted > to our needs. > But all Windows since 2000 already have an environment variable > OS=Windows_NT. So, if env OS is defined and equal Windows_NT, this had to > be Windows. > May we use it in our case? I don't insist, just asking. > Ah, that makes more sense. I don't have a strong opinion on which we should use, and I would probably defer to someone who does windows (and possibly cygwin) builds more often than I do.
Re: heapgettup refactoring
On Thu, Jan 5, 2023 at 8:52 AM Peter Eisentraut wrote: > > Ok, let's look through these patches starting from the top then. > > v4-0001-Add-no-movement-scan-helper.patch > > This makes sense overall; there is clearly some duplicate code that can > be unified. > > It appears that during your rebasing you have effectively reverted your > earlier changes that have been committed as > 8e1db29cdbbd218ab6ba53eea56624553c3bef8c. You should undo that. Thanks. I think I have addressed this. I've attached a rebased v5. > I don't understand the purpose of the noinline maker. If it's not > necessary to inline, we can just leave it off, but there is no need to > outright prevent inlining AFAICT. > I have removed it. > I don't know why you changed the if/else sequences. Before, the > sequence was effectively > > if (forward) > { > ... > } > else if (backward) > { > ... > } > else > { > /* it's no movement */ > } > > Now it's changed to > > if (no movement) > { > ... > return; > } > > if (forward) > { > ... > } > else > { > Assert(backward); > ... > } > > Sure, that's the same thing, but it looks less elegant to me. In this commit, you could keep the original ordering of if statements. I preferred no movement scan first because then backwards and forwards scans' code is physically closer to the rest of the code without the intrusion of the no movement scan code. Ultimately, the refactor (in later patches) flips the ordering of if statements at the top from if (scan direction) to if (initial or continue) and this isn't a very interesting distinction for no movement scans. By dealing with no movement scan at the top, I didn't have to handle no movement scans in the initial and continue branches in the new structure. Also, I will note that patches 4-6 at least and perhaps 4-7 do not make sense as separate commits. I separated them for ease of review. Each is correct and passes tests but is not really an improvement without the others. - Melanie From 9b51959c483812c30ca848b66a0e6e3a807ab03f Mon Sep 17 00:00:00 2001 From: Melanie Plageman Date: Mon, 9 Jan 2023 17:33:43 -0500 Subject: [PATCH v5 3/7] Add heapgettup_initial_block() helper --- src/backend/access/heap/heapam.c | 212 --- 1 file changed, 82 insertions(+), 130 deletions(-) diff --git a/src/backend/access/heap/heapam.c b/src/backend/access/heap/heapam.c index c80b547809..b9a1aac3ca 100644 --- a/src/backend/access/heap/heapam.c +++ b/src/backend/access/heap/heapam.c @@ -521,6 +521,57 @@ heapgettup_no_movement(HeapScanDesc scan) } +static inline BlockNumber +heapgettup_initial_block(HeapScanDesc scan, ScanDirection dir) +{ + Assert(!ScanDirectionIsNoMovement(dir)); + Assert(!scan->rs_inited); + + /* return null immediately if relation is empty */ + if (scan->rs_nblocks == 0 || scan->rs_numblocks == 0) + return InvalidBlockNumber; + + scan->rs_inited = true; + + /* forward and serial */ + if (ScanDirectionIsForward(dir) && scan->rs_base.rs_parallel == NULL) + return scan->rs_startblock; + + /* forward and parallel */ + if (ScanDirectionIsForward(dir)) + { + table_block_parallelscan_startblock_init(scan->rs_base.rs_rd, + scan->rs_parallelworkerdata, + (ParallelBlockTableScanDesc) scan->rs_base.rs_parallel); + + return table_block_parallelscan_nextpage(scan->rs_base.rs_rd, + scan->rs_parallelworkerdata, + (ParallelBlockTableScanDesc) scan->rs_base.rs_parallel); + } + + /* backward parallel scan not supported */ + Assert(scan->rs_base.rs_parallel == NULL); + + /* + * Disable reporting to syncscan logic in a backwards scan; it's not very + * likely anyone else is doing the same thing at the same time, and much + * more likely that we'll just bollix things for forward scanners. + */ + scan->rs_base.rs_flags &= ~SO_ALLOW_SYNC; + + /* + * Start from last page of the scan. Ensure we take into account + * rs_numblocks if it's been adjusted by heap_setscanlimits(). + */ + if (scan->rs_numblocks != InvalidBlockNumber) + return (scan->rs_startblock + scan->rs_numblocks - 1) % scan->rs_nblocks; + + if (scan->rs_startblock > 0) + return scan->rs_startblock - 1; + + return scan->rs_nblocks - 1; +} + /* * heapgettup - fetch next heap tuple * @@ -574,41 +625,16 @@ heapgettup(HeapScanDesc scan, { if (!scan->rs_inited) { - /* - * return null immediately if relation is empty - */ - if (scan->rs_nblocks == 0 || scan->rs_numblocks == 0) + block = heapgettup_initial_block(scan, dir); + + if (block == InvalidBlockNumber) { Assert(!BufferIsValid(scan->rs_cbuf)); tuple->t_data = NULL; return; } - if (scan->rs_base.rs_parallel != NULL) - { -ParallelBlockTableScanDesc pbscan = -(ParallelBlockTableScanDesc) scan->rs_base.rs_parallel; -ParallelBlockTableScanWorker pbscanwork = -scan->rs_parallelworkerdata; - -
Re: Fixing a couple of buglets in how VACUUM sets visibility map bits
On Tue, Jan 10, 2023 at 2:48 PM Peter Geoghegan wrote: > What I actually said was that there is no reason to declare up front > that the only circumstances under which a fix could be committed is > when a clean repro is available. I never said that a test case has > little or no value, and I certainly didn't assert that we definitely > don't need a test case to proceed with a commit -- since I am not in > the habit of presumptuously attaching conditions to such things well > in advance. I don't understand what distinction you're making. It seems like hair-splitting to me. We should be able to reproduce problems like this reliably, at least with the aid of a debugger and some breakpoints, before we go changing the code. The risk of being wrong is quite high because the code is subtle, and the consequences of being wrong are potentially very bad because the code is critical to data integrity. If the reproducer doesn't require a debugger or other extreme contortions, then we should consider reducing it to a TAP test that can be committed. If you agree with that, then I'm not sure what your last email was complaining about. If you disagree, then I don't know why. > I have been unable to reproduce the problem, and think it's possible > that the issue cannot be triggered in practice. Though only through > sheer luck. Here's why that is: > > While pruning will remove aborted dead tuples, freezing will not > remove the xmax of an aborted update unless the XID happens to be < > OldestXmin. With my problem scenario, the page will be all_visible in > prunestate, but not all_frozen -- so it dodges the relevant > visibilitymap_set() call site. > > That just leaves inserts that abort, I think. An aborted insert will > be totally undone by pruning, but that does still leave behind an > LP_DEAD item that needs to be vacuumed in the second heap pass. This > means that we can only set the page all-visible/all-frozen in the VM > in the second heap pass, which also dodges the relevant > visibilitymap_set() call site. I guess I'm not very sure that this is sheer luck. It seems like we could equally well suppose that the people who wrote the code correctly understood the circumstances under which we needed to avoid calling visibilitymap_set(), and wrote the code in a way that accomplished that purpose. Maybe there's contrary evidence or maybe it is actually broken somehow, but that's not currently clear to me. For the purposes of clarifying my understanding, is this the code you're principally worried about? /* * If the all-visible page is all-frozen but not marked as such yet, * mark it as all-frozen. Note that all_frozen is only valid if * all_visible is true, so we must check both prunestate fields. */ else if (all_visible_according_to_vm && prunestate.all_visible && prunestate.all_frozen && !VM_ALL_FROZEN(vacrel->rel, blkno, )) { /* * We can pass InvalidTransactionId as the cutoff XID here, * because setting the all-frozen bit doesn't cause recovery * conflicts. */ visibilitymap_set(vacrel->rel, blkno, buf, InvalidXLogRecPtr, vmbuffer, InvalidTransactionId, VISIBILITYMAP_ALL_FROZEN); } Or maybe this one? if (PageIsAllVisible(page)) { uint8 flags = 0; uint8 vm_status = visibilitymap_get_status(vacrel->rel, blkno, vmbuffer); /* Set the VM all-frozen bit to flag, if needed */ if ((vm_status & VISIBILITYMAP_ALL_VISIBLE) == 0) flags |= VISIBILITYMAP_ALL_VISIBLE; if ((vm_status & VISIBILITYMAP_ALL_FROZEN) == 0 && all_frozen) flags |= VISIBILITYMAP_ALL_FROZEN; Assert(BufferIsValid(*vmbuffer)); if (flags != 0) visibilitymap_set(vacrel->rel, blkno, buffer, InvalidXLogRecPtr, *vmbuffer, visibility_cutoff_xid, flags); } These are the only two call sites in vacuumlazy.c where I can see there being a theoretical risk of the kind of problem that you're describing. -- Robert Haas EDB: http://www.enterprisedb.com
Re: Fixing a couple of buglets in how VACUUM sets visibility map bits
On Tue, Jan 10, 2023 at 11:47 AM Peter Geoghegan wrote: > In summary, I think that there is currently no way that we can have > the VM (or the PD_ALL_VISIBLE flag) concurrently unset, while leaving > the page all_frozen. It can happen and leave the page all_visible, but > not all_frozen, due to these very fine details. (Assuming I haven't > missed another path to the problem with aborted Multis or something, > but looks like I haven't.) Actually, FreezeMultiXactId() can fully remove an xmax that has some member XIDs >= OldestXmin, provided FRM_NOOP processing isn't possible, at least when no individual member is still running. Doesn't have to involve transaction aborts at all. Let me go try to break it that way... -- Peter Geoghegan
Re: logical decoding and replication of sequences, take 2
On Tue, Jan 10, 2023 at 1:32 PM Tomas Vondra wrote: > 0001 is a fix for the pre-existing issue in logicalmsg_decode, > attempting to build a snapshot before getting into a consistent state. > AFAICS this only affects assert-enabled builds and is otherwise > harmless, because we are not actually using the snapshot (apply gets a > valid snapshot from the transaction). > > This is mostly the fix I shared in November, except that I kept the call > in decode.c (per comment from Andres). I haven't added any argument to > SnapBuildProcessChange because we may need to backpatch this (and it > didn't seem much simpler, IMHO). I tend to associate transactional behavior with snapshots, so it looks odd to see code that builds a snapshot only when the message is non-transactional. I think that a more detailed comment spelling out the reasoning would be useful here. > This however brings me to the original question what's the purpose of > this patch - and that's essentially keeping sequences up to date to make > them usable after a failover. We can't generate values from the sequence > on the subscriber, because it'd just get overwritten. And from this > point of view, it's also fine that the sequence is slightly ahead, > because that's what happens after crash recovery anyway. And we're not > guaranteeing the sequences to be gap-less. I agree that it's fine for the sequence to be slightly ahead, but I think that it can't be too far ahead without causing problems. Suppose for example that transaction #1 creates a sequence. Transaction #2 does nextval on the sequence a bunch of times and inserts rows into a table using the sequence values as the PK. It's fine if the nextval operations are replicated ahead of the commit of transaction #2 -- in fact I'd say it's necessary for correctness -- but they can't precede the commit of transaction #1, since then the sequence won't exist yet. Likewise, if there's an ALTER SEQUENCE that creates a new relfilenode, I think that needs to act as a barrier: non-transactional changes that happened before that transaction must also be replicated before that transaction is replicated, and those that happened after that transaction is replicated must be replayed after that transaction is replicated. Otherwise, at the very least, there will be states visible on the standby that were never visible on the origin server, and maybe we'll just straight up get the wrong answer. For instance: 1. nextval, setting last_value to 3 2. ALTER SEQUENCE, getting a new relfilenode, and also set last_value to 19 3. nextval, setting last_value to 20 If 3 happens before 2, the sequence ends up in the wrong state. Maybe you've already got this and similar cases totally correctly handled, I'm not sure, just throwing it out there. -- Robert Haas EDB: http://www.enterprisedb.com
Re: Fixing a couple of buglets in how VACUUM sets visibility map bits
On Tue, Jan 10, 2023 at 10:50 AM Robert Haas wrote: > Look, I don't want to spend time arguing about what seem to me to be > basic principles of good software engineering. When I don't put test > cases into my patches, people complain at me and tell me that I'm a > bad software engineer because I didn't include test cases. Your > argument here seems to be that you're such a good software engineer > that you don't need any test cases to know what the bug is or that > you've fixed it correctly. That's not what I said. This is a straw man. What I actually said was that there is no reason to declare up front that the only circumstances under which a fix could be committed is when a clean repro is available. I never said that a test case has little or no value, and I certainly didn't assert that we definitely don't need a test case to proceed with a commit -- since I am not in the habit of presumptuously attaching conditions to such things well in advance. > I don't particularly appreciate the implication that I either lack > relevant or expertise or don't actually take time, either. The implication was only that you didn't take the time. Clearly you have the expertise. Obviously you're very far from stupid. I have been unable to reproduce the problem, and think it's possible that the issue cannot be triggered in practice. Though only through sheer luck. Here's why that is: While pruning will remove aborted dead tuples, freezing will not remove the xmax of an aborted update unless the XID happens to be < OldestXmin. With my problem scenario, the page will be all_visible in prunestate, but not all_frozen -- so it dodges the relevant visibilitymap_set() call site. That just leaves inserts that abort, I think. An aborted insert will be totally undone by pruning, but that does still leave behind an LP_DEAD item that needs to be vacuumed in the second heap pass. This means that we can only set the page all-visible/all-frozen in the VM in the second heap pass, which also dodges the relevant visibilitymap_set() call site. In summary, I think that there is currently no way that we can have the VM (or the PD_ALL_VISIBLE flag) concurrently unset, while leaving the page all_frozen. It can happen and leave the page all_visible, but not all_frozen, due to these very fine details. (Assuming I haven't missed another path to the problem with aborted Multis or something, but looks like I haven't.) -- Peter Geoghegan
Re: Transparent column encryption
> On Jan 10, 2023, at 9:26 AM, Mark Dilger wrote: > >-- Cryptographically connected to the encrypted record >patient_id BIGINT NOT NULL, >patient_ssn CHAR(11), > >-- The encrypted record >patient_record TEXT ENCRYPTED WITH (column_encryption_key = cek1, >column_encryption_salt = (patient_id, > patient_ssn)), As you mention upthread, tying columns together creates problems for statements that only operate on a subset of columns. Allowing schema designers a choice about tying the encrypted column to zero or more other columns allows them to choose which works best for their security needs. The example above would make a statement like "UPDATE patient_record SET patient_record = $1 \bind '{some json whatever}'" raise an exception at the libpq client level, but maybe that's what schema designers wants it to do. If not, they should omit the column_encryption_salt option in the create table statement; but if so, they should expect to have to specify the other columns as part of the update statement, possibly as part of the where clause, like UPDATE patient_record SET patient_record = $1 WHERE patient_id = 12345 AND patient_ssn = '111-11-' \bind '{some json record}' and have the libpq get the salt column values from the where clause (which may be tricky to implement), or perhaps use some new bind syntax like UPDATE patient_record SET patient_record = ($1:$2,$3) -- new, wonky syntax WHERE patient_id = $2 AND patient_ssn = $3 \bind '{some json record}' 12345 '111-11-' which would be error prone, since the sql statement could specify the ($1:$2,$3) inconsistently with the where clause, or perhaps specify the "new" salt columns even when not changed, like UPDATE patient_record SET patient_record = $1, patient_id = 12345, patient_ssn = "111-11-" WHERE patient_id = 12345 AND patient_ssn = "111-11-" \bind '{some json record}' which looks kind of nuts at first glance, but is grammatically consistent with cases where one or both of the patient_id or patient_ssn are also being changed, like UPDATE patient_record SET patient_record = $1, patient_id = 98765, patient_ssn = "999-99-" WHERE patient_id = 12345 AND patient_ssn = "111-11-" \bind '{some json record}' Or, of course, you can ignore these suggestions or punt them to some future patch that extends the current work, rather than trying to get it all done in the first column encryption commit. But it seems useful to think about what future directions would be, to avoid coding ourselves into a corner, making such future work harder. — Mark Dilger EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
Re: RFC: logical publication via inheritance root?
On Mon, Jan 9, 2023 at 12:41 AM Aleksander Alekseev wrote: > I would like to point out that we shouldn't necessarily support > multiple inheritance in all the possible cases, at least not in the > first implementation. Supporting simple cases of inheritance would be > already a valuable feature even if it will have certain limitations. I agree. What I'm trying to avoid is the case where replication works nicely for a table until someone performs an ALTER TABLE ... [NO] INHERIT, and then Something Bad happens because we can't support the new edge case. If every inheritance tree is automatically opted into this new publication behavior, I think it'll be easier to hit that by accident, making the whole thing feel brittle. By contrast, if we have to opt specific tables into this feature by marking them in the catalog, then not only will it be harder to hit by accident (because we can document the requirements for the marker function, and then it's up to the callers/extension authors/DBAs to maintain those requirements), but we even have the chance to bail out during an inheritance change if we see that the table is marked in this way. Two general pieces of progress to report: 1) I'm playing around with a marker in pg_inherits, where the inhseqno is set to a sentinel value (0) for an inheritance relationship that has been marked for logical publication. The intent is that the pg_inherits helpers will prevent further inheritance relationships when they see that marker, and reusing inhseqno means we can make use of the existing index to do the lookups. An example: =# CREATE TABLE root (a int); =# CREATE TABLE root_p1 () INHERITS (root); =# SELECT pg_set_logical_root('root_p1', 'root'); and then any data written to root_p1 gets replicated via root instead, if publish_via_partition_root = true. If root_p1 is set up with extra columns, they'll be omitted from replication. 2) While this strategy works well for ongoing replication, it's not enough to get the initial synchronization correct. The subscriber still does a COPY of the root table directly, missing out on all the logical descendant data. The publisher will have to tell the subscriber about the relationship somehow, and older subscriber versions won't understand how to use that (similar to how old subscribers can't correctly handle row filters). --Jacob
Re: Show various offset arrays for heap WAL records
Hi, On 2023-01-09 19:59:42 -0800, Peter Geoghegan wrote: > On Mon, Jan 9, 2023 at 1:58 PM Andres Freund wrote: > > A couple times when investigating data corruption issues, the last time just > > yesterday in [1], I needed to see the offsets affected by PRUNE and VACUUM > > records. As that's probably not just me, I think we should make that change > > in-tree. > > I remember how useful this was when we were investigating that early > bug in 14, that turned out to be in parallel VACUUM. So I'm all in > favor of it. Cool. > > The attached patch adds details to XLOG_HEAP2_PRUNE, XLOG_HEAP2_VACUUM, > > XLOG_HEAP2_FREEZE_PAGE. > > I'm bound to end up doing the same in index access methods. Might make > sense for the utility routines to live somewhere more centralized, at > least when code reuse is likely. Practically every index AM has WAL > records that include a sorted page offset number array, just like > these ones. It's a very standard thing, obviously. Hm, there doesn't seem to be a great location for them today. I guess we could add something like src/include/access/rmgrdesc_utils.h? And put the implementation in src/backend/access/rmgrdesc/rmgrdesc_utils.c? I first was thinking of just rmgrdesc.[ch], but custom rmgrs added src/bin/pg_waldump/rmgrdesc.[ch] ... > > I chose to include infomask[2] for the different freeze plans mainly because > > it looks odd to see different plans without a visible reason. But I'm not > > sure > > that's the right choice. > > I don't think that it is particularly necessary to do so in order for > the output to make sense -- pg_waldump is inherently a tool for > experts. What it comes down to for me is whether or not this > information is sufficiently useful to display, and/or can be (or needs > to be) controlled via some kind of verbosity knob. It seemed useful enough to me, but I likely also stare more at this stuff than most. Compared to the list of offsets it's not that much content. > How hard would it be to invent a general mechanism to control the verbosity > of what we'll show for each WAL record? Nontrivial, I'm afraid. We don't pass any relevant parameters to rm_desc: void(*rm_desc) (StringInfo buf, XLogReaderState *record); so we'd need to patch all of them. That might be worth doing at some point, but I don't want to tackle it right now. Greetings, Andres Freund
Re: Use windows VMs instead of windows containers on the CI
Hi, On 2023-01-10 09:22:12 -0600, Justin Pryzby wrote: > > There is more than 2x speed gain when VMs are used. > > One consideration is that if windows runs twice as fast, we'll suddenly > start using twice as many resources at cirrus/google/amazon - the > windows task has been throttling everything else. Not sure if we should > to do anything beyond the limits that cfbot already uses. I'm not sure we would. cfbot has a time based limit for how often it tries to rebuild entries, and I think we were just about keeping up with that. In which case we shouldn't, on average, schedule more jobs than we currently do. Although peak "job throughput" would be higher. Thomas? Greetings, Andres Freund
Re: psql: Add role's membership options to the \du+ command
Added the patch to the open commitfest: https://commitfest.postgresql.org/42/4116/ Feel free to reject if it's not interesting. -- Pavel Luzanov
Re: Todo: Teach planner to evaluate multiple windows in the optimal order
On 10/01/23 10:53, David Rowley wrote: the total cost is the same for both of these, but the execution time seems to vary quite a bit. This is really weird, I tried it different ways (to rule out any issues due to caching) and execution time varied in spite of having same cost. Maybe we should try and do this for DISTINCT queries if the distinct_pathkeys match the orderby_pathkeys. That seems a little less copout-ish. If the ORDER BY is the same as the DISTINCT then it seems likely that the ORDER BY might opt to use the Unique path for DISTINCT since it'll already have the correct pathkeys. However, if the ORDER BY has fewer columns then it might be cheaper to Hash Aggregate and then sort all over again, especially so when the DISTINCT removes a large proportion of the rows. Isn't order by pathkeys are always fewer than distinct pathkeys? distinct pathkeys = order by pathkeys + window functions pathkeys Again, I got your point which that it is okay to pushdown order by clause if distinct is doing unique sort. But problem is (atleast from what I am facing), distinct is not caring about pushed down sortkeys, it goes with hashagg or unique with some other logic (mentioned below). Consider following (with distinct clause restriction removed) if (parse->distinctClause) { List* distinct_pathkeys = make_pathkeys_for_sortclauses(root, parse->distinctClause, root->processed_tlist); if (!compare_pathkeys(distinct_pathkeys, orderby_pathkeys)==1) // distinct key > order by key skip = true; // this is used to skip order by pushdown } CASE #1: explain (costs off) select distinct a,b, min(a) over (partition by a), sum (a) over (partition by a) from abcd order by a,b; QUERY PLAN --- Sort Sort Key: a, b -> HashAggregate Group Key: a, b, min(a) OVER (?), sum(a) OVER (?) -> WindowAgg -> Sort Sort Key: a, b -> Seq Scan on abcd (8 rows) explain (costs off) select distinct a,b,c, min(a) over (partition by a), sum (a) over (partition by a) from abcd order by a,b,c; QUERY PLAN -- Sort Sort Key: a, b, c -> HashAggregate Group Key: a, b, c, min(a) OVER (?), sum(a) OVER (?) -> WindowAgg -> Sort Sort Key: a, b, c -> Seq Scan on abcd (8 rows) No matter how many columns are pushed down, it does hashagg. On the other hand: CASE #2: EXPLAIN (costs off) SELECT DISTINCT depname, empno, min(salary) OVER (PARTITION BY depname) depminsalary,sum(salary) OVER (PARTITION BY depname) depsalary FROM empsalary ORDER BY depname, empno; QUERY PLAN -- Unique -> Sort Sort Key: depname, empno, (min(salary) OVER (?)), (sum(salary) OVER (?)) -> WindowAgg -> Sort Sort Key: depname, empno -> Seq Scan on empsalary (7 rows) EXPLAIN (costs off) SELECT DISTINCT depname, empno, enroll_date, min(salary) OVER (PARTITION BY depname) depminsalary,sum(salary) OVER (PARTITION BY depname) depsalary FROM empsalary ORDER BY depname, empno, enroll_date; QUERY PLAN --- Unique -> Sort Sort Key: depname, empno, enroll_date, (min(salary) OVER (?)), (sum(salary) OVER (?)) -> WindowAgg -> Sort Sort Key: depname, empno, enroll_date -> Seq Scan on empsalary (7 rows) It keeps doing Unique. In both of the cases, compare_pathkeys(distinct_pathkeys, orderby_pathkeys) returns 1 Looking bit further, planner is choosing things correctly. I could see cost of unique being higher in 1st case and lower in 2nd case. But the point is, if sort for orderby is pushdown, shouldn't there be some discount on cost of Unique sort (so that there is more possibility of it being favorable compared to HashAgg in certain cases)? Again, cost of Unqiue node is taken as cost of sort node as it is, but for HashAgg, new cost is being computed. If we do incremental sort here (for unique node), as we have pushed down orderby's, unique cost could be reduced and our optimization could be made worthwhile (I assume this is what you intended here) in case of distinct. Eg: EXPLAIN SELECT DISTINCT depname, empno, enroll_date, min(salary) OVER (PARTITION BY depname) depminsalary,sum(salary) OVER (PARTITION BY depname) depsalary FROM empsalary ORDER BY depname, empno, enroll_date; QUERY PLAN
Re: BUG: Postgres 14 + vacuum_defer_cleanup_age + FOR UPDATE + UPDATE
Hi, On 2023-01-10 15:03:42 +0100, Matthias van de Meent wrote: > On Mon, 9 Jan 2023 at 20:34, Andres Freund wrote: > > On 2023-01-09 17:50:10 +0100, Matthias van de Meent wrote: > > > Wouldn't it be enough to only fix the constructions in > > > FullXidRelativeTo() and widen_snapshot_xid() (as attached, $topic does > > > not occur with the patch), and (optionally) bump the first XID > > > available for any cluster to (FirstNormalXid + 1) to retain the 'older > > > than any running transaction' property? > > > > It's not too hard to fix in individual places, but I suspect that we'll > > introduce the bug in future places without some more fundamental protection. > > > > Locally I fixed it by clamping vacuum_defer_cleanup_age to a reasonable > > value > > in ComputeXidHorizons() and GetSnapshotData(). > > I don't think that clamping the value with oldestXid (as seen in patch > 0001, in GetSnapshotData) is right. I agree that using oldestXid to clamp is problematic. > It would clamp the value relative to the oldest frozen xid of all > databases, which can be millions of transactions behind oldestXmin, > and thus severely skew the amount of transaction's changes you keep on > disk (that is, until oldestXid moves past 1000_000). What precisely do you mean with "skew" here? Do you just mean that it'd take a long time until vacuum_defer_cleanup_age takes effect? Somehow it sounds like you might mean more than that? I'm tempted to go with reinterpreting 64bit xids as signed. Except that it seems like a mighty invasive change to backpatch. Greetings, Andres Freund
Re: Avoiding "wrong tuple length" errors at the end of VACUUM on pg_database update (Backpatch of 947789f to v12 and v13)
Hi, On 2023-01-10 02:57:43 -0500, Tom Lane wrote: > No objections to back-patching the fix, but I wonder if we can find > some mechanical way to prevent this sort of error in future. What about a define that forces external toasting very aggressively for catalog tables, iff they have a toast table? I suspect doing so for non-catalog tables as well would trigger test changes. Running a buildfarm animal with that would at least make issues like this much easier to discover. Greetings, Andres Freund
Re: can while loop in ClockSweepTick function be kind of infinite loop in some cases?
Hi, On 2023-01-10 13:11:35 -0500, Robert Haas wrote: > On Tue, Jan 10, 2023 at 12:40 PM Andres Freund wrote: > > > I think. `expected = originalVictim + 1;` line should be in while loop > > > (before acquiring spin lock) so that, even in the case above, expected > > > variable is incremented for each loop and CAS operation will be successful > > > at some point. > > > Is my understanding correct? If so, I will send PR for fixing this issue. > > > > Yes, I think your understanding might be correct. Interesting that this > > apparently has never occurred. > > Doesn't pg_atomic_compare_exchange_u32 set expected if it fails? Indeed, so there's no problem. I wonder if we should make ->nextVictimBuffer a 64bit atomic. At the time the changes went in we didn't (or rather, couldn't) rely on it, but these days we could. I think with a 64bit number we could get rid of ->completePasses and just infer it from ->nextVictimBuffer/NBuffers, removing th need for the spinlock. It's very unlikely that 64bit would ever wrap, and even if, it'd just be a small inaccuracy in the assumed number of passes. OTOH, it's doubtful the overflow handling / the spinlock matters performance wise. Greetings, Andres Freund
Re: Fixing a couple of buglets in how VACUUM sets visibility map bits
On Mon, Jan 9, 2023 at 5:59 PM Peter Geoghegan wrote: > On Mon, Jan 9, 2023 at 12:51 PM Robert Haas wrote: > > I feel that you should at least have a reproducer for these problems > > posted to the thread, and ideally a regression test, before committing > > things. I think it's very hard to understand what the problems are > > right now. > > Hard to understand relative to what, exactly? We're talking about a > very subtle race condition here. > > I'll try to come up with a reproducer, but I *utterly* reject your > assertion that it's a hard requirement, sight unseen. Why should those > be the parameters of the discussion? > > For one thing I'm quite confident that I'm right, with or without a > reproducer. And my argument isn't all that hard to follow, if you have > relevant expertise, and actually take the time. Look, I don't want to spend time arguing about what seem to me to be basic principles of good software engineering. When I don't put test cases into my patches, people complain at me and tell me that I'm a bad software engineer because I didn't include test cases. Your argument here seems to be that you're such a good software engineer that you don't need any test cases to know what the bug is or that you've fixed it correctly. That seems like a surprising argument, but even if it's true, test cases can have considerable value to future code authors, because it allows them to avoid reintroducing bugs that have previously been fixed. In my opinion, it's not worth trying to have automated test cases for absolutely every bug we fix, because many of them would be really hard to develop and executing all of them every time we do anything would be unduly time-consuming. But I can't remember the last time before this that someone wanted to commit a patch for a data corruption issue without even providing a test case that other people can run manually. If you think that is or ought to be standard practice, I can only say that I disagree. I don't particularly appreciate the implication that I either lack relevant or expertise or don't actually take time, either. I spent an hour yesterday looking at your patches yesterday and didn't feel I was very close to understanding 0002 in that time. I feel that if the patches were better-written, with relevant comments and test cases and really good commit messages and a lack of extraneous changes, I believe I probably would have gotten a lot further in the same amount of time. There is certainly an alternate explanation, which is that I am stupid. I'm inclined to think that's not the correct explanation, but most stupid people believe that they aren't, so that doesn't really prove anything. > But even this is > unlikely to matter much. Even if I somehow turn out to have been > completely wrong about the race condition, it is still self-evident > that the problem of uselessly WAL logging non-changes to the VM > exists. That doesn't require any concurrent access at all. It's a > natural consequence of calling visibilitymap_set() with > VISIBILITYMAP_ALL_FROZEN-only flags. You need only look at the code > for 2 minutes to see it. Apparently not, because I spent more time than that. -- Robert Haas EDB: http://www.enterprisedb.com
Re: releasing ParallelApplyTxnHash when pa_launch_parallel_worker returns NULL
On Tue, Jan 10, 2023 at 9:43 AM Ted Yu wrote: > > > On Tue, Jan 10, 2023 at 9:26 AM Ted Yu wrote: > >> >> >> On Tue, Jan 10, 2023 at 9:25 AM Ted Yu wrote: >> >>> Hi, >>> I was reading src/backend/replication/logical/applyparallelworker.c . >>> In `pa_allocate_worker`, when pa_launch_parallel_worker returns NULL, I >>> think the `ParallelApplyTxnHash` should be released. >>> >>> Please see the patch. >>> >>> Thanks >>> >> Here is the patch :-) >> > > In `pa_process_spooled_messages_if_required`, the `pa_unlock_stream` call > immediately follows `pa_lock_stream`. > I assume the following is the intended sequence of calls. If this is the > case, I can add it to the patch. > > Cheers > > diff --git a/src/backend/replication/logical/applyparallelworker.c > b/src/backend/replication/logical/applyparallelworker.c > index 2e5914d5d9..9879b3fff2 100644 > --- a/src/backend/replication/logical/applyparallelworker.c > +++ b/src/backend/replication/logical/applyparallelworker.c > @@ -684,9 +684,9 @@ pa_process_spooled_messages_if_required(void) > if (fileset_state == FS_SERIALIZE_IN_PROGRESS) > { > pa_lock_stream(MyParallelShared->xid, AccessShareLock); > -pa_unlock_stream(MyParallelShared->xid, AccessShareLock); > > fileset_state = pa_get_fileset_state(); > +pa_unlock_stream(MyParallelShared->xid, AccessShareLock); > } > > /* > Looking closer at the comment above this code and other part of the file, it seems the order is intentional. Please disregard my email about `pa_process_spooled_messages_if_required`.
Re: can while loop in ClockSweepTick function be kind of infinite loop in some cases?
On Tue, Jan 10, 2023 at 12:40 PM Andres Freund wrote: > > I think. `expected = originalVictim + 1;` line should be in while loop > > (before acquiring spin lock) so that, even in the case above, expected > > variable is incremented for each loop and CAS operation will be successful > > at some point. > > Is my understanding correct? If so, I will send PR for fixing this issue. > > Yes, I think your understanding might be correct. Interesting that this > apparently has never occurred. Doesn't pg_atomic_compare_exchange_u32 set expected if it fails? -- Robert Haas EDB: http://www.enterprisedb.com
Re: Handle infinite recursion in logical replication setup
On 1/10/23 10:17 AM, Amit Kapila wrote: On Tue, Jan 10, 2023 at 8:13 AM Jonathan S. Katz wrote: This consistently created the deadlock in my testing. Discussing with Masahiko off-list, this is due to a deadlock from 4 processes: the walsenders on A and B, and the apply workers on A and B. The walsenders are waiting for feedback from the apply workers, and the apply workers are waiting for the walsenders to synchronize (I may be oversimplifying). He suggested I try the above example instead with `synchronous_commit` set to `local`. In this case, I verified that there is no more deadlock, but he informed me that we would not be able to use cascading synchronous replication when "origin=none". This has nothing to do with the origin feature. I mean this should happen with origin=any or even in PG15 without using origin at all. Am, I missing something? One related point to note is that in physical replication cascading replication is asynchronous. See docs [1] (Cascading replication is currently asynchronous) This is not directly related to the origin feature, but the origin feature makes it apparent. There is a common use-case where people want to replicate data synchronously between two tables, and this is enabled by the "origin" feature. To be clear, my bigger concern is that it's fairly easy for users to create a deadlock situation based on how they set "synchronous_commit" with the origin feature -- this is the main reason why I brought it up. I'm less concerned about the "cascading" case, though I want to try out sync rep between 3 instances to see what happens. If we decide that this is a documentation issue, I'd suggest we improve the guidance around using `synchronous_commit`[1] on the CREATE SUBSCRIPTION page, as the GUC page[2] warns against using `local`: Yeah, but on Create Subscription page, we have mentioned that it is safe to use off for logical replication. While I think you can infer that it's "safe" for synchronous replication, I don't think it's clear. We say it's "safe to use `off` for logical replication", but provide a lengthy explanation around synchronous logical replication that concludes that it's "advantageous to set synchronous_commit to local or higher" but does not address safety. The first line in the explanation of the parameter links to the `synchronous_commit` GUC which specifically advises against using "local" for synchronous replication. Additionally, because we say "local" or higher, we increase the risk of the aforementioned in HEAD with the origin feature. I know I'm hammering on this point a bit, but it feels like this is relatively easy to misconfigure with the upcoming "origin" change (I did so myself from reading the devel docs) and we should ensure we guide our users appropriately. One can use local or higher for reducing the latency for COMMIT when synchronous replication is used in the publisher. Won't using 'local' while creating subscription would suffice the need to consistently replicate the data? I mean it is equivalent to somebody using levels greater than local in case of physical replication. I think in the case of physical replication, we won't wait for standby to replicate to another node before sending a response, so why to wait in the case of logical replication? If this understanding is correct, then probably it is sufficient to support 'local' for a subscription. I think this is a good explanation. We should incorporate some version of this into the docs, and add some clarity around the use of `synchronous_commit` option in `CREATE SUBSCRIPTION` in particular with the origin feature. I can make an attempt at this. Perhaps another question: based on this, should we disallow setting values of `synchronous_commit` greater than `local` when using "origin=none"? That might be too strict, but maybe we should warn around the risk of deadlock either in the logs or in the docs. Thanks, Jonathan OpenPGP_signature Description: OpenPGP digital signature
Re: Avoiding "wrong tuple length" errors at the end of VACUUM on pg_database update (Backpatch of 947789f to v12 and v13)
On Tue, Jan 10, 2023 at 02:57:43AM -0500, Tom Lane wrote: > Michael Paquier writes: >> Any objections about getting 947789f applied to REL_13_STABLE and >> REL_12_STABLE and see this issue completely gone from all the versions >> supported? > > No objections to back-patching the fix... +1 -- Nathan Bossart Amazon Web Services: https://aws.amazon.com
Re: fixing CREATEROLE
On Thu, Jan 5, 2023 at 2:53 PM Robert Haas wrote: > On Tue, Jan 3, 2023 at 3:11 PM Robert Haas wrote: > > Committed and back-patched 0001 with fixes for the issues that you pointed > > out. > > > > Here's a trivial rebase of the rest of the patch set. > > I committed 0001 and 0002 after improving the commit messages a bit. > Here's the remaining two patches back. I've done a bit more polishing > of these as well, specifically in terms of fleshing out the regression > tests. I'd like to move forward with these soon, if nobody's too > vehemently opposed to that. Done now. -- Robert Haas EDB: http://www.enterprisedb.com