Re: recovery modules
On Fri, Jan 27, 2023 at 08:17:46PM -0800, Nathan Bossart wrote: > On Fri, Jan 27, 2023 at 05:55:42PM -0800, Andres Freund wrote: >> On 2023-01-27 16:59:10 -0800, Nathan Bossart wrote: >>> I think it would be weird for the archive module and >>> recovery module interfaces to look so different, but if that's okay, I can >>> change it. >> >> I'm a bit sad about the archive module case - I wonder if we should change it >> now, there can't be many users of it out there. And I think it's more likely >> that we'll eventually want multiple archiving scripts to run concurrently - >> which will be quite hard with the current interface (no private state). > > I'm open to that. IIUC it wouldn't require too many changes to existing > archive modules, and if it gets us closer to batching or parallelism, it's > probably worth doing sooner than later. Here is a work-in-progress patch set for adjusting the archive modules interface. Is this roughly what you had in mind? -- Nathan Bossart Amazon Web Services: https://aws.amazon.com >From f865032283d0c937ff1c47441d70a2b11e89d3f4 Mon Sep 17 00:00:00 2001 From: Nathan Bossart Date: Fri, 27 Jan 2023 21:01:22 -0800 Subject: [PATCH 1/4] s/ArchiveContext/ArchiveCallbacks --- src/backend/postmaster/pgarch.c | 20 ++-- 1 file changed, 10 insertions(+), 10 deletions(-) diff --git a/src/backend/postmaster/pgarch.c b/src/backend/postmaster/pgarch.c index 8ecdb9ca23..36800127e8 100644 --- a/src/backend/postmaster/pgarch.c +++ b/src/backend/postmaster/pgarch.c @@ -97,7 +97,7 @@ char *XLogArchiveLibrary = ""; */ static time_t last_sigterm_time = 0; static PgArchData *PgArch = NULL; -static ArchiveModuleCallbacks ArchiveContext; +static ArchiveModuleCallbacks ArchiveCallbacks; /* @@ -416,8 +416,8 @@ pgarch_ArchiverCopyLoop(void) HandlePgArchInterrupts(); /* can't do anything if not configured ... */ - if (ArchiveContext.check_configured_cb != NULL && -!ArchiveContext.check_configured_cb()) + if (ArchiveCallbacks.check_configured_cb != NULL && +!ArchiveCallbacks.check_configured_cb()) { ereport(WARNING, (errmsg("archive_mode enabled, yet archiving is not configured"))); @@ -518,7 +518,7 @@ pgarch_archiveXlog(char *xlog) snprintf(activitymsg, sizeof(activitymsg), "archiving %s", xlog); set_ps_display(activitymsg); - ret = ArchiveContext.archive_file_cb(xlog, pathname); + ret = ArchiveCallbacks.archive_file_cb(xlog, pathname); if (ret) snprintf(activitymsg, sizeof(activitymsg), "last was %s", xlog); else @@ -824,7 +824,7 @@ HandlePgArchInterrupts(void) /* * LoadArchiveLibrary * - * Loads the archiving callbacks into our local ArchiveContext. + * Loads the archiving callbacks into our local ArchiveCallbacks. */ static void LoadArchiveLibrary(void) @@ -837,7 +837,7 @@ LoadArchiveLibrary(void) errmsg("both archive_command and archive_library set"), errdetail("Only one of archive_command, archive_library may be set."))); - memset(&ArchiveContext, 0, sizeof(ArchiveModuleCallbacks)); + memset(&ArchiveCallbacks, 0, sizeof(ArchiveModuleCallbacks)); /* * If shell archiving is enabled, use our special initialization function. @@ -854,9 +854,9 @@ LoadArchiveLibrary(void) ereport(ERROR, (errmsg("archive modules have to define the symbol %s", "_PG_archive_module_init"))); - (*archive_init) (&ArchiveContext); + (*archive_init) (&ArchiveCallbacks); - if (ArchiveContext.archive_file_cb == NULL) + if (ArchiveCallbacks.archive_file_cb == NULL) ereport(ERROR, (errmsg("archive modules must register an archive callback"))); @@ -869,6 +869,6 @@ LoadArchiveLibrary(void) static void pgarch_call_module_shutdown_cb(int code, Datum arg) { - if (ArchiveContext.shutdown_cb != NULL) - ArchiveContext.shutdown_cb(); + if (ArchiveCallbacks.shutdown_cb != NULL) + ArchiveCallbacks.shutdown_cb(); } -- 2.25.1 >From 1ea6fb293a1e69e6fdd39e1d6b96a0af601d8542 Mon Sep 17 00:00:00 2001 From: Nathan Bossart Date: Fri, 27 Jan 2023 21:16:34 -0800 Subject: [PATCH 2/4] move archive module exports to dedicated header --- contrib/basic_archive/basic_archive.c | 2 +- src/backend/postmaster/pgarch.c | 1 + src/backend/postmaster/shell_archive.c | 2 +- src/backend/utils/misc/guc_tables.c | 1 + src/include/postmaster/archive_module.h | 54 + src/include/postmaster/pgarch.h | 39 -- 6 files changed, 58 insertions(+), 41 deletions(-) create mode 100644 src/include/postmaster/archive_module.h diff --git a/contrib/basic_archive/basic_archive.c b/contrib/basic_archive/basic_archive.c index 3d29711a31..87bbb2174d 100644 --- a/contrib/basic_archive/basic_archive.c +++ b/contrib/basic_archive/basic_archive.c @@ -32,7 +32,7 @@ #include "common/int.h" #include "miscadmin.h" -#include "postmaster/pgarch.h" +#include "postmaster/archive_module.h" #include "storage/copydir.h" #include "storage/fd.h" #include "utils/guc.h" diff --g
Re: Deadlock between logrep apply worker and tablesync worker
On Sat, Jan 28, 2023 at 9:36 AM houzj.f...@fujitsu.com wrote: > > On Friday, January 27, 2023 8:16 PM Amit Kapila > > > > On Fri, Jan 27, 2023 at 3:45 PM vignesh C wrote: > > > > > > On Mon, 23 Jan 2023 at 10:52, Amit Kapila > > wrote: > > > > > > > > IIRC, this is done to prevent concurrent drops of origin drop say by > > > > exposed API pg_replication_origin_drop(). See the discussion in [1] > > > > related to it. If we want we can optimize it so that we can acquire > > > > the lock on the specific origin as mentioned in comments > > > > replorigin_drop_by_name() but it was not clear that this operation > > > > would be frequent enough. > > > > > > Here is an attached patch to lock the replication origin record using > > > LockSharedObject instead of locking pg_replication_origin relation in > > > ExclusiveLock mode. Now tablesync worker will wait only if the > > > tablesync worker is trying to drop the same replication origin which > > > has already been dropped by the apply worker, the other tablesync > > > workers will be able to successfully drop the replication origin > > > without any wait. > > > > > > > There is a code in the function replorigin_drop_guts() that uses the > > functionality introduced by replorigin_exists(). Can we reuse this function > > for > > the same? > > Maybe we can use SearchSysCacheExists1 to check the existence instead of > adding a new function. > Yeah, I think that would be better. > One comment about the patch. > > @@ -430,23 +445,21 @@ replorigin_drop_by_name(const char *name, bool > missing_ok, bool nowait) > ... > + /* Drop the replication origin if it has not been dropped already */ > + if (replorigin_exists(roident)) > replorigin_drop_guts(rel, roident, nowait); > > If developer pass missing_ok as false, should we report an ERROR here > instead of silently return ? > One thing that looks a bit odd is that we will anyway have a similar check in replorigin_drop_guts() which is a static function and called from only one place, so, will it be required to check at both places? -- With Regards, Amit Kapila.
Re: lockup in parallel hash join on dikkop (freebsd 14.0-current)
On Sat, Jan 28, 2023 at 4:42 PM Andres Freund wrote: > Did you use the same compiler / compilation flags as when elver hit it? > Clearly Tomas' case was with at least some optimizations enabled. I did use the same compiler version and optimisation level, clang llvmorg-13.0.0-0-gd7b669b3a303 at -O2.
Re: Something is wrong with wal_compression
Thomas Munro writes: > Reading Andres's comments and realising how relatively young > txid_status() is compared to txid_current(), I'm now wondering if we > shouldn't just disclaim the whole thing in back branches. My thoughts were trending in that direction too. It's starting to sound like we aren't going to be able to make a fix that we'd be willing to risk back-patching, even if it were completely compatible at the user level. Still, the idea that txid_status() isn't trustworthy is rather scary. I wonder whether there is a failure mode here that's exhibitable without using that. regards, tom lane
Re: Something is wrong with wal_compression
On Sat, Jan 28, 2023 at 4:57 PM Andrey Borodin wrote: > It's not trustworthy anyway. Xid wraparound might happen during > reconnect. I suspect we can design a test that will show that it does > not always show correct results during xid->2pc conversion (there is a > point in time when xid is not in regular and not in 2pc, and I'm not > sure ProcArrayLock is held). Maybe there are other edge cases. I'm not sure I understand the edge cases, but it is true that this can only give you the answer until the CLOG is truncated, which is pretty arbitrary and you could be unlucky. I guess a reliable version of this would have new policies about CLOG retention, and CLOG segment filenames derived from 64 bit xids so they don't wrap around. > Anyway, if a user wants to know the status of xid in case of > disconnection they have prepared xacts. Yeah. The original proposal mentioned that, but that this was a "lighter" alternative. Reading Andres's comments and realising how relatively young txid_status() is compared to txid_current(), I'm now wondering if we shouldn't just disclaim the whole thing in back branches. Maybe if we want to rescue it in master, there could be a "reliable" argument, defaulting to false, or whatever, and we could eventually make the amortisation improvement.
Re: suppressing useless wakeups in logical/worker.c
On Fri, Jan 27, 2023 at 4:07 AM Tom Lane wrote: > > Thanks, pushed. > > Returning to the prior patch ... I don't much care for this: > > +/* Maybe there will be a free slot in a second... */ > +retry_time = TimestampTzPlusSeconds(now, 1); > +LogRepWorkerUpdateSyncStartWakeup(retry_time); > > We're not moving the goalposts very far on unnecessary wakeups if > we have to do that. Do we need to get a wakeup on sync slot free? > Although having to send that to every worker seems ugly. Maybe this > is being done in the wrong place and we need to find a way to get > the launcher to handle it. > > As for the business about process_syncing_tables being only called > conditionally, I was already of the opinion that the way it's > getting called is loony. Why isn't it called from LogicalRepApplyLoop > (and noplace else)? Currently, it seems to be called after processing transaction end commands or when we are not in any transaction. As per my understanding, that is when we can ensure the sync between tablesync and apply worker. For example, say when tablesync worker is doing the initial copy, the apply worker went ahead and processed some additional xacts (WAL), now the tablesync worker needs to process all those transactions after initial sync and before it can mark the state as SYNCDONE. So that can be checked only at transaction boundries. However, it is not very clear to me why the patch needs the below code. @@ -3615,7 +3639,33 @@ LogicalRepApplyLoop(XLogRecPtr last_received) if (!dlist_is_empty(&lsn_mapping)) wait_time = WalWriterDelay; else - wait_time = NAPTIME_PER_CYCLE; + { + TimestampTz nextWakeup = DT_NOEND; + + /* + * Since process_syncing_tables() is called conditionally, the + * tablesync worker start wakeup time might be in the past, and we + * can't know for sure when it will be updated again. Rather than + * spinning in a tight loop in this case, bump this wakeup time by + * a second. + */ + now = GetCurrentTimestamp(); + if (wakeup[LRW_WAKEUP_SYNC_START] < now) + wakeup[LRW_WAKEUP_SYNC_START] = TimestampTzPlusSeconds(wakeup[LRW_WAKEUP_SYNC_START], 1); Do we see unnecessary wakeups without this, or delay in sync? BTW, do we need to do something about wakeups in wait_for_relation_state_change()? -- With Regards, Amit Kapila.
Re: lockup in parallel hash join on dikkop (freebsd 14.0-current)
Hi, On 2023-01-27 23:18:39 -0500, Tom Lane wrote: > I also saw it on florican, which is/was an i386 machine using clang and > pretty standard build options other than > 'CFLAGS' => '-msse2 -O2', > so I think this isn't too much about machine architecture or compiler > flags. Ah. Florican dropped of the BF status page and I was too lazy to look deeper. You have a penchant for odd architectures, so it didn't seem too crazy :) > Machine speed might matter though. elver is a good deal faster than > florican was, and dikkop is slower yet. I gather Thomas has seen this > only once on elver, but I saw it maybe a dozen times over a couple of > years on florican, and now dikkop has hit it after not so many runs. Re-reading the old thread, it is interesting that you tried hard to reproduce it outside of the BF, without success: https://postgr.es/m/2398828.1646000688%40sss.pgh.pa.us Such problems are quite annoying. Last time I hit such a case was https://postgr.es/m/20220325052654.3xpbmntatyofau2w%40alap3.anarazel.de but I can't see anything like that being the issue here. Greetings, Andres Freund
RE: Time delayed LR (WAS Re: logical replication restrictions)
Hi, On Friday, January 27, 2023 8:00 PM Amit Kapila wrote: > On Fri, Jan 27, 2023 at 1:39 PM Takamichi Osumi (Fujitsu) > wrote: > > > > On Wednesday, January 25, 2023 11:24 PM I wrote: > > > Attached the updated v22. > > Hi, > > > > During self-review, I noticed some changes are required for some > > variable types related to 'min_apply_delay' value, so have conducted > > the adjustment changes for the same. > > > > So, you have changed min_apply_delay from int64 to int32, but you haven't > mentioned the reason for the same? We use 'int' for the similar parameter > recovery_min_apply_delay, so, ideally, it makes sense but still better to > tell your > reason explicitly. Yes. It's because I thought I need to make this feature consistent with the recovery_min_apply_delay. This feature handles the range same as the recovery_min_apply delay from 0 to INT_MAX now so should be adjusted to match it. > Few comments > = > 1. > @@ -70,6 +70,8 @@ CATALOG(pg_subscription,6100,SubscriptionRelationId) > BKI_SHARED_RELATION BKI_ROW > XLogRecPtr subskiplsn; /* All changes finished at this LSN are > * skipped */ > > + int32 subminapplydelay; /* Replication apply delay (ms) */ > + > NameData subname; /* Name of the subscription */ > > Oid subowner BKI_LOOKUP(pg_authid); /* Owner of the subscription */ > > Why are you placing this after subskiplsn? Earlier it was okay because we want > the 64 bit value to be aligned but now, isn't it better to keep it after > subowner? Moved it after subowner. > 2. > + > + diffms = TimestampDifferenceMilliseconds(GetCurrentTimestamp(), > + TimestampTzPlusMilliseconds(finish_ts, > + MySubscription->minapplydelay)); > > The above code appears a bit unreadable. Can we store the result of > TimestampTzPlusMilliseconds() in a separate variable say "TimestampTz > delayUntil;"? Agreed. Fixed. Attached the updated patch v24. Best Regards, Takamichi Osumi v24-0001-Time-delayed-logical-replication-subscriber.patch Description: v24-0001-Time-delayed-logical-replication-subscriber.patch
Re: Something is wrong with wal_compression
Hi, On 2023-01-27 19:57:35 -0800, Andrey Borodin wrote: > On Fri, Jan 27, 2023 at 7:40 PM Tom Lane wrote: > > > > What are you using it for, that you don't care whether the answer > > is trustworthy? > > > > It's not trustworthy anyway. Xid wraparound might happen during > reconnect. I think that part would be approximately fine, as long as you can live with an answer of "too old". The xid returned by txid_status/pg_current_xact_id() is 64bit, and there is code to verify that the relevant range is covered by the clog. However - there's nothing preventing the xid to become too old in case of a crash. If you have an open connection, you can prevent the clog from being truncated by having an open snapshot. But you can't really without using e.g. 2PC if you want to handle crashes - obviously snapshots don't survive them. I really don't think txid_status() can be used for anything but informational probing of the clog / procarray. > I suspect we can design a test that will show that it does not always show > correct results during xid->2pc conversion (there is a point in time when > xid is not in regular and not in 2pc, and I'm not sure ProcArrayLock is > held). Maybe there are other edge cases. Unless I am missing something, that would be very bad [TM], completely independent of txid_status(). The underlying functions like TransactionIdIsInProgress() are used for MVCC. Greetings, Andres Freund
Re: lockup in parallel hash join on dikkop (freebsd 14.0-current)
Andres Freund writes: > Except that you're saying that you hit this on elver (amd64), I think it'd be > interesting that we see the failure on an arm host, which has a less strict > memory order model than x86. I also saw it on florican, which is/was an i386 machine using clang and pretty standard build options other than 'CFLAGS' => '-msse2 -O2', so I think this isn't too much about machine architecture or compiler flags. Machine speed might matter though. elver is a good deal faster than florican was, and dikkop is slower yet. I gather Thomas has seen this only once on elver, but I saw it maybe a dozen times over a couple of years on florican, and now dikkop has hit it after not so many runs. regards, tom lane
Re: recovery modules
On Fri, Jan 27, 2023 at 05:55:42PM -0800, Andres Freund wrote: > On 2023-01-27 16:59:10 -0800, Nathan Bossart wrote: >> I think it would be weird for the archive module and >> recovery module interfaces to look so different, but if that's okay, I can >> change it. > > I'm a bit sad about the archive module case - I wonder if we should change it > now, there can't be many users of it out there. And I think it's more likely > that we'll eventually want multiple archiving scripts to run concurrently - > which will be quite hard with the current interface (no private state). I'm open to that. IIUC it wouldn't require too many changes to existing archive modules, and if it gets us closer to batching or parallelism, it's probably worth doing sooner than later. > I was wondering why we implement "shell" via a separate mechanism from > restore_library. I.e. a) why doesn't restore_library default to 'shell', > instead of an empty string, b) why aren't restore_command et al implemented > using a restore module. I think that's the long-term idea. For archive modules, there were concerns about backward compatibility [0]. [0] https://postgr.es/m/CABUevEx8cKy%3D%2BYQU_3NaeXnZV2bSB7Lk6EE%2B-FEcmE4JO4V1hg%40mail.gmail.com -- Nathan Bossart Amazon Web Services: https://aws.amazon.com
Re: Something is wrong with wal_compression
Hi, On 2023-01-27 19:49:17 -0800, Andres Freund wrote: > It's quite commonly used as part of trigger based replication tools (IIRC > that's its origin), monitoring, as part of client side logging, as part of > snapshot management. Forgot one: Queues. The way it's used for trigger based replication, queues and also some materialized aggregation tooling, is that there's a trigger that inserts into a "log" table. And that log table has a column into which txid_current() will be inserted. Together with txid_current_snapshot() etc that's used to get a (at least semi) "transactional" order out of such log tables. I believe that's originally been invented by londiste / skytool, later slony migrated to it. The necessary C code was added as contrib/txid in 1f92630fc4e 2007-10-07 and then moved into core a few days later in 18e3fcc31e7. For those cases making txid_current() flush would approximately double the WAL flush rate. Greetings, Andres Freund
RE: Deadlock between logrep apply worker and tablesync worker
On Friday, January 27, 2023 8:16 PM Amit Kapila > > On Fri, Jan 27, 2023 at 3:45 PM vignesh C wrote: > > > > On Mon, 23 Jan 2023 at 10:52, Amit Kapila > wrote: > > > > > > IIRC, this is done to prevent concurrent drops of origin drop say by > > > exposed API pg_replication_origin_drop(). See the discussion in [1] > > > related to it. If we want we can optimize it so that we can acquire > > > the lock on the specific origin as mentioned in comments > > > replorigin_drop_by_name() but it was not clear that this operation > > > would be frequent enough. > > > > Here is an attached patch to lock the replication origin record using > > LockSharedObject instead of locking pg_replication_origin relation in > > ExclusiveLock mode. Now tablesync worker will wait only if the > > tablesync worker is trying to drop the same replication origin which > > has already been dropped by the apply worker, the other tablesync > > workers will be able to successfully drop the replication origin > > without any wait. > > > > There is a code in the function replorigin_drop_guts() that uses the > functionality introduced by replorigin_exists(). Can we reuse this function > for > the same? Maybe we can use SearchSysCacheExists1 to check the existence instead of adding a new function. One comment about the patch. @@ -430,23 +445,21 @@ replorigin_drop_by_name(const char *name, bool missing_ok, bool nowait) ... + /* Drop the replication origin if it has not been dropped already */ + if (replorigin_exists(roident)) replorigin_drop_guts(rel, roident, nowait); If developer pass missing_ok as false, should we report an ERROR here instead of silently return ? Best Regards, Hou zj
Re: Something is wrong with wal_compression
On Fri, Jan 27, 2023 at 7:40 PM Tom Lane wrote: > > What are you using it for, that you don't care whether the answer > is trustworthy? > It's not trustworthy anyway. Xid wraparound might happen during reconnect. I suspect we can design a test that will show that it does not always show correct results during xid->2pc conversion (there is a point in time when xid is not in regular and not in 2pc, and I'm not sure ProcArrayLock is held). Maybe there are other edge cases. Anyway, if a user wants to know the status of xid in case of disconnection they have prepared xacts. Best regards, Andrey Borodin.
Re: Something is wrong with wal_compression
Hi, On 2023-01-27 22:39:56 -0500, Tom Lane wrote: > Andres Freund writes: > > On 2023-01-28 11:38:50 +0900, Michael Paquier wrote: > >> FWIW, my vote goes for a more expensive but reliable function even in > >> stable branches. > > > I very strenuously object. If we make txid_current() (by way of > > pg_current_xact_id()) flush WAL, we'll cause outages. > > What are you using it for, that you don't care whether the answer > is trustworthy? It's quite commonly used as part of trigger based replication tools (IIRC that's its origin), monitoring, as part of client side logging, as part of snapshot management. txid_current() predates pg_xact_status() by well over 10 years. Clearly we had lots of uses for it before pg_xact_status() was around. Greetings, Andres Freund
Re: lockup in parallel hash join on dikkop (freebsd 14.0-current)
Hi, On 2023-01-27 22:23:58 +1300, Thomas Munro wrote: > After 1000 make check loops, and 1000 make -C src/test/modules/test_shm_mq > check loops, on the same FBSD 13.1 machine as elver which has failed > like this once before, I haven't been able to reproduce this on > REL_12_STABLE. Did you use the same compiler / compilation flags as when elver hit it? Clearly Tomas' case was with at least some optimizations enabled. Except that you're saying that you hit this on elver (amd64), I think it'd be interesting that we see the failure on an arm host, which has a less strict memory order model than x86. IIUC elver previously hit this on 12? Greetings, Andres Freund
Re: Something is wrong with wal_compression
Andres Freund writes: > On 2023-01-28 11:38:50 +0900, Michael Paquier wrote: >> FWIW, my vote goes for a more expensive but reliable function even in >> stable branches. > I very strenuously object. If we make txid_current() (by way of > pg_current_xact_id()) flush WAL, we'll cause outages. What are you using it for, that you don't care whether the answer is trustworthy? regards, tom lane
Re: heapgettup refactoring
"On Wed, 25 Jan 2023 at 10:17, Melanie Plageman wrote: > I've added a comment but I didn't include the function name in it -- I > find it repetitive when the comments above functions do that -- however, > I'm not strongly attached to that. I think the general format for header comments is: /* * *\t\t * * [Further details] */ We've certainly got places that don't follow that, but I don't think that's any reason to have no comment or invent some new format. heapam.c seems to have some other format where we do: " - ". I generally just try to copy the style from the surrounding code. I think generally, people won't argue if you follow the style from the surrounding code, but there'd be exceptions to that, I'm sure. I'll skip further review of 0001 here as the whole ScanDirectionNoMovement case is being discussed on the other thread. v6-0002: 1. heapgettup_initial_block() needs a header comment to mention what it does and what it returns. It would be good to make it obvious that it returns InvalidBlockNumber when there are no blocks to scan. 2. After heapgettup_initial_block(), you're checking "if (block == InvalidBlockNumber). It might be worth a mention something like /* * Check if we got to the end of the scan already. This could happen for * an empty relation or if parallel workers scanned everything before we * got a chance. */ the backward scan comment wouldn't mention parallel workers. v6-0003: 3. Can you explain why you removed the snapshot local variable in heapgettup()? 4. I think it might be a good idea to use unlikely() in if (!scan->rs_inited). The idea is to help coax the compiler into moving that code off to a cold path. That's likely especially important if heapgettup_initial_block is inlined, which I see it is marked as. v6-0004: 5. heapgettup_start_page() and heapgettup_continue_page() both need a header comment to explain what they do and what the inputs and output arguments are. 6. I'm not too sure what the following comment means: /* block and lineoff now reference the physically next tid */ "block" is just a parameter to the function and its value is not adjusted. The comment makes it sound like something was changed. (I think these might be just not well updated from having split this out from the 0006 patch as the same comment makes more sense in 0006) v6-0005: 7. heapgettup_advance_block() needs a header comment. 8. Is there a reason why heapgettup_advance_block() handle backward scans first? I'd expect you should just follow the lead of the other functions and do ScanDirectionIsForward first. v6-0006 David
Re: Generating code for query jumbling through gen_node_support.pl
On Tue, Jan 24, 2023 at 03:57:56PM +0900, Michael Paquier wrote: > Still, my plan here is to enforce the loading of > pg_stat_statements with compute_query_id = regress and > utility_query_id = jumble (if needed) in a new buildfarm machine, Actually, about this specific point, I have been able to set up a buildfarm machine that uses shared_preload_libraries = pg_stat_statements and compute_query_id = regress in the base configuration, which is uncovered yet. This works as long as one sets up EXTRA_INSTALL => "contrib/pg_stat_statements" in build_env. -- Michael signature.asc Description: PGP signature
Re: Something is wrong with wal_compression
Hi, On 2023-01-28 11:38:50 +0900, Michael Paquier wrote: > On Fri, Jan 27, 2023 at 06:06:05AM +0100, Laurenz Albe wrote: > > On Fri, 2023-01-27 at 16:15 +1300, Thomas Munro wrote: > >> There is no > >> doubt that the current situation is unacceptable, though, so maybe we > >> really should just do it and make a faster one later. Anyone else > >> want to vote on this? > > > > I wasn't aware of the existence of pg_xact_status, so I suspect that it > > is not a widely known and used feature. After reading the documentation, > > I'd say that anybody who uses it will want it to give a reliable answer. > > So I'd agree that it is better to make it more expensive, but live up to > > its promise. > A code search within the Debian packages (codesearch.debian.net) and > github does not show that it is not actually used, pg_xact_status() is > reported as parts of copies of the Postgres code in the regression > tests. Not finding a user at codesearch.debian.net provides useful information for C APIs, but a negative result for an SQL exposed function doesn't provide any information. Those callers will largely be in application code, which largely won't be in debian. And as noted two messages up, we wouldn't need to flush in pg_xact_status(), we'd need to flush in pg_current_xact_id()/txid_current(). > FWIW, my vote goes for a more expensive but reliable function even in > stable branches. I very strenuously object. If we make txid_current() (by way of pg_current_xact_id()) flush WAL, we'll cause outages. Greetings, Andres Freund
Re: Something is wrong with wal_compression
On Fri, Jan 27, 2023, 18:58 Andres Freund wrote: > Hi, > > On 2023-01-27 16:15:08 +1300, Thomas Munro wrote: > > It would be pg_current_xact_id() that would have to pay the cost of > > the WAL flush, not pg_xact_status() itself, but yeah that's what the > > patch does (with some optimisations). I guess one question is whether > > there are any other reasonable real world uses of > > pg_current_xact_id(), other than the original goal[1]. > > txid_current() is a lot older than pg_current_xact_id(), and they're > backed by > the same code afaict. 8.4 I think. > > Unfortunately txid_current() is used in plenty montiring setups IME. > > I don't think it's a good idea to make a function that was quite cheap for > 15 > years, suddenly be several orders of magnitude more expensive... As someone working on a monitoring tool that uses it (well, both), +1. We'd have to rethink a few things if this becomes a performance concern. Thanks, Maciek
Re: Something is wrong with wal_compression
Hi, On 2023-01-27 16:15:08 +1300, Thomas Munro wrote: > It would be pg_current_xact_id() that would have to pay the cost of > the WAL flush, not pg_xact_status() itself, but yeah that's what the > patch does (with some optimisations). I guess one question is whether > there are any other reasonable real world uses of > pg_current_xact_id(), other than the original goal[1]. txid_current() is a lot older than pg_current_xact_id(), and they're backed by the same code afaict. 8.4 I think. Unfortunately txid_current() is used in plenty montiring setups IME. I don't think it's a good idea to make a function that was quite cheap for 15 years, suddenly be several orders of magnitude more expensive... Greetings, Andres Freund
Re: bug: copy progress reporting of backends which run multiple COPYs
On Sat, Jan 21, 2023 at 02:45:40AM +0100, Matthias van de Meent wrote: > Let me think about it. I think it would work, but I'm not sure that's > a great option - it adds backend state that we would need to add > cleanup handles for. But then again, COPY ... TO could use TRIGGER to > trigger actual COPY FROM statements, which would also break progress > reporting in a vanilla instance without extensions. > > I'm not sure what the right thing to do is here. Simply disabling COPY reporting for file_fdw does not sound appealing to me, because that can be really useful for users. As long as you rely on two code paths that call separately the progress reporting, things are doomed with the current infrastructure. For example, thinking about some fancy cases, you could you create an index that uses a function as expression, function that includes utility commands that do progress reporting. Things would equally go wrong in the progress view. What are the assertions/warnings that get triggered in file_fdw? Joining together file_fdw with a plain COPY is surely a fancy case, even if COPY TO would allow that. >> Would you do anything different in the master branch, with no >> compatibility constraints ? I think the progress reporting would still >> be limited to one row per backend, not one per CopyFrom(). > > I think I would at least introduce another parameter to BeginCopyFrom > for progress reporting (instead of relying on pstate != NULL), like > how we have a bit in reindex_index's params->options that specifies > whether we want progress reporting (which is unset for parallel > workers iirc). This makes sense, independently of the discussion about what should be done with cross-runs of these APIs. This could be extended with user-controllable options for each one of them. It does not take care of the root of the problem, just bypasses it. -- Michael signature.asc Description: PGP signature
Re: Add n_tup_newpage_upd to pg_stat table views
On Fri, Jan 27, 2023 at 6:44 PM Andres Freund wrote: > I don't think that'd make it particularly easy to figure out how often > out-of-space causes non-HOT updates to go out of page, and how often it causes > potential HOT updates to go out of page. If you just have a single index, > it's not too hard, but after that seems decidedly nontrival. > > But I might just be missing what you're suggesting. It would be useless for that, of course. But it would be a good proxy for what specific indexes force non-hot updates due to HOT safety issues. This would work independently of the issue of what's going on in the heap. That matters too, of course, but in practice the main problem is likely the specific combination of indexes and updates. (Maybe it would just be an issue with heap fill factor, at times, but even then you'd still want to rule out basic HOT safety issues first.) If you see one particular index that gets a far larger number of non-hot updates that are reported as "logical changes to the indexed columns", then dropping that index has the potential to make the HOT update situation far better. -- Peter Geoghegan
Re: Add n_tup_newpage_upd to pg_stat table views
Hi, On 2023-01-27 17:59:32 -0800, Peter Geoghegan wrote: > > I think this might cause some trouble for existing monitoring setups after > > an > > upgrade. Suddenly the number of updates will appear way lower than > > before... And if we end up eventually distinguishing between different > > reasons > > for out-of-page updates, or hot/non-hot out-of-page that'll happen again. > > Uh...no it won't? The new counter is totally independent of the existing > HOT counter, and the transactional all-updates counter. It's just that > there is an enum that encodes which of the two non-transactional "sub > counters" to use (either for HOT updates or new-page-migration > updates). > > > I wish we'd included HOT updates in the total number of updates, and just > > kept > > HOT updates a separate counter that'd always be less than updates in total. > > Uh...we did in fact do it that way to begin with? Sorry, I misread the diff, and then misremembered some old issue. > > From that angle: Perhaps it'd be better to have counter for how many times a > > page is found to be full during an update? > > Didn't Corey propose a patch to add just that? Do you mean something > more specific, like a tracker for when an UPDATE leaves a page full, > without needing to go to a new page itself? Nope, I just had a brainfart. > > Similarly, it's a bit sad that we can't distinguish between the number of > > potential-HOT out-of-page updates and the other out-of-page updates. But > > that'd mean even more counters. > > ISTM that it would make more sense to do that at the index level > instead. It wouldn't be all that hard to teach ExecInsertIndexTuples() > to remember whether each index received the indexUnchanged hint used > by bottom-up deletion, which is approximately the same thing, but > works at the index level. I don't think that'd make it particularly easy to figure out how often out-of-space causes non-HOT updates to go out of page, and how often it causes potential HOT updates to go out of page. If you just have a single index, it's not too hard, but after that seems decidedly nontrival. But I might just be missing what you're suggesting. Greetings, Andres Freund
Re: Something is wrong with wal_compression
On Fri, Jan 27, 2023 at 06:06:05AM +0100, Laurenz Albe wrote: > On Fri, 2023-01-27 at 16:15 +1300, Thomas Munro wrote: >> There is no >> doubt that the current situation is unacceptable, though, so maybe we >> really should just do it and make a faster one later. Anyone else >> want to vote on this? > > I wasn't aware of the existence of pg_xact_status, so I suspect that it > is not a widely known and used feature. After reading the documentation, > I'd say that anybody who uses it will want it to give a reliable answer. > So I'd agree that it is better to make it more expensive, but live up to > its promise. A code search within the Debian packages (codesearch.debian.net) and github does not show that it is not actually used, pg_xact_status() is reported as parts of copies of the Postgres code in the regression tests. FWIW, my vote goes for a more expensive but reliable function even in stable branches. Even 857ee8e mentions that this could be used on a lost connection, so we don't even satisfy the use case of the original commit as things stand (right?), because lost connection could just be a result of a crash, and if crash recovery reassigns the XID, then the client gets it wrong. -- Michael signature.asc Description: PGP signature
Re: Small omission in type_sanity.sql
Hi, On 2023-01-27 20:39:04 -0500, Tom Lane wrote: > Andres Freund writes: > > Tom, is there a reason we run the various sanity tests early-ish in the > > schedule? It does seem to reduce their effectiveness a bit... > > Originally, those tests were mainly needed to sanity-check the > hand-maintained initial catalog data, so it made sense to run them > early. It's also kinda useful to have some basic validity testing early on, because if there's something wrong with the catalog values, it'll cause lots of issues later. > > Problems: > > - "Cross-check against pg_type entry" is far too strict about legal > > combinations > > of typstorage > > Perhaps, but it's enforcing policy about what we want in the > initial catalog data, not what is possible to support. True in generaly, but I don't think it matters much in this specific case. We don't gain much by forbidding 'e' -> 'x' mismatches, given that we allow 'x' -> 'p'. There's a lot more such cases in opr_sanity. There's a lot of tests in it that only make sense for validating the initial catalog contents. It might be useful to run a more lenient version of it later though. > So there's a bit of divergence of goals here too. Maybe we need to split up > the tests into initial-data-only tests (run early) and tests that should > hold for user-created objects too (run late)? Yea, I think so. A bit worried about the duplication that might require. But the *sanity tests also do also encode a lot of good cross-checks, that are somewhat easy to break in code (and / or have been broken in the past), so I think it's worth pursuing. Greetings, Andres Freund
Re: Add n_tup_newpage_upd to pg_stat table views
On Fri, Jan 27, 2023 at 3:55 PM Andres Freund wrote: > I wonder if it's quite detailed enough - we can be forced to do out-of-page > updates because we actually are out of space, or because we reach the max > number of line pointers we allow in a page. HOT pruning can't remove dead line > pointers, so that doesn't necessarily help. It would be hard to apply that kind of information, I suspect. Maybe it's still worth having, though. > Similarly, it's a bit sad that we can't distinguish between the number of > potential-HOT out-of-page updates and the other out-of-page updates. But > that'd mean even more counters. ISTM that it would make more sense to do that at the index level instead. It wouldn't be all that hard to teach ExecInsertIndexTuples() to remember whether each index received the indexUnchanged hint used by bottom-up deletion, which is approximately the same thing, but works at the index level. This is obviously more useful, because you have index-granularity information that can guide users in how to index to maximize the number of HOT updates. And, even if changing things around didn't lead to the hoped-for improvement in the rate of HOT updates, it would at least still allow the indexes on the table to use bottom-up deletion more often, on average. Admittedly this has some problems. The index_unchanged_by_update() logic probably isn't as sophisticated as it ought to be because it's driven by the statement-level extraUpdatedCols bitmap set, and not a per-tuple test, like the HOT safety test in heap_update() is. But...that should probably be fixed anyway. > I think this might cause some trouble for existing monitoring setups after an > upgrade. Suddenly the number of updates will appear way lower than > before... And if we end up eventually distinguishing between different reasons > for out-of-page updates, or hot/non-hot out-of-page that'll happen again. Uh...no it won't? The new counter is totally independent of the existing HOT counter, and the transactional all-updates counter. It's just that there is an enum that encodes which of the two non-transactional "sub counters" to use (either for HOT updates or new-page-migration updates). > I wish we'd included HOT updates in the total number of updates, and just kept > HOT updates a separate counter that'd always be less than updates in total. Uh...we did in fact do it that way to begin with? > From that angle: Perhaps it'd be better to have counter for how many times a > page is found to be full during an update? Didn't Corey propose a patch to add just that? Do you mean something more specific, like a tracker for when an UPDATE leaves a page full, without needing to go to a new page itself? If so, then that does require defining what that really means, because it isn't trivial. Do you assume that all updates have a successor version that is equal in size to that of the UPDATE that gets counted by this hypothetical other counter of yours? -- Peter Geoghegan
Re: Using WaitEventSet in the postmaster
Hi, On 2023-01-28 14:25:38 +1300, Thomas Munro wrote: > The nearby thread about searching for uses of volatile reminded me: we > can now drop a bunch of these in postmaster.c. The patch I originally > wrote to do that as part of this series somehow morphed into an > experimental patch to nuke all global variables[1], Hah. > but of course we should at least drop the now redundant use of volatile and > sigatomic_t. See attached. +1 Greetings, Andres Freund
Re: recovery modules
Hi, On 2023-01-27 16:59:10 -0800, Nathan Bossart wrote: > On Fri, Jan 27, 2023 at 04:23:19PM -0800, Andres Freund wrote: > >> +typedef bool (*RecoveryRestoreCB) (const char *file, const char *path, > >> + const char > >> *lastRestartPointFileName); > >> +typedef void (*RecoveryArchiveCleanupCB) (const char > >> *lastRestartPointFileName); > >> +typedef void (*RecoveryEndCB) (const char *lastRestartPointFileName); > >> +typedef void (*RecoveryShutdownCB) (void); > > > > I think the signature of these forces bad coding practices, because there's > > no > > way to have state within a recovery module (as there's no parameter for it). > > > > It's possible we would eventually support multiple modules, e.g. restoring > > from shorter term file based archiving and from longer term archiving in > > some > > blob store. Then we'll regret not having a varible for this. > > Are you suggesting that we add a "void *arg" to each one of these? Yes. Or pass a pointer to a struct with a "private_data" style field to all of them. > >> +extern RecoveryModuleCallbacks RecoveryContext; > > > > I think that'll typically be interpreteted as a MemoryContext by readers. > > How about RecoveryCallbacks? Callbacks is better than Context here imo, but I think 'Recovery' makes it sound like this actually performs WAL replay or such. Seems like it should be RestoreCallbacks at the very least? > > Also, why is this a global var? Exported too? > > It's needed in xlog.c, xlogarchive.c, and xlogrecovery.c. Would you rather > it be static to xlogarchive.c and provide accessors for the others? Maybe? Something about this feels wrong to me, but I can't entirely put my finger on it. > >> +/* > >> + * Type of the shared library symbol _PG_recovery_module_init that is > >> looked up > >> + * when loading a recovery library. > >> + */ > >> +typedef void (*RecoveryModuleInit) (RecoveryModuleCallbacks *cb); > > > > I think this is a bad way to return callbacks. This way the > > RecoveryModuleCallbacks needs to be modifiable, which makes the job for the > > compiler harder (and isn't the greatest for security). > > > > I strongly encourage you to follow the model used e.g. by tableam. The init > > function should return a pointer to a *constant* struct. Which is > > compile-time > > initialized with the function pointers. > > > > See the bottom of heapam_handler.c for how that ends up looking. > > Hm. I used the existing strategy for archive modules and logical decoding > output plugins here. Unfortunately I didn't realize the problem when I was designing the output plugin interface. But there's probably too many users of it out there now to change it. The interface does at least provide a way to have its own "per instance" state, via the startup callback and LogicalDecodingContext->output_plugin_private. The worst interface in this area is index AMs - the handler returns a pointer to a palloced struct with callbacks. That then is copied into a new allocation in the relcache entry. We have hundreds to thousands of copies of what bthandler() sets up in memory. Without any sort of benefit. > I think it would be weird for the archive module and > recovery module interfaces to look so different, but if that's okay, I can > change it. I'm a bit sad about the archive module case - I wonder if we should change it now, there can't be many users of it out there. And I think it's more likely that we'll eventually want multiple archiving scripts to run concurrently - which will be quite hard with the current interface (no private state). Just btw: It's imo a bit awkward for the definition of the archiving plugin interface to be in pgarch.h: "Exports from postmaster/pgarch.c" doesn't describe that well. A dedicated header seems cleaner. > > >> +void > >> +LoadRecoveryCallbacks(void) > >> +{ > >> + RecoveryModuleInit init; > >> + > >> + /* > >> + * If the shell command is enabled, use our special initialization > >> + * function. Otherwise, load the library and call its > >> + * _PG_recovery_module_init(). > >> + */ > >> + if (restoreLibrary[0] == '\0') > >> + init = shell_restore_init; > >> + else > >> + init = (RecoveryModuleInit) > >> + load_external_function(restoreLibrary, > >> "_PG_recovery_module_init", > >> + false, NULL); > > > > Why a special rule for shell, instead of just defaulting the GUC to it? > > I'm not following this one. The default value of the restore_library GUC > is an empty string, which means that the shell commands should be used. I was wondering why we implement "shell" via a separate mechanism from restore_library. I.e. a) why doesn't restore_library default to 'shell', instead of an empty string, b) why aren't restore_command et al implemented using a restore module. > >> + /* > >> + * If using shell commands, remove
Re: Using WaitEventSet in the postmaster
Thomas Munro writes: > The nearby thread about searching for uses of volatile reminded me: we > can now drop a bunch of these in postmaster.c. The patch I originally > wrote to do that as part of this series somehow morphed into an > experimental patch to nuke all global variables[1], but of course we > should at least drop the now redundant use of volatile and > sigatomic_t. See attached. +1 regards, tom lane
Re: Small omission in type_sanity.sql
Andres Freund writes: > Tom, is there a reason we run the various sanity tests early-ish in the > schedule? It does seem to reduce their effectiveness a bit... Originally, those tests were mainly needed to sanity-check the hand-maintained initial catalog data, so it made sense to run them early. Since we taught genbki.pl to do a bunch more work, that's perhaps a bit less pressing. There's at least one test that intentionally sets up a bogus btree opclass, which we'd have to drop again if we wanted to run the sanity checks later. Not sure what other issues might surface. You could find out easily enough, of course ... > Problems: > - "Cross-check against pg_type entry" is far too strict about legal > combinations > of typstorage Perhaps, but it's enforcing policy about what we want in the initial catalog data, not what is possible to support. So there's a bit of divergence of goals here too. Maybe we need to split up the tests into initial-data-only tests (run early) and tests that should hold for user-created objects too (run late)? regards, tom lane
Re: Using WaitEventSet in the postmaster
The nearby thread about searching for uses of volatile reminded me: we can now drop a bunch of these in postmaster.c. The patch I originally wrote to do that as part of this series somehow morphed into an experimental patch to nuke all global variables[1], but of course we should at least drop the now redundant use of volatile and sigatomic_t. See attached. [1] https://www.postgresql.org/message-id/flat/CA%2BhUKGKH_RPAo%3DNgPfHKj--565aL1qiVpUGdWt1_pmJehY%2Bdmw%40mail.gmail.com From 74f8b703f1a23b47bf613813014bc432c62c75d8 Mon Sep 17 00:00:00 2001 From: Thomas Munro Date: Sat, 28 Jan 2023 14:08:23 +1300 Subject: [PATCH] Remove unneeded volatile qualitifiers from postmaster.c. Several flags were marked volatile and in some cases use sigatomic_t because they were accessed from signal handlers. After commit 7389aad6, we can just use unqualified bool. --- src/backend/postmaster/postmaster.c | 10 +- 1 file changed, 5 insertions(+), 5 deletions(-) diff --git a/src/backend/postmaster/postmaster.c b/src/backend/postmaster/postmaster.c index 62fba5fcee..f92dbc2270 100644 --- a/src/backend/postmaster/postmaster.c +++ b/src/backend/postmaster/postmaster.c @@ -359,17 +359,17 @@ bool ClientAuthInProgress = false; /* T during new-client bool redirection_done = false; /* stderr redirected for syslogger? */ /* received START_AUTOVAC_LAUNCHER signal */ -static volatile sig_atomic_t start_autovac_launcher = false; +static bool start_autovac_launcher = false; /* the launcher needs to be signaled to communicate some condition */ -static volatile bool avlauncher_needs_signal = false; +static bool avlauncher_needs_signal = false; /* received START_WALRECEIVER signal */ -static volatile sig_atomic_t WalReceiverRequested = false; +static bool WalReceiverRequested = false; /* set when there's a worker that needs to be started up */ -static volatile bool StartWorkerNeeded = true; -static volatile bool HaveCrashedWorker = false; +static bool StartWorkerNeeded = true; +static bool HaveCrashedWorker = false; /* set when signals arrive */ static volatile sig_atomic_t pending_pm_pmsignal; -- 2.38.1
Re: Small omission in type_sanity.sql
Hi, On 2023-01-18 14:51:32 -0500, Melanie Plageman wrote: > I only changed these few lines in type_sanity to be more correct; I > didn't change anything else in regress to actually exercise them (e.g. > ensuring a partitioned table is around when running type_sanity). It > might be worth moving type_sanity down in the parallel schedule? Unfortunately it's not entirely trivial to do that. Despite the comment at the top of type_sanity.sql: -- None of the SELECTs here should ever find any matching entries, -- so the expected output is easy to maintain ;-). there are actually a few queries in there that are expected to return objects. And would return more at the end of the tests. That doesn't seem great. Tom, is there a reason we run the various sanity tests early-ish in the schedule? It does seem to reduce their effectiveness a bit... Problems: - shell types show up in a bunch of queries, e.g. "Look for illegal values in pg_type fields." due to NOT t1.typisdefined - the omission of various relkinds noted by Melanie shows in "Look for illegal values in pg_class fields" - pg_attribute query doesn't know about dropped columns - "Cross-check against pg_type entry" is far too strict about legal combinations of typstorage > It does seem a bit hard to remember to update these tests in > type_sanity.sql when adding some new value for a pg_class field. I > wonder if there is a better way of testing this. As evidenced by the above list of failures, moving the test to the end of the regression tests would help, if we can get there. > --- All tables and indexes should have an access method. > -SELECT c1.oid, c1.relname > -FROM pg_class as c1 > -WHERE c1.relkind NOT IN ('S', 'v', 'f', 'c') and > -c1.relam = 0; > - oid | relname > --+- > +-- All tables and indexes except partitioned tables should have an access > +-- method. > +SELECT oid, relname, relkind, relam > +FROM pg_class > +WHERE relkind NOT IN ('S', 'v', 'f', 'c', 'p') and > +relam = 0; > + oid | relname | relkind | relam > +-+-+-+--- > (0 rows) Don't think that one is right, a partitioned table doesn't have an AM. > --- Conversely, sequences, views, types shouldn't have them > -SELECT c1.oid, c1.relname > -FROM pg_class as c1 > -WHERE c1.relkind IN ('S', 'v', 'f', 'c') and > -c1.relam != 0; > - oid | relname > --+- > +-- Conversely, sequences, views, types, and partitioned tables shouldn't have > +-- them > +SELECT oid, relname, relkind, relam > +FROM pg_class > +WHERE relkind IN ('S', 'v', 'f', 'c', 'p') and > +relam != 0; > + oid | relname | relkind | relam > +-+-+-+--- > (0 rows) Particularly because you include them again here :) Greetings, Andres Freund
Re: recovery modules
On Fri, Jan 27, 2023 at 04:23:19PM -0800, Andres Freund wrote: >> +typedef bool (*RecoveryRestoreCB) (const char *file, const char *path, >> + const char >> *lastRestartPointFileName); >> +typedef void (*RecoveryArchiveCleanupCB) (const char >> *lastRestartPointFileName); >> +typedef void (*RecoveryEndCB) (const char *lastRestartPointFileName); >> +typedef void (*RecoveryShutdownCB) (void); > > I think the signature of these forces bad coding practices, because there's no > way to have state within a recovery module (as there's no parameter for it). > > It's possible we would eventually support multiple modules, e.g. restoring > from shorter term file based archiving and from longer term archiving in some > blob store. Then we'll regret not having a varible for this. Are you suggesting that we add a "void *arg" to each one of these? Or put the arguments into a struct? Or something else? >> +extern RecoveryModuleCallbacks RecoveryContext; > > I think that'll typically be interpreteted as a MemoryContext by readers. How about RecoveryCallbacks? > Also, why is this a global var? Exported too? It's needed in xlog.c, xlogarchive.c, and xlogrecovery.c. Would you rather it be static to xlogarchive.c and provide accessors for the others? >> +/* >> + * Type of the shared library symbol _PG_recovery_module_init that is >> looked up >> + * when loading a recovery library. >> + */ >> +typedef void (*RecoveryModuleInit) (RecoveryModuleCallbacks *cb); > > I think this is a bad way to return callbacks. This way the > RecoveryModuleCallbacks needs to be modifiable, which makes the job for the > compiler harder (and isn't the greatest for security). > > I strongly encourage you to follow the model used e.g. by tableam. The init > function should return a pointer to a *constant* struct. Which is compile-time > initialized with the function pointers. > > See the bottom of heapam_handler.c for how that ends up looking. Hm. I used the existing strategy for archive modules and logical decoding output plugins here. I think it would be weird for the archive module and recovery module interfaces to look so different, but if that's okay, I can change it. >> +void >> +LoadRecoveryCallbacks(void) >> +{ >> +RecoveryModuleInit init; >> + >> +/* >> + * If the shell command is enabled, use our special initialization >> + * function. Otherwise, load the library and call its >> + * _PG_recovery_module_init(). >> + */ >> +if (restoreLibrary[0] == '\0') >> +init = shell_restore_init; >> +else >> +init = (RecoveryModuleInit) >> +load_external_function(restoreLibrary, >> "_PG_recovery_module_init", >> + false, NULL); > > Why a special rule for shell, instead of just defaulting the GUC to it? I'm not following this one. The default value of the restore_library GUC is an empty string, which means that the shell commands should be used. >> +/* >> + * If using shell commands, remove callbacks for any commands that are >> not >> + * set. >> + */ >> +if (restoreLibrary[0] == '\0') >> +{ >> +if (recoveryRestoreCommand[0] == '\0') >> +RecoveryContext.restore_cb = NULL; >> +if (archiveCleanupCommand[0] == '\0') >> +RecoveryContext.archive_cleanup_cb = NULL; >> +if (recoveryEndCommand[0] == '\0') >> +RecoveryContext.recovery_end_cb = NULL; > > I'd just mandate that these are implemented and that the module has to handle > if it doesn't need to do anything. Wouldn't this just force module authors to write empty functions for the parts they don't need? -- Nathan Bossart Amazon Web Services: https://aws.amazon.com
Re: Optimizing PostgreSQL with LLVM's PGO+LTO
Hi, On 2023-01-27 18:28:16 -0500, Tom Lane wrote: > Andres Freund writes: > > I'm sure we have a few places that aren't that careful, but I would hope > > it's > > not a large number. Are you thinking of specific "patterns" we've repeated > > all > > over, or just a few cases you recall? > > I recall that we used to have dependencies on, for example, the LWLock > functions being out-of-line. Probably that specific pain point has > been cleaned up, but it surprises me not at all to hear that there > are more. We did clean up a fair bit, some via "infrastructure" fixes. E.g. our spinlocks didn't use to be a barrier a good while back (c.f. 0709b7ee72e), and that required putting volatile on things that couldn't move across the lock boundaries. I think that in turn was what caused the LWLock issue you mention, as back then lwlocks used spinlocks. The increased use of atomics instead of "let's just do a dirty read", fixed a few instances too. > I agree that there are probably not a huge number of places that would > need to be fixed, but I'm not sure how we'd go about finding them. Yea, that's the annoying part... One thing we can look for is the use of volatile, which we used to use a lot for preventing code rearrangement (for lack of barrier primitives in the bad old days). Both Robert and I removed a bunch of that kind of use of volatile, and from memory some of them wouldn't have been safe with LTO. It's really too bad that we [have to] use volatile around signal handlers and for PG_TRY too, otherwise it'd be easier to search for. Kinda wondering if we ought to add a sig_volatile, err_volatile or such. But the main thing probably is to just regularly test LTO and look for problems. Perhaps worth adding a BF animal that uses -O3 + LTO? I don't immediately see how to squeeze using PGO into the BF build process (since we'd have to build without PGO, run some workload, build with PGO - without any source modifications inbetween)... Greetings, Andres Freund
Re: Syncrep and improving latency due to WAL throttling
On 1/27/23 22:19, Andres Freund wrote: > Hi, > > On 2023-01-27 12:06:49 +0100, Jakub Wartak wrote: >> On Thu, Jan 26, 2023 at 4:49 PM Andres Freund wrote: >> >>> Huh? Why did you remove the GUC? >> >> After reading previous threads, my optimism level of getting it ever >> in shape of being widely accepted degraded significantly (mainly due >> to the discussion of wider category of 'WAL I/O throttling' especially >> in async case, RPO targets in async case and potentially calculating >> global bandwidth). > > I think it's quite reasonable to limit this to a smaller scope. Particularly > because those other goals are pretty vague but ambitious goals. IMO the > problem with a lot of the threads is precisely that that they aimed at a level > of generallity that isn't achievable in one step. > +1 to that -- Tomas Vondra EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
Re: Syncrep and improving latency due to WAL throttling
On 1/27/23 22:33, Andres Freund wrote: > Hi, > > On 2023-01-27 21:45:16 +0100, Tomas Vondra wrote: >> On 1/27/23 08:18, Bharath Rupireddy wrote: I think my idea of only forcing to flush/wait an LSN some distance in the past would automatically achieve that? >>> >>> I'm sorry, I couldn't get your point, can you please explain it a bit more? >>> >> >> The idea is that we would not flush the exact current LSN, because >> that's likely somewhere in the page, and we always write the whole page >> which leads to write amplification. >> >> But if we backed off a bit, and wrote e.g. to the last page boundary, >> that wouldn't have this issue (either the page was already flushed - >> noop, or we'd have to flush it anyway). > > Yep. > > >> We could even back off a bit more, to increase the probability it was >> actually flushed / sent to standby. > > That's not the sole goal, from my end: I'd like to avoid writing out + > flushing the WAL in too small chunks. Imagine a few concurrent vacuums or > COPYs or such - if we're unlucky they'd each end up exceeding their "private" > limit close to each other, leading to a number of small writes of the > WAL. Which could end up increasing local commit latency / iops. > > If we instead decide to only ever flush up to something like > last_page_boundary - 1/8 * throttle_pages * XLOG_BLCKSZ > > we'd make sure that the throttling mechanism won't cause a lot of small > writes. > I'm not saying we shouldn't do this, but I still don't see how this could make a measurable difference. At least assuming a sensible value of the throttling limit (say, more than 256kB per backend), and OLTP workload running concurrently. That means ~64 extra flushes/writes per 16MB segment (at most). Yeah, a particular page might get unlucky and be flushed by multiple backends, but the average still holds. Meanwhile, the OLTP transactions will generate (at least) an order of magnitude more flushes. > >>> Keeping replication lag under check enables one to provide a better >>> RPO guarantee as discussed in the other thread >>> https://www.postgresql.org/message-id/CAHg%2BQDcO_zhgBCMn5SosvhuuCoJ1vKmLjnVuqUEOd4S73B1urw%40mail.gmail.com. >>> >> >> Isn't that a bit over-complicated? RPO generally only cares about xacts >> that committed (because that's what you want to not lose), so why not to >> simply introduce a "sync mode" that simply uses a bit older LSN when >> waiting for the replica? Seems much simpler and similar to what we >> already do. > > I don't think that really helps you that much. If there's e.g. a huge VACUUM / > COPY emitting loads of WAL you'll suddenly see commit latency of a > concurrently committing transactions spike into oblivion. Whereas a general > WAL throttling mechanism would throttle the VACUUM, without impacting the > commit latency of normal transactions. > True, but it solves the RPO goal which is what the other thread was about. IMHO it's useful to look at this as a resource scheduling problem: limited WAL bandwidth consumed by backends, with the bandwidth distributed using some sort of policy. The patch discussed in this thread uses fundamentally unfair policy, with throttling applied only on backends that produce a lot of WAL. And trying to leave the OLTP as unaffected as possible. The RPO thread seems to be aiming for a "fair" policy, providing the same fraction of bandwidth to all processes. This will affect all xacts the same way (sleeps proportional to amount of WAL generated by the xact). Perhaps we want such alternative scheduling policies, but it'll probably require something like the autovacuum throttling. regards -- Tomas Vondra EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
Re: recovery modules
On Fri, Jan 27, 2023 at 04:09:39PM -0800, Andres Freund wrote: > I think it would be hard to write a good module that isn't just implementing > the existing commands without it. Needing to clean up archives and reacting to > the end of recovery seems a pretty core task. FWIW, recovery_end_command is straight-forward as it is done by the startup process, so that's an easy take. You could split the cake into two parts, as well, aka first focus on restore_command and recovery_end_command as a first step, then we could try to figure out how archive_cleanup_command would fit in this picture with the checkpointer or a different process. There are a bunch of deployments where WAL archive retention is controlled by the age of the backups, not by the backend deciding when these should be removed as a checkpoint runs depending on the computed redo LSN, so recovery modules would still be useful with just coverage for restore_command and recovery_end_command. > I was briefly wondering whether it'd be worth trying to offload things like > archive_cleanup_command from checkpointer to a different process, for > robustness. But given that it's pretty much required for performance that the > module runs in the startup process, that ship probably has sailed. Yeah, agreed that this could be interesting. That could leverage the work of the checkpointer. Nathan has proposed a patch for that recently, as far as I recall, to offload some tasks from the startup and checkpointer processes: https://commitfest.postgresql.org/41/3448/ So that pretty much goes into the same area? -- Michael signature.asc Description: PGP signature
Re: recovery modules
Hi, On 2023-01-25 16:34:21 +0900, Michael Paquier wrote: > diff --git a/src/include/access/xlogarchive.h > b/src/include/access/xlogarchive.h > index 299304703e..71c9b88165 100644 > --- a/src/include/access/xlogarchive.h > +++ b/src/include/access/xlogarchive.h > @@ -30,9 +30,45 @@ extern bool XLogArchiveIsReady(const char *xlog); > extern bool XLogArchiveIsReadyOrDone(const char *xlog); > extern void XLogArchiveCleanup(const char *xlog); > > -extern bool shell_restore(const char *file, const char *path, > - const char > *lastRestartPointFileName); > -extern void shell_archive_cleanup(const char *lastRestartPointFileName); > -extern void shell_recovery_end(const char *lastRestartPointFileName); > +/* > + * Recovery module callbacks > + * > + * These callback functions should be defined by recovery libraries and > + * returned via _PG_recovery_module_init(). For more information about the > + * purpose of each callback, refer to the recovery modules documentation. > + */ > +typedef bool (*RecoveryRestoreCB) (const char *file, const char *path, > +const char > *lastRestartPointFileName); > +typedef void (*RecoveryArchiveCleanupCB) (const char > *lastRestartPointFileName); > +typedef void (*RecoveryEndCB) (const char *lastRestartPointFileName); > +typedef void (*RecoveryShutdownCB) (void); I think the signature of these forces bad coding practices, because there's no way to have state within a recovery module (as there's no parameter for it). It's possible we would eventually support multiple modules, e.g. restoring from shorter term file based archiving and from longer term archiving in some blob store. Then we'll regret not having a varible for this. > +typedef struct RecoveryModuleCallbacks > +{ > + RecoveryRestoreCB restore_cb; > + RecoveryArchiveCleanupCB archive_cleanup_cb; > + RecoveryEndCB recovery_end_cb; > + RecoveryShutdownCB shutdown_cb; > +} RecoveryModuleCallbacks; > + > +extern RecoveryModuleCallbacks RecoveryContext; I think that'll typically be interpreteted as a MemoryContext by readers. Also, why is this a global var? Exported too? > +/* > + * Type of the shared library symbol _PG_recovery_module_init that is looked > up > + * when loading a recovery library. > + */ > +typedef void (*RecoveryModuleInit) (RecoveryModuleCallbacks *cb); I think this is a bad way to return callbacks. This way the RecoveryModuleCallbacks needs to be modifiable, which makes the job for the compiler harder (and isn't the greatest for security). I strongly encourage you to follow the model used e.g. by tableam. The init function should return a pointer to a *constant* struct. Which is compile-time initialized with the function pointers. See the bottom of heapam_handler.c for how that ends up looking. > +void > +LoadRecoveryCallbacks(void) > +{ > + RecoveryModuleInit init; > + > + /* > + * If the shell command is enabled, use our special initialization > + * function. Otherwise, load the library and call its > + * _PG_recovery_module_init(). > + */ > + if (restoreLibrary[0] == '\0') > + init = shell_restore_init; > + else > + init = (RecoveryModuleInit) > + load_external_function(restoreLibrary, > "_PG_recovery_module_init", > +false, NULL); Why a special rule for shell, instead of just defaulting the GUC to it? > + /* > + * If using shell commands, remove callbacks for any commands that are > not > + * set. > + */ > + if (restoreLibrary[0] == '\0') > + { > + if (recoveryRestoreCommand[0] == '\0') > + RecoveryContext.restore_cb = NULL; > + if (archiveCleanupCommand[0] == '\0') > + RecoveryContext.archive_cleanup_cb = NULL; > + if (recoveryEndCommand[0] == '\0') > + RecoveryContext.recovery_end_cb = NULL; I'd just mandate that these are implemented and that the module has to handle if it doesn't need to do anything. > + /* > + * Check for invalid combinations of the command/library parameters and > + * load the callbacks. > + */ > + CheckMutuallyExclusiveGUCs(restoreLibrary, "restore_library", > + > recoveryRestoreCommand, "restore_command"); > + CheckMutuallyExclusiveGUCs(restoreLibrary, "restore_library", > +recoveryEndCommand, > "recovery_end_command"); > + before_shmem_exit(call_recovery_module_shutdown_cb, 0); > + LoadRecoveryCallbacks(); This kind of sequence is duplicated into several places. Greetings, Andres Freund
Re: How to find the number of cached pages for a relation?
Thank you Andres. If I want to do "a" ( Do one probe of the buffer mapping table for each block of the relation. Thus O(#relation blocks)) what function calls can I use, assuming I only have access to the relation id? How can I access and scan the buffer mapping table? On Fri, Jan 13, 2023 at 6:27 PM Andres Freund wrote: > Hi, > > On 2023-01-13 17:28:31 -0800, Amin wrote: > > Before scanning a relation, in the planner stage, I want to make a call > to > > retrieve information about how many pages will be a hit for a specific > > relation. The module pg_buffercache seems to be doing a similar thing. > > Also, pg_statio_all_tables seems to be having that information, but it is > > updated after execution. However, I want the information before > execution. > > Also not sure how pg_statio_all_tables is created and how I can access it > > in the code. > > There's no cheap way to do that. Currently the only ways are to: > > a) Do one probe of the buffer mapping table for each block of the >relation. Thus O(#relation blocks). > > b) Scan all of buffer headers, check which are for the relation. Thus >O(#NBuffers) > > Neither of which are a good idea during planning. > > > It might be a bit more realistic to get very rough estimates: > > You could compute the table's historic cache hit ratio from pgstats (i.e. > use > the data backing pg_statio_all_tables). Of course that's not going to be > specific to your query (for index scans etc), and might have changed more > recently. It'd also be completely wrong after a restart. > > If we had information about *recent* cache hit patterns for the relation, > it'd > be a lot better, but we don't have the infrastructure for that, and > introducing it would increase the size of the stats entries noticably. > > Another way could be to probe the buffer mapping table for a small subset > of > the locks and infer the likelihood of other blocks being in shared buffers > that way. > > A third way could be to track the cache hit for relations in backend local > memory, likely in the relache entry. The big disadvantage would be that > query > plans would differ between connections and that connections would need to > "warm up" to have good plans. But it'd handle restarts nicely. > > Greetings, > > Andres Freund >
Re: recovery modules
Hi, On 2023-01-27 15:28:21 -0800, Nathan Bossart wrote: > The more I think about this, the more I wonder whether we really need to > include archive_cleanup_command and recovery_end_command in recovery > modules. I think it would be hard to write a good module that isn't just implementing the existing commands without it. Needing to clean up archives and reacting to the end of recovery seems a pretty core task. > Another weird thing with the checkpointer is that the restore_library will > stay loaded long after recovery is finished, and it'll be loaded regardless > of whether recovery is required in the first place. I don't see a problem with that. And I suspect we might even end up there for other reasons. I was briefly wondering whether it'd be worth trying to offload things like archive_cleanup_command from checkpointer to a different process, for robustness. But given that it's pretty much required for performance that the module runs in the startup process, that ship probably has sailed. Greetings, Andres Freund
Re: Add n_tup_newpage_upd to pg_stat table views
Hi, On 2023-01-27 18:23:39 -0500, Corey Huinker wrote: > This patch adds the n_tup_newpage_upd to all the table stat views. > > Just as we currently track HOT updates, it should be beneficial to track > updates where the new tuple cannot fit on the existing page and must go to > a different one. I like that idea. I wonder if it's quite detailed enough - we can be forced to do out-of-page updates because we actually are out of space, or because we reach the max number of line pointers we allow in a page. HOT pruning can't remove dead line pointers, so that doesn't necessarily help. Which e.g. means that: > Hopefully this can give users some insight as to whether their current > fillfactor settings need to be adjusted. Isn't that easy, because you can have a page with just a visible single tuple on, but still be unable to do a same-page update. The fix instead is to VACUUM (more aggressively). OTOH, just seeing that there's high percentage "out-of-page updates" provides more information than we have right now. And the alternative would be to add yet another counter. Similarly, it's a bit sad that we can't distinguish between the number of potential-HOT out-of-page updates and the other out-of-page updates. But that'd mean even more counters. I guess we could try to add tracepoints to allow to distinguish between those cases instead? Not a lot of people use those though. > @@ -372,8 +372,11 @@ pgstat_count_heap_update(Relation rel, bool hot) > pgstat_info->trans->tuples_updated++; > > /* t_tuples_hot_updated is nontransactional, so just advance it > */ > - if (hot) > + if (hut == PGSTAT_HEAPUPDATE_HOT) > pgstat_info->t_counts.t_tuples_hot_updated++; > + else if (hut == PGSTAT_HEAPUPDATE_NEW_PAGE) > + pgstat_info->t_counts.t_tuples_newpage_updated++; > + > } > } > I think this might cause some trouble for existing monitoring setups after an upgrade. Suddenly the number of updates will appear way lower than before... And if we end up eventually distinguishing between different reasons for out-of-page updates, or hot/non-hot out-of-page that'll happen again. I wish we'd included HOT updates in the total number of updates, and just kept HOT updates a separate counter that'd always be less than updates in total. >From that angle: Perhaps it'd be better to have counter for how many times a page is found to be full during an update? > --- a/src/backend/access/heap/heapam.c > +++ b/src/backend/access/heap/heapam.c > @@ -3155,7 +3155,8 @@ heap_update(Relation relation, ItemPointer otid, > HeapTuple newtup, > pagefree; > boolhave_tuple_lock = false; > booliscombo; > - booluse_hot_update = false; > + PgStat_HeapUpdateType update_type = PGSTAT_HEAPUPDATE_NON_HOT; > + > boolkey_intact; > boolall_visible_cleared = false; > boolall_visible_cleared_new = false; > @@ -3838,10 +3839,11 @@ l2: >* changed. >*/ > if (!bms_overlap(modified_attrs, hot_attrs)) > - use_hot_update = true; > + update_type = PGSTAT_HEAPUPDATE_HOT; > } > else > { > + update_type = PGSTAT_HEAPUPDATE_NEW_PAGE; > /* Set a hint that the old page could use prune/defrag */ > PageSetFull(page); > } > @@ -3875,7 +3877,7 @@ l2: >*/ > PageSetPrunable(page, xid); > > - if (use_hot_update) > + if (update_type == PGSTAT_HEAPUPDATE_HOT) It's a bit weird that heap_update() uses a pgstat type to decide what to do. But not sure there's a much better alternative. Greetings, Andres Freund
Re: recovery modules
On Thu, Jan 26, 2023 at 09:40:58PM -0800, Nathan Bossart wrote: > Yeah, there are some problems here. If we ERROR, we'll just bounce back to > the sigsetjmp() block once a second, and we'll never pick up configuration > reloads, shutdown signals, etc. If we FATAL, we'll just rapidly restart > over and over. Given the dicussion about misconfigured archiving > parameters [0], I doubt folks will be okay with giving priority to one or > the other. > > I'm currently thinking that the checkpointer should set a flag and clear > the recovery callbacks when a misconfiguration is detected. Anytime the > checkpointer tries to use the archive-cleanup callback, a WARNING would be > emitted. This is similar to an approach I proposed for archiving > misconfigurations (that we didn't proceed with) [1]. Given the > aformentioned problems, this approach might be more suitable for the > checkpointer than it is for the archiver. So, by doing that, archive_library would be ignored. What should be the checkpointer do when a aconfiguration error is found on misconfiguration? Would archive_cleanup_command be equally ignored or could there be a risk to see it used by the checkpointer? Ignoring it would be fine as the user intended the use of a library, rather than enforcing its use via a value one did not really want. So, on top of cleaning the callbacks, archive_cleanup_command should also be cleaned up in the checkpointer? Issuing one WARNING per checkpoint would be indeed much better than looping over and over, impacting the system reliability. Thoughts or comments from anyone? -- Michael signature.asc Description: PGP signature
Re: heapgettup() with NoMovementScanDirection unused in core?
David Rowley writes: > My personal preference would have been to call it > ScanDirectionCombine, so the naming is more aligned to the 4 other > macro names that start with ScanDirection in sdir.h, but I'm not going > to fuss over it. No objection to that. regards, tom lane
Re: heapgettup() with NoMovementScanDirection unused in core?
On Sat, 28 Jan 2023 at 12:15, Tom Lane wrote: > /* > * Determine the net effect of two direction specifications. > * This relies on having ForwardScanDirection = +1, BackwardScanDirection = > -1, > * and will probably not do what you want if applied to any other values. > */ > #define CombineScanDirections(a, b) ((a) * (b)) > > The main thing this'd buy us is being able to grep for uses of the > trick. If it's written as just multiplication, good luck being > able to find what's depending on that, should you ever need to. Yeah, I think the multiplication macro is a good way of doing it. Having the definition of it close to the ScanDirection enum's definition is likely a very good idea so that anyone adjusting the enum values is more likely to notice that it'll cause an issue. A small note on the enum declaration about the -1, +1 values being exploited in various places might be a good idea too. I see v6-0006 in [1] further exploits this, so that's further reason to document that. My personal preference would have been to call it ScanDirectionCombine, so the naming is more aligned to the 4 other macro names that start with ScanDirection in sdir.h, but I'm not going to fuss over it. David [1] https://postgr.es/m/caakru_zyixwws1wxszneoy+sjoh_+f5uho-1ufhyi-u0d6z...@mail.gmail.com David
Re: recovery modules
On Thu, Jan 26, 2023 at 09:40:58PM -0800, Nathan Bossart wrote: > On Wed, Jan 25, 2023 at 04:34:21PM +0900, Michael Paquier wrote: >> The loop part is annoying.. I've never been a fan of adding this >> cross-value checks for the archiver modules in the first place, and it >> would make things much simpler in the checkpointer if we need to think >> about that as we want these values to be reloadable. Perhaps this >> could just be an exception where we just give priority on one over the >> other archive_cleanup_command? The startup process has a well-defined >> sequence after a failure, while the checkpointer is designed to remain >> robust. > > Yeah, there are some problems here. If we ERROR, we'll just bounce back to > the sigsetjmp() block once a second, and we'll never pick up configuration > reloads, shutdown signals, etc. If we FATAL, we'll just rapidly restart > over and over. Given the dicussion about misconfigured archiving > parameters [0], I doubt folks will be okay with giving priority to one or > the other. > > I'm currently thinking that the checkpointer should set a flag and clear > the recovery callbacks when a misconfiguration is detected. Anytime the > checkpointer tries to use the archive-cleanup callback, a WARNING would be > emitted. This is similar to an approach I proposed for archiving > misconfigurations (that we didn't proceed with) [1]. Given the > aformentioned problems, this approach might be more suitable for the > checkpointer than it is for the archiver. The more I think about this, the more I wonder whether we really need to include archive_cleanup_command and recovery_end_command in recovery modules. Another weird thing with the checkpointer is that the restore_library will stay loaded long after recovery is finished, and it'll be loaded regardless of whether recovery is required in the first place. Of course, that typically won't cause any problems, and we could wait until we need to do archive cleanup to load the library (and call its shutdown callback when recovery is finished), but this strikes me as potentially more complexity than the feature is worth. Perhaps we should just focus on covering the restore_command functionality for now and add the rest later. -- Nathan Bossart Amazon Web Services: https://aws.amazon.com
Re: Optimizing PostgreSQL with LLVM's PGO+LTO
Andres Freund writes: > On 2023-01-27 15:06:37 -0500, Tom Lane wrote: >> There are a lot of places where we're implicitly relying on >> cross-compilation-unit optimizations NOT happening, because the code isn't >> adequately decorated with memory barriers and the like. > We have a fallback compiler barrier implementation doing that, but it > shouldn't be used on any halfway reasonable compiler. Cross-compilation-unit > calls don't provide a memory barrier - I assume you're thinking about a > compiler barrier? Sorry, yeah, I was being sloppy there. > I'm sure we have a few places that aren't that careful, but I would hope it's > not a large number. Are you thinking of specific "patterns" we've repeated all > over, or just a few cases you recall? I recall that we used to have dependencies on, for example, the LWLock functions being out-of-line. Probably that specific pain point has been cleaned up, but it surprises me not at all to hear that there are more. I agree that there are probably not a huge number of places that would need to be fixed, but I'm not sure how we'd go about finding them. regards, tom lane
Add n_tup_newpage_upd to pg_stat table views
This patch adds the n_tup_newpage_upd to all the table stat views. Just as we currently track HOT updates, it should be beneficial to track updates where the new tuple cannot fit on the existing page and must go to a different one. Hopefully this can give users some insight as to whether their current fillfactor settings need to be adjusted. My chosen implementation replaces the hot-update boolean with an update_type which is currently a three-value enum. I favored that only slightly over adding a separate newpage-update boolean because the two events are mutually exclusive and fewer parameters is less overhead and one less assertion check. The relative wisdom of this choice may not come to light until we add a new measurement and see whether that new measurement overlaps either is-hot or is-new-page. From 55204b3d2f719f5dd8c308ea722606a40b3d09b8 Mon Sep 17 00:00:00 2001 From: coreyhuinker Date: Fri, 27 Jan 2023 17:56:59 -0500 Subject: [PATCH v1] Add n_tup_newpage_upd to pg_stat_all_tables and pg_stat_xact_all_tables. This value reflects the number of tuples updated where the new tuple was placed on a different page than the previous version. --- doc/src/sgml/monitoring.sgml | 9 + src/backend/access/heap/heapam.c | 10 ++ src/backend/catalog/system_views.sql | 4 +++- src/backend/utils/activity/pgstat_relation.c | 8 ++-- src/backend/utils/adt/pgstatfuncs.c | 18 ++ src/include/catalog/pg_proc.dat | 9 + src/include/pgstat.h | 14 -- src/test/regress/expected/rules.out | 12 +--- 8 files changed, 72 insertions(+), 12 deletions(-) diff --git a/doc/src/sgml/monitoring.sgml b/doc/src/sgml/monitoring.sgml index 1756f1a4b6..f0291a21fb 100644 --- a/doc/src/sgml/monitoring.sgml +++ b/doc/src/sgml/monitoring.sgml @@ -4523,6 +4523,15 @@ SELECT pid, wait_event_type, wait_event FROM pg_stat_activity WHERE wait_event i + + + n_tup_newpage_upd bigint + + + Number of rows updated where the new row is on a different page than the previous version + + + n_live_tup bigint diff --git a/src/backend/access/heap/heapam.c b/src/backend/access/heap/heapam.c index e6024a980b..f5aa429a9a 100644 --- a/src/backend/access/heap/heapam.c +++ b/src/backend/access/heap/heapam.c @@ -3155,7 +3155,8 @@ heap_update(Relation relation, ItemPointer otid, HeapTuple newtup, pagefree; bool have_tuple_lock = false; bool iscombo; - bool use_hot_update = false; + PgStat_HeapUpdateType update_type = PGSTAT_HEAPUPDATE_NON_HOT; + bool key_intact; bool all_visible_cleared = false; bool all_visible_cleared_new = false; @@ -3838,10 +3839,11 @@ l2: * changed. */ if (!bms_overlap(modified_attrs, hot_attrs)) - use_hot_update = true; + update_type = PGSTAT_HEAPUPDATE_HOT; } else { + update_type = PGSTAT_HEAPUPDATE_NEW_PAGE; /* Set a hint that the old page could use prune/defrag */ PageSetFull(page); } @@ -3875,7 +3877,7 @@ l2: */ PageSetPrunable(page, xid); - if (use_hot_update) + if (update_type == PGSTAT_HEAPUPDATE_HOT) { /* Mark the old tuple as HOT-updated */ HeapTupleSetHotUpdated(&oldtup); @@ -3986,7 +3988,7 @@ l2: if (have_tuple_lock) UnlockTupleTuplock(relation, &(oldtup.t_self), *lockmode); - pgstat_count_heap_update(relation, use_hot_update); + pgstat_count_heap_update(relation, update_type); /* * If heaptup is a private copy, release it. Don't forget to copy t_self diff --git a/src/backend/catalog/system_views.sql b/src/backend/catalog/system_views.sql index 8608e3fa5b..292a9b88b3 100644 --- a/src/backend/catalog/system_views.sql +++ b/src/backend/catalog/system_views.sql @@ -665,6 +665,7 @@ CREATE VIEW pg_stat_all_tables AS pg_stat_get_tuples_updated(C.oid) AS n_tup_upd, pg_stat_get_tuples_deleted(C.oid) AS n_tup_del, pg_stat_get_tuples_hot_updated(C.oid) AS n_tup_hot_upd, +pg_stat_get_tuples_newpage_updated(C.oid) AS n_tup_newpage_upd, pg_stat_get_live_tuples(C.oid) AS n_live_tup, pg_stat_get_dead_tuples(C.oid) AS n_dead_tup, pg_stat_get_mod_since_analyze(C.oid) AS n_mod_since_analyze, @@ -696,7 +697,8 @@ CREATE VIEW pg_stat_xact_all_tables AS pg_stat_get_xact_tuples_inserted(C.oid) AS n_tup_ins, pg_stat_get_xact_tuples_updated(C.oid) AS n_tup_upd, pg_stat_get_xact_tuples_deleted(C.oid) AS n_tup_del, -pg_stat_get_xact_tuples_hot_updated(C.oid) AS n_tup_hot_upd +pg_stat_get_xact_tuples_hot_updated(C.oid) AS n_tup_hot_upd, +pg_stat_get_xact_tuples_newpage_updated(C.oid) AS n_tup_newpage_upd FROM pg_class C LEFT JOIN pg_index I ON C.oid = I.indrelid LEFT JOIN pg_namespace N ON (N.oid = C.relnamespace) diff --git a/src/backend/utils/ac
Re: heapgettup() with NoMovementScanDirection unused in core?
Melanie Plageman writes: > On Fri, Jan 27, 2023 at 05:05:16PM -0500, Tom Lane wrote: >> AFAIR, there is noplace today that depends on the exact encoding >> of ForwardScanDirection and BackwardScanDirection, and I'm not >> sure that we want to introduce such a dependency. > I think you mean the enum value when you say encoding? I actually > started using the ScanDirection value in the refactor of heapgettup() > and heapgettup_pagemode() which I proposed here [1]. Why wouldn't we > want to introduce such a dependency? It's just that in general, depending on the numeric values of an enum isn't a great coding practice. After thinking about it for awhile, I'd be happier if we added something like this to sdir.h, and then used it rather than directly depending on multiplication: /* * Determine the net effect of two direction specifications. * This relies on having ForwardScanDirection = +1, BackwardScanDirection = -1, * and will probably not do what you want if applied to any other values. */ #define CombineScanDirections(a, b) ((a) * (b)) The main thing this'd buy us is being able to grep for uses of the trick. If it's written as just multiplication, good luck being able to find what's depending on that, should you ever need to. >> 4. I don't think the proposed test case is worth the cycles. > Just the one I wrote or any test case? I think that all this code is quite well-tested already, so I'm not sure what's the point of adding another test for it. regards, tom lane
Re: improving user.c error messages
On Fri, Jan 27, 2023 at 07:31:19PM +0100, Alvaro Herrera wrote: > On 2023-Jan-26, Nathan Bossart wrote: >> ereport(ERROR, >> >> (errcode(ERRCODE_INSUFFICIENT_PRIVILEGE), >> - errmsg("permission denied: bootstrap >> user must be superuser"))); >> + errmsg("permission denied to alter >> role"), >> + errdetail("The bootstrap user must be >> superuser."))); > > I think this one isn't using the right errcode; this is not a case of > insufficient privileges. There's no priv you can acquire that lets you > do it. So I'd change it to unsupported operation. І fixed this in v4. I've also attached a second patch in which I've adjusted the messages that Peter mentioned upthread. One thing that feels a bit odd is how some of the DETAILs mention the operation being attempted while others do not. For example, we have ERROR: permission denied to drop role DETAIL: You must have SUPERUSER privilege to drop roles with SUPERUSER. In this case, the DETAIL explains the action that is prohibited. In other cases, we have something like ERROR: permission denied to alter role DETAIL: You must have CREATEROLE privilege and ADMIN OPTION on role "myrole". which does not. I think this is okay because adding "to alter the role" to the end of the DETAIL seems kind of awkward. But in other cases, such as ERROR: permission denied to use replication slots DETAIL: You must have REPLICATION privilege. adding the operation to the end seems less awkward (i.e., "You must have REPLICATION privilege to use replication slots."). I don't think there's any information lost by omitting the action in the DETAIL, so perhaps this is just a stylistic choice. I think I'm inclined to add the action to the DETAIL whenever it doesn't make the message lengthy and awkward, and leave it out otherwise. Thoughts? -- Nathan Bossart Amazon Web Services: https://aws.amazon.com >From d369e35e6a44cb9708a08a1f32d1755e04f04de1 Mon Sep 17 00:00:00 2001 From: Nathan Bossart Date: Thu, 26 Jan 2023 11:05:13 -0800 Subject: [PATCH v4 1/2] Improve user.c error messages. --- src/backend/commands/user.c | 168 -- src/test/regress/expected/create_role.out | 77 ++ src/test/regress/expected/dependency.out | 4 + src/test/regress/expected/privileges.out | 23 +-- 4 files changed, 197 insertions(+), 75 deletions(-) diff --git a/src/backend/commands/user.c b/src/backend/commands/user.c index 3a92e930c0..26b533f1be 100644 --- a/src/backend/commands/user.c +++ b/src/backend/commands/user.c @@ -316,23 +316,33 @@ CreateRole(ParseState *pstate, CreateRoleStmt *stmt) if (!has_createrole_privilege(currentUserId)) ereport(ERROR, (errcode(ERRCODE_INSUFFICIENT_PRIVILEGE), - errmsg("permission denied to create role"))); + errmsg("permission denied to create role"), + errdetail("You must have %s privilege to create roles.", + "CREATEROLE"))); if (issuper) ereport(ERROR, (errcode(ERRCODE_INSUFFICIENT_PRIVILEGE), - errmsg("must be superuser to create superusers"))); + errmsg("permission denied to create role"), + errdetail("You must have %s privilege to create roles with %s.", + "SUPERUSER", "SUPERUSER"))); if (createdb && !have_createdb_privilege()) ereport(ERROR, (errcode(ERRCODE_INSUFFICIENT_PRIVILEGE), - errmsg("must have createdb permission to create createdb users"))); + errmsg("permission denied to create role"), + errdetail("You must have %s privilege to create roles with %s.", + "CREATEDB", "CREATEDB"))); if (isreplication && !has_rolreplication(currentUserId)) ereport(ERROR, (errcode(ERRCODE_INSUFFICIENT_PRIVILEGE), - errmsg("must have replication permission to create replication users"))); + errmsg("permission denied to create role"), + errdetail("You must have %s privilege to create roles with %s.", + "REPLICATION", "REPLICATION"))); if (bypassrls && !has_bypassrls_privilege(currentUserId)) ereport(ERROR, (errcode(ERRCODE_INSUFFICIENT_PRIVILEGE), - errmsg("must have bypassrls to create bypassrls users"))); + errmsg("permission denied to create role"), + errdetail("You must have %s privilege to create roles with %s.", + "BYPASSRLS", "BYPASSRLS"))); } /* @@ -744,10 +754,18 @@ AlterRole(ParseState *pstate, AlterRoleStmt *stmt) roleid = authform->oid; /* To mess with a superuser in any way you gotta be superuser. */ - if (!superuser() && (authform->rolsuper || dissuper)) + if (!superuser() && authform->rolsuper) ereport(ERROR, (errcode(ERRCODE_INSUFFICIENT_PRIVILEGE), - errmsg("must be superuser to alter superuser roles or change superuser attribute"))); + errmsg("permission denied to alter
Re: Optimizing PostgreSQL with LLVM's PGO+LTO
Hi, On 2023-01-27 15:06:37 -0500, Tom Lane wrote: > There are a lot of places where we're implicitly relying on > cross-compilation-unit optimizations NOT happening, because the code isn't > adequately decorated with memory barriers and the like. We have a fallback compiler barrier implementation doing that, but it shouldn't be used on any halfway reasonable compiler. Cross-compilation-unit calls don't provide a memory barrier - I assume you're thinking about a compiler barrier? I'm sure we have a few places that aren't that careful, but I would hope it's not a large number. Are you thinking of specific "patterns" we've repeated all over, or just a few cases you recall? Greetings, Andres Freund
Re: Optimizing PostgreSQL with LLVM's PGO+LTO
Hi, On 2023-01-27 10:05:09 -0700, João Paulo Labegalini de Carvalho wrote: > I am investigating the benefits of different profile-guided optimizations > (PGO) and link-time optimizations (LTO) versus binary optimizers (e.g. > BOLT) for applications such as PostgreSQL. > > I am facing issues when applying LTO to PostgreSQL as the produced binary > seems broken (the server dies quickly after it has started). This is > definitely a compiler bug, but I was wondering if anyone here have > experimented with LTO for PostgreSQL. What compiler / version / flags / OS did you try? FWIW, I've experimented with LTO and PGO a bunch, both with gcc and clang. I did hit a crash in gcc, but that did turn out to be a compiler bug, and actually reduced to something not even needing LTO. I saw quite substantial speedups with PGO, but I only tested very specific workloads. IIRC it was >15% gain in concurrent readonly pgbench. I dimly recall failing to get some benefit out of bolt for some reason that I unfortunately don't even vaguely recall. Greetings, Andres Freund
Re: Non-superuser subscription owners
> On Jan 27, 2023, at 1:35 PM, Robert Haas wrote: > >> I started out asking what benefits it provides to own a subscription one >> cannot modify. But I think it is a good capability to have, to restrict the >> set of relations that replication could target. Although perhaps it'd be >> better to set the "replay user" as a separate property on the subscription? > > That's been proposed previously, but for reasons I don't quite > remember it seems not to have happened. I don't think it achieved > consensus. > >> Does owning a subscription one isn't allowed to modify useful outside of >> that? > > Uh, possibly that's a question for Mark or Jeff. I don't know. I can't > see what they would be, but I just work here. If the owner cannot modify the subscription, then the owner degenerates into a mere "run-as" user. Note that this isn't how things work now, and even if we disallowed owners from modifying the connection string, there would still be other attributes the owner could modify, such as the set of publications subscribed. More generally, my thinking on this thread is that there needs to be two nosuperuser roles: A higher privileged role which can create a subscription, and a lower privileged role serving the "run-as" function. Those shouldn't be the same, because the "run-as" concept doesn't logically need to have subscription creation power, and likely *shouldn't* have that power. Depending on which sorts of attributes a subscription object has, such as the connection string, the answer differs for whether the owner/"run-as" user should get to change those attributes. One advantage of Jeff's idea of using a server object rather than a string is that it becomes more plausibly safe to allow the subscription owner to make changes to that attribute of the subscription. — Mark Dilger EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
Re: heapgettup() with NoMovementScanDirection unused in core?
On Fri, Jan 27, 2023 at 05:05:16PM -0500, Tom Lane wrote: > Melanie Plageman writes: > > I have written the patch to remove the unreachable code in > > heapgettup_pagemode](). > > A few thoughts ... > > 1. Do we really need quite so many Asserts? I'd kind of lean > to just having one, at some higher level of the executor. Yes, perhaps I was a bit overzealous putting them in functions called for every tuple. I'm not sure where in the executor would make the most sense. ExecInitSeqScan() comes to mind, but I'm not sure that covers all of the desired cases. > > 2. I'm not sure if we want to do this: > > - direction = estate->es_direction; > - /* flip direction if this is an overall backward scan */ > - if (ScanDirectionIsBackward(((IndexScan *) > node->ss.ps.plan)->indexorderdir)) > - { > - if (ScanDirectionIsForward(direction)) > - direction = BackwardScanDirection; > - else if (ScanDirectionIsBackward(direction)) > - direction = ForwardScanDirection; > - } > + direction = estate->es_direction * ((IndexScan *) > node->ss.ps.plan)->indexorderdir; > > AFAIR, there is noplace today that depends on the exact encoding > of ForwardScanDirection and BackwardScanDirection, and I'm not > sure that we want to introduce such a dependency. If we do it > at least deserves a comment here, and you probably ought to adjust > the wishy-washy comment in sdir.h as well. Taking out the existing > comment explaining what this code is doing is not an improvement > either. I think you mean the enum value when you say encoding? I actually started using the ScanDirection value in the refactor of heapgettup() and heapgettup_pagemode() which I proposed here [1]. Why wouldn't we want to introduce such a dependency? > > 3. You didn't update the header comment for heapgettup, nor the > one in pathnodes.h for IndexPath.indexscandir. Oops -- thanks for catching those! > > 4. I don't think the proposed test case is worth the cycles. Just the one I wrote or any test case? - Melanie [1] https://www.postgresql.org/message-id/flat/CAAKRu_ZJg_N7zHtWP%2BJoSY_hrce4%2BGKioL137Y2c2En-kuXQ7g%40mail.gmail.com#8a106c6625bc069cf439230cd9fa1000
Re: heapgettup() with NoMovementScanDirection unused in core?
Melanie Plageman writes: > I have written the patch to remove the unreachable code in > heapgettup_pagemode](). A few thoughts ... 1. Do we really need quite so many Asserts? I'd kind of lean to just having one, at some higher level of the executor. 2. I'm not sure if we want to do this: - direction = estate->es_direction; - /* flip direction if this is an overall backward scan */ - if (ScanDirectionIsBackward(((IndexScan *) node->ss.ps.plan)->indexorderdir)) - { - if (ScanDirectionIsForward(direction)) - direction = BackwardScanDirection; - else if (ScanDirectionIsBackward(direction)) - direction = ForwardScanDirection; - } + direction = estate->es_direction * ((IndexScan *) node->ss.ps.plan)->indexorderdir; AFAIR, there is noplace today that depends on the exact encoding of ForwardScanDirection and BackwardScanDirection, and I'm not sure that we want to introduce such a dependency. If we do it at least deserves a comment here, and you probably ought to adjust the wishy-washy comment in sdir.h as well. Taking out the existing comment explaining what this code is doing is not an improvement either. 3. You didn't update the header comment for heapgettup, nor the one in pathnodes.h for IndexPath.indexscandir. 4. I don't think the proposed test case is worth the cycles. regards, tom lane
Re: Non-superuser subscription owners
Hi, On 2023-01-27 16:35:11 -0500, Robert Haas wrote: > > Maybe a daft question: > > > > Have we considered using a "normal grant", e.g. on the database, instead of > > a > > role? Could it e.g. be useful to grant a user the permission to create a > > subscription in one database, but not in another? > > Potentially, but I didn't think we'd want to burn through permissions > bits that fast, even given 7b378237aa805711353075de142021b1d40ff3b0. > Still, if the consensus is otherwise, I can change it. I don't really have an opinion on what's better. I looked briefly whether there was discussion around ithis but I didn't see anything. pg_create_subcription feels a bit different than most of the other pg_* roles. For most of those there is no schema object to tie permissions to. But here there is. But I think there's good arguments against a GRANT approach, too. GRANT ALL ON DATABASE would suddenly be dangerous. How does it interact with database ownership? Etc. > Then I guess we'd end up with GRANT CREATE ON DATABASE and GRANT CREATE > SUBSCRIPTION ON DATABASE, which I'm sure wouldn't be confusing at all. Heh. I guess it could just be GRANT SUBSCRIBE. > Or, another thought, maybe this should be checking for CREATE on the > current database + also pg_create_subscription. That seems like it > might be the right idea, actually. Yes, that seems like a good idea. > > The subscription code already does ownership checks before locking and now > > there's also the passwordrequired before. Is it possible that this could > > open > > up some sort of race? Could e.g. the user change the ownership to the > > superuser in one session, do an ALTER in the other? > > > > It looks like your change won't increase the danger of that, as the > > superuser() check just checks the current users permissions. > > I'm not entirely clear whether there's a hazard there. I'm not at all either. It's just a code pattern that makes me anxious - I suspect there's a few places it makes us more vulnerable. > If there is, I think we could fix it by moving the LockSharedObject call up > higher, above object_ownercheck. The only problem with that is it lets you > lock an object on which you have no permissions: see > 2ad36c4e44c8b513f6155656e1b7a8d26715bb94. To really fix that, we'd need an > analogue of RangeVarGetRelidExtended. Yea, we really should have something like RangeVarGetRelidExtended() for other kinds of objects. It'd take a fair bit of work / time to use it widely, but it'll take even longer if we start in 5 years ;) Perhaps the bulk of RangeVarGetRelidExtended() could be generalized by having a separate name->oid lookup callback? Greetings, Andres Freund
Re: heapgettup() with NoMovementScanDirection unused in core?
Hi, I have written the patch to remove the unreachable code in heapgettup_pagemode](). On Wed, Jan 25, 2023 at 10:02 AM Tom Lane wrote: > > I wonder if we couldn't also get rid of this confusingly-inconsistent > alternative usage in the planner: > > * 'indexscandir' is one of: > *ForwardScanDirection: forward scan of an ordered index > *BackwardScanDirection: backward scan of an ordered index > *NoMovementScanDirection: scan of an unordered index, or don't care > * (The executor doesn't care whether it gets ForwardScanDirection or > * NoMovementScanDirection for an indexscan, but the planner wants to > * distinguish ordered from unordered indexes for building pathkeys.) > > While that comment's claim is plausible, I think it's been wrong for > years. AFAICS indxpath.c makes pathkeys before it ever does this: > > index_is_ordered ? > ForwardScanDirection : > NoMovementScanDirection, > > and nothing depends on that later either. So I think we could > simplify this to something like "indexscandir is either > ForwardScanDirection or BackwardScanDirection. (Unordered > index types need not support BackwardScanDirection.)" > I also did what I *think* Tom is suggesting here -- make index scan's scan direction always forward or backward... Maybe the set should be two patches...dunno. - Melanie From be8119af72499aec03e59200c9db44f8520ccebf Mon Sep 17 00:00:00 2001 From: Melanie Plageman Date: Fri, 27 Jan 2023 14:02:03 -0500 Subject: [PATCH v1] Remove NoMovementScanDirection inconsistencies Remove use of NoMovementScanDirection for index scans. Unordered indexes will now always use ForwardScanDirection. Remove unreachable code in heapgettup() and heapgettup_pagemode() handling NoMovementScanDirection. Add assertions in case table AMs try to use scan directions other than forward and backward. Discussion: https://www.postgresql.org/message-id/flat/CAAKRu_bvkhka0CZQun28KTqhuUh5ZqY%3D_T8QEqZqOL02rpi2bw%40mail.gmail.com --- src/backend/access/heap/heapam.c | 73 src/backend/commands/explain.c | 3 - src/backend/executor/nodeIndexonlyscan.c | 17 +++--- src/backend/executor/nodeIndexscan.c | 17 +++--- src/backend/optimizer/path/indxpath.c| 8 +-- src/backend/optimizer/util/pathnode.c| 5 +- src/include/access/tableam.h | 6 ++ src/test/regress/expected/hash_index.out | 27 + src/test/regress/sql/hash_index.sql | 23 9 files changed, 87 insertions(+), 92 deletions(-) diff --git a/src/backend/access/heap/heapam.c b/src/backend/access/heap/heapam.c index e6024a980b..be0555f5c4 100644 --- a/src/backend/access/heap/heapam.c +++ b/src/backend/access/heap/heapam.c @@ -523,6 +523,9 @@ heapgettup(HeapScanDesc scan, int linesleft; ItemId lpp; + Assert(dir == ForwardScanDirection || + dir == BackwardScanDirection); + /* * calculate next starting lineoff, given scan direction */ @@ -583,7 +586,7 @@ heapgettup(HeapScanDesc scan, linesleft = lines - lineoff + 1; } - else if (backward) + else { /* backward parallel scan not supported */ Assert(scan->rs_base.rs_parallel == NULL); @@ -653,34 +656,6 @@ heapgettup(HeapScanDesc scan, linesleft = lineoff; } - else - { - /* - * ``no movement'' scan direction: refetch prior tuple - */ - if (!scan->rs_inited) - { - Assert(!BufferIsValid(scan->rs_cbuf)); - tuple->t_data = NULL; - return; - } - - block = ItemPointerGetBlockNumber(&(tuple->t_self)); - if (block != scan->rs_cblock) - heapgetpage((TableScanDesc) scan, block); - - /* Since the tuple was previously fetched, needn't lock page here */ - page = BufferGetPage(scan->rs_cbuf); - TestForOldSnapshot(snapshot, scan->rs_base.rs_rd, page); - lineoff = ItemPointerGetOffsetNumber(&(tuple->t_self)); - lpp = PageGetItemId(page, lineoff); - Assert(ItemIdIsNormal(lpp)); - - tuple->t_data = (HeapTupleHeader) PageGetItem(page, lpp); - tuple->t_len = ItemIdGetLength(lpp); - - return; - } /* * advance the scan until we find a qualifying tuple or run out of stuff @@ -861,6 +836,9 @@ heapgettup_pagemode(HeapScanDesc scan, int linesleft; ItemId lpp; + Assert(dir == ForwardScanDirection || + dir == BackwardScanDirection); + /* * calculate next starting lineindex, given scan direction */ @@ -918,7 +896,7 @@ heapgettup_pagemode(HeapScanDesc scan, linesleft = lines - lineindex; } - else if (backward) + else { /* backward parallel scan not supported */ Assert(scan->rs_base.rs_parallel == NULL); @@ -978,38 +956,6 @@ heapgettup_pagemode(HeapScanDesc scan, linesleft = lineindex + 1; } - else - { - /* - * ``no movement'' scan direction: refetch prior tuple - */ - if (!scan->rs_inited) - { - Assert(!BufferIsValid(scan->rs_cbuf)); - tuple->t_data = NULL; - return; - } - - block = ItemPo
Re: Non-superuser subscription owners
On Fri, Jan 27, 2023 at 4:09 PM Andres Freund wrote: > Hm, compared to postgres_fdw, the user has far less control over what's > happening using that connection. Is there a way a subscription owner can > trigger evaluation of near-arbitrary SQL on the publisher side? I'm not aware of one, but what I think it would let you do is exfiltrate data you're not entitled to see. > I started out asking what benefits it provides to own a subscription one > cannot modify. But I think it is a good capability to have, to restrict the > set of relations that replication could target. Although perhaps it'd be > better to set the "replay user" as a separate property on the subscription? That's been proposed previously, but for reasons I don't quite remember it seems not to have happened. I don't think it achieved consensus. > Does owning a subscription one isn't allowed to modify useful outside of that? Uh, possibly that's a question for Mark or Jeff. I don't know. I can't see what they would be, but I just work here. > Maybe a daft question: > > Have we considered using a "normal grant", e.g. on the database, instead of a > role? Could it e.g. be useful to grant a user the permission to create a > subscription in one database, but not in another? Potentially, but I didn't think we'd want to burn through permissions bits that fast, even given 7b378237aa805711353075de142021b1d40ff3b0. Still, if the consensus is otherwise, I can change it. Then I guess we'd end up with GRANT CREATE ON DATABASE and GRANT CREATE SUBSCRIPTION ON DATABASE, which I'm sure wouldn't be confusing at all. Or, another thought, maybe this should be checking for CREATE on the current database + also pg_create_subscription. That seems like it might be the right idea, actually. > The subscription code already does ownership checks before locking and now > there's also the passwordrequired before. Is it possible that this could open > up some sort of race? Could e.g. the user change the ownership to the > superuser in one session, do an ALTER in the other? > > It looks like your change won't increase the danger of that, as the > superuser() check just checks the current users permissions. I'm not entirely clear whether there's a hazard there. If there is, I think we could fix it by moving the LockSharedObject call up higher, above object_ownercheck. The only problem with that is it lets you lock an object on which you have no permissions: see 2ad36c4e44c8b513f6155656e1b7a8d26715bb94. To really fix that, we'd need an analogue of RangeVarGetRelidExtended. > and throwing an error like that will at the very least leak the connection, > fd, fd reservation. Which I just had fixed :). At the very least you'd need to > copy the stuff that "bad_connection:" does. OK. > I did wonder whether we should make libpqrcv_connect() use errsave() to return > errors. Or whether we should make libpqrcv register a memory context reset > callback that'd close the libpq connection. Yeah. Using errsave() might be better, but not sure I want to tackle that just now. > That is a somewhat odd API. Why does it throw for some things, but not > others? Seems a bit cleaner to pass in a parameter indicating whether it > should throw when not finding a password? Particularly because you already > pass that to walrcv_connect(). Will look into that. -- Robert Haas EDB: http://www.enterprisedb.com
Re: Syncrep and improving latency due to WAL throttling
Hi, On 2023-01-27 21:45:16 +0100, Tomas Vondra wrote: > On 1/27/23 08:18, Bharath Rupireddy wrote: > >> I think my idea of only forcing to flush/wait an LSN some distance in the > >> past > >> would automatically achieve that? > > > > I'm sorry, I couldn't get your point, can you please explain it a bit more? > > > > The idea is that we would not flush the exact current LSN, because > that's likely somewhere in the page, and we always write the whole page > which leads to write amplification. > > But if we backed off a bit, and wrote e.g. to the last page boundary, > that wouldn't have this issue (either the page was already flushed - > noop, or we'd have to flush it anyway). Yep. > We could even back off a bit more, to increase the probability it was > actually flushed / sent to standby. That's not the sole goal, from my end: I'd like to avoid writing out + flushing the WAL in too small chunks. Imagine a few concurrent vacuums or COPYs or such - if we're unlucky they'd each end up exceeding their "private" limit close to each other, leading to a number of small writes of the WAL. Which could end up increasing local commit latency / iops. If we instead decide to only ever flush up to something like last_page_boundary - 1/8 * throttle_pages * XLOG_BLCKSZ we'd make sure that the throttling mechanism won't cause a lot of small writes. > > Keeping replication lag under check enables one to provide a better > > RPO guarantee as discussed in the other thread > > https://www.postgresql.org/message-id/CAHg%2BQDcO_zhgBCMn5SosvhuuCoJ1vKmLjnVuqUEOd4S73B1urw%40mail.gmail.com. > > > > Isn't that a bit over-complicated? RPO generally only cares about xacts > that committed (because that's what you want to not lose), so why not to > simply introduce a "sync mode" that simply uses a bit older LSN when > waiting for the replica? Seems much simpler and similar to what we > already do. I don't think that really helps you that much. If there's e.g. a huge VACUUM / COPY emitting loads of WAL you'll suddenly see commit latency of a concurrently committing transactions spike into oblivion. Whereas a general WAL throttling mechanism would throttle the VACUUM, without impacting the commit latency of normal transactions. Greetings, Andres Freund
Re: Syncrep and improving latency due to WAL throttling
Hi, On 2023-01-27 12:48:43 +0530, Bharath Rupireddy wrote: > Looking at the patch, the feature, in its current shape, focuses on > improving replication lag (by throttling WAL on the primary) only when > synchronous replication is enabled. Why is that? Why can't we design > it for replication in general (async, sync, and logical replication)? > > Keeping replication lag under check enables one to provide a better > RPO guarantee as discussed in the other thread > https://www.postgresql.org/message-id/CAHg%2BQDcO_zhgBCMn5SosvhuuCoJ1vKmLjnVuqUEOd4S73B1urw%40mail.gmail.com. I think something narrower and easier to achieve is a good thing. We've already had loads of discussion for the more general problem, without a lot of actual progress. Greetings, Andres Freund
Re: Syncrep and improving latency due to WAL throttling
Hi, On 2023-01-27 12:06:49 +0100, Jakub Wartak wrote: > On Thu, Jan 26, 2023 at 4:49 PM Andres Freund wrote: > > > Huh? Why did you remove the GUC? > > After reading previous threads, my optimism level of getting it ever > in shape of being widely accepted degraded significantly (mainly due > to the discussion of wider category of 'WAL I/O throttling' especially > in async case, RPO targets in async case and potentially calculating > global bandwidth). I think it's quite reasonable to limit this to a smaller scope. Particularly because those other goals are pretty vague but ambitious goals. IMO the problem with a lot of the threads is precisely that that they aimed at a level of generallity that isn't achievable in one step. > I've assumed that it is a working sketch, and as such not having GUC name > right now (just for sync case) would still allow covering various other > async cases in future without breaking compatibility with potential name GUC > changes (see my previous "wal_throttle_larger_transactions=" > proposal ). It's harder to benchmark something like this without a GUC, so I think it's worth having, even if it's not the final name. > > SyncRepWaitForLSN() has this comment: > > * 'lsn' represents the LSN to wait for. 'commit' indicates whether this > > LSN > > * represents a commit record. If it doesn't, then we wait only for the WAL > > * to be flushed if synchronous_commit is set to the higher level of > > * remote_apply, because only commit records provide apply feedback. > > Hm, not sure if I understand: are you saying that we should (in the > throttled scenario) have some special feedback msgs or not -- > irrespective of the setting? To be honest the throttling shouldn't > wait for the standby full setting, it's just about slowdown fact (so > IMHO it would be fine even in remote_write/remote_apply scenario if > the remote walreceiver just received the data, not necessarily write > it into file or wait for for applying it). Just this waiting for a > round-trip ack about LSN progress would be enough to slow down the > writer (?). I've added some timing log into the draft and it shows > more or less constantly solid RTT even as it stands: My problem was that the function header for SyncRepWaitForLSN() seems to say that we don't wait at all if commit=false and synchronous_commit < remote_apply. But I think that might just be bad formulation. [...] 'commit' indicates whether this LSN * represents a commit record. If it doesn't, then we wait only for the WAL * to be flushed if synchronous_commit is set to the higher level of * remote_apply, because only commit records provide apply feedback. because the code does something entirely different afaics: /* Cap the level for anything other than commit to remote flush only. */ if (commit) mode = SyncRepWaitMode; else mode = Min(SyncRepWaitMode, SYNC_REP_WAIT_FLUSH); Greetings, Andres Freund
Re: Non-superuser subscription owners
Hi, On 2023-01-27 14:42:01 -0500, Robert Haas wrote: > At first, I found it a bit tempting to see this as a further > indication that the force-a-password approach is not the right idea, > because the test case clearly memorializes a desire *not* to require a > password in this situation. However, the loopback-to-superuser attack > is just as viable for subscription as it in other cases, and my > previous patch would have done nothing to block it. Hm, compared to postgres_fdw, the user has far less control over what's happening using that connection. Is there a way a subscription owner can trigger evaluation of near-arbitrary SQL on the publisher side? > So what I did instead is add a password_required attribute, just like what > postgres_fdw has. As in the case of postgres_fdw, the actual rule is that if > the attribute is false, a password is not required, and if the attribute is > true, a password is required unless you are a superuser. If you're a > superuser, it still isn't. This is a slightly odd set of semantics but it > has precedent and practical advantages. Also, as in the case of > postgres_fdw, only a superuser can set password_required=false, and a > subscription that has that setting can only be modified by a superuser, no > matter who owns it. I started out asking what benefits it provides to own a subscription one cannot modify. But I think it is a good capability to have, to restrict the set of relations that replication could target. Although perhaps it'd be better to set the "replay user" as a separate property on the subscription? Does owning a subscription one isn't allowed to modify useful outside of that? > Even though I hate the require-a-password stuff with the intensity of > a thousand suns, I think this is better than the previous patch, > because it's more consistent with what we do elsewhere and because it > blocks the loopback-connection-to-superuser attack. I think we > *really* need to develop a better system for restricting proxied > connections (no matter how proxied) and I hope that we do that soon. > But inventing something for this purpose that differs from what we do > elsewhere will make that task harder, not easier. > > Thoughts? I think it's reasonable to mirror behaviour from elsewhere, and it'd let us have this feature relatively soon - I think it's a common need to do this as a non-superuser. It's IMO a very good idea to not subscribe as a superuser, even if set up by a superuser... But I also would understand if you / somebody else chose to focus on implementing a less nasty connection model. > Subject: [PATCH v2] Add new predefined role pg_create_subscriptions. Maybe a daft question: Have we considered using a "normal grant", e.g. on the database, instead of a role? Could it e.g. be useful to grant a user the permission to create a subscription in one database, but not in another? > @@ -1039,6 +1082,16 @@ AlterSubscription(ParseState *pstate, > AlterSubscriptionStmt *stmt, > > sub = GetSubscription(subid, false); > > + /* > + * Don't allow non-superuser modification of a subscription with > + * password_required=false. > + */ > + if (!sub->passwordrequired && !superuser()) > + ereport(ERROR, > + (errcode(ERRCODE_INSUFFICIENT_PRIVILEGE), > + > errmsg("password_required=false is superuser-only"), > + errhint("Subscriptions with > the password_required option set to false may only be created or modified by > the superuser."))); > + > /* Lock the subscription so nobody else can do anything with it. */ > LockSharedObject(SubscriptionRelationId, subid, 0, AccessExclusiveLock); The subscription code already does ownership checks before locking and now there's also the passwordrequired before. Is it possible that this could open up some sort of race? Could e.g. the user change the ownership to the superuser in one session, do an ALTER in the other? It looks like your change won't increase the danger of that, as the superuser() check just checks the current users permissions. > @@ -180,6 +180,13 @@ libpqrcv_connect(const char *conninfo, bool logical, > const char *appname, > if (PQstatus(conn->streamConn) != CONNECTION_OK) > goto bad_connection_errmsg; > > + if (must_use_password && !PQconnectionUsedPassword(conn->streamConn)) > + ereport(ERROR, > + > (errcode(ERRCODE_S_R_E_PROHIBITED_SQL_STATEMENT_ATTEMPTED), > + errmsg("password is required"), > + errdetail("Non-superuser cannot connect if the > server does not request a password."), > + errhint("Target server's authentication method > must be changed."))); > + The documentation of libpqrcv_connect() says that: * Returns NULL on error and fills the err with pal
Re: postgres_fdw, dblink, and CREATE SUBSCRIPTION security
On Wed, Jan 25, 2023 at 6:22 PM Jacob Champion wrote: > Sure: Ambient authority [1] means that something is granted access based > on some aspect of its existence that it can't remove (or even > necessarily enumerate). Up above, when you said "I cannot choose not to > be myself," that's a clear marker that ambient authority is involved. > Examples of ambient authn/z factors might include an originating IP > address, the user ID of a connected peer process, the use of a loopback > interface, a GPS location, and so on. So 'peer' and 'ident' are ambient > authentication methods. OK. > 1) Forwarding the original ambient context along with the request, so > the server can check it too. Right, so a protocol extension. Reasonable idea, but a big lift. Not only do you need everyone to be running a new enough version of PostgreSQL, but existing proxies like pgpool and pgbouncer need updates, too. > 2) Explicitly combining the request with the proof of authority needed > to make it, as in capability-based security [3]. As far as I can see, that link doesn't address how you'd make this approach work across a network. > 3) Dropping as many implicitly-held privileges as possible before making > a request. This doesn't solve the problem but may considerably reduce > the practical attack surface. Right. I definitely don't object to this kind of approach, but I don't think it can ever be sufficient by itself. > > e.g. > > > > all all all local all all - deny # block access through UNIX sockets > > all all all 127.0.0.0/8 all all - deny # block loopback interface via IPv4 > > > > Or: > > > > postgres_fdw all all all all all authentication=cleartext,md5,sasl > > allow # allow postgres_fdw with password-ish authentication > > I think this style focuses on absolute configuration flexibility at the > expense of usability. It obfuscates the common use cases. (I have the > exact same complaint about our HBA and ident configs, so I may be > fighting uphill.) That's probably somewhat true, but on the other hand, it also is more powerful than what you're describing. In your system, is there some way the DBA can say "hey, you can connect to any of the machines on this list of subnets, but nothing else"? Or equally, "hey, you may NOT connect to any machine on this list of subnets, but anything else is fine"? Or "you can connect to these subnets without SSL, but if you want to talk to anything else, you need to use SSL"? I would feel a bit bad saying that those are just use cases we don't care about. Most people likely wouldn't use that kind of flexibility, so maybe it doesn't really matter, but it seems kind of nice to have. Your idea seems to rely on us being able to identify all of the policies that a user is likely to want and give names to each one, and I don't feel very confident that that's realistic. But maybe I'm misinterpreting your idea? -- Robert Haas EDB: http://www.enterprisedb.com
Re: Syncrep and improving latency due to WAL throttling
On 1/27/23 08:18, Bharath Rupireddy wrote: > On Thu, Jan 26, 2023 at 9:21 PM Andres Freund wrote: >> >>> 7. I think we need to not let backends throttle too frequently even >>> though they have crossed wal_throttle_threshold bytes. The best way is >>> to rely on replication lag, after all the goal of this feature is to >>> keep replication lag under check - say, throttle only when >>> wal_distance > wal_throttle_threshold && replication_lag > >>> wal_throttle_replication_lag_threshold. >> >> I think my idea of only forcing to flush/wait an LSN some distance in the >> past >> would automatically achieve that? > > I'm sorry, I couldn't get your point, can you please explain it a bit more? > The idea is that we would not flush the exact current LSN, because that's likely somewhere in the page, and we always write the whole page which leads to write amplification. But if we backed off a bit, and wrote e.g. to the last page boundary, that wouldn't have this issue (either the page was already flushed - noop, or we'd have to flush it anyway). We could even back off a bit more, to increase the probability it was actually flushed / sent to standby. That would still work, because the whole point is not to allow one process to generate too much unflushed WAL, forcing the other (small) xacts to wait at commit. Imagine we have the limit set to 8MB, i.e. the backend flushes WAL after generating 8MB of WAL. If we flush to the exact current LSN, the other backends will wait for ~4MB on average. If we back off to 1MB, the wait average increases to ~4.5MB. (This is simplified, as it ignores WAL from the small xacts. But those flush regularly, which limit the amount. It also ignores there might be multiple large xacts.) > Looking at the patch, the feature, in its current shape, focuses on > improving replication lag (by throttling WAL on the primary) only when > synchronous replication is enabled. Why is that? Why can't we design > it for replication in general (async, sync, and logical replication)? > This focuses on sync rep, because that's where the commit latency comes from. Async doesn't have that issue, because it doesn't wait for the standby. In particular, the trick is in penalizing the backends generating a lot of WAL, while leaving the small xacts alone. > Keeping replication lag under check enables one to provide a better > RPO guarantee as discussed in the other thread > https://www.postgresql.org/message-id/CAHg%2BQDcO_zhgBCMn5SosvhuuCoJ1vKmLjnVuqUEOd4S73B1urw%40mail.gmail.com. > Isn't that a bit over-complicated? RPO generally only cares about xacts that committed (because that's what you want to not lose), so why not to simply introduce a "sync mode" that simply uses a bit older LSN when waiting for the replica? Seems much simpler and similar to what we already do. Yeah, if someone generates a lot of WAL in uncommitted transaction, all of that may be lost. But who cares (from the RPO point of view)? regards -- Tomas Vondra EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
Re: GUCs to control abbreviated sort keys
On Fri, 2023-01-27 at 11:41 -0800, Peter Geoghegan wrote: > I cannot recreate the issue you describe. Interesting. For my test: glibc 2.35 ICU 70.1 gcc11.3.0LLVM 14.0.0 > It's not impossible that the perl program you wrote produces > non-deterministic output It is non-deterministic, but I tried with two generated files, and got similar results. Right now I suspect the ICU version might be the reason. I'll try with 72. -- Jeff Davis PostgreSQL Contributor Team - AWS
Re: Authentication fails for md5 connections if ~/.postgresql/postgresql.{crt and key} exist
> I think the sslcertmode=disable option that I introduced in [1] solves > this issue too; would it work for your case? That whole patchset is > meant to tackle the general case of the problem you've described. > > (Eventually I'd like to teach the server not to ask for a client > certificate if it's not going to use it.) there is an option in pg_hba.conf on the server side called "clientcert" that can be specified besides the auth method that controls if certain client connections are required to send client certificate for additional verification. The value of "clientcert" can be "verify-ca" or "verify-full". For example: hostssl all all 127.0.0.1/32 md5 clientcert=verify-full If clientcert is not requested by the server, but yet the client still sends the certificate, the server will still verify it. This is the case in this discussion. I agree that it is a more elegant approach to add "sslcertmode=disable" on the client side to prevent sending default certificate. But, if the server does request clientcert but client uses "sslcertmode=disable" to connect and not give a certificate, it would also result in authentication failure. In this case, we actually would want to ignore "sslcertmode=disable" and send default certificates if found. It would perhaps to better change the parameter to "defaultclientcert=on-demand" on the client side that will: 1. not send the existing default certificate if server does not request a certificate 2. send the existing default certificate if server does request a certificate while the client does not use "sslcert" parameter to specify another non-default certificate I put "default" in the parameter name to indicate that it only applies to default certificate. If user specifies a non-default certificate using "sslcert" parameter, "defaultclientcert" should not be used and client should give error if both exists. Cary Huang HighGo Software Canada www.highgo.ca
Re: Optimizing PostgreSQL with LLVM's PGO+LTO
=?UTF-8?Q?Jo=C3=A3o_Paulo_Labegalini_de_Carvalho?= writes: > I am facing issues when applying LTO to PostgreSQL as the produced binary > seems broken (the server dies quickly after it has started). This is > definitely a compiler bug, but I was wondering if anyone here have > experimented with LTO for PostgreSQL. There are a lot of places where we're implicitly relying on cross-compilation-unit optimizations NOT happening, because the code isn't adequately decorated with memory barriers and the like. So I wouldn't necessarily assume that the misbehavior you're seeing represents anything that the compiler folks would consider a bug. In the long run we might be interested in trying to make this work better, but I don't know of anyone working on it now. regards, tom lane
Re: Set arbitrary GUC options during initdb
On Fri, Jan 27, 2023 at 3:02 PM Tom Lane wrote: > The string-hacking was fully as tedious as I expected. However, the > output looks pretty nice, and this does have the advantage that the > pre-programmed substitutions become a lot more robust: they are no > longer dependent on the initdb code exactly matching what is in > postgresql.conf.sample. Awesome! Thank you very much. -- Robert Haas EDB: http://www.enterprisedb.com
Re: Set arbitrary GUC options during initdb
I wrote: >>> Anyway, it seems like I gotta work harder. I'll produce a >>> new patch. The string-hacking was fully as tedious as I expected. However, the output looks pretty nice, and this does have the advantage that the pre-programmed substitutions become a lot more robust: they are no longer dependent on the initdb code exactly matching what is in postgresql.conf.sample. regards, tom lane diff --git a/doc/src/sgml/ref/initdb.sgml b/doc/src/sgml/ref/initdb.sgml index 5b2bdac101..724188b1b5 100644 --- a/doc/src/sgml/ref/initdb.sgml +++ b/doc/src/sgml/ref/initdb.sgml @@ -433,6 +433,23 @@ PostgreSQL documentation Other, less commonly used, options are also available: + + -c name=value + --set name=value + + +Forcibly set the server parameter name +to value during initdb, +and also install that setting in the +generated postgresql.conf file, +so that it will apply during future server runs. +This option can be given more than once to set several parameters. +It is primarily useful when the environment is such that the server +will not start at all using the default parameters. + + + + -d --debug diff --git a/src/bin/initdb/initdb.c b/src/bin/initdb/initdb.c index 7a58c33ace..c955c0837e 100644 --- a/src/bin/initdb/initdb.c +++ b/src/bin/initdb/initdb.c @@ -48,6 +48,7 @@ #include "postgres_fe.h" +#include #include #include #include @@ -82,6 +83,13 @@ /* Ideally this would be in a .h file, but it hardly seems worth the trouble */ extern const char *select_default_timezone(const char *share_path); +/* simple list of strings */ +typedef struct _stringlist +{ + char *str; + struct _stringlist *next; +} _stringlist; + static const char *const auth_methods_host[] = { "trust", "reject", "scram-sha-256", "md5", "password", "ident", "radius", #ifdef ENABLE_GSS @@ -142,6 +150,8 @@ static char *pwfilename = NULL; static char *superuser_password = NULL; static const char *authmethodhost = NULL; static const char *authmethodlocal = NULL; +static _stringlist *extra_guc_names = NULL; +static _stringlist *extra_guc_values = NULL; static bool debug = false; static bool noclean = false; static bool noinstructions = false; @@ -242,7 +252,10 @@ static char backend_exec[MAXPGPATH]; static char **replace_token(char **lines, const char *token, const char *replacement); - +static char **replace_guc_value(char **lines, +const char *guc_name, const char *guc_value, +bool mark_as_comment); +static bool guc_value_requires_quotes(const char *guc_value); static char **readfile(const char *path); static void writefile(char *path, char **lines); static FILE *popen_check(const char *command, const char *mode); @@ -253,6 +266,7 @@ static void check_input(char *path); static void write_version_file(const char *extrapath); static void set_null_conf(void); static void test_config_settings(void); +static bool test_specific_config_settings(int test_conns, int test_buffs); static void setup_config(void); static void bootstrap_template1(void); static void setup_auth(FILE *cmdfd); @@ -360,9 +374,34 @@ escape_quotes_bki(const char *src) } /* - * make a copy of the array of lines, with token replaced by replacement + * Add an item at the end of a stringlist. + */ +static void +add_stringlist_item(_stringlist **listhead, const char *str) +{ + _stringlist *newentry = pg_malloc(sizeof(_stringlist)); + _stringlist *oldentry; + + newentry->str = pg_strdup(str); + newentry->next = NULL; + if (*listhead == NULL) + *listhead = newentry; + else + { + for (oldentry = *listhead; oldentry->next; oldentry = oldentry->next) + /* skip */ ; + oldentry->next = newentry; + } +} + +/* + * Make a copy of the array of lines, with token replaced by replacement * the first time it occurs on each line. * + * The original data structure is not changed, but we share any unchanged + * strings with it. (This definition lends itself to memory leaks, but + * we don't care too much about leaks in this program.) + * * This does most of what sed was used for in the shell script, but * doesn't need any regexp stuff. */ @@ -416,6 +455,168 @@ replace_token(char **lines, const char *token, const char *replacement) return result; } +/* + * Make a copy of the array of lines, replacing the possibly-commented-out + * assignment of parameter guc_name with a live assignment of guc_value. + * The value will be suitably quoted. + * + * If mark_as_comment is true, the replacement line is prefixed with '#'. + * This is used for fixing up cases where the effective default might not + * match what is in postgresql.conf.sample. + * + * We assume there's at most one matching assignment. If we find no match, + * append a new line with the desired assignment. + * + * The original data structure is not changed, but we share any unchanged + * stri
Re: Show various offset arrays for heap WAL records
On Fri, Jan 27, 2023 at 12:24 PM Melanie Plageman wrote: > I believe I have addressed this in the attached patch. I'm not sure what's best in terms of formatting details but I definitely like the idea of making pg_waldump show more details. I'd even like to have a way to extract the tuple data, when it's operations on tuples and we have those tuples in the payload. That'd be a lot more verbose than what you are doing here, though, and I'm not saying you should go do it right now or anything like that. -- Robert Haas EDB: http://www.enterprisedb.com
Re: Non-superuser subscription owners
On Thu, Jan 19, 2023 at 8:46 PM Andres Freund wrote: > > If we already had (or have) that logic someplace else, it would > > probably make sense to reuse it > > We hve. See at least postgres_fdw's check_conn_params(), dblink's > dblink_connstr_check() and dblink_security_check(). In the patch I posted previously, I had some other set of checks, more or less along the lines suggested by Jeff. I looked into revising that approach and making the behavior match exactly what we do in those places instead. I find that it breaks 027_nosuperuser.pl. Specifically, where without the patch I get "ok 6 - nosuperuser admin with all table privileges can replicate into unpartitioned", with the patch it goes boom, because the subscription can't connect any more due to the password requirement. At first, I found it a bit tempting to see this as a further indication that the force-a-password approach is not the right idea, because the test case clearly memorializes a desire *not* to require a password in this situation. However, the loopback-to-superuser attack is just as viable for subscription as it in other cases, and my previous patch would have done nothing to block it. So what I did instead is add a password_required attribute, just like what postgres_fdw has. As in the case of postgres_fdw, the actual rule is that if the attribute is false, a password is not required, and if the attribute is true, a password is required unless you are a superuser. If you're a superuser, it still isn't. This is a slightly odd set of semantics but it has precedent and practical advantages. Also, as in the case of postgres_fdw, only a superuser can set password_required=false, and a subscription that has that setting can only be modified by a superuser, no matter who owns it. Even though I hate the require-a-password stuff with the intensity of a thousand suns, I think this is better than the previous patch, because it's more consistent with what we do elsewhere and because it blocks the loopback-connection-to-superuser attack. I think we *really* need to develop a better system for restricting proxied connections (no matter how proxied) and I hope that we do that soon. But inventing something for this purpose that differs from what we do elsewhere will make that task harder, not easier. Thoughts? -- Robert Haas EDB: http://www.enterprisedb.com v2-0001-Add-new-predefined-role-pg_create_subscriptions.patch Description: Binary data
Re: GUCs to control abbreviated sort keys
On Thu, Jan 26, 2023 at 3:29 PM Jeff Davis wrote: > On Thu, 2023-01-26 at 22:39 +0100, Peter Eisentraut wrote: > > Maybe an easier way to enable or disable it in the source code with a > > #define would serve this. Making it a GUC right away seems a bit > > heavy-handed. Further exploration and tweaking might well require > > further source code changes, so relying on a source code level toggle > > would seem appropriate. > > I am using these GUCs for testing the various collation paths in my > collation refactoring branch. I'm fine with adding the GUC as a developer option. I think that there is zero chance of the changes to tuplesort.c having appreciable overhead. > I find them pretty useful, and when I saw a regression, I thought > others might think it was useful, too. But if not I'll just leave them > in my branch and withdraw from this thread. I cannot recreate the issue you describe. With abbreviated keys, your exact test case takes 00:16.620 on my system. Without abbreviated keys, it takes 00:21.255. To me it appears to be a moderately good case for abbreviated keys, though certainly not as good as some cases that I've seen -- ~3x improvements are common enough. As a point of reference, the same test case with the C collation and with abbreviated keys takes 00:10.822. When I look at the "trace_sort" output for the C collation with abbreviated keys, and compare it to the equivalent "trace_sort" output for the original "en-US-x-icu" collation from your test case, it is clear that the overhead of generating collated abbreviated keys within ICU is relatively high -- the initial scan of the table (which is where we generate all abbreviated keys here) takes 4.45 seconds in the ICU case, and only 1.65 seconds in the "C" locale case. I think that you should look into that same difference on your own system, so that we can compare and contrast. The underlying issue might well have something to do with the ICU version you're using, or some other detail of your environment. I'm using Debian unstable here. Postgres links to the system ICU, which is ICU 72. It's not impossible that the perl program you wrote produces non-deterministic output, which should be controlled for, since it might just be significant. I see this on my system, having run the perl program as outlined in your test case: $ ls -l /tmp/strings.txt -rw-r--r-- 1 pg pg 431886574 Jan 27 11:13 /tmp/strings.txt $ sha1sum /tmp/strings.txt 22f60dc12527c215c8e3992e49d31dc531261a83 /tmp/strings.txt Does that match what you see on your system? -- Peter Geoghegan
Re: Optimizing PostgreSQL with LLVM's PGO+LTO
Hi, We have implemented LTO in PostGIS's build system a couple releases ago. It definitely gives +10% on heavy maths. Unfortunately we did not manage to get it running under FreeBSD because of default system linker issues so we had to hide it under --with-lto switch which we recommend to everyone. I did not experiment with Postgres itself but there are definitely traces of numerous LTO-enabled private builds on the web. On Fri, Jan 27, 2023 at 8:05 PM João Paulo Labegalini de Carvalho < jaopaul...@gmail.com> wrote: > Hi all, > > I am investigating the benefits of different profile-guided optimizations > (PGO) and link-time optimizations (LTO) versus binary optimizers (e.g. > BOLT) for applications such as PostgreSQL. > > I am facing issues when applying LTO to PostgreSQL as the produced binary > seems broken (the server dies quickly after it has started). This is > definitely a compiler bug, but I was wondering if anyone here have > experimented with LTO for PostgreSQL. > > Thanks, > > -- > João Paulo L. de Carvalho > Ph.D Computer Science | IC-UNICAMP | Campinas , SP - Brazil > Postdoctoral Research Fellow | University of Alberta | Edmonton, AB - > Canada > joao.carva...@ic.unicamp.br > joao.carva...@ualberta.ca >
Re: improving user.c error messages
While we're here, On 2023-Jan-26, Nathan Bossart wrote: > @@ -838,7 +867,8 @@ AlterRole(ParseState *pstate, AlterRoleStmt *stmt) > if (!should_be_super && roleid == BOOTSTRAP_SUPERUSERID) > ereport(ERROR, > > (errcode(ERRCODE_INSUFFICIENT_PRIVILEGE), > - errmsg("permission denied: bootstrap > user must be superuser"))); > + errmsg("permission denied to alter > role"), > + errdetail("The bootstrap user must be > superuser."))); I think this one isn't using the right errcode; this is not a case of insufficient privileges. There's no priv you can acquire that lets you do it. So I'd change it to unsupported operation. I was confused a bit by this one: > /* an unprivileged user can change their own password */ > if (dpassword && roleid != currentUserId) > ereport(ERROR, > (errcode(ERRCODE_INSUFFICIENT_PRIVILEGE), > -errmsg("must have CREATEROLE privilege to change another > user's password"))); > +errmsg("permission denied to alter role"), > +errdetail("To change another role's password, you must > have %s privilege and %s on the role.", > + "CREATEROLE", "ADMIN OPTION"))); > } In no other message we say what operation is being attempted in the errdetail; all the others start with "You must have" and that's it. However, looking closer I think this one being different is okay, because the errmsg() you're using is vague, and I think the error report would be confusing if you were to remove the "To change another role's password" bit. The patch looks good to me. -- Álvaro HerreraBreisgau, Deutschland — https://www.EnterpriseDB.com/
Re: New strategies for freezing, advancing relfrozenxid early
On Fri, Jan 27, 2023 at 12:52 AM Andres Freund wrote: > I agree with bringing high-level context into the decision about whether to > freeze agressively - my problem with the eager freezing strategy patch isn't > that it did that too much, it's that it didn't do it enough. > > > But I also don't think what I describe above is really comparable to "table > level" eager freezing though - the potential worst case overhead is a small > fraction of the WAL volume, and there's zero increase in data write volume. All I meant was that I initially thought that you were trying to replace the FPI thing with something at the same level of ambition, that could work in a low context way. But I now see that you're actually talking about something quite a bit more ambitious for Postgres 16, which is structurally similar to a freezing strategy, from a code point of view -- it relies on high-level context for the VACUUM/table as a whole. I wasn't equating it with the eager freezing strategy in any other way. It might also be true that this other thing happens to render the FPI mechanism redundant. I'm actually not completely sure that it will just yet. Let me verify my understanding of your proposal: You mean that we'd take the page LSN before doing anything with the page, right at the top of lazy_scan_prune, at the same point that "fpi_before" is initialized currently. Then, if we subsequently dirtied the page (as determined by its LSN, so as to focus on "dirtied via WAL logged operation") during pruning, *and* if the "lsn_before" of the page was from before our cutoff (derived via " lsn_threshold = insert_lsn - (insert_lsn - lsn_of_last_vacuum) * 0.1" or similar), *and* if the page is eligible to become all-frozen, then we'd freeze the page. That's it, right? It's about pages that *we* (VACUUM) dirtied, and wrote records and/or FPIs for already? > I suspect the absolute worst case of "always freeze dirty pages" is when a > single tuple on the page gets updated immediately after every time we freeze > the page - a single tuple is where the freeze record is the least space > efficient. The smallest update is about the same size as the smallest freeze > record. For that to amount to a large WAL increase you'd a crazy rate of such > updates interspersed with vacuums. In slightly more realistic cases (i.e. not > column less tuples that constantly get updated and freezing happening all the > time) you end up with a reasonably small WAL rate overhead. Other thing is that we'd be doing this in situations where we already know that a VISIBLE record is required, which is comparable in size to a FREEZE_PAGE record with one tuple/plan (around 64 bytes). The smallest WAL records are mostly just generic WAL record header overhead. > Obviously that's a pointless workload, but I do think that > analyzing the "outer boundaries" of the regression something can cause, can be > helpful. I agree about the "outer boundaries" being a useful guide. > I think one way forward with the eager strategy approach would be to have a > very narrow gating condition for now, and then incrementally expand it in > later releases. > > One use-case where the eager strategy is particularly useful is > [nearly-]append-only tables - and it's also the one workload that's reasonably > easy to detect using stats. Maybe something like > (dead_tuples_since_last_vacuum / inserts_since_last_vacuum) < 0.05 > or so. > > That'll definitely leave out loads of workloads where eager freezing would be > useful - but are there semi-reasonable workloads where it'll hurt badly? I > don't *think* so. I have no further plans to work on eager freezing strategy, or anything of the sort, in light of recent developments. My goal at this point is very unambitious: to get the basic page-level freezing work into a form that makes sense as a standalone thing for Postgres 16. To put things on a good footing, so that I can permanently bow out of all work on VACUUM having left everything in good order. That's all. Now, that might still mean that I'd facilitate future work of this sort, by getting the right basic structure in place. But my involvement in any work on freezing or anything of the sort ends here, both as a patch author and a committer of anybody else's work. I'm proud of the work I've done on VACUUM, but I'm keen to move on from it. > > What about unlogged/temporary tables? The obvious thing to do there is > > what I did in the patch that was reverted (freeze whenever the page > > will thereby become all-frozen), and forget about LSNs. But you have > > already objected to that part, specifically. > > My main concern about that is the data write amplification it could cause when > page is clean when we start freezing. But I can't see a large potential > downside to always freezing unlogged/temp tables when the page is already > dirty. But we have to dirty the page anyway, just to set PD_ALL_VISIBLE. That was always a gating condition. Actually, that may have depen
Re: New strategies for freezing, advancing relfrozenxid early
Hi, On 2023-01-27 12:53:58 -0500, Robert Haas wrote: > On Fri, Jan 27, 2023 at 12:58 AM Andres Freund wrote: > > Essentially the "any fpi" logic is a very coarse grained way of using the > > page > > LSN as a measurement. As I said, I don't think "has a checkpoint occurred > > since the last write" is a good metric to avoid unnecessary freezing - it's > > too coarse. But I think using the LSN is the right thought. What about > > something like > > > > lsn_threshold = insert_lsn - (insert_lsn - lsn_of_last_vacuum) * 0.1 > > if (/* other conds */ && PageGetLSN(page) <= lsn_threshold) > > FreezeMe(); > > > > I probably got some details wrong, what I am going for with lsn_threshold is > > that we'd freeze an already dirty page if it's not been updated within 10% > > of > > the LSN distance to the last VACUUM. > > I think this might not be quite the right idea for a couple of reasons. It's definitely not perfect. If we had an approximate LSN->time map as general infrastructure, we could do a lot better. I think it'd be reasonably easy to maintain that in the autovacuum launcher, for example. One thing worth calling out here, because it's not obvious from the code quoted above in isolation, is that what I was trying to refine here was the decision when to perform opportunistic freezing *of already dirty pages that do not require an FPI*. So all that we need to prevent here is freezing very hotly updated data, where the WAL overhead of the freeze records would be noticable, because we constantly VACUUM, due to the high turnover. > First, suppose that the table is being processed just by autovacuum > (no manual VACUUM operations) and that the rate of WAL generation is > pretty even, so that LSN age is a good proxy for time. If autovacuum > processes the table once per hour, this will freeze if it hasn't been > updated in the last six minutes. That sounds good. But if autovacuum > processes the table once per day, then this will freeze if it hasn't > been updated in 2.4 hours. That might be OK, but it sounds a little on > the long side. You're right. I was thinking of the "lsn_since_last_vacuum" because I was posulating it being useful elsewhere in the thread (but for eager strategy logic) - but here that's really not very relevant. Given that we're dealing with already dirty pages not requiring an FPI, I think a much better "reference LSN" would be the LSN of the last checkpoint (LSN of the last checkpoint record, not the current REDO pointer). > Second, and more seriously, I think this would, in some circumstances, > lead to tremendously unstable behavior. Suppose somebody does a bunch > of work on a table and then they're like "oh, we should clean up, > VACUUM" and it completes quickly because it's been a while since the > last vacuum and so it doesn't freeze much. Then, for whatever reason, > they decide to run it one more time, and it goes bananas and starts > freezing all kinds of stuff because the LSN distance since the last > vacuum is basically zero. Or equally, you run a manual VACUUM, and you > get completely different behavior depending on how long it's been > since the last autovacuum ran. I don't think this quite applies to the scenario at hand, because it's restricted to already dirty pages. And the max increased overhead is also small due to that - so occasionally getting it wrong is that impactful. Greetings, Andres Freund
Re: improving user.c error messages
On 2023-Jan-27, Tom Lane wrote: > Good point. My vote is for standardizing on *not* mentioning it. > Error messages should say "you need privilege X". That is not > the place to go into all the ways you could hold privilege X > (one of which is being superuser). +1 -- Álvaro Herrera 48°01'N 7°57'E — https://www.EnterpriseDB.com/ "El sabio habla porque tiene algo que decir; el tonto, porque tiene que decir algo" (Platon).
Re: New strategies for freezing, advancing relfrozenxid early
On Fri, Jan 27, 2023 at 12:58 AM Andres Freund wrote: > Essentially the "any fpi" logic is a very coarse grained way of using the page > LSN as a measurement. As I said, I don't think "has a checkpoint occurred > since the last write" is a good metric to avoid unnecessary freezing - it's > too coarse. But I think using the LSN is the right thought. What about > something like > > lsn_threshold = insert_lsn - (insert_lsn - lsn_of_last_vacuum) * 0.1 > if (/* other conds */ && PageGetLSN(page) <= lsn_threshold) > FreezeMe(); > > I probably got some details wrong, what I am going for with lsn_threshold is > that we'd freeze an already dirty page if it's not been updated within 10% of > the LSN distance to the last VACUUM. I think this might not be quite the right idea for a couple of reasons. First, suppose that the table is being processed just by autovacuum (no manual VACUUM operations) and that the rate of WAL generation is pretty even, so that LSN age is a good proxy for time. If autovacuum processes the table once per hour, this will freeze if it hasn't been updated in the last six minutes. That sounds good. But if autovacuum processes the table once per day, then this will freeze if it hasn't been updated in 2.4 hours. That might be OK, but it sounds a little on the long side. If autovacuum processes the table once per week, then this will freeze if it hasn't been updated in 16.8 hours. That sounds too conservative. Conversely, if autovacuum processes the table every 3 minutes, then this will freeze the data if it hasn't been updated in the last 18 seconds, which sounds awfully aggressive. Maybe I'm wrong here, but I feel like the absolute amount of wall-clock time we're talking about here probably matters to some degree. I'm not sure whether a strict time-based threshold like, say, 10 minutes would be a good idea, leaving aside the difficulties of implementation. It might be right to think that if the table is being vacuumed a lot, freezing more aggressively is smart, and if it's being vacuumed infrequently, freezing less aggressively is smart, because if the table has enough activity that it's being vacuumed frequently, that might also be a sign that we need to freeze more aggressively in order to avoid having things go sideways. However, I'm not completely sure about that, and I think it's possible that we need some guardrails to avoid going too far in either direction. Second, and more seriously, I think this would, in some circumstances, lead to tremendously unstable behavior. Suppose somebody does a bunch of work on a table and then they're like "oh, we should clean up, VACUUM" and it completes quickly because it's been a while since the last vacuum and so it doesn't freeze much. Then, for whatever reason, they decide to run it one more time, and it goes bananas and starts freezing all kinds of stuff because the LSN distance since the last vacuum is basically zero. Or equally, you run a manual VACUUM, and you get completely different behavior depending on how long it's been since the last autovacuum ran. In some ways, I think this proposal has many of the same problems as vacuum_freeze_min_age. In both cases, the instinct is that we should use something on the page to let us know how long it's been since the page was modified, and proceed on the theory that if the page has not been modified recently, it probably isn't about to be modified again. That's a reasonable instinct, but the rate of XID advancement and the rate of LSN advancement are both highly variable, even on a system that's always under some load. -- Robert Haas EDB: http://www.enterprisedb.com
Re: Show various offset arrays for heap WAL records
Hi, I have taken a stab at doing some of the tasks listed in this email. I have made the new files rmgr_utils.c/h. I have come up with a standard format that I like for the output and used it in all the heap record types. Examples below: snapshotConflictHorizon: 2184, nplans: 2, plans [ { xmax: 0, infomask: 2816, infomask2: 2, ntuples: 5, offsets: [ 10, 11, 12, 18, 71 ] }, { xmax: 0, infomask: 11008, infomask2: 2, ntuples: 2, offsets: [ 72, 73 ] } ] snapshotConflictHorizon: 2199, nredirected: 4, ndead: 0, nunused: 4, redirected: [ 1->38, 2->39, 3->40, 4->41 ], dead: [], unused: [ 24, 25, 26, 27, 37 ] I started documenting it in the rmgr_utils.h header file in a comment, however it may be worth a README? I haven't polished this description of the format (or added examples, etc) or used it in the btree-related functions because I assume the format and helper function API will need more discussion. This is still a rough draft, as I anticipate changes will be requested. I would split it into multiple patches, etc. But I am looking for feedback on the suggested format and the array formatting helper function API. Perhaps there should also be example output of the offset arrays in pgwalinspect docs? I've changed the array format helper functions that Andres added to be a single function with an additional layer of indirection so that any record with an array can use it regardless of type and format of the individual elements. The signature is based somewhat off of qsort_r() and allows the user to pass a function with the the desired format of the elements. On a semi-unrelated note, I think it might be nice to have a comment in heapam_xlog.h about what the infobits fields actually are and why they exist -- e.g. we only need a subset of infomask[2] bits in these records. I put a random comment in the code where I think it should go. I will delete it later, of course. On Mon, Jan 9, 2023 at 11:00 PM Peter Geoghegan wrote: > > On Mon, Jan 9, 2023 at 1:58 PM Andres Freund wrote: > > The attached patch adds details to XLOG_HEAP2_PRUNE, XLOG_HEAP2_VACUUM, > > XLOG_HEAP2_FREEZE_PAGE. > > I'm bound to end up doing the same in index access methods. Might make > sense for the utility routines to live somewhere more centralized, at > least when code reuse is likely. Practically every index AM has WAL > records that include a sorted page offset number array, just like > these ones. It's a very standard thing, obviously. I plan to add these if the format and API I suggested seems like the right direction. On Tue, Jan 10, 2023 at 2:35 PM Andres Freund wrote: > > > > I chose to include infomask[2] for the different freeze plans mainly > > > because > > > it looks odd to see different plans without a visible reason. But I'm not > > > sure > > > that's the right choice. > > > > I don't think that it is particularly necessary to do so in order for > > the output to make sense -- pg_waldump is inherently a tool for > > experts. What it comes down to for me is whether or not this > > information is sufficiently useful to display, and/or can be (or needs > > to be) controlled via some kind of verbosity knob. > > It seemed useful enough to me, but I likely also stare more at this stuff than > most. Compared to the list of offsets it's not that much content. > Personally, I like having the infomasks for the freeze plans. If we someday have a more structured input to rmgr_desc, we could then easily have them in their own column and use functions like heap_tuple_infomask_flags() on them. > > How hard would it be to invent a general mechanism to control the verbosity > > of what we'll show for each WAL record? > > Nontrivial, I'm afraid. We don't pass any relevant parameters to rm_desc: > void(*rm_desc) (StringInfo buf, XLogReaderState *record); > > so we'd need to patch all of them. That might be worth doing at some point, > but I don't want to tackle it right now. In terms of a more structured format, it seems like it would make the most sense to pass a JSON or composite datatype structure to rm_desc instead of that StringInfo. I would also like to see functions like XLogRecGetBlockRefInfo() pass something more useful than a stringinfo buffer so that we could easily extract out the relfilenode in pgwalinspect. On Mon, Jan 16, 2023 at 10:09 PM Peter Geoghegan wrote: > > On Wed, Jan 11, 2023 at 3:11 PM Peter Geoghegan wrote: > > On Wed, Jan 11, 2023 at 3:00 PM Andres Freund wrote: > > > What are your thoughts about the place for the helper functions? You're ok > > > with rmgrdesc_utils.[ch]? > > > > Yeah, that seems okay. > > BTW, while playing around with this patch today, I noticed that it > won't display the number of elements in each offset array directly. > Perhaps it's worth including that, too? I believe I have addressed this in the attached patch. - Melanie v1-0001-Add-rmgr_desc-utilities.patch Description: Binary data
Re: Add LZ4 compression in pg_dump
On Thu, Jan 26, 2023 at 12:22:45PM -0600, Justin Pryzby wrote: > That commit also added this to pg-dump.c: > > + case PG_COMPRESSION_ZSTD: > + pg_fatal("compression with %s is not yet supported", > "ZSTD"); > + break; > + case PG_COMPRESSION_LZ4: > + pg_fatal("compression with %s is not yet supported", > "LZ4"); > + break; > > In 002, that could be simplified by re-using the supports_compression() > function. (And maybe the same in WriteDataToArchive()?) The first patch aims to minimize references to ".gz" and "GZIP" and ZLIB. pg_backup_directory.c comments still refers to ".gz". I think the patch should ideally change to refer to "the compressed file extension" (similar to compress_io.c), avoiding the need to update it later. I think the file extension stuff could be generalized, so it doesn't need to be updated in multiple places (pg_backup_directory.c and compress_io.c). Maybe it's useful to add a function to return the extension of a given compression method. It could go in compression.c, and be useful in basebackup. For the 2nd patch: I might be in the minority, but I still think some references to "gzip" should say "zlib": +} GzipCompressorState; + +/* Private routines that support gzip compressed data I/O */ +static void +DeflateCompressorGzip(ArchiveHandle *AH, CompressorState *cs, bool flush) In my mind, three things here are misleading, because it doesn't use gzip headers: | GzipCompressorState, DeflateCompressorGzip, "gzip compressed". This comment is about exactly that: * underlying stream. The second API is a wrapper around fopen/gzopen and * friends, providing an interface similar to those, but abstracts away * the possible compression. Both APIs use libz for the compression, but * the second API uses gzip headers, so the resulting files can be easily * manipulated with the gzip utility. AIUI, Michael says that it's fine that the user-facing command-line options use "-Z gzip" (even though the "custom" format doesn't use gzip headers). I'm okay with that, as long as that's discussed/understood. -- Justin
Re: Improve GetConfigOptionValues function
Nitin Jadhav writes: >> Both of you are arguing as though GUC_NO_SHOW_ALL is a security >> property. It is not, or at least it's so trivially bypassable >> that it's useless to consider it one. All it is is a de-clutter >> mechanism. > Understood. If that is the case, then I am ok with the patch. Pushed v4, then. regards, tom lane
Optimizing PostgreSQL with LLVM's PGO+LTO
Hi all, I am investigating the benefits of different profile-guided optimizations (PGO) and link-time optimizations (LTO) versus binary optimizers (e.g. BOLT) for applications such as PostgreSQL. I am facing issues when applying LTO to PostgreSQL as the produced binary seems broken (the server dies quickly after it has started). This is definitely a compiler bug, but I was wondering if anyone here have experimented with LTO for PostgreSQL. Thanks, -- João Paulo L. de Carvalho Ph.D Computer Science | IC-UNICAMP | Campinas , SP - Brazil Postdoctoral Research Fellow | University of Alberta | Edmonton, AB - Canada joao.carva...@ic.unicamp.br joao.carva...@ualberta.ca
Re: Perform streaming logical transactions by background workers and parallel apply
On Wed, Jan 25, 2023 at 3:27 PM Amit Kapila wrote: > > On Wed, Jan 25, 2023 at 10:05 AM Amit Kapila wrote: > > > > On Wed, Jan 25, 2023 at 3:15 AM Peter Smith wrote: > > > > > > 1. > > > @@ -210,7 +210,7 @@ int logical_decoding_work_mem; > > > static const Size max_changes_in_memory = 4096; /* XXX for restore only > > > */ > > > > > > /* GUC variable */ > > > -int logical_decoding_mode = LOGICAL_DECODING_MODE_BUFFERED; > > > +int logical_replication_mode = LOGICAL_REP_MODE_BUFFERED; > > > > > > > > > I noticed that the comment /* GUC variable */ is currently only above > > > the logical_replication_mode, but actually logical_decoding_work_mem > > > is a GUC variable too. Maybe this should be rearranged somehow then > > > change the comment "GUC variable" -> "GUC variables"? > > > > > > > I think moving these variables together doesn't sound like a good idea > > because logical_decoding_work_mem variable is defined with other > > related variable. Also, if we are doing the last comment, I think that > > will obviate the need for this. > > > > > == > > > src/backend/utils/misc/guc_tables.c > > > > > > @@ -4908,13 +4908,13 @@ struct config_enum ConfigureNamesEnum[] = > > > }, > > > > > > { > > > - {"logical_decoding_mode", PGC_USERSET, DEVELOPER_OPTIONS, > > > + {"logical_replication_mode", PGC_USERSET, DEVELOPER_OPTIONS, > > > gettext_noop("Allows streaming or serializing each change in logical > > > decoding."), > > > NULL, > > > GUC_NOT_IN_SAMPLE > > > }, > > > - &logical_decoding_mode, > > > - LOGICAL_DECODING_MODE_BUFFERED, logical_decoding_mode_options, > > > + &logical_replication_mode, > > > + LOGICAL_REP_MODE_BUFFERED, logical_replication_mode_options, > > > NULL, NULL, NULL > > > }, > > > > > > That gettext_noop string seems incorrect. I think Kuroda-san > > > previously reported the same, but then you replied it has been fixed > > > already [1] > > > > > > > I felt the description seems not to be suitable for current behavior. > > > > A short description should be like "Sets a behavior of logical > > > > replication", and > > > > further descriptions can be added in lond description. > > > I adjusted the description here. > > > > > > But this doesn't look fixed to me. (??) > > > > > > > Okay, so, how about the following for the 0001 patch: > > short desc: Controls when to replicate each change. > > long desc: On the publisher, it allows streaming or serializing each > > change in logical decoding. > > > > I have updated the patch accordingly and it looks good to me. I'll > push this first patch early next week (Monday) unless there are more > comments. The patch looks good to me too. Thank you for the patch. Regards, -- Masahiko Sawada Amazon Web Services: https://aws.amazon.com
Re: Set arbitrary GUC options during initdb
"David G. Johnston" writes: > On Fri, Jan 27, 2023 at 8:53 AM Tom Lane wrote: >> Anyway, it seems like I gotta work harder. I'll produce a >> new patch. > How about just adding a "section" to the end of the file as needed: > # AdHoc Settings Specified During InitDB > max_connections=75 > ... Nah, I think that would be impossibly confusing. One way or another the live setting has to be near where the GUC is documented. We will have to do add-at-the-end for custom GUCs, of course, but in that case there's no matching comment to confuse you. regards, tom lane
Re: New strategies for freezing, advancing relfrozenxid early
On Fri, Jan 27, 2023 at 6:48 AM Robert Haas wrote: > > One of the key strengths of systems like Postgres is the ability to > > inexpensively store a relatively large amount of data that has just > > about zero chance of being read, let alone modified. While at the same > > time having decent OLTP performance for the hot data. Not nearly as > > good as an in-memory system, mind you -- and yet in-memory systems > > remain largely a niche thing. > > I think it's interesting that TPC-C suffers from the kind of problem > that your patch was intended to address. I hadn't considered that. But > I do not think it detracts from the basic point I was making, which is > that you need to think about the downsides of your patch, not just the > upsides. > > If you want to argue that there is *no* OLTP workload that will be > harmed by freezing as aggressively as possible, then that would be a > good argument in favor of your patch, because it would be arguing that > the downside simply doesn't exist, at least for OLTP workloads. The > fact that you can think of *one particular* OLTP workload that can > benefit from the patch is just doubling down on the "my patch has an > upside" argument, which literally no one is disputing. You've treated me to another multi paragraph talking down, as if I was still clinging to my original position, which is of course not the case. I've literally said I'm done with VACUUM for good, and that I just want to put a line under this. Yet you still persist in doing this sort of thing. I'm not fighting you, I'm not fighting Andres. I was making a point about the need to do something in this area in general. That's all. -- Peter Geoghegan
Re: Timeline ID hexadecimal format
On 27/01/2023 15:55, Peter Eisentraut wrote: On 27.01.23 14:52, Sébastien Lardière wrote: The attached patch proposes to change the format of timelineid from %u to %X. I think your complaint has merit. But note that if we did a change like this, then log files or reports from different versions would have different meaning without a visual difference, which is kind of what you complained about in the first place. At least we should do something like 0x%X. Indeed, but the messages that puzzled was in one log file, just together, not in some differents versions. But yes, it should be documented somewhere, actually, I can't find any good place for that, While digging, It seems that recovery_target_timeline should be given in decimal, not in hexadecimal, which seems odd to me ; and pg_controldata use decimal too, not hexadecimal… So, if this idea is correct, the given patch is not enough. Anyway, do you think it is a good idea or not ? Regarding .po files, I don't know how to manage them. Is there any routine to spread the modifications? Or should I identify and change each message? Don't worry about this. This is handled elsewhere. nice, regards, -- Sébastien
Re: improving user.c error messages
Robert Haas writes: > I almost hate to bring this up since I'm not sure how far we want to > go down this rat hole, but what should be our policy about mentioning > superuser? I don't think we're entirely consistent right now, and I'm > not sure whether every error message needs to mention that if you were > the superuser you could do everything. Is that something we should > mention always, never, or in some set of circumstances? Good point. My vote is for standardizing on *not* mentioning it. Error messages should say "you need privilege X". That is not the place to go into all the ways you could hold privilege X (one of which is being superuser). regards, tom lane
Re: Set arbitrary GUC options during initdb
On Fri, Jan 27, 2023 at 8:53 AM Tom Lane wrote: > Robert Haas writes: > > The idea is that instead of: > > > replace_token(conflines, "#max_connections = 100", repltok); > > > You'd write something like: > > > replace_guc_value(conflines, "max_connections", repltok); > > > Which would look for a line matching /^#max_connections\s+=\s/, and > > then identify everything following that point up to the first #. It > > would replace all that stuff with repltok, but if the replacement is > > shorter than the original, it would pad with spaces to get back to the > > original length. And otherwise it would add a single space, so that if > > you set a super long GUC value there's still at least one space > > between the end of the value and the comment that follows. > > Well, yeah, I was trying to avoid writing that ;-). There's even > one more wrinkle: we might already have removed the initial '#', > if one does say "-c max_connections=N", because this logic won't > know whether the -c switch matches one of initdb's predetermined > substitutions. > > > There might be some quoting-related problems with this idea, not sure. > > '#' in a value might confuse it, but we could probably take the last '#' > not the first. > > Anyway, it seems like I gotta work harder. I'll produce a > new patch. > > How about just adding a "section" to the end of the file as needed: # AdHoc Settings Specified During InitDB max_connections=75 ... David J.
Re: improving user.c error messages
On Fri, Jan 27, 2023 at 10:52 AM Nathan Bossart wrote: > IMHO superuser should typically only be mentioned when it is the only way > to do something. Since superusers have all privileges, I think logs like > "superuser or privileges of X" are kind of redundant. If Robert has > privileges of X, we wouldn't say "privileges of X or Robert." We'd just > point to X. Ultimately, I feel like mentioning superuser in error messages > usually just makes the message longer without adding any useful > information. That's kind of my opinion too, but I'm not sure whether there are cases where it will lead to confusion. -- Robert Haas EDB: http://www.enterprisedb.com
Re: Set arbitrary GUC options during initdb
Robert Haas writes: > The idea is that instead of: > replace_token(conflines, "#max_connections = 100", repltok); > You'd write something like: > replace_guc_value(conflines, "max_connections", repltok); > Which would look for a line matching /^#max_connections\s+=\s/, and > then identify everything following that point up to the first #. It > would replace all that stuff with repltok, but if the replacement is > shorter than the original, it would pad with spaces to get back to the > original length. And otherwise it would add a single space, so that if > you set a super long GUC value there's still at least one space > between the end of the value and the comment that follows. Well, yeah, I was trying to avoid writing that ;-). There's even one more wrinkle: we might already have removed the initial '#', if one does say "-c max_connections=N", because this logic won't know whether the -c switch matches one of initdb's predetermined substitutions. > There might be some quoting-related problems with this idea, not sure. '#' in a value might confuse it, but we could probably take the last '#' not the first. Anyway, it seems like I gotta work harder. I'll produce a new patch. regards, tom lane
Re: improving user.c error messages
On Fri, Jan 27, 2023 at 08:31:32AM -0500, Robert Haas wrote: > I almost hate to bring this up since I'm not sure how far we want to > go down this rat hole, but what should be our policy about mentioning > superuser? I don't think we're entirely consistent right now, and I'm > not sure whether every error message needs to mention that if you were > the superuser you could do everything. Is that something we should > mention always, never, or in some set of circumstances? IMHO superuser should typically only be mentioned when it is the only way to do something. Since superusers have all privileges, I think logs like "superuser or privileges of X" are kind of redundant. If Robert has privileges of X, we wouldn't say "privileges of X or Robert." We'd just point to X. Ultimately, I feel like mentioning superuser in error messages usually just makes the message longer without adding any useful information. I recognize that this is a bold opinion and that the policy to mention superuser might need to be more nuanced in practice... -- Nathan Bossart Amazon Web Services: https://aws.amazon.com
Re: Set arbitrary GUC options during initdb
On Fri, Jan 27, 2023 at 10:34 AM Tom Lane wrote: > One idea if we want to make it work like that could be to stop > trying to edit out the default value, and instead make the file > contents look like, say, > > #huge_pages = try # on, off, or try > huge_pages = off# set by initdb How about just making replace_token() a little smarter, and maybe renaming it? The idea is that instead of: replace_token(conflines, "#max_connections = 100", repltok); You'd write something like: replace_guc_value(conflines, "max_connections", repltok); Which would look for a line matching /^#max_connections\s+=\s/, and then identify everything following that point up to the first #. It would replace all that stuff with repltok, but if the replacement is shorter than the original, it would pad with spaces to get back to the original length. And otherwise it would add a single space, so that if you set a super long GUC value there's still at least one space between the end of the value and the comment that follows. There might be some quoting-related problems with this idea, not sure. -- Robert Haas EDB: http://www.enterprisedb.com
Re: Named Operators
On Fri, 27 Jan 2023 at 16:26, Peter Eisentraut wrote: > > On 12.01.23 14:55, Matthias van de Meent wrote: > >> Matter of taste, I guess. But more importantly, defining an operator > >> gives you many additional features that the planner can use to > >> optimize your query differently, which it can't do with functions. See > >> the COMMUTATOR, HASHES, etc. clause in the CREATE OPERATOR command. > > I see. Wouldn't it be better then to instead make it possible for the > > planner to detect the use of the functions used in operators and treat > > them as aliases of the operator? Or am I missing something w.r.t. > > differences between operator and function invocation? > > > > E.g. indexes on `int8pl(my_bigint, 1)` does not match queries for > > `my_bigint + 1` (and vice versa), while they should be able to support > > that, as OPERATOR(pg_catalog.+(int8, int8)) 's function is int8pl. > > I have been thinking about something like this for a long time. > Basically, we would merge pg_proc and pg_operator internally. Then, all > the special treatment for operators would also be available to > two-argument functions. And single-argument functions in case of prefix operators, right?
Re: Set arbitrary GUC options during initdb
Peter Eisentraut writes: > On 25.01.23 22:25, Tom Lane wrote: >> The specified settings >> are applied on the command line of the initial probe calls >> (which happen before we've made any config files), and then they >> are added to postgresql.auto.conf, which causes them to take >> effect for the bootstrap backend runs as well as subsequent >> postmaster starts. > I would have expected them to be edited into postgresql.conf. What are > the arguments for one or the other? TBH, the driving reason was that the string-munging code we have in initdb isn't up to snuff for that: it wants to substitute for an exactly-known string, which we won't have in the case of an arbitrary GUC. One idea if we want to make it work like that could be to stop trying to edit out the default value, and instead make the file contents look like, say, #huge_pages = try # on, off, or try huge_pages = off# set by initdb Then we just need to be able to find the GUC's entry. > Btw., something that I have had in my notes for a while, but with this > it would now be officially exposed: Not all options can be safely set > during bootstrap. For example, > initdb -D data -c track_commit_timestamp=on > will fail an assertion. This might be an exception, or there might be > others. Interesting. We'd probably want to sprinkle some more do-nothing-in-bootstrap-mode tests as we discover that sort of thing. regards, tom lane
Re: Set arbitrary GUC options during initdb
On Wed, Jan 25, 2023 at 4:26 PM Tom Lane wrote: > So this invents an initdb switch "-c NAME=VALUE" just like the > one that the server itself has long had. HUGE +1 from me. This will, I think, be extremely convenient in many situations. > The specified settings > are applied on the command line of the initial probe calls > (which happen before we've made any config files), and then they > are added to postgresql.auto.conf, which causes them to take > effect for the bootstrap backend runs as well as subsequent > postmaster starts. I agree with others that it would seem more natural to edit them in postgresql.conf itself, but I also think it doesn't matter nearly as much as getting the feature in some form. -- Robert Haas EDB: http://www.enterprisedb.com
Re: Named Operators
On 12.01.23 14:55, Matthias van de Meent wrote: Matter of taste, I guess. But more importantly, defining an operator gives you many additional features that the planner can use to optimize your query differently, which it can't do with functions. See the COMMUTATOR, HASHES, etc. clause in the CREATE OPERATOR command. I see. Wouldn't it be better then to instead make it possible for the planner to detect the use of the functions used in operators and treat them as aliases of the operator? Or am I missing something w.r.t. differences between operator and function invocation? E.g. indexes on `int8pl(my_bigint, 1)` does not match queries for `my_bigint + 1` (and vice versa), while they should be able to support that, as OPERATOR(pg_catalog.+(int8, int8)) 's function is int8pl. I have been thinking about something like this for a long time. Basically, we would merge pg_proc and pg_operator internally. Then, all the special treatment for operators would also be available to two-argument functions.
Re: Set arbitrary GUC options during initdb
On Fri, 27 Jan 2023 at 09:49, Peter Eisentraut < peter.eisentr...@enterprisedb.com> wrote: > On 25.01.23 22:25, Tom Lane wrote: > > So this invents an initdb switch "-c NAME=VALUE" just like the > > one that the server itself has long had. > > This seems useful. > > > The specified settings > > are applied on the command line of the initial probe calls > > (which happen before we've made any config files), and then they > > are added to postgresql.auto.conf, which causes them to take > > effect for the bootstrap backend runs as well as subsequent > > postmaster starts. > > I would have expected them to be edited into postgresql.conf. What are > the arguments for one or the other? > That would be my expectation also. I believe that is how it works now for options which can be set by initdb, such as locale and port. I view postgresql.auto.conf being for temporary changes, or changes related to different instances within a replication setup, or whatever other uses people come up with - but not for the permanent configuration established by initdb. In particular, I would be surprised if removing a postgresql.auto.conf completely disabled an instance. Obviously, in my replication setup example, the replication would be broken, but the basic operation of the instance would still be possible. Also, if somebody wants to put a change in postgresql.auto.conf, they can easily do it using ALTER SYSTEM once the instance is running, or by just writing out their own postgresql.auto.conf before starting it. Putting a change in postgresql.conf programmatically is a bit of a pain.
Re: Lazy JIT IR code generation to increase JIT speed with partitions
> On Fri, Jan 27, 2023 at 10:02:32AM +0100, David Geier wrote: > It's very curious as to why we didn't really see that when dumping the > bitcode. It seems like the bitcode is always different enough to not spot > that. As I've noted off-list, there was noticeable difference in the dumped bitcode, which I haven't noticed since we were talking mostly about differences between executions of the same query.
Re: old_snapshot_threshold bottleneck on replica
On Fri, Jan 27, 2023 at 9:30 AM Maxim Orlov wrote: > I thank you for your advices. I've dived deeper into the problem and I think > v2 patch is wrong. Cool! > Accessing threshold_timestamp and threshold_xid in > TransactionIdLimitedForOldSnapshots > without lock would lead to an improper xlimit calculation. That would be a bummer. > So, my choice would be (3b). My goal is to optimize access to the > threshold_timestamp to avoid > multiple spinlock acquisition on read. In the same time, simultaneous access > to these variable > (threshold_timestamp and threshold_xid) should be protected with spinlock. > > I remove atomic for threshold_xid and add comments on mutex_threshold. PFA, > v3. I Interesting, but it's still not entirely clear to me from reading the comments why we should think that this is safe. -- Robert Haas EDB: http://www.enterprisedb.com