Re: pg_upgrade and logical replication
On Mon, 24 Apr 2023 at 12:52, Julien Rouhaud wrote: > > Hi, > > On Tue, Apr 18, 2023 at 01:40:51AM +, Hayato Kuroda (Fujitsu) wrote: > > > > I found a cfbot failure on macOS [1]. According to the log, > > "SELECT count(*) FROM t2" was executed before synchronization was done. > > > > ``` > > [09:24:21.018](0.132s) not ok 18 - Table t2 should now have 3 rows on the > > new subscriber > > ``` > > > > With the patch present, wait_for_catchup() is executed after REFRESH, but > > it may not be sufficient because it does not check pg_subscription_rel. > > wait_for_subscription_sync() seems better for the purpose. > > Fixed, thanks! > > v5 attached with all previously mentioned fixes. Few comments: 1) Should we document this command: + case ALTER_SUBSCRIPTION_ADD_TABLE: + { + if (!IsBinaryUpgrade) + ereport(ERROR, + (errcode(ERRCODE_SYNTAX_ERROR)), + errmsg("ALTER SUBSCRIPTION ... ADD TABLE is not supported")); + + supported_opts = SUBOPT_RELID | SUBOPT_STATE | SUBOPT_LSN; + parse_subscription_options(pstate, stmt->options, + supported_opts, ); + + /* relid and state should always be provided. */ + Assert(IsSet(opts.specified_opts, SUBOPT_RELID)); + Assert(IsSet(opts.specified_opts, SUBOPT_STATE)); + + AddSubscriptionRelState(subid, opts.relid, opts.state, + opts.lsn); + Should we document something like: This command is for use by in-place upgrade utilities. Its use for other purposes is not recommended or supported. The behavior of the option may change in future releases without notice. 2) Similarly in pg_dump too: @@ -431,6 +431,7 @@ main(int argc, char **argv) {"table-and-children", required_argument, NULL, 12}, {"exclude-table-and-children", required_argument, NULL, 13}, {"exclude-table-data-and-children", required_argument, NULL, 14}, + {"preserve-subscription-state", no_argument, _subscriptions, 1}, Should we document something like: This command is for use by in-place upgrade utilities. Its use for other purposes is not recommended or supported. The behavior of the option may change in future releases without notice. 3) This same error is possible for ready state table but with invalid remote_lsn, should we include this too in the error message: + if (is_error) + pg_fatal("--preserve-subscription-state is incompatible with " +"subscription relations in non-ready state"); + + check_ok(); +} Regards, Vignesh
Re: [PoC] Federated Authn/z with OAUTHBEARER
On Sat, May 20, 2023 at 10:01 AM Jacob Champion wrote: > - The client implementation is currently epoll-/Linux-specific. I think > kqueue shouldn't be too much trouble for the BSDs, but it's even more > code to maintain. I guess you also need a fallback that uses plain old POSIX poll()? I see you're not just using epoll but also timerfd. Could that be converted to plain old timeout bookkeeping? That should be enough to get every other Unix and *possibly* also Windows to work with the same code path. > - Unless someone is aware of some amazing Winsock magic, I'm pretty sure > the multiplexed-socket approach is dead in the water on Windows. I think > the strategy there probably has to be a background thread plus a fake > "self-pipe" (loopback socket) for polling... which may be controversial? I am not a Windows user or hacker, but there are certainly several ways to multiplex sockets. First there is the WSAEventSelect() + WaitForMultipleObjects() approach that latch.c uses. It has the advantage that it allows socket readiness to be multiplexed with various other things that use Windows "events". But if you don't need that, ie you *only* need readiness-based wakeup for a bunch of sockets and no other kinds of fd or object, you can use winsock's plain old select() or its fairly faithful poll() clone called WSAPoll(). It looks a bit like that'd be true here if you could kill the timerfd? It's a shame to write modern code using select(), but you can find lots of shouting all over the internet about WSAPoll()'s defects, most famously the cURL guys[1] whose blog is widely cited, so people still do it. Possibly some good news on that front: by my reading of the docs, it looks like that problem was fixed in Windows 10 2004[2] which itself is by now EOL, so all systems should have the fix? I suspect that means that, finally, you could probably just use the same poll() code path for Unix (when epoll is not available) *and* Windows these days, making porting a lot easier. But I've never tried it, so I don't know what other problems there might be. Another thing people complain about is the lack of socketpair() or similar in winsock which means you unfortunately can't easily make anonymous select/poll-compatible local sockets, but that doesn't seem to be needed here. [1] https://daniel.haxx.se/blog/2012/10/10/wsapoll-is-broken/ [2] https://learn.microsoft.com/en-us/windows/win32/api/winsock2/nf-winsock2-wsapoll
Re: RFC: pg_stat_logmsg
Hi so 1. 7. 2023 v 1:57 odesílatel Joe Conway napsal: > Greetings, > > Attached please find a tarball (rather than a patch) for a proposed new > contrib extension, pg_stat_logmsg. > > The basic idea is to mirror how pg_stat_statements works, except the > logged messages keyed by filename, lineno, and elevel are saved with a > aggregate count. The format string is displayed (similar to a query > jumble) for context, along with function name and sqlerrcode. > > I threw this together rather quickly over the past couple of days > between meetings, so not claiming that it is committable (and lacks > documentation and regression tests as well), but I would love to get > feedback on: > > 1/ the general concept > 2/ the pg_stat_statement-like implementation > 3/ contrib vs core vs external project > > Some samples and data: > > `make installcheck` with the extension loaded: > 8<-- > # All 215 tests passed. > > > real2m24.854s > user0m0.086s > sys 0m0.283s > 8<-- > > `make installcheck` without the extension loaded: > 8<-- > > # All 215 tests passed. > > real2m26.765s > user0m0.076s > sys 0m0.293s > 8<-- > > Sample output after running make installcheck a couple times (plus a few > manually generated ERRORs): > > 8<-- > test=# select sum(count) from pg_stat_logmsg where elevel > 20; >sum > --- > 10554 > (1 row) > > test=# \x > Expanded display is on. > test=# select * from pg_stat_logmsg where elevel > 20 order by count desc; > -[ RECORD 1 ]--- > filename | aclchk.c > lineno | 2811 > elevel | 21 > funcname | aclcheck_error > sqlerrcode | 42501 > message| permission denied for schema %s > count | 578 > -[ RECORD 2 ]--- > filename | scan.l > lineno | 1241 > elevel | 21 > funcname | scanner_yyerror > sqlerrcode | 42601 > message| %s at or near "%s" > count | 265 > ... > > test=# select * from pg_stat_logmsg where elevel > 20 and sqlerrcode = > 'XX000'; > -[ RECORD 1 ]--- > filename | tid.c > lineno | 352 > elevel | 21 > funcname | currtid_for_view > sqlerrcode | XX000 > message| ctid isn't of type TID > count | 2 > -[ RECORD 2 ]--- > filename | pg_locale.c > lineno | 2493 > elevel | 21 > funcname | pg_ucol_open > sqlerrcode | XX000 > message| could not open collator for locale "%s": %s > count | 2 > ... > > 8<-- > > Part of the thinking is that people with fleets of postgres instances > can use this to scan for various errors that they care about. > Additionally it would be useful to look for "should not happen" errors. > > I will register this in the July CF and will appreciate feedback. > This can be a very interesting feature. I like it. Regards Pavel > Thanks! > > -- > Joe Conway > PostgreSQL Contributors Team > RDS Open Source Databases > Amazon Web Services: https://aws.amazon.com
Re: Extensible storage manager API - SMGR hook Redux
Hi, On 2023-06-30 14:26:44 +0200, Matthias van de Meent wrote: > diff --git a/src/backend/postmaster/postmaster.c > b/src/backend/postmaster/postmaster.c > index 4c49393fc5..8685b9fde6 100644 > --- a/src/backend/postmaster/postmaster.c > +++ b/src/backend/postmaster/postmaster.c > @@ -1002,6 +1002,11 @@ PostmasterMain(int argc, char *argv[]) >*/ > ApplyLauncherRegister(); > > + /* > + * Register built-in managers that are not part of static arrays > + */ > + register_builtin_dynamic_managers(); > + > /* >* process any libraries that should be preloaded at postmaster start >*/ That doesn't strike me as a good place to initialize this, we'll need it in multiple places that way. How about putting it into BaseInit()? > -static const f_smgr smgrsw[] = { > +static f_smgr *smgrsw; This adds another level of indirection. I would rather limit the number of registerable smgrs than do that. > +SMgrId > +smgr_register(const f_smgr *smgr, Size smgrrelation_size) > +{ > + MemoryContextSwitchTo(old); > + > + pg_compiler_barrier(); Huh, what's that about? > @@ -59,14 +63,8 @@ typedef struct SMgrRelationData >* Fields below here are intended to be private to smgr.c and its >* submodules. Do not touch them from elsewhere. >*/ > - int smgr_which; /* storage manager > selector */ > - > - /* > - * for md.c; per-fork arrays of the number of open segments > - * (md_num_open_segs) and the segments themselves (md_seg_fds). > - */ > - int md_num_open_segs[MAX_FORKNUM + 1]; > - struct _MdfdVec *md_seg_fds[MAX_FORKNUM + 1]; > + SMgrId smgr_which; /* storage manager selector */ > + int smgrrelation_size; /* size of this struct, > incl. smgr-specific data */ It looked to me like you determined this globally - why do we need it in every entry then? Greetings, Andres Freund
Re: Synchronizing slots from primary to standby
On Thu, Jun 29, 2023 at 3:52 PM Hayato Kuroda (Fujitsu) wrote: > > 2. I want to confirm the reason why new replication command is added. > Are you referring LIST_SLOTS command? If so, I don't think that is required and instead, we can use a query to fetch the required information. > IIUC the >launcher connects to primary by using primary_conninfo connection string, > but >it establishes the physical replication connection so that any SQL cannot > be executed. >Is it right? Another approach not to use is to specify the target database > via >GUC, whereas not smart. How do you think? > 3. You chose the per-db worker approach, however, it is difficult to extend > the >feature to support physical slots. This may be problematic. Was there any >reasons for that? I doubted ReplicationSlotCreate() or advance functions > might >not be used from other databases and these may be reasons, but not sure. >If these operations can do without connecting to specific database, I think >the architecture can be changed. > I think this point needs some investigation but do we want just one worker that syncs all the slots? That may lead to lag in keeping the slots up-to-date. We probably need some tests. > 4. Currently the launcher establishes the connection every time. Isn't it > better >to reuse the same one instead? > I feel it is not the launcher but a separate sync slot worker that establishes the connection. It is not clear to me what exactly you have in mind. Can you please explain a bit more? > Following comments are assumed the configuration, maybe the straightfoward: > > primary->standby >|->subscriber > > 5. After constructing the system, I dropped the subscription on the > subscriber. >In this case the logical slot on primary was removed, but that was not > replicated >to standby server. Did you support the workload or not? > This should work. > > 6. Current approach may delay the startpoint of sync. > > Assuming that physical replication system is created first, and then the > subscriber connects to the publisher node. In this case the launcher connects > to > primary earlier than the apply worker, and reads the slot. At that time there > are > no slots on primary, so launcher disconnects from primary and waits a time > period (up to 3min). > Even if the apply worker creates the slot on publisher, but the launcher on > standby > cannot notice that. The synchronization may start 3 min later. > I feel this should be based on some GUC like 'wal_retrieve_retry_interval' which we are already using in the launcher or probably a new one if that doesn't seem to match. -- With Regards, Amit Kapila.
Re: cataloguing NOT NULL constraints
Hi, On 2023-06-30 13:44:03 +0200, Alvaro Herrera wrote: > OK, so here's a new attempt to get this working correctly. Thanks for continuing to work on this! > The main novelty in this version of the patch, is that we now emit > "throwaway" NOT NULL constraints when a column is part of the primary > key. Then, after the PK is created, we run a DROP for that constraint. > That lets us create the PK without having to scan the table during > pg_upgrade. Have you considered extending the DDL statement for this purpose? We have ALTER TABLE ... ADD CONSTRAINT ... PRIMARY KEY USING INDEX ...; we could just do something similar for the NOT NULL constraint? Which would then delete the separate constraint NOT NULL constraint. Greetings, Andres Freund
RFC: pg_stat_logmsg
Greetings, Attached please find a tarball (rather than a patch) for a proposed new contrib extension, pg_stat_logmsg. The basic idea is to mirror how pg_stat_statements works, except the logged messages keyed by filename, lineno, and elevel are saved with a aggregate count. The format string is displayed (similar to a query jumble) for context, along with function name and sqlerrcode. I threw this together rather quickly over the past couple of days between meetings, so not claiming that it is committable (and lacks documentation and regression tests as well), but I would love to get feedback on: 1/ the general concept 2/ the pg_stat_statement-like implementation 3/ contrib vs core vs external project Some samples and data: `make installcheck` with the extension loaded: 8<-- # All 215 tests passed. real2m24.854s user0m0.086s sys 0m0.283s 8<-- `make installcheck` without the extension loaded: 8<-- # All 215 tests passed. real2m26.765s user0m0.076s sys 0m0.293s 8<-- Sample output after running make installcheck a couple times (plus a few manually generated ERRORs): 8<-- test=# select sum(count) from pg_stat_logmsg where elevel > 20; sum --- 10554 (1 row) test=# \x Expanded display is on. test=# select * from pg_stat_logmsg where elevel > 20 order by count desc; -[ RECORD 1 ]--- filename | aclchk.c lineno | 2811 elevel | 21 funcname | aclcheck_error sqlerrcode | 42501 message| permission denied for schema %s count | 578 -[ RECORD 2 ]--- filename | scan.l lineno | 1241 elevel | 21 funcname | scanner_yyerror sqlerrcode | 42601 message| %s at or near "%s" count | 265 ... test=# select * from pg_stat_logmsg where elevel > 20 and sqlerrcode = 'XX000'; -[ RECORD 1 ]--- filename | tid.c lineno | 352 elevel | 21 funcname | currtid_for_view sqlerrcode | XX000 message| ctid isn't of type TID count | 2 -[ RECORD 2 ]--- filename | pg_locale.c lineno | 2493 elevel | 21 funcname | pg_ucol_open sqlerrcode | XX000 message| could not open collator for locale "%s": %s count | 2 ... 8<-- Part of the thinking is that people with fleets of postgres instances can use this to scan for various errors that they care about. Additionally it would be useful to look for "should not happen" errors. I will register this in the July CF and will appreciate feedback. Thanks! -- Joe Conway PostgreSQL Contributors Team RDS Open Source Databases Amazon Web Services: https://aws.amazon.com pg_stat_logmsg-000.tgz Description: application/compressed-tar
Re: Initdb-time block size specification
Hi, On 2023-06-30 16:28:59 -0500, David Christensen wrote: > On Fri, Jun 30, 2023 at 3:29 PM Andres Freund wrote: > >> And indeed. Comparing e.g. TPC-H, I see *massive* regressions. Some > >> queries > > are the same, sobut others regress by up to 70% (although more commonly > > around > > 10-20%). > > Hmm, that is definitely not good. > > > That's larger than I thought, which makes me suspect that there's some bug > > in > > the new code. > > Will do a little profiling here to see if I can figure out the > regression. Which build optimization settings are you seeing this > under? gcc 12 with: meson setup \ -Doptimization=3 -Ddebug=true \ -Dc_args="-ggdb -g3 -march=native -mtune=native -fno-plt -fno-semantic-interposition -Wno-array-bounds" \ -Dc_link_args="-fuse-ld=mold -Wl,--gdb-index,--Bsymbolic" \ ... Relevant postgres settings: -c huge_pages=on -c shared_buffers=12GB -c max_connections=120 -c work_mem=32MB -c autovacuum=0 # I always do that for comparative benchmarks, too much variance -c track_io_timing=on The later run where I saw the smaller regression was with work_mem=1GB. I just had forgotten to adjust that. I had loaded tpch scale 5 before, which is why I just used that. FWIW, even just "SELECT count(*) FROM lineitem;" shows a substantial regression. I disabled parallelism, prewarmed the data and pinned postgres to a single core to reduce noise. The result is the best of three (variance was low in all cases). HEAD patch index only scan 1896.364 2242.288 seq scan 1586.990 1628.042 A profile shows that 20% of the runtime in the IOS case is in visibilitymap_get_status(): + 20.50% postgres.new postgres.new [.] visibilitymap_get_status + 19.54% postgres.new postgres.new [.] ExecInterpExpr + 14.47% postgres.new postgres.new [.] IndexOnlyNext +6.47% postgres.new postgres.new [.] index_deform_tuple_internal +4.67% postgres.new postgres.new [.] ExecScan +4.12% postgres.new postgres.new [.] btgettuple +3.97% postgres.new postgres.new [.] ExecAgg +3.92% postgres.new postgres.new [.] _bt_next +3.71% postgres.new postgres.new [.] _bt_readpage +3.43% postgres.new postgres.new [.] fetch_input_tuple +2.87% postgres.new postgres.new [.] index_getnext_tid +2.45% postgres.new postgres.new [.] MemoryContextReset +2.35% postgres.new postgres.new [.] tts_virtual_clear +1.37% postgres.new postgres.new [.] index_deform_tuple +1.14% postgres.new postgres.new [.] ExecStoreVirtualTuple +1.13% postgres.new postgres.new [.] PredicateLockPage +1.12% postgres.new postgres.new [.] int8inc +1.04% postgres.new postgres.new [.] ExecIndexOnlyScan +0.57% postgres.new postgres.new [.] BufferGetBlockNumber mostly due to │ BlockNumber mapBlock = HEAPBLK_TO_MAPBLOCK(heapBlk); 2.46 │lea-0x60(,%rdx,4),%rcx │xor%edx,%edx 59.79 │div%rcx You can't have have divisions for this kind of thing in the vicinity of a peformance critical path. With compile time constants the compiler can turn this into shifts, but that's not possible as-is after the change. While not quite as bad as divisions, the paths with multiplications are also not going to be ok. E.g. return (Block) (BufferBlocks + ((Size) (buffer - 1)) * CLUSTER_BLOCK_SIZE); is going to be noticeable. You'd have to turn all of this into shifts (and enforce power of 2 sizes, if you aren't yet). I don't think pre-computed tables are a viable answer FWIW. Even just going through a memory indirection is going to be noticable. This stuff is in a crapton of hot code paths. Greetings, Andres Freund
Re: Initdb-time block size specification
On Sat, Jul 1, 2023 at 01:18:34AM +0200, Tomas Vondra wrote: > What does "smartctl -a /dev/nvme0n1" say? There should be something like > this: > > Supported LBA Sizes (NSID 0x1) > Id Fmt Data Metadt Rel_Perf > 0 -4096 0 0 > 1 + 512 0 0 > > which says the drive supports 4k and 512B sectors, and is currently > configures to use 512B sectors. It says: Supported LBA Sizes (NSID 0x1) Id Fmt Data Metadt Rel_Perf 0 + 512 0 2 1 -4096 0 0 -- Bruce Momjian https://momjian.us EDB https://enterprisedb.com Only you can decide what is important to you.
Re: Initdb-time block size specification
On 7/1/23 01:16, Bruce Momjian wrote: > On Fri, Jun 30, 2023 at 04:04:57PM -0700, Andres Freund wrote: >> Hi, >> >> On 2023-06-30 18:58:20 -0400, Bruce Momjian wrote: [1] On linux I think you need to use stat() to figure out the st_dev for a file, then look in /proc/self/mountinfo for the block device, use the name of the file to look in /sys/block/$d/queue/physical_block_size. >>> >>> I just got a new server: >>> >>> https://momjian.us/main/blogs/blog/2023.html#June_28_2023 >>> >>> so tested this on my new M.2 NVME storage device: >>> >>> $ /sys/block/nvme0n1/queue/physical_block_size >>> 262144 >> >> Ah, I got the relevant filename wrong. I think it's logical_block_size, not >> physical one (that's the size of addressing). I didn't realize because the >> devices I looked at have the same... > > That one reports 512 _bytes_ for me: > > $ cat /sys/block/nvme0n1/queue/logical_block_size > 512 > What does "smartctl -a /dev/nvme0n1" say? There should be something like this: Supported LBA Sizes (NSID 0x1) Id Fmt Data Metadt Rel_Perf 0 -4096 0 0 1 + 512 0 0 which says the drive supports 4k and 512B sectors, and is currently configures to use 512B sectors. regards -- Tomas Vondra EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
Re: Initdb-time block size specification
On Fri, Jun 30, 2023 at 04:04:57PM -0700, Andres Freund wrote: > Hi, > > On 2023-06-30 18:58:20 -0400, Bruce Momjian wrote: > > > [1] On linux I think you need to use stat() to figure out the st_dev for a > > > file, then look in /proc/self/mountinfo for the block device, use the name > > > of the file to look in /sys/block/$d/queue/physical_block_size. > > > > I just got a new server: > > > > https://momjian.us/main/blogs/blog/2023.html#June_28_2023 > > > > so tested this on my new M.2 NVME storage device: > > > > $ /sys/block/nvme0n1/queue/physical_block_size > > 262144 > > Ah, I got the relevant filename wrong. I think it's logical_block_size, not > physical one (that's the size of addressing). I didn't realize because the > devices I looked at have the same... That one reports 512 _bytes_ for me: $ cat /sys/block/nvme0n1/queue/logical_block_size 512 -- Bruce Momjian https://momjian.us EDB https://enterprisedb.com Only you can decide what is important to you.
Re: Initdb-time block size specification
On 7/1/23 00:59, Andres Freund wrote: > On 2023-06-30 18:37:39 -0400, Bruce Momjian wrote: >> On Sat, Jul 1, 2023 at 12:21:03AM +0200, Tomas Vondra wrote: >>> On 6/30/23 23:53, Bruce Momjian wrote: For a 4kB write, to say it is not partially written would be to require the operating system to guarantee that the 4kB write is not split into smaller writes which might each be atomic because smaller atomic writes would not help us. >>> >>> Right, that's the dance we do to protect against torn pages. But Andres >>> suggested that if you have modern storage and configure it correctly, >>> writing with 4kB pages would be atomic. So we wouldn't need to do this >>> FPI stuff, eliminating pretty significant source of write amplification. >> >> I agree the hardware is atomic for 4k writes, but do we know the OS >> always issues 4k writes? > > When using a sector size of 4K you *can't* make smaller writes via normal > paths. The addressing unit is in sectors. The details obviously differ between > storage protocol, but you pretty much always just specify a start sector and a > number of sectors to be operated on. > > Obviously the kernel could read 4k, modify 512 bytes in-memory, and then write > 4k back, but that shouldn't be a danger here. There might also be debug > interfaces to allow reading/writing in different increments, but that'd not be > something happening during normal operation. I think it's important to point out that there's a physical and logical sector size. The "physical" is what the drive does internally, "logical" defines what OS does. Some drives have 4k physical sectors but only 512B logical sectors. AFAIK most "old" SATA SSDs do it that way, for compatibility reasons. New drives may have 4k physical sectors but typically support both 512B and 4k logical sectors - my nvme SSDs do this, for example. My understanding is that for drives with 4k physical+logical sectors, the OS would only issue "full" 4k writes. regards -- Tomas Vondra EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
Re: Initdb-time block size specification
Hi, On 2023-06-30 18:58:20 -0400, Bruce Momjian wrote: > > [1] On linux I think you need to use stat() to figure out the st_dev for a > > file, then look in /proc/self/mountinfo for the block device, use the name > > of the file to look in /sys/block/$d/queue/physical_block_size. > > I just got a new server: > > https://momjian.us/main/blogs/blog/2023.html#June_28_2023 > > so tested this on my new M.2 NVME storage device: > > $ /sys/block/nvme0n1/queue/physical_block_size > 262144 Ah, I got the relevant filename wrong. I think it's logical_block_size, not physical one (that's the size of addressing). I didn't realize because the devices I looked at have the same... Regards, Andres
Re: Initdb-time block size specification
On Fri, Jun 30, 2023 at 06:58:20PM -0400, Bruce Momjian wrote: > I just got a new server: > > https://momjian.us/main/blogs/blog/2023.html#June_28_2023 > > so tested this on my new M.2 NVME storage device: > > $ /sys/block/nvme0n1/queue/physical_block_size > 262144 > > that's 256k, not 4k. I have another approach to this. My storage device has power protection, so even though it has a 256k physical block size, it should be fine with 4k write atomicity. -- Bruce Momjian https://momjian.us EDB https://enterprisedb.com Only you can decide what is important to you.
Re: Initdb-time block size specification
On 2023-06-30 18:37:39 -0400, Bruce Momjian wrote: > On Sat, Jul 1, 2023 at 12:21:03AM +0200, Tomas Vondra wrote: > > On 6/30/23 23:53, Bruce Momjian wrote: > > > For a 4kB write, to say it is not partially written would be to require > > > the operating system to guarantee that the 4kB write is not split into > > > smaller writes which might each be atomic because smaller atomic writes > > > would not help us. > > > > Right, that's the dance we do to protect against torn pages. But Andres > > suggested that if you have modern storage and configure it correctly, > > writing with 4kB pages would be atomic. So we wouldn't need to do this > > FPI stuff, eliminating pretty significant source of write amplification. > > I agree the hardware is atomic for 4k writes, but do we know the OS > always issues 4k writes? When using a sector size of 4K you *can't* make smaller writes via normal paths. The addressing unit is in sectors. The details obviously differ between storage protocol, but you pretty much always just specify a start sector and a number of sectors to be operated on. Obviously the kernel could read 4k, modify 512 bytes in-memory, and then write 4k back, but that shouldn't be a danger here. There might also be debug interfaces to allow reading/writing in different increments, but that'd not be something happening during normal operation.
Re: Initdb-time block size specification
On Fri, Jun 30, 2023 at 03:51:18PM -0700, Andres Freund wrote: > > For a 4kB write, to say it is not partially written would be to require > > the operating system to guarantee that the 4kB write is not split into > > smaller writes which might each be atomic because smaller atomic writes > > would not help us. > > That's why were talking about drives with 4k sector size - you *can't* split > the writes below that. Okay, good point. > The problem is that, as far as I know,it's not always obvious what block size > is being used on the actual storage level. It's not even trivial when > operating on a filesystem directly stored on a single block device ([1]). Once > there's things like LVM or disk encryption involved, it gets pretty hairy > ([2]). Once you know all the block devices, it's not too bad, but ... > > Greetings, > > Andres Freund > > [1] On linux I think you need to use stat() to figure out the st_dev for a > file, then look in /proc/self/mountinfo for the block device, use the name > of the file to look in /sys/block/$d/queue/physical_block_size. I just got a new server: https://momjian.us/main/blogs/blog/2023.html#June_28_2023 so tested this on my new M.2 NVME storage device: $ /sys/block/nvme0n1/queue/physical_block_size 262144 that's 256k, not 4k. > [2] The above doesn't work because e.g. a device mapper target might only > support 4k sectors, even though the sectors on the underlying storage device > are 512b sectors. E.g. my root filesystem is encrypted, and if you follow the > above recipe (with the added step of resolving the symlink to know the actual > device name), you would see a 4k sector size. Even though the underlying NVMe > disk only supports 512b sectors. Good point. -- Bruce Momjian https://momjian.us EDB https://enterprisedb.com Only you can decide what is important to you.
Re: Initdb-time block size specification
On 7/1/23 00:05, Andres Freund wrote: > Hi, > > On 2023-06-30 23:27:45 +0200, Tomas Vondra wrote: >> On 6/30/23 23:11, Andres Freund wrote: >>> ... >>> >>> If we really wanted to do this - but I don't think we do - I'd argue for >>> working on the buildsystem support to build the postgres binary multiple >>> times, for 4, 8, 16 kB BLCKSZ and having a wrapper postgres binary that just >>> exec's the relevant "real" binary based on the pg_control value. I really >>> don't see us ever wanting to make BLCKSZ runtime configurable within one >>> postgres binary. There's just too much intrinsic overhead associated with >>> that. >> >> I don't quite understand why we shouldn't do this (or at least try to). >> >> IMO the benefits of using smaller blocks were substantial (especially >> for 4kB, most likely due matching the internal SSD page size). The other >> benefits (reducing WAL volume) seem rather interesting too. > > Mostly because I think there are bigger gains to be had elsewhere. > I think that decision is up to whoever chooses to work on it, especially if performance is not their primary motivation (IIRC this was discussed as part of the TDE session). > IME not a whole lot of storage ships by default with externally visible 4k > sectors, but needs to be manually reformated [1], which looses all data, so it > has to be done initially. I don't see why "you have to configure stuff" would be a reason against improvements in this area. I don't know how prevalent storage with 4k sectors is now, but AFAIK it's not hard to get and it's likely to get yet more common in the future. FWIW I don't think the benefits of different (lower) page sizes hinge on 4k sectors - it's just that not having to do FPIs would make it even more interesting. > Then postgres would also need OS specific trickery > to figure out that indeed the IO stack is entirely 4k (checking sector size is > not enough). I haven't suggested we should be doing that automatically (would be nice, but I'd be happy with knowing when it's safe to disable FPW using the GUC in config). But knowing when it's safe would make it yet more interesting be able to use a different block page size at initdb. > And you run into the issue that suddenly the #column and > index-tuple-size limits are lower, which won't make it easier. > True. This limit is annoying, but no one is proposing to change the default page size. initdb would just provide a more convenient way to do that, but the user would have to check. (I rather doubt many people actually index such large values). > > I think we should change the default of the WAL blocksize to 4k > though. There's practically no downsides, and it drastically reduces > postgres-side write amplification in many transactional workloads, by only > writing out partially filled 4k pages instead of partially filled 8k pages. > +1 (although in my tests the benefits we much smaller than for BLCKSZ) > >> Sure, there are challenges (e.g. the overhead due to making it dynamic). >> No doubt about that. > > I don't think the runtime-dynamic overhead is avoidable with reasonable effort > (leaving aside compiling code multiple times and switching between). > > If we were to start building postgres for multiple compile-time settings, I > think there are uses other than switching between BLCKSZ, potentially more > interesting. E.g. you can see substantially improved performance by being able > to use SSE4.2 without CPU dispatch (partially because it allows compiler > autovectorization, partially because it allows to compiler to use newer > non-vectorized math instructions (when targetting AVX IIRC), partially because > the dispatch overhead is not insubstantial). Another example: ARMv8 > performance is substantially better if you target ARMv8.1-A instead of > ARMv8.0, due to having atomic instructions instead of LL/SC (it still baffles > me that they didn't do this earlier, ll/sc is just inherently inefficient). > Maybe, although I think it depends on what parts of the code this would affect. If it's sufficiently small/isolated, it'd be possible to have multiple paths, specialized to a particular page size (pretty common technique for GPU/SIMD, I believe). regards -- Tomas Vondra EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
Re: Initdb-time block size specification
Hi, On 2023-06-30 17:53:34 -0400, Bruce Momjian wrote: > On Fri, Jun 30, 2023 at 11:42:30PM +0200, Tomas Vondra wrote: > > On 6/30/23 23:11, Andres Freund wrote: > > > I suspect you're going to see more benefits from going to a *lower* > > > setting > > > than a higher one. Some practical issues aside, plenty of storage hardware > > > these days would allow to get rid of FPIs if you go to 4k blocks > > > (although it > > > often requires explicit sysadmin action to reformat the drive into that > > > mode > > > etc). But obviously that's problematic from the "postgres limits" POV. > > > > > > > I wonder what are the conditions/options for disabling FPI. I kinda > > assume it'd apply to new drives with 4k sectors, with properly aligned > > partitions etc. But I haven't seen any particularly clear confirmation > > that's correct. > > I don't think we have ever had to study this --- we just request the > write to the operating system, and we either get a successful reply or > we go into WAL recovery to reread the pre-image. We never really care > if the write is atomic, e.g., an 8k write can be done in 2 4kB writes 4 > 2kB writes --- we don't care --- we only care if they are all done or > not. Well, that works because we have FPI. This sub-discussion is motivated by getting rid of FPIs. > For a 4kB write, to say it is not partially written would be to require > the operating system to guarantee that the 4kB write is not split into > smaller writes which might each be atomic because smaller atomic writes > would not help us. That's why were talking about drives with 4k sector size - you *can't* split the writes below that. The problem is that, as far as I know,it's not always obvious what block size is being used on the actual storage level. It's not even trivial when operating on a filesystem directly stored on a single block device ([1]). Once there's things like LVM or disk encryption involved, it gets pretty hairy ([2]). Once you know all the block devices, it's not too bad, but ... Greetings, Andres Freund [1] On linux I think you need to use stat() to figure out the st_dev for a file, then look in /proc/self/mountinfo for the block device, use the name of the file to look in /sys/block/$d/queue/physical_block_size. [2] The above doesn't work because e.g. a device mapper target might only support 4k sectors, even though the sectors on the underlying storage device are 512b sectors. E.g. my root filesystem is encrypted, and if you follow the above recipe (with the added step of resolving the symlink to know the actual device name), you would see a 4k sector size. Even though the underlying NVMe disk only supports 512b sectors.
Re: Initdb-time block size specification
On Sat, Jul 1, 2023 at 12:21:03AM +0200, Tomas Vondra wrote: > On 6/30/23 23:53, Bruce Momjian wrote: > > For a 4kB write, to say it is not partially written would be to require > > the operating system to guarantee that the 4kB write is not split into > > smaller writes which might each be atomic because smaller atomic writes > > would not help us. > > Right, that's the dance we do to protect against torn pages. But Andres > suggested that if you have modern storage and configure it correctly, > writing with 4kB pages would be atomic. So we wouldn't need to do this > FPI stuff, eliminating pretty significant source of write amplification. I agree the hardware is atomic for 4k writes, but do we know the OS always issues 4k writes? -- Bruce Momjian https://momjian.us EDB https://enterprisedb.com Only you can decide what is important to you.
Re: pg_upgrade instructions involving "rsync --size-only" might lead to standby corruption?
On Fri, Jun 30, 2023 at 03:18:03PM -0700, Nikolay Samokhvalov wrote: > I wonder, if we ensure that standbys are fully caught up before upgrading the > primary, if we check the latest checkpoint positions, are we good to use > "rsync > --size-only", or there are still some concerns? It seems so to me, but maybe > I'm missing something. Yes, I think you are correct. -- Bruce Momjian https://momjian.us EDB https://enterprisedb.com Only you can decide what is important to you.
Re: Should we remove db_user_namespace?
On Sat, Jul 01, 2023 at 12:13:26AM +0200, Magnus Hagander wrote: > Strong +1 from here for removing it, assuming you don't find a bunch > of users on -general who are using it. Having never come across one > myself, I think it's unlikely, but I agree it's good to ask. Cool. I'll let that thread [0] sit for a while. [0] https://postgr.es/m/20230630215608.GD2941194%40nathanxps13 -- Nathan Bossart Amazon Web Services: https://aws.amazon.com
Re: Initdb-time block size specification
Hi, On 2023-06-30 23:42:30 +0200, Tomas Vondra wrote: > I wonder what are the conditions/options for disabling FPI. I kinda > assume it'd apply to new drives with 4k sectors, with properly aligned > partitions etc. But I haven't seen any particularly clear confirmation > that's correct. Yea, it's not trivial. And the downsides are also substantial from a replication / crash recovery performance POV - even with reading blocks ahead of WAL replay, it's hard to beat just sequentially reading nearly all the data you're going to need. > On 6/30/23 23:11, Andres Freund wrote: > > If we really wanted to do this - but I don't think we do - I'd argue for > > working on the buildsystem support to build the postgres binary multiple > > times, for 4, 8, 16 kB BLCKSZ and having a wrapper postgres binary that just > > exec's the relevant "real" binary based on the pg_control value. I really > > don't see us ever wanting to make BLCKSZ runtime configurable within one > > postgres binary. There's just too much intrinsic overhead associated with > > that. > > How would that work for extensions which may be built for a particular > BLCKSZ value (like pageinspect)? I think we'd need to do something similar for extensions, likely loading them from a path that includes the "subvariant" the server currently is running. Or alternatively adding a suffix to the filename indicating the variant. Something like pageinspect.x86-64-v4-4kB.so. The x86-64-v* stuff is something gcc and clang added a couple years ago, so that not every project has to define different "baseline" levels. I think it was done in collaboration with the sytem-v/linux AMD64 ABI specification group ([1]). Greetings, Andres [1] https://gitlab.com/x86-psABIs/x86-64-ABI/-/jobs/artifacts/master/raw/x86-64-ABI/abi.pdf?job=build section 3.1.1.
Re: Initdb-time block size specification
On 6/30/23 23:53, Bruce Momjian wrote: > On Fri, Jun 30, 2023 at 11:42:30PM +0200, Tomas Vondra wrote: >> >> >> On 6/30/23 23:11, Andres Freund wrote: >>> Hi, >>> >>> ... >>> >>> I suspect you're going to see more benefits from going to a *lower* setting >>> than a higher one. Some practical issues aside, plenty of storage hardware >>> these days would allow to get rid of FPIs if you go to 4k blocks (although >>> it >>> often requires explicit sysadmin action to reformat the drive into that mode >>> etc). But obviously that's problematic from the "postgres limits" POV. >>> >> >> I wonder what are the conditions/options for disabling FPI. I kinda >> assume it'd apply to new drives with 4k sectors, with properly aligned >> partitions etc. But I haven't seen any particularly clear confirmation >> that's correct. > > I don't think we have ever had to study this --- we just request the > write to the operating system, and we either get a successful reply or > we go into WAL recovery to reread the pre-image. We never really care > if the write is atomic, e.g., an 8k write can be done in 2 4kB writes 4 > 2kB writes --- we don't care --- we only care if they are all done or > not. > > For a 4kB write, to say it is not partially written would be to require > the operating system to guarantee that the 4kB write is not split into > smaller writes which might each be atomic because smaller atomic writes > would not help us. > Right, that's the dance we do to protect against torn pages. But Andres suggested that if you have modern storage and configure it correctly, writing with 4kB pages would be atomic. So we wouldn't need to do this FPI stuff, eliminating pretty significant source of write amplification. regards -- Tomas Vondra EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
Re: pg_upgrade instructions involving "rsync --size-only" might lead to standby corruption?
On Fri, Jun 30, 2023 at 14:33 Bruce Momjian wrote: > On Fri, Jun 30, 2023 at 04:16:31PM -0400, Robert Haas wrote: > > I'm not quite clear on how Nikolay got into trouble here. I don't > > think I understand under exactly what conditions the procedure is > > reliable and under what conditions it isn't. But there is no way in > > heck I would ever advise anyone to use this procedure on a database > > they actually care about. This is a great party trick or something to > > show off in a lightning talk at PGCon, not something you ought to be > > doing with valuable data that you actually care about. > > Well, it does get used, and if we remove it perhaps we can have it on > our wiki and point to it from our docs. In my case, we performed some additional writes on the primary before running "pg_upgrade -k" and we did it *after* we shut down all the standbys. So those changes were not replicated and then "rsync --size-only" ignored them. (By the way, that cluster has wal_log_hints=on to allow Patroni run pg_rewind when needed.) But this can happen with anyone who follows the procedure from the docs as is and doesn't do any additional steps, because in step 9 "Prepare for standby server upgrades": 1) there is no requirement to follow specific order to shut down the nodes - "Streaming replication and log-shipping standby servers can remain running until a later step" should probably be changed to a requirement-like "keep them running" 2) checking the latest checkpoint position with pg_controldata now looks like a thing that is good to do, but with uncertainty purpose -- it does not seem to be used to support any decision - "There will be a mismatch if old standby servers were shut down before the old primary or if the old standby servers are still running" should probably be rephrased saying that if there is mismatch, it's a big problem So following the steps as is, if some writes on the primary are not replicated (due to whatever reason) before execution of pg_upgrade -k + rsync --size-only, then those writes are going to be silently lost on standbys. I wonder, if we ensure that standbys are fully caught up before upgrading the primary, if we check the latest checkpoint positions, are we good to use "rsync --size-only", or there are still some concerns? It seems so to me, but maybe I'm missing something. > -- Thanks, Nikolay Samokhvalov Founder, Postgres.ai
Re: Should we remove db_user_namespace?
On Fri, Jun 30, 2023 at 11:43 PM Nathan Bossart wrote: > > On Fri, Jun 30, 2023 at 05:40:18PM -0400, Tom Lane wrote: > > Might be worth asking on pgsql-general whether anyone knows of > > active use of this feature. If not, I'm good with killing it. > > Will do. Strong +1 from here for removing it, assuming you don't find a bunch of users on -general who are using it. Having never come across one myself, I think it's unlikely, but I agree it's good to ask. -- Magnus Hagander Me: https://www.hagander.net/ Work: https://www.redpill-linpro.com/
Re: Extension Enhancement: Buffer Invalidation in pg_buffercache
On Fri, Jun 30, 2023 at 10:47 PM Palak Chaturvedi wrote: > pgbench=# select count(pg_buffercache_invalidate(bufferid)) from > pg_buffercache where relfilenode = > pg_relation_filenode('pgbench_accounts'::regclass); Hi Palak, Thanks for working on this! I think this will be very useful for testing existing workloads but also for testing future work on prefetching with AIO (and DIO), work on putting SLRUs (or anything else) into the buffer pool, nearby proposals for caching buffer mapping information, etc etc. Palak and I talked about this idea a bit last week (stimulated by a recent thread[1], but the topic has certainly come up before), and we discussed some different ways one could specify which pages are dropped. For example, perhaps the pg_prewarm extension could have an 'unwarm' option instead. I personally thought the buffer ID-based approach was quite good because it's extremely simple, while giving the user the full power of SQL to say which buffers. Half a table? Visibility map? Everything? Root page of an index? I think that's probably better than something that requires more code and complication but is less flexible in the end. It feels like the right level of rawness for something primarily of interest to hackers and advanced users. I don't think it matters that there is a window between selecting a buffer ID and invalidating it, for the intended use cases. That's my vote, anyway, let's see if others have other ideas... We also talked a bit about how one might control the kernel page cache in more fine-grained ways for testing purposes, but it seems like the pgfincore project has that covered with its pgfadvise_willneed() and pgfadvise_dontneed(). IMHO that project could use more page-oriented operations (instead of just counts and coarse grains operations) but that's something that could be material for patches to send to the extension maintainers. This work, in contrast, is more tangled up with bufmgr.c internals, so it feels like this feature belongs in a core contrib module. Some initial thoughts on the patch: I wonder if we should include a simple exercise in contrib/pg_buffercache/sql/pg_buffercache.sql. One problem is that it's not guaranteed to succeed in general. It doesn't wait for pins to go away, and it doesn't retry cleaning dirty buffers after one attempt, it just returns false, which I think is probably the right approach, but it makes the behaviour too non-deterministic for simple tests. Perhaps it's enough to include an exercise where we call it a few times to hit a couple of cases, but not verify what effect it has. It should be restricted by role, but I wonder which role it should be. Testing for superuser is now out of fashion. Where the Makefile mentions 1.4--1.5.sql, the meson.build file needs to do the same. That's because PostgreSQL is currently in transition from autoconf/gmake to meson/ninja[2], so for now we have to maintain both build systems. That's why it fails to build in some CI tasks[3]. You can enable CI in your own GitHub account if you want to run test builds on several operating systems, see [4] for info. [1] https://www.postgresql.org/message-id/flat/CAFSGpE3y_oMK1uHhcHxGxBxs%2BKrjMMdGrE%2B6HHOu0vttVET0UQ%40mail.gmail.com [2] https://wiki.postgresql.org/wiki/Meson [3] http://cfbot.cputube.org/palak-chaturvedi.html [4] https://git.postgresql.org/gitweb/?p=postgresql.git;a=blob_plain;f=src/tools/ci/README;hb=HEAD
Re: Initdb-time block size specification
Hi, On 2023-06-30 23:27:45 +0200, Tomas Vondra wrote: > On 6/30/23 23:11, Andres Freund wrote: > > ... > > > > If we really wanted to do this - but I don't think we do - I'd argue for > > working on the buildsystem support to build the postgres binary multiple > > times, for 4, 8, 16 kB BLCKSZ and having a wrapper postgres binary that just > > exec's the relevant "real" binary based on the pg_control value. I really > > don't see us ever wanting to make BLCKSZ runtime configurable within one > > postgres binary. There's just too much intrinsic overhead associated with > > that. > > I don't quite understand why we shouldn't do this (or at least try to). > > IMO the benefits of using smaller blocks were substantial (especially > for 4kB, most likely due matching the internal SSD page size). The other > benefits (reducing WAL volume) seem rather interesting too. Mostly because I think there are bigger gains to be had elsewhere. IME not a whole lot of storage ships by default with externally visible 4k sectors, but needs to be manually reformated [1], which looses all data, so it has to be done initially. Then postgres would also need OS specific trickery to figure out that indeed the IO stack is entirely 4k (checking sector size is not enough). And you run into the issue that suddenly the #column and index-tuple-size limits are lower, which won't make it easier. I think we should change the default of the WAL blocksize to 4k though. There's practically no downsides, and it drastically reduces postgres-side write amplification in many transactional workloads, by only writing out partially filled 4k pages instead of partially filled 8k pages. > Sure, there are challenges (e.g. the overhead due to making it dynamic). > No doubt about that. I don't think the runtime-dynamic overhead is avoidable with reasonable effort (leaving aside compiling code multiple times and switching between). If we were to start building postgres for multiple compile-time settings, I think there are uses other than switching between BLCKSZ, potentially more interesting. E.g. you can see substantially improved performance by being able to use SSE4.2 without CPU dispatch (partially because it allows compiler autovectorization, partially because it allows to compiler to use newer non-vectorized math instructions (when targetting AVX IIRC), partially because the dispatch overhead is not insubstantial). Another example: ARMv8 performance is substantially better if you target ARMv8.1-A instead of ARMv8.0, due to having atomic instructions instead of LL/SC (it still baffles me that they didn't do this earlier, ll/sc is just inherently inefficient). Greetings, Andres Freund [1] to see the for d in /dev/nvme*n1; do echo "$d:"; sudo nvme id-ns -H $d|grep '^LBA Format';echo ;done
Re: Initdb-time block size specification
On Fri, Jun 30, 2023 at 11:42:30PM +0200, Tomas Vondra wrote: > > > On 6/30/23 23:11, Andres Freund wrote: > > Hi, > > > > ... > > > > I suspect you're going to see more benefits from going to a *lower* setting > > than a higher one. Some practical issues aside, plenty of storage hardware > > these days would allow to get rid of FPIs if you go to 4k blocks (although > > it > > often requires explicit sysadmin action to reformat the drive into that mode > > etc). But obviously that's problematic from the "postgres limits" POV. > > > > I wonder what are the conditions/options for disabling FPI. I kinda > assume it'd apply to new drives with 4k sectors, with properly aligned > partitions etc. But I haven't seen any particularly clear confirmation > that's correct. I don't think we have ever had to study this --- we just request the write to the operating system, and we either get a successful reply or we go into WAL recovery to reread the pre-image. We never really care if the write is atomic, e.g., an 8k write can be done in 2 4kB writes 4 2kB writes --- we don't care --- we only care if they are all done or not. For a 4kB write, to say it is not partially written would be to require the operating system to guarantee that the 4kB write is not split into smaller writes which might each be atomic because smaller atomic writes would not help us. -- Bruce Momjian https://momjian.us EDB https://enterprisedb.com Only you can decide what is important to you.
Re: Should we remove db_user_namespace?
On Fri, Jun 30, 2023 at 05:40:18PM -0400, Tom Lane wrote: > Might be worth asking on pgsql-general whether anyone knows of > active use of this feature. If not, I'm good with killing it. Will do. -- Nathan Bossart Amazon Web Services: https://aws.amazon.com
Re: Should we remove db_user_namespace?
On Fri, Jun 30, 2023 at 05:29:04PM -0400, Bruce Momjian wrote: > I am the original author, and it was a hack designed to support > per-database user names. I am fine with its removal. Thanks, Bruce. -- Nathan Bossart Amazon Web Services: https://aws.amazon.com
Re: Initdb-time block size specification
On 6/30/23 23:11, Andres Freund wrote: > Hi, > > ... > > I suspect you're going to see more benefits from going to a *lower* setting > than a higher one. Some practical issues aside, plenty of storage hardware > these days would allow to get rid of FPIs if you go to 4k blocks (although it > often requires explicit sysadmin action to reformat the drive into that mode > etc). But obviously that's problematic from the "postgres limits" POV. > I wonder what are the conditions/options for disabling FPI. I kinda assume it'd apply to new drives with 4k sectors, with properly aligned partitions etc. But I haven't seen any particularly clear confirmation that's correct. > > If we really wanted to do this - but I don't think we do - I'd argue for > working on the buildsystem support to build the postgres binary multiple > times, for 4, 8, 16 kB BLCKSZ and having a wrapper postgres binary that just > exec's the relevant "real" binary based on the pg_control value. I really > don't see us ever wanting to make BLCKSZ runtime configurable within one > postgres binary. There's just too much intrinsic overhead associated with > that. > How would that work for extensions which may be built for a particular BLCKSZ value (like pageinspect)? regards -- Tomas Vondra EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
Re: Initdb-time block size specification
On Fri, Jun 30, 2023 at 4:17 PM Andres Freund wrote: > > Hi, > > On 2023-06-30 15:05:54 -0500, David Christensen wrote: > > > I am fairly certain this is going to be causing substantial performance > > > regressions. I think we should reject this even if we don't immediately > > > find > > > them, because it's almost guaranteed to cause some. > > > > What would be considered substantial? Some overhead would be expected, > > but I think having an actual patch to evaluate lets us see what > > potential there is. > > Anything beyond 1-2%, although even that imo is a hard sell. I'd agree that that threshold seems like a reasonable target, and anything much above that would be regressive. > > > Besides this, I've not really heard any convincing justification for > > > needing > > > this in the first place. > > > > Doing this would open up experiments in larger block sizes, so we > > would be able to have larger indexable tuples, say, or be able to > > store data types that are larger than currently supported for tuple > > row limits without dropping to toast (native vector data types come to > > mind as a candidate here). > > You can do experiments today with the compile time option. Which does not > require regressing performance for everyone. Sure, not arguing that this is more performant than the current approach. > > We've had 8k blocks for a long time while hardware has improved over 20+ > > years, and it would be interesting to see how tuning things would open up > > additional avenues for performance without requiring packagers to make a > > single choice on this regardless of use-case. > > I suspect you're going to see more benefits from going to a *lower* setting > than a higher one. Some practical issues aside, plenty of storage hardware > these days would allow to get rid of FPIs if you go to 4k blocks (although it > often requires explicit sysadmin action to reformat the drive into that mode > etc). But obviously that's problematic from the "postgres limits" POV. > > > If we really wanted to do this - but I don't think we do - I'd argue for > working on the buildsystem support to build the postgres binary multiple > times, for 4, 8, 16 kB BLCKSZ and having a wrapper postgres binary that just > exec's the relevant "real" binary based on the pg_control value. I really > don't see us ever wanting to make BLCKSZ runtime configurable within one > postgres binary. There's just too much intrinsic overhead associated with > that. You may well be right, but I think if we haven't tried to do that and measure it, it's hard to say exactly. There are of course more parts of the system that are about BLCKSZ than the backend, plus you'd need to build other extensions to support each option, so there is a lot more that would need to change. (That's neither here nor there, as my approach also involves changing all those places, so change isn't inherently bad; just saying it's not a trivial solution to merely iterate over the block size for binaries.) David
Re: Should we remove db_user_namespace?
Nathan Bossart writes: > I'm personally not aware of anyone using this parameter. Might be worth asking on pgsql-general whether anyone knows of active use of this feature. If not, I'm good with killing it. regards, tom lane
Re: [PATCH] Extend the length of BackgroundWorker.bgw_library_name
On Wed, Apr 26, 2023 at 03:07:18PM +0300, Aleksander Alekseev wrote: > The commit message may require a bit of tweaking by the committer but > other than that the patch seems to be fine. I'm going to mark it as > RfC in a bit unless anyone objects. In v4, I've introduced a new BGW_LIBLEN macro and set it to the default value of MAXPGPATH (1024). This way, the value can live in bgworker.h like the other BGW_* macros do. Plus, this should make the assertion that checks for backward compatibility unnecessary. Since bgw_library_name is essentially a path, I can see the argument that we should just set BGW_LIBLEN to MAXPGPATH directly. I'm curious what folks think about this. I also changed the added sizeofs to use the macro for consistency with the surrounding code. -- Nathan Bossart Amazon Web Services: https://aws.amazon.com >From 3760f8efd0a20f60f7eda19450f97c5b2e5daf8f Mon Sep 17 00:00:00 2001 From: Nathan Bossart Date: Fri, 30 Jun 2023 14:25:20 -0700 Subject: [PATCH v4 1/1] extend bgw_library_name --- doc/src/sgml/bgworker.sgml | 2 +- src/backend/postmaster/bgworker.c | 2 +- src/backend/replication/logical/launcher.c | 4 ++-- src/include/postmaster/bgworker.h | 3 ++- 4 files changed, 6 insertions(+), 5 deletions(-) diff --git a/doc/src/sgml/bgworker.sgml b/doc/src/sgml/bgworker.sgml index 7ba5da27e5..f890216333 100644 --- a/doc/src/sgml/bgworker.sgml +++ b/doc/src/sgml/bgworker.sgml @@ -59,7 +59,7 @@ typedef struct BackgroundWorker int bgw_flags; BgWorkerStartTime bgw_start_time; int bgw_restart_time; /* in seconds, or BGW_NEVER_RESTART */ -charbgw_library_name[BGW_MAXLEN]; +charbgw_library_name[BGW_LIBLEN]; charbgw_function_name[BGW_MAXLEN]; Datum bgw_main_arg; charbgw_extra[BGW_EXTRALEN]; diff --git a/src/backend/postmaster/bgworker.c b/src/backend/postmaster/bgworker.c index 0dd22b2351..bee2b895e3 100644 --- a/src/backend/postmaster/bgworker.c +++ b/src/backend/postmaster/bgworker.c @@ -362,7 +362,7 @@ BackgroundWorkerStateChange(bool allow_new_workers) ascii_safe_strlcpy(rw->rw_worker.bgw_type, slot->worker.bgw_type, BGW_MAXLEN); ascii_safe_strlcpy(rw->rw_worker.bgw_library_name, - slot->worker.bgw_library_name, BGW_MAXLEN); + slot->worker.bgw_library_name, BGW_LIBLEN); ascii_safe_strlcpy(rw->rw_worker.bgw_function_name, slot->worker.bgw_function_name, BGW_MAXLEN); diff --git a/src/backend/replication/logical/launcher.c b/src/backend/replication/logical/launcher.c index 8395ae7b23..e1ed820b31 100644 --- a/src/backend/replication/logical/launcher.c +++ b/src/backend/replication/logical/launcher.c @@ -456,7 +456,7 @@ retry: bgw.bgw_flags = BGWORKER_SHMEM_ACCESS | BGWORKER_BACKEND_DATABASE_CONNECTION; bgw.bgw_start_time = BgWorkerStart_RecoveryFinished; - snprintf(bgw.bgw_library_name, BGW_MAXLEN, "postgres"); + snprintf(bgw.bgw_library_name, BGW_LIBLEN, "postgres"); if (is_parallel_apply_worker) snprintf(bgw.bgw_function_name, BGW_MAXLEN, "ParallelApplyWorkerMain"); @@ -910,7 +910,7 @@ ApplyLauncherRegister(void) bgw.bgw_flags = BGWORKER_SHMEM_ACCESS | BGWORKER_BACKEND_DATABASE_CONNECTION; bgw.bgw_start_time = BgWorkerStart_RecoveryFinished; - snprintf(bgw.bgw_library_name, BGW_MAXLEN, "postgres"); + snprintf(bgw.bgw_library_name, BGW_LIBLEN, "postgres"); snprintf(bgw.bgw_function_name, BGW_MAXLEN, "ApplyLauncherMain"); snprintf(bgw.bgw_name, BGW_MAXLEN, "logical replication launcher"); diff --git a/src/include/postmaster/bgworker.h b/src/include/postmaster/bgworker.h index 845d4498e6..f8c401093e 100644 --- a/src/include/postmaster/bgworker.h +++ b/src/include/postmaster/bgworker.h @@ -84,6 +84,7 @@ typedef enum #define BGW_DEFAULT_RESTART_INTERVAL 60 #define BGW_NEVER_RESTART-1 #define BGW_MAXLEN 96 +#define BGW_LIBLEN 1024 #define BGW_EXTRALEN 128 typedef struct BackgroundWorker @@ -93,7 +94,7 @@ typedef struct BackgroundWorker int bgw_flags; BgWorkerStartTime bgw_start_time; int bgw_restart_time; /* in seconds, or BGW_NEVER_RESTART */ - char bgw_library_name[BGW_MAXLEN]; + char bgw_library_name[BGW_LIBLEN]; char bgw_function_name[BGW_MAXLEN]; Datum bgw_main_arg; char bgw_extra[BGW_EXTRALEN]; -- 2.25.1
Re: PG 16 draft release notes ready
On Fri, Jun 30, 2023 at 05:29:17PM +0900, Masahiro Ikeda wrote: > Hi, > > Thanks for making the release notes. I found the release note of > PG16 beta2 mentions a reverted following feature. > > ``` > > > > > Have initdb use ICU by default if ICU is enabled in the binary (Jeff Davis) > > > > Option --locale-provider=libc can be used to disable ICU. > > > ``` > > Unfortunately, the feature is reverted with the commit. > * > https://git.postgresql.org/gitweb/?p=postgresql.git;a=commit;h=2535c74b1a6190cc42e13f6b6b55d94bff4b7dd6 Oh, I didn't notice the revert --- item removed. -- Bruce Momjian https://momjian.us EDB https://enterprisedb.com Only you can decide what is important to you.
Re: Initdb-time block size specification
On Fri, Jun 30, 2023 at 4:27 PM Tomas Vondra wrote: > On 6/30/23 23:11, Andres Freund wrote: > > ... > > > > If we really wanted to do this - but I don't think we do - I'd argue for > > working on the buildsystem support to build the postgres binary multiple > > times, for 4, 8, 16 kB BLCKSZ and having a wrapper postgres binary that just > > exec's the relevant "real" binary based on the pg_control value. I really > > don't see us ever wanting to make BLCKSZ runtime configurable within one > > postgres binary. There's just too much intrinsic overhead associated with > > that. > > > > I don't quite understand why we shouldn't do this (or at least try to). > IMO the benefits of using smaller blocks were substantial (especially > for 4kB, most likely due matching the internal SSD page size). The other > benefits (reducing WAL volume) seem rather interesting too. If it's dynamic, we could also be able to add detection of the best block size at initdb time, leading to improvements all around, say. :-) > Sure, there are challenges (e.g. the overhead due to making it dynamic). > No doubt about that. I definitely agree that it seems worth looking into. If nothing else, being able to quantify just what/where the overhead comes from may be instructive for future efforts. David
Re: pg_upgrade instructions involving "rsync --size-only" might lead to standby corruption?
On Fri, Jun 30, 2023 at 04:16:31PM -0400, Robert Haas wrote: > On Fri, Jun 30, 2023 at 1:41 PM Bruce Momjian wrote: > > I think --size-only was chosen only because it is the minimal comparison > > option. > > I think it's worse than that. I think that the procedure relies on > using the --size-only option to intentionally trick rsync into > thinking that files are identical when they're not. > > Say we have a file like base/23246/78901 on the primary. Unless > wal_log_hints=on, the standby version is very likely different, but > only in ways that don't matter to WAL replay. So the procedure aims to > trick rsync into hard-linking the version of that file that exists on > the standby in the old cluster into the new cluster on the standby, > instead of copying the slightly-different version from the master, > thus making the upgrade very fast. If rsync actually checksummed the > files, it would realize that they're different and copy the file from > the original primary, which the person who wrote this procedure does > not want. What is the problem with having different hint bits between the two servers? > That's kind of a crazy thing for us to be documenting. I think we > really ought to consider removing from this documentation. If somebody > wants to write a reliable tool for this to ship as part of PostgreSQL, > well and good. But this procedure has no real sanity checks and is > based on very fragile assumptions. That doesn't seem suitable for > end-user use. > > I'm not quite clear on how Nikolay got into trouble here. I don't > think I understand under exactly what conditions the procedure is > reliable and under what conditions it isn't. But there is no way in > heck I would ever advise anyone to use this procedure on a database > they actually care about. This is a great party trick or something to > show off in a lightning talk at PGCon, not something you ought to be > doing with valuable data that you actually care about. Well, it does get used, and if we remove it perhaps we can have it on our wiki and point to it from our docs. -- Bruce Momjian https://momjian.us EDB https://enterprisedb.com Only you can decide what is important to you.
Re: Initdb-time block size specification
On Fri, Jun 30, 2023 at 4:12 PM Tomas Vondra wrote: > > > > On 6/30/23 22:05, David Christensen wrote: > > On Fri, Jun 30, 2023 at 2:39 PM Andres Freund wrote: > >> > >> ... > >> > >> Besides this, I've not really heard any convincing justification for > >> needing > >> this in the first place. > > > > Doing this would open up experiments in larger block sizes, so we > > would be able to have larger indexable tuples, say, or be able to > > store data types that are larger than currently supported for tuple > > row limits without dropping to toast (native vector data types come to > > mind as a candidate here). We've had 8k blocks for a long time while > > hardware has improved over 20+ years, and it would be interesting to > > see how tuning things would open up additional avenues for performance > > without requiring packagers to make a single choice on this regardless > > of use-case. (The fact that we allow compiling this at a different > > value suggests there is thought to be some utility having this be > > something other than the default value.) > > > > I just think it's one of those things that is hard to evaluate without > > actually having something specific, which is why we have this patch > > now. > > > > But it's possible to evaluate that - you just need to rebuild with a > different configuration option. Yes, allowing doing that at initdb is > simpler and allows testing this on systems where rebuilding is not > convenient. And having a binary that can deal with any block size would > be nice too. > > In fact, I did exactly that a year ago for a conference, and I spoke > about it at the 2022 unconference too. Not sure if there's recording > from pgcon, but there is one from the other conference [1][2]. > > The short story is that the possible gains are significant (say +50%) > for data sets that don't fit into RAM. But that was with block size set > at compile time, the question is what's the impact of making it a > variable instead of a macro > > [1] https://www.youtube.com/watch?v=mVKpoQxtCXk > [2] https://blog.pgaddict.com/pdf/block-sizes-postgresvision-2022.pdf Cool, thanks for the video links; will review. David
Re: Initdb-time block size specification
On Fri, Jun 30, 2023 at 3:29 PM Andres Freund wrote: >> And indeed. Comparing e.g. TPC-H, I see *massive* regressions. Some queries > are the same, sobut others regress by up to 70% (although more commonly around > 10-20%). Hmm, that is definitely not good. > That's larger than I thought, which makes me suspect that there's some bug in > the new code. Will do a little profiling here to see if I can figure out the regression. Which build optimization settings are you seeing this under? > Interestingly, repeating the benchmark with a larger work_mem setting, the > regressions are still quite present, but smaller. I suspect the planner > chooses smarter plans which move bottlenecks more towards hashjoin code etc, > which won't be affected by this change. Interesting. > IOW, you seriously need to evaluate analytics queries before this is worth > looking at further. Makes sense, thanks for reviewing. David
Re: Should we remove db_user_namespace?
On Fri, Jun 30, 2023 at 01:05:09PM -0700, Nathan Bossart wrote: > I think this is the second decennial thread [0] for removing this GUC. > This topic came up at PGCon, so I thought I'd start the discussion on the > lists. > > I'm personally not aware of anyone using this parameter. A couple of my > colleagues claimed to have used it in the aughts, but they also noted that > users were confused by the current implementation, and they seemed > generally in favor of removing it. AFAICT the strongest reason for keeping > it is that there is presently no viable replacement. Does this opinion > still stand? If so, perhaps we can look into adding a viable replacement > for v17. I am the original author, and it was a hack designed to support per-database user names. I am fine with its removal. -- Bruce Momjian https://momjian.us EDB https://enterprisedb.com Only you can decide what is important to you.
Re: Initdb-time block size specification
On 6/30/23 23:11, Andres Freund wrote: > ... > > If we really wanted to do this - but I don't think we do - I'd argue for > working on the buildsystem support to build the postgres binary multiple > times, for 4, 8, 16 kB BLCKSZ and having a wrapper postgres binary that just > exec's the relevant "real" binary based on the pg_control value. I really > don't see us ever wanting to make BLCKSZ runtime configurable within one > postgres binary. There's just too much intrinsic overhead associated with > that. > I don't quite understand why we shouldn't do this (or at least try to). IMO the benefits of using smaller blocks were substantial (especially for 4kB, most likely due matching the internal SSD page size). The other benefits (reducing WAL volume) seem rather interesting too. Sure, there are challenges (e.g. the overhead due to making it dynamic). No doubt about that. regards -- Tomas Vondra EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
Re: Initdb-time block size specification
Hi, On 2023-06-30 15:05:54 -0500, David Christensen wrote: > > I am fairly certain this is going to be causing substantial performance > > regressions. I think we should reject this even if we don't immediately > > find > > them, because it's almost guaranteed to cause some. > > What would be considered substantial? Some overhead would be expected, > but I think having an actual patch to evaluate lets us see what > potential there is. Anything beyond 1-2%, although even that imo is a hard sell. > > Besides this, I've not really heard any convincing justification for needing > > this in the first place. > > Doing this would open up experiments in larger block sizes, so we > would be able to have larger indexable tuples, say, or be able to > store data types that are larger than currently supported for tuple > row limits without dropping to toast (native vector data types come to > mind as a candidate here). You can do experiments today with the compile time option. Which does not require regressing performance for everyone. > We've had 8k blocks for a long time while hardware has improved over 20+ > years, and it would be interesting to see how tuning things would open up > additional avenues for performance without requiring packagers to make a > single choice on this regardless of use-case. I suspect you're going to see more benefits from going to a *lower* setting than a higher one. Some practical issues aside, plenty of storage hardware these days would allow to get rid of FPIs if you go to 4k blocks (although it often requires explicit sysadmin action to reformat the drive into that mode etc). But obviously that's problematic from the "postgres limits" POV. If we really wanted to do this - but I don't think we do - I'd argue for working on the buildsystem support to build the postgres binary multiple times, for 4, 8, 16 kB BLCKSZ and having a wrapper postgres binary that just exec's the relevant "real" binary based on the pg_control value. I really don't see us ever wanting to make BLCKSZ runtime configurable within one postgres binary. There's just too much intrinsic overhead associated with that. Greetings, Andres Freund
Re: Initdb-time block size specification
On 6/30/23 22:05, David Christensen wrote: > On Fri, Jun 30, 2023 at 2:39 PM Andres Freund wrote: >> >> ... >> >> Besides this, I've not really heard any convincing justification for needing >> this in the first place. > > Doing this would open up experiments in larger block sizes, so we > would be able to have larger indexable tuples, say, or be able to > store data types that are larger than currently supported for tuple > row limits without dropping to toast (native vector data types come to > mind as a candidate here). We've had 8k blocks for a long time while > hardware has improved over 20+ years, and it would be interesting to > see how tuning things would open up additional avenues for performance > without requiring packagers to make a single choice on this regardless > of use-case. (The fact that we allow compiling this at a different > value suggests there is thought to be some utility having this be > something other than the default value.) > > I just think it's one of those things that is hard to evaluate without > actually having something specific, which is why we have this patch > now. > But it's possible to evaluate that - you just need to rebuild with a different configuration option. Yes, allowing doing that at initdb is simpler and allows testing this on systems where rebuilding is not convenient. And having a binary that can deal with any block size would be nice too. In fact, I did exactly that a year ago for a conference, and I spoke about it at the 2022 unconference too. Not sure if there's recording from pgcon, but there is one from the other conference [1][2]. The short story is that the possible gains are significant (say +50%) for data sets that don't fit into RAM. But that was with block size set at compile time, the question is what's the impact of making it a variable instead of a macro [1] https://www.youtube.com/watch?v=mVKpoQxtCXk [2] https://blog.pgaddict.com/pdf/block-sizes-postgresvision-2022.pdf regards -- Tomas Vondra EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
Re: Should we remove db_user_namespace?
On Fri, Jun 30, 2023 at 01:05:09PM -0700, Nathan Bossart wrote: > The attached patch simply removes the GUC. And here's a new version of the patch with docs that actually build. -- Nathan Bossart Amazon Web Services: https://aws.amazon.com >From 3b7fdd41eb429bc9bb03dcecf38126fbc63dafa3 Mon Sep 17 00:00:00 2001 From: Nathan Bossart Date: Fri, 30 Jun 2023 12:46:08 -0700 Subject: [PATCH v2 1/1] remove db_user_namespace --- doc/src/sgml/client-auth.sgml | 5 -- doc/src/sgml/config.sgml | 52 --- src/backend/libpq/auth.c | 5 -- src/backend/libpq/hba.c | 12 - src/backend/postmaster/postmaster.c | 19 --- src/backend/utils/misc/guc_tables.c | 9 src/backend/utils/misc/postgresql.conf.sample | 1 - src/include/libpq/pqcomm.h| 2 - 8 files changed, 105 deletions(-) diff --git a/doc/src/sgml/client-auth.sgml b/doc/src/sgml/client-auth.sgml index 204d09df67..6c95f0df1e 100644 --- a/doc/src/sgml/client-auth.sgml +++ b/doc/src/sgml/client-auth.sgml @@ -1253,11 +1253,6 @@ omicron bryanh guest1 attacks. - - The md5 method cannot be used with - the feature. - - To ease transition from the md5 method to the newer SCRAM method, if md5 is specified as a method diff --git a/doc/src/sgml/config.sgml b/doc/src/sgml/config.sgml index 6262cb7bb2..e6cea8ddfc 100644 --- a/doc/src/sgml/config.sgml +++ b/doc/src/sgml/config.sgml @@ -1188,58 +1188,6 @@ include_dir 'conf.d' - - - db_user_namespace (boolean) - - db_user_namespace configuration parameter - - - - -This parameter enables per-database user names. It is off by default. -This parameter can only be set in the postgresql.conf -file or on the server command line. - - - -If this is on, you should create users as username@dbname. -When username is passed by a connecting client, -@ and the database name are appended to the user -name and that database-specific user name is looked up by the -server. Note that when you create users with names containing -@ within the SQL environment, you will need to -quote the user name. - - - -With this parameter enabled, you can still create ordinary global -users. Simply append @ when specifying the user -name in the client, e.g., joe@. The @ -will be stripped off before the user name is looked up by the -server. - - - -db_user_namespace causes the client's and -server's user name representation to differ. -Authentication checks are always done with the server's user name -so authentication methods must be configured for the -server's user name, not the client's. Because -md5 uses the user name as salt on both the -client and server, md5 cannot be used with -db_user_namespace. - - - - - This feature is intended as a temporary measure until a - complete solution is found. At that time, this option will - be removed. - - - - diff --git a/src/backend/libpq/auth.c b/src/backend/libpq/auth.c index a98b934a8e..65d452f099 100644 --- a/src/backend/libpq/auth.c +++ b/src/backend/libpq/auth.c @@ -873,11 +873,6 @@ CheckMD5Auth(Port *port, char *shadow_pass, const char **logdetail) char *passwd; int result; - if (Db_user_namespace) - ereport(FATAL, -(errcode(ERRCODE_INVALID_AUTHORIZATION_SPECIFICATION), - errmsg("MD5 authentication is not supported when \"db_user_namespace\" is enabled"))); - /* include the salt to use for computing the response */ if (!pg_strong_random(md5Salt, 4)) { diff --git a/src/backend/libpq/hba.c b/src/backend/libpq/hba.c index f89f138f3c..5d4ddbb04d 100644 --- a/src/backend/libpq/hba.c +++ b/src/backend/libpq/hba.c @@ -1741,19 +1741,7 @@ parse_hba_line(TokenizedAuthLine *tok_line, int elevel) else if (strcmp(token->string, "reject") == 0) parsedline->auth_method = uaReject; else if (strcmp(token->string, "md5") == 0) - { - if (Db_user_namespace) - { - ereport(elevel, - (errcode(ERRCODE_CONFIG_FILE_ERROR), - errmsg("MD5 authentication is not supported when \"db_user_namespace\" is enabled"), - errcontext("line %d of configuration file \"%s\"", -line_num, file_name))); - *err_msg = "MD5 authentication is not supported when \"db_user_namespace\" is enabled"; - return NULL; - } parsedline->auth_method = uaMD5; - } else if (strcmp(token->string, "scram-sha-256") == 0) parsedline->auth_method = uaSCRAM; else if (strcmp(token->string, "pam") == 0) diff --git a/src/backend/postmaster/postmaster.c b/src/backend/postmaster/postmaster.c
Re: Initdb-time block size specification
Hi, On 2023-06-30 14:09:55 -0500, David Christensen wrote: > On Fri, Jun 30, 2023 at 1:14 PM Tomas Vondra > > I wonder how to best evaluate/benchmark this. AFAICS what we want to > > measure is the extra cost of making the values dynamic (which means the > > compiler can't just optimize them out). I'd say a "pgbench -S" seems > > like a good test. > > Yep, I tested 100 runs apiece with pgbench -S at scale factor 100, > default settings for optimized builds of the same base commit with and > without the patch; saw a reduction of TPS around 1% in that case, but > I do think we'd want to look at different workloads; I presume the > largest impact would be seen when it's all in shared_buffers and no IO > is required. I think pgbench -S indeed isn't a good workload - the overhead for it is much more in context switches and instantiating executor state etc than code that is affected by the change. And indeed. Comparing e.g. TPC-H, I see *massive* regressions. Some queries are the same, sobut others regress by up to 70% (although more commonly around 10-20%). That's larger than I thought, which makes me suspect that there's some bug in the new code. Interestingly, repeating the benchmark with a larger work_mem setting, the regressions are still quite present, but smaller. I suspect the planner chooses smarter plans which move bottlenecks more towards hashjoin code etc, which won't be affected by this change. IOW, you seriously need to evaluate analytics queries before this is worth looking at further. Greetings, Andres Freund
Re: pg_upgrade instructions involving "rsync --size-only" might lead to standby corruption?
On Fri, Jun 30, 2023 at 1:41 PM Bruce Momjian wrote: > I think --size-only was chosen only because it is the minimal comparison > option. I think it's worse than that. I think that the procedure relies on using the --size-only option to intentionally trick rsync into thinking that files are identical when they're not. Say we have a file like base/23246/78901 on the primary. Unless wal_log_hints=on, the standby version is very likely different, but only in ways that don't matter to WAL replay. So the procedure aims to trick rsync into hard-linking the version of that file that exists on the standby in the old cluster into the new cluster on the standby, instead of copying the slightly-different version from the master, thus making the upgrade very fast. If rsync actually checksummed the files, it would realize that they're different and copy the file from the original primary, which the person who wrote this procedure does not want. That's kind of a crazy thing for us to be documenting. I think we really ought to consider removing from this documentation. If somebody wants to write a reliable tool for this to ship as part of PostgreSQL, well and good. But this procedure has no real sanity checks and is based on very fragile assumptions. That doesn't seem suitable for end-user use. I'm not quite clear on how Nikolay got into trouble here. I don't think I understand under exactly what conditions the procedure is reliable and under what conditions it isn't. But there is no way in heck I would ever advise anyone to use this procedure on a database they actually care about. This is a great party trick or something to show off in a lightning talk at PGCon, not something you ought to be doing with valuable data that you actually care about. -- Robert Haas EDB: http://www.enterprisedb.com
Re: Initdb-time block size specification
On Fri, Jun 30, 2023 at 2:39 PM Andres Freund wrote: > > Hi, > > On 2023-06-30 12:35:09 -0500, David Christensen wrote: > > As discussed somewhat at PgCon, enclosed is v1 of a patch to provide > > variable block sizes; basically instead of BLCKSZ being a compile-time > > constant, a single set of binaries can support all of the block sizes > > Pg can support, using the value stored in pg_control as the basis. > > (Possible future plans would be to make this something even more > > dynamic, such as configured per tablespace, but this is out of scope; > > this just sets up the infrastructure for this.) > > I am extremely doubtful this is a good idea. For one it causes a lot of churn, > but more importantly it turns currently cheap code paths into more expensive > ones. > > Changes like > > > -#define BufHdrGetBlock(bufHdr) ((Block) (BufferBlocks + ((Size) > > (bufHdr)->buf_id) * BLCKSZ)) > > +#define BufHdrGetBlock(bufHdr) ((Block) (BufferBlocks + ((Size) > > (bufHdr)->buf_id) * CLUSTER_BLOCK_SIZE)) > > Note that CLUSTER_BLOCK_SIZE, despite looking like a macro that's constant, is > actually variable. Correct; that is mainly a notational device which would be easy enough to change (and presumably would follow along the lines of the commit Tomas pointed out above). > I am fairly certain this is going to be causing substantial performance > regressions. I think we should reject this even if we don't immediately find > them, because it's almost guaranteed to cause some. What would be considered substantial? Some overhead would be expected, but I think having an actual patch to evaluate lets us see what potential there is. Seems like this will likely be optimized as an offset stored in a register, so wouldn't expect huge changes here. (There may be other approaches I haven't thought of in terms of getting this.) > Besides this, I've not really heard any convincing justification for needing > this in the first place. Doing this would open up experiments in larger block sizes, so we would be able to have larger indexable tuples, say, or be able to store data types that are larger than currently supported for tuple row limits without dropping to toast (native vector data types come to mind as a candidate here). We've had 8k blocks for a long time while hardware has improved over 20+ years, and it would be interesting to see how tuning things would open up additional avenues for performance without requiring packagers to make a single choice on this regardless of use-case. (The fact that we allow compiling this at a different value suggests there is thought to be some utility having this be something other than the default value.) I just think it's one of those things that is hard to evaluate without actually having something specific, which is why we have this patch now. Thanks, David
Should we remove db_user_namespace?
I think this is the second decennial thread [0] for removing this GUC. This topic came up at PGCon, so I thought I'd start the discussion on the lists. I'm personally not aware of anyone using this parameter. A couple of my colleagues claimed to have used it in the aughts, but they also noted that users were confused by the current implementation, and they seemed generally in favor of removing it. AFAICT the strongest reason for keeping it is that there is presently no viable replacement. Does this opinion still stand? If so, perhaps we can look into adding a viable replacement for v17. The attached patch simply removes the GUC. [0] https://postgr.es/m/CAA-aLv6wnwp6Qr5fZo%2B7K%3DVSr51qFMuLsCeYvTkSF3E5qEPvqA%40mail.gmail.com -- Nathan Bossart Amazon Web Services: https://aws.amazon.com >From 6677f4b98fd0b1bd68e07d773b04caf45cf27715 Mon Sep 17 00:00:00 2001 From: Nathan Bossart Date: Fri, 30 Jun 2023 12:46:08 -0700 Subject: [PATCH v1 1/1] remove db_user_namespace --- doc/src/sgml/config.sgml | 52 --- src/backend/libpq/auth.c | 5 -- src/backend/libpq/hba.c | 12 - src/backend/postmaster/postmaster.c | 19 --- src/backend/utils/misc/guc_tables.c | 9 src/backend/utils/misc/postgresql.conf.sample | 1 - src/include/libpq/pqcomm.h| 2 - 7 files changed, 100 deletions(-) diff --git a/doc/src/sgml/config.sgml b/doc/src/sgml/config.sgml index 6262cb7bb2..e6cea8ddfc 100644 --- a/doc/src/sgml/config.sgml +++ b/doc/src/sgml/config.sgml @@ -1188,58 +1188,6 @@ include_dir 'conf.d' - - - db_user_namespace (boolean) - - db_user_namespace configuration parameter - - - - -This parameter enables per-database user names. It is off by default. -This parameter can only be set in the postgresql.conf -file or on the server command line. - - - -If this is on, you should create users as username@dbname. -When username is passed by a connecting client, -@ and the database name are appended to the user -name and that database-specific user name is looked up by the -server. Note that when you create users with names containing -@ within the SQL environment, you will need to -quote the user name. - - - -With this parameter enabled, you can still create ordinary global -users. Simply append @ when specifying the user -name in the client, e.g., joe@. The @ -will be stripped off before the user name is looked up by the -server. - - - -db_user_namespace causes the client's and -server's user name representation to differ. -Authentication checks are always done with the server's user name -so authentication methods must be configured for the -server's user name, not the client's. Because -md5 uses the user name as salt on both the -client and server, md5 cannot be used with -db_user_namespace. - - - - - This feature is intended as a temporary measure until a - complete solution is found. At that time, this option will - be removed. - - - - diff --git a/src/backend/libpq/auth.c b/src/backend/libpq/auth.c index a98b934a8e..65d452f099 100644 --- a/src/backend/libpq/auth.c +++ b/src/backend/libpq/auth.c @@ -873,11 +873,6 @@ CheckMD5Auth(Port *port, char *shadow_pass, const char **logdetail) char *passwd; int result; - if (Db_user_namespace) - ereport(FATAL, -(errcode(ERRCODE_INVALID_AUTHORIZATION_SPECIFICATION), - errmsg("MD5 authentication is not supported when \"db_user_namespace\" is enabled"))); - /* include the salt to use for computing the response */ if (!pg_strong_random(md5Salt, 4)) { diff --git a/src/backend/libpq/hba.c b/src/backend/libpq/hba.c index f89f138f3c..5d4ddbb04d 100644 --- a/src/backend/libpq/hba.c +++ b/src/backend/libpq/hba.c @@ -1741,19 +1741,7 @@ parse_hba_line(TokenizedAuthLine *tok_line, int elevel) else if (strcmp(token->string, "reject") == 0) parsedline->auth_method = uaReject; else if (strcmp(token->string, "md5") == 0) - { - if (Db_user_namespace) - { - ereport(elevel, - (errcode(ERRCODE_CONFIG_FILE_ERROR), - errmsg("MD5 authentication is not supported when \"db_user_namespace\" is enabled"), - errcontext("line %d of configuration file \"%s\"", -line_num, file_name))); - *err_msg = "MD5 authentication is not supported when \"db_user_namespace\" is enabled"; - return NULL; - } parsedline->auth_method = uaMD5; - } else if (strcmp(token->string, "scram-sha-256") == 0) parsedline->auth_method = uaSCRAM; else if (strcmp(token->string, "pam") == 0) diff --git a/src/backend/postmaster/postmaster.c
Re: Initdb-time block size specification
Hi, On 2023-06-30 12:35:09 -0500, David Christensen wrote: > As discussed somewhat at PgCon, enclosed is v1 of a patch to provide > variable block sizes; basically instead of BLCKSZ being a compile-time > constant, a single set of binaries can support all of the block sizes > Pg can support, using the value stored in pg_control as the basis. > (Possible future plans would be to make this something even more > dynamic, such as configured per tablespace, but this is out of scope; > this just sets up the infrastructure for this.) I am extremely doubtful this is a good idea. For one it causes a lot of churn, but more importantly it turns currently cheap code paths into more expensive ones. Changes like > -#define BufHdrGetBlock(bufHdr) ((Block) (BufferBlocks + ((Size) > (bufHdr)->buf_id) * BLCKSZ)) > +#define BufHdrGetBlock(bufHdr) ((Block) (BufferBlocks + ((Size) > (bufHdr)->buf_id) * CLUSTER_BLOCK_SIZE)) Note that CLUSTER_BLOCK_SIZE, despite looking like a macro that's constant, is actually variable. I am fairly certain this is going to be causing substantial performance regressions. I think we should reject this even if we don't immediately find them, because it's almost guaranteed to cause some. Besides this, I've not really heard any convincing justification for needing this in the first place. Greetings, Andres Freund
Re: Initdb-time block size specification
On Fri, Jun 30, 2023 at 1:14 PM Tomas Vondra wrote: > Do we really want to prefix the option with CLUSTER_? That seems to just > add a lot of noise into the patch, and I don't see much value in this > rename. I'd prefer keeping BLCKSZ and tweak just the couple places that > need "default" to use BLCKSZ_DEFAULT or something like that. > > But more importantly, I'd say we use CAPITALIZED_NAMES for compile-time > values, so after making this initdb-time parameter we should abandon > that (just like fc49e24fa69a did for segment sizes). Perhaps something > like cluste_block_size would work ... (yes, that touches all the places > too). Yes, I can see that being an equivalent change; thanks for the pointer there. Definitely the "cluster_block_size" could be an approach, though since it's just currently a #define for GetBlockSize(), maybe we just replace with the equivalent instead. I was mainly trying to make it something that was conceptually similar and easy to reason about without getting bogged down in the details locally, but can see that ALL_CAPS does have a specific meaning. Also eliminating the BLCKSZ symbol meant it was easier to catch anything which depended on that value. If we wanted to keep BLCKSZ, I'd be okay with that at this point vs the CLUSTER_BLOCK_SIZE approach, could help to make the patch smaller at this point. > > Initial (basic) performance testing shows only minor changes with the > > pgbench -S > > benchmark, though this is obviously an area that will need considerable > > testing/verification across multiple workloads. > > > > I wonder how to best evaluate/benchmark this. AFAICS what we want to > measure is the extra cost of making the values dynamic (which means the > compiler can't just optimize them out). I'd say a "pgbench -S" seems > like a good test. Yep, I tested 100 runs apiece with pgbench -S at scale factor 100, default settings for optimized builds of the same base commit with and without the patch; saw a reduction of TPS around 1% in that case, but I do think we'd want to look at different workloads; I presume the largest impact would be seen when it's all in shared_buffers and no IO is required. Thanks, David
Re: Avoid unncessary always true test (src/backend/storage/buffer/bufmgr.c)
Hi, Good catch. Indeed, eb.rel shouldn't be NULL there and the tests should be unnecessary. However, it doesn't follow from the code of these functions. >From what I can see eb.rel can be NULL in both of these functions. There is the following Assert in the beginning of the ExtendBufferedRelTo() function: Assert((eb.rel != NULL) != (eb.smgr != NULL)); And ExtendBufferedRelShared() is only called from ExtendBufferedRelCommon() which can be called from ExtendBufferedRelTo() or ExtendBufferedRelBy() that also has the same Assert(). And none of these functions assigns eb.rel, so it can be NULL from the very beginning and it stays the same. And there is the following call in xlogutils.c, which is exactly the case when eb.rel is NULL: buffer = ExtendBufferedRelTo(EB_SMGR(smgr, RELPERSISTENCE_PERMANENT), forknum, NULL, EB_PERFORMING_RECOVERY | EB_SKIP_EXTENSION_LOCK, blkno + 1, mode); So as for me calling LockRelationForExtension() and UnlockRelationForExtension() without testing eb.rel first looks more like a bug here. However, they are never actually called with eb.rel=NULL because of the EB_* flags, so there is no bug here. I believe we should add Assert checking that when eb.rel is NULL, flags are such that we won't use eb.rel. And yes, we can remove unnecessary checks where the flags value guaranty us that eb.rel is not NULL. And another minor observation. It seems to me that we don't need a "could have been closed while waiting for lock" in ExtendBufferedRelShared(), because I believe the comment above already explains why updating eb.smgr: * Note that another backend might have extended the relation by the time * we get the lock. I attached the new version of the patch as I see it. Best regards, Karina Litskevich Postgres Professional: http://postgrespro.com/ From 7425d5b1b9d30f0da75aabd94e862906f9635a60 Mon Sep 17 00:00:00 2001 From: Karina Litskevich Date: Fri, 30 Jun 2023 21:26:18 +0300 Subject: [PATCH v4] Remove always true checks Also add asserts that help understanding these checks are always true --- src/backend/storage/buffer/bufmgr.c | 10 ++ 1 file changed, 6 insertions(+), 4 deletions(-) diff --git a/src/backend/storage/buffer/bufmgr.c b/src/backend/storage/buffer/bufmgr.c index 3c59bbd04e..7b1e2e1284 100644 --- a/src/backend/storage/buffer/bufmgr.c +++ b/src/backend/storage/buffer/bufmgr.c @@ -851,6 +851,8 @@ ExtendBufferedRelBy(ExtendBufferedWhat eb, { Assert((eb.rel != NULL) != (eb.smgr != NULL)); Assert(eb.smgr == NULL || eb.relpersistence != 0); + Assert(eb.rel != NULL || (flags & EB_SKIP_EXTENSION_LOCK) && + !(flags & EB_CREATE_FORK_IF_NEEDED)); Assert(extend_by > 0); if (eb.smgr == NULL) @@ -887,6 +889,8 @@ ExtendBufferedRelTo(ExtendBufferedWhat eb, Assert((eb.rel != NULL) != (eb.smgr != NULL)); Assert(eb.smgr == NULL || eb.relpersistence != 0); + Assert(eb.rel != NULL || (flags & EB_SKIP_EXTENSION_LOCK) && + !(flags & EB_CREATE_FORK_IF_NEEDED)); Assert(extend_to != InvalidBlockNumber && extend_to > 0); if (eb.smgr == NULL) @@ -908,8 +912,7 @@ ExtendBufferedRelTo(ExtendBufferedWhat eb, LockRelationForExtension(eb.rel, ExclusiveLock); /* could have been closed while waiting for lock */ - if (eb.rel) - eb.smgr = RelationGetSmgr(eb.rel); + eb.smgr = RelationGetSmgr(eb.rel); /* recheck, fork might have been created concurrently */ if (!smgrexists(eb.smgr, fork)) @@ -1875,8 +1878,7 @@ ExtendBufferedRelShared(ExtendBufferedWhat eb, if (!(flags & EB_SKIP_EXTENSION_LOCK)) { LockRelationForExtension(eb.rel, ExclusiveLock); - if (eb.rel) - eb.smgr = RelationGetSmgr(eb.rel); + eb.smgr = RelationGetSmgr(eb.rel); } /* -- 2.25.1
Re: pgsql: Fix search_path to a safe value during maintenance operations.
On Thu, Jun 29, 2023 at 10:09:21PM -0700, Nathan Bossart wrote: > On Thu, Jun 29, 2023 at 08:53:56PM -0400, Tom Lane wrote: >> I'm leaning to Robert's thought that we need to revert this for now, >> and think harder about how to make it work cleanly and safely. > > Since it sounds like this is headed towards a revert, here's a patch for > removing MAINTAIN and pg_maintain. I will revert this next week unless opinions change before then. I'm currently planning to revert on both master and REL_16_STABLE, but another option would be to keep it on master while we sort out the remaining issues. Thoughts? -- Nathan Bossart Amazon Web Services: https://aws.amazon.com
Re: Move un-parenthesized syntax docs to "compatibility" for few SQL commands
On Mon, May 15, 2023 at 12:36:51PM -0700, Nathan Bossart wrote: > On Tue, Apr 25, 2023 at 03:18:44PM +0900, Michael Paquier wrote: >> FWIW, I agree to hold on this stuff for v17~ once it opens for >> business. There is no urgency here. > > There's still some time before we'll be able to commit any of these, but > here is an attempt at addressing all the feedback thus far. I think these patches are in decent shape, so I'd like to commit them soon, but I will wait at least a couple more weeks in case anyone has additional feedback. -- Nathan Bossart Amazon Web Services: https://aws.amazon.com
Re: Initdb-time block size specification
On 6/30/23 19:35, David Christensen wrote: > Hi, > > As discussed somewhat at PgCon, enclosed is v1 of a patch to provide > variable block sizes; basically instead of BLCKSZ being a compile-time > constant, a single set of binaries can support all of the block sizes > Pg can support, using the value stored in pg_control as the basis. > (Possible future plans would be to make this something even more > dynamic, such as configured per tablespace, but this is out of scope; > this just sets up the infrastructure for this.) > > Whereas we had traditionally used BLCKSZ to indicate the compile-time selected > block size, this commit adjusted things so the cluster block size can be > selected at initdb time. > > In order to code for this, we introduce a few new defines: > > - CLUSTER_BLOCK_SIZE is the blocksize for this cluster itself. This is not > valid until BlockSizeInit() has been called in the given backend, which we do > as > early as possible by parsing the ControlFile and using the blcksz field. > > - MIN_BLOCK_SIZE and MAX_BLOCK_SIZE are the limits for the selectable block > size. It is required that CLUSTER_BLOCK_SIZE is a power of 2 between these two > constants. > > - DEFAULT_BLOCK_SIZE is the moral equivalent of BLCKSZ; it is the built-in > default value. This is used in a few places that just needed a buffer of an > arbitrary size, but the dynamic value CLUSTER_BLOCK_SIZE should almost always > be > used instead. > > - CLUSTER_RELSEG_SIZE is used instead of RELSEG_SIZE, since internally we are > storing the segment size in terms of number of blocks. RELSEG_SIZE is still > kept, but is used in terms of the number of blocks of DEFAULT_BLOCK_SIZE; > CLUSTER_RELSEG_SIZE scales appropriately (and is the only thing used > internally) > to keep the same target total segment size regardless of block size. > Do we really want to prefix the option with CLUSTER_? That seems to just add a lot of noise into the patch, and I don't see much value in this rename. I'd prefer keeping BLCKSZ and tweak just the couple places that need "default" to use BLCKSZ_DEFAULT or something like that. But more importantly, I'd say we use CAPITALIZED_NAMES for compile-time values, so after making this initdb-time parameter we should abandon that (just like fc49e24fa69a did for segment sizes). Perhaps something like cluste_block_size would work ... (yes, that touches all the places too). > This patch uses a precalculated table to store the block size itself, as well > as > additional derived values that have traditionally been compile-time > constants (example: MaxHeapTuplesPerPage). The traditional macro names are > kept > so code that doesn't care about it should not need to change, however the > definition of these has changed (see the CalcXXX() routines in blocksize.h for > details). > > A new function, BlockSizeInit() populates the appropriate values based on the > target block size. This should be called as early as possible in any code > that > utilizes block sizes. This patch adds this in the appropriate place on the > handful of src/bin/ programs that used BLCKSZ, so this caveat mainly impacts > new > code. > > Code which had previously used BLCKZ should likely be able to get away with > changing these instances to CLUSTER_BLOCK_SIZE, unless you're using a > structure > allocated on the stack. In these cases, the compiler will complain about > dynamic structure. The solution is to utilize an expression with > MAX_BLOCK_SIZE > instead of BLCKSZ, ensuring enough stack space is allocated for the maximum > size. This also does require using CLUSTER_BLOCK_SIZE or an expression based > on > it when actually using this structure, so in practice more stack space may be > allocated then used in principal; as long as there is plenty of stack this > should have no specific impacts on code. > > Initial (basic) performance testing shows only minor changes with the pgbench > -S > benchmark, though this is obviously an area that will need considerable > testing/verification across multiple workloads. > I wonder how to best evaluate/benchmark this. AFAICS what we want to measure is the extra cost of making the values dynamic (which means the compiler can't just optimize them out). I'd say a "pgbench -S" seems like a good test. regards -- Tomas Vondra EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
Re: sslinfo extension - add notbefore and notafter timestamps
> This needs to adjust the tests in src/test/ssl which now fails due to SELECT > * > returning a row which doesn't match what the test was coded for. Thank you so much for pointing out. I have adjusted the extra ssl test to account for the extra columns returned. It should not fail now. > The new patchset isn't updating contrib/sslinfo/meson with the 1.3 update so > it > fails to build with Meson. Thanks again for pointing out, I have adjusted the meson build file to include the 1.3 update Please see attached patches for the fixes. Thank you so much! Cary Huang - HighGo Software Inc. (Canada) cary.hu...@highgo.ca www.highgo.ca v3-0001-sslinfo-add-notbefore-and-notafter-timestamps.patch Description: Binary data v3-0002-pg-stat-ssl-add-notbefore-and-notafter-timestamps.patch Description: Binary data
Re: Adding SHOW CREATE TABLE
On Wed, Jun 21, 2023 at 8:52 PM Kirk Wolak wrote: > On Mon, Jun 5, 2023 at 7:43 AM Jelte Fennema wrote: > >> On Thu, 1 Jun 2023 at 18:57, Kirk Wolak wrote: >> > Can this get turned into a Patch? Were you offering this code up for >> others (me?) to pull, and work into a patch? >> > [If I do the patch, I am not sure it gives you the value of reducing >> what CITUS has to maintain. But it dawns on >> > me that you might be pushing a much bigger patch... But I would take >> that, as I think there is other value in there] >> > > Yeah, the Citus code only handles things that Citus supports in >> distributed tables. Which is quite a lot, but indeed not everything >> yet. Temporary and inherited tables are not supported in this code >> afaik. Possibly more. See the commented out >> EnsureRelationKindSupported for what should be supported (normal >> tables and partitioned tables afaik). >> > > Okay, apologies for the long delay on this. I have the code Jelte submitted working. And I have (almost) figured out how to add the function so it shows up in the pg_catalog... (I edited files I should not have, I need to know the proper process... Anyone...) Not sure if it is customary to attach the code when asking about stuff. For the most part, it was what Jelte Gave us with a pg_get_tabledef() wrapper to call... Here is the output it produces for *select pg_get_tabledef('pg_class'::regclass); * (Feedback Welcome) CREATE TABLE pg_class (oid oid NOT NULL, relname name NOT NULL COLLATE "C", relnamespace oid NOT NULL, reltype oid NOT NULL, reloftype oid NOT NULL, relowner oid NOT NULL, relam oid NOT NULL, relfilenode oid NOT NULL, reltablespace oid NOT NULL, relpages integer NOT NULL, reltuples real NOT NULL, relallvisible integer NOT NULL, reltoastrelid oid NOT NULL, relhasindex boolean NOT NULL, relisshared boolean NOT NULL, relpersistence "char" NOT NULL, relkind "char" NOT NULL, relnatts smallint NOT NULL, relchecks smallint NOT NULL, relhasrules boolean NOT NULL, relhastriggers boolean NOT NULL, relhassubclass boolean NOT NULL, relrowsecurity boolean NOT NULL, relforcerowsecurity boolean NOT NULL, relispopulated boolean NOT NULL, relreplident "char" NOT NULL, relispartition boolean NOT NULL, relrewrite oid NOT NULL, relfrozenxid xid NOT NULL, relminmxid xid NOT NULL, relacl aclitem[], reloptions text[] COLLATE "C", relpartbound pg_node_tree COLLATE "C") USING heap == My Comments/Questions: 1) I would prefer Legible output, like below 2) I would prefer to leave off COLLATE "C" IFF that is the DB Default 3) The USING heap... I want to pull UNLESS the value is NOT the default (That's a theme in my comments) 4) I *THINK* including the schema would be nice? 5) This version will work with a TEMP table, but NOT EMIT "TEMPORARY"... Thoughts? Is emitting [pg_temp.] good enough? 6) This version enumerates sequence values (Drop always, or Drop if they are the default values?) 7) Should I enable the pg_get_seqdef() code 8) It does NOT handle Inheritance (Yet... Is this important? Is it okay to just give the table structure for this table?) 9) I have not tested against Partitions, etc... I SIMPLY want initial feedback on Formatting -- Legible: CREATE TABLE pg_class (oid oid NOT NULL, relname name NOT NULL COLLATE "C", relnamespace oid NOT NULL, reltype oid NOT NULL, ... reloptions text[] COLLATE "C", relpartbound pg_node_tree COLLATE "C" ) -- Too verbose with "*DEFAULT*" Sequence Values: CREATE TABLE t1 (id bigint GENERATED BY DEFAULT AS IDENTITY *(INCREMENT BY 1 MINVALUE 1 MAXVALUE 9223372036854775807 START WITH 1 CACHE 1 NO CYCLE)* NOT NULL, f1 text ) WITH (autovacuum_vacuum_cost_delay='0', fillfactor='80', autovacuum_vacuum_insert_threshold='-1', autovacuum_analyze_threshold='5', autovacuum_vacuum_threshold='5', autovacuum_vacuum_scale_factor='1.5') Thanks, Kirk... PS: After I get feedback on Formatting the output, etc. I will gladly generate a new .patch file and send it along. Otherwise Jelte gets 100% of the credit, and I don't want to look like I am changing that.
Re: pg_upgrade instructions involving "rsync --size-only" might lead to standby corruption?
On Thu, Jun 29, 2023 at 02:38:58PM -0400, Robert Haas wrote: > On Thu, Jun 29, 2023 at 1:50 PM Nikolay Samokhvalov wrote: > > Does this make sense or I'm missing something and the current docs describe > > a reliable process? (As I said, we have deviated from the process, to > > involve logical replication, so I'm not 100% sure I'm right suspecting the > > original procedure in having standby corruption risks.) > > I'm very suspicious about this section of the documentation. It > doesn't explain why --size-only is used or why --no-inc-recursive is > used. I think --size-only was chosen only because it is the minimal comparison option. > > > 9. Prepare for standby server upgrades > > > If you are upgrading standby servers using methods outlined in section > > > Step 11, verify that the old standby servers are caught up by running > > > pg_controldata against the old primary and standby clusters. Verify that > > > the “Latest checkpoint location” values match in all clusters. (There > > > will be a mismatch if old standby servers were shut down before the old > > > primary or if the old standby servers are still running.) Also, make sure > > > wal_level is not set to minimal in the postgresql.conf file on the new > > > primary cluster. > > > > – admitting that there might be mismatch. But if there is mismatch, rsync > > --size-only is not going to help synchronize properly, right? > > I think the idea is that you shouldn't use the procedure in this case. > But honestly I don't think it's probably a good idea to use this > procedure at all. It's not clear enough under what circumstances, if > any, it's safe to use, and there's not really any way to know if > you've done it correctly. You couldn't pay me enough to recommend this > procedure to anyone. I think it would be good to revisit all the steps outlined in that procedure and check which ones are still valid or need adjusting. It is very possible the original steps have bugs or that new Postgres features added since the steps were created don't work with these steps. I think we need to bring Stephen Frost into the discussion, so I have CC'ed him. Frankly, I didn't think the documented procedure would work either, but people say it does, so it is in the docs. I do think it is overdue for a re-analysis. -- Bruce Momjian https://momjian.us EDB https://enterprisedb.com Only you can decide what is important to you.
Re: ProcessStartupPacket(): database_name and user_name truncation
Hi, On 6/30/23 5:54 PM, Tom Lane wrote: Nathan Bossart writes: After taking another look at this, I wonder if it'd be better to fail as soon as we see the database or user name is too long instead of lugging them around when authentication is destined to fail. If we're agreed that we aren't going to truncate these identifiers, that seems like a reasonable way to handle it. Yeah agree, thanks Nathan for the idea. I'll work on a new patch version proposal. Regards, -- Bertrand Drouvot PostgreSQL Contributors Team RDS Open Source Databases Amazon Web Services: https://aws.amazon.com
Re: SPI isolation changes
Seino Yuki writes: > Of course, executing SET TRANSACTION ISOLATION LEVEL with SPI_execute > will result in error. > --- > SPI_execute("SET TRANSACTION ISOLATION LEVEL SERIALIZABLE", false, 0); > (Log Output) > ERROR: SET TRANSACTION ISOLATION LEVEL must be called before any query > CONTEXT: SQL statement "SET TRANSACTION ISOLATION LEVEL SERIALIZABLE" Even if you just did SPI_commit? That *should* fail if you just do it right off the bat in a SPI-using procedure, because you're already within the transaction that called the procedure. But I think it will work if you do SPI_commit followed by this SPI_execute. regards, tom lane
Re: SPI isolation changes
On 2023-07-01 00:06, Tom Lane wrote: Seino Yuki writes: I also thought that using SPI_start_transaction would be more readable than using SPI_commit/SPI_rollback to implicitly start a transaction. What do you think? I think you're trying to get us to undo commit 2e517818f, which is not going to happen. See the threads that led up to that: Discussion: https://postgr.es/m/3375ffd8-d71c-2565-e348-a597d6e73...@enterprisedb.com Discussion: https://postgr.es/m/17416-ed8fe5d7213d6...@postgresql.org It looks to me like you can just change the transaction property settings immediately after SPI_start_transaction if you want to. Compare this bit in SnapBuildExportSnapshot: StartTransactionCommand(); /* There doesn't seem to a nice API to set these */ XactIsoLevel = XACT_REPEATABLE_READ; XactReadOnly = true; Also look at the implementation of SPI_commit_and_chain, particularly RestoreTransactionCharacteristics. regards, tom lane Thanks for sharing past threads. I was understand how SPI_start_transaction went no-operation. I also understand how to set the transaction property. However, it was a little disappointing that the transaction property could not be changed only by SPI commands. Of course, executing SET TRANSACTION ISOLATION LEVEL with SPI_execute will result in error. --- SPI_execute("SET TRANSACTION ISOLATION LEVEL SERIALIZABLE", false, 0); (Log Output) ERROR: SET TRANSACTION ISOLATION LEVEL must be called before any query CONTEXT: SQL statement "SET TRANSACTION ISOLATION LEVEL SERIALIZABLE" --- Thanks for answering.
Re: Avoid unused value (src/fe_utils/print.c)
Em sex., 30 de jun. de 2023 às 13:00, Alexander Lakhin escreveu: > Hello Karina, > > 30.06.2023 17:25, Karina Litskevich wrote: > > Hi, > > > > Alexander wrote: > > > >> It also aligns the code with print_unaligned_vertical(), but I can't > see why > >> need_recordsep = true; > >> is a no-op here (scan-build dislikes only need_recordsep = false;). > >> I suspect that removing that line will change the behaviour in cases > when > >> need_recordsep = false after the loop "print cells" and the loop > >> "for (footers)" is executed. > > As I understand cont->cells is supoused to have all cont->ncolumns * > cont->nrows > > entries filled so the loop "print cells" always assigns need_recordsep = > true, > > except when there are no cells at all or cancel_pressed == true. > > If cancel_pressed == true then footers are not printed. So to have > > need_recordsep == false before the loop "for (footers)" table should be > empty, > > and need_recordsep should be false before the loop "print cells". It can > only > > be false there when cont->opt->start_table == true and opt_tuples_only > == true > > so that headers are not printed. But when opt_tuples_only == true > footers are > > not printed either. > > > > So technically removing "need_recordsep = true;" won't change the > outcome. But > > it's not obvious, so I also have doubts about removing this line. If > someday > > print options are changed, for example to support printing footers and > not > > printing headers, or anything else changes in this function, the output > might > > be unexpected with this line removed. > > Hi Alexander, > I think that the question that should be answered before moving forward > here is: what this discussion is expected to result in? > I hope to make the function more readable and maintainable. > If the goal is to avoid an unused value to make Coverity/clang`s scan-build > a little happier, then maybe just don't touch other line, that is not > recognized as dead (at least by scan-build; > I wonder what Coverity says > about that line). > if (cont->opt->stop_table) 477{ 478printTableFooter *footers = footers_with_default(cont); 479 480if (!opt_tuples_only && footers != NULL && ! cancel_pressed) 481{ 482printTableFooter *f; 483 484for (f = footers; f; f = f->next) 485{ 486if (need_recordsep) 487{ 488print_separator(cont->opt-> recordSep, fout); CID 1512766 (#1 of 1): Unused value (UNUSED_VALUE)assigned_value: Assigning value false to need_recordsep here, but that stored value is overwritten before it can be used. 489need_recordsep = false; 490} 491fputs(f->data, fout); value_overwrite: Overwriting previous write to need_recordsep with value true. 492need_recordsep = true; 493} 494} 495 > Otherwise, if the goal is to do review the code and make it cleaner, then > why not get rid of "if (need_recordsep)" there? > I don't agree with removing this line, as it is essential to print the separators, in the footers. best regards, Ranier Vilela
Re: Avoid unused value (src/fe_utils/print.c)
Em sex., 30 de jun. de 2023 às 11:26, Karina Litskevich < litskevichkar...@gmail.com> escreveu: > Hi, > > Alexander wrote: > > > It also aligns the code with print_unaligned_vertical(), but I can't see > why > > need_recordsep = true; > > is a no-op here (scan-build dislikes only need_recordsep = false;). > > I suspect that removing that line will change the behaviour in cases when > > need_recordsep = false after the loop "print cells" and the loop > > "for (footers)" is executed. > > Hi Karina, > As I understand cont->cells is supoused to have all cont->ncolumns * > cont->nrows > entries filled so the loop "print cells" always assigns need_recordsep = > true, > except when there are no cells at all or cancel_pressed == true. > If cancel_pressed == true then footers are not printed. So to have > need_recordsep == false before the loop "for (footers)" table should be > empty, > and need_recordsep should be false before the loop "print cells". It can > only > be false there when cont->opt->start_table == true and opt_tuples_only == > true > so that headers are not printed. But when opt_tuples_only == true footers > are > not printed either. > > So technically removing "need_recordsep = true;" won't change the outcome. Thanks for the confirmation. > But > it's not obvious, so I also have doubts about removing this line. If > someday > print options are changed, for example to support printing footers and not > printing headers, or anything else changes in this function, the output > might > be unexpected with this line removed. That part I didn't understand. How are we going to make this function less readable by removing the complicating part. best regards, Ranier Vilela
Re: Avoid unused value (src/fe_utils/print.c)
Hello Karina, 30.06.2023 17:25, Karina Litskevich wrote: Hi, Alexander wrote: It also aligns the code with print_unaligned_vertical(), but I can't see why need_recordsep = true; is a no-op here (scan-build dislikes only need_recordsep = false;). I suspect that removing that line will change the behaviour in cases when need_recordsep = false after the loop "print cells" and the loop "for (footers)" is executed. As I understand cont->cells is supoused to have all cont->ncolumns * cont->nrows entries filled so the loop "print cells" always assigns need_recordsep = true, except when there are no cells at all or cancel_pressed == true. If cancel_pressed == true then footers are not printed. So to have need_recordsep == false before the loop "for (footers)" table should be empty, and need_recordsep should be false before the loop "print cells". It can only be false there when cont->opt->start_table == true and opt_tuples_only == true so that headers are not printed. But when opt_tuples_only == true footers are not printed either. So technically removing "need_recordsep = true;" won't change the outcome. But it's not obvious, so I also have doubts about removing this line. If someday print options are changed, for example to support printing footers and not printing headers, or anything else changes in this function, the output might be unexpected with this line removed. I think that the question that should be answered before moving forward here is: what this discussion is expected to result in? If the goal is to avoid an unused value to make Coverity/clang`s scan-build a little happier, then maybe just don't touch other line, that is not recognized as dead (at least by scan-build; I wonder what Coverity says about that line). Otherwise, if the goal is to do review the code and make it cleaner, then why not get rid of "if (need_recordsep)" there? Best regards, Alexander
Re: ProcessStartupPacket(): database_name and user_name truncation
Nathan Bossart writes: > After taking another look at this, I wonder if it'd be better to fail as > soon as we see the database or user name is too long instead of lugging > them around when authentication is destined to fail. If we're agreed that we aren't going to truncate these identifiers, that seems like a reasonable way to handle it. regards, tom lane
Re: ProcessStartupPacket(): database_name and user_name truncation
After taking another look at this, I wonder if it'd be better to fail as soon as we see the database or user name is too long instead of lugging them around when authentication is destined to fail. -- Nathan Bossart Amazon Web Services: https://aws.amazon.com
Re: On /*----- comments
Heikki Linnakangas writes: > Except for the translator comments, I think those others forgot about > the end-guards by accident. But they look just as nice to me. It's > probably not worth the code churn to remove them from existing comments, > but how about we stop requiring them in new code, and update the > pgindent README accordingly? Seems reasonable; the trailing dashes eat a line without adding much. Should we also provide specific guidance about how many leading dashes to use for this? I vaguely recall that pgindent might only need one, but I think using somewhere around 5 to 10 looks better. regards, tom lane
On /*----- comments
I spotted this comment in walsender.c: /*--- * When reading from a historic timeline, and there is a timeline switch * [.. long comment omitted ...] * portion of the old segment is copied to the new file. --- */ Note the bogus dashes at the end. This was introduced in commit 0dc8ead463, which moved and reformatted the comment. It's supposed to end the pgindent-guard at the beginning of the comment like this: /*--- * When reading from a historic timeline, and there is a timeline switch * [.. long comment omitted ...] * portion of the old segment is copied to the new file. *--- */ But that got me wondering, do we care about those end-guards? pgindent doesn't need them. We already have a bunch of comments that don't have the end-guard, for example: analyze.c: /*-- translator: %s is a SQL row locking clause such as FOR UPDATE */ gistproc.c: /* * The goal is to form a left and right interval, so that every entry * interval is contained by either left or right interval (or both). * * For example, with the intervals (0,1), (1,3), (2,3), (2,4): * * 0 1 2 3 4 * +-+ * +---+ * +-+ * +---+ * * The left and right intervals are of the form (0,a) and (b,4). * We first consider splits where b is the lower bound of an entry. * We iterate through all entries, and for each b, calculate the * smallest possible a. Then we consider splits where a is the * upper bound of an entry, and for each a, calculate the greatest * possible b. * * In the above example, the first loop would consider splits: * b=0: (0,1)-(0,4) * b=1: (0,1)-(1,4) * b=2: (0,3)-(2,4) * * And the second loop: * a=1: (0,1)-(1,4) * a=3: (0,3)-(2,4) * a=4: (0,4)-(2,4) */ predicate.c: /*-- * The SLRU is no longer needed. Truncate to head before we set head * invalid. * * XXX: It's possible that the SLRU is not needed again until XID * wrap-around has happened, so that the segment containing headPage * that we leave behind will appear to be new again. In that case it * won't be removed until XID horizon advances enough to make it * current again. * * XXX: This should happen in vac_truncate_clog(), not in checkpoints. * Consider this scenario, starting from a system with no in-progress * transactions and VACUUM FREEZE having maximized oldestXact: * - Start a SERIALIZABLE transaction. * - Start, finish, and summarize a SERIALIZABLE transaction, creating * one SLRU page. * - Consume XIDs to reach xidStopLimit. * - Finish all transactions. Due to the long-running SERIALIZABLE * transaction, earlier checkpoints did not touch headPage. The * next checkpoint will change it, but that checkpoint happens after * the end of the scenario. * - VACUUM to advance XID limits. * - Consume ~2M XIDs, crossing the former xidWrapLimit. * - Start, finish, and summarize a SERIALIZABLE transaction. * SerialAdd() declines to create the targetPage, because headPage * is not regarded as in the past relative to that targetPage. The * transaction instigating the summarize fails in * SimpleLruReadPage(). */ indexcmds.c: /*- * Now we have all the indexes we want to process in indexIds. * * The phases now are: * * 1. create new indexes in the catalog * 2. build new indexes * 3. let new indexes catch up with tuples inserted in the meantime * 4. swap index names * 5. mark old indexes as dead * 6. drop old indexes * * We process each phase for all indexes before moving to the next phase, * for efficiency. */ Except for the translator comments, I think those others forgot about the end-guards by accident. But they look just as nice to me. It's probably not worth the code churn to remove them from existing comments, but
Re: SPI isolation changes
Seino Yuki writes: > I also thought that using SPI_start_transaction would be more readable > than using SPI_commit/SPI_rollback to implicitly start a transaction. > What do you think? I think you're trying to get us to undo commit 2e517818f, which is not going to happen. See the threads that led up to that: Discussion: https://postgr.es/m/3375ffd8-d71c-2565-e348-a597d6e73...@enterprisedb.com Discussion: https://postgr.es/m/17416-ed8fe5d7213d6...@postgresql.org It looks to me like you can just change the transaction property settings immediately after SPI_start_transaction if you want to. Compare this bit in SnapBuildExportSnapshot: StartTransactionCommand(); /* There doesn't seem to a nice API to set these */ XactIsoLevel = XACT_REPEATABLE_READ; XactReadOnly = true; Also look at the implementation of SPI_commit_and_chain, particularly RestoreTransactionCharacteristics. regards, tom lane
Re: Assert !bms_overlap(joinrel->relids, required_outer)
Richard Guo writes: > On Fri, Jun 30, 2023 at 12:20 AM Tom Lane wrote: >> Yeah, while looking at this I was wondering why try_mergejoin_path and >> try_hashjoin_path don't do the same "Paths are parameterized by >> top-level parents, so run parameterization tests on the parent relids" >> dance that try_nestloop_path does. This omission is consistent with >> that, but it's not obvious why it'd be okay to skip it for >> non-nestloop joins. I guess we'd have noticed by now if it wasn't >> okay, but ... > I think it just makes these two assertions meaningless to skip it for > non-nestloop joins if the input paths are for otherrels, because paths > would never be parameterized by the member relations. So these two > assertions would always be true for otherrel paths. I think this is why > we have not noticed any problem by now. After studying this some more, I think that maybe it's the "run parameterization tests on the parent relids" bit that is misguided. I believe the way it's really working is that all paths arriving here are parameterized by top parents, because that's the only thing we generate to start with. A path can only become parameterized by an otherrel when we apply reparameterize_path_by_child to it. But the only place that happens is in try_nestloop_path itself (or try_partial_nestloop_path), and then we immediately wrap it in a nestloop join node, which becomes a child of an append that's forming a partitionwise join. The partitionwise join as a whole won't be parameterized by any child rels. So I think that a path that's parameterized by a child rel can't exist "in the wild" in a way that would allow it to get fed to one of the try_xxx_path functions. This explains why the seeming oversights in the merge and hash cases aren't causing a problem. If this theory is correct, we could simplify try_nestloop_path a bit. I doubt the code savings would matter, but maybe it's worth changing for clarity's sake. regards, tom lane
Re: SPI isolation changes
Thanks for the reply! On 2023-06-30 23:26, Heikki Linnakangas wrote: On 30/06/2023 17:15, Seino Yuki wrote: Hi, When I read the documents and coding of SPI, [1] I found that the following the SPI_start_transaction does not support transaciton_mode(ISOLATION LEVEL, READ WRITE/READ ONLY) like BEGIN command. [2] Is there a reason for this? Per the documentation for SPI_start_transaction that you linked to: "SPI_start_transaction does nothing, and exists only for code compatibility with earlier PostgreSQL releases." I haven't tested it, but perhaps you can do "SET TRANSACTION ISOLATION LEVEL" in the new transaction after calling SPI_commit() though. Or "SET DEFAULT TRANSACTION ISOLATION LEVEL" before committing. I understand that too. However, I thought SPI_start_transaction was the function equivalent to BEGIN (or START TRANSACTION). Therefore, I did not understand why the same option could not be specified. I also thought that using SPI_start_transaction would be more readable than using SPI_commit/SPI_rollback to implicitly start a transaction. What do you think? Regards, -- Seino Yuki NTT DATA CORPORATION
Re: SPI isolation changes
On 30/06/2023 17:15, Seino Yuki wrote: Hi, When I read the documents and coding of SPI, [1] I found that the following the SPI_start_transaction does not support transaciton_mode(ISOLATION LEVEL, READ WRITE/READ ONLY) like BEGIN command. [2] Is there a reason for this? Per the documentation for SPI_start_transaction that you linked to: "SPI_start_transaction does nothing, and exists only for code compatibility with earlier PostgreSQL releases." I haven't tested it, but perhaps you can do "SET TRANSACTION ISOLATION LEVEL" in the new transaction after calling SPI_commit() though. Or "SET DEFAULT TRANSACTION ISOLATION LEVEL" before committing. -- Heikki Linnakangas Neon (https://neon.tech)
Re: Avoid unused value (src/fe_utils/print.c)
Hi, Alexander wrote: > It also aligns the code with print_unaligned_vertical(), but I can't see why > need_recordsep = true; > is a no-op here (scan-build dislikes only need_recordsep = false;). > I suspect that removing that line will change the behaviour in cases when > need_recordsep = false after the loop "print cells" and the loop > "for (footers)" is executed. As I understand cont->cells is supoused to have all cont->ncolumns * cont->nrows entries filled so the loop "print cells" always assigns need_recordsep = true, except when there are no cells at all or cancel_pressed == true. If cancel_pressed == true then footers are not printed. So to have need_recordsep == false before the loop "for (footers)" table should be empty, and need_recordsep should be false before the loop "print cells". It can only be false there when cont->opt->start_table == true and opt_tuples_only == true so that headers are not printed. But when opt_tuples_only == true footers are not printed either. So technically removing "need_recordsep = true;" won't change the outcome. But it's not obvious, so I also have doubts about removing this line. If someday print options are changed, for example to support printing footers and not printing headers, or anything else changes in this function, the output might be unexpected with this line removed. Best regards, Karina Litskevich Postgres Professional: http://postgrespro.com/
SPI isolation changes
Hi, When I read the documents and coding of SPI, [1] I found that the following the SPI_start_transaction does not support transaciton_mode(ISOLATION LEVEL, READ WRITE/READ ONLY) like BEGIN command. [2] Is there a reason for this? I would like to be able to set transaciton_mode in SPI_start_transaction. What do you think? [1] https://www.postgresql.org/docs/devel/spi-spi-start-transaction.html [2] https://www.postgresql.org/docs/devel/sql-begin.html Regards, -- Seino Yuki NTT DATA CORPORATION
RE: [PoC] pg_upgrade: allow to upgrade publisher node
Dear Amit, Thank you for giving comments! > > > Sorry for the delay, I didn't had time to come back to it until this > > > afternoon. > > > > No issues, everyone is busy:-). > > > > > I don't think that your analysis is correct. Slots are guaranteed to be > > > stopped after all the normal backends have been stopped, exactly to avoid > such > > > extraneous records. > > > > > > What is happening here is that the slot's confirmed_flush_lsn is properly > > > updated in memory and ends up being the same as the current LSN before the > > > shutdown. But as it's a logical slot and those records aren't decoded, > > > the > > > slot isn't marked as dirty and therefore isn't saved to disk. You don't > > > see > > > that behavior when doing a manual checkpoint before (per your script > comment), > > > as in that case the checkpoint also tries to save the slot to disk but > > > then > > > finds a slot that was marked as dirty and therefore saves it. > > > > > Here, why the behavior is different for manual and non-manual checkpoint? I have analyzed more, and concluded that there are no difference between manual and shutdown checkpoint. The difference was whether the CHECKPOINT record has been decoded or not. The overall workflow of this test was: 1. do INSERT (2. do CHECKPOINT) (3. decode CHECKPOINT record) 4. receive feedback message from standby 5. do shutdown CHECKPOINT At step 3, the walsender decoded that WAL and set candidate_xmin_lsn. The stucktrace was: standby_decode()->SnapBuildProcessRunningXacts()->LogicalIncreaseXminForSlot(). At step 4, the confirmed_flush of the slot was updated, but ReplicationSlotSave() was executed only when the slot->candidate_xmin_lsn had valid lsn. If step 2 and 3 are misssed, the dirty flag is not set and the change is still on the memory. FInally, the CHECKPOINT was executed at step 5. If step 2 and 3 are misssed and the patch from Julien is not applied, the updated value will be discarded. This is what I observed. The patch forces to save the logical slot at the shutdown checkpoint, so the confirmed_lsn is save to disk at step 5. > Can you please explain what led to updating the confirmed_flush in > memory but not in the disk? The code-level workflow was said above. The slot info is updated only after decoding CHECKPOINT. I'm not sure the initial motivation, but I suspect we wanted to reduce the number of writing to disk. > BTW, have we ensured that discarding the > additional records are already sent to the subscriber, if so, why for > those records confirmed_flush LSN is not progressed? In this case, the apply worker request the LSN which is greater than confirmed_lsn via START_REPLICATION. Therefore, according to CreateDecodingContext(), the walsender sends from the appropriate records, doesn't it? I think discarding is not happened on subscriber. Best Regards, Hayato Kuroda FUJITSU LIMITED
Extensible storage manager API - SMGR hook Redux
Hi hackers, At Neon, we've been working on removing the file system dependency from PostgreSQL and replacing it with a distributed storage layer. For now, we've seen most success in this by replacing the implementation of the smgr API, but it did require some core modifications like those proposed early last year by Anastasia [0]. As mentioned in the previous thread, there are several reasons why you would want to use a non-default storage manager: storage-level compression, encryption, and disk limit quotas [0]; offloading of cold relation data was also mentioned [1]. In the thread on Anastasia's patch, Yura Sokolov mentioned that instead of a hook-based smgr extension, a registration-based smgr would be preferred, with integration into namespaces. Please find attached an as of yet incomplete patch that starts to do that. The patch is yet incomplete (as it isn't derived from Anastasia's patch), but I would like comments on this regardless, as this is a fairly fundamental component of PostgreSQL that is being modified, and it is often better to get comments early in the development cycle. One significant issue that I've seen so far are that catcache is not guaranteed to be available in all backends that need to do smgr operations, and I've not yet found a good solution. Changes compared to HEAD: - smgrsw is now dynamically allocated and grows as new storage managers are loaded (during shared_preload_libraries) - CREATE TABLESPACE has new optional syntax USING smgrname (option [, ...]) - tablespace storage is (planned) fully managed by smgr through some new smgr apis Changes compared to Anastasia's patch: - extensions do not get to hook and replace the api of the smgr code directly - they are hidden behind the smgr registry. Successes: - 0001 passes tests (make check-world) - 0002 builds without warnings (make) TODO: - fix dependency failures when catcache is unavailable - tablespace redo is currently broken with 0002 - fix tests for 0002 - ensure that pg_dump etc. works with the new tablespace storage manager options Looking forward to any comments, suggestions and reviews. Kind regards, Matthias van de Meent Neon (https://neon.tech/) [0] https://www.postgresql.org/message-id/CAP4vRV6JKXyFfEOf%3Dn%2Bv5RGsZywAQ3CTM8ESWvgq%2BS87Tmgx_g%40mail.gmail.com [1] https://www.postgresql.org/message-id/d365f19f-bc3e-4f96-a91e-8db130497...@yandex-team.ru v1-0001-Expose-f_smgr-to-extensions-for-manual-implementa.patch Description: Binary data v1-0002-Prototype-Allow-tablespaces-to-specify-which-SMGR.patch Description: Binary data
Re: BUG #17946: LC_MONETARY & DO LANGUAGE plperl - BUG
On 6/29/23 22:13, Tristan Partin wrote: On Mon Jun 5, 2023 at 11:00 AM CDT, Heikki Linnakangas wrote: I think the uselocale() call renders ineffective the setlocale() calls that we make later. Maybe we should replace our setlocale() calls with uselocale(), too. For what it's worth to everyone else in the thread (especially Joe), I have a patch locally that fixes the mentioned bug using uselocale(). I am not sure that it is worth committing for v16 given how _large_ (the patch is actually quite small, +216 -235) of a change it is. I am going to spend tomorrow combing over it a bit more and evaluating other setlocale uses in the codebase. (moving thread to hackers) I don't see a patch attached -- how is it different than what I posted a week ago and added to the commitfest here? https://commitfest.postgresql.org/43/4413/ FWIW, if you are proposing replacing all uses of setlocale() with uselocale() as Heikki suggested: 1/ I don't think that is pg16 material, and almost certainly not back-patchable to earlier. 2/ It probably does not solve all of the identified issues caused by the newer perl libraries by itself, i.e. I believe the patch posted to the CF is still needed. 3/ I believe it is probably the right way to go for pg17+, but I would love to hear opinions from Jeff Davis, Peter Eisentraut, and/or Thomas Munroe (the locale code "usual suspects" ;-)), and others, about that. -- Joe Conway PostgreSQL Contributors Team RDS Open Source Databases Amazon Web Services: https://aws.amazon.com OpenPGP_signature Description: OpenPGP digital signature
Re: index prefetching
Hi, attached is a v4 of the patch, with a fairly major shift in the approach. Until now the patch very much relied on the AM to provide information which blocks to prefetch next (based on the current leaf index page). This seemed like a natural approach when I started working on the PoC, but over time I ran into various drawbacks: * a lot of the logic is at the AM level * can't prefetch across the index page boundary (have to wait until the next index leaf page is read by the indexscan) * doesn't work for distance searches (gist/spgist), After thinking about this, I decided to ditch this whole idea of exchanging prefetch information through an API, and make the prefetching almost entirely in the indexam code. The new patch maintains a queue of TIDs (read from index_getnext_tid), with up to effective_io_concurrency entries - calling getnext_slot() adds a TID at the queue tail, issues a prefetch for the block, and then returns TID from the queue head. Maintaining the queue is up to index_getnext_slot() - it can't be done in index_getnext_tid(), because then it'd affect IOS (and prefetching heap would mostly defeat the whole point of IOS). And we can't do that above index_getnext_slot() because that already fetched the heap page. I still think prefetching for IOS is doable (and desirable), in mostly the same way - except that we'd need to maintain the queue from some other place, as IOS doesn't do index_getnext_slot(). FWIW there's also the "index-only filters without IOS" patch [1] which switches even regular index scans to index_getnext_tid(), so maybe relying on index_getnext_slot() is a lost cause anyway. Anyway, this has the nice consequence that it makes AM code entirely oblivious of prefetching - there's no need to API, we just get TIDs as before, and the prefetching magic happens after that. Thus it also works for searches ordered by distance (gist/spgist). The patch got much smaller (about 40kB, down from 80kB), which is nice. I ran the benchmarks [2] with this v4 patch, and the results for the "point" queries are almost exactly the same as for v3. The SAOP part is still running - I'll add those results in a day or two, but I expect similar outcome as for point queries. regards [1] https://commitfest.postgresql.org/43/4352/ [2] https://github.com/tvondra/index-prefetch-tests-2/ -- Tomas Vondra EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Companydiff --git a/src/backend/access/gist/gistget.c b/src/backend/access/gist/gistget.c index e2c9b5f069c..9045c6eb7aa 100644 --- a/src/backend/access/gist/gistget.c +++ b/src/backend/access/gist/gistget.c @@ -678,7 +678,6 @@ gistgettuple(IndexScanDesc scan, ScanDirection dir) scan->xs_hitup = so->pageData[so->curPageData].recontup; so->curPageData++; - return true; } diff --git a/src/backend/access/heap/heapam_handler.c b/src/backend/access/heap/heapam_handler.c index 0755be83901..f0412da94ae 100644 --- a/src/backend/access/heap/heapam_handler.c +++ b/src/backend/access/heap/heapam_handler.c @@ -44,6 +44,7 @@ #include "storage/smgr.h" #include "utils/builtins.h" #include "utils/rel.h" +#include "utils/spccache.h" static void reform_and_rewrite_tuple(HeapTuple tuple, Relation OldHeap, Relation NewHeap, @@ -751,6 +752,9 @@ heapam_relation_copy_for_cluster(Relation OldHeap, Relation NewHeap, PROGRESS_CLUSTER_INDEX_RELID }; int64 ci_val[2]; + int prefetch_target; + + prefetch_target = get_tablespace_io_concurrency(OldHeap->rd_rel->reltablespace); /* Set phase and OIDOldIndex to columns */ ci_val[0] = PROGRESS_CLUSTER_PHASE_INDEX_SCAN_HEAP; @@ -759,7 +763,8 @@ heapam_relation_copy_for_cluster(Relation OldHeap, Relation NewHeap, tableScan = NULL; heapScan = NULL; - indexScan = index_beginscan(OldHeap, OldIndex, SnapshotAny, 0, 0); + indexScan = index_beginscan(OldHeap, OldIndex, SnapshotAny, 0, 0, + prefetch_target, prefetch_target); index_rescan(indexScan, NULL, 0, NULL, 0); } else diff --git a/src/backend/access/index/genam.c b/src/backend/access/index/genam.c index 722927aebab..264ebe1d8e5 100644 --- a/src/backend/access/index/genam.c +++ b/src/backend/access/index/genam.c @@ -126,6 +126,9 @@ RelationGetIndexScan(Relation indexRelation, int nkeys, int norderbys) scan->xs_hitup = NULL; scan->xs_hitupdesc = NULL; + /* set in each AM when applicable */ + scan->xs_prefetch = NULL; + return scan; } @@ -440,8 +443,9 @@ systable_beginscan(Relation heapRelation, elog(ERROR, "column is not in index"); } + /* no index prefetch for system catalogs */ sysscan->iscan = index_beginscan(heapRelation, irel, - snapshot, nkeys, 0); + snapshot, nkeys, 0, 0, 0); index_rescan(sysscan->iscan, key, nkeys, NULL, 0); sysscan->scan = NULL; } @@ -696,8 +700,9 @@ systable_beginscan_ordered(Relation heapRelation, elog(ERROR, "column is not in index"); } + /* no index prefetch for system catalogs */
Re: Removing unneeded self joins
On 4/4/2023 02:30, Gregory Stark (as CFM) wrote: On Mon, 6 Mar 2023 at 00:30, Michał Kłeczek wrote: Hi All, I just wanted to ask about the status and plans for this patch. I can see it being stuck at “Waiting for Author” status in several commit tests. Sadly it seems to now be badly in need of a rebase. There are large hunks failing in the guts of analyzejoins.c as well as minor failures elsewhere and lots of offsets which need to be reviewed. I think given the lack of activity it's out of time for this release at this point. I'm moving it ahead to the next CF. Hi, Version 41 is heavily remade of the feature: 1. In previous versions, I tried to reuse remove_rel_from_query() for both left and self-join removal. But for now, I realized that it is a bit different procedures which treat different operations. In this patch, only common stages of the PlannerInfo fixing process are united in one function. 2. Transferring clauses from the removing table to keeping one is more transparent now and contains comments. 3. Equivalence classes update procedure was changed according to David's commit 3373c71. As I see, Tom has added remove_rel_from_eclass since the last v.40 version, and it looks pretty similar to the update_eclass routine in this patch. It passes regression tests, but some questions are still open: 1. Should we look for duplicated or redundant clauses (the same for eclasses) during the clause transfer procedure? On the one side, we reduce the length of restrict lists that can impact planning or executing time. Additionally, we improve the accuracy of cardinality estimation. On the other side, it is one more place that can make planning time much longer in specific cases. It would have been better to avoid calling the equal() function here, but it's the only way to detect duplicated inequality expressions. 2. Could we reuse ChangeVarNodes instead of sje_walker(), merge remove_rel_from_restrictinfo with replace_varno? 3. Also, I still don't finish with the split_selfjoin_quals: some improvements could be made. -- regards, Andrey Lepikhov Postgres Professional From 4a342b9789f5be209318c13fb7ec336fcbd2aee5 Mon Sep 17 00:00:00 2001 From: Andrey Lepikhov Date: Mon, 15 May 2023 09:04:51 +0500 Subject: [PATCH] Remove self-joins. A Self Join Elimination (SJE) feature removes inner join of plain table to itself in a query tree if can be proved that the join can be replaced with a scan. The proof based on innerrel_is_unique machinery. We can remove a self-join when for each outer row: 1. At most one inner row matches the join clause. 2. If the join target list contains any inner vars, an inner row must be (physically) the same row as the outer one. In this patch we use the next approach to identify a self-join: 1. Collect all mergejoinable join quals which look like a.x = b.x 2. Check innerrel_is_unique() for the qual list from (1). If it returns true, then outer row matches only the same row from the inner relation. 3. If uniqueness of outer relation is deduced from baserestrictinfo clause like 'x=const' on unique field(s), check what the inner has the same clause with the same constant(s). 4. Compare RowMarks of inner and outer relations. They must have the same strength. Some regression tests changed due to self-join removal logic. --- doc/src/sgml/config.sgml | 16 + src/backend/optimizer/path/indxpath.c | 38 + src/backend/optimizer/plan/analyzejoins.c | 1077 - src/backend/optimizer/plan/planmain.c |5 + src/backend/utils/misc/guc_tables.c | 22 + src/include/optimizer/paths.h |3 + src/include/optimizer/planmain.h |7 + src/test/regress/expected/equivclass.out | 32 + src/test/regress/expected/join.out| 798 +++ src/test/regress/expected/sysviews.out|3 +- src/test/regress/sql/equivclass.sql | 16 + src/test/regress/sql/join.sql | 351 +++ src/tools/pgindent/typedefs.list |2 + 13 files changed, 2319 insertions(+), 51 deletions(-) diff --git a/doc/src/sgml/config.sgml b/doc/src/sgml/config.sgml index 6262cb7bb2..68215e1093 100644 --- a/doc/src/sgml/config.sgml +++ b/doc/src/sgml/config.sgml @@ -5419,6 +5419,22 @@ ANY num_sync ( + enable_self_join_removal (boolean) + + enable_self_join_removal configuration parameter + + + + + Enables or disables the query planner's optimization which analyses +query tree and replaces self joins with semantically equivalent single +scans. Take into consideration only plain tables. +The default is on. + + + + enable_seqscan (boolean) diff --git a/src/backend/optimizer/path/indxpath.c b/src/backend/optimizer/path/indxpath.c index 0065c8992b..57bdc6811f 100644 --- a/src/backend/optimizer/path/indxpath.c +++ b/src/backend/optimizer/path/indxpath.c @@ -3491,6 +3491,21
Extension Enhancement: Buffer Invalidation in pg_buffercache
I hope this email finds you well. I am excited to share that I have extended the functionality of the `pg_buffercache` extension by implementing buffer invalidation capability, as requested by some PostgreSQL contributors for improved testing scenarios. This marks my first time submitting a patch to pgsql-hackers, and I am eager to receive your expert feedback on the changes made. Your insights are invaluable, and any review or comments you provide will be greatly appreciated. The primary objective of this enhancement is to enable explicit buffer invalidation within the `pg_buffercache` extension. By doing so, we can simulate scenarios where buffers are invalidated and observe the resulting behavior in PostgreSQL. As part of this patch, a new function or mechanism has been introduced to facilitate buffer invalidation. I would like to hear your thoughts on whether this approach provides a good user interface for this functionality. Additionally, I seek your evaluation of the buffer locking protocol employed in the extension to ensure its correctness and efficiency. Please note that I plan to add comprehensive documentation once the details of this enhancement are agreed upon. This documentation will serve as a valuable resource for users and contributors alike. I believe that your expertise will help uncover any potential issues and opportunities for further improvement. I have attached the patch file to this email for your convenience. Your valuable time and consideration in reviewing this extension are sincerely appreciated. Thank you for your continued support and guidance. I am looking forward to your feedback and collaboration in enhancing the PostgreSQL ecosystem. The working of the extension: 1. Creating the extension pg_buffercache and then call select query on a table and note the buffer to be cleared. pgbench=# create extension pg_buffercache; CREATE EXTENSION pgbench=# select count(*) from pgbench_accounts; count 10 (1 row) pgbench=# SELECT * FROM pg_buffercache WHERE relfilenode = pg_relation_filenode('pgbench_accounts'::regclass); bufferid | relfilenode | reltablespace | reldatabase | relforknumber | relblocknumber | isdirty | usagecount | pinning_backends --+-+---+-+---++-++-- 233 | 16397 | 1663 | 16384 | 0 | 0 | f | 1 |0 234 | 16397 | 1663 | 16384 | 0 | 1 | f | 1 |0 235 | 16397 | 1663 | 16384 | 0 | 2 | f | 1 |0 236 | 16397 | 1663 | 16384 | 0 | 3 | f | 1 |0 237 | 16397 | 1663 | 16384 | 0 | 4 | f | 1 |0 2. Clearing a single buffer by entering the bufferid. pgbench=# SELECT count(*) FROM pg_buffercache WHERE relfilenode = pg_relation_filenode('pgbench_accounts'::regclass); count --- 1660 (1 row) pgbench=# select pg_buffercache_invalidate(233); pg_buffercache_invalidate --- t (1 row) pgbench=# SELECT count(*) FROM pg_buffercache WHERE relfilenode = pg_relation_filenode('pgbench_accounts'::regclass); count --- 1659 (1 row) 3. Clearing the entire buffer for a relation using the function. pgbench=# SELECT count(*) FROM pg_buffercache WHERE relfilenode = pg_relation_filenode('pgbench_accounts'::regclass); count --- 1659 (1 row) pgbench=# select count(pg_buffercache_invalidate(bufferid)) from pg_buffercache where relfilenode = pg_relation_filenode('pgbench_accounts'::regclass); count --- 1659 (1 row) pgbench=# SELECT count(*) FROM pg_buffercache WHERE relfilenode = pg_relation_filenode('pgbench_accounts'::regclass); count --- 0 (1 row) Best regards, Palak v1-0001-Invalidate-Buffer-By-Bufnum.patch Description: Binary data
Re: [PATCH] Using named captures in Catalog::ParseHeader()
Michael Paquier writes: > On Wed, Jun 14, 2023 at 10:30:23AM +0100, Dagfinn Ilmari Mannsåker wrote: >> Thanks for the review! > > v17 is now open, so applied this one. Thanks for committing! - ilmari
Re: Tab completion for CREATE SCHEMAAUTHORIZATION
Michael Paquier writes: > On Tue, May 09, 2023 at 12:26:16PM +0900, Michael Paquier wrote: >> That looks pretty much OK to me. One tiny comment I have is that this >> lacks brackets for the inner blocks, so I have added some in the v4 >> attached. > > The indentation was a bit wrong, so fixed it, and applied on HEAD. Thanks! - ilmari
Re: [HACKERS] make async slave to wait for lsn to be replayed
All rebased and tested -- Ivan Kartyshov Postgres Professional: http://www.postgrespro.com The Russian Postgres comp...@postgrespro.ru> diff --git a/src/backend/access/transam/xlogrecovery.c b/src/backend/access/transam/xlogrecovery.c index becc2bda62..c7460bd9b8 100644 --- a/src/backend/access/transam/xlogrecovery.c +++ b/src/backend/access/transam/xlogrecovery.c @@ -43,6 +43,7 @@ #include "backup/basebackup.h" #include "catalog/pg_control.h" #include "commands/tablespace.h" +#include "commands/wait.h" #include "common/file_utils.h" #include "miscadmin.h" #include "pgstat.h" @@ -1752,6 +1753,15 @@ PerformWalRecovery(void) break; } + /* + * If we replayed an LSN that someone was waiting for, + * set latches in shared memory array to notify the waiter. + */ + if (XLogRecoveryCtl->lastReplayedEndRecPtr >= GetMinWaitedLSN()) + { +WaitSetLatch(XLogRecoveryCtl->lastReplayedEndRecPtr); + } + /* Else, try to fetch the next WAL record */ record = ReadRecord(xlogprefetcher, LOG, false, replayTLI); } while (record != NULL); diff --git a/src/backend/commands/Makefile b/src/backend/commands/Makefile index 48f7348f91..d8f6965d8c 100644 --- a/src/backend/commands/Makefile +++ b/src/backend/commands/Makefile @@ -61,6 +61,7 @@ OBJS = \ vacuum.o \ vacuumparallel.o \ variable.o \ - view.o + view.o \ + wait.o include $(top_srcdir)/src/backend/common.mk diff --git a/src/backend/commands/meson.build b/src/backend/commands/meson.build index 42cced9ebe..ec6ab7722a 100644 --- a/src/backend/commands/meson.build +++ b/src/backend/commands/meson.build @@ -50,4 +50,5 @@ backend_sources += files( 'vacuumparallel.c', 'variable.c', 'view.c', + 'wait.c', ) diff --git a/src/backend/commands/wait.c b/src/backend/commands/wait.c new file mode 100644 index 00..2f9be4f9ad --- /dev/null +++ b/src/backend/commands/wait.c @@ -0,0 +1,275 @@ +/*- + * + * wait.c + * Implements wait lsn, which allows waiting for events such as + * LSN having been replayed on replica. + * + * Portions Copyright (c) 1996-2023, PostgreSQL Global Development Group + * Portions Copyright (c) 2023, Regents of PostgresPro + * + * IDENTIFICATION + * src/backend/commands/wait.c + * + *- + */ +#include "postgres.h" + +#include + +#include "access/xact.h" +#include "access/xlogrecovery.h" +#include "access/xlogdefs.h" +#include "commands/wait.h" +#include "funcapi.h" +#include "miscadmin.h" +#include "pgstat.h" +#include "storage/backendid.h" +#include "storage/pmsignal.h" +#include "storage/proc.h" +#include "storage/shmem.h" +#include "storage/sinvaladt.h" +#include "storage/spin.h" +#include "utils/builtins.h" +#include "utils/pg_lsn.h" +#include "utils/timestamp.h" + +/* Add to shared memory array */ +static void AddWaitedLSN(XLogRecPtr lsn_to_wait); + +/* Shared memory structure */ +typedef struct +{ + int backend_maxid; + pg_atomic_uint64 min_lsn; /* XLogRecPtr of minimal waited for LSN */ + slock_t mutex; + /* LSNs that different backends are waiting */ + XLogRecPtr lsn[FLEXIBLE_ARRAY_MEMBER]; +} WaitState; + +static WaitState *state; + +/* + * Add the wait event of the current backend to shared memory array + */ +static void +AddWaitedLSN(XLogRecPtr lsn_to_wait) +{ + SpinLockAcquire(>mutex); + if (state->backend_maxid < MyBackendId) + state->backend_maxid = MyBackendId; + + state->lsn[MyBackendId] = lsn_to_wait; + + if (lsn_to_wait < state->min_lsn.value) + state->min_lsn.value = lsn_to_wait; + SpinLockRelease(>mutex); +} + +/* + * Delete wait event of the current backend from the shared memory array. + */ +void +DeleteWaitedLSN(void) +{ + int i; + XLogRecPtr lsn_to_delete; + + SpinLockAcquire(>mutex); + + lsn_to_delete = state->lsn[MyBackendId]; + state->lsn[MyBackendId] = InvalidXLogRecPtr; + + /* If we are deleting the minimal LSN, then choose the next min_lsn */ + if (lsn_to_delete != InvalidXLogRecPtr && + lsn_to_delete == state->min_lsn.value) + { + state->min_lsn.value = PG_UINT64_MAX; + for (i = 2; i <= state->backend_maxid; i++) + if (state->lsn[i] != InvalidXLogRecPtr && +state->lsn[i] < state->min_lsn.value) +state->min_lsn.value = state->lsn[i]; + } + + /* If deleting from the end of the array, shorten the array's used part */ + if (state->backend_maxid == MyBackendId) + for (i = (MyBackendId); i >= 2; i--) + if (state->lsn[i] != InvalidXLogRecPtr) + { +state->backend_maxid = i; +break; + } + + SpinLockRelease(>mutex); +} + +/* + * Report amount of shared memory space needed for WaitState + */ +Size +WaitShmemSize(void) +{ + Size size; + + size = offsetof(WaitState, lsn); + size = add_size(size, mul_size(MaxBackends + 1, sizeof(XLogRecPtr))); + return size; +} + +/* + * Initialize an array of events to wait for in shared memory + */ +void +WaitShmemInit(void) +{ + bool found; + uint32 i; + +
Re: PG 16 draft release notes ready
Hi, Thanks for making the release notes. I found the release note of PG16 beta2 mentions a reverted following feature. ``` Have initdb use ICU by default if ICU is enabled in the binary (Jeff Davis) Option --locale-provider=libc can be used to disable ICU. ``` Unfortunately, the feature is reverted with the commit. * https://git.postgresql.org/gitweb/?p=postgresql.git;a=commit;h=2535c74b1a6190cc42e13f6b6b55d94bff4b7dd6 Regards, -- Masahiro Ikeda NTT DATA CORPORATION
Re: [PATCH] Honor PG_TEST_NOCLEAN for tempdirs
> On 30 Jun 2023, at 09:09, Michael Paquier wrote: > > On Thu, Jun 29, 2023 at 09:05:59AM -0700, Jacob Champion wrote: >> Sounds good -- 0002 can be ignored as needed, then. (Or I can resend a >> v3 for CI purposes, if you'd like.) > > I am assuming that this is 0001 posted here: > https://www.postgresql.org/message-id/8be3d35a-9608-b1d5-e5e6-7a744ea45...@timescale.com > > And that looks OK to me. This is something I'd rather backpatch down > to v11 on usability ground for developers. Any comments or objections > about that? Agreed, I'd prefer all branches to work the same for this. Reading the patch, only one thing stood out: -variable PG_TEST_NOCLEAN is set, data directories will be retained -regardless of test status. +variable PG_TEST_NOCLEAN is set, those directories will be retained I would've written "the data directories" instead of "those directories" here. -- Daniel Gustafsson
Re: pgsql: Fix search_path to a safe value during maintenance operations.
On Thu, 2023-06-29 at 20:53 -0400, Tom Lane wrote: > I think that's a seriously awful kluge. It will mean that things > behave > differently for the owner than for MAINTAIN grantees, which pretty > much > destroys the use-case for that privilege, as well as being very > confusing > and hard to debug. In version 15, try this: CREATE USER foo; CREATE SCHEMA foo AUTHORIZATION foo; CREATE USER bar; CREATE SCHEMA bar AUTHORIZATION bar; \c - foo CREATE FUNCTION foo.mod10(INT) RETURNS INT IMMUTABLE LANGUAGE plpgsql AS $$ BEGIN RETURN mod($1,10); END; $$; CREATE TABLE t(i INT); -- units digit must be unique CREATE UNIQUE INDEX t_idx ON t (foo.mod10(i)); INSERT INTO t VALUES(7); -- success INSERT INTO t VALUES(17); -- fails GRANT USAGE ON SCHEMA foo TO bar; GRANT INSERT ON t TO bar; \c - bar CREATE FUNCTION bar.mod(INT, INT) RETURNS INT IMMUTABLE LANGUAGE plpgsql AS $$ BEGIN RETURN $1 + 100; END; $$; SET search_path = bar, pg_catalog; INSERT INTO foo.t VALUES(7); -- succeeds \c - foo SELECT * FROM t; i --- 7 7 (2 rows) I'm not sure that everyone in this thread realizes just how broken it is to depend on search_path in a functional index at all. And doubly so if it depends on a schema other than pg_catalog in the search_path. Let's also not forget that logical replication always uses search_path=pg_catalog, so if you depend on a different search_path for any function attached to the table (not just functional indexes, also functions inside expressions or trigger functions), then those are already broken in version 15. And if a superuser is executing maintenance commands, there's little reason to think they'll have the same search path as the user that created the table. At some point in the very near future (though I realize that point may come after version 16), we need to lock down the search path in a lot of cases (not just maintenance commands), and I don't see any way around that. Regards, Jeff Davis
Re: [PATCH] Honor PG_TEST_NOCLEAN for tempdirs
On Thu, Jun 29, 2023 at 09:05:59AM -0700, Jacob Champion wrote: > Sounds good -- 0002 can be ignored as needed, then. (Or I can resend a > v3 for CI purposes, if you'd like.) I am assuming that this is 0001 posted here: https://www.postgresql.org/message-id/8be3d35a-9608-b1d5-e5e6-7a744ea45...@timescale.com And that looks OK to me. This is something I'd rather backpatch down to v11 on usability ground for developers. Any comments or objections about that? -- Michael signature.asc Description: PGP signature
Re: ExecRTCheckPerms() and many prunable partitions (checkAsUser)
On Wed, Jun 28, 2023 at 4:30 PM Amit Langote wrote: > > Hi, > > On Tue, Feb 21, 2023 at 4:12 PM Amit Langote wrote: > > On Tue, Feb 21, 2023 at 12:40 AM Tom Lane wrote: > > > Alvaro Herrera writes: > > > > On 2023-Feb-20, Amit Langote wrote: > > > >> One more thing we could try is come up with a postgres_fdw test case, > > > >> because it uses the RelOptInfo.userid value for remote-costs-based > > > >> path size estimation. But adding a test case to contrib module's > > > >> suite test a core planner change might seem strange, ;-). > > > > > > > Maybe. Perhaps adding it in a separate file there is okay? > > > > > > There is plenty of stuff in contrib module tests that is really > > > there to test core-code behavior. (You could indeed argue that > > > *all* of contrib is there for that purpose.) If it's not > > > convenient to test something without an extension, just do it > > > and don't sweat about that. > > > > OK. Attached adds a test case to postgres_fdw's suite. You can see > > that it fails without a316a3bc. > > Noticed that there's an RfC entry for this in the next CF. Here's an > updated version of the patch where I updated the comments a bit and > the commit message. > > I'm thinking of pushing this on Friday barring objections. Seeing none, I've pushed this to HEAD and 16. Marking the CF entry as committed. -- Thanks, Amit Langote EDB: http://www.enterprisedb.com
Re: Allow pg_archivecleanup to remove backup history files
On Fri, Jun 23, 2023 at 05:37:09PM +0900, torikoshia wrote: > On 2023-06-22 16:47, Kyotaro Horiguchi wrote: > Thanks for your review! I have begun cleaning up my board, and applied 0001 for the moment. -- Michael signature.asc Description: PGP signature
Re: Todo: Teach planner to evaluate multiple windows in the optimal order
On Wed, Feb 15, 2023 at 11:23 AM John Naylor wrote: > > > On Tue, Feb 14, 2023 at 5:46 PM David Rowley wrote: > > > > I didn't do a detailed review of the sort patch, but I did wonder > > about the use of the name "fallback" in the new functions. The > > comment in the following snippet from qsort_tuple_unsigned_compare() > > makes me think "tiebreak" is a better name. > > I agree "tiebreak" is better. Here is v3 with that change. I still need to make sure the tests cover all cases, so I'll do that as time permits. Also creating CF entry. -- John Naylor EDB: http://www.enterprisedb.com From be88d7cab46e44385762365f1ba23d7684f76d3a Mon Sep 17 00:00:00 2001 From: John Naylor Date: Mon, 30 Jan 2023 17:10:00 +0700 Subject: [PATCH v3] Split out tiebreaker comparisons from comparetup_* functions Previously, if a specialized comparator found equal datum1 keys, the "comparetup" function would repeat the comparison on the datum before proceeding with the unabbreviated first key and/or additional sort keys. Move comparing additional sort keys into "tiebreak" functions so that specialized comparators can call these directly if needed, avoiding duplicate work. Reviewed by David Rowley Discussion: https://www.postgresql.org/message-id/CAFBsxsGaVfUrjTghpf%3DkDBYY%3DjWx1PN-fuusVe7Vw5s0XqGdGw%40mail.gmail.com --- src/backend/utils/sort/tuplesort.c | 6 +- src/backend/utils/sort/tuplesortvariants.c | 112 - src/include/utils/tuplesort.h | 7 ++ 3 files changed, 95 insertions(+), 30 deletions(-) diff --git a/src/backend/utils/sort/tuplesort.c b/src/backend/utils/sort/tuplesort.c index e5a4e5b371..3e872d7518 100644 --- a/src/backend/utils/sort/tuplesort.c +++ b/src/backend/utils/sort/tuplesort.c @@ -513,7 +513,7 @@ qsort_tuple_unsigned_compare(SortTuple *a, SortTuple *b, Tuplesortstate *state) if (state->base.onlyKey != NULL) return 0; - return state->base.comparetup(a, b, state); + return state->base.comparetup_tiebreak(a, b, state); } #if SIZEOF_DATUM >= 8 @@ -537,7 +537,7 @@ qsort_tuple_signed_compare(SortTuple *a, SortTuple *b, Tuplesortstate *state) if (state->base.onlyKey != NULL) return 0; - return state->base.comparetup(a, b, state); + return state->base.comparetup_tiebreak(a, b, state); } #endif @@ -561,7 +561,7 @@ qsort_tuple_int32_compare(SortTuple *a, SortTuple *b, Tuplesortstate *state) if (state->base.onlyKey != NULL) return 0; - return state->base.comparetup(a, b, state); + return state->base.comparetup_tiebreak(a, b, state); } /* diff --git a/src/backend/utils/sort/tuplesortvariants.c b/src/backend/utils/sort/tuplesortvariants.c index eb6cfcfd00..dd9d4d3764 100644 --- a/src/backend/utils/sort/tuplesortvariants.c +++ b/src/backend/utils/sort/tuplesortvariants.c @@ -47,18 +47,24 @@ static void removeabbrev_datum(Tuplesortstate *state, SortTuple *stups, int count); static int comparetup_heap(const SortTuple *a, const SortTuple *b, Tuplesortstate *state); +static int comparetup_heap_tiebreak(const SortTuple *a, const SortTuple *b, + Tuplesortstate *state); static void writetup_heap(Tuplesortstate *state, LogicalTape *tape, SortTuple *stup); static void readtup_heap(Tuplesortstate *state, SortTuple *stup, LogicalTape *tape, unsigned int len); static int comparetup_cluster(const SortTuple *a, const SortTuple *b, Tuplesortstate *state); +static int comparetup_cluster_tiebreak(const SortTuple *a, const SortTuple *b, + Tuplesortstate *state); static void writetup_cluster(Tuplesortstate *state, LogicalTape *tape, SortTuple *stup); static void readtup_cluster(Tuplesortstate *state, SortTuple *stup, LogicalTape *tape, unsigned int tuplen); static int comparetup_index_btree(const SortTuple *a, const SortTuple *b, Tuplesortstate *state); +static int comparetup_index_btree_tiebreak(const SortTuple *a, const SortTuple *b, + Tuplesortstate *state); static int comparetup_index_hash(const SortTuple *a, const SortTuple *b, Tuplesortstate *state); static void writetup_index(Tuplesortstate *state, LogicalTape *tape, @@ -67,6 +73,8 @@ static void readtup_index(Tuplesortstate *state, SortTuple *stup, LogicalTape *tape, unsigned int len); static int comparetup_datum(const SortTuple *a, const SortTuple *b, Tuplesortstate *state); +static int comparetup_datum_tiebreak(const SortTuple *a, const SortTuple *b, + Tuplesortstate *state); static void writetup_datum(Tuplesortstate *state, LogicalTape *tape, SortTuple *stup); static void readtup_datum(Tuplesortstate *state, SortTuple *stup, @@ -165,6 +173,7 @@ tuplesort_begin_heap(TupleDesc tupDesc, base->removeabbrev = removeabbrev_heap; base->comparetup = comparetup_heap; + base->comparetup_tiebreak = comparetup_heap_tiebreak; base->writetup = writetup_heap; base->readtup = readtup_heap; base->haveDatum1 = true; @@ -242,6 +251,7 @@
Re: Synchronizing slots from primary to standby
Hi Kuroda-san, On 6/29/23 12:22 PM, Hayato Kuroda (Fujitsu) wrote: Dear Drouvot, Hi, I'm also interested in the feature. Followings are my high-level comments. I did not mention some detailed notations because this patch is not at the stage. And very sorry that I could not follow all of this discussions. Thanks for looking at it and your feedback! All I've done so far is to provide a re-based version in April of the existing patch. I'll have a closer look at the code, at your feedback and Amit's one while working on the new version that will: - take care of slot invalidation - ensure that synchronized slot cant' be consumed until the standby gets promoted as discussed up-thread. Regards, -- Bertrand Drouvot PostgreSQL Contributors Team RDS Open Source Databases Amazon Web Services: https://aws.amazon.com