Re: Refactor backup related code (was: Is it correct to say, "invalid data in file \"%s\"", BACKUP_LABEL_FILE in do_pg_backup_stop?)
On Wed, Sep 21, 2022 at 03:45:49PM +0900, Fujii Masao wrote: > Instead of making allocate_backup_state() or reset_backup_state() > store the label name in BackupState before do_pg_backup_start(), > how about making do_pg_backup_start() do that after checking its length? > Seems this can simplify the code very much. > > If so, ISTM that we can replace allocate_backup_state() and > reset_backup_state() with just palloc0() and MemSet(), respectively. > Also we can remove fill_backup_label_name(). Yep, agreed. Having all these routines feels a bit overengineered. -- Michael signature.asc Description: PGP signature
Re: Refactor backup related code (was: Is it correct to say, "invalid data in file \"%s\"", BACKUP_LABEL_FILE in do_pg_backup_stop?)
On Tue, Sep 20, 2022 at 05:13:49PM +0530, Bharath Rupireddy wrote: > On Tue, Sep 20, 2022 at 7:20 AM Michael Paquier wrote: > I've separated out these variables from the backup state, please see > the attached v9 patch. Thanks, the separation looks cleaner. >>> I've also taken help of the error callback mechanism to clean up the >>> allocated memory in case of a failure. For do_pg_abort_backup() cases, >>> I think we don't need to clean the memory because that gets called on >>> proc exit (before_shmem_exit()). >> >> Memory could still bloat while the process running the SQL functions >> is running depending on the error code path, anyway. > > I didn't get your point. Can you please elaborate it? I think adding > error callbacks at the right places would free up the memory for us. > Please note that we already are causing memory leaks on HEAD today. I mean that HEAD makes no effort in freeing this memory in TopMemoryContext on session ERROR. > I addressed the above review comments. I also changed a wrong comment > [1] that lies before pg_backup_start() even after the removal of > exclusive backup. > > I'm attaching v9 patch set herewith, 0001 - refactors the backup code > with backup state, 0002 - adds error callbacks to clean up the memory > allocated for backup variables. Please review them further. I have a few comments on 0001. +#include Double quotes wanted here. deallocate_backup_variables() is the only part of xlogbackup.c that includes references of the tablespace map_and backup_label StringInfos. I would be tempted to fully decouple that from xlogbackup.c/h for the time being. - tblspc_map_file = makeStringInfo(); Not sure that there is a need for a rename here. +void +build_backup_content(BackupState *state, bool ishistoryfile, StringInfo str) +{ It would be more natural to have build_backup_content() do by itself the initialization of the StringInfo for the contents of backup_label and return it as a result of the routine? This is short-lived in xlogfuncs.c when the backup ends. @@ -248,18 +250,25 @@ perform_base_backup(basebackup_options *opt, bbsink *sink) [...] + /* Construct backup_label contents. */ + build_backup_content(backup_state, false, backup_label); Actually, for base backups, perhaps it would be more intuitive to build and free the StringInfo of the backup_label when we send it for base.tar rather than initializing it at the beginning and freeing it at the end? - * pg_backup_start: set up for taking an on-line backup dump + * pg_backup_start: start taking an on-line backup. * - * Essentially what this does is to create a backup label file in $PGDATA, - * where it will be archived as part of the backup dump. The label file - * contains the user-supplied label string (typically this would be used - * to tell where the backup dump will be stored) and the starting time and - * starting WAL location for the dump. + * This function starts the backup and creates tablespace_map contents. The last part of the comment is still correct while the former is not, so this loses some information. -- Michael signature.asc Description: PGP signature
Assign TupleTableSlot->tts_tableOid duplicated in tale AM.
Hi hackers, Recently, we discover that the field of tts_tableOid of TupleTableSlot is assigned duplicated in table AM's interface which is not necessary. For example, in table_scan_getnextslot, ``` static inline bool table_scan_getnextslot(TableScanDesc sscan, ScanDirection direction, TupleTableSlot *slot) { slot->tts_tableOid = RelationGetRelid(sscan->rs_rd); /* * We don't expect direct calls to table_scan_getnextslot with valid * CheckXidAlive for catalog or regular tables. See detailed comments in * xact.c where these variables are declared. */ if (unlikely(TransactionIdIsValid(CheckXidAlive) && !bsysscan)) elog(ERROR, "unexpected table_scan_getnextslot call during logical decoding"); return sscan->rs_rd->rd_tableam->scan_getnextslot(sscan, direction, slot); } ``` we can see that it assigns tts_tableOid, while calling sscan->rs_rd->rd_tableam->scan_getnextslot which implemented by heap_getnextslot also assigns tts_tableOid in the call of ExecStoreBufferHeapTuple. ``` TupleTableSlot * ExecStoreBufferHeapTuple(HeapTuple tuple, TupleTableSlot *slot, Buffer buffer) { /* * sanity checks */ Assert(tuple != NULL); Assert(slot != NULL); Assert(slot->tts_tupleDescriptor != NULL); Assert(BufferIsValid(buffer)); if (unlikely(!TTS_IS_BUFFERTUPLE(slot))) elog(ERROR, "trying to store an on-disk heap tuple into wrong type of slot"); tts_buffer_heap_store_tuple(slot, tuple, buffer, false); slot->tts_tableOid = tuple->t_tableOid; return slot; } ``` We can get the two assigned values are same by reading codes. Maybe we should remove one? What's more, we find that maybe we assign slot->tts_tableOid in outer interface like table_scan_getnextslot may be better and more friendly when we import other pluggable storage formats. It can avoid duplicated assignments in every implementation of table AM's interfaces. Regards, Wenchao
Re: ICU for global collation
On 2022-09-20 12:59, Peter Eisentraut wrote: On 17.09.22 10:33, Marina Polyakova wrote: 3. The locale provider is ICU, but it has not yet been set from the template database: $ initdb --locale-provider icu --icu-locale en-US -D data && pg_ctl -D data -l logfile start && createdb --icu-locale ru-RU --template template0 mydb ... createdb: error: database creation failed: ERROR: ICU locale cannot be specified unless locale provider is ICU Please see attached patch for a fix. Does that work for you? Yes, it works. The following test checks this fix: diff --git a/src/bin/scripts/t/020_createdb.pl b/src/bin/scripts/t/020_createdb.pl index b87d8fc63b5246b02bcd4499aae815269b60df7c..c2464a99618cd7ca5616cc21121e1e4379b52baf 100644 --- a/src/bin/scripts/t/020_createdb.pl +++ b/src/bin/scripts/t/020_createdb.pl @@ -71,6 +71,14 @@ if ($ENV{with_icu} eq 'yes') $node2->command_ok( [ 'createdb', '-T', 'template0', '--locale-provider=libc', 'foobar55' ], 'create database with libc provider from template database with icu provider'); + + $node2->command_ok( + [ + 'createdb', '-T', 'template0', '--icu-locale', + 'en-US', 'foobar56' + ], + 'create database with icu locale from template database with icu provider' + ); } else { -- Marina Polyakova Postgres Professional: http://www.postgrespro.com The Russian Postgres Company
Re: Refactor backup related code (was: Is it correct to say, "invalid data in file \"%s\"", BACKUP_LABEL_FILE in do_pg_backup_stop?)
On 2022/09/20 20:43, Bharath Rupireddy wrote: - if (strlen(backupidstr) > MAXPGPATH) + if (state->name_overflowed == true) ereport(ERROR, (errcode(ERRCODE_INVALID_PARAMETER_VALUE), errmsg("backup label too long (max %d bytes)", It does not strike me as a huge issue to force a truncation of such backup label names. 1024 is large enough for basically all users, in my opinion. Or you could just truncate in the internal logic, but still complain at MAXPGPATH - 1 as the last byte would be for the zero termination. In short, there is no need to complicate things with name_overflowed. We currently allow MAXPGPATH bytes of label name excluding null termination. I don't want to change this behaviour. In the attached v9 patch, I'm truncating the larger names to MAXPGPATH + 1 bytes in backup state (one extra byte for representing that the name has overflown, and another extra byte for null termination). This looks much complicated to me. Instead of making allocate_backup_state() or reset_backup_state() store the label name in BackupState before do_pg_backup_start(), how about making do_pg_backup_start() do that after checking its length? Seems this can simplify the code very much. If so, ISTM that we can replace allocate_backup_state() and reset_backup_state() with just palloc0() and MemSet(), respectively. Also we can remove fill_backup_label_name(). Regards, -- Fujii Masao Advanced Computing Technology Center Research and Development Headquarters NTT DATA CORPORATION
Re: make additional use of optimized linear search routines
On Wed, Sep 21, 2022 at 1:40 PM Michael Paquier wrote: > In short, switching those code paths to use the linear search routines > looks like a good thing in the long-term, so I would like to apply > this patch. If you have any comments or objections, please feel > free. Yeah, I agree that the changes in the patch are meaningful even if the performance gain is limited. I wonder if there are other code paths we can replace with the linear search routines. I tried to search for them but no luck. Thanks Richard
Re: [PoC] Improve dead tuple storage for lazy vacuum
On Tue, Sep 20, 2022 at 3:19 PM Masahiko Sawada wrote: > > On Fri, Sep 16, 2022 at 4:54 PM John Naylor > wrote: > > Here again, I'd rather put this off and focus on getting the "large > > details" in good enough shape so we can got towards integrating with > > vacuum. > > Thank you for the comments! These above comments are addressed by > Nathan in a newly derived thread. I'll work on the patch. I still seem to be out-voted on when to tackle this particular optimization, so I've extended the v6 benchmark code with a hackish function that populates a fixed number of keys, but with different fanouts. (diff attached as a text file) I didn't take particular care to make this scientific, but the following seems pretty reproducible. Note what happens to load and search performance when node16 has 15 entries versus 16: fanout | nkeys | rt_mem_allocated | rt_load_ms | rt_search_ms ++--++-- 15 | 327680 | 3776512 | 39 | 20 (1 row) num_keys = 327680, height = 4, n4 = 1, n16 = 23408, n32 = 0, n128 = 0, n256 = 0 fanout | nkeys | rt_mem_allocated | rt_load_ms | rt_search_ms ++--++-- 16 | 327680 | 3514368 | 25 | 11 (1 row) num_keys = 327680, height = 4, n4 = 0, n16 = 21846, n32 = 0, n128 = 0, n256 = 0 In trying to wrap the SIMD code behind layers of abstraction, the latest patch (and Nathan's cleanup) threw it away in almost all cases. To explain, we need to talk about how vectorized code deals with the "tail" that is too small for the register: 1. Use a one-by-one algorithm, like we do for the pg_lfind* variants. 2. Read some junk into the register and mask off false positives from the result. There are advantages to both depending on the situation. Patch v5 and earlier used #2. Patch v6 used #1, so if a node16 has 15 elements or less, it will iterate over them one-by-one exactly like a node4. Only when full with 16 will the vector path be taken. When another entry is added, the elements are copied to the next bigger node, so there's a *small* window where it's fast. In short, this code needs to be lower level so that we still have full control while being portable. I will work on this, and also the related code for node dispatch. Since v6 has some good infrastructure to do low-level benchmarking, I also want to do some experiments with memory management. (I have further comments about the code, but I will put that off until later) > I'll consider how to integrate with vacuum as the next step. One > concern for me is how to limit the memory usage to > maintenance_work_mem. Unlike using a flat array, memory space for > adding one TID varies depending on the situation. If we want strictly > not to allow using memory more than maintenance_work_mem, probably we > need to estimate the memory consumption in a conservative way. +1 -- John Naylor EDB: http://www.enterprisedb.com commit 18407962e96ccec6c9aeeba97412edd762a5a4fe Author: John Naylor Date: Wed Sep 21 11:44:43 2022 +0700 Add special benchmark function to test effect of fanout diff --git a/contrib/bench_radix_tree/Makefile b/contrib/bench_radix_tree/Makefile index b8f70e12d1..952bb0ceae 100644 --- a/contrib/bench_radix_tree/Makefile +++ b/contrib/bench_radix_tree/Makefile @@ -7,7 +7,7 @@ OBJS = \ EXTENSION = bench_radix_tree DATA = bench_radix_tree--1.0.sql -REGRESS = bench +REGRESS = bench_fixed_height ifdef USE_PGXS PG_CONFIG = pg_config diff --git a/contrib/bench_radix_tree/bench_radix_tree--1.0.sql b/contrib/bench_radix_tree/bench_radix_tree--1.0.sql index 6663abe6a4..f2fee15b17 100644 --- a/contrib/bench_radix_tree/bench_radix_tree--1.0.sql +++ b/contrib/bench_radix_tree/bench_radix_tree--1.0.sql @@ -40,3 +40,15 @@ OUT load_ms int8) returns record as 'MODULE_PATHNAME' LANGUAGE C STRICT VOLATILE PARALLEL UNSAFE; + +create function bench_fixed_height_search( +fanout int4, +OUT fanout int4, +OUT nkeys int8, +OUT rt_mem_allocated int8, +OUT rt_load_ms int8, +OUT rt_search_ms int8 +) +returns record +as 'MODULE_PATHNAME' +LANGUAGE C STRICT VOLATILE PARALLEL UNSAFE; diff --git a/contrib/bench_radix_tree/bench_radix_tree.c b/contrib/bench_radix_tree/bench_radix_tree.c index 5806ef7519..0778da2d7b 100644 --- a/contrib/bench_radix_tree/bench_radix_tree.c +++ b/contrib/bench_radix_tree/bench_radix_tree.c @@ -13,6 +13,7 @@ #include "fmgr.h" #include "funcapi.h" #include "lib/radixtree.h" +#include #include "miscadmin.h" #include "utils/timestamp.h" @@ -24,6 +25,7 @@ PG_MODULE_MAGIC; PG_FUNCTION_INFO_V1(bench_seq_search); PG_FUNCTION_INFO_V1(bench_shuffle_search); PG_FUNCTION_INFO_V1(bench_load_random_int); +PG_FUNCTION_INFO_V1(bench_fixed_height_search); static radix_tree *rt = NULL; static ItemPointer itemptrs = NULL; @@ -299,3 +301,108 @@ bench_load_random_int(PG_FUNCTION_ARGS) rt_free(rt); PG_RETURN_DATUM(HeapTupleGetDatum(h
Re: Fix snapshot name for SET TRANSACTION documentation
On 2022/09/21 12:01, Japin Li wrote: Hi hackers, In 6c2003f8a1bbc7c192a2e83ec51581c018aa162f, we change the snapshot name when exporting snapshot, however, there is one place we missed update the snapshot name in documentation. Attach a patch to fix it. Thanks for the patch! Looks good to me. Regards, -- Fujii Masao Advanced Computing Technology Center Research and Development Headquarters NTT DATA CORPORATION
Re: make additional use of optimized linear search routines
On Sat, Sep 03, 2022 at 10:06:58AM +0900, Michael Paquier wrote: > Ohoh. This sounds like a good idea to me, close to what John has > applied lately. I'll take a closer look.. So, the two code paths patched here are rather isolated. The one in TransactionIdIsInProgress() requires an overflowed set of subxids still running, something similar to what the isolation test subxid-overflow does. XidIsConcurrent() is also kind of hard to reason about with a benchmark. Anyway, I did not know about the work done with SIMD instructions in pg_lfind.h and after playing the API I have run some micro benchmarks with on pg_lfind32() and I can see some improvements. With a range of 100~10k elements in a fixed number of repeated calls with a for loop and lfind(), I could not get up to the 40% speedup. That was somewhat closer to 15%~20% on x86 and 20%~25% with arm64. There is a trend where things got better with a higher number of elements with lfind(). In short, switching those code paths to use the linear search routines looks like a good thing in the long-term, so I would like to apply this patch. If you have any comments or objections, please feel free. -- Michael signature.asc Description: PGP signature
Re: [PATCH]Feature improvement for MERGE tab completion
On 2022/09/21 0:51, Alvaro Herrera wrote: The rules starting at line 4111 make me a bit nervous, since nowhere we're restricting them to operating only on MERGE lines. I don't think it's a real problem since USING is not terribly common anyway. Likewise for the ones with WHEN [NOT] MATCHED. I kinda wish we had a way to search for stuff like "keyword MERGE appears earlier in the command", but we don't have that. Yeah, I was thinking the same when updating the patch. How about adding something like PartialMatches() that checks whether the keywords are included in the input string or not? If so, we can restrict some tab-completion rules to operating only on MERGE, as follows. I attached the WIP patch (0002 patch) that introduces PartialMatches(). Is this approach over-complicated? Thought? + else if (PartialMatches("MERGE", "INTO", MatchAny, "USING") || +PartialMatches("MERGE", "INTO", MatchAny, "AS", MatchAny, "USING") || +PartialMatches("MERGE", "INTO", MatchAny, MatchAny, "USING")) + { + /* Complete MERGE INTO ... ON with target table attributes */ + if (TailMatches("INTO", MatchAny, "USING", MatchAny, "ON")) + COMPLETE_WITH_ATTR(prev4_wd); + else if (TailMatches("INTO", MatchAny, "AS", MatchAny, "USING", MatchAny, "AS", MatchAny, "ON")) + COMPLETE_WITH_ATTR(prev8_wd); + else if (TailMatches("INTO", MatchAny, MatchAny, "USING", MatchAny, MatchAny, "ON")) + COMPLETE_WITH_ATTR(prev6_wd); Regards, -- Fujii Masao Advanced Computing Technology Center Research and Development Headquarters NTT DATA CORPORATIONFrom 671b4dff2e259b4d148dcbfaee0611fc2bddce85 Mon Sep 17 00:00:00 2001 From: Fujii Masao Date: Wed, 21 Sep 2022 11:46:59 +0900 Subject: [PATCH 1/2] psql: Improve tab-completion for MERGE. --- src/bin/psql/tab-complete.c | 102 +++- 1 file changed, 67 insertions(+), 35 deletions(-) diff --git a/src/bin/psql/tab-complete.c b/src/bin/psql/tab-complete.c index f3465adb85..820f47d23a 100644 --- a/src/bin/psql/tab-complete.c +++ b/src/bin/psql/tab-complete.c @@ -1669,7 +1669,7 @@ psql_completion(const char *text, int start, int end) "COMMENT", "COMMIT", "COPY", "CREATE", "DEALLOCATE", "DECLARE", "DELETE FROM", "DISCARD", "DO", "DROP", "END", "EXECUTE", "EXPLAIN", "FETCH", "GRANT", "IMPORT FOREIGN SCHEMA", "INSERT INTO", "LISTEN", "LOAD", "LOCK", - "MERGE", "MOVE", "NOTIFY", "PREPARE", + "MERGE INTO", "MOVE", "NOTIFY", "PREPARE", "REASSIGN", "REFRESH MATERIALIZED VIEW", "REINDEX", "RELEASE", "RESET", "REVOKE", "ROLLBACK", "SAVEPOINT", "SECURITY LABEL", "SELECT", "SET", "SHOW", "START", @@ -3641,7 +3641,7 @@ psql_completion(const char *text, int start, int end) */ else if (Matches("EXPLAIN")) COMPLETE_WITH("SELECT", "INSERT INTO", "DELETE FROM", "UPDATE", "DECLARE", - "MERGE", "EXECUTE", "ANALYZE", "VERBOSE"); + "MERGE INTO", "EXECUTE", "ANALYZE", "VERBOSE"); else if (HeadMatches("EXPLAIN", "(*") && !HeadMatches("EXPLAIN", "(*)")) { @@ -3660,12 +3660,12 @@ psql_completion(const char *text, int start, int end) } else if (Matches("EXPLAIN", "ANALYZE")) COMPLETE_WITH("SELECT", "INSERT INTO", "DELETE FROM", "UPDATE", "DECLARE", - "MERGE", "EXECUTE", "VERBOSE"); + "MERGE INTO", "EXECUTE", "VERBOSE"); else if (Matches("EXPLAIN", "(*)") || Matches("EXPLAIN", "VERBOSE") || Matches("EXPLAIN", "ANALYZE", "VERBOSE")) COMPLETE_WITH("SELECT", "INSERT INTO", "DELETE FROM", "UPDATE", "DECLARE", - "MERGE", "EXECUTE"); + "MERGE INTO", "EXECUTE"); /* FETCH && MOVE */ @@ -4065,58 +4065,90 @@ psql_completion(const char *text, int start, int end) else if (HeadMatches("LOCK") && TailMatches("IN", "SHARE")) COMPLETE_WITH("MODE", "ROW EXCLUSIVE MODE", "UPDATE EXCLUSIVE MODE"); + + /* Complete LOCK [TABLE] [ONLY] [IN lockmode MODE] with "NOWAIT" */ + else if (HeadMatches("LOCK") && TailMatches("MODE")) + COMPLETE_WITH("NOWAIT"); + /* MERGE --- can be inside EXPLAIN */ else if (TailMatches("MERGE")) COMPLETE_WITH("INTO"); else if (TailMatches("MERGE", "INTO")) COMPLETE_WITH_SCHEMA_QUERY(Query_for_list_of_mergetargets); + + /* Complete MERGE INTO [[AS] ] with USING */ else if (TailMatches("MERGE", "INTO", MatchAny))
Re: Add common function ReplicationOriginName.
On Tue, Sep 20, 2022 at 2:06 PM Aleksander Alekseev wrote: > > Hi Amit, > > > I think it is better to use Size. Even though, it may not fail now as > > the size of names for origin will always be much lesser but it is > > better if we are consistent. If we agree with this, then as a first > > patch, we can make it to use Size in existing places and then > > introduce this new function. > > OK, here is the updated patchset. > > * 0001 replaces int's with Size's in the existing code > Pushed this one. > * 0002 applies Peter's patch on top of 0001 > Can't we use the existing function ReplicationOriginNameForTablesync() by passing relid as InvalidOid for this purpose? We need a check inside to decide which name to construct, otherwise, it should be fine. If we agree with this, then we can change the name of the function to something like ReplicationOriginNameForLogicalRep or ReplicationOriginNameForLogicalRepWorkers. -- With Regards, Amit Kapila.
Re: [RFC] building postgres with meson - v13
Hi, On 2022-09-21 09:52:48 +0700, John Naylor wrote: > On Wed, Sep 21, 2022 at 7:11 AM Andres Freund wrote: > > > > Hi, > > > > On 2022-09-19 19:16:30 -0700, Andres Freund wrote: > > > To have some initial "translation" for other developers I've started a > wiki > > > page with a translation table. Still very WIP: > > > https://wiki.postgresql.org/wiki/Meson > > > > > > For now, a bit of polishing aside, I'm just planning to add a minimal > > > explanation of what's happening, and a reference to this thread. > > > > I added installation instructions for meson for a bunch of platforms, but > > Small typo: The homebrew section is still labeled with "find MacPorts > libraries". Thanks, fixed. I wrote these blindly, so there's probably more wrong than this - although Thomas was helpful enough to provide some information / testing. Greetings, Andres Freund
Re: predefined role(s) for VACUUM and ANALYZE
On Wed, Sep 21, 2022 at 10:31:47AM +0900, Michael Paquier wrote: > Did you just run an aclupdate()? 4% for aclitem[] sounds like quite a > number to me :/ It may be worth looking at if these operations could > be locally optimized more, as well. I'd like to think that we could > live with that to free up enough bits in AclItems for the next 20 > years, anyway. Any opinions? Yes, the test was mostly for aclupdate(). Looking at that function, I bet most of its time is spent in palloc0() and memcpy(). It might be possible to replace the linear search if the array was sorted, but I'm skeptical that will help much. In the end, I'm not it's worth worrying too much about 2,000 calls to aclupdate() with an array of 2,000 ACLs taking 5.3 seconds instead of 5.1 seconds. I bet a more pressing concern is the calls to aclmask() since checking privileges is probably done more frequently than updating them. That appears to use a linear search, too, so maybe sorting the aclitem arrays is actually worth exploring. I still doubt there will be much noticeable impact from expanding AclMode outside of the most extreme cases. > For the column sizes of the catalogs, I was wondering about how > pg_column_size() changes when they hold ACL information. Unoptimized > alignment could cause an unnecessary increase in the structure sizes, > so the addition of new fields or changes in object size could have > unexpected side effects. After a few tests, I haven't discovered any changes to the output of pg_column_size(). -- Nathan Bossart Amazon Web Services: https://aws.amazon.com
Re: Virtual tx id
Hi, On Wed, Sep 21, 2022 at 01:58:47PM +1000, James Sewell wrote: > Hello Hackers! > > Is it possible to get the current virtual txid from C somehow? > > I've looked through the code, but can't seem to find anything other than > getting a NULL when there is no (real) xid assigned. Maybe I'm missing > something? It should be MyProc->lxid, and prepend it with MyBackendId if you want to compare something like pg_locks.virtualtransaction.
Re: Virtual tx id
HI, On Sep 21, 2022, 11:59 +0800, James Sewell , wrote: > Hello Hackers! > > Is it possible to get the current virtual txid from C somehow? > > I've looked through the code, but can't seem to find anything other than > getting a NULL when there is no (real) xid assigned. Maybe I'm missing > something? > > Cheers, > James Virtual xid is meaningful only inside a read-only transaction. It’s made up of MyProc->BackendId and MyProc->LocalTransactionId. To catch it, you can begin a transaction, but don’t exec sqls that could change the db. And gdb on process to see( must exec a read only sql to call StartTransaction, else MyProc->lxid is not assigned). ``` Begin; // do nothing ``` gdb on it and see ``` p MyProc->lxid p MyProc->backendId ``` Regards, Zhang Mingli
Re: Virtual tx id
On Wed, 21 Sep 2022 at 11:58, James Sewell wrote: > Hello Hackers! > > Is it possible to get the current virtual txid from C somehow? > The virtual txid is consisted of MyProc->backendId and MyProc->lxid. Do you mean a C function that returns virtual txid? > I've looked through the code, but can't seem to find anything other than > getting a NULL when there is no (real) xid assigned. Maybe I'm missing > something? > Do you mean use SQL function to check the virtual? IIRC, there is no such functions. Maybe you can use pg_export_snapshot() to get virtual txid (v10 and later), the filename of exported snapshot consists MyProc->backendId and MyProc->lxid. -- Regrads, Japin Li. ChengDu WenWu Information Technology Co.,Ltd.
Virtual tx id
Hello Hackers! Is it possible to get the current virtual txid from C somehow? I've looked through the code, but can't seem to find anything other than getting a NULL when there is no (real) xid assigned. Maybe I'm missing something? Cheers, James
Fix snapshot name for SET TRANSACTION documentation
Hi hackers, In 6c2003f8a1bbc7c192a2e83ec51581c018aa162f, we change the snapshot name when exporting snapshot, however, there is one place we missed update the snapshot name in documentation. Attach a patch to fix it. -- Regrads, Japin Li. ChengDu WenWu Information Technology Co.,Ltd. >From 5eac3c0ca274da30b2e2cf0e19887892d5f92788 Mon Sep 17 00:00:00 2001 From: Japin Li Date: Wed, 21 Sep 2022 10:51:22 +0800 Subject: [PATCH v1] Fix sanpshot name for set transaction documentation --- doc/src/sgml/ref/set_transaction.sgml | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/doc/src/sgml/ref/set_transaction.sgml b/doc/src/sgml/ref/set_transaction.sgml index e062e2461e..d394e622f8 100644 --- a/doc/src/sgml/ref/set_transaction.sgml +++ b/doc/src/sgml/ref/set_transaction.sgml @@ -164,7 +164,7 @@ SET SESSION CHARACTERISTICS AS TRANSACTION transa snapshot identifier, which must be given to SET TRANSACTION SNAPSHOT to specify which snapshot is to be imported. The identifier must be written as a string literal in this command, for example - '03A1-1'. + '0003-001B-1'. SET TRANSACTION SNAPSHOT can only be executed at the start of a transaction, before the first query or data-modification statement (SELECT, -- 2.25.1
Re: [RFC] building postgres with meson - v13
On Wed, Sep 21, 2022 at 7:11 AM Andres Freund wrote: > > Hi, > > On 2022-09-19 19:16:30 -0700, Andres Freund wrote: > > To have some initial "translation" for other developers I've started a wiki > > page with a translation table. Still very WIP: > > https://wiki.postgresql.org/wiki/Meson > > > > For now, a bit of polishing aside, I'm just planning to add a minimal > > explanation of what's happening, and a reference to this thread. > > I added installation instructions for meson for a bunch of platforms, but Small typo: The homebrew section is still labeled with "find MacPorts libraries". -- John Naylor EDB: http://www.enterprisedb.com
RE: why can't a table be part of the same publication as its schema
On Wednesday, September 21, 2022 4:06 AM Mark Dilger wrote: > > On Sep 20, 2022, at 12:36 PM, Jonathan S. Katz > wrote: > > > > This behavior exists "FOR ALL TABLES" without the "IN SCHEMA" qualifier. > This was discussed multiple times on the original thread[1]. > > Yes, nobody is debating that as far as I can see. And I do take your point > that > this stuff was discussed in other threads quite a while back. > > > I tried to diligently read the sections where we talk about granting + > privileges[2][3] to see what it says about "ALL * IN SCHEMA". Unless I missed > it, > and I read through it twice, it does not explicitly state whether or not > "GRANT" > applies to all objects at only that given moment, or to future objects of that > type which are created in that schema. Maybe the behavior is implied or is > part > of the standard, but it's not currently documented. > > Interesting. Thanks for that bit of research. > > > We do link to "ALTER DEFAULT PRIVILEGES" at the bottom of the GRANT[2] > docs, but we don't give any indication as to why. > > > > (This is also to say we should document in GRANT that ALL * IN SCHEMA does > not apply to future objects; > > Yes, I agree this should be documented. > > > if you need that behavior use ALTER DEFAULT PRIVILEGES. Separate thread :) > > > > I understand there is a risk of confusion of the similar grammar across > commands, but the current command in logical replication has this is building > on the existing behavior. > > I don't complain that it is buidling on the existing behavior. I'm *only* > concerned about the keywords we're using for this. Consider the following: > >-- AS ADMIN >CREATE USER bob NOSUPERUSER; >GRANT ALL PRIVILEGES ON ALL TABLES IN SCHEMA foo TO bob; >SET ROLE bob; >CREATE PUBLICATION bobs_pub FOR ALL TABLES IN SCHEMA foo; > > We're going to have that fail in pg15 because the FOR ALL TABLES IN SCHEMA > option is reserved to superusers. But we agreed that was a stop-gap solution > that we'd potentially loosen in the future. Certainly we'll need wiggle room > in > the syntax to perform that loosening: > >--- Must be superuser for this in pg15, and in subsequent releases. >CREATE PUBLICATION bobs_pub FOR ALL FUTURE TABLES IN SCHEMA foo; > >--- Not supported in pg15, but reserved for some future pg versions to > allow >--- non-superusers to create publications on tables currently in schema > foo, >--- assuming they have sufficient privileges on those tables >CREATE PUBLICATION bobs_pub FOR ALL TABLES IN SCHEMA foo; > > Doing it this way makes the syntax consistent between the GRANT...TO bob and > the CREATE PUBLICATION bobs_pub. Surely this makes more sense? Thanks for the suggestion. My concern is that I am not sure do we really want to add a feature that only publish all the current tables(not future tables). I think, if possible, it would be better to find an approach that can release the superuser restriction for both FOR ALL TABLES and FOR ALL TABLES IN SCHEMA in the future release. I think another solution might be introduce a new publication option (like: include_future). When user execute: CREATE PUBLICATION ... FOR ALL TABLES IN SCHEMA ... WITH (include_future) it means we publish all current and future tables and require superuser. We can set the default value of this option to 'true' and user can set it to false if they only want to publish the current tables and don't want to use superuser. And in this approach, we don't need to change the syntax. Best regards, Hou zj
Re: Make ON_ERROR_STOP stop on shell script failure
On 2022/09/20 15:15, bt22nakamorit wrote: I thought that this action is rather unexpected since, based on the word """ON_ERROR_STOP""", ones may expect that failures of shell scripts should halt the incoming instructions as well. One clear solution is to let failures of shell script stop incoming queries just like how errors of SQLs do currently. Thoughts? +1 I edited the documentation for ON_ERROR_STOP. Any other suggestions? Thanks for the patch! Could you add it to the next CommitFest so that we don't forget it? We can execute the shell commands via psql in various ways other than \! meta-command. For example, * `command` * \g | command * \gx | command * \o | command * \w | command * \copy ... program 'command' ON_ERROR_STOP should handle not only \! but also all the above in the same way? One concern about this patch is that some applications already depend on the current behavior of ON_ERROR_STOP, i.e., psql doesn't stop even when the shell command returns non-zero exit code. If so, we might need to extend ON_ERROR_STOP so that it accepts the following setting values. * off - don't stop even when either sql or shell fails (same as the current behavior) * on or sql - stop only whensql fails (same as the current behavior) * shell - stop only when shell fails * all - stop when either sql or shell fails Thought? Regards, -- Fujii Masao Advanced Computing Technology Center Research and Development Headquarters NTT DATA CORPORATION
Re: pg_receivewal fail to streams when the partial file to write is not fully initialized present in the wal receiver directory
On Fri, Aug 19, 2022 at 5:27 PM Bharath Rupireddy wrote: > > Please review the attached v6 patch. I'm attaching the v7 patch rebased on to the latest HEAD. -- Bharath Rupireddy PostgreSQL Contributors Team RDS Open Source Databases Amazon Web Services: https://aws.amazon.com From 5ccee7e8ed82952bad6ef410176952aec2733e4f Mon Sep 17 00:00:00 2001 From: Bharath Rupireddy Date: Wed, 21 Sep 2022 01:44:44 + Subject: [PATCH v7] Make WAL file initialization by pg_receivewal atomic While initializing WAL files with zeros (i.e. padding) the pg_receivewal can fail for any reason, for instance, no space left on device or host server crash etc., which may leave partially padded WAL files due to which the pg_receivewal won't come up after the hostserver restart. The error it emits is "error: write-ahead log file "foo.partial" has x bytes, should be 0 or y" (where y is size of WAL file typically and x < y). To fix this, one needs to manually intervene and delete the partially padded WAL file which requires an internal understanding of what the issue is and often a time-consuming process in production environments. The above problem can also exist for pg_basebackup as it uses the same walmethods.c function for initialization. To address the above problem, make WAL file initialization atomic i.e. first create a temporary file, pad it and then rename it to the target file. This is similar to what postgres does to make WAL file initialization atomic. Authors: Bharath Rupireddy, SATYANARAYANA NARLAPURAM Reviewed-by: Cary Huang, mahendrakar s Reviewed-by: Kyotaro Horiguchi, Michael Paquier Discussion: https://www.postgresql.org/message-id/CAHg%2BQDcVUss6ocOmbLbV5f4YeGLhOCt%2B1x2yLNfG2H_eDwU8tw%40mail.gmail.com --- src/bin/pg_basebackup/pg_receivewal.c | 22 ++ src/bin/pg_basebackup/walmethods.c| 96 --- 2 files changed, 109 insertions(+), 9 deletions(-) diff --git a/src/bin/pg_basebackup/pg_receivewal.c b/src/bin/pg_basebackup/pg_receivewal.c index f98ec557db..1cbc992e67 100644 --- a/src/bin/pg_basebackup/pg_receivewal.c +++ b/src/bin/pg_basebackup/pg_receivewal.c @@ -237,6 +237,28 @@ is_xlogfilename(const char *filename, bool *ispartial, return true; } + /* + * File looks like a temporary partial uncompressed WAL file that was + * leftover during pre-padding phase from a previous crash, delete it now + * as we don't need it. + */ + if (fname_len == XLOG_FNAME_LEN + strlen(".partial.tmp") && + strcmp(filename + XLOG_FNAME_LEN, ".partial.tmp") == 0) + { + /* 12 is length of string ".partial.tmp" */ + char path[MAXPGPATH + XLOG_FNAME_LEN + 12]; + + snprintf(path, sizeof(path), "%s/%s", basedir, filename); + + if (unlink(path) < 0) + pg_log_error("could not remove file \"%s\"", path); + + if (verbose) + pg_log_info("removed file \"%s\"", path); + + return false; + } + /* File does not look like something we know */ return false; } diff --git a/src/bin/pg_basebackup/walmethods.c b/src/bin/pg_basebackup/walmethods.c index bc2e83d02b..4d34d4346a 100644 --- a/src/bin/pg_basebackup/walmethods.c +++ b/src/bin/pg_basebackup/walmethods.c @@ -118,7 +118,10 @@ dir_open_for_write(WalWriteMethod *wwmethod, const char *pathname, const char *temp_suffix, size_t pad_to_size) { DirectoryMethodData *dir_data = (DirectoryMethodData *) wwmethod; - char tmppath[MAXPGPATH]; + char targetpath[MAXPGPATH]; + char tmpsuffixpath[MAXPGPATH]; + bool shouldcreatetempfile = false; + int flags; char *filename; int fd; DirectoryMethodFile *f; @@ -134,7 +137,7 @@ dir_open_for_write(WalWriteMethod *wwmethod, const char *pathname, clear_error(wwmethod); filename = dir_get_file_name(wwmethod, pathname, temp_suffix); - snprintf(tmppath, sizeof(tmppath), "%s/%s", + snprintf(targetpath, sizeof(targetpath), "%s/%s", dir_data->basedir, filename); pg_free(filename); @@ -143,12 +146,57 @@ dir_open_for_write(WalWriteMethod *wwmethod, const char *pathname, * the file descriptor is important for dir_sync() method as gzflush() * does not do any system calls to fsync() to make changes permanent on * disk. + * + * Create a temporary file for padding and no compression cases to ensure + * a fully initialized file is created. */ - fd = open(tmppath, O_WRONLY | O_CREAT | PG_BINARY, pg_file_create_mode); + if (pad_to_size > 0 && + wwmethod->compression_algorithm == PG_COMPRESSION_NONE) + { + shouldcreatetempfile = true; + flags = O_WRONLY | PG_BINARY; + } + else + { + shouldcreatetempfile = false; + flags = O_WRONLY | O_CREAT | PG_BINARY; + } + + fd = open(targetpath, flags, pg_file_create_mode); if (fd < 0) { - wwmethod->lasterrno = errno; - return NULL; + if (errno == ENOENT && shouldcreatetempfile) + { + /* + * Actual file doesn't exist. Now, create a temporary file pad it + * and rename to the target file. The temporary file may exist from + * the last failed attempt (failed during partial padding or + * renaming or some ot
RE: [PATCH] Use indexes on the subscriber when REPLICA IDENTITY is full on the publisher
> One last thing - do you think there is any need to mention this > behaviour in the pgdocs, or is OK just to be a hidden performance > improvement? FYI - I put my opinion. We have following sentence in the logical-replication.sgml: ``` ... If the table does not have any suitable key, then it can be set to replica identity full, which means the entire row becomes the key. This, however, is very inefficient and should only be used as a fallback if no other solution is possible. ... ``` Here the word "very inefficient" may mean that sequential scans will be executed every time. I think some descriptions can be added around here. Best Regards, Hayato Kuroda FUJITSU LIMITED
Re: why can't a table be part of the same publication as its schema
[personal views, not RMT] On 9/20/22 4:06 PM, Mark Dilger wrote: I don't complain that it is buidling on the existing behavior. I'm *only* concerned about the keywords we're using for this. Consider the following: -- AS ADMIN CREATE USER bob NOSUPERUSER; GRANT ALL PRIVILEGES ON ALL TABLES IN SCHEMA foo TO bob; SET ROLE bob; CREATE PUBLICATION bobs_pub FOR ALL TABLES IN SCHEMA foo; We're going to have that fail in pg15 because the FOR ALL TABLES IN SCHEMA option is reserved to superusers. But we agreed that was a stop-gap solution that we'd potentially loosen in the future. Certainly we'll need wiggle room in the syntax to perform that loosening: --- Must be superuser for this in pg15, and in subsequent releases. CREATE PUBLICATION bobs_pub FOR ALL FUTURE TABLES IN SCHEMA foo; --- Not supported in pg15, but reserved for some future pg versions to allow --- non-superusers to create publications on tables currently in schema foo, --- assuming they have sufficient privileges on those tables CREATE PUBLICATION bobs_pub FOR ALL TABLES IN SCHEMA foo; Doing it this way makes the syntax consistent between the GRANT...TO bob and the CREATE PUBLICATION bobs_pub. Surely this makes more sense? When you put it that way, I see your point. However, for the lesser-privileged user though, will the behavior be that it will continue to add all future tables in a schema to the publication so long as they have sufficient privileges on those tables? Or would that mirror the current behavior with GRANT? While I understand it makes it consistent, the one concern I raise is that it means the less privileged user could have a less convenient user experience than the privileged user. Perhaps that's OK, but worth noting. I'm not a huge fan of the keyword "FUTURE" here, but I found a reference to another database that uses that keyword for what I think is a similar purpose. I did try doing research on this prior, but hadn't thought to incorporate "future" into my searches. Doing so, I probably found the same database that you did that used the "FUTURE" word for adding permissions to future objects (and this is fresh, as the docs for it were published last week). That's definitely interesting. I did see some notes on a legacy database system that offered similar advice to what we do for GRANT if you're not using ALTER DEFAULT PRIVILEGES. We should choose *something* for this, though, if we want things to be rational going forward. That all said, while I understand your point and open to the suggestion on "FUTURE", I'm not convinced on the syntax change. But I'll sleep on it. Jonathan OpenPGP_signature Description: OpenPGP digital signature
RE: Perform streaming logical transactions by background workers and parallel apply
> FYI - > > The latest patch 30-0001 fails to apply, it seems due to a recent commit [1]. > > [postgres@CentOS7-x64 oss_postgres_misc]$ git apply > ../patches_misc/v30-0001-Perform-streaming-logical-transactions-by- > parall.patch > error: patch failed: src/include/replication/logicalproto.h:246 > error: src/include/replication/logicalproto.h: patch does not apply Thanks for your kindly reminder. I rebased the patch set and attached them in [1]. [1] - https://www.postgresql.org/message-id/OS3PR01MB6275298521AE1BBEF5A055EE9E4F9%40OS3PR01MB6275.jpnprd01.prod.outlook.com Regards, Wang wei
Re: Perform streaming logical transactions by background workers and parallel apply
FYI - The latest patch 30-0001 fails to apply, it seems due to a recent commit [1]. [postgres@CentOS7-x64 oss_postgres_misc]$ git apply ../patches_misc/v30-0001-Perform-streaming-logical-transactions-by-parall.patch error: patch failed: src/include/replication/logicalproto.h:246 error: src/include/replication/logicalproto.h: patch does not apply -- [1] https://github.com/postgres/postgres/commit/bfcf1b34805f70df48eedeec237230d0cc1154a6 Kind Regards, Peter Smith. Fujitsu Australia
Re: Hash index build performance tweak from sorting
On Tue, 2 Aug 2022 at 03:37, Simon Riggs wrote: > Using the above test case, I'm getting a further 4-7% improvement on > already committed code with the attached patch, which follows your > proposal. > > The patch passes info via a state object, useful to avoid API churn in > later patches. Hi Simon, I took this patch for a spin and saw a 2.5% performance increase using the random INT test that Tom posted. The index took an average of 7227.47 milliseconds on master and 7045.05 with the patch applied. On making a pass of the changes, I noted down a few things. 1. In _h_spoolinit() the HSpool is allocated with palloc and then you're setting the istate field to a pointer to the HashInsertState which is allocated on the stack by the only calling function (hashbuild()). Looking at hashbuild(), it looks like the return value of _h_spoolinit is never put anywhere to make it available outside of the function, so it does not seem like there is an actual bug there. However, it just seems like a bug waiting to happen. If _h_spoolinit() is pallocing memory, then we really shouldn't be setting pointer fields in that memory to point to something on the stack. It might be nicer if the istate field in HSpool was a HashInsertStateData and _h_spoolinit() just memcpy'd the contents of that parameter. That would make HSpool 4 bytes smaller and save additional pointer dereferences in _hash_doinsert(). 2. There are quite a few casts that are not required. e.g: _hash_doinsert(rel, itup, heapRel, (HashInsertState) &istate); buildstate.spool = _h_spoolinit(heap, index, num_buckets, (HashInsertState) &insertstate); buildstate.istate = (HashInsertState) &insertstate; This is just my opinion, but I don't really see the value in having a typedef for a pointer to HashInsertStateData. I can understand that if the struct was local to a .c file, but you've got the struct and pointer typedef in the same header. I understand we often do this in the code, but I feel like we do it less often in newer code. e.g we do it in aset.c but not generation.c (which is much newer than aset.c). My personal preference would be just to name the struct HashInsertState and have no extra pointer typedefs. 3. Just a minor nitpick. Line wraps at 80 chars. You're doing this sometimes but not others. This seems just to be due to the additional function parameters that have been added. 4. I added the following Assert to _hash_pgaddtup() as I expected the itup_off to be set to the same thing before and after this change. I see the Assert is failing in the regression tests. Assert(PageGetMaxOffsetNumber(page) + 1 == _hash_binsearch(page, _hash_get_indextuple_hashkey(itup))); I think this is because _hash_binsearch() returns the offset with the first tuple with the given hashkey, so if there are duplicate hashkey values then it looks like PageAddItemExtended() will set needshuffle and memmove() the existing item(s) up one slot. I don't know this hash index building code very well, but I wonder if it's worth having another version of _hash_binsearch() that can be used to make _hash_pgaddtup() put any duplicate hashkeys after the existing ones rather than before and shuffle the others up? It sounds like that might speed up normal insertions when there are many duplicate values to hash. I wonder if this might be the reason the random INT test didn't come out as good as your original test which had unique values. The unique values test would do less shuffling during PageAddItemExtended(). If so, that implies that skipping the binary search is only part of the gains here and that not shuffling tuples accounts for quite a bit of the gain you're seeing. If so, then it would be good to not have to shuffle duplicate hashkey tuples up in the page during normal insertions as well as when building the index. In any case, it would be nice to have some way to assert that we don't accidentally pass sorted==true to _hash_pgaddtup() when there's an existing item on the page with a higher hash value. Maybe we could just look at the hash value of the last tuple on the page and ensure it's <= to the current one? 5. I think it would be nicer to move the insertstate.sorted = false; into the else branch in hashbuild(). However, you might have to do that anyway if you were to do what I mentioned in #1. David
Re: predefined role(s) for VACUUM and ANALYZE
On Tue, Sep 20, 2022 at 04:50:10PM -0700, Nathan Bossart wrote: > On Tue, Sep 20, 2022 at 04:31:17PM -0700, Nathan Bossart wrote: >> On Tue, Sep 20, 2022 at 11:05:33AM -0700, Nathan Bossart wrote: >>> On Tue, Sep 20, 2022 at 02:45:52PM +0900, Michael Paquier wrote: Any impact for the column sizes of the catalogs holding ACL information? Just asking while browsing the patch set. >>> >>> Since each aclitem requires 16 bytes instead of 12, I assume so. However, >>> in my testing, I hit a "row is too big" error with the same number of >>> aclitems in a pg_class row before and after the change. I might be missing >>> something in my patch, or maybe I am misunderstanding how arrays of >>> aclitems are stored on disk. >> >> Ah, it looks like relacl is compressed. The column is marked "extended," >> but pg_class doesn't appear to have a TOAST table, so presumably no >> out-of-line storage can be used. I found a couple of threads about this >> [0] [1] [2]. Adding a toast table to pg_class has been a sensitive topic over the years. Based on my recollection of the events, there were worries about the potential cross-dependencies with pg_class and pg_attribute that this would create. > I suppose there is some risk that folks with really long aclitem arrays > might be unable to pg_upgrade to a version with uint64 AclModes, but I > suspect that risk is limited to extreme cases (i.e., multiple thousands of > aclitems). I'm not sure whether that's worth worrying about too much. Did you just run an aclupdate()? 4% for aclitem[] sounds like quite a number to me :/ It may be worth looking at if these operations could be locally optimized more, as well. I'd like to think that we could live with that to free up enough bits in AclItems for the next 20 years, anyway. Any opinions? For the column sizes of the catalogs, I was wondering about how pg_column_size() changes when they hold ACL information. Unoptimized alignment could cause an unnecessary increase in the structure sizes, so the addition of new fields or changes in object size could have unexpected side effects. -- Michael signature.asc Description: PGP signature
Re: Support pg_attribute_aligned and noreturn in MSVC
On Tue, Sep 20, 2022 at 08:01:20AM -0400, James Coleman wrote: > I don't have access to a Windows machine for testing, but re-reading > the documentation it looks like the issue is that our noreturn macro > is used after the definition while the MSVC equivalent is used before. A CI setup would do the job for example, see src/tools/ci/README that explains how to set up things. > I've removed that for now (and commented about it); it's not as > valuable anyway since it's mostly an indicator for code analysis > (human and machine). Except for the fact that the patch missed to define pg_attribute_noreturn() in the MSVC branch, this looks fine to me. I have been looking at what you meant with packing, and I can see the business with PACK(), something actually doable with gcc. That's a first step, at least, and I see no reason not to do it, so applied. -- Michael signature.asc Description: PGP signature
Re: Summary function for pg_buffercache
Hi, On 2022-09-20 12:45:24 +0300, Aleksander Alekseev wrote: > > I'm not sure how to avoid any undefined behaviour without locks though. > > Even with locks, performance is much better. But is it good enough for > > production? > > Potentially you could avoid taking locks by utilizing atomic > operations and lock-free algorithms. But these algorithms are > typically error-prone and not always produce a faster code than the > lock-based ones. I'm pretty confident this is out of scope of this > particular patch. Why would you need lockfree operations? All you need to do is to read BufferDesc->state into a local variable and then make decisions based on that? > + for (int i = 0; i < NBuffers; i++) > + { > + BufferDesc *bufHdr; > + uint32 buf_state; > + > + bufHdr = GetBufferDescriptor(i); > + > + /* Lock each buffer header before inspecting. */ > + buf_state = LockBufHdr(bufHdr); > + > + /* Invalid RelFileNumber means the buffer is unused */ > + if (RelFileNumberIsValid(BufTagGetRelNumber(&bufHdr->tag))) > + { > + buffers_used++; > + usagecount_avg += BUF_STATE_GET_USAGECOUNT(buf_state); > + > + if (buf_state & BM_DIRTY) > + buffers_dirty++; > + } > + else > + buffers_unused++; > + > + if (BUF_STATE_GET_REFCOUNT(buf_state) > 0) > + buffers_pinned++; > + > + UnlockBufHdr(bufHdr, buf_state); > + } I.e. instead of locking the buffer header as done above, this could just do something along these lines: BufferDesc *bufHdr; uint32 buf_state; bufHdr = GetBufferDescriptor(i); buf_state = pg_atomic_read_u32(&bufHdr->state); if (buf_state & BM_VALID) { buffers_used++; usagecount_avg += BUF_STATE_GET_USAGECOUNT(buf_state); if (buf_state & BM_DIRTY) buffers_dirty++; } else buffers_unused++; if (BUF_STATE_GET_REFCOUNT(buf_state) > 0) buffers_pinned++; Without a memory barrier you can get very slightly "out-of-date" values of the state, but that's fine in this case. Greetings, Andres Freund
Re: [PATCH] Use indexes on the subscriber when REPLICA IDENTITY is full on the publisher
Hi Onder, Thanks for addressing all my previous feedback. I checked the latest v12-0001, and have no more comments at this time. One last thing - do you think there is any need to mention this behaviour in the pgdocs, or is OK just to be a hidden performance improvement? -- Kind Regards, Peter Smith. Fujitsu Australia
Re: [RFC] building postgres with meson - v13
Hi, On 2022-09-19 19:16:30 -0700, Andres Freund wrote: > To have some initial "translation" for other developers I've started a wiki > page with a translation table. Still very WIP: > https://wiki.postgresql.org/wiki/Meson > > For now, a bit of polishing aside, I'm just planning to add a minimal > explanation of what's happening, and a reference to this thread. I added installation instructions for meson for a bunch of platforms, but failed to figure out how to do so in a rhel9 container. I don't have a rhel subscription, and apparently the repos with developer tools now require a subscription. Great way to make it easy for projects to test anything on RHEL. Greetings, Andres Freund
Re: predefined role(s) for VACUUM and ANALYZE
On Tue, Sep 20, 2022 at 04:31:17PM -0700, Nathan Bossart wrote: > On Tue, Sep 20, 2022 at 11:05:33AM -0700, Nathan Bossart wrote: >> On Tue, Sep 20, 2022 at 02:45:52PM +0900, Michael Paquier wrote: >>> Any impact for the column sizes of the catalogs holding ACL >>> information? Just asking while browsing the patch set. >> >> Since each aclitem requires 16 bytes instead of 12, I assume so. However, >> in my testing, I hit a "row is too big" error with the same number of >> aclitems in a pg_class row before and after the change. I might be missing >> something in my patch, or maybe I am misunderstanding how arrays of >> aclitems are stored on disk. > > Ah, it looks like relacl is compressed. The column is marked "extended," > but pg_class doesn't appear to have a TOAST table, so presumably no > out-of-line storage can be used. I found a couple of threads about this > [0] [1] [2]. I suppose there is some risk that folks with really long aclitem arrays might be unable to pg_upgrade to a version with uint64 AclModes, but I suspect that risk is limited to extreme cases (i.e., multiple thousands of aclitems). I'm not sure whether that's worth worrying about too much. -- Nathan Bossart Amazon Web Services: https://aws.amazon.com
Re: default sorting behavior for index
Zhihong Yu writes: > I was looking at this check in src/backend/parser/parse_utilcmd.c w.r.t. > constraint: > ... > If the index has DESC sorting order, why it cannot be used to back a > constraint ? > Some concrete sample would help me understand this. Please read the nearby comments, particularly * Insist on default opclass, collation, and sort options. * While the index would still work as a constraint with * non-default settings, it might not provide exactly the same * uniqueness semantics as you'd get from a normally-created * constraint; and there's also the dump/reload problem * mentioned above. The "mentioned above" refers to this: * Insist on it being a btree. That's the only kind that supports * uniqueness at the moment anyway; but we must have an index that * exactly matches what you'd get from plain ADD CONSTRAINT syntax, * else dump and reload will produce a different index (breaking * pg_upgrade in particular). The concern about whether the uniqueness semantics are the same probably mostly applies to just the opclass and collation properties. However, rd_indoption contains AM-specific options, and we have little ability to be sure in this code exactly what those bits might do. In any case we'd definitely have a risk of things breaking during pg_upgrade if we ignore rd_indoption. regards, tom lane
Re: predefined role(s) for VACUUM and ANALYZE
On Tue, Sep 20, 2022 at 11:05:33AM -0700, Nathan Bossart wrote: > On Tue, Sep 20, 2022 at 02:45:52PM +0900, Michael Paquier wrote: >> Any impact for the column sizes of the catalogs holding ACL >> information? Just asking while browsing the patch set. > > Since each aclitem requires 16 bytes instead of 12, I assume so. However, > in my testing, I hit a "row is too big" error with the same number of > aclitems in a pg_class row before and after the change. I might be missing > something in my patch, or maybe I am misunderstanding how arrays of > aclitems are stored on disk. Ah, it looks like relacl is compressed. The column is marked "extended," but pg_class doesn't appear to have a TOAST table, so presumably no out-of-line storage can be used. I found a couple of threads about this [0] [1] [2]. [0] https://postgr.es/m/17245.964897719%40sss.pgh.pa.us [1] https://postgr.es/m/200309040531.h845ViP05881%40candle.pha.pa.us [2] https://postgr.es/m/29061.1265327626%40sss.pgh.pa.us -- Nathan Bossart Amazon Web Services: https://aws.amazon.com
Re: [PoC] Federated Authn/z with OAUTHBEARER
Hi Mahendrakar, thanks for your interest and for the patch! On Mon, Sep 19, 2022 at 10:03 PM mahendrakar s wrote: > The changes for each component are summarized below. > > 1. Provider-specific extension: > Each OAuth provider implements their own token validator as an > extension. Extension registers an OAuth provider hook which is matched > to a line in the HBA file. How easy is it to write a Bearer validator using C? My limited understanding was that most providers were publishing libraries in higher-level languages. Along those lines, sample validators will need to be provided, both to help in review and to get the pytest suite green again. (And coverage for the new code is important, too.) > 2. Add support to pass on the OAuth bearer token. In this > obtaining the bearer token is left to 3rd party application or user. > > ./psql -U -d 'dbname=postgres > oauth_client_id= oauth_bearer_token= This hurts, but I think people are definitely going to ask for it, given the frightening practice of copy-pasting these (incredibly sensitive secret) tokens all over the place... Ideally I'd like to implement sender constraints for the Bearer token, to *prevent* copy-pasting (or, you know, outright theft). But I'm not sure that sender constraints are well-implemented yet for the major providers. > 3. HBA: An additional param ‘provider’ is added for the oauth method. > Defining "oauth" as method + passing provider, issuer endpoint > and expected audience > > * * * * oauth provider= > issuer= scope= Naming aside (this conflicts with Samay's previous proposal, I think), I have concerns about the implementation. There's this code: > + if (oauth_provider && oauth_provider->name) > + { > + ereport(ERROR, > + (errmsg("OAuth provider \"%s\" is already > loaded.", > + oauth_provider->name))); > + } which appears to prevent loading more than one global provider. But there's also code that deals with a provider list? (Again, it'd help to have test code covering the new stuff.) >b) libpq optionally compiled for the clients which > explicitly need libpq to orchestrate OAuth communication with the > issuer (it depends heavily on 3rd party library iddawc as Jacob > already pointed out. The library seems to be supporting all the OAuth > flows.) Speaking of iddawc, I don't think it's a dependency we should choose to rely on. For all the code that it has, it doesn't seem to provide compatibility with several real-world providers. Google, for one, chose not to follow the IETF spec it helped author, and iddawc doesn't support its flavor of Device Authorization. At another point, I think iddawc tried to decode Azure's Bearer tokens, which is incorrect... I haven't been able to check if those problems have been fixed in a recent version, but if we're going to tie ourselves to a huge dependency, I'd at least like to believe that said dependency is battle-tested and solid, and personally I don't feel like iddawc is. > - auth_method = I_TOKEN_AUTH_METHOD_NONE; > - if (conn->oauth_client_secret && *conn->oauth_client_secret) > - auth_method = I_TOKEN_AUTH_METHOD_SECRET_BASIC; This code got moved, but I'm not sure why? It doesn't appear to have made a change to the logic. > + if (conn->oauth_client_secret && *conn->oauth_client_secret) > + { > + session_response_type = I_RESPONSE_TYPE_CLIENT_CREDENTIALS; > + } Is this an Azure-specific requirement? Ideally a public client (which psql is) shouldn't have to provide a secret to begin with, if I understand that bit of the protocol correctly. I think Google also required provider-specific changes in this part of the code, and unfortunately I don't think they looked the same as yours. We'll have to figure all that out... Standards are great; everyone has one of their own. :) Thanks, --Jacob
Re: Use pg_pwritev_with_retry() instead of write() in dir_open_for_write() to avoid partial writes?
On Tue, Sep 20, 2022 at 04:00:26PM -0700, Nathan Bossart wrote: > On Mon, Aug 08, 2022 at 06:10:23PM +0530, Bharath Rupireddy wrote: >> I'm attaching v5 patch-set. I've addressed review comments received so >> far and fixed a compiler warning that CF bot complained about. >> >> Please review it further. > > 0001 looks reasonable to me. > > +errno = 0; > +rc = pg_pwritev_zeros(fd, pad_to_size); > > Do we need to reset errno? pg_pwritev_zeros() claims to set errno > appropriately. > > +/* > + * PWRITEV_BLCKSZ is same as XLOG_BLCKSZ for now, however it may change if > + * writing more bytes per pg_pwritev_with_retry() call is proven to be more > + * performant. > + */ > +#define PWRITEV_BLCKSZ XLOG_BLCKSZ > > This seems like something we should sort out now instead of leaving as > future work. Given your recent note, I think we should just use > XLOG_BLCKSZ and PGAlignedXLogBlock and add a comment about the performance > findings with different buffer sizes. I also noticed that the latest patch set no longer applies, so I've marked the commitfest entry as waiting-on-author. -- Nathan Bossart Amazon Web Services: https://aws.amazon.com
Re: Use pg_pwritev_with_retry() instead of write() in dir_open_for_write() to avoid partial writes?
On Mon, Aug 08, 2022 at 06:10:23PM +0530, Bharath Rupireddy wrote: > I'm attaching v5 patch-set. I've addressed review comments received so > far and fixed a compiler warning that CF bot complained about. > > Please review it further. 0001 looks reasonable to me. +errno = 0; +rc = pg_pwritev_zeros(fd, pad_to_size); Do we need to reset errno? pg_pwritev_zeros() claims to set errno appropriately. +/* + * PWRITEV_BLCKSZ is same as XLOG_BLCKSZ for now, however it may change if + * writing more bytes per pg_pwritev_with_retry() call is proven to be more + * performant. + */ +#define PWRITEV_BLCKSZ XLOG_BLCKSZ This seems like something we should sort out now instead of leaving as future work. Given your recent note, I think we should just use XLOG_BLCKSZ and PGAlignedXLogBlock and add a comment about the performance findings with different buffer sizes. -- Nathan Bossart Amazon Web Services: https://aws.amazon.com
Re: Auto explain after query timeout
On Tue, Sep 20, 2022 at 5:08 PM James Coleman wrote: > - A safe explain (e.g., disallow catalog access) that is potentially > missing information. This would be pretty useless I think, because you'd be missing all relation names. > - A safe way to interrupt queries such as "safe shutdown" of a node > (e.g., a seq scan could stop returning tuples early) and allow a > configurable buffer of time after the statement timeout before firing > a hard abort of the query (and transaction). This might be useful, but it seems extremely difficult to get working. You'd not only have to design the safe shutdown mechanism itself, but also find a way to safely engage it at the right times. > Alternatively I wonder if it's possible (this would maybe assume no > catalog changes in the current transaction -- or at least none that > would be referenced by the current query) to open a new transaction > (with the same horizon information) and duplicate the plan over to > that transaction and run the explain there. This way you do it *after* > the error is raised. That's some serious spit-balling -- I'm not > saying that's doable, just trying to imagine how one might > comprehensively address the concerns. Doesn't work, because the new transaction's snapshot wouldn't be the same as that of the old one. Imagine that you create a table and run a query on it in the same transaction. Then you migrate the plan tree to a new transaction and try to find out the table name. But in the new transaction, that table doesn't exist: it was destroyed by the previous rollback. Honestly I have no very good ideas how to create the feature you want here. I guess the only thing I can think of is to separate the EXPLAIN process into two phases: a first phase that runs when the plan tree is set up and gathers all of the information that we might need later, like relation names, and then a second phase that runs later when you want to generate the output and does nothing that can fail, or at least no database: maybe it's allowed to allocate memory, for example. But that sounds like a big and perhaps painful refactoring exercise, and I can imagine that there might be reasons why it doesn't work out. -- Robert Haas EDB: http://www.enterprisedb.com
Re: Reducing the WAL overhead of freezing in VACUUM by deduplicating per-tuple freeze plans
On Mon, Sep 12, 2022 at 2:01 PM Peter Geoghegan wrote: > I'd like to talk about one such technique on this thread. The attached > WIP patch reduces the size of xl_heap_freeze_page records by applying > a simple deduplication process. Attached is v2, which I'm just posting to keep CFTester happy. No real changes here. -- Peter Geoghegan From c555f716a79c7de1f26a5e1e87265f2702b980ee Mon Sep 17 00:00:00 2001 From: Peter Geoghegan Date: Sat, 30 Jul 2022 10:47:59 -0700 Subject: [PATCH v2] Shrink freeze WAL records via deduplication. --- src/include/access/heapam.h| 25 ++ src/include/access/heapam_xlog.h | 34 +-- src/backend/access/heap/heapam.c | 310 - src/backend/access/heap/vacuumlazy.c | 39 +--- src/backend/access/rmgrdesc/heapdesc.c | 4 +- 5 files changed, 289 insertions(+), 123 deletions(-) diff --git a/src/include/access/heapam.h b/src/include/access/heapam.h index 9dab35551..579ef32b0 100644 --- a/src/include/access/heapam.h +++ b/src/include/access/heapam.h @@ -99,6 +99,19 @@ typedef enum HEAPTUPLE_DELETE_IN_PROGRESS /* deleting xact is still in progress */ } HTSV_Result; +/* heap_prepare_freeze_tuple state describing how to freeze a tuple */ +typedef struct HeapFreezeTuple +{ + /* Fields describing how to process tuple */ + uint16 t_infomask2; + uint16 t_infomask; + TransactionId xmax; + uint8 frzflags; + + /* Page offset number for tuple */ + OffsetNumber offset; +} HeapFreezeTuple; + /* * function prototypes for heap access method * @@ -164,6 +177,18 @@ extern TM_Result heap_lock_tuple(Relation relation, HeapTuple tuple, Buffer *buffer, struct TM_FailureData *tmfd); extern void heap_inplace_update(Relation relation, HeapTuple tuple); +extern bool heap_prepare_freeze_tuple(HeapTupleHeader tuple, + TransactionId relfrozenxid, + TransactionId relminmxid, + TransactionId cutoff_xid, + TransactionId cutoff_multi, + HeapFreezeTuple *frz, + bool *totally_frozen, + TransactionId *relfrozenxid_out, + MultiXactId *relminmxid_out); +extern void heap_freeze_prepared_tuples(Relation rel, Buffer buffer, + TransactionId latestRemovedXid, + HeapFreezeTuple *tuples, int ntuples); extern bool heap_freeze_tuple(HeapTupleHeader tuple, TransactionId relfrozenxid, TransactionId relminmxid, TransactionId cutoff_xid, TransactionId cutoff_multi); diff --git a/src/include/access/heapam_xlog.h b/src/include/access/heapam_xlog.h index 34220d93c..4e3a023eb 100644 --- a/src/include/access/heapam_xlog.h +++ b/src/include/access/heapam_xlog.h @@ -315,34 +315,36 @@ typedef struct xl_heap_inplace /* * This struct represents a 'freeze plan', which is what we need to know about - * a single tuple being frozen during vacuum. + * a group of tuples being frozen from the same page. */ /* 0x01 was XLH_FREEZE_XMIN */ #define XLH_FREEZE_XVAC 0x02 #define XLH_INVALID_XVAC 0x04 -typedef struct xl_heap_freeze_tuple +typedef struct xl_heap_freeze_plan { TransactionId xmax; - OffsetNumber offset; + uint16 ntuples; uint16 t_infomask2; uint16 t_infomask; uint8 frzflags; -} xl_heap_freeze_tuple; +} xl_heap_freeze_plan; /* * This is what we need to know about a block being frozen during vacuum * - * Backup block 0's data contains an array of xl_heap_freeze_tuple structs, - * one for each tuple. + * Backup block 0's data contains an array of xl_heap_freeze_plan structs. */ typedef struct xl_heap_freeze_page { - TransactionId cutoff_xid; - uint16 ntuples; + TransactionId latestRemovedXid; + uint16 nplans; + + /* FREEZE PLANS FOLLOW */ + /* OFFSET NUMBERS FOLLOW */ } xl_heap_freeze_page; -#define SizeOfHeapFreezePage (offsetof(xl_heap_freeze_page, ntuples) + sizeof(uint16)) +#define SizeOfHeapFreezePage (offsetof(xl_heap_freeze_page, nplans) + sizeof(uint16)) /* * This is what we need to know about setting a visibility map bit @@ -401,20 +403,6 @@ extern void heap2_desc(StringInfo buf, XLogReaderState *record); extern const char *heap2_identify(uint8 info); extern void heap_xlog_logical_rewrite(XLogReaderState *r); -extern XLogRecPtr log_heap_freeze(Relation reln, Buffer buffer, - TransactionId cutoff_xid, xl_heap_freeze_tuple *tuples, - int ntuples); -extern bool heap_prepare_freeze_tuple(HeapTupleHeader tuple, - TransactionId relfrozenxid, - TransactionId relminmxid, - TransactionId cutoff_xid, - TransactionId cutoff_multi, - xl_heap_freeze_tuple *frz, - bool *totally_frozen, - TransactionId *relfrozenxid_out, - MultiXactId *relminmxid_out); -extern void heap_execute_freeze_tuple(HeapTupleHeader tuple, - xl_heap_freeze_tuple *frz); extern XLogRecPtr log_heap_visible(RelFileLocator rlocator, Buffer heap_buffer, Buffer vm_buffer, TransactionId cutoff_xid, u
Re: Auto explain after query timeout
On Tue, Sep 20, 2022 at 3:06 PM Robert Haas wrote: > > On Tue, Sep 20, 2022 at 2:35 PM James Coleman wrote: > > Either I'm missing something (and/or this was fixed in a later PG > > version), but I don't think this is how it works. We have this > > specific problem now: we set auto_explain.log_min_duration to 200 (ms) > > and statement_timeout set to 30s, but when a statement times out we do > > not get the plan logged with auto-explain. > > I think you're correct. auto_explain uses the ExecutorEnd hook, but > that hook will not be fired in the case of an error. Indeed, if an > error has already been thrown, it would be unsafe to try to > auto-explain anything. For instance -- and this is just one problem of > probably many -- ExplainTargetRel() performs catalog lookups, which is > not OK in a failed transaction. > > To make this work, I think you'd need a hook that fires just BEFORE > the error is thrown. However, previous attempts to introduce hooks > into ProcessInterrupts() have not met with a wam response from Tom, so > it might be a tough sell. And maybe for good reason. I see at least > two problems. The first is that explaining a query is a pretty > complicated operation that involves catalog lookups and lots of > complicated stuff, and I don't think that it would be safe to do all > of that at any arbitrary point in the code where ProcessInterrupts() > happened to fire. What if one of the syscache lookups misses the cache > and we have to open the underlying catalog? Then > AcceptInvalidationMessages() will fire, but we don't currently assume > that any old CHECK_FOR_INTERRUPTS() can process invalidations. What if > the running query has called a user-defined function or procedure > which is running DDL which is in the middle of changing catalog state > for some relation involved in the query at the moment that the > statement timeout arrives? I feel like there are big problems here. > > The other problem I see is that ProcessInterrupts() is our mechanism > for allowing people to escape from queries that would otherwise run > forever by hitting ^C. But what if the explain code goes crazy and > itself needs to be interrupted, when we're already inside > ProcessInterrupts()? Maybe that would work out OK... or maybe it > wouldn't. I'm not really sure. But it's another reason to be very, > very cautious about putting complicated logic inside > ProcessInterrupts(). This is exactly the kind of background I was hoping someone would provide; thank you, Robert. It seems like one could imagine addressing all of these by having one of: - A safe explain (e.g., disallow catalog access) that is potentially missing information. - A safe way to interrupt queries such as "safe shutdown" of a node (e.g., a seq scan could stop returning tuples early) and allow a configurable buffer of time after the statement timeout before firing a hard abort of the query (and transaction). Both of those seem like a significant amount of work. Alternatively I wonder if it's possible (this would maybe assume no catalog changes in the current transaction -- or at least none that would be referenced by the current query) to open a new transaction (with the same horizon information) and duplicate the plan over to that transaction and run the explain there. This way you do it *after* the error is raised. That's some serious spit-balling -- I'm not saying that's doable, just trying to imagine how one might comprehensively address the concerns. Does any of that seem at all like a path you could imagine being fruitful? Thanks, James Coleman
Re: Making C function declaration parameter names consistent with corresponding definition names
On Mon, Sep 19, 2022 at 11:36 PM Peter Geoghegan wrote: > Attached revision v4 fixes those pg_dump patch items. > > It also breaks out the ecpg changes into their own patch. I pushed much of this just now. All that remains to bring the entire codebase into compliance is the ecpg patch and the pg_dump patch. Those two areas are relatively tricky. But it's now unlikely that I'll need to push a commit that makes relatively many CF patches stop applying against HEAD -- that part is over. Once we're done with ecpg and pg_dump, we can talk about the actual practicalities of formally adopting a project policy on consistent parameter names. I mostly use clang-tidy via my editor's support for the clangd language server -- clang-tidy is primarily a linter, so it isn't necessarily run in bulk all that often. I'll need to come up with instructions for running clang-tidy from the command line that are easy to follow. I've found that the run_clang_tidy script (AKA run-clang-tidy.py) works, but the whole experience feels hobbled together. I think that we really need something like a build target for this -- something comparable to what we do to support GCOV. That would also allow us to use additional clang-tidy checks, which might be useful. We might even find it useful to come up with some novel check of our own. Apparently it's not all that difficult to write one from scratch, to implement custom rules. There are already custom rules for big open source projects such as the Linux Kernel, Chromium, and LLVM itself. -- Peter Geoghegan
Re: why can't a table be part of the same publication as its schema
> On Sep 20, 2022, at 12:36 PM, Jonathan S. Katz wrote: > > This behavior exists "FOR ALL TABLES" without the "IN SCHEMA" qualifier. This > was discussed multiple times on the original thread[1]. Yes, nobody is debating that as far as I can see. And I do take your point that this stuff was discussed in other threads quite a while back. > I tried to diligently read the sections where we talk about granting + > privileges[2][3] to see what it says about "ALL * IN SCHEMA". Unless I missed > it, and I read through it twice, it does not explicitly state whether or not > "GRANT" applies to all objects at only that given moment, or to future > objects of that type which are created in that schema. Maybe the behavior is > implied or is part of the standard, but it's not currently documented. Interesting. Thanks for that bit of research. > We do link to "ALTER DEFAULT PRIVILEGES" at the bottom of the GRANT[2] docs, > but we don't give any indication as to why. > > (This is also to say we should document in GRANT that ALL * IN SCHEMA does > not apply to future objects; Yes, I agree this should be documented. > if you need that behavior use ALTER DEFAULT PRIVILEGES. Separate thread :) > > I understand there is a risk of confusion of the similar grammar across > commands, but the current command in logical replication has this is building > on the existing behavior. I don't complain that it is buidling on the existing behavior. I'm *only* concerned about the keywords we're using for this. Consider the following: -- AS ADMIN CREATE USER bob NOSUPERUSER; GRANT ALL PRIVILEGES ON ALL TABLES IN SCHEMA foo TO bob; SET ROLE bob; CREATE PUBLICATION bobs_pub FOR ALL TABLES IN SCHEMA foo; We're going to have that fail in pg15 because the FOR ALL TABLES IN SCHEMA option is reserved to superusers. But we agreed that was a stop-gap solution that we'd potentially loosen in the future. Certainly we'll need wiggle room in the syntax to perform that loosening: --- Must be superuser for this in pg15, and in subsequent releases. CREATE PUBLICATION bobs_pub FOR ALL FUTURE TABLES IN SCHEMA foo; --- Not supported in pg15, but reserved for some future pg versions to allow --- non-superusers to create publications on tables currently in schema foo, --- assuming they have sufficient privileges on those tables CREATE PUBLICATION bobs_pub FOR ALL TABLES IN SCHEMA foo; Doing it this way makes the syntax consistent between the GRANT...TO bob and the CREATE PUBLICATION bobs_pub. Surely this makes more sense? I'm not a huge fan of the keyword "FUTURE" here, but I found a reference to another database that uses that keyword for what I think is a similar purpose. We should choose *something* for this, though, if we want things to be rational going forward. — Mark Dilger EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
Re: why can't a table be part of the same publication as its schema
(RMT hat on, unless otherwise noted) On 9/20/22 9:42 AM, Robert Haas wrote: On Mon, Sep 19, 2022 at 11:03 PM Jonathan S. Katz wrote: For #1 (allowing calls that have schema/table overlap...), there appears to be both a patch that allows this (reversing[8]), and a suggestion for dealing with a corner-case that is reasonable, i.e. disallowing adding schemas to a publication when specifying column-lists. Do we think we can have consensus on this prior to the RC1 freeze? I am not sure whether we can or should rush a fix in that fast, but I agree with this direction. The RMT met today to discuss this. We did agree that the above is an open item that should be resolved before this release. While it is an accepted pattern for us to "ERROR" on unsupported behavior and then later introduce said behavior, we do agree with Peter's original post in this thread and would like it resolved. As for the state of the fix, the patch has been iterated on and Amit felt ready to commit it[1]. We do want to hear how others feel about this, but the folks behind this feature have been working on this patch since this was reported. For #2 ("ALL TABLES IN SCHEMA" syntax), this was heavily discussed on the original thread[1][3][4][5][7]. I thought Tom's proposal on the syntax[3] was reasonable as it "future proofs" for when we allow other schema-scoped objects to be published and give control over which ones can be published. All right, well, I still don't like it and think it's confusing, but perhaps I'm in the minority. The RMT discussed this as well. The RMT feels that there should not be any changes to syntax/behavior for this release. This doesn't preclude future work in this area (e.g. having a toggle for "all future behavior"), but based on all the discussions and existing behavior in this feature, we do not see a need to make changes or delay the release on this. I don't think we should change this behavior that's already in logical replication. While I understand the reasons why "GRANT ... ALL TABLES IN SCHEMA" has a different behavior (i.e. it's not applied to future objects) and do not advocate to change it, I have personally been affected where I thought a permission would be applied to all future objects, only to discover otherwise. I believe it's more intuitive to think that "ALL" applies to "everything, always." Nah, there's room for multiple behaviors here. It's reasonable to want to add all the tables currently in the schema to a publication (or grant permissions on them) and it's reasonable to want to include all current and future tables in the schema in a publication (or grant permissions on them) too. The reason I don't like the ALL TABLES IN SCHEMA syntax is that it sounds like the former, but actually is the latter. Based on your link to the email from Tom, I understand now the reason why it's like that, but it's still counterintuitive to me. I understand your view on "multiple behaviors" and I do agree with your reasoning. I still think we should leave this as is, but perhaps this opens up an option we add later to specify the behavior. For #3 (more superuser-only), in general I do agree that we shouldn't be adding more of these. However, we have in this release, and not just to this feature. ALTER SUBSCRIPTION ... SKIP[11] requires superuser. I think it's easier for us to "relax" privileges (e.g. w/predefined roles) than to make something "superuser-only" in the future, so I don't see this being a blocker for v15. The feature will continue to work for users even if we remove "superuser-only" in the future. Yeah, this is clearly not a release blocker, I think. The RMT concurs. We do recommend future work on "relaxing" the superuser-only requirement. Thanks, Jonathan [1] https://www.postgresql.org/message-id/CAA4eK1LDhoBM8K5uVme8PZ%2BkxNOfVpRh%3DoO42JtFdqBgBuj1bA%40mail.gmail.com OpenPGP_signature Description: OpenPGP digital signature
Re: why can't a table be part of the same publication as its schema
On 9/20/22 10:55 AM, Mark Dilger wrote: On Sep 19, 2022, at 8:03 PM, Jonathan S. Katz wrote: "When a partitioned table is added to a publication, all of its existing and future partitions are implicitly considered to be part of the publication."[10] Additionally, this is the behavior that is already present in "FOR ALL TABLES": "Marks the publication as one that replicates changes for all tables in the database, including tables created in the future."[10] I don't think we should change this behavior that's already in logical replication. The existing behavior in logical replication doesn't have any "IN SCHEMA" qualifiers. This behavior exists "FOR ALL TABLES" without the "IN SCHEMA" qualifier. This was discussed multiple times on the original thread[1]. While I understand the reasons why "GRANT ... ALL TABLES IN SCHEMA" has a different behavior (i.e. it's not applied to future objects) and do not advocate to change it, I have personally been affected where I thought a permission would be applied to all future objects, only to discover otherwise. I believe it's more intuitive to think that "ALL" applies to "everything, always." The conversation is focusing on what "ALL TABLES" means, but the ambiguous part is what "IN SCHEMA" means. In GRANT it means "currently in schema, computed now." We are about to create confusion by adding the "IN SCHEMA" phrase to publication commands meaning "later in schema, computed then." A user who diligently consults the documentation for one command to discover what "IN SCHEMA" means may fairly, but wrongly, assume it means the same thing in another command. I tried to diligently read the sections where we talk about granting + privileges[2][3] to see what it says about "ALL * IN SCHEMA". Unless I missed it, and I read through it twice, it does not explicitly state whether or not "GRANT" applies to all objects at only that given moment, or to future objects of that type which are created in that schema. Maybe the behavior is implied or is part of the standard, but it's not currently documented. We do link to "ALTER DEFAULT PRIVILEGES" at the bottom of the GRANT[2] docs, but we don't give any indication as to why. (This is also to say we should document in GRANT that ALL * IN SCHEMA does not apply to future objects; if you need that behavior use ALTER DEFAULT PRIVILEGES. Separate thread :) I understand there is a risk of confusion of the similar grammar across commands, but the current command in logical replication has this is building on the existing behavior. Thanks, Jonathan [1] https://www.postgresql.org/message-id/flat/CALDaNm0OANxuJ6RXqwZsM1MSY4s19nuH3734j4a72etDwvBETQ%40mail.gmail.com [2] https://www.postgresql.org/docs/current/sql-grant.html [3] https://www.postgresql.org/docs/current/ddl-priv.html OpenPGP_signature Description: OpenPGP digital signature
Support logical replication of large objects
Hello, I’m working on a patch to support logical replication of large objects (LOBs). This is a useful feature when a database in logical replication has lots of tables, functions and other objects that change over time, such as in online cross major version upgrade. As an example, this lets users replicate large objects between different PostgreSQL major versions. The topic was previously discussed in [1]. Moreover, we need to address the following 3 challenges. I worked on some designs and appreciate feedback : 1. Replication of the change stream of LOBs My suggestion is that we can just add a check when any LOB function or API is called and executed in the backend, and then add a simple SQL command in WAL files to do the replication . Take lo_unlink as example[2] : we can create a “simple” SQL like SELECT lo_unlink(); and log it in WAL, so that replica only needs to replay the “simple” SQL command. We can unlink the LOBs in replica accordingly. Pros : a. We do not need to add much additional functionality. b. For most of the LOBs related APIs, we don’t need to touch whole LOBs, except for the case creation of LOBs. Cons: a. For the case creation of LOBs, we may need to split the whole LOB content into WAL files which will increase volume of WAL and replicated writes dramatically. This could be prevented if we can make sure the original file is publicly shared, like a url from cloud storage, or exists on host on replica as well. 2. Initializing replication of LOBs When a subscription is established, LOBs in the source should be replicated even if they are not created in replica. Here are two options to approach this problem: Option 1 : Support LOB related system catalogs in logical replication We can make an exception in this line[3] in the “check_publication_add_schema” function. Pros: All required LOBs can be transferred to replica. Cons: There is currently no support for allowing logical replication of system catalogs. Option 2 : Run a function or a script from source instance when it detects logical replication is established. The point is that we can ask the source to replicate whole LOBs when a new subscription is created. Maybe we can copy the whole LOBs related system catalogs and replicate the copy to replica, then restore the original LOBs into replica from the copy. Cons : This will increase the volume of WAL and replicated writes dramatically. I currently do not have a preference on either option. I would like to see if others have thoughts on how we could approach this. 3. OID conflicts A common case is that the OID we want to publish is already used in subscriber. Option 1 (My recommendation) : Create/Update existing System catalog for mapping the OID if conflict happens Maybe we could add another column naming like mapping_oid in system catalog pg_largeobject_metaqdate on the replica. When the replica detects the OID (E.g. 16400) that source is replicating is already used in replica, replica could store the 16400 as mapping_oid and create a new OID (E.g. 16500) as oid to be used in replica, so whatever operation is done on 16400 in source, in replica we just need to perform on 16500. Cons : We would need to add additional columns to the system catalog Option 2 : Prompt error message in Replica and let user handle it manually Cons : This is not user-friendly. Please let me know your thoughts. Borui Yang Amazon RDS/Aurora for PostgreSQL [1] https://www.postgresql.org/message-id/VisenaEmail.16.35d9e854e3626e81.15cd0de93df%40tc7-visena [2] https://git.postgresql.org/gitweb/?p=postgresql.git;a=blob;f=src/backend/libpq/be-fsstubs.c#l312 [3] https://git.postgresql.org/gitweb/?p=postgresql.git;a=blob;f=src/backend/catalog/pg_publication.c#l98
Re: Auto explain after query timeout
On Tue Sep 20, 2022 at 11:34 AM PDT, James Coleman wrote: > On Tue, Sep 20, 2022 at 2:12 PM Gurjeet wrote: > > > > For someone who would like to achieve this in the field today, I believe > > they can set auto_explain.log_min_duration equal to, or less than, > > statement_timeout. > > Either I'm missing something (and/or this was fixed in a later PG > version), but I don't think this is how it works. We have this > specific problem now: we set auto_explain.log_min_duration to 200 (ms) > and statement_timeout set to 30s, but when a statement times out we do > not get the plan logged with auto-explain. My DBA skills are rusty, so I'll take your word for it. If this is the current behaviour, I'm inclined to treat this as a bug, or at least a desirable improvement, and see if auto_explain can be improved to emit the plan on statment_timeout. >From what I undestand, the behaviour of auto_explain is that it waits for the query to finish before it emits the plan. This information is very useful for diagnosing long-running queries that are still running. Many a times, you encounter such queries in production workloads, and reproducing such a scenario later on is either undesirable, expensive, or even impossible. So being able to see the plan of a query that has crossed auto_explain.log_min_duration as soon as possible, would be highly desirable. Best regards, Gurjeet http://Gurje.et
Re: Auto explain after query timeout
On Tue, Sep 20, 2022 at 2:35 PM James Coleman wrote: > Either I'm missing something (and/or this was fixed in a later PG > version), but I don't think this is how it works. We have this > specific problem now: we set auto_explain.log_min_duration to 200 (ms) > and statement_timeout set to 30s, but when a statement times out we do > not get the plan logged with auto-explain. I think you're correct. auto_explain uses the ExecutorEnd hook, but that hook will not be fired in the case of an error. Indeed, if an error has already been thrown, it would be unsafe to try to auto-explain anything. For instance -- and this is just one problem of probably many -- ExplainTargetRel() performs catalog lookups, which is not OK in a failed transaction. To make this work, I think you'd need a hook that fires just BEFORE the error is thrown. However, previous attempts to introduce hooks into ProcessInterrupts() have not met with a wam response from Tom, so it might be a tough sell. And maybe for good reason. I see at least two problems. The first is that explaining a query is a pretty complicated operation that involves catalog lookups and lots of complicated stuff, and I don't think that it would be safe to do all of that at any arbitrary point in the code where ProcessInterrupts() happened to fire. What if one of the syscache lookups misses the cache and we have to open the underlying catalog? Then AcceptInvalidationMessages() will fire, but we don't currently assume that any old CHECK_FOR_INTERRUPTS() can process invalidations. What if the running query has called a user-defined function or procedure which is running DDL which is in the middle of changing catalog state for some relation involved in the query at the moment that the statement timeout arrives? I feel like there are big problems here. The other problem I see is that ProcessInterrupts() is our mechanism for allowing people to escape from queries that would otherwise run forever by hitting ^C. But what if the explain code goes crazy and itself needs to be interrupted, when we're already inside ProcessInterrupts()? Maybe that would work out OK... or maybe it wouldn't. I'm not really sure. But it's another reason to be very, very cautious about putting complicated logic inside ProcessInterrupts(). -- Robert Haas EDB: http://www.enterprisedb.com
Re: A question about wording in messages
Alvaro Herrera writes: > After spending way too much time editing this line, I ended up with > exactly what Tom proposed, so +1 for his version. I think "This > controls" adds nothing very useful, and we don't have it anywhere else, > except tcp_keepalives_count from where I also propose to remove it. LGTM. regards, tom lane
Re: Support tls-exporter as channel binding for TLSv1.3
On Tue, Sep 20, 2022 at 11:01 AM Jacob Champion wrote: > Well, I'm working on a next version, but it's ballooning in complexity > as I try to navigate the fix for OpenSSL 1.0.1 (which is currently > failing the tests, unsurprisingly). To be more specific: I think I'm hitting the case that Heikki pointed out several years ago [1]: > The problematic case is when e.g. the server > only supports tls-unique and the client only supports > tls-server-end-point. What we would (usually) like to happen, is to fall > back to not using channel binding. But it's not clear how to make that > work, and still protect from downgrade attacks. The problem was deferred when tls-unique was removed. We might have to actually solve it now. bcc: Heikki, in case he would like to weigh in. --Jacob [1] https://www.postgresql.org/message-id/ec787074-2305-c6f4-86aa-6902f98485a4%40iki.fi
Re: Auto explain after query timeout
On Tue, Sep 20, 2022 at 2:12 PM Gurjeet wrote: > > On Tue Sep 20, 2022 at 10:34 AM PDT, James Coleman wrote: > > Hopefully I'm not missing something obvious, but as far as I know > > there's no way to configure auto explain to work fire > > statement_timeout fires. > > I believe you're correct. > > > I'd like to look into this at some point, but I'm wondering if anyone > > has thought about it before, and, if so, is there any obvious > > impediment to doing so? > > This would be a neat feature. Since the changes would be fairly > localized to the contrib module, this would be a great first patch for > someone new to contributing. > > This can be exposed at a new GUC auto_explain.log_on_statement_timeout. > I wish our conventions allowed for creating hierarchies of GUC > parameters, e.g. auto_explain.when.statmeent_timeout. > > For someone who would like to achieve this in the field today, I believe > they can set auto_explain.log_min_duration equal to, or less than, > statement_timeout. Either I'm missing something (and/or this was fixed in a later PG version), but I don't think this is how it works. We have this specific problem now: we set auto_explain.log_min_duration to 200 (ms) and statement_timeout set to 30s, but when a statement times out we do not get the plan logged with auto-explain. James Coleman
Re: XMAX_LOCK_ONLY and XMAX_COMMITTED (fk/multixact code)
Here is a rebased patch for cfbot. -- Nathan Bossart Amazon Web Services: https://aws.amazon.com >From 9ae1f5409ddeee17b99a9565247da885cbb86be9 Mon Sep 17 00:00:00 2001 From: Nathan Bossart Date: Tue, 22 Mar 2022 15:35:34 -0700 Subject: [PATCH v6 1/1] Disallow setting XMAX_COMMITTED and XMAX_LOCK_ONLY together. Even though Postgres doesn't set both the XMAX_COMMITTED and XMAX_LOCK_ONLY infomask bits on the same tuple header, the heap visibility logic still accepts it. However, other functions like compute_new_xmax_infomask() seem to assume that this bit pattern is not possible. This change marks this bit pattern as disallowed, removes the heap visibility logic that handles it, and adds checks like those for other disallowed infomask bit combinations (e.g., XMAX_COMMITTED and XMAX_IS_MULTI). Besides simplifying the heap visibility logic a bit, this change aims to reduce ambiguity about the legal tuple header states. Note that this change also disallows XMAX_COMMITTED together with the special pre-v9.3 locked-only bit pattern that HEAP_XMAX_IS_LOCKED_ONLY checks for. This locked-only bit pattern may still be present on servers pg_upgraded from pre-v9.3 versions. --- contrib/amcheck/verify_heapam.c | 19 src/backend/access/heap/README.tuplock | 2 +- src/backend/access/heap/heapam.c| 10 +++ src/backend/access/heap/heapam_visibility.c | 97 ++--- src/backend/access/heap/hio.c | 2 + 5 files changed, 96 insertions(+), 34 deletions(-) diff --git a/contrib/amcheck/verify_heapam.c b/contrib/amcheck/verify_heapam.c index d33f33f170..f74f88afc5 100644 --- a/contrib/amcheck/verify_heapam.c +++ b/contrib/amcheck/verify_heapam.c @@ -663,6 +663,25 @@ check_tuple_header(HeapCheckContext *ctx) */ } + if ((ctx->tuphdr->t_infomask & HEAP_XMAX_COMMITTED) && + (ctx->tuphdr->t_infomask & HEAP_XMAX_LOCK_ONLY)) + { + report_corruption(ctx, + pstrdup("locked-only should not be marked committed")); + + /* As above, do not skip further checks. */ + } + + /* also check for pre-v9.3 lock-only bit pattern */ + if ((ctx->tuphdr->t_infomask & HEAP_XMAX_COMMITTED) && + HEAP_XMAX_IS_LOCKED_ONLY(ctx->tuphdr->t_infomask)) + { + report_corruption(ctx, + pstrdup("tuple with HEAP_XMAX_EXCL_LOCK set and HEAP_XMAX_IS_MULTI unset should not be marked committed")); + + /* As above, do not skip further checks. */ + } + if (infomask & HEAP_HASNULL) expected_hoff = MAXALIGN(SizeofHeapTupleHeader + BITMAPLEN(ctx->natts)); else diff --git a/src/backend/access/heap/README.tuplock b/src/backend/access/heap/README.tuplock index 6441e8baf0..4e565e60ab 100644 --- a/src/backend/access/heap/README.tuplock +++ b/src/backend/access/heap/README.tuplock @@ -152,4 +152,4 @@ The following infomask bits are applicable: whether the XMAX is a TransactionId or a MultiXactId. We currently never set the HEAP_XMAX_COMMITTED when the HEAP_XMAX_IS_MULTI bit -is set. +or the HEAP_XMAX_LOCK_ONLY bit is set. diff --git a/src/backend/access/heap/heapam.c b/src/backend/access/heap/heapam.c index eb811d751e..616df576c3 100644 --- a/src/backend/access/heap/heapam.c +++ b/src/backend/access/heap/heapam.c @@ -5115,6 +5115,16 @@ l5: MultiXactStatus status; MultiXactStatus new_status; + /* + * Currently we don't allow XMAX_LOCK_ONLY and XMAX_COMMITTED to both be + * set in a tuple header, so cross-check. + * + * Note that this also checks for the special locked-only bit pattern + * that may be present on servers that were pg_upgraded from pre-v9.3 + * versions. + */ + Assert(!HEAP_XMAX_IS_LOCKED_ONLY(old_infomask)); + if (old_infomask2 & HEAP_KEYS_UPDATED) status = MultiXactStatusUpdate; else diff --git a/src/backend/access/heap/heapam_visibility.c b/src/backend/access/heap/heapam_visibility.c index 6e33d1c881..e6ee8ff1fa 100644 --- a/src/backend/access/heap/heapam_visibility.c +++ b/src/backend/access/heap/heapam_visibility.c @@ -78,6 +78,31 @@ #include "utils/snapmgr.h" +/* + * InfomaskAssertions() + * + * Checks for invalid infomask bit patterns. + */ +static inline void +InfomaskAssertions(HeapTupleHeader tuple) +{ + /* + * XMAX_COMMITTED and XMAX_LOCK_ONLY cannot both be set in a tuple header. + * + * Note that this also checks for the special locked-only bit pattern that + * may be present on servers that were pg_upgraded from pre-v9.3 versions. + */ + Assert(!((tuple->t_infomask & HEAP_XMAX_COMMITTED) && + HEAP_XMAX_IS_LOCKED_ONLY(tuple->t_infomask))); + + /* + * XMAX_COMMITTED and XMAX_IS_MULTI cannot be be set in a tuple header. + */ + Assert(!((tuple->t_infomask & HEAP_XMAX_COMMITTED) && + (tuple->t_infomask & HEAP_XMAX_IS_MULTI))); +} + + /* * SetHintBits() * @@ -113,6 +138,8 @@ static inline void SetHintBits(HeapTupleHeader tuple, Buffer buffer, uint16 infomask, TransactionId xid) { + InfomaskAssertions(tuple); + if (TransactionIdIsValid(xid)) { /* NB: xid must be known committed here!
Re: Auto explain after query timeout
On Tue Sep 20, 2022 at 10:34 AM PDT, James Coleman wrote: > Hopefully I'm not missing something obvious, but as far as I know > there's no way to configure auto explain to work fire > statement_timeout fires. I believe you're correct. > I'd like to look into this at some point, but I'm wondering if anyone > has thought about it before, and, if so, is there any obvious > impediment to doing so? This would be a neat feature. Since the changes would be fairly localized to the contrib module, this would be a great first patch for someone new to contributing. This can be exposed at a new GUC auto_explain.log_on_statement_timeout. I wish our conventions allowed for creating hierarchies of GUC parameters, e.g. auto_explain.when.statmeent_timeout. For someone who would like to achieve this in the field today, I believe they can set auto_explain.log_min_duration equal to, or less than, statement_timeout. Best regards, Gurjeet http://Gurje.et
Re: predefined role(s) for VACUUM and ANALYZE
On Tue, Sep 20, 2022 at 02:45:52PM +0900, Michael Paquier wrote: > I have gone through the thread, and I'd agree with getting more > granularity when it comes to assigning ACLs to relations rather than > just an on/off switch for the objects of a given type would be nice. > I've been looking at the whole use of AclMode and AclItem in the code, > and I don't quite see why a larger size could have a noticeable > impact. There are a few things that could handle a large number of > AclItems, though, say for array operations like aclupdate(). These > could be easily checked with some micro-benchmarking or some SQL > queries that emulate a large number of items in aclitem[] arrays. I performed a few quick tests with a couple thousand ACLs on my laptop, and I'm consistently seeing a 4.3% regression. > Any impact for the column sizes of the catalogs holding ACL > information? Just asking while browsing the patch set. Since each aclitem requires 16 bytes instead of 12, I assume so. However, in my testing, I hit a "row is too big" error with the same number of aclitems in a pg_class row before and after the change. I might be missing something in my patch, or maybe I am misunderstanding how arrays of aclitems are stored on disk. > Some comments in utils/acl.h need a refresh as the number of lower and > upper bits looked at from ai_privs changes. Oops, I missed that one. I fixed it in the attached patch set. -- Nathan Bossart Amazon Web Services: https://aws.amazon.com >From 15e9a60b990070d36f4e1f749cbfbb638b37a666 Mon Sep 17 00:00:00 2001 From: Nathan Bossart Date: Wed, 7 Sep 2022 22:25:29 -0700 Subject: [PATCH v6 1/4] Change AclMode from a uint32 to a uint64. --- src/backend/nodes/outfuncs.c| 2 +- src/bin/pg_upgrade/check.c | 35 + src/include/catalog/pg_type.dat | 2 +- src/include/nodes/parsenodes.h | 6 +++--- src/include/utils/acl.h | 28 +- 5 files changed, 54 insertions(+), 19 deletions(-) diff --git a/src/backend/nodes/outfuncs.c b/src/backend/nodes/outfuncs.c index 60610e3a4b..dbb9ad52da 100644 --- a/src/backend/nodes/outfuncs.c +++ b/src/backend/nodes/outfuncs.c @@ -528,7 +528,7 @@ _outRangeTblEntry(StringInfo str, const RangeTblEntry *node) WRITE_BOOL_FIELD(lateral); WRITE_BOOL_FIELD(inh); WRITE_BOOL_FIELD(inFromCl); - WRITE_UINT_FIELD(requiredPerms); + WRITE_UINT64_FIELD(requiredPerms); WRITE_OID_FIELD(checkAsUser); WRITE_BITMAPSET_FIELD(selectedCols); WRITE_BITMAPSET_FIELD(insertedCols); diff --git a/src/bin/pg_upgrade/check.c b/src/bin/pg_upgrade/check.c index f1bc1e6886..615a53a864 100644 --- a/src/bin/pg_upgrade/check.c +++ b/src/bin/pg_upgrade/check.c @@ -28,6 +28,7 @@ static void check_for_incompatible_polymorphics(ClusterInfo *cluster); static void check_for_tables_with_oids(ClusterInfo *cluster); static void check_for_composite_data_type_usage(ClusterInfo *cluster); static void check_for_reg_data_type_usage(ClusterInfo *cluster); +static void check_for_aclitem_data_type_usage(ClusterInfo *cluster); static void check_for_jsonb_9_4_usage(ClusterInfo *cluster); static void check_for_pg_role_prefix(ClusterInfo *cluster); static void check_for_new_tablespace_dir(ClusterInfo *new_cluster); @@ -107,6 +108,13 @@ check_and_dump_old_cluster(bool live_check) check_for_reg_data_type_usage(&old_cluster); check_for_isn_and_int8_passing_mismatch(&old_cluster); + /* + * PG 16 increased the size of the 'aclitem' type, which breaks the on-disk + * format for existing data. + */ + if (GET_MAJOR_VERSION(old_cluster.major_version) <= 1500) + check_for_aclitem_data_type_usage(&old_cluster); + /* * PG 14 changed the function signature of encoding conversion functions. * Conversions from older versions cannot be upgraded automatically @@ -1319,6 +1327,33 @@ check_for_reg_data_type_usage(ClusterInfo *cluster) check_ok(); } +/* + * check_for_aclitem_data_type_usage + * + * aclitem changed its storage format in 16, so check for it. + */ +static void +check_for_aclitem_data_type_usage(ClusterInfo *cluster) +{ + char output_path[MAXPGPATH]; + + prep_status("Checking for incompatible aclitem data type in user tables"); + + snprintf(output_path, sizeof(output_path), "tables_using_aclitem.txt"); + + if (check_for_data_type_usage(cluster, "pg_catalog.aclitem", output_path)) + { + pg_log(PG_REPORT, "fatal"); + pg_fatal("Your installation contains the \"aclitem\" data type in user tables.\n" + "The internal format of \"aclitem\" changed in PostgreSQL version 16\n" + "so this cluster cannot currently be upgraded. You can drop the\n" + "problem columns and restart the upgrade. A list of the problem\n" + "columns is in the file:\n" + "%s", output_path); + } + else + check_ok(); +} /* * check_for_jsonb_9_4_usage() diff --git a/src/include/catalog/pg_type.dat b/src/include/catalog/pg_type.dat index df45879463..484dec39e8 100644 --- a/src/include/catalog/pg_type.dat +
Re: Support tls-exporter as channel binding for TLSv1.3
On Mon, Sep 19, 2022 at 5:54 PM Michael Paquier wrote: > X509_get_signature_nid() has been introduced in 1.0.2. > SSL_export_keying_material() is older than that, being present since > 1.0.1. Considering the fact that we want to always have > tls-server-end-point as default, it seems to me that we should always > publish SCRAM-SHA-256-PLUS and support channel binding when building > with OpenSSL >= 1.0.2. The same can be said about the areas where we > have HAVE_BE_TLS_GET_[PEER_]CERTIFICATE_HASH. Should we advertise support even if the client is too old to provide an extended master secret? > I was planning to continue working on this patch based on your latest > review. Anyway, as that's originally your code, I am fine to let you > take the lead here. Just let me know which way you prefer, and I'll > stick to it :) Well, I'm working on a next version, but it's ballooning in complexity as I try to navigate the fix for OpenSSL 1.0.1 (which is currently failing the tests, unsurprisingly). You mentioned not wanting to add maintenance burden for 1.0.1, and I'm curious to see whether the approach you have in mind would be easier than what mine is turning out to be. Maybe we can compare notes? --Jacob
Re: oat_post_create expected behavior
On Tue, 2022-08-02 at 13:30 -0700, Mary Xu wrote: > > Right, same thing I'm saying. I also think we should discourage > > people from doing cowboy CCIs inside their OAT hooks, because that > > makes the testability problem even worse. Maybe that means we > > need to uniformly move the CREATE hooks to after a system-provided > > CCI, but I've not thought hard about the implications of that. > > I like this approach, however, I am relatively new to the PG scene > and > am not sure how or what I should look into in terms of the > implications that Tom mentioned. Are there any tips? What should be > the next course of action here? I could update my patch to always > call > CCI before the create hooks. I didn't see a clear consensus that we should call OAT_POST_CREATE after CCI, so I went ahead and updated the comment. We can always update the behavior later when we do have consensus, but until that time, at least the comment will be more helpful. If you are satisfied you can mark the CF issue as "committed", or you can leave it open if you think it's still unresolved. -- Jeff Davis PostgreSQL Contributor Team - AWS
Re: CFM Manager
On Thu, Sep 8, 2022 at 2:34 PM Jacob Champion wrote: > I still have yet to update the section "5 to 7 days before end of CF" > and onward. Well, I've saved the hardest part for last... Ibrar, Hamid, have the checklist rewrites been helpful so far? Are you planning on doing an (optional!) triage, and if so, are there any pieces in particular you'd like me to document? --Jacob
Tighten pg_get_object_address argument checking
For publication schemas (OBJECT_PUBLICATION_NAMESPACE) and user mappings (OBJECT_USER_MAPPING), pg_get_object_address() checked the array length of the second argument, but not of the first argument. If the first argument was too long, it would just silently ignore everything but the first argument. Fix that by checking the length of the first argument as well. I wouldn't be surprised if there were more holes like this in this area. I just happened to find these while working on something related.From eb80c87a083464160a1436e5f983df840b282085 Mon Sep 17 00:00:00 2001 From: Peter Eisentraut Date: Tue, 20 Sep 2022 13:37:27 -0400 Subject: [PATCH] Tighten pg_get_object_address argument checking For publication schemas (OBJECT_PUBLICATION_NAMESPACE) and user mappings (OBJECT_USER_MAPPING), pg_get_object_address() checked the array length of the second argument, but not of the first argument. If the first argument was too long, it would just silently ignore everything but the first argument. Fix that by checking the length of the first argument as well. --- src/backend/catalog/objectaddress.c | 10 -- src/test/regress/expected/object_address.out | 16 +++- src/test/regress/sql/object_address.sql | 2 +- 3 files changed, 20 insertions(+), 8 deletions(-) diff --git a/src/backend/catalog/objectaddress.c b/src/backend/catalog/objectaddress.c index 8377b4f7d4d1..27616ac2ad26 100644 --- a/src/backend/catalog/objectaddress.c +++ b/src/backend/catalog/objectaddress.c @@ -2239,10 +2239,16 @@ pg_get_object_address(PG_FUNCTION_ARGS) */ switch (type) { + case OBJECT_PUBLICATION_NAMESPACE: + case OBJECT_USER_MAPPING: + if (list_length(name) != 1) + ereport(ERROR, + (errcode(ERRCODE_INVALID_PARAMETER_VALUE), +errmsg("name list length must be exactly %d", 1))); + /* fall through to check args length */ + /* FALLTHROUGH */ case OBJECT_DOMCONSTRAINT: case OBJECT_CAST: - case OBJECT_USER_MAPPING: - case OBJECT_PUBLICATION_NAMESPACE: case OBJECT_PUBLICATION_REL: case OBJECT_DEFACL: case OBJECT_TRANSFORM: diff --git a/src/test/regress/expected/object_address.out b/src/test/regress/expected/object_address.out index 4117fc27c9a5..cbb99c7b9f94 100644 --- a/src/test/regress/expected/object_address.out +++ b/src/test/regress/expected/object_address.out @@ -105,7 +105,7 @@ BEGIN ('text search template'), ('text search configuration'), ('policy'), ('user mapping'), ('default acl'), ('transform'), ('operator of access method'), ('function of access method'), - ('publication relation') + ('publication namespace'), ('publication relation') LOOP FOR names IN VALUES ('{eins}'), ('{addr_nsp, zwei}'), ('{eins, zwei, drei}') LOOP @@ -285,10 +285,10 @@ WARNING: error for policy,{eins,zwei,drei},{}: schema "eins" does not exist WARNING: error for policy,{eins,zwei,drei},{integer}: schema "eins" does not exist WARNING: error for user mapping,{eins},{}: argument list length must be exactly 1 WARNING: error for user mapping,{eins},{integer}: user mapping for user "eins" on server "integer" does not exist -WARNING: error for user mapping,{addr_nsp,zwei},{}: argument list length must be exactly 1 -WARNING: error for user mapping,{addr_nsp,zwei},{integer}: user mapping for user "addr_nsp" on server "integer" does not exist -WARNING: error for user mapping,{eins,zwei,drei},{}: argument list length must be exactly 1 -WARNING: error for user mapping,{eins,zwei,drei},{integer}: user mapping for user "eins" on server "integer" does not exist +WARNING: error for user mapping,{addr_nsp,zwei},{}: name list length must be exactly 1 +WARNING: error for user mapping,{addr_nsp,zwei},{integer}: name list length must be exactly 1 +WARNING: error for user mapping,{eins,zwei,drei},{}: name list length must be exactly 1 +WARNING: error for user mapping,{eins,zwei,drei},{integer}: name list length must be exactly 1 WARNING: error for default acl,{eins},{}: argument list length must be exactly 1 WARNING: error for default acl,{eins},{integer}: unrecognized default ACL object type "i" WARNING: error for default acl,{addr_nsp,zwei},{}: argument list length must be exactly 1 @@ -313,6 +313,12 @@ WARNING: error for function of access method,{addr_nsp,zwei},{}: name list leng WARNING: error for function of access method,{addr_nsp,zwei},{integer}: name list length must be at least 3 WARNING: error for function of access method,{eins,zwei,drei},{}: argument list length must be exactly 2 WARNING: error for function of access meth
Auto explain after query timeout
Hopefully I'm not missing something obvious, but as far as I know there's no way to configure auto explain to work fire statement_timeout fires. I'd like to look into this at some point, but I'm wondering if anyone has thought about it before, and, if so, is there any obvious impediment to doing so? Thanks, James Coleman
Re: A question about wording in messages
On 2022-Sep-16, Amit Kapila wrote: > We only want to commit the changes to 15 (a) if those fixes a problem > introduced in 15, or (b) those are for a bug fix. I think the error > message improvements fall into none of those categories, we can map it > to (b) but I feel those are an improvement in the current messages and > don't seem critical to me. IMO at least the GUC one does fix a problem related to the wording of a user-visible message, which also flows into the translations. I prefer to have that one fixed it in 15 also. The other messages (errors) don't seem very interesting because they're not as visible, so I don't care if those are not backpatched. -- Álvaro HerreraBreisgau, Deutschland — https://www.EnterpriseDB.com/ "This is a foot just waiting to be shot"(Andrew Dunstan)
Re: Patch proposal: make use of regular expressions for the username in pg_hba.conf
On Mon, Sep 19, 2022 at 9:09 PM Tom Lane wrote: > You have to assume that somebody (a) has a role or DB name starting > with slash, (b) has an explicit reference to that name in their > pg_hba.conf, (c) doesn't read the release notes, and (d) doesn't > notice that things are misbehaving until after some hacker manages > to break into their installation on the strength of the misbehaving > entry. OK, I'll grant that the probability of (c) is depressingly > close to unity; but each of the other steps seems quite low probability. > All four of them happening in one installation is something I doubt > will happen. I can't argue with (a) or (b), but (d) seems decently likely to me. If your normal user base consists of people who are authorized to access your system, what clues would you have that your HBA is silently failing open? --Jacob
Re: making relfilenodes 56 bits
On Fri, Sep 9, 2022 at 6:02 AM Dilip Kumar wrote: > [ new patch ] +typedef pg_int64 RelFileNumber; This seems really random to me. First, why isn't this an unsigned type? OID is unsigned and I don't see a reason to change to a signed type. But even if we were going to change to a signed type, why pg_int64? That is declared like this: /* Define a signed 64-bit integer type for use in client API declarations. */ typedef PG_INT64_TYPE pg_int64; Surely this is not a client API declaration Note that if we change this a lot of references to INT64_FORMAT will need to become UINT64_FORMAT. I think we should use int64 at the SQL level, because we don't have an unsigned 64-bit SQL type, and a signed 64-bit type can hold 56 bits. So it would still be Int64GetDatum((int64) rd_rel->relfilenode) or similar. But internally I think using unsigned is cleaner. + * RelFileNumber is unique within a cluster. Not really, because of CREATE DATABASE. Probably just drop this line. Or else expand it: we never assign the same RelFileNumber twice within the lifetime of the same cluster, but there can be multiple relations with the same RelFileNumber e.g. because CREATE DATABASE duplicates the RelFileNumber values from the template database. But maybe we don't need this here, as it's already explained in relfilelocator.h. +ret = (int8) (tag->relForkDetails[0] >> BUFTAG_RELNUM_HIGH_BITS); Why not declare ret as ForkNumber instead of casting twice? +uint64 relnum; + +Assert(relnumber <= MAX_RELFILENUMBER); +Assert(forknum <= MAX_FORKNUM); + +relnum = relnumber; Perhaps it'd be better to write uint64 relnum = relnumber instead of initializing on a separate line. +#define RELNUMBERCHARS 20 /* max chars printed by %llu */ Maybe instead of %llu we should say UINT64_FORMAT (or INT64_FORMAT if there's some reason to stick with a signed type). +elog(ERROR, "relfilenumber is out of bound"); It would have to be "out of bounds", with an "s". But maybe "is too large" would be better. +nextRelFileNumber = ShmemVariableCache->nextRelFileNumber; +loggedRelFileNumber = ShmemVariableCache->loggedRelFileNumber; +flushedRelFileNumber = ShmemVariableCache->flushedRelFileNumber; Maybe it would be a good idea to asset that next <= flushed and flushed <= logged? +#ifdef USE_ASSERT_CHECKING + +{ +RelFileLocatorBackend rlocator; +char *rpath; Let's add a comment here, like "Because the RelFileNumber counter only ever increases and never wraps around, it should be impossible for the newly-allocated RelFileNumber to already be in use. But, if Asserts are enabled, double check that there's no main-fork relation file with the new RelFileNumber already on disk." +elog(ERROR, "cannot forward RelFileNumber during recovery"); forward -> set (or advance) +if (relnumber >= ShmemVariableCache->loggedRelFileNumber) It probably doesn't make any difference, but to me it seems better to test flushedRelFileNumber rather than logRelFileNumber here. What do you think? /* * We set up the lockRelId in case anything tries to lock the dummy - * relation. Note that this is fairly bogus since relNumber may be - * different from the relation's OID. It shouldn't really matter though. - * In recovery, we are running by ourselves and can't have any lock - * conflicts. While syncing, we already hold AccessExclusiveLock. + * relation. Note we are setting relId to just FirstNormalObjectId which + * is completely bogus. It shouldn't really matter though. In recovery, + * we are running by ourselves and can't have any lock conflicts. While + * syncing, we already hold AccessExclusiveLock. */ rel->rd_lockInfo.lockRelId.dbId = rlocator.dbOid; -rel->rd_lockInfo.lockRelId.relId = rlocator.relNumber; +rel->rd_lockInfo.lockRelId.relId = FirstNormalObjectId; Boy, this makes me uncomfortable. The existing logic is pretty bogus, and we're replacing it with some other bogus thing. Do we know whether anything actually does try to use this for locking? One notable difference between the existing logic and your change is that, with the existing logic, we use a bogus value that will differ from one relation to the next, whereas with this change, it will always be the same value. Perhaps el->rd_lockInfo.lockRelId.relId = (Oid) rlocator.relNumber would be a more natural adaptation? +#define CHECK_RELFILENUMBER_RANGE(relfilenumber)\ +do {\ +if ((relfilenumber) < 0 || (relfilenumber) > MAX_RELFILENUMBER) \ +ereport(ERROR, \ +errcode(ERRCODE_INVALID_PARAMETER_VALUE), \ +errmsg("relfilenumber %lld is out of range",\ +(long long) (relfilenumber))); \ +} while (0) Here, you take the approach of casting the relfilenumber to
Re: cataloguing NOT NULL constraints
On Tue, Sep 20, 2022 at 10:39 AM Alvaro Herrera wrote: > That said, the patch I posted for this ~10 years ago used a separate > contype and was simpler than what I ended up with now, but amusingly > enough it was returned at the time with the argument that it would be > better to treat them as normal CHECK constraints; so I want to be very > sure that we're not just going around in circles. I don't have an intrinsic view on whether we ought to have one contype or two, but I think this comment of yours from a few messages ago is right on point: ".. though I'm now wondering if there's additional overhead from checking the constraint twice on each row: first the attnotnull bit, then the CHECK itself. Hmm. That's probably quite bad." For that exact reason, it seems absolutely necessary to be able to somehow identify the "redundant" check constraints, so that we don't pay the expense of revalidating them. Another contype would be one way of identifying such constraints, but it could probably be done in other ways, too. Perhaps it could even be discovered dynamically, like when we build a relcache entry. I actually have no idea what design is best. I am a little confused as to why we want to do this, though. While we're on the topic of what is more complicated and simpler, what functionality do we get out of adding all of these additional catalog entries that then have to be linked back to the corresponding attnotnull markings? And couldn't we get that functionality in some much simpler way? Like, if you want to track whether the NOT NULL constraint has been validated, we could add an attnotnullvalidated column, or probably better, change the existing attnotnull column to a character used as an enum, or maybe an integer bit-field of some kind. I'm not trying to blow up your patch with dynamite or anything, but I'm a little suspicious that this may be one of those cases where pgsql-hackers discussed turns a complicated project into an even more complicated project to no real benefit. One thing that I don't particularly like about this whole design is that it feels like it creates a bunch of catalog bloat. Now all of the attnotnull flags also generate additional pg_constraint rows. The catalogs in the default install will be bigger than before, and the catalogs after user tables are created will be more bigger. If we get some nifty benefit out of all that, cool! But if we're just anti-optimizing the catalog storage out of some feeling that the result will be intellectually purer than some competing design, maybe we should reconsider. It's not stupid to optimize for common special cases, and making a column as NOT NULL is probably at least one and maybe several orders of magnitude more common than putting some other CHECK constraint on it. -- Robert Haas EDB: http://www.enterprisedb.com
default sorting behavior for index
Hi, I was looking at this check in src/backend/parser/parse_utilcmd.c w.r.t. constraint: if (indclass->values[i] != defopclass || attform->attcollation != index_rel->rd_indcollation[i] || attoptions != (Datum) 0 || index_rel->rd_indoption[i] != 0) ereport(ERROR, (errcode(ERRCODE_WRONG_OBJECT_TYPE), errmsg("index \"%s\" column number %d does not have default sorting behavior", index_name, i + 1), errdetail("Cannot create a primary key or unique constraint using such an index."), It seems this first came in via `Indexes with INCLUDE columns and their support in B-tree` If the index has DESC sorting order, why it cannot be used to back a constraint ? Some concrete sample would help me understand this. Thanks
Re: A question about wording in messages
On 2022-Sep-14, Kyotaro Horiguchi wrote: > At Tue, 13 Sep 2022 22:38:46 -0400, Tom Lane wrote in > > Kyotaro Horiguchi writes: > > > I saw the following message recently modified. > > >> This controls the maximum distance we can read ahead in the WAL to > > >> prefetch referenced data blocks. > > > Maybe the "we" means "PostgreSQL program and you" but I see it > > > somewhat out of place. > > > > +1, I saw that today and thought it was outside our usual style. > > The whole thing is awfully verbose for a GUC description, too. > > Maybe > > > > "Maximum distance to read ahead in WAL to prefetch data blocks." I failed to notice this issue. I agree it's unusual and +1 for changing it. > It seems to sufficiently work for average users and rather easy to > read, but it looks a short description. > So, taking the middle of them, how about the following? > > Short: Buffer size for reading ahead in the WAL during recovery. > Extra: This controls the maximum distance to read ahead in WAL to prefetch > data blocks." But why do we care that it's short? We don't need it to be long .. we only need it to explain what it needs to explain. After spending way too much time editing this line, I ended up with exactly what Tom proposed, so +1 for his version. I think "This controls" adds nothing very useful, and we don't have it anywhere else, except tcp_keepalives_count from where I also propose to remove it. -- Álvaro Herrera PostgreSQL Developer — https://www.EnterpriseDB.com/ "Hay quien adquiere la mala costumbre de ser infeliz" (M. A. Evans) >From 5b8bf15ed5d9f1a21150039da33a557f027640d4 Mon Sep 17 00:00:00 2001 From: Alvaro Herrera Date: Tue, 20 Sep 2022 18:19:34 +0200 Subject: [PATCH] fix some GUC strings --- src/backend/utils/misc/guc_tables.c | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-) diff --git a/src/backend/utils/misc/guc_tables.c b/src/backend/utils/misc/guc_tables.c index 550e95056c..ab3140ff3a 100644 --- a/src/backend/utils/misc/guc_tables.c +++ b/src/backend/utils/misc/guc_tables.c @@ -2591,7 +2591,7 @@ struct config_int ConfigureNamesInt[] = { {"wal_decode_buffer_size", PGC_POSTMASTER, WAL_RECOVERY, gettext_noop("Buffer size for reading ahead in the WAL during recovery."), - gettext_noop("This controls the maximum distance we can read ahead in the WAL to prefetch referenced data blocks."), + gettext_noop("Maximum distance to read ahead in the WAL to prefetch referenced data blocks."), GUC_UNIT_BYTE }, &wal_decode_buffer_size, @@ -3248,7 +3248,7 @@ struct config_int ConfigureNamesInt[] = { {"tcp_keepalives_count", PGC_USERSET, CONN_AUTH_TCP, gettext_noop("Maximum number of TCP keepalive retransmits."), - gettext_noop("This controls the number of consecutive keepalive retransmits that can be " + gettext_noop("Number of consecutive keepalive retransmits that can be " "lost before a connection is considered dead. A value of 0 uses the " "system default."), }, -- 2.30.2
Re: Tree-walker callbacks vs -Wdeprecated-non-prototype
I wrote: > (That verbiage is from the gcc manual; clang seems to act the same > except that -Wcast-function-type is selected by -Wall, or perhaps is > even on by default.) Nah, scratch that: the reason -Wcast-function-type is on is that we explicitly enable it, and have done so since de8feb1f3 (v14). I did not happen to see this warning with gcc because the test runs I made with this patch already had c35ba141d, whereas I did my clang test on another machine that wasn't quite up to HEAD. So we should have good warning coverage for bogus walker signatures on both compilers. regards, tom lane
Re: [PATCH]Feature improvement for MERGE tab completion
On 2022-Sep-18, Fujii Masao wrote: > The tab-completion code for MERGE was added in the middle of that for LOCK > TABLE. > This would be an oversight of the commit that originally supported > tab-completion > for MERGE. I fixed this issue. Argh, thanks. > "MERGE" was tab-completed with just after "EXPLAIN" or "EXPLAIN ANALYZE", etc. > Since "INTO" always follows "MERGE", it's better to complete with "MERGE INTO" > there. I replaced "MERGE" with "MERGE INTO" in those tab-completions. OK, that would be similar to REFRESH MATERIALIZED VIEW. The rules starting at line 4111 make me a bit nervous, since nowhere we're restricting them to operating only on MERGE lines. I don't think it's a real problem since USING is not terribly common anyway. Likewise for the ones with WHEN [NOT] MATCHED. I kinda wish we had a way to search for stuff like "keyword MERGE appears earlier in the command", but we don't have that. -- Álvaro Herrera 48°01'N 7°57'E — https://www.EnterpriseDB.com/ "I'm always right, but sometimes I'm more right than other times." (Linus Torvalds)
Re: Proposal to use JSON for Postgres Parser format
On 2022-Sep-20, Matthias van de Meent wrote: > Allow me to add: compressability > > In the thread surrounding [0] there were complaints about the size of > catalogs, and specifically the template database. Significant parts of > that (688kB of 8080kB a fresh PG14 database) are in pg_rewrite, which > consists mostly of serialized Nodes. If we're going to replace our > current NodeToText infrastructure, we'd better know we can effectively > compress this data. True. Currently, the largest ev_action values compress pretty well. I think if we wanted this to be more succint, we would have to invent some binary format -- perhaps something like Protocol Buffers: it'd be stored in the binary format in catalogs, but for output it would be converted into something easy to read (we already do this for pg_statistic_ext_data for example). We'd probably lose compressibility, but that'd be okay because the binary format would already remove most of the redundancy by nature. Do we want to go there? -- Álvaro Herrera 48°01'N 7°57'E — https://www.EnterpriseDB.com/ "Java is clearly an example of money oriented programming" (A. Stepanov)
Re: why can't a table be part of the same publication as its schema
> On Sep 19, 2022, at 8:03 PM, Jonathan S. Katz wrote: > > "When a partitioned table is added to a publication, all of its existing and > future partitions are implicitly considered to be part of the > publication."[10] > > Additionally, this is the behavior that is already present in "FOR ALL > TABLES": > > "Marks the publication as one that replicates changes for all tables in the > database, including tables created in the future."[10] > > I don't think we should change this behavior that's already in logical > replication. The existing behavior in logical replication doesn't have any "IN SCHEMA" qualifiers. > While I understand the reasons why "GRANT ... ALL TABLES IN SCHEMA" has a > different behavior (i.e. it's not applied to future objects) and do not > advocate to change it, I have personally been affected where I thought a > permission would be applied to all future objects, only to discover > otherwise. I believe it's more intuitive to think that "ALL" applies to > "everything, always." The conversation is focusing on what "ALL TABLES" means, but the ambiguous part is what "IN SCHEMA" means. In GRANT it means "currently in schema, computed now." We are about to create confusion by adding the "IN SCHEMA" phrase to publication commands meaning "later in schema, computed then." A user who diligently consults the documentation for one command to discover what "IN SCHEMA" means may fairly, but wrongly, assume it means the same thing in another command. — Mark Dilger EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
Re: RFC: Logging plan of the running query
On 2022-09-19 17:47, Алена Рыбакина wrote: Thanks for your review and comments! Hi, I'm sorry,if this message is duplicated previous this one, but I'm not sure that the previous message is sent correctly. I sent it from email address a.rybak...@postgrespro.ru and I couldn't send this one email from those address. I've successfully received your mail from both a.rybak...@postgrespro.ru and lena.riback...@yandex.ru. I like idea to create patch for logging query plan. After reviewing this code and notice some moments and I'd rather ask you some questions. Firstly, I suggest some editing in the comment of commit. I think, it is turned out the more laconic and the same clear. I wrote it below since I can't think of any other way to add it. ``` Currently, we have to wait for finishing of the query execution to check its plan. This is not so convenient in investigation long-running queries on production environments where we cannot use debuggers. To improve this situation there is proposed the patch containing the pg_log_query_plan() function which request to log plan of the specified backend process. By default, only superusers are allowed to request log of the plan otherwise allowing any users to issue this request could create cause lots of log messages and it can lead to denial of service. At the next requesting CHECK_FOR_INTERRUPTS(), the target backend logs its plan at LOG_SERVER_ONLY level and therefore this plan will appear in the server log only, not to be sent to the client. Thanks, I have incorporated your comments. Since the latter part of the original message comes from the commit message of pg_log_backend_memory_contexts(43620e328617c), so I left it as it was for consistency. Secondly, I have question about deleting USE_ASSERT_CHECKING in lock.h? It supposed to have been checked in another placed of the code by matching values. I am worry about skipping errors due to untesting with assert option in the places where it (GetLockMethodLocalHash) participates and we won't able to get core file in segfault cases. I might not understand something, then can you please explain to me? Since GetLockMethodLocalHash() is only used for assertions, this is only defined when USE_ASSERT_CHECKING is enabled. This patch uses GetLockMethodLocalHash() not only for the assertion purpose, so I removed "ifdef USE_ASSERT_CHECKING" for this function. I belive it does not lead to skip errors. Thirdly, I have incomprehension of the point why save_ActiveQueryDesc is declared in the pquery.h? I am seemed to save_ActiveQueryDesc to be used in an once time in the ExecutorRun function and its declaration superfluous. I added it in the attached patch. Exactly. Fourthly, it seems to me there are not enough explanatory comments in the code. I also added them in the attached patch. Thanks! | + /* | +* Save value of ActiveQueryDesc before having called | +* ExecutorRun_hook function due to having reset by | +* AbortSubTransaction. | +*/ | + | save_ActiveQueryDesc = ActiveQueryDesc; | ActiveQueryDesc = queryDesc; | | @@ -314,6 +320,7 @@ ExecutorRun(QueryDesc *queryDesc, | else | standard_ExecutorRun(queryDesc, direction, count, execute_once); | | + /* We set the actual value of ActiveQueryDesc */ | ActiveQueryDesc = save_ActiveQueryDesc; Since these processes are needed for nested queries, not only for AbortSubTransaction[1], added comments on it. | +/* Function to handle the signal to output the query plan. */ | extern void HandleLogQueryPlanInterrupt(void); I feel this comment is unnecessary since the explanation of HandleLogQueryPlanInterrupt() is written in explain.c and no functions in explain.h have comments in it. Lastly, I have incomprehension about handling signals since have been unused it before. Could another signal disabled calling this signal to log query plan? I noticed this signal to be checked the latest in procsignal_sigusr1_handler function. Are you concerned that one signal will not be processed when multiple signals are sent in succession? AFAIU both of them are processed since SendProcSignal flags ps_signalFlags for each signal. ``` SendProcSignal(pid_t pid, ProcSignalReason reason, BackendId backendId) { volatile ProcSignalSlot *slot; ...(snip)... 278 if (slot->pss_pid == pid) 279 { 280 /* Atomically set the proper flag */ 281 slot->pss_signalFlags[reason] = true; 282 /* Send signal */ 283 return kill(pid, SIGUSR1); ``` Comments of ProcSignalReason also say 'We can cope with concurrent signals for different reasons'. ```C /* * Reasons for signaling a Postgres child process (a backend or an auxiliary * process, like checkpointer). We can cope with concurrent signals for different * reasons. However, if the same reason is signaled multiple times in quick * succession, the process is likely to observe only one notification of
Re: cataloguing NOT NULL constraints
On 2022-Sep-20, Isaac Morland wrote: > On Tue, 20 Sept 2022 at 06:56, Alvaro Herrera > wrote: > > > .. though I'm now wondering if there's additional overhead from checking > > the constraint twice on each row: first the attnotnull bit, then the > > CHECK itself. Hmm. That's probably quite bad. > > Another reason to treat NOT NULL-implementing constraints differently. Yeah. > My thinking is that pg_constraint entries for NOT NULL columns are mostly > an implementation detail. I've certainly never cared whether I had an > actual constraint corresponding to my NOT NULL columns. Naturally, all catalog entries are implementation details; a user never really cares if an entry exists or not, only that the desired semantics are provided. In this case, we want the constraint row because it gives us some additional features, such as the ability to mark NOT NULL constraints NOT VALID and validating them later, which is a useful thing to do in large production databases. We have some hacks to provide part of that functionality using straight CHECK constraints, but you cannot cleanly get the `attnotnull` flag set for a column (which means it's hard to add a primary key, for example). It is also supposed to fix some inconsistencies such as disallowing to remove a constraint on a table when it is implied from a constraint on an ancestor table. Right now we have ad-hoc protections for partitions, but we don't do that for legacy inheritance. That said, the patch I posted for this ~10 years ago used a separate contype and was simpler than what I ended up with now, but amusingly enough it was returned at the time with the argument that it would be better to treat them as normal CHECK constraints; so I want to be very sure that we're not just going around in circles. -- Álvaro Herrera 48°01'N 7°57'E — https://www.EnterpriseDB.com/
Re: making relfilenodes 56 bits
On Fri, Sep 9, 2022 at 3:32 PM Dilip Kumar wrote: > > On Thu, Sep 8, 2022 at 4:10 PM Dilip Kumar wrote: > > > On a separate note, while reviewing the latest patch I see there is some > > risk of using the unflushed relfilenumber in GetNewRelFileNumber() > > function. Basically, in the current code, the flushing logic is tightly > > coupled with the logging new relfilenumber logic and that might not work > > with all the values of the VAR_RELNUMBER_NEW_XLOG_THRESHOLD. So the idea > > is we need to keep the flushing logic separate from the logging, I am > > working on the idea and I will post the patch soon. > > I have fixed the issue, so now we will track nextRelFileNumber, > loggedRelFileNumber and flushedRelFileNumber. So whenever > nextRelFileNumber is just VAR_RELNUMBER_NEW_XLOG_THRESHOLD behind the > loggedRelFileNumber we will log VAR_RELNUMBER_PER_XLOG more > relfilenumbers. And whenever nextRelFileNumber reaches the > flushedRelFileNumber then we will do XlogFlush for WAL upto the last > loggedRelFileNumber. Ideally flushedRelFileNumber should always be > VAR_RELNUMBER_PER_XLOG number behind the loggedRelFileNumber so we can > avoid tracking the flushedRelFileNumber. But I feel keeping track of > the flushedRelFileNumber looks cleaner and easier to understand. For > more details refer to the code in GetNewRelFileNumber(). > Here are a few minor suggestions I came across while reading this patch, might be useful: +#ifdef USE_ASSERT_CHECKING + + { Unnecessary space after USE_ASSERT_CHECKING. -- + return InvalidRelFileNumber;/* placate compiler */ I don't think we needed this after the error on the latest branches. -- + LWLockAcquire(RelFileNumberGenLock, LW_SHARED); + if (shutdown) + checkPoint.nextRelFileNumber = ShmemVariableCache->nextRelFileNumber; + else + checkPoint.nextRelFileNumber = ShmemVariableCache->loggedRelFileNumber; + + LWLockRelease(RelFileNumberGenLock); This is done for the good reason, I think, it should have a comment describing different checkPoint.nextRelFileNumber assignment need and crash recovery perspective. -- +#define SizeOfRelFileLocatorBackend \ + (offsetof(RelFileLocatorBackend, backend) + sizeof(BackendId)) Can append empty parenthesis "()" to the macro name, to look like a function call at use or change the macro name to uppercase? -- + if (val < 0 || val > MAX_RELFILENUMBER) .. if ((relfilenumber) < 0 || (relfilenumber) > MAX_RELFILENUMBER) \ How about adding a macro for this condition as RelFileNumberIsValid()? We can replace all the checks referring to MAX_RELFILENUMBER with this. Regards, Amul
Re: cataloguing NOT NULL constraints
On Tue, 20 Sept 2022 at 06:56, Alvaro Herrera wrote: The NULL checks would still be mostly done by the attnotnull checks > internally, so there shouldn't be too much of a difference. > > .. though I'm now wondering if there's additional overhead from checking > the constraint twice on each row: first the attnotnull bit, then the > CHECK itself. Hmm. That's probably quite bad. > Another reason to treat NOT NULL-implementing constraints differently. My thinking is that pg_constraint entries for NOT NULL columns are mostly an implementation detail. I've certainly never cared whether I had an actual constraint corresponding to my NOT NULL columns. So I think marking them as such, or a different contype, and excluding them from \d+ display, probably makes sense. Just need to deal with the issue of trying to create a constraint and having its name conflict with a NOT NULL constraint. Could it work to reserve [field name]_notnull for NOT NULL-implementing constraints? I'd be worried about what happens with field renames; renaming the constraint automatically seems a bit weird, but maybe…
Re: On login trigger: take three
> On 20 Sep 2022, at 15:43, Sergey Shinderuk wrote: > > On 02.09.2022 18:36, Daniel Gustafsson wrote: >> This had bitrotted a fair bit, attached is a rebase along with (mostly) >> documentation fixes. 0001 adds a generic GUC for ignoring event triggers and >> 0002 adds the login event with event trigger support, and hooks it up to the >> GUC such that broken triggers wont require single-user mode. Moving the CF >> entry back to Needs Review. > There is a race around setting and clearing of dathasloginevt. Thanks for the report! The whole dathasloginevt mechanism is IMO too clunky and unelegant to go ahead with, I wouldn't be surprised if there are other bugs lurking there. Since the original authors seems to have retired from the patch (I've only rebased it to try and help) I am inclined to mark it as returned with feedback. -- Daniel Gustafsson https://vmware.com/
Re: why can't a table be part of the same publication as its schema
On 2022-Sep-13, Kyotaro Horiguchi wrote: > At Mon, 12 Sep 2022 04:26:48 +, "houzj.f...@fujitsu.com" > wrote in > > I feel we'd better compare the syntax with the existing publication command: > > FOR ALL TABLES. If you create a publication FOR ALL TABLES, it means > > publishing > > all the tables in the database *including* tables created in the future. I > > think both the syntax and meaning of ALL TABLES IN SCHEMA are consistent > > with > > the existing FOR ALL TABLES. > > IMHO, I feel closer to Robert. "ALL TABLES IN SCHEMA" sounds like the > concrete tables at the time of invocation. While I agree that it is > not directly comparable to GRANT, What if we remove the ALL keyword from there? That would leave us with "FOR TABLES IN SCHEMA", which seems to better convey that it doesn't restrict to current tables in there. > but if I see "ALTER PUBLICATION p1 ADD SCHEMA s1", I automatically > translate that into "all tables in the schema s1 at the time of using > this publication". ... but that translation is wrong if replication supports other kinds of objects, as it inevitably will in the near future. Clearly the fact that we spell out TABLES there is important. When we add support for sequences, we could have combinations ADD [ALL] TABLES IN SCHEMA s ADD [ALL] SEQUENCES IN SCHEMA s ADD [ALL] TABLES AND SEQUENCES IN SCHEMA s and at that point, the unadorned ADD SCHEMA one will become ambiguous. > At least, it would cause less confusion when it were "ALT PUB p1 DROP > SCEMA s1" aginst "DROP ALL TABLES IN SCHEMA s1". I'm not sure what you mean here. -- Álvaro Herrera PostgreSQL Developer — https://www.EnterpriseDB.com/ "En las profundidades de nuestro inconsciente hay una obsesiva necesidad de un universo lógico y coherente. Pero el universo real se halla siempre un paso más allá de la lógica" (Irulan)
Re: On login trigger: take three
On 02.09.2022 18:36, Daniel Gustafsson wrote: This had bitrotted a fair bit, attached is a rebase along with (mostly) documentation fixes. 0001 adds a generic GUC for ignoring event triggers and 0002 adds the login event with event trigger support, and hooks it up to the GUC such that broken triggers wont require single-user mode. Moving the CF entry back to Needs Review. Hello! There is a race around setting and clearing of dathasloginevt. Steps to reproduce: 1. Create a trigger: CREATE FUNCTION on_login_proc() RETURNS event_trigger AS $$ BEGIN RAISE NOTICE 'You are welcome!'; END; $$ LANGUAGE plpgsql; CREATE EVENT TRIGGER on_login_trigger ON login EXECUTE PROCEDURE on_login_proc(); 2. Then drop it, but don't start new sessions: DROP EVENT TRIGGER on_login_trigger; 3. Create another trigger, but don't commit yet: BEGIN; CREATE EVENT TRIGGER on_login_trigger ON login EXECUTE PROCEDURE on_login_proc(); 4. Start a new session. This clears dathasloginevt. 5. Commit the transaction: COMMIT; Now we have a trigger, but it doesn't fire, because dathasloginevt=false. If two sessions create triggers concurrently, one of them will fail. Steps: 1. In the first session, start a transaction and create a trigger: BEGIN; CREATE EVENT TRIGGER on_login_trigger1 ON login EXECUTE PROCEDURE on_login_proc(); 2. In the second session, create another trigger (this query blocks): CREATE EVENT TRIGGER on_login_trigger2 ON login EXECUTE PROCEDURE on_login_proc(); 3. Commit in the first session: COMMIT; The second session fails: postgres=# CREATE EVENT TRIGGER on_login_trigger2 ON login EXECUTE PROCEDURE on_login_proc(); ERROR: tuple concurrently updated What else bothers me is that login triggers execute in an environment under user control and one has to be very careful. The example trigger from the documentation +DECLARE + hour integer = EXTRACT('hour' FROM current_time); + rec boolean; +BEGIN +-- 1. Forbid logging in between 2AM and 4AM. +IF hour BETWEEN 2 AND 4 THEN + RAISE EXCEPTION 'Login forbidden'; +END IF; can be bypassed with PGOPTIONS='-c timezone=...'. Probably this is nothing new and concerns any SECURITY DEFINER function, but still... Best regards, -- Sergey Shinderukhttps://postgrespro.com/
Re: why can't a table be part of the same publication as its schema
On Mon, Sep 19, 2022 at 11:03 PM Jonathan S. Katz wrote: > For #1 (allowing calls that have schema/table overlap...), there appears > to be both a patch that allows this (reversing[8]), and a suggestion for > dealing with a corner-case that is reasonable, i.e. disallowing adding > schemas to a publication when specifying column-lists. Do we think we > can have consensus on this prior to the RC1 freeze? I am not sure whether we can or should rush a fix in that fast, but I agree with this direction. > For #2 ("ALL TABLES IN SCHEMA" syntax), this was heavily discussed on > the original thread[1][3][4][5][7]. I thought Tom's proposal on the > syntax[3] was reasonable as it "future proofs" for when we allow other > schema-scoped objects to be published and give control over which ones > can be published. All right, well, I still don't like it and think it's confusing, but perhaps I'm in the minority. > I don't think we should change this behavior that's already in logical > replication. While I understand the reasons why "GRANT ... ALL TABLES IN > SCHEMA" has a different behavior (i.e. it's not applied to future > objects) and do not advocate to change it, I have personally been > affected where I thought a permission would be applied to all future > objects, only to discover otherwise. I believe it's more intuitive to > think that "ALL" applies to "everything, always." Nah, there's room for multiple behaviors here. It's reasonable to want to add all the tables currently in the schema to a publication (or grant permissions on them) and it's reasonable to want to include all current and future tables in the schema in a publication (or grant permissions on them) too. The reason I don't like the ALL TABLES IN SCHEMA syntax is that it sounds like the former, but actually is the latter. Based on your link to the email from Tom, I understand now the reason why it's like that, but it's still counterintuitive to me. > For #3 (more superuser-only), in general I do agree that we shouldn't be > adding more of these. However, we have in this release, and not just to > this feature. ALTER SUBSCRIPTION ... SKIP[11] requires superuser. I > think it's easier for us to "relax" privileges (e.g. w/predefined roles) > than to make something "superuser-only" in the future, so I don't see > this being a blocker for v15. The feature will continue to work for > users even if we remove "superuser-only" in the future. Yeah, this is clearly not a release blocker, I think. -- Robert Haas EDB: http://www.enterprisedb.com
Re: HOT chain validation in verify_heapam()
On Tue, Sep 20, 2022 at 5:00 AM Himanshu Upadhyaya wrote: > Please find it attached. This patch still has no test cases. Just as we have test cases for the existing corruption checks, we should have test cases for these new corruption checks, showing cases where they actually fire. I think I would be inclined to set lp_valid[x] = true in both the redirected and non-redirected case, and then have the very first thing that the second loop does be if (nextoffnum == 0 || !lp_valid[ctx.offnum]) continue. I think that would be more clear about the intent to ignore line pointers that failed validation. Also, if you did it that way, then you could catch the case of a redirected line pointer pointing to another redirected line pointer, which is a corruption condition that the current code does not appear to check. +/* + * Validation via the predecessor array. 1) If the predecessor's + * xmin is aborted or in progress, the current tuples xmin should + * be aborted or in progress respectively. Also both xmin's must + * be equal. 2) If the predecessor's xmin is not frozen, then + * current tuple's shouldn't be either. 3) If the predecessor's + * xmin is equal to the current tuple's xmin, the current tuple's + * cmin should be greater than the predecessor's cmin. 4) If the + * current tuple is not HOT then its predecessor's tuple must not + * be HEAP_HOT_UPDATED. 5) If the current tuple is HOT then its + * predecessor's tuple must be HEAP_HOT_UPDATED. + */ This comment needs to be split up into pieces and the pieces need to be moved closer to the tests to which they correspond. + psprintf("unfrozen tuple was updated to produce a tuple at offset %u which is not frozen", Shouldn't this say "which is frozen"? + * Not a corruption if current tuple is updated/deleted by a + * different transaction, means t_cid will point to cmax (that is + * command id of deleting transaction) and cid of predecessor not + * necessarily will be smaller than cid of current tuple. t_cid I think that the next person who reads this code is likely to understand that the CIDs of different transactions are numerically unrelated. What's less obvious is that if the XID is the same, the newer update must have a higher CID. + * can hold combo command id but we are not worrying here since + * combo command id of the next updated tuple (if present) must be + * greater than combo command id of the current tuple. So here we + * are not checking HEAP_COMBOCID flag and simply doing t_cid + * comparison. I disapprove of ignoring the HEAP_COMBOCID flag. Emitting a message claiming that the CID has a certain value when that's actually a combo CID is misleading, so at least a different message wording is needed in such cases. But it's also not clear to me that the newer update has to have a higher combo CID, because combo CIDs can be reused. If you have multiple cursors open in the same transaction, the updates can be interleaved, and it seems to me that it might be possible for an older CID to have created a certain combo CID after a newer CID, and then both cursors could update the same page in succession and end up with combo CIDs out of numerical order. Unless somebody can make a convincing argument that this isn't possible, I think we should just skip this check for cases where either tuple has a combo CID. +if (TransactionIdEquals(pred_xmin, curr_xmin) && +(TransactionIdEquals(curr_xmin, curr_xmax) || + !TransactionIdIsValid(curr_xmax)) && pred_cmin >= curr_cmin) I don't understand the reason for the middle part of this condition -- TransactionIdEquals(curr_xmin, curr_xmax) || !TransactionIdIsValid(curr_xmax). I suppose the comment is meant to explain this, but I still don't get it. If a tuple with XMIN 12345 CMIN 2 is updated to produce a tuple with XMIN 12345 CMIN 1, that's corruption, regardless of what the XMAX of the second tuple may happen to be. +if (HeapTupleHeaderIsHeapOnly(curr_htup) && +!HeapTupleHeaderIsHotUpdated(pred_htup)) +if (!HeapTupleHeaderIsHeapOnly(curr_htup) && +HeapTupleHeaderIsHotUpdated(pred_htup)) I think it would be slightly clearer to write these tests the other way around i.e. check the previous tuple's state first. +if (!TransactionIdIsValid(curr_xmax) && HeapTupleHeaderIsHotUpdated(tuphdr)) +{ +report_corruption(ctx, + psprintf("tuple has been updated, but xmax is 0")); +result = false; +} I guess this message needs to say "tuple has been HOT updated, but xmax is 0" or something like that. -- Robert Haas EDB: http://www.enterprisedb.com
Re: Summary function for pg_buffercache
Hi, On Sep 20, 2022, 20:49 +0800, Melih Mutlu , wrote: > Hi Zhang, > > Those are two different locks. > The locks that are taken in the patch are for buffer headers. This locks only > the current buffer and makes that particular buffer's info consistent within > itself. > > However, the lock mentioned in the doc is for buffer manager which would > prevent changes on any buffer if it's held. > pg_buffercache_summary (and pg_buffercache_pages) does not hold buffer > manager lock. Therefore, consistency across all buffers is not guaranteed. > > For pg_buffercache_pages, self-consistent buffer information is useful since > it shows each buffer separately. > > For pg_buffercache_summary, even self-consistency may not matter much since > results are aggregated and we can't see individual buffer information. > Consistency across all buffers is also not a concern since its purpose is to > give an overall idea about the state of buffers. > > I see that these two different locks in the same context can be confusing. I > hope it is a bit more clear now. > > Best, > Melih Thanks for your explanation, LGTM.
Re: Summary function for pg_buffercache
Hi Zhang, Those are two different locks. The locks that are taken in the patch are for buffer headers. This locks only the current buffer and makes that particular buffer's info consistent within itself. However, the lock mentioned in the doc is for buffer manager which would prevent changes on any buffer if it's held. pg_buffercache_summary (and pg_buffercache_pages) does not hold buffer manager lock. Therefore, consistency across all buffers is not guaranteed. For pg_buffercache_pages, self-consistent buffer information is useful since it shows each buffer separately. For pg_buffercache_summary, even self-consistency may not matter much since results are aggregated and we can't see individual buffer information. Consistency across all buffers is also not a concern since its purpose is to give an overall idea about the state of buffers. I see that these two different locks in the same context can be confusing. I hope it is a bit more clear now. Best, Melih >
Re: Summary function for pg_buffercache
Hi, Regards, Zhang Mingli On Sep 20, 2022, 20:43 +0800, Aleksander Alekseev , wrote: > > Correct, the procedure doesn't take the locks of the buffer manager. > It does take the locks of every individual buffer. Ah, now I get it, thanks.
Re: Summary function for pg_buffercache
Hi Zhang, > The doc says we don’t take lock during pg_buffercache_summary, but I see > locks in the v8 patch, Isn’t it? > > ``` > Similar to pg_buffercache_pages function > pg_buffercache_summary doesn't take buffer manager > locks [...] > ``` Correct, the procedure doesn't take the locks of the buffer manager. It does take the locks of every individual buffer. I agree that the text is somewhat confusing, but it is consistent with the current description of pg_buffercache [1]. I think this is a problem worth addressing but it also seems to be out of scope of the proposed patch. [1]: https://www.postgresql.org/docs/current/pgbuffercache.html -- Best regards, Aleksander Alekseev
Re: Summary function for pg_buffercache
Hi, Correct me if I’m wrong. The doc says we don’t take lock during pg_buffercache_summary, but I see locks in the v8 patch, Isn’t it? ``` Similar to pg_buffercache_pages function pg_buffercache_summary doesn't take buffer manager locks, thus the result is not consistent across all buffers. This is intentional. The purpose of this function is to provide a general idea about the state of shared buffers as fast as possible. Additionally, pg_buffercache_summary allocates much less memory. ``` Regards, Zhang Mingli On Sep 20, 2022, 20:10 +0800, Melih Mutlu , wrote: > Hi, > > Seems like cfbot tests are passing now: > https://cirrus-ci.com/build/4727923671302144 > > Best, > Melih > > > Melih Mutlu , 20 Eyl 2022 Sal, 14:00 tarihinde şunu > > yazdı: > > > Aleksander Alekseev , 20 Eyl 2022 Sal, 13:57 > > > tarihinde şunu yazdı: > > > > > There was a missing empty line in pg_buffercache.out which made the > > > > > tests fail. Here is a corrected v8 patch. > > > > > > > > I was just sending a corrected patch without the missing line. > > > > > > > > Thanks a lot for all these reviews and the corrected patch. > > > > > > > > Best, > > > > Melih
Re: Add SPLIT PARTITION/MERGE PARTITIONS commands
On Mon, Sep 19, 2022 at 4:42 PM Dmitry Koval wrote: > Thanks for comments and advice! > I thought about this problem and discussed about it with colleagues. > Unfortunately, I don't know of a good general solution. Yeah, me neither. > But for specific situation like this (certain partition is not changing) > we can add CONCURRENTLY modifier. > Our DDL query can be like > > ALTER TABLE...SPLIT PARTITION [CONCURRENTLY]; > > With CONCURRENTLY modifier we can lock partitioned table in > ShareUpdateExclusiveLock mode and split partition - in > AccessExclusiveLock mode. So we don't lock partitioned table in > AccessExclusiveLock mode and can modify other partitions during SPLIT > operation (except split partition). > If smb try to modify split partition, he will receive error "relation > does not exist" at end of operation (because split partition will be drop). I think that a built-in DDL command can't really assume that the user won't modify anything. You'd have to take a ShareLock. But you might be able to have a CONCURRENTLY variant of the command that does the same kind of multi-transaction thing as, e.g., CREATE INDEX CONCURRENTLY. You would probably have to be quite careful about race conditions (e.g. you commit the first transaction and before you start the second one, someone drops or detaches the partition you were planning to merge or split). Might take some thought, but feels possibly doable. I've never been excited enough about this kind of thing to want to put a lot of energy into engineering it, because doing it "manually" feels so much nicer to me, and doubly so given that we now have ATTACH CONCURRENTLY and DETACH CONCURRENTLY, but it does seem like a thing some people would probably use and value. -- Robert Haas EDB: http://www.enterprisedb.com
Re: Add LSN along with offset to error messages reported for WAL file read/write/validate header failures
On Tue, Sep 20, 2022 at 12:57 PM Alvaro Herrera wrote: > > On 2022-Sep-19, Bharath Rupireddy wrote: > > > We have a bunch of messages [1] that have an offset, but not LSN in > > the error message. Firstly, is there an easiest way to figure out LSN > > from offset reported in the error messages? If not, is adding LSN to > > these messages along with offset a good idea? Of course, we can't just > > convert offset to LSN using XLogSegNoOffsetToRecPtr() and report, but > > something meaningful like reporting the LSN of the page that we are > > reading-in or writing-out etc. > > Maybe add errcontext() somewhere that reports the LSN would be > appropriate. For example, the page_read() callbacks have the LSN > readily available, so the ones in backend could install the errcontext > callback; or perhaps ReadPageInternal can do it #ifndef FRONTEND. Not > sure what is best of those options, but either of those sounds better > than sticking the LSN in a lower-level routine that doesn't necessarily > have the info already. All of the error messages [1] have the LSN from which offset was calculated, I think we can just append that to the error messages (something like " offset %u, LSN %X/%X: %m") and not complicate it. Thoughts? [1] errmsg("could not read from WAL segment %s, offset %u: %m", errmsg("could not read from WAL segment %s, offset %u: %m", errmsg("could not write to log file %s " "at offset %u, length %zu: %m", errmsg("unexpected timeline ID %u in WAL segment %s, offset %u", errmsg("could not read from WAL segment %s, offset %u: read %d of %zu", pg_log_error("received write-ahead log record for offset %u with no file open", "invalid magic number %04X in WAL segment %s, offset %u", "invalid info bits %04X in WAL segment %s, offset %u", "invalid info bits %04X in WAL segment %s, offset %u", "unexpected pageaddr %X/%X in WAL segment %s, offset %u", "out-of-sequence timeline ID %u (after %u) in WAL segment %s, offset %u", -- Bharath Rupireddy PostgreSQL Contributors Team RDS Open Source Databases Amazon Web Services: https://aws.amazon.com
Re: Summary function for pg_buffercache
Hi, Seems like cfbot tests are passing now: https://cirrus-ci.com/build/4727923671302144 Best, Melih Melih Mutlu , 20 Eyl 2022 Sal, 14:00 tarihinde şunu yazdı: > Aleksander Alekseev , 20 Eyl 2022 Sal, 13:57 > tarihinde şunu yazdı: > >> There was a missing empty line in pg_buffercache.out which made the >> tests fail. Here is a corrected v8 patch. >> > > I was just sending a corrected patch without the missing line. > > Thanks a lot for all these reviews and the corrected patch. > > Best, > Melih >
Re: Add LSN along with offset to error messages reported for WAL file read/write/validate header failures
On Tue, Sep 20, 2022 at 9:02 AM Nathan Bossart wrote: > > On Mon, Sep 19, 2022 at 03:16:42PM -0700, Nathan Bossart wrote: > > It seems like you want the opposite of pg_walfile_name_offset(). Perhaps > > we could add a function like pg_walfile_offset_lsn() that accepts a WAL > > file name and byte offset and returns the LSN. > > Like so... Yeah, something like this will be handy for sure, but I'm not sure if we want this to be in core. Let's hear from others. -- Bharath Rupireddy PostgreSQL Contributors Team RDS Open Source Databases Amazon Web Services: https://aws.amazon.com
Re: Support pg_attribute_aligned and noreturn in MSVC
On Mon, Sep 19, 2022 at 11:21 PM Michael Paquier wrote: > > On Mon, Sep 19, 2022 at 08:51:37PM -0400, James Coleman wrote: > > Yes, fixed. > > The CF bot is failing compilation on Windows: > http://commitfest.cputube.org/james-coleman.html > https://api.cirrus-ci.com/v1/task/5376566577332224/logs/build.log > > There is something going on with noreturn() after applying it to > elog.h: > 01:11:00.468] c:\cirrus\src\include\utils\elog.h(410,45): error C2085: > 'ThrowErrorData': not in formal parameter list (compiling source file > src/common/hashfn.c) [c:\cirrus\libpgcommon.vcxproj] > [01:11:00.468] c:\cirrus\src\include\mb\pg_wchar.h(701,80): error > C2085: 'pgwin32_message_to_UTF16': not in formal parameter list > (compiling source file src/common/encnames.c) > [c:\cirrus\libpgcommon.vcxproj] > [01:11:00.468] c:\cirrus\src\include\utils\elog.h(411,54): error > C2085: 'pg_re_throw': not in formal parameter list (compiling source > file src/common/hashfn.c) [c:\cirrus\libpgcommon.vcxproj] > > align() seems to look fine, at least. I'd be tempted to apply the > align part first, as that's independently useful for itemptr.h. I don't have access to a Windows machine for testing, but re-reading the documentation it looks like the issue is that our noreturn macro is used after the definition while the MSVC equivalent is used before. I've removed that for now (and commented about it); it's not as valuable anyway since it's mostly an indicator for code analysis (human and machine). James Coleman v3-0001-Support-pg_attribute_aligned-in-MSVC.patch Description: Binary data
Re: why can't a table be part of the same publication as its schema
On Tue, Sep 20, 2022 at 2:57 PM Alvaro Herrera wrote: > > On 2022-Sep-20, Amit Kapila wrote: > > > On Mon, Sep 19, 2022 at 8:46 PM Alvaro Herrera > > wrote: > > > > This seems a pretty arbitrary restriction. It feels like you're adding > > > this restriction precisely so that you don't have to write the code to > > > reject the ALTER .. SET SCHEMA if an incompatible configuration is > > > detected. But we already have such checks in several cases, so I don't > > > see why this one does not seem a good idea. > > > > > I agree that we have such checks at other places as well and one > > somewhat similar is in ATPrepChangePersistence(). > > > > ATPrepChangePersistence() > > { > > ... > > ... > > /* > > * Check that the table is not part of any publication when changing to > > * UNLOGGED, as UNLOGGED tables can't be published. > > */ > > Right, I think this is a sensible approach. > > > However, another angle to look at it is that we try to avoid adding > > restrictions in other DDL commands for defined publications. > > Well, it makes sense to avoid restrictions wherever possible. But here, > the consequence is that you end up with a restriction in the publication > definition that is not very sensible. Imagine if you said "you can't > add schema S because it contains an unlogged table". It's absurd. > > Maybe this can be relaxed in a future release, but it's quite odd. > Yeah, we can relax it in a future release based on some field experience, or maybe we can keep the current restriction of not allowing to add a table when the schema of the table is part of the same publication and try to relax that in a future release based on field experience. > > The intention was to be in sync with FOR ALL TABLES. > > I guess we can change both (FOR ALL TABLES and IN SCHEMA) later. > That sounds reasonable. -- With Regards, Amit Kapila.
Re: Refactor backup related code (was: Is it correct to say, "invalid data in file \"%s\"", BACKUP_LABEL_FILE in do_pg_backup_stop?)
On Tue, Sep 20, 2022 at 7:20 AM Michael Paquier wrote: > > The main regression test suite should not include direct calls to > pg_backup_start() or pg_backup_stop() as these depend on wal_level, > and we spend a certain amount of resources in keeping the tests a > maximum portable across such configurations, wal_level = minimal being > one of them. One portable example is in 001_stream_rep.pl. Understood. > > That's a good idea. I'm marking a flag if the label name overflows (> > > MAXPGPATH), later using it in do_pg_backup_start() to error out. We > > could've thrown error directly, but that changes the behaviour - right > > now, first " > > wal_level must be set to \"replica\" or \"logical\" at server start." > > gets emitted and then label name overflow error - I don't want to do > > that. > > - if (strlen(backupidstr) > MAXPGPATH) > + if (state->name_overflowed == true) > ereport(ERROR, > (errcode(ERRCODE_INVALID_PARAMETER_VALUE), > errmsg("backup label too long (max %d bytes)", > It does not strike me as a huge issue to force a truncation of such > backup label names. 1024 is large enough for basically all users, > in my opinion. Or you could just truncate in the internal logic, but > still complain at MAXPGPATH - 1 as the last byte would be for the zero > termination. In short, there is no need to complicate things with > name_overflowed. We currently allow MAXPGPATH bytes of label name excluding null termination. I don't want to change this behaviour. In the attached v9 patch, I'm truncating the larger names to MAXPGPATH + 1 bytes in backup state (one extra byte for representing that the name has overflown, and another extra byte for null termination). > + StringInfo backup_label; /* backup_label file contents */ > + StringInfo tablespace_map; /* tablespace_map file contents */ > + StringInfo history_file; /* history file contents */ > IMV, repeating a point I already made once upthread, BackupState > should hold none of these. Let's just generate the contents of these > files in the contexts where they are needed, making BackupState > something to rely on to build them in the code paths where they are > necessary. This is going to make the reasoning around the memory > contexts where each one of them is stored much easier and reduce the > changes of bugs in the long-term. I've separated out these variables from the backup state, please see the attached v9 patch. > > I've also taken help of the error callback mechanism to clean up the > > allocated memory in case of a failure. For do_pg_abort_backup() cases, > > I think we don't need to clean the memory because that gets called on > > proc exit (before_shmem_exit()). > > Memory could still bloat while the process running the SQL functions > is running depending on the error code path, anyway. I didn't get your point. Can you please elaborate it? I think adding error callbacks at the right places would free up the memory for us. Please note that we already are causing memory leaks on HEAD today. I addressed the above review comments. I also changed a wrong comment [1] that lies before pg_backup_start() even after the removal of exclusive backup. I'm attaching v9 patch set herewith, 0001 - refactors the backup code with backup state, 0002 - adds error callbacks to clean up the memory allocated for backup variables. Please review them further. [1] * Essentially what this does is to create a backup label file in $PGDATA, * where it will be archived as part of the backup dump. The label file * contains the user-supplied label string (typically this would be used * to tell where the backup dump will be stored) and the starting time and * starting WAL location for the dump. -- Bharath Rupireddy PostgreSQL Contributors Team RDS Open Source Databases Amazon Web Services: https://aws.amazon.com From ab1e86ac7fb75a2d2219c7681ead40faf8c01446 Mon Sep 17 00:00:00 2001 From: Bharath Rupireddy Date: Tue, 20 Sep 2022 10:04:29 + Subject: [PATCH v9] Refactor backup related code Refactor backup related code, advantages of doing so are following: 1) backup state is more structured now - all in a single structure, callers can create backup_label contents whenever required, either during the pg_backup_start or the pg_backup_stop or in between. 2) no parsing of backup_label file lines now, no error checking for invalid parsing. 3) backup_label and history file contents have most of the things in common, they can now be created within a single function. 4) makes backup related code extensible and readable. This introduces new source files xlogbackup.c/.h for backup related code and adds the new code in there. The xlog.c file has already grown to ~9000 LOC (as of this time). Eventually, we would want to move all the backup related code from xlog.c, xlogfuncs.c, elsewhere to here. Author: Bharath Rupireddy Reviewed-by: Michael Paquier Reviewed-by: Fujii Masao Discussion: ht
Re: binary version of pg_current_wal_insert_lsn and pg_walfile_name functions
On Mon, Sep 19, 2022 at 8:19 PM Ashutosh Sharma wrote: > > Hi All, > > Currently, we have pg_current_wal_insert_lsn and pg_walfile_name sql > functions which gives us information about the next wal insert > location and the WAL file that the next wal insert location belongs > to. Can we have a binary version of these sql functions? +1 for the idea in general. As said, pg_waldump seems to be the right candidate. I think we want the lsn of the last WAL record and its info and the WAL file name given an input data directory or just the pg_wal directory or any directory where WAL files are located. For instance, one can use this on an archive location containing archived WAL files or on a node where pg_receivewal is receiving WAL files. Am I missing any other use-cases? pg_waldump currently can't understand compressed and partial files. I think that we need to fix this as well. -- Bharath Rupireddy PostgreSQL Contributors Team RDS Open Source Databases Amazon Web Services: https://aws.amazon.com
Re: Proposal to use JSON for Postgres Parser format
On Tue, 20 Sept 2022 at 12:00, Alexander Korotkov wrote: > On Tue, Sep 20, 2022 at 7:48 AM Tom Lane wrote: > > Peter Geoghegan writes: > > > On Mon, Sep 19, 2022 at 8:39 PM Tom Lane wrote: > > >> Our existing format is certainly not great on those metrics, but > > >> I do not see how "let's use JSON!" is a route to improvement. > > > > > The existing format was designed with developer convenience as a goal, > > > though -- despite my complaints, and in spite of your objections. > > > > As Munro adduces nearby, it'd be a stretch to conclude that the current > > format was designed with any Postgres-related goals in mind at all. > > I think he's right that it's a variant of some Lisp-y dump format that's > > probably far hoarier than even Berkeley Postgres. > > > > > If it didn't have to be easy (or even practical) for developers to > > > directly work with the output format, then presumably the format used > > > internally could be replaced with something lower level and faster. So > > > it seems like the two goals (developer ergonomics and faster > > > interchange format for users) might actually be complementary. > > > > I think the principal mistake in what we have now is that the storage > > format is identical to the "developer friendly" text format (plus or > > minus some whitespace). First we need to separate those. We could > > have more than one equivalent text format perhaps, and I don't have > > any strong objection to basing the text format (or one of them) on > > JSON. > > +1 for considering storage format and text format separately. > > Let's consider what our criteria could be for the storage format. > > 1) Storage effectiveness (shorter is better) and > serialization/deserialization effectiveness (faster is better). On > this criterion, the custom binary format looks perfect. > 2) Robustness in the case of corruption. It seems much easier to > detect the data corruption and possibly make some partial manual > recovery for textual format. > 3) Standartness. It's better to use something known worldwide or at > least used in other parts of PostgreSQL than something completely > custom. From this perspective, JSON/JSONB is better than custom > things. Allow me to add: compressability In the thread surrounding [0] there were complaints about the size of catalogs, and specifically the template database. Significant parts of that (688kB of 8080kB a fresh PG14 database) are in pg_rewrite, which consists mostly of serialized Nodes. If we're going to replace our current NodeToText infrastructure, we'd better know we can effectively compress this data. In that same thread, I also suggested that we could try to not emit a Node's fields if they contain their default values while serializing; such as the common `:location -1` or `:mynodefield <>`. Those fields still take up space in the format, while conveying no interesting information (the absense of that field in the struct definition would convey the same). It would be useful if this new serialized format would allow us to do similar tricks cheaply. As for JSON vs JSONB for storage: I'm fairly certain that JSONB is less compact than JSON (without taking compression into the picture) due to the 4-byte guaranteed overhead for each jsonb element; while for JSON that is only 2 bytes for each (up to 3 when you consider separators, plus potential extra overhead for escaped values that are unlikely to appear our catalogs). Some numbers can be stored more efficiently in JSONB, but only large numbers and small fractions that we're unlikely to hit in system views: a back-of-the-envelope calculation puts the cutoff point of efficient storage between strings-of-decimals and Numeric at >10^12, < -10^11, or very precise fractional values. Kind regards, Matthias van de Meent [0] https://www.postgresql.org/message-id/CAEze2WgGexDM63dOvndLdAWwA6uSmSsc97jmrCuNmrF1JEDK7w%40mail.gmail.com
Re: pg_create_logical_replication_slot argument incongruency
Thank you! Il mar 20 set 2022, 12:29 Michael Paquier ha scritto: > On Tue, Sep 20, 2022 at 08:41:56AM +0200, Florin Irion wrote: > > OK, patch only for the docs attached. > > Thanks, applied. > -- > Michael >
Re: Patch proposal: make use of regular expressions for the username in pg_hba.conf
Hi, On 9/20/22 6:30 AM, Michael Paquier wrote: On Tue, Sep 20, 2022 at 12:09:33AM -0400, Tom Lane wrote: You have to assume that somebody (a) has a role or DB name starting with slash, (b) has an explicit reference to that name in their pg_hba.conf, (c) doesn't read the release notes, and (d) doesn't notice that things are misbehaving until after some hacker manages to break into their installation on the strength of the misbehaving entry. OK, I'll grant that the probability of (c) is depressingly close to unity; but each of the other steps seems quite low probability. All four of them happening in one installation is something I doubt will happen. It is the kind of things that could blow up as a CVE and some bad PR for the project, so I cannot get excited about enforcing this new rule in an authentication file (aka before a role is authenticated) while we are talking about 3~4 code paths (?) that would need an extra check to make sure that no instances have such object names. I also have the feeling that having (a), (b) and (d) is low probability. That said, If the user "/john" already exists and has a hba entry then this entry will still match with the patch. Problem is that all the users that contain "john" would also now match. But things get worst if say /a is an existing user and hba entry as the entry would match any users that contains "a" with the patch. I assume (maybe i should not) that if objects starting with / already exist there is very good reason(s) behind. Then I don't think that preventing their creation in the DDL would help (quite the contrary for the ones that really need them). It looks to me that adding a GUC (off by default) to enable/disable the regexp usage in the hba could be a fair compromise. It won't block any creation starting with a / and won't open more doors (if such objects exist) by default. Regards, -- Bertrand Drouvot PostgreSQL Contributors Team RDS Open Source Databases Amazon Web Services: https://aws.amazon.com
Re: Summary function for pg_buffercache
Aleksander Alekseev , 20 Eyl 2022 Sal, 13:57 tarihinde şunu yazdı: > There was a missing empty line in pg_buffercache.out which made the > tests fail. Here is a corrected v8 patch. > I was just sending a corrected patch without the missing line. Thanks a lot for all these reviews and the corrected patch. Best, Melih
Re: Summary function for pg_buffercache
Hi hackers, > Here is a corrected patch v7. To me it seems to be in pretty good > shape, unless cfbot and/or other hackers will report any issues. There was a missing empty line in pg_buffercache.out which made the tests fail. Here is a corrected v8 patch. -- Best regards, Aleksander Alekseev v8-0001-Added-pg_buffercache_summary-function.patch Description: Binary data
Re: cataloguing NOT NULL constraints
On 2022-Sep-19, Matthias van de Meent wrote: > I'm not sure on the 'good' part of this alternative, but we could go > with a single row-based IS NOT NULL to reduce such clutter, utilizing > the `ROW() IS NOT NULL` requirement of a row only matching IS NOT NULL > when all attributes are also IS NOT NULL: > > Check constraints: > "tab_notnull_check" CHECK (ROW(a, b, c, d, e) IS NOT NULL) There's no way to mark this NOT VALID individually or validate it afterwards, though. > But the performance of repeated row-casting would probably not be as > good as our current NULL checks The NULL checks would still be mostly done by the attnotnull checks internally, so there shouldn't be too much of a difference. .. though I'm now wondering if there's additional overhead from checking the constraint twice on each row: first the attnotnull bit, then the CHECK itself. Hmm. That's probably quite bad. -- Álvaro Herrera PostgreSQL Developer — https://www.EnterpriseDB.com/
Re: cataloguing NOT NULL constraints
On 2022-Sep-19, Isaac Morland wrote: > I thought I saw some discussion about the SQL standard saying that there is > a difference between putting NOT NULL in a column definition, and CHECK > (column_name NOT NULL). So if we're going to take this seriously, Was it Peter E.'s reply to this thread? https://postgr.es/m/bac841ed-b86d-e3c2-030d-02a3db067...@enterprisedb.com because if it wasn't there, then I would like to know what you source is. -- Álvaro Herrera PostgreSQL Developer — https://www.EnterpriseDB.com/ "Thou shalt not follow the NULL pointer, for chaos and madness await thee at its end." (2nd Commandment for C programmers)
Re: cataloguing NOT NULL constraints
On 2022-Sep-19, Robert Haas wrote: > On Wed, Aug 17, 2022 at 2:12 PM Alvaro Herrera > wrote: > > 55489 16devel 1776237=# \d tab > > Tabla «public.tab» > > Columna │ Tipo │ Ordenamiento │ Nulable │ Por omisión > > ─┼─┼──┼──┼─ > > a │ integer │ │ not null │ > > Restricciones CHECK: > > "tab_a_not_null" CHECK (a IS NOT NULL) > > In a table with many columns, most of which are NOT NULL, this is > going to produce a ton of clutter. I don't like that. > > I'm not sure what a good alternative would be, though. Perhaps that can be solved by displaying the constraint name in the table: 55489 16devel 1776237=# \d tab Tabla «public.tab» Columna │ Tipo │ Ordenamiento │ Nulable│ Por omisión ─┼─┼──┼┼─ a │ integer │ │ tab_a_not_null │ (Perhaps we can change the column title also, not sure.) -- Álvaro HerreraBreisgau, Deutschland — https://www.EnterpriseDB.com/ "The Gord often wonders why people threaten never to come back after they've been told never to return" (www.actsofgord.com)
Re: pg_create_logical_replication_slot argument incongruency
On Tue, Sep 20, 2022 at 08:41:56AM +0200, Florin Irion wrote: > OK, patch only for the docs attached. Thanks, applied. -- Michael signature.asc Description: PGP signature