Reduce useless changes before reassembly during logical replication
Hi hackers, During logical replication, if there is a large write transaction, some spill files will be written to disk, depending on the setting of logical_decoding_work_mem. This behavior can effectively avoid OOM, but if the transaction generates a lot of change before commit, a large number of files may fill the disk. For example, you can update a TB-level table. Of course, this is also inevitable. But I found an inelegant phenomenon. If the updated large table is not published, its changes will also be written with a large number of spill files. Look at an example below: publisher: ``` create table tbl_pub(id int, val1 text, val2 text,val3 text); create table tbl_t1(id int, val1 text, val2 text,val3 text); CREATE PUBLICATION mypub FOR TABLE public.tbl_pub; ``` subscriber: ``` create table tbl_pub(id int, val1 text, val2 text,val3 text); create table tbl_t1(id int, val1 text, val2 text,val3 text); CREATE SUBSCRIPTION mysub CONNECTION 'host=127.0.0.1 port=5432 user=postgres dbname=postgres' PUBLICATION mypub; ``` publisher: ``` begin; insert into tbl_t1 select i,repeat('xyzzy', i),repeat('abcba', i),repeat('dfds', i) from generate_series(0,99) i; ``` Later you will see a large number of spill files in the "/$PGDATA/pg_replslot/mysub/" directory. ``` $ll -sh total 4.5G 4.0K -rw--- 1 postgres postgres 200 Nov 30 09:24 state 17M -rw--- 1 postgres postgres 17M Nov 30 08:22 xid-750-lsn-0-1000.spill 12M -rw--- 1 postgres postgres 12M Nov 30 08:20 xid-750-lsn-0-100.spill 17M -rw--- 1 postgres postgres 17M Nov 30 08:23 xid-750-lsn-0-1100.spill .. ``` We can see that table tbl_t1 is not published in mypub. It also won't be sent downstream because it's not subscribed. After the transaction is reorganized, the pgoutput decoding plugin filters out changes to these unpublished relationships when sending logical changes. See function pgoutput_change. Most importantly, if we filter out unpublished relationship-related changes after constructing the changes but before queuing the changes into a transaction, will it reduce the workload of logical decoding and avoid disk or memory growth as much as possible? Attached is the patch I used to implement this optimization. Design: 1. Added a callback LogicalDecodeFilterByRelCB for the output plugin. 2. Added this callback function pgoutput_table_filter for the pgoutput plugin. Its main implementation is based on the table filter in the pgoutput_change function. Its main function is to determine whether the change needs to be published based on the parameters of the publication, and if not, filter it. 3. After constructing a change and before Queue a change into a transaction, use RelidByRelfilenumber to obtain the relation associated with the change, just like obtaining the relation in the ReorderBufferProcessTXN function. 4. Relation may be a toast, and there is no good way to get its real table relation based on toast relation. Here, I get the real table oid through toast relname, and then get the real table relation. 5. This filtering takes into account INSERT/UPDATE/INSERT. Other changes have not been considered yet and can be expanded in the future. Test: 1. Added a test case 034_table_filter.pl 2. Like the case above, create two tables, the published table tbl_pub and the non-published table tbl_t1 3. Insert 10,000 rows of toast data into tbl_t1 on the publisher, and use pg_ls_replslotdir to record the total size of the slot directory every second. 4. Compare the size of the slot directory at the beginning of the transaction(size1), the size at the end of the transaction (size2), and the average size of the entire process(size3). 5. Assert(size1==size2==size3) Sincerely look forward to your feedback. Regards, lijie v1-Reduce-useless-changes-before-reassemble-during-logi.patch Description: Binary data
Removing const-false IS NULL quals and redundant IS NOT NULL quals
(Moving discussion from -bugs [1] to -hackers for more visibility.) Background: This started out as a performance fix for bug #17540 but has now extended beyond that as fixing that only requires we don't add redundant IS NOT NULL quals to Min/Max aggregate rewrites. The attached gets rid of all IS NOT NULL quals on columns that are provably not null and replaces any IS NULL quals on NOT NULL columns with a const-false gating qual which could result in not having to scan the relation at all. explain (costs off) select * from pg_class where oid is null; QUERY PLAN -- Result One-Time Filter: false The need for this is slightly higher than it once was as the self-join removal code must add IS NOT NULL quals when removing self-joins when the join condition is strict. explain select c1.* from pg_class c1 inner join pg_class c2 on c1.oid=c2.oid; QUERY PLAN Seq Scan on pg_class c2 (cost=0.00..18.15 rows=415 width=273) master would contain an oid IS NOT NULL filter condition. On Fri, 1 Dec 2023 at 23:07, Richard Guo wrote: > > > On Wed, Nov 29, 2023 at 8:48 AM David Rowley wrote: >> On looking deeper, I see you're overwriting the rinfo_serial of the >> const-false RestrictInfo with the one from the original RestrictInfo. >> If that's the correct thing to do then the following comment would >> need to be updated to mention this exception of why the rinfo_serial >> isn't unique. > > > Right, that's what we need to do. > >> >> Looking at the tests, I see: >> >> select * from pred_tab t1 left join pred_tab t2 on true left join >> pred_tab t3 on t2.a is null; >> >> I'm wondering if you can come up with a better test for this? I don't >> quite see any reason why varnullingrels can't be empty for t2.a in the >> join qual as the "ON true" join condition between t1 and t2 means that >> there shouldn't ever be any NULL t2.a rows. My thoughts are that if >> we improve how varnullingrels are set in the future then this test >> will be broken. >> >> Also, I also like to write exactly what each test is testing so that >> it's easier in the future to maintain the expected results. It's >> often tricky when making planner changes to know if some planner >> changes makes a test completely useless or if the expected results >> just need to be updated. If someone changes varnullingrels to be >> empty for this case, then if they accept the actual results as >> expected results then the test becomes useless. I tend to do this >> with comments in the .sql file along the lines of "-- Ensure ..." >> >> I also would rather see the SQLs in the test wrap their lines before >> each join and the keywords to be upper case. > > > Thanks for the suggestions on the tests. I had a go at improving the > test queries and their comments. Thanks. I made a pass over this patch which resulted in just adding and tweaking some comments. The other thing that bothers me about this patch now is the lack of simplification of OR clauses with a redundant condition. For example: postgres=# explain (costs off) select * from pg_class where oid is null or relname = 'non-existent'; QUERY PLAN - Bitmap Heap Scan on pg_class Recheck Cond: ((oid IS NULL) OR (relname = 'non-existant'::name)) -> BitmapOr -> Bitmap Index Scan on pg_class_oid_index Index Cond: (oid IS NULL) -> Bitmap Index Scan on pg_class_relname_nsp_index Index Cond: (relname = 'non-existant'::name) (7 rows) oid is null is const-false and if we simplified that to remove the redundant OR branch and run it through the constant folding code, we'd end up with just the relname = 'non-existent' and we'd end up with a more simple plan as a result. I don't think that's a blocker. I think the patch is ready to go even without doing anything to improve that. Happy to hear other people's thoughts on this patch. Otherwise, I currently don't think the missed optimisation is a reason to block what we've ended up with so far. David [1] https://postgr.es/m/flat/17540-7aa1855ad5ec18b4%40postgresql.org From e690e0f622f69e6896f16f1729ff1b531b65ef4e Mon Sep 17 00:00:00 2001 From: David Rowley Date: Thu, 7 Dec 2023 22:52:34 +1300 Subject: [PATCH v10] Reduce NullTest quals to constant TRUE or FALSE --- .../postgres_fdw/expected/postgres_fdw.out| 16 +- contrib/postgres_fdw/sql/postgres_fdw.sql | 4 +- src/backend/optimizer/plan/initsplan.c| 202 +-- src/backend/optimizer/util/joininfo.c | 28 ++ src/backend/optimizer/util/plancat.c | 10 + src/backend/optimizer/util/relnode.c | 3 + src/include/nodes/pathnodes.h | 7 +- src/include/optimizer/planmain.h | 4 + src/test/regress/expected/equivclass.out | 18 +-
Re: remaining sql/json patches
On 2023-Dec-07, Erik Rijkers wrote: > Hm, this set doesn't apply for me. 0003 gives error, see below (sorrty for > my interspersed bash echoing - seemed best to leave it in. > (I'm using patch; should be all right, no? Am I doing it wrong?) There's definitely something wrong with the patch file; that binary file should not be there. OTOH clearly if we ever start including binary files in our tree, `patch` is no longer going to cut it. Maybe we won't ever do that, though. There's also a complaint about whitespace. -- Álvaro Herrera PostgreSQL Developer — https://www.EnterpriseDB.com/ "El destino baraja y nosotros jugamos" (A. Schopenhauer)
Re: Transaction timeout
Hi, I read the V6 patch and found something needs to be improved. Prepared transactions should also be documented. A value of zero (the default) disables the timeout. +This timeout is not applied to prepared transactions. Only transactions +with user connections are affected. Missing 'time'. - gettext_noop("Sets the maximum allowed in a transaction."), + gettext_noop("Sets the maximum allowed time in a transaction."), 16 is already released. It's 17 now. - if (AH->remoteVersion >= 16) + if (AH->remoteVersion >= 17) ExecuteSqlStatement(AH, "SET transaction_timeout = 0"); And I test the V6 patch and it works as expected. -- Yuhang Qiu
Re: remaining sql/json patches
two JsonCoercionState in src/tools/pgindent/typedefs.list. +JsonCoercionState JsonConstructorExpr JsonConstructorExprState JsonConstructorType JsonEncoding +JsonExpr +JsonExprOp +JsonExprPostEvalState +JsonExprState +JsonCoercionState + post_eval->jump_eval_coercion = jsestate->jump_eval_result_coercion; + if (jbv == NULL) + { + /* Will be coerced with result_coercion. */ + *op->resvalue = (Datum) 0; + *op->resnull = true; + } + else if (!error && !empty) + { + Assert(jbv != NULL); the above "Assert(jbv != NULL);" will always be true? based on: json_behavior_clause_opt: json_behavior ON EMPTY_P { $$ = list_make2($1, NULL); } | json_behavior ON ERROR_P { $$ = list_make2(NULL, $1); } | json_behavior ON EMPTY_P json_behavior ON ERROR_P { $$ = list_make2($1, $4); } | /* EMPTY */ { $$ = list_make2(NULL, NULL); } ; so + if (func->behavior) + { + on_empty = linitial(func->behavior); + on_error = lsecond(func->behavior); + } `if (func->behavior)` will always be true? By the way, in the above "{ $$ = list_make2($1, $4); }" what does $4 refer to? (I don't know gram.y) + jsexpr->formatted_expr = transformJsonValueExpr(pstate, constructName, + func->context_item, + JS_FORMAT_JSON, + InvalidOid, false); + + Assert(jsexpr->formatted_expr != NULL); This Assert is unnecessary? transformJsonValueExpr function already has an assert in the end, will it fail that one first?
Re: remaining sql/json patches
Op 12/7/23 om 10:32 schreef Amit Langote: On Thu, Dec 7, 2023 at 12:26 AM Alvaro Herrera wrote: On 2023-Dec-06, Amit Langote wrote: I think I'm inclined toward adapting the LA-token fix (attached 0005), This one needs to be fixed, so done. On Thu, Dec 7, 2023 at 5:25 PM Peter Eisentraut wrote: Here are a couple of small patches to tidy up the parser a bit in your v28-0004 (JSON_TABLE) patch. It's not a lot; the rest looks okay to me. Thanks Peter. I've merged these into 0004. Hm, this set doesn't apply for me. 0003 gives error, see below (sorrty for my interspersed bash echoing - seemed best to leave it in. (I'm using patch; should be all right, no? Am I doing it wrong?) -- [2023.12.07 11:29:39 json_table2] patch 1 of 5 (json_table2) [/home/aardvark/download/pgpatches/0170/json_table/20231207/v29-0001-Add-soft-error-handling-to-some-expression-nodes.patch] rv [] # [ok] OK, patch returned [0] so now break and continue (all is well) -- [2023.12.07 11:29:39 json_table2] patch 2 of 5 (json_table2) [/home/aardvark/download/pgpatches/0170/json_table/20231207/v29-0002-Add-soft-error-handling-to-populate_record_field.patch] rv [0] # [ok] OK, patch returned [0] so now break and continue (all is well) -- [2023.12.07 11:29:39 json_table2] patch 3 of 5 (json_table2) [/home/aardvark/download/pgpatches/0170/json_table/20231207/v29-0003-SQL-JSON-query-functions.patch] rv [0] # [ok] File src/interfaces/ecpg/test/sql/sqljson_queryfuncs: git binary diffs are not supported. patch apply failed: rv = 1 patch file: /home/aardvark/download/pgpatches/0170/json_table/20231207/v29-0003-SQL-JSON-query-functions.patch rv [1] # [ok] The text leading up to this was: -- |From 712b95c8a1a3dd683852ac151e229440af783243 Mon Sep 17 00:00:00 2001 |From: Amit Langote |Date: Tue, 5 Dec 2023 14:33:25 +0900 |Subject: [PATCH v29 3/5] SQL/JSON query functions |MIME-Version: 1.0 |Content-Type: text/plain; charset=UTF-8 |Content-Transfer-Encoding: 8bit | |This introduces the SQL/JSON functions for querying JSON data using |jsonpath expressions. The functions are: | |JSON_EXISTS() |JSON_QUERY() |JSON_VALUE() | Erik -- Thanks, Amit Langote EDB: http://www.enterprisedb.com [1] https://www.postgresql.org/message-id/CA%2BHiwqGsByGXLUniPxBgZjn6PeDr0Scp0jxxQOmBXy63tiJ60A%40mail.gmail.com
Re: Improve WALRead() to suck data directly from WAL buffers when possible
On Mon, Nov 13, 2023 at 7:02 PM Bharath Rupireddy wrote: > > On Fri, Nov 10, 2023 at 2:28 AM Andres Freund wrote: > > > I think the code needs to make sure that *never* happens. That seems > > unrelated > > to holding or not holding WALBufMappingLock. Even if the page header is > > already valid, I don't think it's ok to just read/parse WAL data that's > > concurrently being modified. > > > > We can never allow WAL being read that's past > > XLogBytePosToRecPtr(XLogCtl->Insert->CurrBytePos) > > as it does not exist. > > Agreed. Erroring out in XLogReadFromBuffers() if passed in WAL is past > the CurrBytePos is an option. Another cleaner way is to just let the > caller decide what it needs to do (retry or error out) - fill an error > message in XLogReadFromBuffers() and return as-if nothing was read or > return a special negative error code like XLogDecodeNextRecord so that > the caller can deal with it. In the attached v17 patch, I've ensured that the XLogReadFromBuffers returns when the caller requests a WAL that's past the current insert position at the moment. > Also, reading CurrBytePos with insertpos_lck spinlock can come in the > way of concurrent inserters. A possible way is to turn both > CurrBytePos and PrevBytePos 64-bit atomics so that > XLogReadFromBuffers() can read CurrBytePos without any lock atomically > and leave it to the caller to deal with non-existing WAL reads. > > > And if the to-be-read LSN is between > > XLogCtl->LogwrtResult->Write and XLogBytePosToRecPtr(Insert->CurrBytePos) > > we need to call WaitXLogInsertionsToFinish() before copying the data. > > Agree to wait for all in-flight insertions to the pages we're about to > read to finish. But, reading XLogCtl->LogwrtRqst.Write requires either > XLogCtl->info_lck spinlock or WALWriteLock. Maybe turn > XLogCtl->LogwrtRqst.Write a 64-bit atomic and read it without any > lock, rely on > WaitXLogInsertionsToFinish()'s return value i.e. if > WaitXLogInsertionsToFinish() returns a value >= Insert->CurrBytePos, > then go read that page from WAL buffers. In the attached v17 patch, I've ensured that the XLogReadFromBuffers waits for all in-progress insertions to finish when the caller requests WAL that's past the current write position and before the current insert position. I've also ensured that the XLogReadFromBuffers returns special return codes for various scenarios (when asked to read in recovery, read on a different TLI, read a non-existent WAL and so on.) instead of it erroring out. This gives flexibility to the caller to decide what to do. -- Bharath Rupireddy PostgreSQL Contributors Team RDS Open Source Databases Amazon Web Services: https://aws.amazon.com v17-0001-Use-64-bit-atomics-for-xlblocks-array-elements.patch Description: Binary data v17-0002-Allow-WAL-reading-from-WAL-buffers.patch Description: Binary data v17-0003-Add-test-module-for-verifying-WAL-read-from-WAL-.patch Description: Binary data
Re: Reducing output size of nodeToString
On 06.12.23 22:08, Matthias van de Meent wrote: PFA a patch that reduces the output size of nodeToString by 50%+ in most cases (measured on pg_rewrite), which on my system reduces the total size of pg_rewrite by 33% to 472KiB. This does keep the textual pg_node_tree format alive, but reduces its size signficantly. The basic techniques used are - Don't emit scalar fields when they contain a default value, and make the reading code aware of this. - Reasonable defaults are set for most datatypes, and overrides can be added with new pg_node_attr() attributes. No introspection into non-null Node/Array/etc. is being done though. - Reset more fields to their default values before storing the values. - Don't write trailing 0s in outDatum calls for by-ref types. This saves many bytes for Name fields, but also some other pre-existing entry points. Future work will probably have to be on a significantly different storage format, as the textual format is about to hit its entropy limits. One thing that was mentioned repeatedly is that we might want different formats for human consumption and for machine storage. For human consumption, I would like some format like what you propose, because it generally omits the "unset" or "uninteresting" fields. But since you also talk about the size of pg_rewrite, I wonder whether it would be smaller if we just didn't write the field names at all but instead all the field values. (This should be pretty easy to test, since the read functions currently ignore the field names anyway; you could just write out all field names as "x" and see what happens.) I don't much like the way your patch uses the term "default". Most of these default values are not defaults at all, but perhaps "most common values". In theory, I would expect a default value to be initialized by makeNode(). (That could be an interesting feature, but let's stay focused here.) But even then most of these "defaults" wouldn't be appropriate for a real default value. This part seems quite controversial to me, and I would like to see some more details about how much this specifically really saves. I don't quite understand why in your patch you have some fields as optional and some not. Or is that what WRITE_NODE_FIELD() vs. WRITE_NODE_FIELD_OPT() means? How is it decided which one to use? The part that clears out the location fields in pg_rewrite entries might be worth considering as a separate patch. Could you explain it more? Does it affect location pointers when using views at all?
Re: Proposal to add page headers to SLRU pages
Hi Yong! +1 to the idea to protect SLRUs from corruption. I'm slightly leaning towards the idea of separating checksums from data pages, but anyway this checksums are better than no checksums. > On 7 Dec 2023, at 10:06, Li, Yong wrote: > > I am still working on patching the pg_upgrade. Just love to hear your > thoughts on the idea and the current patch. FWIW you can take upgrade code from this patch [0] doing all the same stuff :) Best regards, Andrey Borodin. [0] https://www.postgresql.org/message-id/b8f85f94-ecb0-484e-96b8-21d928c8e...@yandex-team.ru
Re: backtrace_on_internal_error
On 05.12.23 11:55, Peter Eisentraut wrote: I think the implementation would be very simple, something like diff --git a/src/backend/utils/error/elog.c b/src/backend/utils/error/elog.c index 6aeb855e49..45d40abe92 100644 --- a/src/backend/utils/error/elog.c +++ b/src/backend/utils/error/elog.c @@ -498,9 +498,11 @@ errfinish(const char *filename, int lineno, const char *funcname) /* Collect backtrace, if enabled and we didn't already */ if (!edata->backtrace && - edata->funcname && - backtrace_functions && - matches_backtrace_functions(edata->funcname)) + ((edata->funcname && + backtrace_functions && + matches_backtrace_functions(edata->funcname)) || + (edata->sqlerrcode == ERRCODE_INTERNAL_ERROR && + backtrace_on_internal_error))) set_backtrace(edata, 2); /* where backtrace_on_internal_error would be a GUC variable. It looks like many people found this idea worthwhile. Several people asked for a way to set the minimum log level for this treatment. Something else to note: I wrote the above code to check the error code; it doesn't check whether the original code write elog() or ereport(). There are some internal errors that are written as ereport() now. Others might be changed from time to time; until now there would have been no external effect from this. I think it would be weird to introduce a difference between these forms now. But then, elog() only uses the error code ERRCODE_INTERNAL_ERROR if the error level is >=ERROR. So this already excludes everything below. Do people want a way to distinguish ERROR/FATAL/PANIC? Or do people want a way to enable backtraces for elog(LOG)? This didn't look too interesting to me. (Many current elog(LOG) calls are behind additional guards like TRACE_SORT or LOCK_DEBUG.) If neither of these two are very interesting, then the above code would already appear to do what was asked.
Configure problem when cross-compiling PostgreSQL 16.1
Hi all, I was trying to cross-compile PostgreSQL 16.1 using Buildroot which worked fine until I executed "initdb" on my target device, where I faced the following error: 2023-12-06 07:10:18.568 UTC [31] FATAL: could not load library "/usr/lib/postgresql/dict_snowball.so": /usr/lib/postgresql/dict_snowball.so: undefined symbol: CurrentMemoryContext It turned out that the "postgres" binary was missing the symbol "CurrentMemoryContext" because the configure script assumed that my toolchain's (GCC 9.4) linker does not support "--export-dynamic", although it supports it. I had a quick look at the configure script where the following lines peeked my interest: if test "$cross_compiling" = yes; then : pgac_cv_prog_cc_LDFLAGS_EX_BE__Wl___export_dynamic="assuming no" Apparently when cross-compiling the linker is automatically assumed to not understand "--export-dynamic", leading to aforementioned problem on my end. A workaround of mine is to override pgac_cv_prog_cc_LDFLAGS_EX_BE__Wl___export_dynamic with "yes", which makes everything work as expected. There is also at least one additional linker flag "--as-needed" that is not being used when cross-compiling. Is this a bug or am I misunderstanding the implications that PostgreSQL has when "$cross_compiling=yes"? Best regards, Dominik
Re: Synchronizing slots from primary to standby
Hi, On 12/7/23 10:07 AM, shveta malik wrote: On Thu, Dec 7, 2023 at 1:19 PM Drouvot, Bertrand wrote: Might be worth to add comments in the code (around the WalRcv->latestWalEnd checks) that no "lagging" sync are possible if the walreceiver is not started though? I am a bit confused. Do you mean as a TODO item? Otherwise the comment will be opposite of the code we are writing. Sorry for the confusion: what I meant to say is that synchronization (should it be lagging) is not possible if the walreceiver is not started (as XLogRecPtrIsInvalid(WalRcv->latestWalEnd) would be true). More precisely here (in synchronize_slots()): /* The primary_slot_name is not set yet or WALs not received yet */ SpinLockAcquire(>mutex); if (!WalRcv || (WalRcv->slotname[0] == '\0') || XLogRecPtrIsInvalid(WalRcv->latestWalEnd)) { SpinLockRelease(>mutex); return naptime; } Regards, -- Bertrand Drouvot PostgreSQL Contributors Team RDS Open Source Databases Amazon Web Services: https://aws.amazon.com
Re: Synchronizing slots from primary to standby
On Thu, Dec 7, 2023 at 1:19 PM Drouvot, Bertrand wrote: > > Hi, > > On 12/6/23 11:58 AM, shveta malik wrote: > > On Wed, Dec 6, 2023 at 3:00 PM Drouvot, Bertrand > > wrote: > >> > >> Hi, > >> > >> On 12/6/23 7:18 AM, shveta malik wrote: > >>> On Wed, Dec 6, 2023 at 10:56 AM Amit Kapila > >>> wrote: > > I feel that is indirectly relying on the fact that the primary won't > advance logical slots unless physical standby has consumed data. > >>> > >>> Yes, that is the basis of this discussion. > >> > >> Yes. > >> > >>> But now on rethinking, if > >>> the user has not set 'standby_slot_names' on primary at first pace, > >>> then even if walreceiver on standby is down, slots on primary will > >>> keep on advancing > >> > >> Oh right, good point. > >> > >>> and thus we need to sync. > >> > >> Yes and I think our current check > >> "XLogRecPtrIsInvalid(WalRcv->latestWalEnd)" > >> in synchronize_slots() prevents us to do so (as I think > >> WalRcv->latestWalEnd > >> would be invalid for a non started walreceiver). > >> > > > > But I think we do not need to deal with the case that walreceiver is > > not started at all on standby. It is always started. Walreceiver not > > getting started or down for long is a rare scenario. We have other > > checks too for 'latestWalEnd' in slotsync worker and I think we should > > retain those as is. > > > > Agree to not deal with the walreceiver being down for now (we can > still improve that part later if we encounter the case in the real > world). > yes, agreed. > Might be worth to add comments in the code (around the WalRcv->latestWalEnd > checks) that no "lagging" sync are possible if the walreceiver is not started > though? > I am a bit confused. Do you mean as a TODO item? Otherwise the comment will be opposite of the code we are writing. > > Regards, > > -- > Bertrand Drouvot > PostgreSQL Contributors Team > RDS Open Source Databases > Amazon Web Services: https://aws.amazon.com
Re: postgres_fdw test timeouts
Hello Thomas, 03.12.2023 09:00, Thomas Munro wrote: On Sun, Dec 3, 2023 at 6:00 PM Alexander Lakhin wrote: I've managed to reproduce the failure locally when running postgres_fdw_x/ regress in parallel (--num-processes 10). It reproduced for me on on 04a09ee94 (iterations 1, 2, 4), but not on 04a09ee94~1 (30 iterations passed). I'm going to investigate this case within days. Maybe we could find a better fix for the issue. Thanks. One thing I can recommend to anyone trying to understand the change is that you view it with: I think, I've found out what's going on here. The culprit is WSAEnumNetworkEvents() assisted by non-trivial logic of ExecAppendAsyncEventWait(). For the case noccurred > 1, ExecAppendAsyncEventWait() performs a loop, where ExecAsyncNotify() is called for the first AsyncRequest, but the second one also processed inside, through a recursive call to ExecAppendAsyncEventWait(): -> ExecAsyncNotify -> produce_tuple_asynchronously -> ExecScan -> ExecInterpExpr -> ExecSetParamPlan -> ExecProcNodeFirst -> ExecAgg -> agg_retrieve_direct -> ExecProcNodeInstr -> ExecAppend -> ExecAppendAsyncEventWait. Here we get into the first loop and call ExecAsyncConfigureWait() for the second AsyncRequest (because we haven't reset it's callback_pending yet), and it leads to creating another WaitEvent for the second socket inside postgresForeignAsyncConfigureWait(): AddWaitEventToSet(set, WL_SOCKET_READABLE, PQsocket(fsstate->conn), ... This WaitEvent seemingly misses an event that we should get for that socket. It's not that important to get noccured > 1 in ExecAppendAsyncEventWait() to see the failure, it's enough to call WSAEnumNetworkEvents() inside WaitEventSetWaitBlock() for the second socket (I tried to exit from the WaitEventSetWaitBlock's new loop prematurely, without touching occurred_events, returned_events on a second iteration of the loop). So it looks like we have the same issue with multiple event handles associated with a single socket here. And v2-0003-Redesign-Windows-socket-event-management.patch from [1] "surprisingly" helps in this case as well (I could not see a failure for 100 iterations of 10 tests in parallel). [1] https://www.postgresql.org/message-id/CA%2BhUKGL0bikWSC2XW-zUgFWNVEpD_gEWXndi2PE5tWqmApkpZQ%40mail.gmail.com Best regards, Alexander
Re: Make COPY format extendable: Extract COPY TO format implementations
On Thu, Dec 7, 2023 at 1:05 PM Sutou Kouhei wrote: > > Hi, > > In > "Re: Make COPY format extendable: Extract COPY TO format implementations" > on Wed, 6 Dec 2023 22:07:51 +0800, > Junwang Zhao wrote: > > > Should we extract both *copy to* and *copy from* for the first step, in that > > case we can add the pg_copy_handler catalog smoothly later. > > I don't object it (mixing TO/FROM changes to one patch) but > it may make review difficult. Is it acceptable? > > FYI: I planed that I implement TO part, and then FROM part, > and then unify TO/FROM parts if needed. [1] I'm fine with step by step refactoring, let's just wait for more suggestions. > > > Attached V4 adds 'extract copy from' and it passed the cirrus ci, > > please take a look. > > Thanks. Here are my comments: > > > + /* > > + * Error is relevant to a particular line. > > + * > > + * If line_buf still contains the correct line, print > > it. > > + */ > > + if (cstate->line_buf_valid) > > We need to fix the indentation. > > > +CopyFromFormatBinaryStart(CopyFromState cstate, TupleDesc tupDesc) > > +{ > > + FmgrInfo *in_functions; > > + Oid*typioparams; > > + Oid in_func_oid; > > + AttrNumber num_phys_attrs; > > + > > + /* > > + * Pick up the required catalog information for each attribute in the > > + * relation, including the input function, the element type (to pass > > to > > + * the input function), and info about defaults and constraints. > > (Which > > + * input function we use depends on text/binary format choice.) > > + */ > > + num_phys_attrs = tupDesc->natts; > > + in_functions = (FmgrInfo *) palloc(num_phys_attrs * sizeof(FmgrInfo)); > > + typioparams = (Oid *) palloc(num_phys_attrs * sizeof(Oid)); > > We need to update the comment because defaults and > constraints aren't picked up here. > > > +CopyFromFormatTextStart(CopyFromState cstate, TupleDesc tupDesc) > ... > > + /* > > + * Pick up the required catalog information for each attribute in the > > + * relation, including the input function, the element type (to pass > > to > > + * the input function), and info about defaults and constraints. > > (Which > > + * input function we use depends on text/binary format choice.) > > + */ > > + in_functions = (FmgrInfo *) palloc(num_phys_attrs * sizeof(FmgrInfo)); > > + typioparams = (Oid *) palloc(num_phys_attrs * sizeof(Oid)); > > ditto. > > > > @@ -1716,15 +1776,6 @@ BeginCopyFrom(ParseState *pstate, > > ReceiveCopyBinaryHeader(cstate); > > } > > I think that this block should be moved to > CopyFromFormatBinaryStart() too. But we need to run it after > we setup inputs such as data_source_cb, pipe and filename... > > +/* Routines for a COPY HANDLER implementation. */ > +typedef struct CopyHandlerOps > +{ > + /* Called when COPY TO is started. This will send a header. */ > + void(*copy_to_start) (CopyToState cstate, TupleDesc > tupDesc); > + > + /* Copy one row for COPY TO. */ > + void(*copy_to_one_row) (CopyToState cstate, > TupleTableSlot *slot); > + > + /* Called when COPY TO is ended. This will send a trailer. */ > + void(*copy_to_end) (CopyToState cstate); > + > + void(*copy_from_start) (CopyFromState cstate, TupleDesc > tupDesc); > + bool(*copy_from_next) (CopyFromState cstate, ExprContext > *econtext, > + Datum > *values, bool *nulls); > + void(*copy_from_error_callback) (CopyFromState cstate); > + void(*copy_from_end) (CopyFromState cstate); > +} CopyHandlerOps; > > It seems that "copy_" prefix is redundant. Should we use > "to_start" instead of "copy_to_start" and so on? > > BTW, it seems that "COPY FROM (FORMAT json)" may not be implemented. [2] > We may need to care about NULL copy_from_* cases. > > > > I added a hook *copy_from_end* but this might be removed later if not used. > > It may be useful to clean up resources for COPY FROM but the > patch doesn't call the copy_from_end. How about removing it > for now? We can add it and call it from EndCopyFrom() later? > Because it's not needed for now. > > I think that we should focus on refactoring instead of > adding a new feature in this patch. > > > [1]: > https://www.postgresql.org/message-id/20231204.153548.2126325458835528809.kou%40clear-code.com > [2]: > https://www.postgresql.org/message-id/flat/CALvfUkBxTYy5uWPFVwpk_7ii2zgT07t3d-yR_cy4sfrrLU%3Dkcg%40mail.gmail.com > > > Thanks, > -- > kou -- Regards Junwang Zhao
Re: Test 002_pg_upgrade fails with olddump on Windows
On Wed, Dec 06, 2023 at 11:00:01AM +0300, Alexander Lakhin wrote: > So it looks like > my $newregresssrc = "$srcdir/src/test/regress"; > is incorrect for meson. > Maybe it should be?: > my $newregresssrc = dirname($ENV{REGRESS_SHLIB}); > (With this line the test passes for me on Windows and Linux). Hmm. Yes, it looks like you're right here. That should allow all the scenarios we expect to work to update the paths for the functions. -- Michael signature.asc Description: PGP signature
Re: Make COPY format extendable: Extract COPY TO format implementations
On Thu, Dec 7, 2023 at 8:39 AM Michael Paquier wrote: > > On Wed, Dec 06, 2023 at 10:07:51PM +0800, Junwang Zhao wrote: > > I read the thread[1] you posted and I think Andres's suggestion sounds > > great. > > > > Should we extract both *copy to* and *copy from* for the first step, in that > > case we can add the pg_copy_handler catalog smoothly later. > > > > Attached V4 adds 'extract copy from' and it passed the cirrus ci, > > please take a look. > > > > I added a hook *copy_from_end* but this might be removed later if not used. > > > > [1]: > > https://www.postgresql.org/message-id/20180211211235.5x3jywe5z3lkgcsr%40alap3.anarazel.de > > I was looking at the differences between v3 posted by Sutou-san and > v4 from you, seeing that: > > +/* Routines for a COPY HANDLER implementation. */ > +typedef struct CopyHandlerOps > { > /* Called when COPY TO is started. This will send a header. */ > -void(*start) (CopyToState cstate, TupleDesc tupDesc); > +void(*copy_to_start) (CopyToState cstate, TupleDesc tupDesc); > > /* Copy one row for COPY TO. */ > -void(*one_row) (CopyToState cstate, TupleTableSlot *slot); > +void(*copy_to_one_row) (CopyToState cstate, TupleTableSlot > *slot); > > /* Called when COPY TO is ended. This will send a trailer. */ > -void(*end) (CopyToState cstate); > -} CopyToFormatOps; > +void(*copy_to_end) (CopyToState cstate); > + > +void(*copy_from_start) (CopyFromState cstate, TupleDesc tupDesc); > +bool(*copy_from_next) (CopyFromState cstate, ExprContext > *econtext, > +Datum *values, bool *nulls); > +void(*copy_from_error_callback) (CopyFromState cstate); > +void(*copy_from_end) (CopyFromState cstate); > +} CopyHandlerOps; > > And we've spent a good deal of time refactoring the copy code so as > the logic behind TO and FROM is split. Having a set of routines that > groups both does not look like a step in the right direction to me, The point of this refactor (from my view) is to make it possible to add new copy handlers in extensions, just like access method. As Andres suggested, a system catalog like *pg_copy_handler*, if we split TO and FROM into two sets of routines, does that mean we have to create two catalog( pg_copy_from_handler and pg_copy_to_handler)? > and v4 is an attempt at solving two problems, while v3 aims to improve > one case. It seems to me that each callback portion should be focused > on staying in its own area of the code, aka copyfrom*.c or copyto*.c. > -- > Michael -- Regards Junwang Zhao
Re: Something seems weird inside tts_virtual_copyslot()
On Fri, 1 Dec 2023 at 14:30, David Rowley wrote: > > On Fri, 1 Dec 2023 at 13:14, Andres Freund wrote: > > So I think adding an assert to ExecCopySlot(), perhaps with a comment saying > > that the restriction could be lifted with a bit of work, would be fine. > > How about the attached? I wrote the comment you mentioned and also > removed the Assert from tts_virtual_copyslot(). I looked over this again and didn't see any issues, so pushed the patch. Thanks for helping with this. David
Re: remaining sql/json patches
Here are a couple of small patches to tidy up the parser a bit in your v28-0004 (JSON_TABLE) patch. It's not a lot; the rest looks okay to me. (I don't have an opinion on the concurrent discussion on resolving some precedence issues.) From 0dc7e7852702272f0bf12aaa4b56b9ac60c4d969 Mon Sep 17 00:00:00 2001 From: Peter Eisentraut Date: Thu, 7 Dec 2023 08:56:26 +0100 Subject: [PATCH 1/3] Fix spurious tab characters --- src/backend/parser/gram.y | 8 1 file changed, 4 insertions(+), 4 deletions(-) diff --git a/src/backend/parser/gram.y b/src/backend/parser/gram.y index 3755434af0..78aaa7a32f 100644 --- a/src/backend/parser/gram.y +++ b/src/backend/parser/gram.y @@ -16661,7 +16661,7 @@ json_quotes_clause_opt: json_table: JSON_TABLE '(' json_value_expr ',' a_expr json_passing_clause_opt - COLUMNS '(' json_table_column_definition_list ')' + COLUMNS '(' json_table_column_definition_list ')' json_table_plan_clause_opt json_behavior_clause_opt ')' @@ -16680,7 +16680,7 @@ json_table: } | JSON_TABLE '(' json_value_expr ',' a_expr AS name json_passing_clause_opt - COLUMNS '(' json_table_column_definition_list ')' + COLUMNS '(' json_table_column_definition_list ')' json_table_plan_clause_opt json_behavior_clause_opt ')' @@ -16772,7 +16772,7 @@ json_table_column_definition: $$ = (Node *) n; } | NESTED path_opt Sconst - COLUMNS '(' json_table_column_definition_list ')' + COLUMNS '(' json_table_column_definition_list ')' { JsonTableColumn *n = makeNode(JsonTableColumn); @@ -16784,7 +16784,7 @@ json_table_column_definition: $$ = (Node *) n; } | NESTED path_opt Sconst AS name - COLUMNS '(' json_table_column_definition_list ')' + COLUMNS '(' json_table_column_definition_list ')' { JsonTableColumn *n = makeNode(JsonTableColumn); -- 2.43.0 From 5ee8a90940d107fc3bf93cccfde2e6a511218377 Mon Sep 17 00:00:00 2001 From: Peter Eisentraut Date: Thu, 7 Dec 2023 09:15:21 +0100 Subject: [PATCH 2/3] Light reformatting --- src/backend/parser/gram.y | 14 -- 1 file changed, 8 insertions(+), 6 deletions(-) diff --git a/src/backend/parser/gram.y b/src/backend/parser/gram.y index 78aaa7a32f..4fad661ff0 100644 --- a/src/backend/parser/gram.y +++ b/src/backend/parser/gram.y @@ -16883,12 +16883,14 @@ json_table_plan_sibling: ; json_table_default_plan_choices: - json_table_default_plan_inner_outer { $$ = $1 | JSTPJ_UNION; } - | json_table_default_plan_union_cross { $$ = $1 | JSTPJ_OUTER; } - | json_table_default_plan_inner_outer ',' - json_table_default_plan_union_cross { $$ = $1 | $3; } - | json_table_default_plan_union_cross ',' - json_table_default_plan_inner_outer { $$ = $1 | $3; } + json_table_default_plan_inner_outer + { $$ = $1 | JSTPJ_UNION; } + | json_table_default_plan_union_cross + { $$ = $1 | JSTPJ_OUTER; } + | json_table_default_plan_inner_outer ',' json_table_default_plan_union_cross + { $$ = $1 | $3; } + | json_table_default_plan_union_cross ',' json_table_default_plan_inner_outer + { $$ = $1 | $3; } ; json_table_default_plan_inner_outer: -- 2.43.0 From a57cea1ea87fe78e6b4e90cef6bbecde67fa Mon Sep 17 00:00:00 2001 From: Peter Eisentraut Date: Thu, 7 Dec 2023 09:15:36 +0100 Subject: [PATCH 3/3] Remove some unnecessary intermediate rules --- src/backend/parser/gram.y | 18 -- 1 file changed, 4 insertions(+), 14 deletions(-) diff --git a/src/backend/parser/gram.y b/src/backend/parser/gram.y index 4fad661ff0..b5eb73acf9 100644 --- a/src/backend/parser/gram.y +++ b/src/backend/parser/gram.y @@ -657,10 +657,8 @@ static Node *makeRecursiveViewSelect(char *relname, List *aliases, Node *query);
Re: Wrong results with grouping sets
On Mon, Sep 25, 2023 at 3:11 PM Richard Guo wrote: > If the grouping expression is a Var or PHV, we can just set its > nullingrels, very straightforward. For an expression that is neither a > Var nor a PHV, I'm not quite sure how to set the nullingrels. I tried > the idea of wrapping it in a new PHV to carry the nullingrels, but that > may cause some unnecessary plan diffs. In the patch for such an > expression I just set the nullingrels of Vars or PHVs that are contained > in it. This is not really 'correct' in theory, because it is the whole > expression that can be nullable by grouping sets, not its individual > vars. But it works in practice, because what we need is that the > expression can be somehow distinguished from the same expression in ECs, > and marking its vars is sufficient for this purpose. But what if the > expression is variable-free? This is the point that needs more work. > Fow now the patch just handles variable-free expressions of type Const, > by wrapping it in a new PHV. > For a variable-free expression, if it contains volatile functions, SRFs, aggregates, or window functions, it would not be treated as a member of EC that is redundant (see get_eclass_for_sort_expr()). That means it would not be removed from the pathkeys list, so we do not need to set the nullingrels for it. Otherwise we can just wrap the expression in a PlaceHolderVar. Attached is an updated patch to do that. BTW, this wrong results issue has existed ever since grouping sets was introduced in v9.5, and there were field reports complaining about this issue. I think it would be great if we can get rid of it. I'm still not sure what we should do in back branches. But let's fix it at least on v16 and later. Thanks Richard v2-0001-Mark-expressions-nullable-by-grouping-sets.patch Description: Binary data
Re: Synchronizing slots from primary to standby
Hi. Here are my review comments for patch v43-0002. == Commit message 1. The nap time of worker is tuned according to the activity on the primary. The worker starts with nap time of 10ms and if no activity is observed on the primary for some time, then nap time is increased to 10sec. And if activity is observed again, nap time is reduced back to 10ms. ~ /nap time of worker/nap time of the worker/ /And if/If/ ~~~ 2. Slots synced on the standby can be identified using 'sync_state' column of pg_replication_slots view. The values are: 'n': none for user slots, 'i': sync initiated for the slot but waiting for the remote slot on the primary server to catch up. 'r': ready for periodic syncs. ~ /identified using/identified using the/ The meaning of "identified by" is unclear to me. It also seems to clash with later descriptions in system-views.sgml. Please see my later review comment about it (in the sgml file) == doc/src/sgml/bgworker.sgml 3. bgw_start_time is the server state during which postgres should start the process; it can be one of BgWorkerStart_PostmasterStart (start as soon as postgres itself has finished its own initialization; processes requesting this are not eligible for database connections), BgWorkerStart_ConsistentState (start as soon as a consistent state has been reached in a hot standby, allowing processes to connect to databases and run read-only queries), and BgWorkerStart_RecoveryFinished (start as soon as the system has entered normal read-write state. Note that the BgWorkerStart_ConsistentState and BgWorkerStart_RecoveryFinished are equivalent in a server that's not a hot standby), and BgWorkerStart_ConsistentState_HotStandby (same meaning as BgWorkerStart_ConsistentState but it is more strict in terms of the server i.e. start the worker only if it is hot-standby; if it is consistent state in non-standby, worker will not be started). Note that this setting only indicates when the processes are to be started; they do not stop when a different state is reached. ~ 3a. This seems to have grown to become just one enormous sentence that is too hard to read. IMO this should be changed to be a of possible values instead of a big slab of text. I suspect it could also be simplified quite a lot -- something like below SUGGESTION bgw_start_time is the server state during which postgres should start the process. Note that this setting only indicates when the processes are to be started; they do not stop when a different state is reached. Possible values are: - BgWorkerStart_PostmasterStart (start as soon as postgres itself has finished its own initialization; processes requesting this are not eligible for database connections) - BgWorkerStart_ConsistentState (start as soon as a consistent state has been reached in a hot-standby, allowing processes to connect to databases and run read-only queries) - BgWorkerStart_RecoveryFinished (start as soon as the system has entered normal read-write state. Note that the BgWorkerStart_ConsistentState and BgWorkerStart_RecoveryFinished are equivalent in a server that's not a hot standby) - BgWorkerStart_ConsistentState_HotStandby (same meaning as BgWorkerStart_ConsistentState but it is more strict in terms of the server i.e. start the worker only if it is hot-standby; if it is a consistent state in non-standby, the worker will not be started). ~~~ 3b. "i.e. start the worker only if it is hot-standby; if it is consistent state in non-standby, worker will not be started" ~ Why is it even necessary to say the 2nd part "if it is consistent state in non-standby, worker will not be started". It seems redundant given 1st part says the same, right? == doc/src/sgml/config.sgml 4. + +The standbys corresponding to the physical replication slots in +standby_slot_names must enable +enable_syncslot for the standbys to receive +failover logical slots changes from the primary. + 4a. Somehow "must enable enable_syncslot" seemed strange. Maybe re-word like: "must enable slot synchronization (see enable_syncslot)" OR "must configure enable_syncslot = true" ~~~ 4b. (seems like repetitive use of "the standbys") /for the standbys to/to/ OR /for the standbys to/so they can/ ~~~ 5. primary_conninfo string, or in a separate - ~/.pgpass file on the standby server (use + ~/.pgpass file on the standby server. (use This rearranged period seems unrelated to the current patch. Maybe don't touch this. ~~~ 6. + + Specify dbname in + primary_conninfo string to allow synchronization + of slots from the primary server to the standby server. + This will only be used for slot synchronization. It is ignored + for streaming. The wording "to allow synchronization of slots" seemed misleading to me. Isn't that more the purpose of the 'enable_syncslot' GUC? I think the intended wording is more like below: