Re: postgres_fdw test timeouts
Hello Nathan, 08.03.2024 01:00, Nathan Bossart wrote: On Sun, Dec 10, 2023 at 12:00:01PM +0300, Alexander Lakhin wrote: So I would not say that it's a dominant failure for now, and given that 04a09ee lives in master only, maybe we can save two commits (Revert + Revert of revert) while moving to a more persistent solution. I just checked in on this one to see whether we needed to create an "open item" for v17. While I'm not seeing the failures anymore, the patch that Alexander claimed should fix it [0] doesn't appear to have been committed, either. Perhaps this was fixed by something else... [0] https://postgr.es/m/CA%2BhUKGL0bikWSC2XW-zUgFWNVEpD_gEWXndi2PE5tWqmApkpZQ%40mail.gmail.com I have re-run the tests and found out that the issue was fixed by d3c5f37dd. It changed the inner of the loop "while (PQisBusy(conn))", formerly contained in pgfdw_get_result() as follows: /* Data available in socket? */ if (wc & WL_SOCKET_READABLE) { if (!PQconsumeInput(conn)) pgfdw_report_error(ERROR, NULL, conn, false, query); } -> /* Consume whatever data is available from the socket */ if (PQconsumeInput(conn) == 0) { /* trouble; expect PQgetResult() to return NULL */ break; } That is, the unconditional "if PQconsumeInput() ..." eliminates the test timeout. Best regards, Alexander
Re: Add comment to specify timeout unit in ConditionVariableTimedSleep()
On Tue, Mar 05, 2024 at 03:20:48PM +0900, Michael Paquier wrote: > That sounds like a good idea to me, so I'm OK with your suggestion. Applied this one as f160bf06f72a. Thanks. -- Michael signature.asc Description: PGP signature
Re: Exposing the lock manager's WaitForLockers() to SQL
Rebased and fixed conflicts. FWIW re: Andrey's comment in his excellent CF summary email[0]: we're currently using vanilla Postgres (via Gentoo) on single nodes, and not anything fancy like Citus. The Citus relationship is just that we were inspired by Marco's blog post there. We have a variety of clients written in different languages that generally don't coordinate their table modifications, and Marco's scheme merely requires them to use sequences idiomatically, which we can just about manage. :-) This feature is then a performance optimization to support this scheme while avoiding the case where one writer holding a RowExclusiveLock blocks the reader from taking a ShareLock which in turn prevents other writers from taking a RowExclusiveLock for a long time. Instead, the reader can wait for the first writer without taking any locks or blocking later writers. I've illustrated this difference in the isolation tests. Still hoping we can get this into 17. :-) [0] https://www.postgresql.org/message-id/C8D65462-0888-4484-A72C-C99A94381ECD%40yandex-team.ru v10-0003-Add-pg_wait_for_lockers-function.patch Description: Binary data v10-0001-Refactor-GetLockConflicts-into-more-general-GetL.patch Description: Binary data v10-0002-Allow-specifying-single-lockmode-in-WaitForLocke.patch Description: Binary data
Re: Remove unnecessary code from psql's watch command
On Fri, 08 Mar 2024 12:09:12 -0500 Tom Lane wrote: > Yugo NAGATA writes: > > On Wed, 06 Mar 2024 13:03:39 -0500 > > Tom Lane wrote: > >> I don't have Windows here to test on, but does the WIN32 code > >> path work at all? > > > The outer loop is eventually exited even because PSQLexecWatch returns 0 > > when cancel_pressed = 0. However, it happens after executing an extra > > query in this function not just after exit of the inner loop. Therefore, > > it would be better to adding set and check of "done" in WIN32, too. > > Ah, I see now. Agreed, that could stand improvement. > > > I've attached the updated patch > > (v2_remove_unnecessary_code_in_psql_watch.patch). > > Pushed with minor tidying. Thanks! -- Yugo NAGATA
Re: pg_column_toast_chunk_id: a function to get a chunk ID of a TOASTed value
On Fri, 8 Mar 2024 16:17:58 -0600 Nathan Bossart wrote: > On Fri, Mar 08, 2024 at 03:31:55PM +0900, Yugo NAGATA wrote: > > On Thu, 7 Mar 2024 16:56:17 -0600 > > Nathan Bossart wrote: > >> to me. Do you think it's worth adding something like a > >> pg_column_toast_num_chunks() function that returns the number of chunks for > >> the TOASTed value, too? > > > > If we want to know the number of chunks of a specified chunk_id, > > we can get this by the following query. > > > > postgres=# SELECT id, (SELECT count(*) FROM pg_toast.pg_toast_16384 WHERE > > chunk_id = id) > > FROM (SELECT pg_column_toast_chunk_id(v) AS id FROM t); > > Good point. Overall, I think this patch is in decent shape, so I'll aim to > commit it sometime next week. Thank you. > > > +{ oid => '8393', descr => 'chunk ID of on-disk TOASTed value', > > + proname => 'pg_column_toast_chunk_id', provolatile => 's', prorettype => > > 'oid', > > + proargtypes => 'any', prosrc => 'pg_column_toast_chunk_id' }, > > Note to self: this change requires a catversion bump. > > > +INSERT INTO test_chunk_id(v1,v2) > > + VALUES (repeat('x', 1), repeat('x', 2048)); > > Is this guaranteed to be TOASTed for all possible page sizes? Should we use block_size? SHOW block_size \gset INSERT INTO test_chunk_id(v1,v2) VALUES (repeat('x', 1), repeat('x', (:block_size / 4))); I think this will work in various page sizes. I've attached a patch in which the test is updated. Regards, Yugo Nagata > > -- > Nathan Bossart > Amazon Web Services: https://aws.amazon.com -- Yugo NAGATA >From c3b99418d1ae3a8235db9bedd95e7d6ac1556f90 Mon Sep 17 00:00:00 2001 From: Yugo Nagata Date: Wed, 29 Mar 2023 09:59:25 +0900 Subject: [PATCH v8] Add pg_column_toast_chunk_id function This function returns the chunk_id of an on-disk TOASTed value, or NULL if the value is un-TOASTed or not on disk. This enables users to know which columns are actually TOASTed. This function is also useful to identify a problematic row when an error like "ERROR: unexpected chunk number ... (expected ...) for toast value" occurs. --- doc/src/sgml/func.sgml | 17 src/backend/utils/adt/varlena.c | 41 src/include/catalog/pg_proc.dat | 3 ++ src/test/regress/expected/misc_functions.out | 27 + src/test/regress/sql/misc_functions.sql | 24 5 files changed, 112 insertions(+) diff --git a/doc/src/sgml/func.sgml b/doc/src/sgml/func.sgml index 0bb7aeb40e..3169e2aeb7 100644 --- a/doc/src/sgml/func.sgml +++ b/doc/src/sgml/func.sgml @@ -28552,6 +28552,23 @@ postgres=# SELECT '0/0'::pg_lsn + pd.segment_number * ps.setting::int + :offset + + + + pg_column_toast_chunk_id + +pg_column_toast_chunk_id ( "any" ) +oid + + +Shows the chunk_id of an on-disk +TOASTed value. Returns NULL +if the value is un-TOASTed or not on-disk. +See for details about +TOAST. + + + diff --git a/src/backend/utils/adt/varlena.c b/src/backend/utils/adt/varlena.c index 543afb66e5..84d36781a4 100644 --- a/src/backend/utils/adt/varlena.c +++ b/src/backend/utils/adt/varlena.c @@ -5105,6 +5105,47 @@ pg_column_compression(PG_FUNCTION_ARGS) PG_RETURN_TEXT_P(cstring_to_text(result)); } +/* + * Return the chunk_id of the on-disk TOASTed value. + * Return NULL if the value is unTOASTed or not on disk. + */ +Datum +pg_column_toast_chunk_id(PG_FUNCTION_ARGS) +{ + int typlen; + struct varlena *attr; + struct varatt_external toast_pointer; + + /* On first call, get the input type's typlen, and save at *fn_extra */ + if (fcinfo->flinfo->fn_extra == NULL) + { + /* Lookup the datatype of the supplied argument */ + Oid argtypeid = get_fn_expr_argtype(fcinfo->flinfo, 0); + + typlen = get_typlen(argtypeid); + if (typlen == 0) /* should not happen */ + elog(ERROR, "cache lookup failed for type %u", argtypeid); + + fcinfo->flinfo->fn_extra = MemoryContextAlloc(fcinfo->flinfo->fn_mcxt, + sizeof(int)); + *((int *) fcinfo->flinfo->fn_extra) = typlen; + } + else + typlen = *((int *) fcinfo->flinfo->fn_extra); + + if (typlen != -1) + PG_RETURN_NULL(); + + attr = (struct varlena *) DatumGetPointer(PG_GETARG_DATUM(0)); + + if (!VARATT_IS_EXTERNAL_ONDISK(attr)) + PG_RETURN_NULL(); + + VARATT_EXTERNAL_GET_POINTER(toast_pointer, attr); + + PG_RETURN_OID(toast_pointer.va_valueid); +} + /* * string_agg - Concatenates values and returns string. * diff --git a/src/include/catalog/pg_proc.dat b/src/include/catalog/pg_proc.dat index 291ed876fc..443ca854a6 100644 --- a/src/include/catalog/pg_proc.dat +++ b/src/include/catalog/pg_proc.dat @@ -7447,6 +7447,9 @@ { oid => '2121', descr => 'compression method for the compressed datum', proname => 'pg_column_compression', provolatile => 's', prorettype => 'text', proargtypes => 'any',
Re: Patch: Add parse_type Function
On Mar 7, 2024, at 23:39, Erik Wienhold wrote: > I think you need to swap the examples. The text mentions the error case > first and the NULL case second. 臘♂️ Thanks, fixed in the attached patch. David v11-0001-Add-to_regtypemod-SQL-function.patch Description: Binary data
Re: Emitting JSON to file using COPY TO
On Sat, Mar 9, 2024 at 2:03 AM Joe Conway wrote: > > On 3/8/24 12:28, Andrey M. Borodin wrote: > > Hello everyone! > > > > Thanks for working on this, really nice feature! > > > >> On 9 Jan 2024, at 01:40, Joe Conway wrote: > >> > >> Thanks -- will have a look > > > > Joe, recently folks proposed a lot of patches in this thread that seem like > > diverted from original way of implementation. > > As an author of CF entry [0] can you please comment on which patch version > > needs review? > > > I don't know if I agree with the proposed changes, but I have also been > waiting to see how the parallel discussion regarding COPY extensibility > shakes out. > > And there were a couple of issues found that need to be tracked down. > > Additionally I have had time/availability challenges recently. > > Overall, chances seem slim that this will make it into 17, but I have > not quite given up hope yet either. Hi. summary changes I've made in v9 patches at [0] meta: rebased. Now you need to use `git apply` or `git am`, previously copyto_json.007.diff, you need to use GNU patch. at [1], Dean Rasheed found some corner cases when the returned slot's tts_tupleDescriptor from ` ExecutorRun(cstate->queryDesc, ForwardScanDirection, 0, true); processed = ((DR_copy *) cstate->queryDesc->dest)->processed; ` cannot be used for composite_to_json. generally DestReceiver->rStartup is to send the TupleDesc to the DestReceiver, The COPY TO DestReceiver's rStartup function is copy_dest_startup, however copy_dest_startup is a no-op. That means to make the final TupleDesc of COPY TO (FORMAT JSON) operation bullet proof, we need to copy the tupDesc from CopyToState's queryDesc. This only applies to when the COPY TO source is a query (example: copy (select 1) to stdout), not a table. The above is my interpretation. at [2], Masahiko Sawada made several points. Mainly split the patch to two, one for format json, second is for options force_array. Splitting into two is easier to review, I think. My changes also addressed all the points Masahiko Sawada had mentioned. [0] https://postgr.es/m/cacjufxhd6zrmjjbsdogpovavaekms-u6aorcw0ja-wyi-0k...@mail.gmail.com [1] https://postgr.es/m/CAEZATCWh29787xf=4ngkoixeqrhrqi0qd33z6_-f8t2dz0y...@mail.gmail.com [2] https://postgr.es/m/cad21aocb02zhzm3vxb8hsw8fwosl+irdefb--kmunv8pjpa...@mail.gmail.com
Re: sslinfo extension - add notbefore and notafter timestamps
Hello Thank you for the review and your patch. I have tested with minimum OpenSSL version 1.0.2 support and incorporated your changes into the v9 patch as attached. > In my -08 timezone, the date doesn't match what's recorded either > (it's my "tomorrow"). I think those probably just need to be converted > to UTC explicitly? I've attached a sample diff on top of v8 that > passes tests on my machine. Yes, I noticed this in the SSL test too. I am also in GTM-8, so for me the tests would fail too due to the time zone differences from GMT. It shall be okay to specifically set the outputs of pg_stat_ssl, ssl_client_get_notbefore, and ssl_client_get_notafte to be in GMT time zone. The not before and not after time stamps in a client certificate are generally expressed in GMT. Thank you! Cary Huang - HighGo Software Inc. (Canada) cary.hu...@highgo.ca www.highgo.ca v9-0001-Add-notBefore-and-notAfter-to-SSL-cert-info-displ.patch Description: Binary data
Re: POC, WIP: OR-clause support for indexes
+ if (!IsA(lfirst(lc), Invalid)) + { + or_list = lappend(or_list, lfirst(lc)); + continue; + } Currently `IsA(lfirst(lc)` works. but is this generally OK? I didn't find any other examples. do you need do cast, like `(Node *) lfirst(lc);` If I understand the logic correctly: In `foreach(lc, args) ` if everything goes well, it will reach `hashkey.type = T_Invalid;` which will make `IsA(lfirst(lc), Invalid)` be true. adding some comments to the above code would be great.
Re: pipe_read_line for reading arbitrary strings
> On 8 Mar 2024, at 19:38, Daniel Gustafsson wrote: > >> On 8 Mar 2024, at 18:13, Peter Eisentraut wrote: > >>> Good catch, that's an incorrect copy/paste, it should use ERRCODE_NO_DATA. >> >> Also it shouldn't print %m, was my point. > > Absolutely, I removed that in the patch upthread, it was clearly wrong. Pushed the fix for the incorrect logline and errcode. -- Daniel Gustafsson
Re: Call perror on popen failure
> On 8 Mar 2024, at 18:08, Peter Eisentraut wrote: > > On 08.03.24 11:05, Daniel Gustafsson wrote: >> If popen fails in pipe_read_line we invoke perror for the error message, and >> pipe_read_line is in turn called by find_other_exec which is used in both >> frontend and backend code. This is an old codepath, and it seems like it >> ought >> to be replaced with the more common logging tools we now have like in the >> attached diff (which also makes the error translated as opposed to now). Any >> objections to removing this last perror() call? > > This change makes it consistent with other popen() calls, so it makes sense > to me. Thanks for review, committed that way. -- Daniel Gustafsson
Re: Make query cancellation keys longer
On 01/03/2024 07:19, Jelte Fennema-Nio wrote: I think this behaviour makes more sense as a minor protocol version bump instead of a parameter. Ok, the consensus is to bump the minor protocol version for this. Works for me. + if (strcmp(conn->workBuffer.data, "_pq_.extended_query_cancel") == 0) + { + /* that's ok */ + continue; + } Please see this thread[1], which in the first few patches makes pqGetNegotiateProtocolVersion3 actually usable for extending the protocol. I started that, because very proposed protocol change that's proposed on the list has similar changes to pqGetNegotiateProtocolVersion3 and I think we shouldn't make these changes ad-hoc hacked into the current code, but actually do them once in a way that makes sense for all protocol changes. Thanks, I will take a look and respond on that thread. Documentation is still missing I think at least protocol message type documentation would be very helpful in reviewing, because that is really a core part of this change. Added some documentation. There's more work to be done there, but at least the message type descriptions are now up-to-date. Based on the current code I think it should have a few changes: + int cancel_key_len = 5 + msgLength - (conn->inCursor - conn->inStart); + + conn->be_cancel_key = malloc(cancel_key_len); + if (conn->be_cancel_key == NULL) This is using the message length to determine the length of the cancel key in BackendKey. That is not something we generally do in the protocol. It's even documented: "Notice that although each message includes a byte count at the beginning, the message format is defined so that the message end can be found without reference to the byte count." So I think the patch should be changed to include the length of the cancel key explicitly in the message. The nice thing about relying on the message length was that we could just redefine the CancelRequest message to have a variable length secret, and old CancelRequest with 4-byte secret was compatible with the new definition too. But it doesn't matter much, so added an explicit length field. FWIW I don't think that restriction makes sense. Any code that parses the messages needs to have the message length available, and I don't think it helps with sanity checking that much. I think the documentation is a little anachronistic. The real reason that all the message types include enough information to find the message end is that in protocol version 2, there was no message length field. The only exception that doesn't have that property is CopyData, and it's no coincidence that it was added in protocol version 3. On 03/03/2024 16:27, Jelte Fennema-Nio wrote: On Fri, 1 Mar 2024 at 06:19, Jelte Fennema-Nio wrote: This is a preliminary review. I'll look at this more closely soon. Some more thoughts after looking some more at the proposed changes +#define EXTENDED_CANCEL_REQUEST_CODE PG_PROTOCOL(1234,5677) nit: I think the code should be 1234,5679 because it's the newer message type, so to me it makes more sense if the code is a larger number. Unfortunately 1234,5679 already in use by NEGOTIATE_SSL_CODE. That's why I went the other direction. If we want to add this to "the end", it needs to be 1234,5681. But I wanted to keep the cancel requests together. + * FIXME: we used to use signal_child. I believe kill() is + * maybe even more correct, but verify that. signal_child seems to be the correct one, not kill. signal_child has this relevant comment explaining why it's better than plain kill: * On systems that have setsid(), each child process sets itself up as a * process group leader. For signals that are generally interpreted in the * appropriate fashion, we signal the entire process group not just the * direct child process. This allows us to, for example, SIGQUIT a blocked * archive_recovery script, or SIGINT a script being run by a backend via * system(). I changed it to signal the process group if HAVE_SETSID, like pg_signal_backend() does. We don't need to signal both the process group and the process itself like signal_child() does, because we don't have the race condition with recently-forked children that signal_child() talks about. +SendCancelRequest(int backendPID, int32 cancelAuthCode) I think this name of the function is quite confusing, it's not sending a cancel request, it is processing one. It sends a SIGINT. Heh, well, it's sending the cancel request signal to the right backend, but I see your point. Renamed to ProcessCancelRequest. While we're at it, switch to using atomics in pmsignal.c for the state field. That feels easier to reason about than volatile pointers. I feel like this refactor would benefit from being a separate commit. That would make it easier to follow which change to pmsignal.c is necessary for what. Point taken. I didn't do that yet, but it makes sense. + MyCancelKeyLength = (MyProcPort != NULL && MyProcPort->extended_query_cancel)
Re: pg_column_toast_chunk_id: a function to get a chunk ID of a TOASTed value
On Fri, Mar 08, 2024 at 03:31:55PM +0900, Yugo NAGATA wrote: > On Thu, 7 Mar 2024 16:56:17 -0600 > Nathan Bossart wrote: >> to me. Do you think it's worth adding something like a >> pg_column_toast_num_chunks() function that returns the number of chunks for >> the TOASTed value, too? > > If we want to know the number of chunks of a specified chunk_id, > we can get this by the following query. > > postgres=# SELECT id, (SELECT count(*) FROM pg_toast.pg_toast_16384 WHERE > chunk_id = id) > FROM (SELECT pg_column_toast_chunk_id(v) AS id FROM t); Good point. Overall, I think this patch is in decent shape, so I'll aim to commit it sometime next week. > +{ oid => '8393', descr => 'chunk ID of on-disk TOASTed value', > + proname => 'pg_column_toast_chunk_id', provolatile => 's', prorettype => > 'oid', > + proargtypes => 'any', prosrc => 'pg_column_toast_chunk_id' }, Note to self: this change requires a catversion bump. > +INSERT INTO test_chunk_id(v1,v2) > + VALUES (repeat('x', 1), repeat('x', 2048)); Is this guaranteed to be TOASTed for all possible page sizes? -- Nathan Bossart Amazon Web Services: https://aws.amazon.com
Re: vacuumdb/clusterdb/reindexdb: allow specifying objects to process in all databases
On Fri, Mar 08, 2024 at 09:33:19AM +, Dean Rasheed wrote: > If I'm nitpicking, "[--verbose | -v]" in the clusterdb synopsis should > be replaced with "[option...]", like the other commands, because there > are other general-purpose options like --quiet and --echo. Good catch. I fixed that in v4. We could probably back-patch this particular change, but since it's been this way for a while, I don't think it's terribly important to do so. > I think this is good to go. Thanks. In v4, I've added a first draft of the commit messages, and I am planning to commit this early next week. -- Nathan Bossart Amazon Web Services: https://aws.amazon.com >From 460da2161265b042079727c1178eff92b3d537b6 Mon Sep 17 00:00:00 2001 From: Nathan Bossart Date: Fri, 8 Mar 2024 14:35:07 -0600 Subject: [PATCH v4 1/3] vacuumdb: Allow specifying objects to process in all databases. Presently, vacuumdb's --table, --schema, and --exclude-schema options cannot be used together with --all, i.e., you cannot specify tables or schemas to process in all databases. This commit removes this unnecessary restriction, thus enabling potentially useful command like "vacuumdb --all --schema pg_catalog". Reviewed-by: Kyotaro Horiguchi, Dean Rasheed Discussion: https://postgr.es/m/20230628232402.GA1954626%40nathanxps13 --- doc/src/sgml/ref/vacuumdb.sgml| 60 ++- src/bin/scripts/t/100_vacuumdb.pl | 24 ++--- src/bin/scripts/vacuumdb.c| 19 +++--- 3 files changed, 52 insertions(+), 51 deletions(-) diff --git a/doc/src/sgml/ref/vacuumdb.sgml b/doc/src/sgml/ref/vacuumdb.sgml index 09356ea4fa..66fccb30a2 100644 --- a/doc/src/sgml/ref/vacuumdb.sgml +++ b/doc/src/sgml/ref/vacuumdb.sgml @@ -36,7 +36,13 @@ PostgreSQL documentation - dbname + + + dbname + -a + --all + + @@ -47,40 +53,44 @@ PostgreSQL documentation - - - - -n - --schema - - schema - - - - - - - -N - --exclude-schema - - schema - - + -n + --schema + schema - dbname + + + dbname + -a + --all + + vacuumdb connection-option option - --a ---all - + + + + + -N + --exclude-schema + + schema + + + + + + dbname + -a + --all + + diff --git a/src/bin/scripts/t/100_vacuumdb.pl b/src/bin/scripts/t/100_vacuumdb.pl index 0601fde205..1d8558c780 100644 --- a/src/bin/scripts/t/100_vacuumdb.pl +++ b/src/bin/scripts/t/100_vacuumdb.pl @@ -184,18 +184,18 @@ $node->command_fails_like( [ 'vacuumdb', '-n', 'pg_catalog', '-N', '"Foo"', 'postgres' ], qr/cannot vacuum all tables in schema\(s\) and exclude schema\(s\) at the same time/, 'cannot use options -n and -N at the same time'); -$node->command_fails_like( - [ 'vacuumdb', '-a', '-N', '"Foo"' ], - qr/cannot exclude specific schema\(s\) in all databases/, - 'cannot use options -a and -N at the same time'); -$node->command_fails_like( - [ 'vacuumdb', '-a', '-n', '"Foo"' ], - qr/cannot vacuum specific schema\(s\) in all databases/, - 'cannot use options -a and -n at the same time'); -$node->command_fails_like( - [ 'vacuumdb', '-a', '-t', '"Foo".bar' ], - qr/cannot vacuum specific table\(s\) in all databases/, - 'cannot use options -a and -t at the same time'); +$node->issues_sql_like( + [ 'vacuumdb', '-a', '-N', 'pg_catalog' ], + qr/(?:(?!VACUUM \(SKIP_DATABASE_STATS\) pg_catalog.pg_class).)*/, + 'vacuumdb -a -N'); +$node->issues_sql_like( + [ 'vacuumdb', '-a', '-n', 'pg_catalog' ], + qr/VACUUM \(SKIP_DATABASE_STATS\) pg_catalog.pg_class/, + 'vacuumdb -a -n'); +$node->issues_sql_like( + [ 'vacuumdb', '-a', '-t', 'pg_class' ], + qr/VACUUM \(SKIP_DATABASE_STATS\) pg_catalog.pg_class/, + 'vacuumdb -a -t'); $node->command_fails_like( [ 'vacuumdb', '-a', '-d', 'postgres' ], qr/cannot vacuum all databases and a specific one at the same time/, diff --git a/src/bin/scripts/vacuumdb.c b/src/bin/scripts/vacuumdb.c index 291766793e..7138c6e97e 100644 --- a/src/bin/scripts/vacuumdb.c +++ b/src/bin/scripts/vacuumdb.c @@ -72,6 +72,7 @@ static void vacuum_one_database(ConnParams *cparams, static void vacuum_all_databases(ConnParams *cparams, vacuumingOptions *vacopts, bool analyze_in_stages, + SimpleStringList *objects, int concurrentCons, const char *progname, bool echo, bool quiet); @@ -378,6 +379,7 @@ main(int argc, char *argv[]) vacuum_all_databases(, , analyze_in_stages, + , concurrentCons, progname, echo, quiet); } @@ -429,18 +431,6 @@ check_objfilter(void) (objfilter & OBJFILTER_DATABASE)) pg_fatal("cannot vacuum all databases and a specific one at the same time"); - if ((objfilter & OBJFILTER_ALL_DBS) && - (objfilter &
Re: Streaming read-ready sequential scan code
On Sat, Mar 02, 2024 at 06:07:48PM -0500, Melanie Plageman wrote: > On Wed, Feb 28, 2024 at 12:30 PM Melanie Plageman > wrote: > > > > On Mon, Feb 26, 2024 at 03:56:57PM -0500, Melanie Plageman wrote: > > > On Mon, Feb 19, 2024 at 6:05 PM Melanie Plageman > > > wrote: > > > > > > > > On Mon, Jan 29, 2024 at 4:17 PM Melanie Plageman > > > > wrote: > > > > > > > > > > There is an outstanding question about where to allocate the > > > > > PgStreamingRead object for sequential scans > > > > > > > > I've written three alternative implementations of the actual streaming > > > > read user for sequential scan which handle the question of where to > > > > allocate the streaming read object and how to handle changing scan > > > > direction in different ways. > > > > > > > > Option A) > > > > https://github.com/melanieplageman/postgres/tree/seqscan_pgsr_initscan_allocation > > > > - Allocates the streaming read object in initscan(). Since we do not > > > > know the scan direction at this time, if the scan ends up not being a > > > > forwards scan, the streaming read object must later be freed -- so > > > > this will sometimes allocate a streaming read object it never uses. > > > > - Only supports ForwardScanDirection and once the scan direction > > > > changes, streaming is never supported again -- even if we return to > > > > ForwardScanDirection > > > > - Must maintain a "fallback" codepath that does not use the streaming > > > > read API > > > > > > Attached is a version of this patch which implements a "reset" > > > function for the streaming read API which should be cheaper than the > > > full pg_streaming_read_free() on rescan. This can easily be ported to > > > work on any of my proposed implementations (A/B/C). I implemented it > > > on A as an example. > > > > Attached is the latest version of this patchset -- rebased in light of > > Thomas' updatees to the streaming read API [1]. I have chosen the > > approach I think we should go with. It is a hybrid of my previously > > proposed approaches. > > While investigating some performance concerns, Andres pointed out that > the members I add to HeapScanDescData in this patch push rs_cindex and > rs_ntuples to another cacheline and introduce a 4-byte hole. Attached > v4's HeapScanDescData is as well-packed as on master and its members > are reordered so that rs_cindex and rs_ntuples are back on the second > cacheline of the struct's data. I did some additional profiling and realized that dropping the unlikely() from the places we check rs_inited frequently was negatively impacting performance. v5 adds those back and also makes a few other very minor cleanups. Note that this patch set has a not yet released version of Thomas Munro's Streaming Read API with a new ramp-up logic which seems to fix a performance issue I saw with my test case when all of the sequential scan's blocks are in shared buffers. Once he sends the official new version, I will rebase this and point to his explanation in that thread. - Melanie >From 29827c23a11061846a7c145f430aa4712c1c30f3 Mon Sep 17 00:00:00 2001 From: Melanie Plageman Date: Sat, 27 Jan 2024 18:39:37 -0500 Subject: [PATCH v5 1/5] Split heapgetpage into two parts heapgetpage(), a per-block utility function used in heap scans, read a passed-in block into a buffer and then, if page-at-a-time processing was enabled, pruned the page and built an array of its visible tuples. This was used for sequential and sample scans. Future commits will add support for streaming reads. The streaming read API will read in the blocks specified by a callback, but any significant per-page processing should be done synchronously on the buffer yielded by the streaming read API. To support this, separate the logic in heapgetpage() to read a block into a buffer from that which prunes the page and builds an array of the visible tuples. The former is now heapfetchbuf() and the latter is now heapbuildvis(). Future commits will push the logic for selecting the next block into heapfetchbuf() in cases when streaming reads are not supported (such as backwards sequential scans). Because this logic differs for sample scans and sequential scans, inline the code to read the block into a buffer for sample scans. This has the added benefit of allowing for a bit of refactoring in heapam_scan_sample_next_block(), including unpinning the previous buffer before invoking the callback to select the next block. --- src/backend/access/heap/heapam.c | 74 ++-- src/backend/access/heap/heapam_handler.c | 40 + src/include/access/heapam.h | 2 +- 3 files changed, 72 insertions(+), 44 deletions(-) diff --git a/src/backend/access/heap/heapam.c b/src/backend/access/heap/heapam.c index 34bc60f625f..aef90d28473 100644 --- a/src/backend/access/heap/heapam.c +++ b/src/backend/access/heap/heapam.c @@ -363,17 +363,18 @@ heap_setscanlimits(TableScanDesc sscan, BlockNumber startBlk, BlockNumber numBlk } /* - *
Re: Proposal to add page headers to SLRU pages
On Wed, 2024-03-06 at 12:01 +, Li, Yong wrote: > Rebase the patch against the latest HEAD. The upgrade logic could use more comments explaining what's going on and why. As I understand it, it's a one-time conversion that needs to happen between 16 and 17. Is that right? Was the way CLOG is upgraded already decided in some earlier discussion? Given that the CLOG is append-only and gets truncated occasionally, I wonder whether we can just have some marker that xids before some number are the old CLOG, and xids beyond that number are in the new CLOG. I'm not necessarily suggesting that; just an idea. Regards, Jeff Davis
Re: Proposal to add page headers to SLRU pages
On Fri, 2024-03-08 at 12:39 -0800, Jeff Davis wrote: > Making a better secondary structure seems doable to me. Just to > brainstorm: We can also keep the group_lsns, and then just update both the CLOG page LSN and the group_lsn when setting the transaction status. The former would be used for all of the normal WAL-related stuff, and the latter would be used by TransactionIdGetStatus() to return the more precise LSN for that group. Regards, Jeff Davis
Re: Failures in constraints regression test, "read only 0 of 8192 bytes"
On 3/8/24 21:29, Thomas Munro wrote: > On Sat, Mar 9, 2024 at 2:36 AM Tomas Vondra > wrote: >> On 3/8/24 13:21, Tomas Vondra wrote: >>> My guess would be 8af25652489, as it's the only storage-related commit. >>> >>> I'm currently running tests to verify this. >>> >> >> Yup, the breakage starts with this commit. I haven't looked into the >> root cause, or whether the commit maybe just made some pre-existing >> issue easier to hit. Also, I haven't followed the discussion on the >> pgsql-bugs thread [1], maybe there are some interesting findings. > > Adding Heikki. I spent a bit of time investigating this today, but haven't made much progress due to (a) my unfamiliarity with the smgr code in general and the patch in particular, and (b) CLOBBER_CACHE_ALWAYS making it quite time consuming to iterate and experiment. However, the smallest schedule that still reproduces the issue is: --- test: test_setup test: create_aggregate create_function_sql create_cast constraints triggers select inherit typed_table vacuum drop_if_exists updatable_views roleattributes create_am hash_func errors infinite_recurse --- I tried to reduce the second step to just "constraints", but that does not fail. Clearly there's some concurrency at play, and having just a single backend makes that go away. regards -- Tomas Vondra EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
Re: Proposal to add page headers to SLRU pages
On Fri, 2024-03-08 at 13:58 +0100, Alvaro Herrera wrote: > In the new code we effectively store only one LSN per page, which I > understand is strictly worse. To quote from the README: "Instead, we defer the setting of the hint bit if we have not yet flushed WAL as far as the LSN associated with the transaction. This requires tracking the LSN of each unflushed async commit." In other words, the problem the group_lsns are solving is that we can't set the hint bit on a tuple until the commit record for that transaction has actually been flushed. For ordinary sync commit, that's fine, because the CLOG bit isn't set until after the commit record is flushed. But for async commit, the CLOG may be updated before the WAL is flushed, and group_lsns are one way to track the right information to hold off updating the hint bits. "It is convenient to associate this data with clog buffers: because we will flush WAL before writing a clog page, we know that we do not need to remember a transaction's LSN longer than the clog page holding its commit status remains in memory." It's not clear to me that it is so convenient, if it's preventing the SLRU from fitting in with the rest of the system architecture. "The worst case is where a sync-commit xact shares a cached LSN with an async-commit xact that commits a bit later; even though we paid to sync the first xact to disk, we won't be able to hint its outputs until the second xact is sync'd, up to three walwriter cycles later." Perhaps we can revisit alternatives to the group_lsn? If we accept Yong's proposal, and the SLRU has a normal LSN and was used in the normal way, we would just need some kind of secondary structure to hold a mapping from XID->LSN only for async transactions. The characteristics we are looking for in this secondary structure are: 1. cheap to test if it's empty, so it doesn't interfere with a purely sync workload at all 2. expire old entries (where the LSN has already been flushed) cheaply enough so the data structure doesn't bloat 3. look up an LSN given an XID cheaply enough that it doesn't interfere with setting hint bits Making a better secondary structure seems doable to me. Just to brainstorm: * Have an open-addressed hash table mapping async XIDs to their commit LSN. If you have a hash collision, opportunistically see if the entry is old and can be removed. Try K probes, and if they are all recent, then you need to XLogFlush. The table could get pretty big, because it needs to hold enough async transactions for a wal writer cycle or two, but it seems reasonable to make async workloads pay that memory cost. * Another idea, if the size of the structure is a problem, is to group K async xids into a bloom filter that points at a single LSN. When more transactions come along, create another bloom filter for the next K async xids. This could interfere with setting hint bits for sync xids if the bloom filters are undersized, but that doesn't seem like a big problem. Regards, Jeff Davis
Re: Failures in constraints regression test, "read only 0 of 8192 bytes"
On Sat, Mar 9, 2024 at 2:36 AM Tomas Vondra wrote: > On 3/8/24 13:21, Tomas Vondra wrote: > > My guess would be 8af25652489, as it's the only storage-related commit. > > > > I'm currently running tests to verify this. > > > > Yup, the breakage starts with this commit. I haven't looked into the > root cause, or whether the commit maybe just made some pre-existing > issue easier to hit. Also, I haven't followed the discussion on the > pgsql-bugs thread [1], maybe there are some interesting findings. Adding Heikki.
Re: type cache cleanup improvements
Yep, exacly. One time from 2^32 we reset whole cache instead of one (or several) entry with hash value = 0. On 08.03.2024 18:35, Tom Lane wrote: Aleksander Alekseev writes: One thing that I couldn't immediately figure out is why 0 hash value is treated as a magic invalid value in TypeCacheTypCallback(): I've not read this patch, but IIRC in some places we have a convention that hash value zero is passed for an sinval reset event (that is, "flush all cache entries"). regards, tom lane -- Teodor Sigaev E-mail: teo...@sigaev.ru WWW: http://www.sigaev.ru/
Re: btree: downlink right separator/HIKEY optimization
On Fri, Mar 8, 2024 at 2:14 PM Peter Geoghegan wrote: > What benchmarking have you done here? I think that the memcmp() test is subtly wrong: > + if (PointerIsValid(rightsep)) > + { > + /* > +* Note: we're not in the rightmost page (see branchpoint earlier > in > +* the loop), so we always have a HIKEY on this page. > +*/ > + ItemId hikeyid = PageGetItemId(page, P_HIKEY); > + IndexTuple highkey = (IndexTuple) PageGetItem(page, hikeyid); > + > + if (ItemIdGetLength(hikeyid) == IndexTupleSize(rightsep) && > + memcmp([1], [1], > + IndexTupleSize(rightsep) - sizeof(IndexTupleData)) == > 0) > + { > + break; > + } > + } Unlike amcheck's bt_pivot_tuple_identical, your memcmp() does not compare relevant metadata fields from struct IndexTupleData. It wouldn't make sense for it to compare the block number, of course (if it did then the optimization would simply never work), but ISTM that you still need to compare ItemPointerData.ip_posid. Suppose, for example, that you had two very similar pivot tuples from a multicolumn index on (a int4, b int2) columns. The first/earlier tuple is (a,b) = "(42, -inf)", due to the influence of suffix truncation. The second/later tuple is (a,b) = "(42, 0)", since suffix truncation couldn't take place when the second pivot tuple was created. (Actually, suffix truncation would have been possible, but it would have only managed to truncate-away the tiebreak heap TID attribute value in the case of our second tuple). There'll be more alignment padding (more zero padding bytes) in the second tuple, compared to the first. But the tuples are still the same size. When you go to you memcmp() this pair of tuples using the approach in v3, the bytes that are actually compared will be identical, despite the fact that these are really two distinct tuples, with distinct values. (As I said, you'd have to actually compare the ItemPointerData.ip_posid metadata to notice this small difference.) -- Peter Geoghegan
Re: speed up a logical replica setup
On 3/8/24 10:44, Hayato Kuroda (Fujitsu) wrote: > Dear Tomas, Euler, > > Thanks for starting to read the thread! Since I'm not an original author, > I want to reply partially. > >> I decided to take a quick look on this patch today, to see how it works >> and do some simple tests. I've only started to get familiar with it, so >> I have only some comments / questions regarding usage, not on the code. >> It's quite possible I didn't understand some finer points, or maybe it >> was already discussed earlier in this very long thread, so please feel >> free to push back or point me to the past discussion. >> >> Also, some of this is rather opinionated, but considering I didn't see >> this patch before, my opinions may easily be wrong ... > > I felt your comments were quit valuable. > >> 1) SGML docs >> >> It seems the SGML docs are more about explaining how this works on the >> inside, rather than how to use the tool. Maybe that's intentional, but >> as someone who didn't work with pg_createsubscriber before I found it >> confusing and not very helpful. >> >> For example, the first half of the page is prerequisities+warning, and >> sure those are useful details, but prerequisities are checked by the >> tool (so I can't really miss this) and warnings go into a lot of details >> about different places where things may go wrong. Sure, worth knowing >> and including in the docs, but maybe not right at the beginning, before >> I learn how to even run the tool? > > Hmm, right. I considered below improvements. Tomas and Euler, how do you > think? > > * Adds more descriptions in "Description" section. > * Moves prerequisities+warning to "Notes" section. > * Adds "Usage" section which describes from a single node. > >> I'm not sure FOR ALL TABLES is a good idea. Or said differently, I'm >> sure it won't work for a number of use cases. I know large databases >> it's common to create "work tables" (not necessarily temporary) as part >> of a batch job, but there's no need to replicate those tables. > > Indeed, the documentation does not describe that all tables in the database > would be included in the publication. > >> I do understand that FOR ALL TABLES is the simplest approach, and for v1 >> it may be an acceptable limitation, but maybe it'd be good to also >> support restricting which tables should be replicated (e.g. blacklist or >> whitelist based on table/schema name?). > > May not directly related, but we considered that accepting options was a > next-step [1]. > >> Note: I now realize this might fall under the warning about DDL, which >> says this: >> >> Executing DDL commands on the source server while running >> pg_createsubscriber is not recommended. If the target server has >> already been converted to logical replica, the DDL commands must >> not be replicated so an error would occur. > > Yeah, they would not be replicated, but not lead ERROR. > So should we say like "Creating tables on the source server..."? > Perhaps. Clarifying the docs would help, but it depends on the wording. For example, I doubt this should talk about "creating tables" because there are other DDL that (probably) could cause issues (like adding a column to the table, or something like that). >> 5) slot / publication / subscription name >> >> I find it somewhat annoying it's not possible to specify names for >> objects created by the tool - replication slots, publication and >> subscriptions. If this is meant to be a replica running for a while, >> after a while I'll have no idea what pg_createsubscriber_569853 or >> pg_createsubscriber_459548_2348239 was meant for. >> >> This is particularly annoying because renaming these objects later is >> either not supported at all (e.g. for replication slots), or may be >> quite difficult (e.g. publications). >> >> I do realize there are challenges with custom names (say, if there are >> multiple databases to replicate), but can't we support some simple >> formatting with basic placeholders? So we could specify >> >> --slot-name "myslot_%d_%p" >> >> or something like that? > > Not sure we can do in the first version, but looks nice. One concern is that I > cannot find applications which accepts escape strings like log_line_prefix. > (It may just because we do not have use-case.) Do you know examples? > I can't think of a tool already doing that, but I think that's simply because it was not needed. Why should we be concerned about this? >> BTW what will happen if we convert multiple standbys? Can't they all get >> the same slot name (they all have the same database OID, and I'm not >> sure how much entropy the PID has)? > > I tested and the second try did not work. The primal reason was the name of > publication > - pg_createsubscriber_%u (oid). > FYI - previously we can reuse same publications, but based on my comment [2] > the > feature was removed. It might be too optimistic. > OK. I could be convinced the other limitations are reasonable for v1 and
Re: Statistics Import and Export
> > Perhaps it's just me ... but it doesn't seem like it's all that many > parameters. > It's more than I can memorize, so that feels like a lot to me. Clearly it's not insurmountable. > > I am a bit concerned about the number of locks on pg_statistic and the > > relation itself, doing CatalogOpenIndexes/CatalogCloseIndexes once per > > attribute rather than once per relation. But I also see that this will > > mostly get used at a time when no other traffic is on the machine, and > > whatever it costs, it's still faster than the smallest table sample > (insert > > joke about "don't have to be faster than the bear" here). > > I continue to not be too concerned about this until and unless it's > actually shown to be an issue. Keeping things simple and > straight-forward has its own value. > Ok, I'm sold on that plan. > > > +/* > > + * Set statistics for a given pg_class entry. > > + * > > + * pg_set_relation_stats(relation Oid, reltuples double, relpages int) > > + * > > + * This does an in-place (i.e. non-transactional) update of pg_class, > just as > > + * is done in ANALYZE. > > + * > > + */ > > +Datum > > +pg_set_relation_stats(PG_FUNCTION_ARGS) > > +{ > > + const char *param_names[] = { > > + "relation", > > + "reltuples", > > + "relpages", > > + }; > > + > > + Oid relid; > > + Relationrel; > > + HeapTuple ctup; > > + Form_pg_class pgcform; > > + > > + for (int i = 0; i <= 2; i++) > > + if (PG_ARGISNULL(i)) > > + ereport(ERROR, > > + > (errcode(ERRCODE_INVALID_PARAMETER_VALUE), > > + errmsg("%s cannot be NULL", > param_names[i]))); > > Why not just mark this function as strict..? Or perhaps we should allow > NULLs to be passed in and just not update the current value in that > case? Strict could definitely apply here, and I'm inclined to make it so. > Also, in some cases we allow the function to be called with a > NULL but then make it a no-op rather than throwing an ERROR (eg, if the > OID ends up being NULL). Thoughts on it emitting a WARN or NOTICE before returning false? > Not sure if that makes sense here or not > offhand but figured I'd mention it as something to consider. > > > + pgcform = (Form_pg_class) GETSTRUCT(ctup); > > + pgcform->reltuples = PG_GETARG_FLOAT4(1); > > + pgcform->relpages = PG_GETARG_INT32(2); > > Shouldn't we include relallvisible? > Yes. No idea why I didn't have that in there from the start. > Also, perhaps we should use the approach that we have in ANALYZE, and > only actually do something if the values are different rather than just > always doing an update. > That was how it worked back in v1, more for the possibility that there was no matching JSON to set values. Looking again at analyze.c (currently lines 1751-1780), we just check if there is a row in the way, and if so we replace it. I don't see where we compare existing values to new values. > > > +/* > > + * Import statistics for a given relation attribute > > + * > > + * pg_set_attribute_stats(relation Oid, attname name, stainherit bool, > > + *stanullfrac float4, stawidth int, stadistinct > float4, > > + *stakind1 int2, stakind2 int2, stakind3 int3, > > + *stakind4 int2, stakind5 int2, stanumbers1 > float4[], > > + *stanumbers2 float4[], stanumbers3 float4[], > > + *stanumbers4 float4[], stanumbers5 float4[], > > + *stanumbers1 float4[], stanumbers2 float4[], > > + *stanumbers3 float4[], stanumbers4 float4[], > > + *stanumbers5 float4[], stavalues1 text, > > + *stavalues2 text, stavalues3 text, > > + *stavalues4 text, stavalues5 text); > > + * > > + * > > + */ > > Don't know that it makes sense to just repeat the function declaration > inside a comment like this- it'll just end up out of date. > Historical artifact - previous versions had a long explanation of the JSON format. > > > +Datum > > +pg_set_attribute_stats(PG_FUNCTION_ARGS) > > > + /* names of columns that cannot be null */ > > + const char *required_param_names[] = { > > + "relation", > > + "attname", > > + "stainherit", > > + "stanullfrac", > > + "stawidth", > > + "stadistinct", > > + "stakind1", > > + "stakind2", > > + "stakind3", > > + "stakind4", > > + "stakind5", > > + }; > > Same comment here as above wrt NULL being passed in. > In this case, the last 10 params (stanumbersN and stavaluesN) can be null, and are NULL more often than not. > > > + for (int k = 0; k < 5; k++) > > Shouldn't we use STATISTIC_NUM_SLOTS here? > Yes, I
Re: btree: downlink right separator/HIKEY optimization
On Thu, Feb 22, 2024 at 10:42 AM Matthias van de Meent wrote: > I forgot to address this in the previous patch, so here's v3 which > fixes the issue warning. What benchmarking have you done here? Have you tried just reordering things in _bt_search() instead? If we delay the check until after the binary search, then the result of the binary search is usually proof enough that we cannot possibly need to move right. That has the advantage of not requiring that we copy anything to the stack. Admittedly, it's harder to make the "binary search first" approach work on the leaf level, under the current code structure. But maybe that doesn't matter very much. And even if it does matter, maybe we should just move the call to _bt_binsrch() that currently takes place in _bt_first into _bt_search() itself -- so that _bt_binsrch() is strictly under the control of _bt_search() (obviously not doable for non-_bt_first callers, which need to call _bt_binsrch_insert instead). This whole approach will have been made easier by the refactoring I did late last year, in commit c9c0589fda. -- Peter Geoghegan
Re: pipe_read_line for reading arbitrary strings
> On 8 Mar 2024, at 18:13, Peter Eisentraut wrote: >> Good catch, that's an incorrect copy/paste, it should use ERRCODE_NO_DATA. > > Also it shouldn't print %m, was my point. Absolutely, I removed that in the patch upthread, it was clearly wrong. -- Daniel Gustafsson
Re: Add new error_action COPY ON_ERROR "log"
On Fri, Mar 8, 2024 at 4:42 PM Michael Paquier wrote: > > On Fri, Mar 08, 2024 at 03:36:30PM +0530, Bharath Rupireddy wrote: > > Please see the attached v5-0001 patch implementing LOG_VERBOSITY with > > options 'default' and 'verbose'. v5-0002 adds the detailed info to > > ON_ERROR 'ignore' option. > > I may be reading this patch set incorrectly, but why doesn't this > patch generate one NOTICE per attribute, as suggested by Sawada-san, > incrementing num_errors once per row when the last attribute has been > processed? Also, why not have a test that checks that multiple rows > spawn more than more messages in some distributed fashion? Did you > look at this idea? If NOTICE per attribute and incrementing num_errors per row is implemented, it ends up erroring out with ERROR: missing data for column "m" for all-column-empty-row. Shall we treat this ERROR softly too if on_error ignore is specified? Or shall we discuss this idea separately? -- tests for options on_error and log_verbosity COPY check_ign_err FROM STDIN WITH (on_error ignore, log_verbosity 'verbose'); 1 {1} 1 a {2} 2 3 {3} 33 4 {a, 4} 4 5 {5} 5 \. NOTICE: detected data type incompatibility at line number 2 for column n; COPY check_ign_err NOTICE: detected data type incompatibility at line number 2 for column m; COPY check_ign_err NOTICE: detected data type incompatibility at line number 2 for column k; COPY check_ign_err NOTICE: detected data type incompatibility at line number 3 for column k; COPY check_ign_err NOTICE: detected data type incompatibility at line number 4 for column m; COPY check_ign_err NOTICE: detected data type incompatibility at line number 4 for column k; COPY check_ign_err NOTICE: detected data type incompatibility at line number 5 for column n; COPY check_ign_err ERROR: missing data for column "m" CONTEXT: COPY check_ign_err, line 5: "" -- Bharath Rupireddy PostgreSQL Contributors Team RDS Open Source Databases Amazon Web Services: https://aws.amazon.com
Re: Confine vacuum skip logic to lazy_scan_skip
On Fri, Mar 8, 2024 at 12:34 PM Melanie Plageman wrote: > > On Fri, Mar 08, 2024 at 06:07:33PM +0200, Heikki Linnakangas wrote: > > On 08/03/2024 02:46, Melanie Plageman wrote: > > > On Wed, Mar 06, 2024 at 10:00:23PM -0500, Melanie Plageman wrote: > > > > On Wed, Mar 06, 2024 at 09:55:21PM +0200, Heikki Linnakangas wrote: > > > I will say that now all of the variable names are *very* long. I didn't > > > want to remove the "state" from LVRelState->next_block_state. (In fact, I > > > kind of miss the "get". But I had to draw the line somewhere.) I think > > > without "state" in the name, next_block sounds too much like a function. > > > > > > Any ideas for shortening the names of next_block_state and its members > > > or are you fine with them? > > > > Hmm, we can remove the inner struct and add the fields directly into > > LVRelState. LVRelState already contains many groups of variables, like > > "Error reporting state", with no inner structs. I did it that way in the > > attached patch. I also used local variables more. > > +1; I like the result of this. I did some perf testing of 0002 and 0003 using that fully-in-SB vacuum test I mentioned in an earlier email. 0002 is a vacuum time reduction from an average of 11.5 ms on master to 9.6 ms with 0002 applied. And 0003 reduces the time vacuum takes from 11.5 ms on master to 7.4 ms with 0003 applied. I profiled them and 0002 seems to simply spend less time in heap_vac_scan_next_block() than master did in lazy_scan_skip(). And 0003 reduces the time vacuum takes because vacuum_delay_point() shows up pretty high in the profile. Here are the profiles for my test. profile of master: + 29.79% postgres postgres [.] visibilitymap_get_status + 27.35% postgres postgres [.] vacuum_delay_point + 17.00% postgres postgres [.] lazy_scan_skip +6.59% postgres postgres [.] heap_vacuum_rel +6.43% postgres postgres [.] BufferGetBlockNumber profile with 0001-0002: + 40.30% postgres postgres [.] visibilitymap_get_status + 20.32% postgres postgres [.] vacuum_delay_point + 20.26% postgres postgres [.] heap_vacuum_rel +5.17% postgres postgres [.] BufferGetBlockNumber profile with 0001-0003 + 59.77% postgres postgres [.] visibilitymap_get_status + 23.86% postgres postgres [.] heap_vacuum_rel +6.59% postgres postgres [.] StrategyGetBuffer Test DDL and setup: psql -c "ALTER SYSTEM SET shared_buffers = '16 GB';" psql -c "CREATE TABLE foo(id INT, a INT, b INT, c INT, d INT, e INT, f INT, g INT) with (autovacuum_enabled=false, fillfactor=25);" psql -c "INSERT INTO foo SELECT i, i, i, i, i, i, i, i FROM generate_series(1, 4600)i;" psql -c "VACUUM (FREEZE) foo;" pg_ctl restart psql -c "SELECT pg_prewarm('foo');" # make sure there isn't an ill-timed checkpoint psql -c "\timing on" -c "vacuum (verbose) foo;" - Melanie
Re: Emitting JSON to file using COPY TO
On 3/8/24 12:28, Andrey M. Borodin wrote: Hello everyone! Thanks for working on this, really nice feature! On 9 Jan 2024, at 01:40, Joe Conway wrote: Thanks -- will have a look Joe, recently folks proposed a lot of patches in this thread that seem like diverted from original way of implementation. As an author of CF entry [0] can you please comment on which patch version needs review? I don't know if I agree with the proposed changes, but I have also been waiting to see how the parallel discussion regarding COPY extensibility shakes out. And there were a couple of issues found that need to be tracked down. Additionally I have had time/availability challenges recently. Overall, chances seem slim that this will make it into 17, but I have not quite given up hope yet either. -- Joe Conway PostgreSQL Contributors Team RDS Open Source Databases Amazon Web Services: https://aws.amazon.com
Re: CF entries for 17 to be reviewed
> On 6 Mar 2024, at 18:49, Andrey M. Borodin wrote: > > Here are statuses for "Refactoring" section: I've made a pass through "Replication and Recovery" and "SQL commands" sections. "SQL commands" section seems to me most interesting stuff on the commitfest (but I'm far from inspecting every thing thread yet). But reviewing patches from this section is not a work for one CF definitely. Be prepared that this is a long run, with many iterations and occasional architectural pivots. Typically reviewers work on these items for much more than one month. Replication and Recovery * Synchronizing slots from primary to standby Titanic work. A lot of stuff already pushed, v108 is now in the discussion. ISTM that entry barrier of jumping into discussion with something useful is really high. * CREATE SUBSCRIPTION ... SERVER The discussion stopped in January. Authors posted new version recently. * speed up a logical replication setup (pg_createsubscriber) Newest version posted recently, but it fails CFbot's tests. Pinged authors. * Have pg_basebackup write "dbname" in "primary_conninfo"? Some feedback and descussin provided. Switched to WoA. SQL Commands * Add SPLIT PARTITION/MERGE PARTITIONS commands Cool new commands, very useful for sharding. CF item was RfC recently, need review update after rebase. * Add CANONICAL option to xmlserialize Vignesh C and Chapman Flack provided some feedback back in October 2023, but the patch still needs review. * Incremental View Maintenance This is a super cool feature. IMO at this point is too late for 17, but you should consider reviewing this not because it's easy, but because it's hard. It's real rocket science. Fine-grained 11-step patchset, which can change a lot of workloads if committed. CFbot finds some failures, but this should not stop anyone frome reviewing in this case. * Implement row pattern recognition feature SQL:2016 feature, carefully split into 8 steps. Needs review, probably review in a long run. The feature seems big. 3 reviewers are working on this, but no recent input for some months. * COPY TO json Active thread with many different authors proposing different patches. I could not summarize it, asked CF entry author for help. * Add new error_action COPY ON_ERROR "log" There's an active discussion in the thread. * add COPY option RECECT_LIMIT While the patch seems much similar to previous, there's no discussion in this thread... Thanks! Best regards, Andrey Borodin.
Re: Confine vacuum skip logic to lazy_scan_skip
On Fri, Mar 08, 2024 at 06:07:33PM +0200, Heikki Linnakangas wrote: > On 08/03/2024 02:46, Melanie Plageman wrote: > > On Wed, Mar 06, 2024 at 10:00:23PM -0500, Melanie Plageman wrote: > > > On Wed, Mar 06, 2024 at 09:55:21PM +0200, Heikki Linnakangas wrote: > > I will say that now all of the variable names are *very* long. I didn't > > want to remove the "state" from LVRelState->next_block_state. (In fact, I > > kind of miss the "get". But I had to draw the line somewhere.) I think > > without "state" in the name, next_block sounds too much like a function. > > > > Any ideas for shortening the names of next_block_state and its members > > or are you fine with them? > > Hmm, we can remove the inner struct and add the fields directly into > LVRelState. LVRelState already contains many groups of variables, like > "Error reporting state", with no inner structs. I did it that way in the > attached patch. I also used local variables more. +1; I like the result of this. > > However, by adding a vmbuffer to next_block_state, the callback may be > > able to avoid extra VM fetches from one invocation to the next. > > That's a good idea, holding separate VM buffer pins for the next-unskippable > block and the block we're processing. I adopted that approach. Cool. It can't be avoided with streaming read vacuum, but I wonder if there would ever be adverse effects to doing it on master? Maybe if we are doing a lot of skipping and the block of the VM for the heap blocks we are processing ends up changing each time but we would have had the right block of the VM if we used the one from heap_vac_scan_next_block()? Frankly, I'm in favor of just doing it now because it makes lazy_scan_heap() less confusing. > My compiler caught one small bug when I was playing with various > refactorings of this: heap_vac_scan_next_block() must set *blkno to > rel_pages, not InvalidBlockNumber, after the last block. The caller uses the > 'blkno' variable also after the loop, and assumes that it's set to > rel_pages. Oops! Thanks for catching that. > I'm pretty happy with the attached patches now. The first one fixes the > existing bug I mentioned in the other email (based on the on-going > discussion that might not how we want to fix it though). ISTM we should still do the fix you mentioned -- seems like it has more upsides than downsides? > From b68cb29c547de3c4acd10f31aad47b453d154666 Mon Sep 17 00:00:00 2001 > From: Heikki Linnakangas > Date: Fri, 8 Mar 2024 16:00:22 +0200 > Subject: [PATCH v8 1/3] Set all_visible_according_to_vm correctly with > DISABLE_PAGE_SKIPPING > > It's important for 'all_visible_according_to_vm' to correctly reflect > whether the VM bit is set or not, even when we are not trusting the VM > to skip pages, because contrary to what the comment said, > lazy_scan_prune() relies on it. > > If it's incorrectly set to 'false', when the VM bit is in fact set, > lazy_scan_prune() will try to set the VM bit again and dirty the page > unnecessarily. As a result, if you used DISABLE_PAGE_SKIPPING, all > heap pages were dirtied, even if there were no changes. We would also > fail to clear any VM bits that were set incorrectly. > > This was broken in commit 980ae17310, so backpatch to v16. LGTM. > From 47af1ca65cf55ca876869b43bff47f9d43f0750e Mon Sep 17 00:00:00 2001 > From: Heikki Linnakangas > Date: Fri, 8 Mar 2024 17:32:19 +0200 > Subject: [PATCH v8 2/3] Confine vacuum skip logic to lazy_scan_skip() > --- > src/backend/access/heap/vacuumlazy.c | 256 +++ > 1 file changed, 141 insertions(+), 115 deletions(-) > > diff --git a/src/backend/access/heap/vacuumlazy.c > b/src/backend/access/heap/vacuumlazy.c > index ac55ebd2ae5..0aa08762015 100644 > --- a/src/backend/access/heap/vacuumlazy.c > +++ b/src/backend/access/heap/vacuumlazy.c > @@ -204,6 +204,12 @@ typedef struct LVRelState > int64 live_tuples;/* # live tuples remaining */ > int64 recently_dead_tuples; /* # dead, but not yet > removable */ > int64 missed_dead_tuples; /* # removable, but not removed */ Perhaps we should add a comment to the blkno member of LVRelState indicating that it is used for error reporting and logging? > + /* State maintained by heap_vac_scan_next_block() */ > + BlockNumber current_block; /* last block returned */ > + BlockNumber next_unskippable_block; /* next unskippable block */ > + boolnext_unskippable_allvis;/* its visibility > status */ > + Buffer next_unskippable_vmbuffer; /* buffer containing > its VM bit */ > } LVRelState; > /* > -static BlockNumber > -lazy_scan_skip(LVRelState *vacrel, Buffer *vmbuffer, BlockNumber next_block, > -bool *next_unskippable_allvis, bool > *skipping_current_range) > +static bool > +heap_vac_scan_next_block(LVRelState *vacrel, BlockNumber *blkno, > + bool >
Re: Emitting JSON to file using COPY TO
Hello everyone! Thanks for working on this, really nice feature! > On 9 Jan 2024, at 01:40, Joe Conway wrote: > > Thanks -- will have a look Joe, recently folks proposed a lot of patches in this thread that seem like diverted from original way of implementation. As an author of CF entry [0] can you please comment on which patch version needs review? Thanks! Best regards, Andrey Borodin. [0] https://commitfest.postgresql.org/47/4716/
Re: pipe_read_line for reading arbitrary strings
On 06.03.24 10:54, Daniel Gustafsson wrote: On 6 Mar 2024, at 10:07, Peter Eisentraut wrote: On 22.11.23 13:47, Alvaro Herrera wrote: On 2023-Mar-07, Daniel Gustafsson wrote: The attached POC diff replace fgets() with pg_get_line(), which may not be an Ok way to cross the streams (it's clearly not a great fit), but as a POC it provided a neater interface for reading one-off lines from a pipe IMO. Does anyone else think this is worth fixing before too many callsites use it, or is this another case of my fear of silent subtle truncation bugs? =) I think this is generally a good change. I think pipe_read_line should have a "%m" in the "no data returned" error message. pg_read_line is careful to retain errno (and it was already zero at start), so this should be okay ... or should we set errno again to zero after popen(), even if it works? Is this correct? The code now looks like this: line = pg_get_line(pipe_cmd, NULL); if (line == NULL) { if (ferror(pipe_cmd)) log_error(errcode_for_file_access(), _("could not read from command \"%s\": %m"), cmd); else log_error(errcode_for_file_access(), _("no data was returned by command \"%s\": %m"), cmd); } We already handle the case where an error happened in the first branch, so there cannot be an error set in the second branch (unless something nonobvious is going on?). It seems to me that if the command being run just happens to print nothing but is otherwise successful, this would print a bogus error code (or "Success")? Good catch, that's an incorrect copy/paste, it should use ERRCODE_NO_DATA. Also it shouldn't print %m, was my point.
Re: Introduce XID age and inactive timeout based replication slot invalidation
On Wed, Mar 6, 2024 at 4:49 PM Amit Kapila wrote: > > You might want to consider its interaction with sync slots on standby. > Say, there is no activity on slots in terms of processing the changes > for slots. Now, we won't perform sync of such slots on standby showing > them inactive as per your new criteria where as same slots could still > be valid on primary as the walsender is still active. This may be more > of a theoretical point as in running system there will probably be > some activity but I think this needs some thougths. I believe the xmin and catalog_xmin of the sync slots on the standby keep advancing depending on the slots on the primary, no? If yes, the XID age based invalidation shouldn't be a problem. I believe there are no walsenders started for the sync slots on the standbys, right? If yes, the inactive timeout based invalidation also shouldn't be a problem. Because, the inactive timeouts for a slot are tracked only for walsenders because they are the ones that typically hold replication slots for longer durations and for real replication use. We did a similar thing in a recent commit [1]. Is my understanding right? Do you still see any problems with it? [1] commit 7c3fb505b14e86581b6a052075a294c78c91b123 Author: Amit Kapila Date: Tue Nov 21 07:59:53 2023 +0530 Log messages for replication slot acquisition and release. . Note that these messages are emitted only for walsenders but not for backends. This is because walsenders are the ones that typically hold replication slots for longer durations, unlike backends which hold them for executing replication related functions. -- Bharath Rupireddy PostgreSQL Contributors Team RDS Open Source Databases Amazon Web Services: https://aws.amazon.com
Re: Remove unnecessary code from psql's watch command
Yugo NAGATA writes: > On Wed, 06 Mar 2024 13:03:39 -0500 > Tom Lane wrote: >> I don't have Windows here to test on, but does the WIN32 code >> path work at all? > The outer loop is eventually exited even because PSQLexecWatch returns 0 > when cancel_pressed = 0. However, it happens after executing an extra > query in this function not just after exit of the inner loop. Therefore, > it would be better to adding set and check of "done" in WIN32, too. Ah, I see now. Agreed, that could stand improvement. > I've attached the updated patch > (v2_remove_unnecessary_code_in_psql_watch.patch). Pushed with minor tidying. regards, tom lane
Re: Call perror on popen failure
On 08.03.24 11:05, Daniel Gustafsson wrote: If popen fails in pipe_read_line we invoke perror for the error message, and pipe_read_line is in turn called by find_other_exec which is used in both frontend and backend code. This is an old codepath, and it seems like it ought to be replaced with the more common logging tools we now have like in the attached diff (which also makes the error translated as opposed to now). Any objections to removing this last perror() call? This change makes it consistent with other popen() calls, so it makes sense to me.
Re: Confine vacuum skip logic to lazy_scan_skip
On Fri, Mar 8, 2024 at 11:31 AM Melanie Plageman wrote: > > On Fri, Mar 8, 2024 at 11:00 AM Peter Geoghegan wrote: > > > > On Fri, Mar 8, 2024 at 10:48 AM Melanie Plageman > > wrote: > > > Not that it will be fun to maintain another special case in the VM > > > update code in lazy_scan_prune(), but we could have a special case > > > that checks if DISABLE_PAGE_SKIPPING was passed to vacuum and if > > > all_visible_according_to_vm is true and all_visible is true, we update > > > the VM but don't dirty the page. > > > > It wouldn't necessarily have to be a special case, I think. > > > > We already conditionally set PD_ALL_VISIBLE/call PageIsAllVisible() in > > the block where lazy_scan_prune marks a previously all-visible page > > all-frozen -- we don't want to dirty the page unnecessarily there. > > Making it conditional is defensive in that particular block (this was > > also added by this same commit of mine), and avoids dirtying the page. > > Ah, I see. I got confused. Even if the VM is suspect, if the page is > all visible and the heap block is already set all-visible in the VM, > there is no need to update it. > > This did make me realize that it seems like there is a case we don't > handle in master with the current code that would be fixed by changing > that code Heikki mentioned: > > Right now, even if the heap block is incorrectly marked all-visible in > the VM, if DISABLE_PAGE_SKIPPING is passed to vacuum, > all_visible_according_to_vm will be passed to lazy_scan_prune() as > false. Then even if lazy_scan_prune() finds that the page is not > all-visible, we won't call visibilitymap_clear(). > > If we revert the code setting next_unskippable_allvis to false in > lazy_scan_skip() when vacrel->skipwithvm is false and allow > all_visible_according_to_vm to be true when the VM has it incorrectly > set to true, then once lazy_scan_prune() discovers the page is not > all-visible and assuming PD_ALL_VISIBLE is not marked so > PageIsAllVisible() returns false, we will call visibilitymap_clear() > to clear the incorrectly set VM bit (without dirtying the page). > > Here is a table of the variable states at the end of lazy_scan_prune() > for clarity: > > master: > all_visible_according_to_vm: false > all_visible: false > VM says all vis: true > PageIsAllVisible:false > > if fixed: > all_visible_according_to_vm: true > all_visible: false > VM says all vis: true > PageIsAllVisible:false Okay, I now see from Heikki's v8-0001 that he was already aware of this. - Melanie
Re: Identify transactions causing highest wal generation
On Fri, Mar 8, 2024 at 9:10 PM Tomas Vondra wrote: > > On 3/8/24 15:50, Gayatri Singh wrote: > > Hello Team, > > > > Can you help me with steps to identify transactions which caused wal > > generation to surge ? > > > > You should probably take a look at pg_waldump, which prints information > about WAL contents, including which XID generated each record. Right. pg_walinspect too can help get the same info for the available WAL if you are on a production database with PG15 without any access to the host instance. > I don't know what exactly is your goal, Yeah, it's good to know the use-case if possible. > but sometimes it's not entirely > direct relationship.For example, a transaction may delete a record, > which generates just a little bit of WAL. But then after a checkpoint a > VACUUM comes around, vacuums the page to reclaim the space of the entry, > and ends up writing FPI (which is much larger). You could argue this WAL > is also attributable to the original transaction, but that's not what > pg_waldump will allow you to do. FPIs in general may inflate the numbers > unpredictably, and it's not something the original transaction can > affect very much. Nice. If one knows the fact that there can be WAL generated without associated transaction (no XID), there won't be surprises when the amount of WAL generated by all transactions is compared against the total WAL on the database. Alternatively, one can get the correct amount of WAL generated including the WAL without XID before and after doing some operations as shown below: postgres=# SELECT pg_current_wal_lsn(); pg_current_wal_lsn 0/52EB488 (1 row) postgres=# create table foo as select i from generate_series(1, 100) i; SELECT 100 postgres=# update foo set i = i +1 where i%2 = 0; UPDATE 50 postgres=# SELECT pg_current_wal_lsn(); pg_current_wal_lsn 0/D2B8000 (1 row) postgres=# SELECT pg_wal_lsn_diff('0/D2B8000', '0/52EB488'); pg_wal_lsn_diff - 134007672 (1 row) postgres=# SELECT pg_size_pretty(pg_wal_lsn_diff('0/D2B8000', '0/52EB488')); pg_size_pretty 128 MB (1 row) -- Bharath Rupireddy PostgreSQL Contributors Team RDS Open Source Databases Amazon Web Services: https://aws.amazon.com
Re: Identify transactions causing highest wal generation
On Fri, Mar 8, 2024, at 12:40 PM, Tomas Vondra wrote: > On 3/8/24 15:50, Gayatri Singh wrote: > > Hello Team, > > > > Can you help me with steps to identify transactions which caused wal > > generation to surge ? > > > > You should probably take a look at pg_waldump, which prints information > about WAL contents, including which XID generated each record. You can also use pg_stat_statements to obtain this information. postgres=# select * from pg_stat_statements order by wal_bytes desc; -[ RECORD 1 ]--+--- --- --- - userid | 10 dbid | 16385 toplevel | t queryid| -8403979585082616547 query | UPDATE pgbench_accounts SET abalance = abalance + $1 WHERE aid = $2 plans | 0 total_plan_time| 0 min_plan_time | 0 max_plan_time | 0 mean_plan_time | 0 stddev_plan_time | 0 calls | 238260 total_exec_time| 4642.59929618 min_exec_time | 0.011094 max_exec_time | 0.872748 mean_exec_time | 0.01948543312347807 stddev_exec_time | 0.006370786385582063 rows | 238260 . . . wal_records| 496659 wal_fpi| 19417 wal_bytes | 208501147 . . . -- Euler Taveira EDB https://www.enterprisedb.com/
Re: Confine vacuum skip logic to lazy_scan_skip
On Fri, Mar 8, 2024 at 11:00 AM Peter Geoghegan wrote: > > On Fri, Mar 8, 2024 at 10:48 AM Melanie Plageman > wrote: > > Not that it will be fun to maintain another special case in the VM > > update code in lazy_scan_prune(), but we could have a special case > > that checks if DISABLE_PAGE_SKIPPING was passed to vacuum and if > > all_visible_according_to_vm is true and all_visible is true, we update > > the VM but don't dirty the page. > > It wouldn't necessarily have to be a special case, I think. > > We already conditionally set PD_ALL_VISIBLE/call PageIsAllVisible() in > the block where lazy_scan_prune marks a previously all-visible page > all-frozen -- we don't want to dirty the page unnecessarily there. > Making it conditional is defensive in that particular block (this was > also added by this same commit of mine), and avoids dirtying the page. Ah, I see. I got confused. Even if the VM is suspect, if the page is all visible and the heap block is already set all-visible in the VM, there is no need to update it. This did make me realize that it seems like there is a case we don't handle in master with the current code that would be fixed by changing that code Heikki mentioned: Right now, even if the heap block is incorrectly marked all-visible in the VM, if DISABLE_PAGE_SKIPPING is passed to vacuum, all_visible_according_to_vm will be passed to lazy_scan_prune() as false. Then even if lazy_scan_prune() finds that the page is not all-visible, we won't call visibilitymap_clear(). If we revert the code setting next_unskippable_allvis to false in lazy_scan_skip() when vacrel->skipwithvm is false and allow all_visible_according_to_vm to be true when the VM has it incorrectly set to true, then once lazy_scan_prune() discovers the page is not all-visible and assuming PD_ALL_VISIBLE is not marked so PageIsAllVisible() returns false, we will call visibilitymap_clear() to clear the incorrectly set VM bit (without dirtying the page). Here is a table of the variable states at the end of lazy_scan_prune() for clarity: master: all_visible_according_to_vm: false all_visible: false VM says all vis: true PageIsAllVisible:false if fixed: all_visible_according_to_vm: true all_visible: false VM says all vis: true PageIsAllVisible:false > Seems like it might be possible to simplify/consolidate the VM-setting > code that's now located at the end of lazy_scan_prune. Perhaps the two > distinct blocks that call visibilitymap_set() could be combined into > one. I agree. I have some code to do that in an unproposed patch which combines the VM updates into the prune record. We will definitely want to reorganize the code when we do that record combining. - Melanie
Re: Support a wildcard in backtrace_functions
On Fri, Mar 8, 2024 at 9:25 PM Jelte Fennema-Nio wrote: > > On Fri, 8 Mar 2024 at 15:51, Peter Eisentraut wrote: > > What is the relationship of these changes with the recently added > > backtrace_on_internal_error? > > I think that's a reasonable question. And the follow up ones too. > > I think it all depends on how close we consider > backtrace_on_internal_error and backtrace_functions. While they > obviously have similar functionality, I feel like > backtrace_on_internal_error is probably a function that we'd want to > turn on by default in the future. Hm, we may not want backtrace_on_internal_error to be on by default. AFAICS, all ERRCODE_INTERNAL_ERROR are identifiable with the error message, so it's sort of easy for one to track down the cause of it without actually needing to log backtrace by default. On Fri, Mar 8, 2024 at 8:21 PM Peter Eisentraut wrote: > > What is the relationship of these changes with the recently added > backtrace_on_internal_error? We had similar discussions there, I feel > like we are doing similar things here but slightly differently. Like, > shouldn't backtrace_functions_min_level also affect > backtrace_on_internal_error? Don't you really just want > backtrace_on_any_error? You are sneaking that in through the backdoor > via backtrace_functions. Can we somehow combine all these use cases > more elegantly? backtrace_on_error = {all|internal|none}? I see a merit in Peter's point. I like the idea of backtrace_functions_min_level controlling backtrace_on_internal_error too. Less GUCs for similar functionality is always a good idea IMHO. Here's what I think: 1. Rename the new GUC backtrace_functions_min_level to backtrace_min_level. 2. Add 'internal' to backtrace_min_level_options enum +static const struct config_enum_entry backtrace_functions_level_options[] = { + {"internal", INTERNAL, false}, +{"debug5", DEBUG5, false}, +{"debug4", DEBUG4, false}, 3. Remove backtrace_on_internal_error GUC which is now effectively covered by backtrace_min_level = 'internal'; Does anyone see a problem with it? -- Bharath Rupireddy PostgreSQL Contributors Team RDS Open Source Databases Amazon Web Services: https://aws.amazon.com
Re: Confine vacuum skip logic to lazy_scan_skip
On 08/03/2024 02:46, Melanie Plageman wrote: On Wed, Mar 06, 2024 at 10:00:23PM -0500, Melanie Plageman wrote: On Wed, Mar 06, 2024 at 09:55:21PM +0200, Heikki Linnakangas wrote: I will say that now all of the variable names are *very* long. I didn't want to remove the "state" from LVRelState->next_block_state. (In fact, I kind of miss the "get". But I had to draw the line somewhere.) I think without "state" in the name, next_block sounds too much like a function. Any ideas for shortening the names of next_block_state and its members or are you fine with them? Hmm, we can remove the inner struct and add the fields directly into LVRelState. LVRelState already contains many groups of variables, like "Error reporting state", with no inner structs. I did it that way in the attached patch. I also used local variables more. I was wondering if we should remove the "get" and just go with heap_vac_scan_next_block(). I didn't do that originally because I didn't want to imply that the next block was literally the sequentially next block, but I think maybe I was overthinking it. Another idea is to call it heap_scan_vac_next_block() and then the order of the words is more like the table AM functions that get the next block (e.g. heapam_scan_bitmap_next_block()). Though maybe we don't want it to be too similar to those since this isn't a table AM callback. I've done a version of this. +1 However, by adding a vmbuffer to next_block_state, the callback may be able to avoid extra VM fetches from one invocation to the next. That's a good idea, holding separate VM buffer pins for the next-unskippable block and the block we're processing. I adopted that approach. My compiler caught one small bug when I was playing with various refactorings of this: heap_vac_scan_next_block() must set *blkno to rel_pages, not InvalidBlockNumber, after the last block. The caller uses the 'blkno' variable also after the loop, and assumes that it's set to rel_pages. I'm pretty happy with the attached patches now. The first one fixes the existing bug I mentioned in the other email (based on the on-going discussion that might not how we want to fix it though). Second commit is a squash of most of the patches. Third patch is the removal of the delay point, that seems worthwhile to keep separate. -- Heikki Linnakangas Neon (https://neon.tech) From b68cb29c547de3c4acd10f31aad47b453d154666 Mon Sep 17 00:00:00 2001 From: Heikki Linnakangas Date: Fri, 8 Mar 2024 16:00:22 +0200 Subject: [PATCH v8 1/3] Set all_visible_according_to_vm correctly with DISABLE_PAGE_SKIPPING It's important for 'all_visible_according_to_vm' to correctly reflect whether the VM bit is set or not, even when we are not trusting the VM to skip pages, because contrary to what the comment said, lazy_scan_prune() relies on it. If it's incorrectly set to 'false', when the VM bit is in fact set, lazy_scan_prune() will try to set the VM bit again and dirty the page unnecessarily. As a result, if you used DISABLE_PAGE_SKIPPING, all heap pages were dirtied, even if there were no changes. We would also fail to clear any VM bits that were set incorrectly. This was broken in commit 980ae17310, so backpatch to v16. Backpatch-through: 16 Reviewed-by: Melanie Plageman Discussion: https://www.postgresql.org/message-id/3df2b582-dc1c-46b6-99b6-38eddd1b2...@iki.fi --- src/backend/access/heap/vacuumlazy.c | 4 1 file changed, 4 deletions(-) diff --git a/src/backend/access/heap/vacuumlazy.c b/src/backend/access/heap/vacuumlazy.c index 8b320c3f89a..ac55ebd2ae5 100644 --- a/src/backend/access/heap/vacuumlazy.c +++ b/src/backend/access/heap/vacuumlazy.c @@ -1136,11 +1136,7 @@ lazy_scan_skip(LVRelState *vacrel, Buffer *vmbuffer, BlockNumber next_block, /* DISABLE_PAGE_SKIPPING makes all skipping unsafe */ if (!vacrel->skipwithvm) - { - /* Caller shouldn't rely on all_visible_according_to_vm */ - *next_unskippable_allvis = false; break; - } /* * Aggressive VACUUM caller can't skip pages just because they are -- 2.39.2 From 47af1ca65cf55ca876869b43bff47f9d43f0750e Mon Sep 17 00:00:00 2001 From: Heikki Linnakangas Date: Fri, 8 Mar 2024 17:32:19 +0200 Subject: [PATCH v8 2/3] Confine vacuum skip logic to lazy_scan_skip() Rename lazy_scan_skip() to heap_vac_scan_next_block() and move more code into the function, so that the caller doesn't need to know about ranges or skipping anymore. heap_vac_scan_next_block() returns the next block to process, and the logic for determining that block is all within the function. This makes the skipping logic easier to understand, as it's all in the same function, and makes the calling code easier to understand as it's less cluttered. The state variables needed to manage the skipping logic are moved to LVRelState. heap_vac_scan_next_block() now manages its own VM buffer separately from the caller's vmbuffer variable. The caller's vmbuffer holds the VM page for the current block its processing,
Re: Confine vacuum skip logic to lazy_scan_skip
On Fri, Mar 8, 2024 at 11:00 AM Peter Geoghegan wrote: > Seems like it might be possible to simplify/consolidate the VM-setting > code that's now located at the end of lazy_scan_prune. Perhaps the two > distinct blocks that call visibilitymap_set() could be combined into > one. FWIW I think that my error here might have had something to do with hallucinating that the code already did things that way. At the time this went in, I was working on a patchset that did things this way (more or less). It broke the dependency on all_visible_according_to_vm entirely, which simplified the set-and-check-VM code that's now at the end of lazy_scan_prune. Not sure how practical it'd be to do something like that now (not offhand), but something to consider. -- Peter Geoghegan
Re: meson: Specify -Wformat as a common warning flag for extensions
On Fri Mar 8, 2024 at 12:32 AM CST, Michael Paquier wrote: On Thu, Mar 07, 2024 at 11:39:39PM -0600, Tristan Partin wrote: > It sounds like a legitimate issue. I have confirmed the issue exists with a > pg_config compiled with Meson. I can also confirm that this issue exists in > the autotools build. First time I'm hearing about that, but I'll admit that I am cheating because -Wformat is forced in my local builds for some time now. I'm failing to see the issue with meson and ./configure even if I remove the switch, though, using a recent version of gcc at 13.2.0, but perhaps Debian does something underground. Are there version and/or environment requirements to be aware of? Forcing -Wformat implies more stuff that can be disabled with -Wno-format-contains-nul, -Wno-format-extra-args, and -Wno-format-zero-length, but the thing is that we're usually very conservative with such additions in the scripts. See also 8b6f5f25102f, done, I guess, as an answer to this thread: https://www.postgresql.org/message-id/4D431505.9010002%40dunslane.net A quick look at the past history of pgsql-hackers does not mention that as a problem, either, but I may have missed something. Ok, I figured this out. -Wall implies -Wformat=1. We set warning_level to 1 in the Meson project() call, which implies -Wall, and set -Wall in CFLAGS for autoconf. That's the reason we don't get issues building Postgres. A user making use of the pg_config --cflags option, as Sutou is, *will* run into the aforementioned issues, since we don't propogate -Wall into pg_config. $ gcc $(pg_config --cflags) -E - < /dev/null > /dev/null cc1: warning: ‘-Wformat-security’ ignored without ‘-Wformat’ [-Wformat-security] $ gcc -Wall $(pg_config --cflags) -E - < /dev/null > /dev/null (nothing printed) -- Tristan Partin Neon (https://neon.tech)
Re: Confine vacuum skip logic to lazy_scan_skip
On Fri, Mar 8, 2024 at 10:48 AM Melanie Plageman wrote: > Not that it will be fun to maintain another special case in the VM > update code in lazy_scan_prune(), but we could have a special case > that checks if DISABLE_PAGE_SKIPPING was passed to vacuum and if > all_visible_according_to_vm is true and all_visible is true, we update > the VM but don't dirty the page. It wouldn't necessarily have to be a special case, I think. We already conditionally set PD_ALL_VISIBLE/call PageIsAllVisible() in the block where lazy_scan_prune marks a previously all-visible page all-frozen -- we don't want to dirty the page unnecessarily there. Making it conditional is defensive in that particular block (this was also added by this same commit of mine), and avoids dirtying the page. Seems like it might be possible to simplify/consolidate the VM-setting code that's now located at the end of lazy_scan_prune. Perhaps the two distinct blocks that call visibilitymap_set() could be combined into one. -- Peter Geoghegan
Re: Support a wildcard in backtrace_functions
On Fri, 8 Mar 2024 at 15:51, Peter Eisentraut wrote: > What is the relationship of these changes with the recently added > backtrace_on_internal_error? I think that's a reasonable question. And the follow up ones too. I think it all depends on how close we consider backtrace_on_internal_error and backtrace_functions. While they obviously have similar functionality, I feel like backtrace_on_internal_error is probably a function that we'd want to turn on by default in the future. While backtrace_functions seems like it's mostly useful for developers. (i.e. the current grouping of backtrace_on_internal_error under DEVELOPER_OPTIONS seems wrong to me) > shouldn't backtrace_functions_min_level also affect > backtrace_on_internal_error? I guess that depends on the default behaviour that we want. Would we want warnings with ERRCODE_INTERNAL_ERROR to be backtraced by default or not. There is at least one example of such a warning in the codebase: ereport(WARNING, errcode(ERRCODE_INTERNAL_ERROR), errmsg_internal("could not parse XML declaration in stored value"), errdetail_for_xml_code(res_code)); > Btw., your code/documentation sometimes writes "stack trace". Let's > stick to backtrace for consistency. Fixed that in the latest patset in the email I sent right before this one.
Re: Confine vacuum skip logic to lazy_scan_skip
On Fri, Mar 8, 2024 at 10:41 AM Peter Geoghegan wrote: > > On Fri, Mar 8, 2024 at 8:49 AM Heikki Linnakangas wrote: > > ISTM we should revert the above hunk, and backpatch it to v16. I'm a > > little wary because I don't understand why that change was made in the > > first place, though. I think it was just an ill-advised attempt at > > tidying up the code as part of the larger commit, but I'm not sure. > > Peter, do you remember? > > I think that it makes sense to set the VM when indicated by > lazy_scan_prune, independent of what either the visibility map or the > page's PD_ALL_VISIBLE marking say. The whole point of > DISABLE_PAGE_SKIPPING is to deal with VM corruption, after all. Not that it will be fun to maintain another special case in the VM update code in lazy_scan_prune(), but we could have a special case that checks if DISABLE_PAGE_SKIPPING was passed to vacuum and if all_visible_according_to_vm is true and all_visible is true, we update the VM but don't dirty the page. The docs on DISABLE_PAGE_SKIPPING say it is meant to deal with VM corruption -- it doesn't say anything about dealing with incorrectly set PD_ALL_VISIBLE markings. - Melanie
Re: [DOC] Add detail regarding resource consumption wrt max_connections
Hi, On Fri, Jan 12, 2024 at 10:14:38PM +, Cary Huang wrote: > I think it is good to warn the user about the increased allocation of > memory for certain parameters so that they do not abuse it by setting > it to a huge number without knowing the consequences. Right, and I think it might be useful to log (i.e. at LOG not DEBUG3 level, with a nicer message) the amount of memory we allocate on startup, that is just one additional line per instance lifetime but might be quite useful to admins. Or maybe two lines if we log whether we could allocate it as huge pages or not as well: |2024-03-08 16:46:13.117 CET [237899] DEBUG: invoking IpcMemoryCreate(size=145145856) |2024-03-08 16:46:13.117 CET [237899] DEBUG: mmap(146800640) with MAP_HUGETLB failed, huge pages disabled: Cannot allocate memory > It is true that max_connections can increase the size of proc array > and other resources, which are allocated in the shared buffer, which > also means less shared buffer to perform regular data operations. AFAICT, those resources are allocated on top of shared_buffers, i.e. the total allocated memory is shared_buffers + (some resources) * max_connections + (other resources) * other_factors. > Instead of stating that higher max_connections results in higher > allocation, It may be better to tell the user that if the value needs > to be set much higher, consider increasing the "shared_buffers" > setting as well. Only if what you say above is true and I am at fault. Michael
Re: Confine vacuum skip logic to lazy_scan_skip
On Fri, Mar 8, 2024 at 8:49 AM Heikki Linnakangas wrote: > > On 08/03/2024 02:46, Melanie Plageman wrote: > > On Wed, Mar 06, 2024 at 10:00:23PM -0500, Melanie Plageman wrote: > >>> I feel heap_vac_scan_get_next_block() function could use some love. Maybe > >>> just some rewording of the comments, or maybe some other refactoring; not > >>> sure. But I'm pretty happy with the function signature and how it's > >>> called. > > > > I've cleaned up the comments on heap_vac_scan_next_block() in the first > > couple patches (not so much in the streaming read user). Let me know if > > it addresses your feelings or if I should look for other things I could > > change. > > Thanks, that is better. I think I now finally understand how the > function works, and now I can see some more issues and refactoring > opportunities :-). > > Looking at current lazy_scan_skip() code in 'master', one thing now > caught my eye (and it's the same with your patches): > > > *next_unskippable_allvis = true; > > while (next_unskippable_block < rel_pages) > > { > > uint8 mapbits = > > visibilitymap_get_status(vacrel->rel, > > > > next_unskippable_block, > > > > vmbuffer); > > > > if ((mapbits & VISIBILITYMAP_ALL_VISIBLE) == 0) > > { > > Assert((mapbits & VISIBILITYMAP_ALL_FROZEN) == 0); > > *next_unskippable_allvis = false; > > break; > > } > > > > /* > >* Caller must scan the last page to determine whether it has > > tuples > >* (caller must have the opportunity to set > > vacrel->nonempty_pages). > >* This rule avoids having lazy_truncate_heap() take > > access-exclusive > >* lock on rel to attempt a truncation that fails anyway, > > just because > >* there are tuples on the last page (it is likely that there > > will be > >* tuples on other nearby pages as well, but those can be > > skipped). > >* > >* Implement this by always treating the last block as unsafe > > to skip. > >*/ > > if (next_unskippable_block == rel_pages - 1) > > break; > > > > /* DISABLE_PAGE_SKIPPING makes all skipping unsafe */ > > if (!vacrel->skipwithvm) > > { > > /* Caller shouldn't rely on > > all_visible_according_to_vm */ > > *next_unskippable_allvis = false; > > break; > > } > > > > /* > >* Aggressive VACUUM caller can't skip pages just because > > they are > >* all-visible. They may still skip all-frozen pages, which > > can't > >* contain XIDs < OldestXmin (XIDs that aren't already frozen > > by now). > >*/ > > if ((mapbits & VISIBILITYMAP_ALL_FROZEN) == 0) > > { > > if (vacrel->aggressive) > > break; > > > > /* > >* All-visible block is safe to skip in > > non-aggressive case. But > >* remember that the final range contains such a > > block for later. > >*/ > > skipsallvis = true; > > } > > > > /* XXX: is it OK to remove this? */ > > vacuum_delay_point(); > > next_unskippable_block++; > > nskippable_blocks++; > > } > > Firstly, it seems silly to check DISABLE_PAGE_SKIPPING within the loop. > When DISABLE_PAGE_SKIPPING is set, we always return the next block and > set *next_unskippable_allvis = false regardless of the visibility map, > so why bother checking the visibility map at all? > > Except at the very last block of the relation! If you look carefully, > at the last block we do return *next_unskippable_allvis = true, if the > VM says so, even if DISABLE_PAGE_SKIPPING is set. I think that's wrong. > Surely the intention was to pretend that none of the VM bits were set if > DISABLE_PAGE_SKIPPING is used, also for the last block. I agree that having next_unskippable_allvis and, as a consequence, all_visible_according_to_vm set to true for the last block seems wrong. And It makes sense from a loop efficiency standpoint also to move it up to the top. However, making that change would have us end up dirtying all pages in your example. > This was changed in commit 980ae17310: > > > @@ -1311,7 +1327,11 @@ lazy_scan_skip(LVRelState *vacrel, Buffer *vmbuffer, > > BlockNumber next_block, > > > > /*
Re: Support a wildcard in backtrace_functions
On Fri, 8 Mar 2024 at 14:42, Daniel Gustafsson wrote: > My documentation comments from upthread still > stands, but other than those this version LGTM. Ah yeah, I forgot about those. Fixed now. v6-0001-Add-backtrace_functions_min_level.patch Description: Binary data v6-0002-Add-wildcard-support-to-backtrace_functions-GUC.patch Description: Binary data
Re: Confine vacuum skip logic to lazy_scan_skip
On Fri, Mar 8, 2024 at 8:49 AM Heikki Linnakangas wrote: > ISTM we should revert the above hunk, and backpatch it to v16. I'm a > little wary because I don't understand why that change was made in the > first place, though. I think it was just an ill-advised attempt at > tidying up the code as part of the larger commit, but I'm not sure. > Peter, do you remember? I think that it makes sense to set the VM when indicated by lazy_scan_prune, independent of what either the visibility map or the page's PD_ALL_VISIBLE marking say. The whole point of DISABLE_PAGE_SKIPPING is to deal with VM corruption, after all. In retrospect I didn't handle this particular aspect very well in commit 980ae17310. The approach I took is a bit crude (and in any case slightly wrong in that it is inconsistent in how it handles the last page). But it has the merit of fixing the case where we just have the VM's all-frozen bit set for a given block (not the all-visible bit set) -- which is always wrong. There was good reason to be concerned about that possibility when 980ae17310 went in. -- Peter Geoghegan
Re: Identify transactions causing highest wal generation
On 3/8/24 15:50, Gayatri Singh wrote: > Hello Team, > > Can you help me with steps to identify transactions which caused wal > generation to surge ? > You should probably take a look at pg_waldump, which prints information about WAL contents, including which XID generated each record. I don't know what exactly is your goal, but sometimes it's not entirely direct relationship. For example, a transaction may delete a record, which generates just a little bit of WAL. But then after a checkpoint a VACUUM comes around, vacuums the page to reclaim the space of the entry, and ends up writing FPI (which is much larger). You could argue this WAL is also attributable to the original transaction, but that's not what pg_waldump will allow you to do. FPIs in general may inflate the numbers unpredictably, and it's not something the original transaction can affect very much. regards -- Tomas Vondra EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
Re: type cache cleanup improvements
Aleksander Alekseev writes: > One thing that I couldn't immediately figure out is why 0 hash value > is treated as a magic invalid value in TypeCacheTypCallback(): I've not read this patch, but IIRC in some places we have a convention that hash value zero is passed for an sinval reset event (that is, "flush all cache entries"). regards, tom lane
Re: type cache cleanup improvements
Hi, > > I would like to tweak the patch a little bit - change some comments, > > add some Asserts, etc. Don't you mind? > You are welcome! Thanks. PFA the updated patch with some tweaks by me. I added the commit message as well. One thing that I couldn't immediately figure out is why 0 hash value is treated as a magic invalid value in TypeCacheTypCallback(): ``` - hash_seq_init(, TypeCacheHash); + if (hashvalue == 0) + hash_seq_init(, TypeCacheHash); + else + hash_seq_init_with_hash_value(, TypeCacheHash, hashvalue); ``` Is there anything that prevents the actual hash value from being zero? I don't think so, but maybe I missed something. If zero is indeed an invalid hash value I would like to reference the corresponding code. If zero is a correct hash value we should either change this by adding something like `if(!hash) hash++` or use an additional boolean argument here. -- Best regards, Aleksander Alekseev v3-0001-Improve-performance-of-type-cache-cleanup.patch Description: Binary data
Re: Call perror on popen failure
Daniel Gustafsson writes: > If popen fails in pipe_read_line we invoke perror for the error message, and > pipe_read_line is in turn called by find_other_exec which is used in both > frontend and backend code. This is an old codepath, and it seems like it > ought > to be replaced with the more common logging tools we now have like in the > attached diff (which also makes the error translated as opposed to now). Any > objections to removing this last perror() call? I didn't check your replacement code in detail, but I think we have a policy against using perror, mainly because we can't count on it to localize consistently with ereport et al. My grep confirms this is the only use, so +1 for removing it. regards, tom lane
Re: Spurious pgstat_drop_replslot() call
Hi, On Fri, Mar 08, 2024 at 07:55:39PM +0900, Michael Paquier wrote: > On Fri, Mar 08, 2024 at 10:19:11AM +, Bertrand Drouvot wrote: > > Indeed, it does not seem appropriate to remove stats during slot > > invalidation as > > one could still be interested to look at them. > > Yeah, my take is that this can still be interesting to know, at least > for debugging. This would limit the stats to be dropped when the slot > is dropped, and that looks like a sound idea. Thanks for looking at it! > > This spurious call has been introduced in be87200efd. I think that > > initially we > > designed to drop slots when a recovery conflict occurred during logical > > decoding > > on a standby. But we changed our mind to invalidate such a slot instead. > > > > The spurious call is probably due to the initial design but has not been > > removed. > > This is not a subject that has really been touched on the original > thread mentioned on the commit, so it is a bit hard to be sure. The > only comment is that a previous version of the patch did the stats > drop in the slot invalidation path at an incorrect location. The switch in the patch from "drop" to "invalidation" is in [1], see: " Given the precedent of max_slot_wal_keep_size, I think it's wrong to just drop the logical slots. Instead we should just mark them as invalid, like InvalidateObsoleteReplicationSlots(). Makes fully sense and done that way in the attached patch. “ But yeah, hard to be sure why this call is there, at least I don't remember... > > I don't think it's worth to add a test but can do if one feel the need. > > Well, that would not be complicated while being cheap, no? We have a > few paths in the 035 test where we know that a slot has been > invalidated so it is just a matter of querying once > pg_stat_replication_slot and see if some data is still there. We can not be 100% sure that the stats are up to date when the process holding the active slot is killed. So v2 attached adds a test where we ensure first that we have non empty stats before triggering the invalidation. [1]: https://www.postgresql.org/message-id/26c6f320-98f0-253c-f8b5-df1e7c1f6315%40amazon.com Regards, -- Bertrand Drouvot PostgreSQL Contributors Team RDS Open Source Databases Amazon Web Services: https://aws.amazon.com >From b97a42edd66ccb7f8f4d5d2fa24df0b02b6f4f68 Mon Sep 17 00:00:00 2001 From: Bertrand Drouvot Date: Fri, 8 Mar 2024 09:29:29 + Subject: [PATCH v2] Remove spurious pgstat_drop_replslot() call There is no need to remove stats for an invalidated slot as one could still be interested to look at them. --- src/backend/replication/slot.c | 1 - .../recovery/t/035_standby_logical_decoding.pl | 18 ++ 2 files changed, 18 insertions(+), 1 deletion(-) 3.5% src/backend/replication/ 96.4% src/test/recovery/t/ diff --git a/src/backend/replication/slot.c b/src/backend/replication/slot.c index b8bf98b182..91ca397857 100644 --- a/src/backend/replication/slot.c +++ b/src/backend/replication/slot.c @@ -1726,7 +1726,6 @@ InvalidatePossiblyObsoleteSlot(ReplicationSlotInvalidationCause cause, ReplicationSlotMarkDirty(); ReplicationSlotSave(); ReplicationSlotRelease(); - pgstat_drop_replslot(s); ReportSlotInvalidation(conflict, false, active_pid, slotname, restart_lsn, diff --git a/src/test/recovery/t/035_standby_logical_decoding.pl b/src/test/recovery/t/035_standby_logical_decoding.pl index 0710bccc17..855f37016e 100644 --- a/src/test/recovery/t/035_standby_logical_decoding.pl +++ b/src/test/recovery/t/035_standby_logical_decoding.pl @@ -494,6 +494,8 @@ $node_subscriber->stop; ## # Recovery conflict: Invalidate conflicting slots, including in-use slots # Scenario 1: hot_standby_feedback off and vacuum FULL +# In passing, ensure that replication slot stats are not removed when the active +# slot is invalidated. ## # One way to produce recovery conflict is to create/drop a relation and @@ -502,6 +504,15 @@ $node_subscriber->stop; reactive_slots_change_hfs_and_wait_for_xmins('behaves_ok_', 'vacuum_full_', 0, 1); +# Ensure replication slot stats are not empty before triggering the conflict +$node_primary->safe_psql('testdb', + qq[INSERT INTO decoding_test(x,y) SELECT 100,'100';] +); + +$node_standby->poll_query_until('testdb', + qq[SELECT total_txns > 0 FROM pg_stat_replication_slots WHERE slot_name = 'vacuum_full_activeslot'] +) or die "slot stat total_txns not > 0 for vacuum_full_activeslot"; + # This should trigger the conflict wait_until_vacuum_can_remove( 'full', 'CREATE TABLE conflict_test(x integer, y text); @@ -515,6 +526,13 @@ check_for_invalidation('vacuum_full_', 1, 'with vacuum FULL on pg_class'); # Verify conflict_reason is 'rows_removed' in pg_replication_slots check_slots_conflict_reason('vacuum_full_', 'rows_removed'); +# Ensure replication slot stats
Re: [DOC] Add detail regarding resource consumption wrt max_connections
On Mon, Jan 22, 2024 at 8:58 AM wrote: > On Fri, 2024-01-19 at 17:37 -0500, reid.thomp...@crunchydata.com wrote: > > On Sat, 2024-01-13 at 10:31 -0700, Roberto Mello wrote: > > > > > > I can add a suggestion for the user to consider increasing > > > shared_buffers in accordance with higher max_connections, but it > > > would be better if there was a "rule of thumb" guideline to go > > > along. I'm open to suggestions. > > > > > > I can revise with a similar warning in max_prepared_xacts as well. > > > > > > Sincerely, > > > > > > Roberto > > > > Can a "close enough" rule of thumb be calculated from: > > postgresql.conf -> log_min_messages = debug3 > > > > start postgresql with varying max_connections to get > > CreateSharedMemoryAndSemaphores() sizes to generate a rough equation > > > > or maybe it would be sufficient to advise to set log_min_messages = > debug3 on a test DB and start/stop it with varying values of > max_connections and look at the differing values in > DEBUG: invoking IpcMemoryCreate(size=...) log messages for themselves. > > I'm of the opinion that advice suggestingDBA's set things to DEBUG 3 is unfriendly at best. If you really want to add more, there is an existing unfriendly section of the docs at https://www.postgresql.org/docs/devel/kernel-resources.html#LINUX-MEMORY-OVERCOMMIT that mentions this problem, specifically: "If PostgreSQL itself is the cause of the system running out of memory, you can avoid the problem by changing your configuration. In some cases, it may help to lower memory-related configuration parameters, particularly shared_buffers, work_mem, and hash_mem_multiplier. In other cases, the problem may be caused by allowing too many connections to the database server itself. In many cases, it may be better to reduce max_connections and instead make use of external connection-pooling software." I couldn't really find a spot to add in your additional info, but maybe you can find a spot that fits? Or maybe a well written walk-through of this would make for a good wiki page in case people really want to dig in. In any case, I think Roberto's original language is an improvement over what we have now, so I'd probably recommend just going with that, along with a similar note to max_prepared_xacts, and optionally a pointer to the shared mem section of the docs. Robert Treat https://xzilla.net
Re: Support a wildcard in backtrace_functions
On 08.03.24 12:25, Jelte Fennema-Nio wrote: On Fri, 8 Mar 2024 at 10:59, Alvaro Herrera wrote: On 2024-Mar-08, Bharath Rupireddy wrote: This works only if '* 'is specified as the only one character in backtrace_functions = '*', right? If yes, what if someone sets backtrace_functions = 'foo, bar, *, baz'? It throws an error, as expected. This is a useless waste of resources: checking for "foo" and "bar" is pointless, since the * is going to give a positive match anyway. And the "baz" is a waste of memory which is never going to be checked. Makes sense. Attached is a new patchset that implements it that way. I've not included Bharath his 0003 patch, since it's a much bigger change than the others, and thus might need some more discussion. What is the relationship of these changes with the recently added backtrace_on_internal_error? We had similar discussions there, I feel like we are doing similar things here but slightly differently. Like, shouldn't backtrace_functions_min_level also affect backtrace_on_internal_error? Don't you really just want backtrace_on_any_error? You are sneaking that in through the backdoor via backtrace_functions. Can we somehow combine all these use cases more elegantly? backtrace_on_error = {all|internal|none}? Btw., your code/documentation sometimes writes "stack trace". Let's stick to backtrace for consistency.
Identify transactions causing highest wal generation
Hello Team, Can you help me with steps to identify transactions which caused wal generation to surge ? Regards, Gayatri.
Re: Introduce XID age and inactive timeout based replication slot invalidation
On Wed, Mar 6, 2024 at 4:28 PM Amit Kapila wrote: > > IIUC, the current conflict_reason is primarily used to determine > logical slots on standby that got invalidated due to recovery time > conflict. On the primary, it will also show logical slots that got > invalidated due to the corresponding WAL got removed. Is that > understanding correct? That's right. > If so, we are already sort of overloading this > column. However, now adding more invalidation reasons that won't > happen during recovery conflict handling will change entirely the > purpose (as per the name we use) of this variable. I think > invalidation_reason could depict this column correctly but OTOH I > guess it would lose its original meaning/purpose. Hm. I get the concern. Are you okay with having inavlidation_reason separately for both logical and physical slots? In such a case, logical slots that got invalidated on the standby will have duplicate info in conflict_reason and invalidation_reason, is this fine? Another idea is to make 'conflict_reason text' as a 'conflicting boolean' again (revert 007693f2a3), and have 'invalidation_reason text' for both logical and physical slots. So, whenever 'conflicting' is true, one can look at invalidation_reason for the reason for conflict. How does this sound? -- Bharath Rupireddy PostgreSQL Contributors Team RDS Open Source Databases Amazon Web Services: https://aws.amazon.com
Re: Support a wildcard in backtrace_functions
> On 8 Mar 2024, at 15:01, Bharath Rupireddy > wrote: > So, to get backtraces of all functions at > backtrace_functions_min_level level, one has to specify > backtrace_functions = '*'; combining it with function names is not > allowed. This looks cleaner. > > postgres=# ALTER SYSTEM SET backtrace_functions = '*, > pg_create_restore_point'; > ERROR: invalid value for parameter "backtrace_functions": "*, > pg_create_restore_point" > DETAIL: Invalid character If we want to be extra helpful here we could add something like the below to give an errhint when a wildcard was found. Also, the errdetail should read like a full sentence so it should be slightly expanded anyways. diff --git a/src/backend/utils/error/elog.c b/src/backend/utils/error/elog.c index ca621ea3ff..7bc655ecd2 100644 --- a/src/backend/utils/error/elog.c +++ b/src/backend/utils/error/elog.c @@ -2151,7 +2151,9 @@ check_backtrace_functions(char **newval, void **extra, GucSource source) ", \n\t"); if (validlen != newvallen) { - GUC_check_errdetail("Invalid character"); + GUC_check_errdetail("Invalid character in function name."); + if ((*newval)[validlen] == '*') + GUC_check_errhint("For wildcard matching, use a single \"*\" without any other function names."); return false; } -- Daniel Gustafsson
Re: Support a wildcard in backtrace_functions
On Fri, Mar 8, 2024 at 7:12 PM Daniel Gustafsson wrote: > > > On 8 Mar 2024, at 12:25, Jelte Fennema-Nio wrote: > > > > On Fri, 8 Mar 2024 at 10:59, Alvaro Herrera wrote: > >> > >> On 2024-Mar-08, Bharath Rupireddy wrote: > >> > >>> This works only if '* 'is specified as the only one character in > >>> backtrace_functions = '*', right? If yes, what if someone sets > >>> backtrace_functions = 'foo, bar, *, baz'? > >> > >> It throws an error, as expected. This is a useless waste of resources: > >> checking for "foo" and "bar" is pointless, since the * is going to give > >> a positive match anyway. And the "baz" is a waste of memory which is > >> never going to be checked. > > > > Makes sense. Attached is a new patchset that implements it that way. > > This version address the concerns raised by Alvaro, and even simplifies the > code over earlier revisions. My documentation comments from upthread still > stands, but other than those this version LGTM. So, to get backtraces of all functions at backtrace_functions_min_level level, one has to specify backtrace_functions = '*'; combining it with function names is not allowed. This looks cleaner. postgres=# ALTER SYSTEM SET backtrace_functions = '*, pg_create_restore_point'; ERROR: invalid value for parameter "backtrace_functions": "*, pg_create_restore_point" DETAIL: Invalid character I have one comment on 0002, otherwise all looks good. + +A single * character can be used instead of a list +of C functions. This * is interpreted as a wildcard +and will cause all errors in the log to contain backtraces. + It's not always the ERRORs for which backtraces get logged, it really depends on the new GUC backtrace_functions_min_level. If my understanding is right, can we specify that in the above note? > > I've not included Bharath his 0003 patch, since it's a much bigger > > change than the others, and thus might need some more discussion. +1. I'll see if I can start a new thread for this. -- Bharath Rupireddy PostgreSQL Contributors Team RDS Open Source Databases Amazon Web Services: https://aws.amazon.com
Re: Confine vacuum skip logic to lazy_scan_skip
On 08/03/2024 02:46, Melanie Plageman wrote: On Wed, Mar 06, 2024 at 10:00:23PM -0500, Melanie Plageman wrote: I feel heap_vac_scan_get_next_block() function could use some love. Maybe just some rewording of the comments, or maybe some other refactoring; not sure. But I'm pretty happy with the function signature and how it's called. I've cleaned up the comments on heap_vac_scan_next_block() in the first couple patches (not so much in the streaming read user). Let me know if it addresses your feelings or if I should look for other things I could change. Thanks, that is better. I think I now finally understand how the function works, and now I can see some more issues and refactoring opportunities :-). Looking at current lazy_scan_skip() code in 'master', one thing now caught my eye (and it's the same with your patches): *next_unskippable_allvis = true; while (next_unskippable_block < rel_pages) { uint8 mapbits = visibilitymap_get_status(vacrel->rel, next_unskippable_block, vmbuffer); if ((mapbits & VISIBILITYMAP_ALL_VISIBLE) == 0) { Assert((mapbits & VISIBILITYMAP_ALL_FROZEN) == 0); *next_unskippable_allvis = false; break; } /* * Caller must scan the last page to determine whether it has tuples * (caller must have the opportunity to set vacrel->nonempty_pages). * This rule avoids having lazy_truncate_heap() take access-exclusive * lock on rel to attempt a truncation that fails anyway, just because * there are tuples on the last page (it is likely that there will be * tuples on other nearby pages as well, but those can be skipped). * * Implement this by always treating the last block as unsafe to skip. */ if (next_unskippable_block == rel_pages - 1) break; /* DISABLE_PAGE_SKIPPING makes all skipping unsafe */ if (!vacrel->skipwithvm) { /* Caller shouldn't rely on all_visible_according_to_vm */ *next_unskippable_allvis = false; break; } /* * Aggressive VACUUM caller can't skip pages just because they are * all-visible. They may still skip all-frozen pages, which can't * contain XIDs < OldestXmin (XIDs that aren't already frozen by now). */ if ((mapbits & VISIBILITYMAP_ALL_FROZEN) == 0) { if (vacrel->aggressive) break; /* * All-visible block is safe to skip in non-aggressive case. But * remember that the final range contains such a block for later. */ skipsallvis = true; } /* XXX: is it OK to remove this? */ vacuum_delay_point(); next_unskippable_block++; nskippable_blocks++; } Firstly, it seems silly to check DISABLE_PAGE_SKIPPING within the loop. When DISABLE_PAGE_SKIPPING is set, we always return the next block and set *next_unskippable_allvis = false regardless of the visibility map, so why bother checking the visibility map at all? Except at the very last block of the relation! If you look carefully, at the last block we do return *next_unskippable_allvis = true, if the VM says so, even if DISABLE_PAGE_SKIPPING is set. I think that's wrong. Surely the intention was to pretend that none of the VM bits were set if DISABLE_PAGE_SKIPPING is used, also for the last block. This was changed in commit 980ae17310: @@ -1311,7 +1327,11 @@ lazy_scan_skip(LVRelState *vacrel, Buffer *vmbuffer, BlockNumber next_block, /* DISABLE_PAGE_SKIPPING makes all skipping unsafe */ if (!vacrel->skipwithvm) + { + /* Caller shouldn't rely on all_visible_according_to_vm */ + *next_unskippable_allvis = false; break; + } Before that, *next_unskippable_allvis was set correctly according to the VM, even when DISABLE_PAGE_SKIPPING was used. It's not clear to me why that was changed. And I think setting it to 'true' would be a more failsafe value than 'false'. When *next_unskippable_allvis is set to true, the caller cannot rely on it because a concurrent
Re: Support a wildcard in backtrace_functions
> On 8 Mar 2024, at 12:25, Jelte Fennema-Nio wrote: > > On Fri, 8 Mar 2024 at 10:59, Alvaro Herrera wrote: >> >> On 2024-Mar-08, Bharath Rupireddy wrote: >> >>> This works only if '* 'is specified as the only one character in >>> backtrace_functions = '*', right? If yes, what if someone sets >>> backtrace_functions = 'foo, bar, *, baz'? >> >> It throws an error, as expected. This is a useless waste of resources: >> checking for "foo" and "bar" is pointless, since the * is going to give >> a positive match anyway. And the "baz" is a waste of memory which is >> never going to be checked. > > Makes sense. Attached is a new patchset that implements it that way. This version address the concerns raised by Alvaro, and even simplifies the code over earlier revisions. My documentation comments from upthread still stands, but other than those this version LGTM. > I've not included Bharath his 0003 patch, since it's a much bigger > change than the others, and thus might need some more discussion. Agreed. -- Daniel Gustafsson
Re: Failures in constraints regression test, "read only 0 of 8192 bytes"
Le vendredi 8 mars 2024, 14:36:48 CET Tomas Vondra a écrit : > > My guess would be 8af25652489, as it's the only storage-related commit. > > > > I'm currently running tests to verify this. > > Yup, the breakage starts with this commit. I haven't looked into the > root cause, or whether the commit maybe just made some pre-existing > issue easier to hit. Also, I haven't followed the discussion on the > pgsql-bugs thread [1], maybe there are some interesting findings. > If that happens only on HEAD and not on 16, and doesn't involve WAL replay, then it's not the same bug. -- Ronan Dunklau
Re: Failures in constraints regression test, "read only 0 of 8192 bytes"
On 3/8/24 13:21, Tomas Vondra wrote: > On 3/8/24 09:33, Thomas Munro wrote: >> Happened again. I see this is OpenSUSE. Does that mean the file >> system is Btrfs? > > > It is, but I don't think that matters - I've been able to reproduce this > locally on my laptop using ext4 filesystem. I'd bet the important piece > here is -DCLOBBER_CACHE_ALWAYS (and it seems avocet/trilobite are the > only animals running with this). > > Also, if this really is a filesystem (or environment) issue, it seems > very strange it'd only affect HEAD and not the other branches. So it > seems quite likely this is actually triggered by a commit. > > Looking at the commits from the good/bad runs, I see this: > > avocet: good=4c2369a bad=f5a465f > trilobite: good=d13ff82 bad=def0ce3 > > That means the commit would have to be somewhere here: > > f5a465f1a07 Promote assertion about !ReindexIsProcessingIndex to ... > 57b28c08305 Doc: fix minor typos in two ECPG function descriptions. > 28e858c0f95 Improve documentation and GUC description for ... > a661bf7b0f5 Remove flaky isolation tests for timeouts > 874d817baa1 Multiple revisions to the GROUP BY reordering tests > 466979ef031 Replace lateral references to removed rels in subqueries > a6b2a51e16d Avoid dangling-pointer problem with partitionwise ... > d360e3cc60e Fix compiler warning on typedef redeclaration > 8af25652489 Introduce a new smgr bulk loading facility. > e612384fc78 Fix mistake in SQL features list > d13ff82319c Fix BF failure in commit 93db6cbda0. > > My guess would be 8af25652489, as it's the only storage-related commit. > > I'm currently running tests to verify this. > Yup, the breakage starts with this commit. I haven't looked into the root cause, or whether the commit maybe just made some pre-existing issue easier to hit. Also, I haven't followed the discussion on the pgsql-bugs thread [1], maybe there are some interesting findings. [1] https://www.postgresql.org/message-id/flat/1878547.tdWV9SEqCh%40aivenlaptop regards -- Tomas Vondra EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
Re: Proposal to add page headers to SLRU pages
On 2024-Mar-08, Stephen Frost wrote: > Thanks for testing! That strikes me as perfectly reasonable and seems > unlikely that we'd get much benefit from parallelizing it, so I'd say it > makes sense to keep this code simple. Okay, agreed, that amount of time sounds reasonable to me too; but I don't want to be responsible for this at least for pg17. If some other committer wants to take it, be my guest. However, I think this is mostly a foundation for building other things on top, so committing during the last commitfest is perhaps not very useful. Another aspect of this patch is the removal of the LSN groups. There's an explanation of the LSN groups in src/backend/access/transam/README, and while this patch removes the LSN group feature, it doesn't update that text. That's already a problem which needs fixed, but the text says : In fact, we store more than one LSN for each clog page. This relates to : the way we set transaction status hint bits during visibility tests. : We must not set a transaction-committed hint bit on a relation page and : have that record make it to disk prior to the WAL record of the commit. : Since visibility tests are normally made while holding buffer share locks, : we do not have the option of changing the page's LSN to guarantee WAL : synchronization. Instead, we defer the setting of the hint bit if we have : not yet flushed WAL as far as the LSN associated with the transaction. : This requires tracking the LSN of each unflushed async commit. : It is convenient to associate this data with clog buffers: because we : will flush WAL before writing a clog page, we know that we do not need : to remember a transaction's LSN longer than the clog page holding its : commit status remains in memory. However, the naive approach of storing : an LSN for each clog position is unattractive: the LSNs are 32x bigger : than the two-bit commit status fields, and so we'd need 256K of : additional shared memory for each 8K clog buffer page. We choose : instead to store a smaller number of LSNs per page, where each LSN is : the highest LSN associated with any transaction commit in a contiguous : range of transaction IDs on that page. This saves storage at the price : of some possibly-unnecessary delay in setting transaction hint bits. In the new code we effectively store only one LSN per page, which I understand is strictly worse. Maybe the idea of doing away with these LSN groups should be reconsidered ... unless I completely misunderstand the whole thing. -- Álvaro Herrera PostgreSQL Developer — https://www.EnterpriseDB.com/
Re: a wrong index choose when statistics is out of date
After some more thoughts about the diference of the two ideas, then I find we are resolving two different issues, just that in the wrong index choose cases, both of them should work generally. Your idea actually adding some rule based logic named certainty_factor, just the implemenation is very grace. for the example in this case, it take effects *when the both indexes has the same cost*. I believe that can resolve the index choose here, but how about the rows estimation? issue due to the fact that the design will not fudge the cost anyway, I assume you will not fudge the rows or selectivity as well. Then if the optimizer statistics is missing, what can we do for both index choosing and rows estimation? I think that's where my idea comes out. Due to the fact that optimizer statistics can't be up to date by design, and assume we have a sistuation where the customer's queries needs that statistcs often, how about doing the predication with the history statistics? it can cover for both index choose and rows estimation. Then the following arguments may be arised. a). we can't decide when the missed optimizer statistics is wanted *automatically*, b). if we predicate the esitmiation with the history statistics, the value of MCV information is missed. The answer for them is a). It is controlled by human with the "alter table t alter column a set (force_generic=on)". b). it can't be resolved I think, and it only take effects when the real Const is so different from the ones in history. generic plan has the same issue I think. I just reviewed the bad queries plan for the past half years internally, I found many queries used the Nested loop which is the direct cause. now I think I find out a new reason for this, because the missed optimizer statistics cause the rows in outer relation to be 1, which make the Nest loop is choosed. I'm not sure your idea could help on this or can help on this than mine at this aspect. -- Best Regards Andy Fan
Re: Failures in constraints regression test, "read only 0 of 8192 bytes"
On 3/8/24 09:33, Thomas Munro wrote: > Happened again. I see this is OpenSUSE. Does that mean the file > system is Btrfs? It is, but I don't think that matters - I've been able to reproduce this locally on my laptop using ext4 filesystem. I'd bet the important piece here is -DCLOBBER_CACHE_ALWAYS (and it seems avocet/trilobite are the only animals running with this). Also, if this really is a filesystem (or environment) issue, it seems very strange it'd only affect HEAD and not the other branches. So it seems quite likely this is actually triggered by a commit. Looking at the commits from the good/bad runs, I see this: avocet: good=4c2369a bad=f5a465f trilobite: good=d13ff82 bad=def0ce3 That means the commit would have to be somewhere here: f5a465f1a07 Promote assertion about !ReindexIsProcessingIndex to ... 57b28c08305 Doc: fix minor typos in two ECPG function descriptions. 28e858c0f95 Improve documentation and GUC description for ... a661bf7b0f5 Remove flaky isolation tests for timeouts 874d817baa1 Multiple revisions to the GROUP BY reordering tests 466979ef031 Replace lateral references to removed rels in subqueries a6b2a51e16d Avoid dangling-pointer problem with partitionwise ... d360e3cc60e Fix compiler warning on typedef redeclaration 8af25652489 Introduce a new smgr bulk loading facility. e612384fc78 Fix mistake in SQL features list d13ff82319c Fix BF failure in commit 93db6cbda0. My guess would be 8af25652489, as it's the only storage-related commit. I'm currently running tests to verify this. regards -- Tomas Vondra EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
Re: Statistics Import and Export
Greetings, * Corey Huinker (corey.huin...@gmail.com) wrote: > > Having some discussion around that would be useful. Is it better to > > have a situation where there are stats for some columns but no stats for > > other columns? There would be a good chance that this would lead to a > > set of queries that were properly planned out and a set which end up > > with unexpected and likely poor query plans due to lack of stats. > > Arguably that's better overall, but either way an ANALYZE needs to be > > done to address the lack of stats for those columns and then that > > ANALYZE is going to blow away whatever stats got loaded previously > > anyway and all we did with a partial stats load was maybe have a subset > > of queries have better plans in the interim, after having expended the > > cost to try and individually load the stats and dealing with the case of > > some of them succeeding and some failing. > > It is my (incomplete and entirely second-hand) understanding is that > pg_upgrade doesn't STOP autovacuum, but sets a delay to a very long value > and then resets it on completion, presumably because analyzing a table > before its data is loaded and indexes are created would just be a waste of > time. No, pg_upgrade starts the postmaster with -b (undocumented binary-upgrade mode) which prevents autovacuum (and logical replication workers) from starting, so we don't need to worry about autovacuum coming in and causing issues during binary upgrade. That doesn't entirely eliminate the concerns around pg_dump vs. autovacuum because we may restore a dump into a non-binary-upgrade-in-progress system though, of course. > > Overall, I'd suggest we wait to see what Corey comes up with in terms of > > doing the stats load for all attributes in a single function call, > > perhaps using the VALUES construct as you suggested up-thread, and then > > we can contemplate if that's clean enough to work or if it's so grotty > > that the better plan would be to do per-attribute function calls. If it > > ends up being the latter, then we can revisit this discussion and try to > > answer some of the questions raised above. > > In the patch below, I ended up doing per-attribute function calls, mostly > because it allowed me to avoid creating a custom data type for the portable > version of pg_statistic. This comes at the cost of a very high number of > parameters, but that's the breaks. Perhaps it's just me ... but it doesn't seem like it's all that many parameters. > I am a bit concerned about the number of locks on pg_statistic and the > relation itself, doing CatalogOpenIndexes/CatalogCloseIndexes once per > attribute rather than once per relation. But I also see that this will > mostly get used at a time when no other traffic is on the machine, and > whatever it costs, it's still faster than the smallest table sample (insert > joke about "don't have to be faster than the bear" here). I continue to not be too concerned about this until and unless it's actually shown to be an issue. Keeping things simple and straight-forward has its own value. > This raises questions about whether a failure in one attribute update > statement should cause the others in that relation to roll back or not, and > I can see situations where both would be desirable. > > I'm putting this out there ahead of the pg_dump / fe_utils work, mostly > because what I do there heavily depends on how this is received. > > Also, I'm still seeking confirmation that I can create a pg_dump TOC entry > with a chain of commands (e.g. BEGIN; ... COMMIT; ) or if I have to fan > them out into multiple entries. If we do go with this approach, we'd certainly want to make sure to do it in a manner which would allow pg_restore's single-transaction mode to still work, which definitely complicates this some. Given that and the other general feeling that the locking won't be a big issue, I'd suggest the simple approach on the pg_dump side of just dumping the stats out without the BEGIN/COMMIT. > Anyway, here's v7. Eagerly awaiting feedback. > Subject: [PATCH v7] Create pg_set_relation_stats, pg_set_attribute_stats. > diff --git a/src/include/catalog/pg_proc.dat b/src/include/catalog/pg_proc.dat > index 291ed876fc..d12b6e3ca3 100644 > --- a/src/include/catalog/pg_proc.dat > +++ b/src/include/catalog/pg_proc.dat > @@ -8818,7 +8818,6 @@ > { oid => '3813', descr => 'generate XML text node', >proname => 'xmltext', proisstrict => 't', prorettype => 'xml', >proargtypes => 'text', prosrc => 'xmltext' }, > - > { oid => '2923', descr => 'map table contents to XML', >proname => 'table_to_xml', procost => '100', provolatile => 's', >proparallel => 'r', prorettype => 'xml', > @@ -12163,8 +12162,24 @@ > > # GiST stratnum implementations > { oid => '8047', descr => 'GiST support', > - proname => 'gist_stratnum_identity', prorettype => 'int2', > + proname => 'gist_stratnum_identity',prorettype => 'int2', >proargtypes => 'int2', >prosrc
Re: Support a wildcard in backtrace_functions
On Fri, 8 Mar 2024 at 10:59, Alvaro Herrera wrote: > > On 2024-Mar-08, Bharath Rupireddy wrote: > > > This works only if '* 'is specified as the only one character in > > backtrace_functions = '*', right? If yes, what if someone sets > > backtrace_functions = 'foo, bar, *, baz'? > > It throws an error, as expected. This is a useless waste of resources: > checking for "foo" and "bar" is pointless, since the * is going to give > a positive match anyway. And the "baz" is a waste of memory which is > never going to be checked. Makes sense. Attached is a new patchset that implements it that way. I've not included Bharath his 0003 patch, since it's a much bigger change than the others, and thus might need some more discussion. v5-0002-Add-wildcard-support-to-backtrace_functions-GUC.patch Description: Binary data v5-0001-Add-backtrace_functions_min_level.patch Description: Binary data
Re: Dump-restore loosing 'attnotnull' bit for DEFERRABLE PRIMARY KEY column(s).
On Thu, 7 Mar 2024 at 17:32, Alvaro Herrera wrote: > > This seems to work okay. > Yes, this looks good. I tested it against CREATE TABLE ... LIKE, and it worked as expected. It might be worth adding a test case for that, to ensure that it doesn't get broken in the future. Do we also want a test case that does what pg_dump would do: - Add a NOT NULL constraint - Add a deferrable PK constraint - Drop the NOT NULL constraint - Check the column is still not nullable Looking at rel.h, I think that the new field should probably come after rd_pkindex, under the comment "data managed by RelationGetIndexList", and have its own comment. Also, if I'm nitpicking, the new field and local variables should use the term "deferrable" rather than "deferred". A DEFERRABLE constraint can be set to be either DEFERRED or IMMEDIATE within a transaction, but "deferrable" is the right term to use to describe the persistent property of an index/constraint that can be deferred. (The same objection applies to the field name "indimmediate", but it's too late to change that.) Also, for neatness/consistency, the new field should probably be reset in load_relcache_init_file(), alongside rd_pkindex, though I don't think it can matter in practice. Regards, Dean
Re: Add new error_action COPY ON_ERROR "log"
On Fri, Mar 08, 2024 at 03:36:30PM +0530, Bharath Rupireddy wrote: > Please see the attached v5-0001 patch implementing LOG_VERBOSITY with > options 'default' and 'verbose'. v5-0002 adds the detailed info to > ON_ERROR 'ignore' option. I may be reading this patch set incorrectly, but why doesn't this patch generate one NOTICE per attribute, as suggested by Sawada-san, incrementing num_errors once per row when the last attribute has been processed? Also, why not have a test that checks that multiple rows spawn more than more messages in some distributed fashion? Did you look at this idea? > We have a CF entry https://commitfest.postgresql.org/47/4798/ for the > original idea proposed in this thread, that is, to have the ON_ERROR > 'log' option. I'll probably start a new thread and add a new CF entry > in the next commitfest if there's no objection from anyone. Hmm. You are referring to the part where you'd want to control where the errors are sent, right? At least, what you have here would address point 1) mentioned in the first message of this thread. -- Michael signature.asc Description: PGP signature
Re: Commitfest Manager for March
> On 4 Mar 2024, at 17:09, Aleksander Alekseev wrote: > > If you need any help please let me know. Aleksander, I would greatly appreciate if you join me in managing CF. Together we can move more stuff :) Currently, I'm going through "SQL Commands". And so far I had not come to "Performance" and "Server Features" at all... So if you can handle updating statuses of that sections - that would be great. Best regards, Andrey Borodin.
Re: Regarding the order of the header file includes
On Thu, Mar 7, 2024 at 12:39 PM Richard Guo wrote: > > While rebasing one of my patches I noticed that the header file includes > in relnode.c are not sorted in order. So I wrote a naive script to see > if any other C files have the same issue. The script is: > > #!/bin/bash > > find . -name "*.c" | while read -r file; do > headers=$(grep -o '#include "[^>]*"' "$file" | > grep -v "postgres.h" | grep -v "postgres_fe.h" | > sed 's/\.h"//g') > > sorted_headers=$(echo "$headers" | sort) > > results=$(diff <(echo "$headers") <(echo "$sorted_headers")) > > if [[ $? != 0 ]]; then > echo "Headers in '$file' are out of order" > echo $results > echo > fi > done Cool. Isn't it a better idea to improve this script to auto-order the header files and land it under src/tools/pginclude/headerssort? It can then be reusable and be another code beautification weapon one can use before the code release. FWIW, I'm getting the syntax error when ran the above shell script: headerssort.sh: 10: Syntax error: "(" unexpected -- Bharath Rupireddy PostgreSQL Contributors Team RDS Open Source Databases Amazon Web Services: https://aws.amazon.com
Re: Stack overflow issue
On Thu, Mar 7, 2024 at 1:49 AM Tom Lane wrote: > > Alexander Korotkov writes: > > Sorry for tediousness, but isn't pre-order a variation of depth-first order > > [1]? > > To me, depth-first implies visiting children before parents. > Do I have the terminology wrong? According to Wikipedia, depth-first is a general term describing the tree traversal algorithm, which goes as deep as possible in one branch before visiting other branches. The order of between parents and children, and between siblings specifies the variation of depth-first search, and pre-order is one of them. But "pre-order" is the most accurate term for MemoryContextTraverseNext() anyway. -- Regards, Alexander Korotkov
Re: Spurious pgstat_drop_replslot() call
On Fri, Mar 08, 2024 at 10:19:11AM +, Bertrand Drouvot wrote: > Indeed, it does not seem appropriate to remove stats during slot invalidation > as > one could still be interested to look at them. Yeah, my take is that this can still be interesting to know, at least for debugging. This would limit the stats to be dropped when the slot is dropped, and that looks like a sound idea. > This spurious call has been introduced in be87200efd. I think that initially > we > designed to drop slots when a recovery conflict occurred during logical > decoding > on a standby. But we changed our mind to invalidate such a slot instead. > > The spurious call is probably due to the initial design but has not been > removed. This is not a subject that has really been touched on the original thread mentioned on the commit, so it is a bit hard to be sure. The only comment is that a previous version of the patch did the stats drop in the slot invalidation path at an incorrect location. > I don't think it's worth to add a test but can do if one feel the need. Well, that would not be complicated while being cheap, no? We have a few paths in the 035 test where we know that a slot has been invalidated so it is just a matter of querying once pg_stat_replication_slot and see if some data is still there. -- Michael signature.asc Description: PGP signature
Re: Missing LWLock protection in pgstat_reset_replslot()
On Fri, Mar 08, 2024 at 10:26:21AM +, Bertrand Drouvot wrote: > Yeah, good point: I'll create a dedicated patch for that. Sounds good to me. > Note that currently pgstat_drop_replslot() would not satisfy this new Assert > when being called from InvalidatePossiblyObsoleteSlot(). I think this call > should be removed and created a dedicated thread for that [1]. > > [1]: > https://www.postgresql.org/message-id/ZermH08Eq6YydHpO%40ip-10-97-1-34.eu-west-3.compute.internal Thanks. -- Michael signature.asc Description: PGP signature
Re: New Table Access Methods for Multi and Single Inserts
On Sat, Mar 2, 2024 at 12:02 PM Bharath Rupireddy wrote: > > On Mon, Jan 29, 2024 at 5:16 PM Bharath Rupireddy > wrote: > > > > > Please find the attached v9 patch set. > > I've had to rebase the patches due to commit 874d817, please find the > attached v11 patch set. Rebase needed. Please see the v12 patch set. -- Bharath Rupireddy PostgreSQL Contributors Team RDS Open Source Databases Amazon Web Services: https://aws.amazon.com From 8a3552e65e62afc40db99fbd7bf4f98990d45390 Mon Sep 17 00:00:00 2001 From: Bharath Rupireddy Date: Fri, 8 Mar 2024 10:11:17 + Subject: [PATCH v12 1/4] New TAMs for inserts --- src/backend/access/heap/heapam.c | 224 +++ src/backend/access/heap/heapam_handler.c | 9 + src/include/access/heapam.h | 49 + src/include/access/tableam.h | 138 ++ 4 files changed, 420 insertions(+) diff --git a/src/backend/access/heap/heapam.c b/src/backend/access/heap/heapam.c index 34bc60f625..497940d74a 100644 --- a/src/backend/access/heap/heapam.c +++ b/src/backend/access/heap/heapam.c @@ -64,6 +64,7 @@ #include "storage/standby.h" #include "utils/datum.h" #include "utils/inval.h" +#include "utils/memutils.h" #include "utils/relcache.h" #include "utils/snapmgr.h" #include "utils/spccache.h" @@ -2442,6 +2443,229 @@ heap_multi_insert(Relation relation, TupleTableSlot **slots, int ntuples, pgstat_count_heap_insert(relation, ntuples); } +/* + * Initialize state required for an insert a single tuple or multiple tuples + * into a heap. + */ +TableInsertState * +heap_insert_begin(Relation rel, CommandId cid, int am_flags, int insert_flags) +{ + TableInsertState *tistate; + + tistate = palloc0(sizeof(TableInsertState)); + tistate->rel = rel; + tistate->cid = cid; + tistate->am_flags = am_flags; + tistate->insert_flags = insert_flags; + + if ((am_flags & TABLEAM_MULTI_INSERTS) != 0 || + (am_flags & TABLEAM_BULKWRITE_BUFFER_ACCESS_STRATEGY)) + tistate->am_data = palloc0(sizeof(HeapInsertState)); + + if ((am_flags & TABLEAM_MULTI_INSERTS) != 0) + { + HeapMultiInsertState *mistate; + + mistate = palloc0(sizeof(HeapMultiInsertState)); + mistate->slots = palloc0(sizeof(TupleTableSlot *) * HEAP_MAX_BUFFERED_SLOTS); + + mistate->context = AllocSetContextCreate(CurrentMemoryContext, + "heap_multi_insert_v2 memory context", + ALLOCSET_DEFAULT_SIZES); + + ((HeapInsertState *) tistate->am_data)->mistate = mistate; + } + + if ((am_flags & TABLEAM_BULKWRITE_BUFFER_ACCESS_STRATEGY) != 0) + ((HeapInsertState *) tistate->am_data)->bistate = GetBulkInsertState(); + + return tistate; +} + +/* + * Insert a single tuple into a heap. + */ +void +heap_insert_v2(TableInsertState * state, TupleTableSlot *slot) +{ + bool shouldFree = true; + HeapTuple tuple = ExecFetchSlotHeapTuple(slot, true, ); + BulkInsertState bistate = NULL; + + Assert(state->am_data != NULL && + ((HeapInsertState *) state->am_data)->mistate == NULL); + + /* Update tuple with table oid */ + slot->tts_tableOid = RelationGetRelid(state->rel); + tuple->t_tableOid = slot->tts_tableOid; + + if (state->am_data != NULL && + ((HeapInsertState *) state->am_data)->bistate != NULL) + bistate = ((HeapInsertState *) state->am_data)->bistate; + + /* Perform insertion, and copy the resulting ItemPointer */ + heap_insert(state->rel, tuple, state->cid, state->insert_flags, +bistate); + ItemPointerCopy(>t_self, >tts_tid); + + if (shouldFree) + pfree(tuple); +} + +/* + * Create/return next free slot from multi-insert buffered slots array. + */ +TupleTableSlot * +heap_multi_insert_next_free_slot(TableInsertState * state) +{ + TupleTableSlot *slot; + HeapMultiInsertState *mistate; + + Assert(state->am_data != NULL && + ((HeapInsertState *) state->am_data)->mistate != NULL); + + mistate = ((HeapInsertState *) state->am_data)->mistate; + slot = mistate->slots[mistate->cur_slots]; + + if (slot == NULL) + { + slot = table_slot_create(state->rel, NULL); + mistate->slots[mistate->cur_slots] = slot; + } + else + ExecClearTuple(slot); + + return slot; +} + +/* + * Store passed-in tuple into in-memory buffered slots. When full, insert + * multiple tuples from the buffers into heap. + */ +void +heap_multi_insert_v2(TableInsertState * state, TupleTableSlot *slot) +{ + TupleTableSlot *dstslot; + HeapMultiInsertState *mistate; + + Assert(state->am_data != NULL && + ((HeapInsertState *) state->am_data)->mistate != NULL); + + mistate = ((HeapInsertState *) state->am_data)->mistate; + dstslot = mistate->slots[mistate->cur_slots]; + + if (dstslot == NULL) + { + dstslot = table_slot_create(state->rel, NULL); + mistate->slots[mistate->cur_slots] = dstslot; + } + + /* + * Caller may have got the slot using heap_multi_insert_next_free_slot, + * filled it and passed. So, skip copying in such a case. + */ + if ((state->am_flags & TABLEAM_SKIP_MULTI_INSERTS_FLUSH) == 0) + { + ExecClearTuple(dstslot); + ExecCopySlot(dstslot, slot); + } + else +
Re: Add SPLIT PARTITION/MERGE PARTITIONS commands
> On 26 Jan 2024, at 23:36, Dmitry Koval wrote: > > The CF entry was in Ready for Committer state no so long ago. Stephane, you might want to review recent version after it was rebased on current HEAD. CFbot's test passed successfully. Thanks! Best regards, Andrey Borodin.
Re: Missing LWLock protection in pgstat_reset_replslot()
Hi, On Thu, Mar 07, 2024 at 02:17:53PM +0900, Michael Paquier wrote: > On Wed, Mar 06, 2024 at 09:05:59AM +, Bertrand Drouvot wrote: > > Yeah, all of the above done in v3 attached. > > In passing.. pgstat_create_replslot() and pgstat_drop_replslot() rely > on the assumption that the LWLock ReplicationSlotAllocationLock is > taken while calling these routines. Perhaps that would be worth some > extra Assert(LWLockHeldByMeInMode()) in pgstat_replslot.c for these > two? Not directly related to this problem. Yeah, good point: I'll create a dedicated patch for that. Note that currently pgstat_drop_replslot() would not satisfy this new Assert when being called from InvalidatePossiblyObsoleteSlot(). I think this call should be removed and created a dedicated thread for that [1]. [1]: https://www.postgresql.org/message-id/ZermH08Eq6YydHpO%40ip-10-97-1-34.eu-west-3.compute.internal Regards, -- Bertrand Drouvot PostgreSQL Contributors Team RDS Open Source Databases Amazon Web Services: https://aws.amazon.com
Re: a wrong index choose when statistics is out of date
On 7/3/2024 17:32, David Rowley wrote: On Thu, 7 Mar 2024 at 21:17, Andrei Lepikhov wrote: I would like to ask David why the var_eq_const estimator doesn't have an option for estimation with a histogram. Having that would relieve a problem with skewed data. Detecting the situation with incoming const that is out of the covered area would allow us to fall back to ndistinct estimation or something else. At least, histogram usage can be restricted by the reltuples value and ratio between the total number of MCV values and the total number of distinct values in the table. If you can think of a way how to calculate it, you should propose a patch. IIRC, we try to make the histogram buckets evenly sized based on the number of occurrences. I've not followed the code in default, I'd guess that doing that allows us to just subtract off the MCV frequencies and assume the remainder is evenly split over each histogram bucket, so unless we had an n_distinct per histogram bucket, or at the very least n_distinct_for_histogram_values, then how would the calculation look for what we currently record? Yeah, It is my mistake; I see nothing special here with such a kind of histogram: in the case of a coarse histogram net, the level of uncertainty in one bin is too high to make a better estimation. I am just pondering detection situations when estimation constant is just out of statistics scope to apply to alternative, more expensive logic involving the number of index pages out of the boundary, index tuple width, and distinct value. The Left and right boundaries of the histogram are suitable detectors for such a situation. -- regards, Andrei Lepikhov Postgres Professional
Spurious pgstat_drop_replslot() call
Hi hackers, Please find attached a tiny patch to remove a $SUBJECT. Indeed, it does not seem appropriate to remove stats during slot invalidation as one could still be interested to look at them. This spurious call has been introduced in be87200efd. I think that initially we designed to drop slots when a recovery conflict occurred during logical decoding on a standby. But we changed our mind to invalidate such a slot instead. The spurious call is probably due to the initial design but has not been removed. I don't think it's worth to add a test but can do if one feel the need. Regards, -- Bertrand Drouvot PostgreSQL Contributors Team RDS Open Source Databases Amazon Web Services: https://aws.amazon.com >From 56ade7f561c180ee0120cb8c33d1c39e32ab7863 Mon Sep 17 00:00:00 2001 From: Bertrand Drouvot Date: Fri, 8 Mar 2024 09:29:29 + Subject: [PATCH v1] Remove spurious pgstat_drop_replslot() call There is no need to remove stats for an invalidated slot as one could still be interested to look at them. --- src/backend/replication/slot.c | 1 - 1 file changed, 1 deletion(-) 100.0% src/backend/replication/ diff --git a/src/backend/replication/slot.c b/src/backend/replication/slot.c index b8bf98b182..91ca397857 100644 --- a/src/backend/replication/slot.c +++ b/src/backend/replication/slot.c @@ -1726,7 +1726,6 @@ InvalidatePossiblyObsoleteSlot(ReplicationSlotInvalidationCause cause, ReplicationSlotMarkDirty(); ReplicationSlotSave(); ReplicationSlotRelease(); - pgstat_drop_replslot(s); ReportSlotInvalidation(conflict, false, active_pid, slotname, restart_lsn, -- 2.34.1
Re: Proposal to add page headers to SLRU pages
Greetings, * Li, Yong (y...@ebay.com) wrote: > > On Mar 7, 2024, at 03:09, Stephen Frost wrote: > > * Alvaro Herrera (alvhe...@alvh.no-ip.org) wrote: > >> I suppose this is important to do if we ever want to move SLRUs into > >> shared buffers. However, I wonder about the extra time this adds to > >> pg_upgrade. Is this something we should be concerned about? Is there > >> any measurement/estimates to tell us how long this would be? Right now, > >> if you use a cloning strategy for the data files, the upgrade should be > >> pretty quick ... but the amount of data in pg_xact and pg_multixact > >> could be massive, and the rewrite is likely to take considerable time. > > > > While I definitely agree that there should be some consideration of > > this concern, it feels on-par with the visibility-map rewrite which was > > done previously. Larger systems will likely have more to deal with than > > smaller systems, but it's still a relatively small portion of the data > > overall. > > > > The benefit of this change, beyond just the possibility of moving them > > into shared buffers some day in the future, is that this would mean that > > SLRUs will have checksums (if the cluster has them enabled). That > > benefit strikes me as well worth the cost of the rewrite taking some > > time and the minor loss of space due to the page header. > > > > Would it be useful to consider parallelizing this work? There's already > > parts of pg_upgrade which can be parallelized and so this isn't, > > hopefully, a big lift to add, but I'm not sure if there's enough work > > being done here CPU-wise, compared to the amount of IO being done, to > > have it make sense to run it in parallel. Might be worth looking into > > though, at least, as disks have gotten to be quite fast. > > Thank Alvaro and Stephen for your thoughtful comments. > > I did a quick benchmark regarding pg_upgrade time, and here are the results. > For clog, 2048 segments can host about 2 billion transactions, right at the > limit for wraparound. > That’s the maximum we can have. 2048 segments are also big for pg_multixact > SLRUs. > > Therefore, on a modern hardware, in the worst case, pg_upgrade will run for 7 > seconds longer. Thanks for testing! That strikes me as perfectly reasonable and seems unlikely that we'd get much benefit from parallelizing it, so I'd say it makes sense to keep this code simple. Thanks again! Stephen signature.asc Description: PGP signature
Re: Invalid query generated by postgres_fdw with UNION ALL and ORDER BY
On Fri, Mar 8, 2024 at 10:13 AM David Rowley wrote: > The fix could also be to use deparseConst() in appendOrderByClause() > and have that handle Const EquivalenceMember instead. I'd rather just > skip them. To me, that seems less risky than ensuring deparseConst() > handles all Const types correctly. I've looked at this patch a bit. I once wondered why we don't check pathkey->pk_eclass->ec_has_const with EC_MUST_BE_REDUNDANT to see if the pathkey is not needed. Then I realized that a child member would not be marked as constant even if the child expr is a Const, as explained in add_child_rel_equivalences(). BTW, I wonder if it is possible that we have a pseudoconstant expression that is not of type Const. In such cases we would need to check 'bms_is_empty(pull_varnos(em_expr))' instead of 'IsA(em_expr, Const)'. However, I'm unable to think of such an expression in this context. The patch looks good to me otherwise. Thanks Richard
Re: Add new error_action COPY ON_ERROR "log"
On Thu, Mar 7, 2024 at 12:54 PM Michael Paquier wrote: > > On Thu, Mar 07, 2024 at 12:48:12PM +0530, Bharath Rupireddy wrote: > > I'm okay with it. But, help me understand it better. We want the > > 'log_verbosity' clause to have options 'default' and 'verbose', right? > > And, later it can also be extended to contain all the LOG levels like > > 'notice', 'error', 'info' , 'debugX' etc. depending on the need, > > right? > > You could, or names that have some status like row_details, etc. > > > One more thing, how does it sound using both verbosity and verbose in > > log_verbosity verbose something like below? Is this okay? > > There's some history with this pattern in psql at least with \set > VERBOSITY verbose. For the patch, I would tend to choose these two, > but that's as far as my opinion goes and I am OK other ideas gather > more votes. Please see the attached v5-0001 patch implementing LOG_VERBOSITY with options 'default' and 'verbose'. v5-0002 adds the detailed info to ON_ERROR 'ignore' option. We have a CF entry https://commitfest.postgresql.org/47/4798/ for the original idea proposed in this thread, that is, to have the ON_ERROR 'log' option. I'll probably start a new thread and add a new CF entry in the next commitfest if there's no objection from anyone. -- Bharath Rupireddy PostgreSQL Contributors Team RDS Open Source Databases Amazon Web Services: https://aws.amazon.com v5-0001-Add-LOG_VERBOSITY-option-to-COPY-command.patch Description: Binary data v5-0002-Add-detailed-info-when-COPY-skips-soft-errors.patch Description: Binary data
Call perror on popen failure
If popen fails in pipe_read_line we invoke perror for the error message, and pipe_read_line is in turn called by find_other_exec which is used in both frontend and backend code. This is an old codepath, and it seems like it ought to be replaced with the more common logging tools we now have like in the attached diff (which also makes the error translated as opposed to now). Any objections to removing this last perror() call? -- Daniel Gustafsson no_perror.diff Description: Binary data
Re: speed up a logical replica setup
> On 8 Mar 2024, at 12:03, Shlok Kyal wrote: > > I haven't digged into the thread, but recent version fails some CFbot's tests. http://commitfest.cputube.org/euler-taveira.html https://cirrus-ci.com/task/4833499115421696 ==29928==ERROR: AddressSanitizer: heap-use-after-free on address 0x61a01458 at pc 0x7f3b29fdedce bp 0x7ffe68fcf1c0 sp 0x7ffe68fcf1b8 Thanks! Best regards, Andrey Borodin.
Re: Support a wildcard in backtrace_functions
On 2024-Mar-08, Bharath Rupireddy wrote: > This works only if '* 'is specified as the only one character in > backtrace_functions = '*', right? If yes, what if someone sets > backtrace_functions = 'foo, bar, *, baz'? It throws an error, as expected. This is a useless waste of resources: checking for "foo" and "bar" is pointless, since the * is going to give a positive match anyway. And the "baz" is a waste of memory which is never going to be checked. -- Álvaro HerreraBreisgau, Deutschland — https://www.EnterpriseDB.com/ "I love the Postgres community. It's all about doing things _properly_. :-)" (David Garamond)
Re: a wrong index choose when statistics is out of date
David Rowley writes: > On Thu, 7 Mar 2024 at 23:40, Andy Fan wrote: >> >> David Rowley writes: >> > If you don't want the planner to use the statistics for the column why >> > not just do the following? >> >> Acutally I didn't want the planner to ignore the statistics totally, I >> want the planner to treat the "Const" which probably miss optimizer part >> average, which is just like what we did for generic plan for the blow >> query. > > I'm with Andrei on this one and agree with his "And it is just luck > that you've got the right answer". Any example to support this conclusion? and what's the new problem after this? -- Best Regards Andy Fan
RE: speed up a logical replica setup
Dear Tomas, Euler, Thanks for starting to read the thread! Since I'm not an original author, I want to reply partially. > I decided to take a quick look on this patch today, to see how it works > and do some simple tests. I've only started to get familiar with it, so > I have only some comments / questions regarding usage, not on the code. > It's quite possible I didn't understand some finer points, or maybe it > was already discussed earlier in this very long thread, so please feel > free to push back or point me to the past discussion. > > Also, some of this is rather opinionated, but considering I didn't see > this patch before, my opinions may easily be wrong ... I felt your comments were quit valuable. > 1) SGML docs > > It seems the SGML docs are more about explaining how this works on the > inside, rather than how to use the tool. Maybe that's intentional, but > as someone who didn't work with pg_createsubscriber before I found it > confusing and not very helpful. > > For example, the first half of the page is prerequisities+warning, and > sure those are useful details, but prerequisities are checked by the > tool (so I can't really miss this) and warnings go into a lot of details > about different places where things may go wrong. Sure, worth knowing > and including in the docs, but maybe not right at the beginning, before > I learn how to even run the tool? Hmm, right. I considered below improvements. Tomas and Euler, how do you think? * Adds more descriptions in "Description" section. * Moves prerequisities+warning to "Notes" section. * Adds "Usage" section which describes from a single node. > I'm not sure FOR ALL TABLES is a good idea. Or said differently, I'm > sure it won't work for a number of use cases. I know large databases > it's common to create "work tables" (not necessarily temporary) as part > of a batch job, but there's no need to replicate those tables. Indeed, the documentation does not describe that all tables in the database would be included in the publication. > I do understand that FOR ALL TABLES is the simplest approach, and for v1 > it may be an acceptable limitation, but maybe it'd be good to also > support restricting which tables should be replicated (e.g. blacklist or > whitelist based on table/schema name?). May not directly related, but we considered that accepting options was a next-step [1]. > Note: I now realize this might fall under the warning about DDL, which > says this: > > Executing DDL commands on the source server while running > pg_createsubscriber is not recommended. If the target server has > already been converted to logical replica, the DDL commands must > not be replicated so an error would occur. Yeah, they would not be replicated, but not lead ERROR. So should we say like "Creating tables on the source server..."? > 5) slot / publication / subscription name > > I find it somewhat annoying it's not possible to specify names for > objects created by the tool - replication slots, publication and > subscriptions. If this is meant to be a replica running for a while, > after a while I'll have no idea what pg_createsubscriber_569853 or > pg_createsubscriber_459548_2348239 was meant for. > > This is particularly annoying because renaming these objects later is > either not supported at all (e.g. for replication slots), or may be > quite difficult (e.g. publications). > > I do realize there are challenges with custom names (say, if there are > multiple databases to replicate), but can't we support some simple > formatting with basic placeholders? So we could specify > > --slot-name "myslot_%d_%p" > > or something like that? Not sure we can do in the first version, but looks nice. One concern is that I cannot find applications which accepts escape strings like log_line_prefix. (It may just because we do not have use-case.) Do you know examples? > BTW what will happen if we convert multiple standbys? Can't they all get > the same slot name (they all have the same database OID, and I'm not > sure how much entropy the PID has)? I tested and the second try did not work. The primal reason was the name of publication - pg_createsubscriber_%u (oid). FYI - previously we can reuse same publications, but based on my comment [2] the feature was removed. It might be too optimistic. [1]: https://www.postgresql.org/message-id/TY3PR01MB9889CCBD4D9DAF8BD2F18541F56F2%40TY3PR01MB9889.jpnprd01.prod.outlook.com [2]: https://www.postgresql.org/message-id/TYCPR01MB12077756323B79042F29DDAEDF54C2%40TYCPR01MB12077.jpnprd01.prod.outlook.com Best Regards, Hayato Kuroda FUJITSU LIMITED https://www.fujitsu.com/
Re: vacuumdb/clusterdb/reindexdb: allow specifying objects to process in all databases
On Wed, 6 Mar 2024 at 22:22, Nathan Bossart wrote: > > Thanks for taking a look. I updated the synopsis sections in v3. OK, that looks good. The vacuumdb synopsis in particular looks a lot better now that "-N | --exclude-schema" is on its own line, because it was hard to read previously, and easy to mistakenly think that -n could be combined with -N. If I'm nitpicking, "[--verbose | -v]" in the clusterdb synopsis should be replaced with "[option...]", like the other commands, because there are other general-purpose options like --quiet and --echo. > I also spent some more time on the reindexdb patch (0003). I previously > had decided to restrict combinations of tables, schemas, and indexes > because I felt it was "ambiguous and inconsistent with vacuumdb," but > looking closer, I think that's the wrong move. reindexdb already supports > such combinations, which it interprets to mean it should reindex each > listed object. So, I removed that change in v3. Makes sense. > Even though reindexdb allows combinations of tables, schema, and indexes, > it doesn't allow combinations of "system catalogs" and other objects, and > it's not clear why. In v3, I've removed this restriction, which ended up > simplifying the 0003 patch a bit. Like combinations of tables, schemas, > and indexes, reindexdb will now interpret combinations that include > --system to mean it should reindex each listed object as well as the system > catalogs. OK, that looks useful, especially given that most people will still probably use this against a single database, and it's making that more flexible. I think this is good to go. Regards, Dean
Re: meson: Specify -Wformat as a common warning flag for extensions
Hi, In "Re: meson: Specify -Wformat as a common warning flag for extensions" on Fri, 8 Mar 2024 15:32:22 +0900, Michael Paquier wrote: > Are there version and/or > environment requirements to be aware of? I'm using Debian GNU/Linux sid and I can reproduce with gcc 8-13: $ for x in {8..13}; do; echo gcc-${x}; gcc-${x} -Wformat-security -E - < /dev/null > /dev/null; done gcc-8 cc1: warning: -Wformat-security ignored without -Wformat [-Wformat-security] gcc-9 cc1: warning: '-Wformat-security' ignored without '-Wformat' [-Wformat-security] gcc-10 cc1: warning: '-Wformat-security' ignored without '-Wformat' [-Wformat-security] gcc-11 cc1: warning: '-Wformat-security' ignored without '-Wformat' [-Wformat-security] gcc-12 cc1: warning: '-Wformat-security' ignored without '-Wformat' [-Wformat-security] gcc-13 cc1: warning: '-Wformat-security' ignored without '-Wformat' [-Wformat-security] $ I tried this on Ubuntu 22.04 too but this isn't reproduced: $ gcc-11 -Wformat-security -E - < /dev/null > /dev/null $ It seems that Ubuntu enables -Wformat by default: $ gcc-11 -Wno-format -Wformat-security -E - < /dev/null > /dev/null cc1: warning: '-Wformat-security' ignored without '-Wformat' [-Wformat-security] I tried this on AlmaLinux 9 too and this is reproduced: $ gcc --version gcc (GCC) 11.4.1 20230605 (Red Hat 11.4.1-2) Copyright (C) 2021 Free Software Foundation, Inc. This is free software; see the source for copying conditions. There is NO warranty; not even for MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. $ gcc -Wformat-security -E - < /dev/null > /dev/null cc1: warning: '-Wformat-security' ignored without '-Wformat' [-Wformat-security] > Forcing -Wformat implies more stuff that can be disabled with > -Wno-format-contains-nul, -Wno-format-extra-args, and > -Wno-format-zero-length, but the thing is that we're usually very > conservative with such additions in the scripts. See also > 8b6f5f25102f, done, I guess, as an answer to this thread: > https://www.postgresql.org/message-id/4D431505.9010002%40dunslane.net I think that this is not a problem. Because the comment added by 8b6f5f25102f ("This was included in -Wall/-Wformat in older GCC versions") implies that we want to always use -Wformat-security. -Wformat-security isn't worked without -Wformat: https://gcc.gnu.org/onlinedocs/gcc/Warning-Options.html#index-Wformat-security > If -Wformat is specified, also warn about uses of format > functions that represent possible security problems. Thanks, -- kou
Re: MERGE ... RETURNING
Jeff Davis: To summarize, most of the problem has been in retrieving the action (INSERT/UPDATE/DELETE) taken or the WHEN-clause number applied to a particular matched row. The reason this is important is because the row returned is the old row for a DELETE action, and the new row for an INSERT or UPDATE action. Without a way to distinguish the particular action, the RETURNING clause returns a mixture of old and new rows, which would be hard to use sensibly. It seems to me that all of this is only a problem, because there is only one RETURNING clause. Dean Rasheed wrote in the very first post to this thread: I considered allowing a separate RETURNING list at the end of each action, but rapidly dismissed that idea. Firstly, it introduces shift/reduce conflicts to the grammar. These can be resolved by making the "AS" before column aliases non-optional, but that's pretty ugly, and there may be a better way. More serious drawbacks are that this syntax is much more cumbersome for the end user, having to repeat the RETURNING clause several times, and the implementation is likely to be pretty complex, so I didn't pursue it. I can't judge the grammar and complexity issues, but as a potential user it seems to me to be less complex to have multiple RETURNING clauses, where I could inject my own constants about the specific actions, than to have to deal with any of the suggested functions / clauses. More repetitive, yes - but not more complex. More importantly, I could add RETURNING to only some of the actions and not always all at the same time - which seems pretty useful to me. Best, Wolfgang
Re: speed up a logical replica setup
On Thu, 7 Mar 2024 at 18:31, Shlok Kyal wrote: > > Hi, > > > Thanks for the feedback. I'm attaching v26 that addresses most of your > > comments > > and some issues pointed by Vignesh [1]. > > I have created a top-up patch v27-0004. It contains additional test > cases for pg_createsubscriber. > > Currently, two testcases (in v27-0004 patch) are failing. These > failures are related to running pg_createsubscriber on nodes in > cascade physical replication and are already reported in [1] and [2]. > I think these cases should be fixed. Thoughts? We can fix these issues, if we are not planning to fix any of them, we can add documentation for the same. > The idea of this patch is to keep track of testcases, so that any > future patch does not break any scenario which has already been worked > on. These testcases can be used to test in our development process, > but which test should actually be committed, can be discussed later. > Thought? Few comments for v27-0004-Add-additional-testcases.patch: 1) We could use command_fails_like to verify the reason of the error: +# set max_replication_slots +$node_p->append_conf('postgresql.conf', 'max_replication_slots = 3'); +$node_p->restart; +command_fails( + [ + 'pg_createsubscriber', '--verbose', + '--dry-run', '--pgdata', + $node_s->data_dir, '--publisher-server', + $node_p->connstr('pg1'), '--socket-directory', + $node_s->host, '--subscriber-port', + $node_s->port, '--database', + 'pg1', '--database', + 'pg2', + ], + 'max_replication_slots are less in number in publisher'); 2) Add a comment saying what is being verified +# set max_replication_slots +$node_p->append_conf('postgresql.conf', 'max_replication_slots = 3'); +$node_p->restart; +command_fails( + [ + 'pg_createsubscriber', '--verbose', + '--dry-run', '--pgdata', + $node_s->data_dir, '--publisher-server', + $node_p->connstr('pg1'), '--socket-directory', + $node_s->host, '--subscriber-port', + $node_s->port, '--database', + 'pg1', '--database', + 'pg2', + ], + 'max_replication_slots are less in number in publisher'); 3) We could rename this file something like pg_create_subscriber_failure_cases or something better: src/bin/pg_basebackup/t/041_tests.pl | 285 +++ 1 file changed, 285 insertions(+) create mode 100644 src/bin/pg_basebackup/t/041_tests.pl diff --git a/src/bin/pg_basebackup/t/041_tests.pl b/src/bin/pg_basebackup/t/041_tests.pl new file mode 100644 index 00..2889d60d54 --- /dev/null +++ b/src/bin/pg_basebackup/t/041_tests.pl @@ -0,0 +1,285 @@ +# Copyright (c) 2024, PostgreSQL Global Development Group 4) This success case is not required as this would have already been covered in 040_pg_createsubscriber.pl: +$node_p->append_conf('postgresql.conf', 'max_replication_slots = 4'); +$node_p->restart; +command_ok( + [ + 'pg_createsubscriber', '--verbose', + '--dry-run', '--pgdata', + $node_s->data_dir, '--publisher-server', + $node_p->connstr('pg1'), '--socket-directory', + $node_s->host, '--subscriber-port', + $node_s->port, '--database', + 'pg1', '--database', + 'pg2', + ], + 'max_replication_slots are accurate on publisher'); 5) We could use command_fails_like to verify the reason of the error: $node_s->append_conf('postgresql.conf', 'max_replication_slots = 1'); $node_s->restart; command_fails( [ 'pg_createsubscriber', '--verbose', '--dry-run', '--pgdata', $node_s->data_dir, '--publisher-server', $node_p->connstr('pg1'), '--socket-directory', $node_s->host, '--subscriber-port', $node_s->port, '--database', 'pg1', '--database', 'pg2', ], 'max_replication_slots are less in number in subscriber'); 6) Add a comment saying what is being verified $node_s->append_conf('postgresql.conf', 'max_replication_slots = 1'); $node_s->restart; command_fails( [ 'pg_createsubscriber', '--verbose', '--dry-run', '--pgdata', $node_s->data_dir, '--publisher-server', $node_p->connstr('pg1'), '--socket-directory', $node_s->host, '--subscriber-port', $node_s->port, '--database', 'pg1', '--database', 'pg2', ], 'max_replication_slots are less in number in subscriber'); 7) This success case is not required as this would have already been covered in 040_pg_createsubscriber.pl: $node_s->append_conf('postgresql.conf', 'max_replication_slots = 2'); $node_s->restart; command_ok( [ 'pg_createsubscriber', '--verbose', '--dry-run', '--pgdata', $node_s->data_dir, '--publisher-server', $node_p->connstr('pg1'), '--socket-directory', $node_s->host, '--subscriber-port', $node_s->port, '--database', 'pg1', '--database', 'pg2', ], 'max_replication_slots are less in number in subscriber'); 8) We could use
Re: Failures in constraints regression test, "read only 0 of 8192 bytes"
Happened again. I see this is OpenSUSE. Does that mean the file system is Btrfs?
Re: [17] CREATE SUBSCRIPTION ... SERVER
On Wed, 2024-01-31 at 11:10 +0530, Ashutosh Bapat wrote: > > I like the idea -- it further decouples the logic from the core > > server. > > I suspect it will make postgres_fdw the primary way (though not the > > only possible way) to use this feature. There would be little need > > to > > create a new builtin FDW to make this work. > > That's what I see as well. I am glad that we are on the same page. Implemented in v11, attached. Is this what you had in mind? It leaves a lot of the work to postgres_fdw and it's almost unusable without postgres_fdw. That's not a bad thing, but it makes the core functionality a bit harder to test standalone. I can work on the core tests some more. The postgres_fdw tests passed without modification, though, and offer a simple example of how to use it. Regards, Jeff Davis From 88fa1333ace4d15d72534d20d2cccb37748277f2 Mon Sep 17 00:00:00 2001 From: Jeff Davis Date: Tue, 2 Jan 2024 13:42:48 -0800 Subject: [PATCH v11] CREATE SUSBCRIPTION ... SERVER. Allow specifying a foreign server for CREATE SUBSCRIPTION, rather than a raw connection string with CONNECTION. Using a foreign server as a layer of indirection improves management of multiple subscriptions to the same server. It also provides integration with user mappings in case different subscriptions have different owners or a subscription changes owners. Discussion: https://postgr.es/m/61831790a0a937038f78ce09f8dd4cef7de7456a.ca...@j-davis.com Reviewed-by: Ashutosh Bapat --- contrib/postgres_fdw/Makefile | 4 +- contrib/postgres_fdw/connection.c | 74 .../postgres_fdw/expected/postgres_fdw.out| 8 + contrib/postgres_fdw/meson.build | 6 + .../postgres_fdw/postgres_fdw--1.1--1.2.sql | 11 ++ contrib/postgres_fdw/postgres_fdw.control | 2 +- contrib/postgres_fdw/sql/postgres_fdw.sql | 7 + contrib/postgres_fdw/t/010_subscription.pl| 71 doc/src/sgml/ref/alter_subscription.sgml | 18 +- doc/src/sgml/ref/create_subscription.sgml | 11 +- src/backend/catalog/pg_subscription.c | 38 +++- src/backend/commands/foreigncmds.c| 57 +- src/backend/commands/subscriptioncmds.c | 167 -- src/backend/foreign/foreign.c | 42 + src/backend/parser/gram.y | 22 +++ src/backend/replication/logical/worker.c | 16 +- src/bin/pg_dump/pg_dump.c | 27 ++- src/bin/pg_dump/pg_dump.h | 1 + src/bin/psql/tab-complete.c | 2 +- src/include/catalog/pg_foreign_data_wrapper.h | 3 + src/include/catalog/pg_subscription.h | 7 +- src/include/foreign/foreign.h | 3 + src/include/nodes/parsenodes.h| 3 + src/test/regress/expected/oidjoins.out| 1 + 24 files changed, 563 insertions(+), 38 deletions(-) create mode 100644 contrib/postgres_fdw/postgres_fdw--1.1--1.2.sql create mode 100644 contrib/postgres_fdw/t/010_subscription.pl diff --git a/contrib/postgres_fdw/Makefile b/contrib/postgres_fdw/Makefile index c1b0cad453..995a30c297 100644 --- a/contrib/postgres_fdw/Makefile +++ b/contrib/postgres_fdw/Makefile @@ -14,10 +14,12 @@ PG_CPPFLAGS = -I$(libpq_srcdir) SHLIB_LINK_INTERNAL = $(libpq) EXTENSION = postgres_fdw -DATA = postgres_fdw--1.0.sql postgres_fdw--1.0--1.1.sql +DATA = postgres_fdw--1.0.sql postgres_fdw--1.0--1.1.sql postgres_fdw--1.1--1.2.sql REGRESS = postgres_fdw +TAP_TESTS = 1 + ifdef USE_PGXS PG_CONFIG = pg_config PGXS := $(shell $(PG_CONFIG) --pgxs) diff --git a/contrib/postgres_fdw/connection.c b/contrib/postgres_fdw/connection.c index 4931ebf591..a011e6df5f 100644 --- a/contrib/postgres_fdw/connection.c +++ b/contrib/postgres_fdw/connection.c @@ -113,6 +113,7 @@ static uint32 pgfdw_we_get_result = 0; PG_FUNCTION_INFO_V1(postgres_fdw_get_connections); PG_FUNCTION_INFO_V1(postgres_fdw_disconnect); PG_FUNCTION_INFO_V1(postgres_fdw_disconnect_all); +PG_FUNCTION_INFO_V1(postgres_fdw_connection); /* prototypes of private functions */ static void make_new_connection(ConnCacheEntry *entry, UserMapping *user); @@ -1972,6 +1973,79 @@ pgfdw_finish_abort_cleanup(List *pending_entries, List *cancel_requested, } } +/* + * Values in connection strings must be enclosed in single quotes. Single + * quotes and backslashes must be escaped with backslash. NB: these rules are + * different from the rules for escaping a SQL literal. + */ +static void +appendEscapedValue(StringInfo str, const char *val) +{ + appendStringInfoChar(str, '\''); + for (int i = 0; val[i] != '\0'; i++) + { + if (val[i] == '\\' || val[i] == '\'') + appendStringInfoChar(str, '\\'); + appendStringInfoChar(str, val[i]); + } + appendStringInfoChar(str, '\''); +} + +Datum +postgres_fdw_connection(PG_FUNCTION_ARGS) +{ + /* TODO: consider memory usage */ + Oid userid = PG_GETARG_OID(0); + Oid serverid = PG_GETARG_OID(1); + ForeignServer