Re: verify predefined LWLocks have entries in wait_event_names.txt
Hi, On Fri, Jan 05, 2024 at 12:11:44AM -0600, Nathan Bossart wrote: > On Wed, Jan 03, 2024 at 07:59:45AM +, Bertrand Drouvot wrote: > > +1 to add a test and put in a place that would produce failures at build > > time. > > I think that having the test in the script that generates the header file > > is more > > appropriate (as building the documentation looks less usual to me when > > working on > > a patch). > > Okay, I did that in v2. Thanks! > +# NB: Predefined locks (those declared in lwlocknames.txt) must be listed in > +# the top section of locks and must be listed in the same order as in > +# lwlocknames.txt. > +# > > Section: ClassName - WaitEventLWLock > > @@ -326,6 +330,12 @@ NotifyQueueTail "Waiting to update limit on > NOTIFY message st > WaitEventExtension "Waiting to read or update custom wait events > information for extensions." > WALSummarizer"Waiting to read or update WAL summarization state." > > +# > +# Predefined LWLocks (those declared in lwlocknames.txt) must be listed in > the > +# section above and must be listed in the same order as in lwlocknames.txt. > +# Other LWLocks must be listed in the section below. > +# > + Another option could be to create a sub-section for predefined LWLocks that are part of lwlocknames.txt and then sort both list (the one in the sub-section and the one in lwlocknames.txt). That would avoid the "must be listed in the same order" constraint. That said, I think the way it is done in the patch is fine because if one does not follow the constraint then the build would fail. I did a few tests leading to: CommitBDTTsSLRALock defined in lwlocknames.txt but missing from wait_event_names.txt at ./generate-lwlocknames.pl line 107, <$lwlocknames> line 58. OR lists of predefined LWLocks in lwlocknames.txt and wait_event_names.txt do not match at ./generate-lwlocknames.pl line 109, <$lwlocknames> line 46. OR CommitBDTTsSLRALock defined in wait_event_names.txt but missing from lwlocknames.txt at ./generate-lwlocknames.pl line 126, <$lwlocknames> line 57. So, that looks good to me except one remark regarding: + die "lists of predefined LWLocks in lwlocknames.txt and wait_event_names.txt do not match" + unless $wait_event_lwlocks[$i] eq $lockname; What about printing $wait_event_lwlocks[$i] and $lockname in the error message? Something like? " die "lists of predefined LWLocks in lwlocknames.txt and wait_event_names.txt do not match (comparing $lockname and $wait_event_lwlocks[$i])" unless $wait_event_lwlocks[$i] eq $lockname; " I think that would give more clues for debugging purpose. Regards, -- Bertrand Drouvot PostgreSQL Contributors Team RDS Open Source Databases Amazon Web Services: https://aws.amazon.com
Re: Adding facility for injection points (or probe points?) for more advanced tests
On Fri, Jan 5, 2024 at 12:49 PM Michael Paquier wrote: > > I mean why? We test a bunch of stuff in src/test/modules/, and this > is not intended to be released to the outside world. > > Putting that in contrib/ has a lot of extra cost. One is > documentation and more complexity regarding versioning when it comes > to upgrading it to a new version. I don't think that it is a good > idea to deal with this extra load of work for something that I'd aim > to be used for having improved *test* coverage, and the build switch > should stay. Saying that, I'd be OK with renaming the module to > injection_points, Ok. Thanks. > but I will fight hard about keeping that in > src/test/modules/. That's less maintenance headache to think about > when having to deal with complex racy bugs. For me getting this feature in code in a usable manner is more important than its place in the code. I have no plans to fight over it. :). -- Best Wishes, Ashutosh Bapat
Re: [17] CREATE SUBSCRIPTION ... SERVER
On Fri, Jan 5, 2024 at 6:26 AM Jeff Davis wrote: > > > 2. Can one use {FDW, user_mapping, foreign_server} combo other than > > the built-in pg_connection_fdw? > > Yes, you can use any FDW for which you have USAGE privileges, passes > the validations, and provides enough of the expected fields to form a > connection string. > > There was some discussion on this point already. Initially, I > implemented it with more catalog and grammar support, which improved > error checking, but others objected that the grammar wasn't worth it > and that it was too inflexible. See: > > https://www.postgresql.org/message-id/172273.1693403385%40sss.pgh.pa.us > https://www.postgresql.org/message-id/CAExHW5unvpDv6yMSmqurHP7Du1PqoJFWVxeK-4YNm5EnoNJiSQ%40mail.gmail.com > Can you please provide an example using postgres_fdw to create a subscription using this patch. I think we should document it in postgres_fdw and add a test for the same. -- Best Wishes, Ashutosh Bapat
Re: Adding facility for injection points (or probe points?) for more advanced tests
On Fri, Jan 05, 2024 at 12:41:33PM +0530, Ashutosh Bapat wrote: > Well, you have already showed that the SQL interface created for the > test module is being used for testing a core feature. The tests for > that should stay somewhere near the other tests for those features. > Using an extension named "test_injection_point" and which resides in a > test module for testing core features doesn't look great. Hence > suggestion to move it to contrib. I mean why? We test a bunch of stuff in src/test/modules/, and this is not intended to be released to the outside world. Putting that in contrib/ has a lot of extra cost. One is documentation and more complexity regarding versioning when it comes to upgrading it to a new version. I don't think that it is a good idea to deal with this extra load of work for something that I'd aim to be used for having improved *test* coverage, and the build switch should stay. Saying that, I'd be OK with renaming the module to injection_points, but I will fight hard about keeping that in src/test/modules/. That's less maintenance headache to think about when having to deal with complex racy bugs. -- Michael signature.asc Description: PGP signature
Re: Adding facility for injection points (or probe points?) for more advanced tests
On Fri, Jan 5, 2024 at 5:08 AM Michael Paquier wrote: > > >> I suggest we move test_injection_points from src/test/modules to > >> contrib/ and rename it as "injection_points". The test files may still > >> be named as test_injection_point. The TAP tests in 0003 and 0004 once > >> moved to their appropriate places, will load injection_point extension > >> and use it. That way predefined injection point callbacks will also be > >> available for others to use. > > > > Rather than defining a module somewhere that tests would need to load, > > should we just put the common callbacks in the core server? Unless there's > > a strong reason to define them elsewhere, that could be a nice way to save > > a step in the tests. > > Nah, having some pre-existing callbacks existing in the backend is > against the original minimalistic design spirit. These would also > require an SQL interface, and the interface design also depends on the > functions registering them when pushing down custom conditions. > Pushing that down to extensions to do what they want will lead to less > noise, particularly if you consider that we will most likely want to > tweak the callback interfaces for backpatched bugs. That's also why I > think contrib/ is not a good idea, src/test/modules/ serving the > actual testing purpose here. Well, you have already showed that the SQL interface created for the test module is being used for testing a core feature. The tests for that should stay somewhere near the other tests for those features. Using an extension named "test_injection_point" and which resides in a test module for testing core features doesn't look great. Hence suggestion to move it to contrib. -- Best Wishes, Ashutosh Bapat
Re: speed up a logical replica setup
On Thu, 4 Jan 2024 at 16:46, Amit Kapila wrote: > > On Thu, Jan 4, 2024 at 12:22 PM Shlok Kyal wrote: > > > > Hi, > > I was testing the patch with following test cases: > > > > Test 1 : > > - Create a 'primary' node > > - Setup physical replica using pg_basebackup "./pg_basebackup –h > > localhost –X stream –v –R –W –D ../standby " > > - Insert data before and after pg_basebackup > > - Run pg_subscriber and then insert some data to check logical > > replication "./pg_subscriber –D ../standby -S “host=localhost > > port=9000 dbname=postgres” -P “host=localhost port=9000 > > dbname=postgres” -d postgres" > > - Also check pg_publication, pg_subscriber and pg_replication_slots tables. > > > > Observation: > > Data is not lost. Replication is happening correctly. Pg_subscriber is > > working as expected. > > > > Test 2: > > - Create a 'primary' node > > - Use normal pg_basebackup but don’t set up Physical replication > > "./pg_basebackup –h localhost –v –W –D ../standby" > > - Insert data before and after pg_basebackup > > - Run pg_subscriber > > > > Observation: > > Pg_subscriber command is not completing and is stuck with following > > log repeating: > > LOG: waiting for WAL to become available at 0/3000168 > > LOG: invalid record length at 0/3000150: expected at least 24, got 0 > > > > I think probably the required WAL is not copied. Can you use the -X > option to stream WAL as well and then test? But I feel in this case > also, we should wait for some threshold time and then exit with > failure, removing new objects created, if any. I have tested with -X stream option in pg_basebackup as well. In this case also the pg_subscriber command is getting stuck. logs: 2024-01-05 11:49:34.436 IST [61948] LOG: invalid resource manager ID 102 at 0/3000118 2024-01-05 11:49:34.436 IST [61948] LOG: waiting for WAL to become available at 0/3000130 > > > Test 3: > > - Create a 'primary' node > > - Use normal pg_basebackup but don’t set up Physical replication > > "./pg_basebackup –h localhost –v –W –D ../standby" > > -Insert data before pg_basebackup but not after pg_basebackup > > -Run pg_subscriber > > > > Observation: > > Pg_subscriber command is not completing and is stuck with following > > log repeating: > > LOG: waiting for WAL to become available at 0/3000168 > > LOG: invalid record length at 0/3000150: expected at least 24, got 0 > > > > This is similar to the previous test and you can try the same option > here as well. For this test as well tried with -X stream option in pg_basebackup. It is getting stuck here as well with similar log. Will investigate the issue further. Thanks and regards Shlok Kyal
Re: verify predefined LWLocks have entries in wait_event_names.txt
On Wed, Jan 03, 2024 at 11:34:25AM +0900, Michael Paquier wrote: > On Tue, Jan 02, 2024 at 11:31:20AM -0600, Nathan Bossart wrote: >> +# Find the location of lwlocknames.h. >> +my $include_dir = $node->config_data('--includedir'); >> +my $lwlocknames_file = "$include_dir/server/storage/lwlocknames.h"; > > I am afraid that this is incorrect because an installation could > decide to install server-side headers in a different path than > $include/server/. Using --includedir-server would be the correct > answer, appending "storage/lwlocknames.h" to the path retrieved from > pg_config. Ah, good to know, thanks. -- Nathan Bossart Amazon Web Services: https://aws.amazon.com
Re: verify predefined LWLocks have entries in wait_event_names.txt
On Wed, Jan 03, 2024 at 07:59:45AM +, Bertrand Drouvot wrote: > +1 to add a test and put in a place that would produce failures at build time. > I think that having the test in the script that generates the header file is > more > appropriate (as building the documentation looks less usual to me when > working on > a patch). Okay, I did that in v2. -- Nathan Bossart Amazon Web Services: https://aws.amazon.com >From 587f6a2c108fc7496fcb3d5764ca06ac5fb21326 Mon Sep 17 00:00:00 2001 From: Nathan Bossart Date: Fri, 5 Jan 2024 00:07:40 -0600 Subject: [PATCH v2 1/1] verify lists of predefined LWLocks in lwlocknames.txt and wait_event_names.txt match --- src/backend/Makefile | 2 +- src/backend/storage/lmgr/Makefile | 4 +- .../storage/lmgr/generate-lwlocknames.pl | 63 +++ .../utils/activity/wait_event_names.txt | 10 +++ src/include/storage/meson.build | 2 +- 5 files changed, 77 insertions(+), 4 deletions(-) diff --git a/src/backend/Makefile b/src/backend/Makefile index 816461aa17..7d2ea7d54a 100644 --- a/src/backend/Makefile +++ b/src/backend/Makefile @@ -130,7 +130,7 @@ $(top_builddir)/src/port/libpgport_srv.a: | submake-libpgport parser/gram.h: parser/gram.y $(MAKE) -C parser gram.h -storage/lmgr/lwlocknames.h: storage/lmgr/generate-lwlocknames.pl storage/lmgr/lwlocknames.txt +storage/lmgr/lwlocknames.h: storage/lmgr/generate-lwlocknames.pl storage/lmgr/lwlocknames.txt utils/activity/wait_event_names.txt $(MAKE) -C storage/lmgr lwlocknames.h lwlocknames.c utils/activity/wait_event_types.h: utils/activity/generate-wait_event_types.pl utils/activity/wait_event_names.txt diff --git a/src/backend/storage/lmgr/Makefile b/src/backend/storage/lmgr/Makefile index c48ba943c4..504480e847 100644 --- a/src/backend/storage/lmgr/Makefile +++ b/src/backend/storage/lmgr/Makefile @@ -39,8 +39,8 @@ s_lock_test: s_lock.c $(top_builddir)/src/common/libpgcommon.a $(top_builddir)/s lwlocknames.c: lwlocknames.h touch $@ -lwlocknames.h: $(top_srcdir)/src/backend/storage/lmgr/lwlocknames.txt generate-lwlocknames.pl - $(PERL) $(srcdir)/generate-lwlocknames.pl $< +lwlocknames.h: $(top_srcdir)/src/backend/storage/lmgr/lwlocknames.txt $(top_srcdir)/src/backend/utils/activity/wait_event_names.txt generate-lwlocknames.pl + $(PERL) $(srcdir)/generate-lwlocknames.pl $^ check: s_lock_test ./s_lock_test diff --git a/src/backend/storage/lmgr/generate-lwlocknames.pl b/src/backend/storage/lmgr/generate-lwlocknames.pl index bef5e16e11..9afee7984b 100644 --- a/src/backend/storage/lmgr/generate-lwlocknames.pl +++ b/src/backend/storage/lmgr/generate-lwlocknames.pl @@ -15,6 +15,7 @@ my $continue = "\n"; GetOptions('outdir:s' => \$output_path); open my $lwlocknames, '<', $ARGV[0] or die; +open my $wait_event_names, '<', $ARGV[1] or die; # Include PID in suffix in case parallel make runs this multiple times. my $htmp = "$output_path/lwlocknames.h.tmp$$"; @@ -30,6 +31,59 @@ print $c $autogen, "\n"; print $c "const char *const IndividualLWLockNames[] = {"; +# +# First, record the predefined LWLocks listed in wait_event_names.txt. We'll +# cross-check those with the ones in lwlocknames.txt. +# +my @wait_event_lwlocks; +my $found_lwlock_section = 0; +my $recording_lwlocks = 0; + +while (<$wait_event_names>) +{ + chomp; + + # Skip comments + next if /^#/; + + # Skip empty lines + if (/^\s*$/) + { + # + # If we encounter an empty line while recording the LWLocks listed in + # wait_event_names.txt, we are done. + # + last if $recording_lwlocks; + + # + # Begin recording the LWLocks listed in wait_event_names.txt following + # the empty line after the WaitEventLWLock section declaration. + # + $recording_lwlocks = 1 if $found_lwlock_section; + next; + } + + # + # Prepare to record the LWLocks when we find the WaitEventLWLock section + # declaration. + # + if (/^Section: ClassName(.*)/) + { + my $section_name = $_; + $section_name =~ s/^.*- //; + $found_lwlock_section = 1 if $section_name eq "WaitEventLWLock"; + next; + } + + # Go to the next line if we are not yet recording LWLocks. + next if not $recording_lwlocks; + + # Record the LWLock. + (my $waiteventname, my $waitevendocsentence) = split(/\t/, $_); + push(@wait_event_lwlocks, $waiteventname . "Lock"); +} + +my $i = 0; while (<$lwlocknames>) { chomp; @@ -50,6 +104,12 @@ while (<$lwlocknames>) die "lwlocknames.txt not in order" if $lockidx < $lastlockidx; die "lwlocknames.txt has duplicates" if $lockidx == $lastlockidx; + die $lockname . " defined in lwlocknames.txt but missing from wait_event_names.txt" + unless $i < scalar(@wait_event_lwlocks); + die "lists of predefined LWLocks in lwlocknames.txt and wait_event_names.txt do not match" + unless $wait_event_lwlocks[$i] eq $lockname; + $i++; + while ($lastlockidx < $lockidx - 1) { ++$lastlockidx; @@ -63,6 +123,9 @@ while (<$lwlocknames>) print $h "#define $lockname ([$lockidx].lock)\n"; }
Re: SET ROLE x NO RESET
> On 3 Jan 2024, at 18:22, Jelte Fennema-Nio wrote: > > >> In my case I have scripts that I want to execute with limited privileges >> and make sure the scripts cannot escape the sandbox via RESET ROLE. > > Depending on the desired workflow I think that could work for you too. > Because it allows you to do this (and use -f script.sql instead of -c > 'select ...): > > ❯ psql "user=postgres _pq_.protocol_managed_params=role options='-c > role=pg_read_all_data'" -c 'select current_user; set role postgres' > current_user > ── > pg_read_all_data > (1 row) > > ERROR: 42501: parameter can only be set at the protocol level "role" > LOCATION: set_config_with_handle, guc.c:3583 > Time: 0.667 ms My scripts are actually Liquibase change logs. I’ve extended Liquibase so that each change set is executed with limited privileges. While doable with protocol level implementation, it would require support from PgJDBC. — Michal
Re: Documentation to upgrade logical replication cluster
On Fri, Jan 5, 2024 at 2:38 PM Hayato Kuroda (Fujitsu) wrote: ... > 2. > I'm not sure it should be listed as step 10. I felt that it should be new > section. > At that time other steps like "Prepare for {publisher|subscriber} upgrades" > can be moved as well. > Thought? During my review, I also felt that step 10 is now so long that it is a distraction from the other content on this page. == Kind Regards, Peter Smith. Fujitsu Australia
Re: Documentation to upgrade logical replication cluster
Here are some review comments for patch v1-0001. == doc/src/sgml/ref/pgupgrade.sgml 1. GENERAL - blank lines Most (but not all) of your procedure steps are preceded by blank lines to make them more readable in the SGML. Add the missing blank lines for the steps that didn't have them. 2. GENERAL - for e.g.: All the "for e.g:" that precedes the code examples can just say "e.g.:" like in other examples on this page. ~~~ 3. GENERAL - reference from elsewhere I was wondering if "Chapter 30. Logical Replication" should include a section that references back to all this just to make it easier to find. ~~~ 4. + + Migration of logical replication clusters can be done when all the members + of the old logical replication clusters are version 17.0 or later. + /can be done when/is possible only when/ ~~~ 5. + + The prerequisites of publisher upgrade applies to logical Replication + cluster upgrades also. See + for the details of publisher upgrade prerequisites. + /applies to/apply to/ /logical Replication/logical replication/ ~~~ 6. + + The prerequisites of subscriber upgrade applies to logical Replication + cluster upgrades also. See + for the details of subscriber upgrade prerequisites. + + /applies to/apply to/ /logical Replication/logical replication/ ~~~ 7. + + The steps to upgrade logical replication clusters in various scenarios are + given below. + The 3 titles do not render very prominently, so it is too easy to get lost scrolling up and down looking for the different scenarios. If the title rendering can't be improved, at least a list of 3 links here (like a TOC) would be helpful. ~~~ // Steps to Upgrade 2 node logical replication cluster // 8. GENERAL - server names I noticed in this set of steps you called the servers 'pub_data' and 'pub_upgraded_data' and 'sub_data' and 'sub_upgraded_data'. I see it is easy to read like this, it is also different from all the subsequent procedures where the names are just like 'data1', 'data2', 'data3', and 'data1_upgraded', 'data2_upgraded', 'data3_upgraded'. I felt maybe it is better to use a consistent naming for all the procedures. ~~~ 9. + + Steps to Upgrade 2 node logical replication cluster SUGGESTION Steps to upgrade a two-node logical replication cluster ~~~ 10. + + + + + Let's say publisher is in node1 and subscriber is + in node2. + + 10a. This renders as Step 1. But IMO this should not be a "step" at all -- it's just a description of the scenario. ~ 10b. The subsequent steps refer to subscriptions 'sub1_node1_node2' and 'sub2_node1_node2'. IMO it would help with the example code if those are named up front here too. e.g. node2 has two subscriptions for changes from node1: sub1_node1_node2 sub2_node1_node2 ~~~ 11. + + + Upgrade the publisher node node1's server to the + required newer version, for e.g.: The wording repeating node/node1 seems complicated. SUGGESTION Upgrade the publisher node's server to the required newer version, e.g.: ~~~ 12. + + + Start the upgraded publisher node node1's server, for e.g.: IMO better to use the similar wording used for the "Stop" step SUGGESTION Start the upgraded publisher server in node1, e.g.: ~~~ 13. + + + Upgrade the subscriber node node2's server to + the required new version, for e.g.: The wording repeating node/node2 seems complicated. SUGGESTION Upgrade the subscriber node's server to the required newer version, e.g.: ~~~ 14. + + + Start the upgraded subscriber node node2's server, + for e.g.: IMO better to use the similar wording used for the "Stop" step SUGGESTION Start the upgraded subscriber server in node2, e.g.: ~~~ 15. + + + Create any tables that were created in the upgraded publisher node1 + server between step-5 and now, for e.g.: + +node2=# CREATE TABLE distributors ( +node2(# did integer CONSTRAINT no_null NOT NULL, +node2(# name varchar(40) NOT NULL +node2(# ); +CREATE TABLE + + + 15a Maybe it is better to have a link to setp5 instead of just hardwiring "Step-5" in the text. ~ 15b. I didn't think it was needed to spread the CREATE TABLE across multiple lines. It is just a dummy example anyway so IMO better to use up less space. ~~~ 16. + + + Refresh the publications using + ALTER SUBSCRIPTION ... REFRESH PUBLICATION, /Refresh the publications/Refresh the subscription's publications/ ~~~ // Steps to upgrade cascaded logical replication clusters // (these comments are similar to those in the previous procedure, but I will give them all again) 17. + + + Steps to upgrade cascaded logical replication clusters + + + +
Re: SQL:2011 application time
On Tue, Jan 2, 2024 at 9:59 AM Paul Jungwirth wrote: > > On 12/31/23 00:51, Paul Jungwirth wrote: > > That's it for now. > > Here is another update. I fixed FOR PORTION OF on partitioned tables, in > particular when the attnums > are different from the root partition. > > Rebased to cea89c93a1. > Hi. +/* + * range_without_portion_internal - Sets outputs and outputn to the ranges + * remaining and their count (respectively) after subtracting r2 from r1. + * The array should never contain empty ranges. + * The outputs will be ordered. We expect that outputs is an array of + * RangeType pointers, already allocated with two slots. + */ +void +range_without_portion_internal(TypeCacheEntry *typcache, RangeType *r1, + RangeType *r2, RangeType **outputs, int *outputn) +{ + int cmp_l1l2, + cmp_l1u2, + cmp_u1l2, + cmp_u1u2; + RangeBound lower1, + lower2; + RangeBound upper1, + upper2; + bool empty1, + empty2; + + range_deserialize(typcache, r1, , , ); + range_deserialize(typcache, r2, , , ); + + if (empty1) + { + /* if r1 is empty then r1 - r2 is empty, so return zero results */ + *outputn = 0; + return; + } + else if (empty2) + { + /* r2 is empty so the result is just r1 (which we know is not empty) */ + outputs[0] = r1; + *outputn = 1; + return; + } + + /* + * Use the same logic as range_minus_internal, + * but support the split case + */ + cmp_l1l2 = range_cmp_bounds(typcache, , ); + cmp_l1u2 = range_cmp_bounds(typcache, , ); + cmp_u1l2 = range_cmp_bounds(typcache, , ); + cmp_u1u2 = range_cmp_bounds(typcache, , ); + + if (cmp_l1l2 < 0 && cmp_u1u2 > 0) + { + lower2.inclusive = !lower2.inclusive; + lower2.lower = false; /* it will become the upper bound */ + outputs[0] = make_range(typcache, , , false, NULL); + + upper2.inclusive = !upper2.inclusive; + upper2.lower = true; /* it will become the lower bound */ + outputs[1] = make_range(typcache, , , false, NULL); + + *outputn = 2; + } + else if (cmp_l1u2 > 0 || cmp_u1l2 < 0) + { + outputs[0] = r1; + *outputn = 1; + } + else if (cmp_l1l2 >= 0 && cmp_u1u2 <= 0) + { + *outputn = 0; + } + else if (cmp_l1l2 <= 0 && cmp_u1l2 >= 0 && cmp_u1u2 <= 0) + { + lower2.inclusive = !lower2.inclusive; + lower2.lower = false; /* it will become the upper bound */ + outputs[0] = make_range(typcache, , , false, NULL); + *outputn = 1; + } + else if (cmp_l1l2 >= 0 && cmp_u1u2 >= 0 && cmp_l1u2 <= 0) + { + upper2.inclusive = !upper2.inclusive; + upper2.lower = true; /* it will become the lower bound */ + outputs[0] = make_range(typcache, , , false, NULL); + *outputn = 1; + } + else + { + elog(ERROR, "unexpected case in range_without_portion"); + } +} I am confused. say condition: " (cmp_l1l2 <= 0 && cmp_u1l2 >= 0 && cmp_u1u2 <= 0)" the following code will only run PartA, never run PartB? ` else if (cmp_l1l2 >= 0 && cmp_u1u2 <= 0) PartA else if (cmp_l1l2 <= 0 && cmp_u1l2 >= 0 && cmp_u1u2 <= 0) PartB ` minimum example: #include #include #include #include int main(void) { int cmp_l1l2; int cmp_u1u2; int cmp_u1l2; int cmp_l1u2; cmp_l1u2 = -1; cmp_l1l2 = 0; cmp_u1u2 = 0; cmp_u1l2 = 0; assert(cmp_u1l2 == 0); if (cmp_l1u2 > 0 || cmp_u1l2 < 0) printf("calling partA\n"); else if (cmp_l1l2 >= 0 && cmp_u1u2 <= 0) printf("calling partB\n"); else if (cmp_l1l2 <= 0 && cmp_u1l2 >= 0 && cmp_u1u2 <= 0) printf("calling partC\n"); } I am confused with the name "range_without_portion", I think "range_not_overlap" would be better. select numrange(1.1, 2.2) @- numrange(2.0, 3.0); the result is not the same as select numrange(2.0, 3.0) @- numrange(1.1, 2.2); So your categorize oprkind as 'b' for operator "@-" is wrong? select oprname,oprkind,oprcanhash,oprcanmerge,oprleft,oprright,oprresult,oprcode from pg_operator where oprname = '@-'; aslo select count(*), oprkind from pg_operator group by oprkind; there are only 5% are prefix operators. maybe we should design it as: 1. if both inputs are empty range, the result array is empty. 2. if both inputs are non-empty and never overlaps, put both of them to the result array. 3. if one input is empty another one is not, then put the non-empty one into the result array. after applying the patch: now the catalog data seems not correct to me. SELECT a1.amopfamily ,a1.amoplefttype::regtype ,a1.amoprighttype ,a1.amopstrategy ,amoppurpose ,amopsortfamily ,amopopr ,op.oprname ,am.amname FROMpg_amop as a1 join pg_operator op on op.oid = a1.amopopr joinpg_am am on am.oid = a1.amopmethod where amoppurpose = 'p'; output: amopfamily | amoplefttype | amoprighttype | amopstrategy | amoppurpose | amopsortfamily | amopopr | oprname | amname +---+---+--+-++-+-+ 2593 | box | 603 | 31 | p | 0 | 803 | # | gist 3919 | anyrange | 3831 |
Re: speed up a logical replica setup
On Thu, Jan 4, 2024 at 9:27 PM Euler Taveira wrote: > > On Thu, Jan 4, 2024, at 3:05 AM, Amit Kapila wrote: > > Won't it be a better user experience that after setting up the target > server as a logical replica (subscriber), it started to work > seamlessly without user intervention? > > > If we have an option to control the replication slot removal (default is on), > it seems a good UI. Even if the user decides to disable the replication slot > removal, it should print a message saying that these replication slots can > cause WAL retention. > As pointed out in the previous response, I think we should not proceed with such a risk of WAL retention and other nodes dependency, we should either give an ERROR (default) or remove slots, if the user provides an option. If we do so, do you think by default we can keep the server started or let the user start it later? I think one advantage of letting the user start it later is that she gets a chance to adjust config parameters in postgresql.conf and by default we won't be using system resources. -- With Regards, Amit Kapila.
Re: PL/pgSQL: Incomplete item Allow handling of %TYPE arrays, e.g. tab.col%TYPE[]
čt 4. 1. 2024 v 22:02 odesílatel Tom Lane napsal: > Pavel Stehule writes: > > Now, I think so this simple patch is ready for committers > > I pushed this with some editorialization -- mostly, rewriting the > documentation and comments. I found that the existing docs for %TYPE > were not great. There are two separate use-cases, one for referencing > a table column and one for referencing a previously-declared variable, > and the docs were about as clear as mud about explaining that. > > I also looked into the problem Pavel mentioned that it doesn't work > for RECORD. If you just write "record[]" you get an error message > that at least indicates it's an unsupported case: > > regression=# do $$declare r record[]; begin end$$; > ERROR: variable "r" has pseudo-type record[] > CONTEXT: compilation of PL/pgSQL function "inline_code_block" near line 1 > > Maybe we could improve on that, but it would be a lot of work and > I'm not terribly excited about it. However, %TYPE fails entirely > for both "record" and named composite types, and the reason turns > out to be just that plpgsql_parse_wordtype fails to handle the > PLPGSQL_NSTYPE_REC case. So that's easily fixed. > > I also wonder what the heck the last half of plpgsql_parse_wordtype > is for at all. It looks for a named type, which means you can do > > regression=# do $$declare x float8%type; begin end$$; > DO > > but that's just stupid. You could leave off the %TYPE and get > the same result. Moreover, it is inconsistent because > plpgsql_parse_cwordtype has no equivalent behavior: > > regression=# do $$declare x pg_catalog.float8%type; begin end$$; > ERROR: syntax error at or near "%" > LINE 1: do $$declare x pg_catalog.float8%type; begin end$$; > ^ > CONTEXT: invalid type name "pg_catalog.float8%type" > > It's also undocumented and untested (the code coverage report > shows this part is never reached). So I propose we remove it. > > That leads me to the attached proposed follow-on patch. > > Another thing we could think about, but I've not done it here, > is to make plpgsql_parse_wordtype and friends throw error > instead of just returning NULL when they don't find the name. > Right now, if NULL is returned, we end up passing the whole > string to parse_datatype, leading to unhelpful errors like > the one shown above. We could do better than that I think, > perhaps like "argument of %TYPE is not a known variable". > +1 Regards Pavel > > regards, tom lane > >
Re: speed up a logical replica setup
On Thu, Jan 4, 2024 at 9:18 PM Euler Taveira wrote: > > On Thu, Jan 4, 2024, at 2:41 AM, Amit Kapila wrote: > > > I think asking users to manually remove such slots won't be a good > idea. We might want to either remove them by default or provide an > option to the user. > > > Am I correct that the majority of the use cases these replication slots will > be > useless? > I am not so sure about it. Say, if some sync slots are present this means the user wants this replica to be used later as a publisher. Now, if the existing primary/publisher node is still alive then we don't have these slots but if the user wants to switch over to this new node as well then they may be required. Is there a possibility that a cascading standby also has a slot on the current physical replica being converted to a new subscriber? > If so, let's remove them by default and add an option to control this > behavior (replication slot removal). > The presence of slots on the physical replica indicates that the other nodes/clusters could be dependent on it, so, I feel by default we should give an error and if the user uses some option to remove slots then it is fine to remove them. -- With Regards, Amit Kapila.
Re: pg_upgrade test failure
On Sun, 29 Oct 2023 at 11:14, Hayato Kuroda (Fujitsu) wrote: > > Dear Andres, > > While tracking BF failures related with pg_ugprade, I found the same failure > has still happened [1] - [4]. > According to the log, the output directory was remained even after the > successful upgrade [5]. > I analyzed and attached the fix patch, and below is my analysis... how do you > think? > The same failure occurs randomly at [1] for a newly added test too: pg_upgrade: warning: could not remove directory "C:/tools/nmsys64/home/pgrunner/bf/root/HEAD/pgsql.build/testrun/pg_upgrade/004_subscription/data/t_004_subscription_new_sub_data/pgdata/pg_upgrade_output.d/20240104T215133.796/log": Directory not empty pg_upgrade: warning: could not remove directory "C:/tools/nmsys64/home/pgrunner/bf/root/HEAD/pgsql.build/testrun/pg_upgrade/004_subscription/data/t_004_subscription_new_sub_data/pgdata/pg_upgrade_output.d/20240104T215133.796": Directory not empty pg_upgrade: warning: could not stat file "C:/tools/nmsys64/home/pgrunner/bf/root/HEAD/pgsql.build/testrun/pg_upgrade/004_subscription/data/t_004_subscription_new_sub_data/pgdata/pg_upgrade_output.d/20240104T215133.796/log/pg_upgrade_internal.log": No such file or directory pg_upgrade: warning: could not remove directory "C:/tools/nmsys64/home/pgrunner/bf/root/HEAD/pgsql.build/testrun/pg_upgrade/004_subscription/data/t_004_subscription_new_sub_data/pgdata/pg_upgrade_output.d/20240104T215133.796/log": Directory not empty pg_upgrade: warning: could not remove directory "C:/tools/nmsys64/home/pgrunner/bf/root/HEAD/pgsql.build/testrun/pg_upgrade/004_subscription/data/t_004_subscription_new_sub_data/pgdata/pg_upgrade_output.d/20240104T215133.796": Directory not empty [1] - https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=fairywren=2024-01-04%2019%3A56%3A20 Regards, Vignesh
Re: Synchronizing slots from primary to standby
On Fri, Jan 5, 2024 at 8:59 AM shveta malik wrote: > > On Thu, Jan 4, 2024 at 7:24 PM Bertrand Drouvot > wrote: > > > > 4 === > > > > Looking closer, the only place where walrcv_connect() is called with > > replication > > set to false and logical set to false is in ReplSlotSyncWorkerMain(). > > > > That does make sense, but what do you think about creating dedicated > > libpqslotsyncwrkr_connect > > and slotsyncwrkr_connect (instead of using the libpqrcv_connect / > > walrcv_connect ones)? > > > > That way we could make use of slotsyncwrkr_connect() in > > ReplSlotSyncWorkerMain() > > as I think it's confusing to use "rcv" functions while the process using > > them is > > not of backend type walreceiver. > > > > I'm not sure that worth the extra complexity though, what do you think? > > I gave it a thought earlier, but then I was not sure even if I create > a new function w/o "rcv" in it then where should it be placed as the > existing file name itself is libpq'walreceiver'.c. Shall we be > creating a new file then? But it does not seem good to create a new > setup (new file, function pointers other stuff) around 1 function. > And thus reusing the same function with 'replication' (new arg) felt > like a better choice than other options. If in future, there is any > other module trying to do the same, then it can use current > walrcv_connect() with rep=false. If I make it specific to slot-sync > worker, then it will not be reusable by other modules (if needed). > I agree that the benefit of creating a new API is not very clear. How about adjusting the description in the file header of libpqwalreceiver.c. I think apart from walreceiver, it is now also used by logical replication workers and with this patch by the slotsync worker as well. -- With Regards, Amit Kapila.
Re: Random pg_upgrade test failure on drongo
On Thu, Jan 4, 2024 at 5:30 PM Alexander Lakhin wrote: > > 03.01.2024 14:42, Amit Kapila wrote: > > > > >> And the internal process is ... background writer (BgBufferSync()). > >> > >> So, I tried just adding bgwriter_lru_maxpages = 0 to postgresql.conf and > >> got 20 x 10 tests passing. > >> > >> Thus, it we want just to get rid of the test failure, maybe it's enough to > >> add this to the test's config... > >> > > What about checkpoints? Can't it do the same while writing the buffers? > > As we deal here with pg_upgrade/pg_restore, it must not be very easy to get > the desired effect, but I think it's not impossible in principle. > More details below. > What happens during the pg_upgrade execution is essentially: > 1) CREATE DATABASE "postgres" WITH TEMPLATE = template0 OID = 5 ...; > -- this command flushes file buffers as well > 2) ALTER DATABASE postgres OWNER TO ... > 3) COMMENT ON DATABASE "postgres" IS ... > 4) -- For binary upgrade, preserve pg_largeobject and index relfilenodes > SELECT > pg_catalog.binary_upgrade_set_next_index_relfilenode('2683'::pg_catalog.oid); > SELECT > pg_catalog.binary_upgrade_set_next_heap_relfilenode('2613'::pg_catalog.oid); > TRUNCATE pg_catalog.pg_largeobject; > -- ^^^ here we can get the error "could not create file "base/5/2683": File > exists" > ... > > We get the effect discussed when the background writer process decides to > flush a file buffer for pg_largeobject during stage 1. > (Thus, if a checkpoint somehow happened to occur during CREATE DATABASE, > the result must be the same.) > And another important factor is shared_buffers = 1MB (set during the test). > With the default setting of 128MB I couldn't see the failure. > > It can be reproduced easily (on old Windows versions) just by running > pg_upgrade in a loop (I've got failures on iterations 22, 37, 17 (with the > default cluster)). > If an old cluster contains dozen of databases, this increases the failure > probability significantly (with 10 additional databases I've got failures > on iterations 4, 1, 6). > I don't have an old Windows environment to test but I agree with your analysis and theory. The question is what should we do for these new random BF failures? I think we should set bgwriter_lru_maxpages to 0 and checkpoint_timeout to 1hr for these new tests. Doing some invasive fix as part of this doesn't sound reasonable because this is an existing problem and there seems to be another patch by Thomas that probably deals with the root cause of the existing problem [1] as pointed out by you. [1] - https://commitfest.postgresql.org/40/3951/ -- With Regards, Amit Kapila.
RE: Documentation to upgrade logical replication cluster
Dear Vignesh, Thanks for making a patch! Below part is my comments. 1. Only two steps were added an id, but I think it should be for all the steps. See [1]. 2. I'm not sure it should be listed as step 10. I felt that it should be new section. At that time other steps like "Prepare for {publisher|subscriber} upgrades" can be moved as well. Thought? 3. ``` + The prerequisites of publisher upgrade applies to logical Replication ``` Replication -> replication 4. ``` + + Let's say publisher is in node1 and subscriber is + in node2. + ``` I felt it is more friendly if you added the name of directory for each instance. 5. You did not write the initialization of new node. Was it intentional? 6. ``` + + Disable all the subscriptions on node2 that are + subscribing the changes from node1 by using + ALTER SUBSCRIPTION ... DISABLE, + for e.g.: + +node2=# ALTER SUBSCRIPTION sub1_node1_node2 DISABLE; +ALTER SUBSCRIPTION +node2=# ALTER SUBSCRIPTION sub2_node1_node2 DISABLE; +ALTER SUBSCRIPTION + + ``` Subscriptions are disabled after stopping a publisher, but it leads ERRORs on the publisher. I think it's better to swap these steps. 7. ``` + +dba@node1:/opt/PostgreSQL/postgres//bin$ pg_ctl -D /opt/PostgreSQL/pub_data stop -l logfile + ``` Hmm. I thought you did not have to show the current directory. You were in the bin dir, but it is not our requirement, right? 8. ``` + +dba@node1:/opt/PostgreSQL/postgres//bin$ pg_upgrade +--old-datadir "/opt/PostgreSQL/postgres/17/pub_data" +--new-datadir "/opt/PostgreSQL/postgres//pub_upgraded_data" +--old-bindir "/opt/PostgreSQL/postgres/17/bin" +--new-bindir "/opt/PostgreSQL/postgres//bin" + ``` For PG17, both old and new bindir look the same. Can we use 18 as new-bindir? 9. ``` + + Create any tables that were created in node2 + between step-2 and now, for e.g.: + +node2=# CREATE TABLE distributors ( +node2(# did integer CONSTRAINT no_null NOT NULL, +node2(# name varchar(40) NOT NULL +node2(# ); +CREATE TABLE + + ``` I think this SQLs must be done on node1, because it has not boot between step-2 and step-7. 10. ``` + + + Enable all the subscriptions on node2 that are + subscribing the changes from node1 by using + ALTER SUBSCRIPTION ... ENABLE, + for e.g.: + +node2=# ALTER SUBSCRIPTION sub1_node1_node2 ENABLE; +ALTER SUBSCRIPTION +node2=# ALTER SUBSCRIPTION sub2_node1_node2 ENABLE; +ALTER SUBSCRIPTION + + + + + + + Refresh the publications using + ALTER SUBSCRIPTION ... REFRESH PUBLICATION, + for e.g.: + +node2=# ALTER SUBSCRIPTION sub1_node1_node2 REFRESH PUBLICATION; +ALTER SUBSCRIPTION +node2=# ALTER SUBSCRIPTION sub2_node1_node2 REFRESH PUBLICATION; +ALTER SUBSCRIPTION + + + ``` I was very confused the location where they would be really do. If my above comment is correct, should they be executed on node1 as well? Could you please all the notation again? 11. ``` + + Disable all the subscriptions on node1 that are + subscribing the changes from node2 by using + ALTER SUBSCRIPTION ... DISABLE, + for e.g.: + +node2=# ALTER SUBSCRIPTION sub1_node2_node1 DISABLE; +ALTER SUBSCRIPTION +node2=# ALTER SUBSCRIPTION sub2_node2_node1 DISABLE; +ALTER SUBSCRIPTION + + ``` They should be on node1, but noted as node2. 12. ``` + + Enable all the subscriptions on node1 that are + subscribing the changes from node2 by using + ALTER SUBSCRIPTION ... ENABLE, + for e.g.: + +node2=# ALTER SUBSCRIPTION sub1_node2_node1 ENABLE; +ALTER SUBSCRIPTION +node2=# ALTER SUBSCRIPTION sub2_node2_node1 ENABLE; +ALTER SUBSCRIPTION + + ``` You said that "enable all the subscription on node1", but SQLs are done on node2. [1]: https://www.postgresql.org/message-id/tyapr01mb58667ae04d291924671e2051f5...@tyapr01mb5866.jpnprd01.prod.outlook.com Best Regards, Hayato Kuroda FUJITSU LIMITED
Re: Synchronizing slots from primary to standby
On Thu, Jan 4, 2024 at 7:24 PM Bertrand Drouvot wrote: > > Hi, > > On Thu, Jan 04, 2024 at 10:27:31AM +0530, shveta malik wrote: > > On Thu, Jan 4, 2024 at 9:18 AM shveta malik wrote: > > > > > > On Wed, Jan 3, 2024 at 6:33 PM Zhijie Hou (Fujitsu) > > > wrote: > > > > > > > > On Tuesday, January 2, 2024 6:32 PM shveta malik > > > > wrote: > > > > > On Fri, Dec 29, 2023 at 10:25 AM Amit Kapila > > > > > > > > > > The topup patch has also changed app_name to > > > > > {cluster_name}_slotsyncworker so that we do not confuse between > > > > > walreceiver > > > > > and slotsyncworker entry. > > > > > > > > > > Please note that there is no change in rest of the patches, changes > > > > > are in > > > > > additional 0004 patch alone. > > > > > > > > Attach the V56 patch set which supports ALTER SUBSCRIPTION SET > > > > (failover). > > > > This is useful when user want to refresh the publication tables, they > > > > can now alter the > > > > failover option to false and then execute the refresh command. > > > > > > > > Best Regards, > > > > Hou zj > > > > > > The patches no longer apply to HEAD due to a recent commit 007693f. I > > > am working on rebasing and will post the new patches soon > > > > > > thanks > > > Shveta > > > > Commit 007693f has changed 'conflicting' to 'conflict_reason', so > > adjusted the code around that in the slotsync worker. > > > > Also removed function 'pg_get_slot_invalidation_cause' as now > > conflict_reason tells the same. > > > > PFA rebased patches with above changes. > > > > Thanks! > > Looking at 0004: > > 1 > > -libpqrcv_connect(const char *conninfo, bool logical, bool must_use_password, > -const char *appname, char **err) > +libpqrcv_connect(const char *conninfo, bool replication, bool logical, > +bool must_use_password, const char *appname, > char **err) > > What about adjusting the preceding comment a bit to describe what the new > replication > parameter is for? > > 2 > > + /* We can not have logical w/o replication */ > > what about replacing w/o by without? > > 3 === > > + if(!replication) > + Assert(!logical); > + > + if (replication) > { > > what about using "if () else" instead (to avoid unnecessary test)? > > > Having said that the patch seems a reasonable way to implement non-replication > connection in slotsync worker. > > 4 === > > Looking closer, the only place where walrcv_connect() is called with > replication > set to false and logical set to false is in ReplSlotSyncWorkerMain(). > > That does make sense, but what do you think about creating dedicated > libpqslotsyncwrkr_connect > and slotsyncwrkr_connect (instead of using the libpqrcv_connect / > walrcv_connect ones)? > > That way we could make use of slotsyncwrkr_connect() in > ReplSlotSyncWorkerMain() > as I think it's confusing to use "rcv" functions while the process using them > is > not of backend type walreceiver. > > I'm not sure that worth the extra complexity though, what do you think? I gave it a thought earlier, but then I was not sure even if I create a new function w/o "rcv" in it then where should it be placed as the existing file name itself is libpq'walreceiver'.c. Shall we be creating a new file then? But it does not seem good to create a new setup (new file, function pointers other stuff) around 1 function. And thus reusing the same function with 'replication' (new arg) felt like a better choice than other options. If in future, there is any other module trying to do the same, then it can use current walrcv_connect() with rep=false. If I make it specific to slot-sync worker, then it will not be reusable by other modules (if needed). thanks Shveta
Re: the s_lock_stuck on perform_spin_delay
Hi, Andres Freund writes: > > On 2024-01-04 14:59:06 +0800, Andy Fan wrote: >> My question is if someone doesn't obey the rule by mistake (everyone >> can make mistake), shall we PANIC on a production environment? IMO I >> think it can be a WARNING on a production environment and be a stuck >> when 'ifdef USE_ASSERT_CHECKING'. >> >> [...] >> >> I notice this issue actually because of the patch "Cache relation >> sizes?" from Thomas Munro [1], where the latest patch[2] still have the >> following code. >> +sr = smgr_alloc_sr(); <-- HERE a spin lock is hold >> + >> +/* Upgrade to exclusive lock so we can create a mapping. */ >> +LWLockAcquire(mapping_lock, LW_EXCLUSIVE); <-- HERE a complex >> operation is needed. it may take a long time. >> >> Our internal testing system found more misuses on our own PG version. > >> I think a experienced engineer like Thomas can make this mistake and the >> patch was reviewed by 3 peoples, the bug is still there. It is not easy >> to say just don't do it. > > I don't follow this argument - just ignoring the problem, I agree with you but I'm feeling you ignored my post at [1], where I said for the known issue, it should be fixed at the first chance. > which emitting a > WARNING basically is, doesn't reduce the impact of the bug, it *increases* the > impact, because now the system will not recover from the bug without explicit > operator intervention. During that time the system might be essentially > unresponsive, because all backends end up contending for some spinlock, which > makes investigating such issues very hard. Acutally they are doing pg_usleep at the most time. Besides what Robert said, one more reason to question PANIC is that: PAINC can't always make the system recovery faster because: a). In the most system, PANIC makes a core dump which take times and spaces. b). After the reboot, all the caches like relcache, plancache, fdcache need to be rebuit. c). Customer needs to handle failure better or else they will be hurt *more often*. All of such sense cause slowness as well. > > I think we should add infrastructure to detect bugs like this during > development, The commit 2 in [1] does something like this. for the details, I missed the check for memory allocation case as you suggested at [2], but checked heavyweight lock as well. others should be same IIUC. > but not PANICing when this happens in production seems completely > non-viable. > Not sure what does *this* exactly means. If it means the bug in Thomas's patch, I absoluately agree with you(since it is a known bug and it should be fixed). If it means the general *unknown* case, it's something we talked above. I'm also agree that some LLVM static checker should be pretty good ideally, it just requires more knowledge base and future maintain effort. I am willing to have a try shortly. [1] https://www.postgresql.org/message-id/871qaxp3ly.fsf%40163.com [2] https://www.postgresql.org/message-id/20240104225403.dgmbbfffmm3srpgq%40awork3.anarazel.de -- Best Regards Andy Fan
Re: the s_lock_stuck on perform_spin_delay
Hi, On 2024-01-04 17:03:18 -0600, Jim Nasby wrote: > On 1/4/24 10:33 AM, Tom Lane wrote: > > Robert Haas writes: > > On Thu, Jan 4, 2024 at 10:22 AM Tom Lane wrote: > > We should be making an effort to ban coding patterns like > "return with spinlock still held", because they're just too prone > to errors similar to this one. > > I agree that we don't want to add overhead, and also about how > spinlocks should be used, but I dispute "easily detectable > statically." I mean, if you or I look at some code that uses a > spinlock, we'll know whether the pattern that you mention is being > followed or not, modulo differences of opinion in debatable cases. But > you and I cannot be there to look at all the code all the time. If we > had a static checking tool that was run as part of every build, or in > the buildfarm, or by cfbot, or somewhere else that raised the alarm if > this rule was violated, then we could claim to be effectively > enforcing this rule. > > I was indeed suggesting that maybe we could find a way to detect > such things automatically. While I've not been paying close > attention, I recall there's been some discussions of using LLVM/clang > infrastructure for customized static analysis, so maybe it'd be > possible without an undue amount of effort. I played around with this a while back. One reference with a link to a playground to experiment with attributes: https://www.postgresql.org/message-id/20200616233105.sm5bvodo6unigno7%40alap3.anarazel.de Unfortunately clang's thread safety analysis doesn't handle conditionally acquired locks, which made it far less promising than I initially thought. I think there might be some other approaches, but they will all suffer from not understanding "danger" encountered indirectly, via function calls doing dangerous things. Which we would like to exclude, but I don't think that's trivial either. > FWIW, the lackey[1] tool in Valgrind is able to do some kinds of instruction > counting, so it might be possible to measure how many instructions are > actualyl > being executed while holding a spinlock. Might be easier than code analysis. I don't think that's particularly promising. Lackey is *slow*. And it requires actually reaching problematic states. Consider e.g. the case that was reported upthread, an lwlock acquired within a spinlock protected section - most of the time that's not going to result in a lot of cycles, because the lwlock is free. Greetings, Andres Freund
Re: add AVX2 support to simd.h
On Wed, Jan 3, 2024 at 10:29 PM Nathan Bossart wrote: > If the requirement is that normal builds use AVX2, then I fear we will be > waiting a long time. IIUC the current proposals (building multiple > binaries or adding a configuration option that maps to compiler flags) > would still be opt-in, If and when we get one of those, I would consider that a "normal" build. Since there are no concrete proposals yet, I'm still waiting for you to justify imposing an immediate maintenance cost for zero benefit.
Re: Improve WALRead() to suck data directly from WAL buffers when possible
On Wed, 2023-12-20 at 15:36 +0530, Bharath Rupireddy wrote: > Thanks. Attaching remaining patches as v18 patch-set after commits > c3a8e2a7cb16 and 766571be1659. Comments: I still think the right thing for this patch is to call XLogReadFromBuffers() directly from the callers who need it, and not change WALRead(). I am open to changing this later, but for now that makes sense to me so that we can clearly identify which callers benefit and why. I have brought this up a few times before[1][2], so there must be some reason that I don't understand -- can you explain it? The XLogReadFromBuffersResult is never used. I can see how it might be useful for testing or asserts, but it's not used even in the test module. I don't think we should clutter the API with that kind of thing -- let's just return the nread. I also do not like the terminology "partial hit" to be used in this way. Perhaps "short read" or something about hitting the end of readable WAL would be better? I like how the callers of WALRead() are being more precise about the bytes they are requesting. You've added several spinlock acquisitions to the loop. Two explicitly, and one implicitly in WaitXLogInsertionsToFinish(). These may allow you to read slightly further, but introduce performance risk. Was this discussed? The callers are not checking for XLREADBUGS_UNINITIALIZED_WAL, so it seems like there's a risk of getting partially-written data? And it's not clear to me the check of the wal page headers is the right one anyway. It seems like all of this would be simpler if you checked first how far you can safely read data, and then just loop and read that far. I'm not sure that it's worth it to try to mix the validity checks with the reading of the data. Regards, Jeff Davis [1] https://www.postgresql.org/message-id/4132fe48f831ed6f73a9eb191af5fe475384969c.camel%40j-davis.com [2] https://www.postgresql.org/message-id/2ef04861c0f77e7ae78b703770cc2bbbac3d85e6.ca...@j-davis.com
Re: Build versionless .so for Android
Hi, Attached a patch with a (hopefully) better wording of the comment. I have unsuccessfully tried to find an official source for this policy. So for reference some discussions about the topic: - https://stackoverflow.com/questions/11491065/linking-with-versioned-shared-library-in-android-ndk - https://stackoverflow.com/questions/18681401/how-can-i-remove-all-versioning-information-from-shared-object-files Please let me know if I can help in any way Matthias On Tue, Jan 2, 2024 at 6:43 PM Robert Haas wrote: > On Sun, Dec 31, 2023 at 1:24 AM Michael Paquier > wrote: > > FWIW, I have mixed feelings about patching the code to treat > > android-linux as an exception while changing nothing in the > > documentation to explain why this is required. A custom patch may > > serve you better here, and note that you did not even touch the meson > > paths. See libpq_c_args in src/interfaces/libpq/meson.build as one > > example. That's just my opinion, of course, still there are no > > buildfarm members that would cover your patch. > > It's reasonable to want good comments -- the one in the patch (1) > doesn't explain why this is required and (2) suggests that it is only > needed when cross-compiling which seems surprising and (3) has a typo. > But if it's true that this is needed, I tentatively think we might do > better to take the patch than force people to carry it out of tree. > > -- > Robert Haas > EDB: http://www.enterprisedb.com > diff --git a/doc/src/sgml/installation.sgml b/doc/src/sgml/installation.sgml index 75dc81a..9c88558 100644 --- a/doc/src/sgml/installation.sgml +++ b/doc/src/sgml/installation.sgml @@ -1989,6 +1989,14 @@ build-postgresql: among developers is to use PROFILE for one-time flag adjustments, while COPT might be kept set all the time. + + + When building libpq as shared libraries, the resulting files are suffixed + with the version libpq.5.[version] and an unversioned + symlink libpq.soto the versioned file is created. An + exception to that is Android where library file names must end with + .so. Building for Android is only supported using make. + diff --git a/src/Makefile.shlib b/src/Makefile.shlib index 92d893e..4801b7a 100644 --- a/src/Makefile.shlib +++ b/src/Makefile.shlib @@ -187,7 +187,13 @@ endif ifeq ($(PORTNAME), linux) LINK.shared = $(COMPILER) -shared ifdef soname -LINK.shared += -Wl,-soname,$(soname) +ifeq (linux-android,$(findstring linux-android,$(host_os))) +# Android does not support versioned soname +shlib = lib$(NAME)$(DLSUFFIX) +LINK.shared += -Wl,-soname,lib$(NAME)$(DLSUFFIX) +else +LINK.shared += -Wl,-soname,$(soname) +endif endif BUILD.exports = ( echo '{ global:'; $(AWK) '/^[^\#]/ {printf "%s;\n",$$1}' $<; echo ' local: *; };' ) >$@ exports_file = $(SHLIB_EXPORTS:%.txt=%.list)
Re: Add a perl function in Cluster.pm to generate WAL
On Thu, Jan 04, 2024 at 04:00:01PM +0300, Alexander Lakhin wrote: > Reproduced here. Did you just make the run slow enough to show the failure with valgrind? > As I can see in the failure logs you referenced, the first problem is: > # Failed test 'inactiveslot slot invalidation is logged with vacuum on > pg_authid' > # at t/035_standby_logical_decoding.pl line 224. > It reminded me of: > https://www.postgresql.org/message-id/b2a1f7d0-7629-72c0-2da7-e9c4e336b010%40gmail.com > > It seems that it's not something new, and maybe that my analysis is still > valid. If so, VACUUM FREEZE/autovacuum = off should fix the issue. Not sure about that yet. -- Michael signature.asc Description: PGP signature
Re: Improve rowcount estimate for UNNEST(column)
Paul Jungwirth writes: > Here is a patch with an improved test. With the old "10" estimate we get a > Merge Join, but now that > the planner can see there are only ~4 elements per array, we get a Nested > Loop. Pushed with minor editorialization. I ended up not using the test case, because I was afraid it wouldn't be all that stable, and code coverage shows that we are exercising the added code path even without a bespoke test case. > On 11/29/23 20:35, jian he wrote: >>> I did a minor change. change estimate_array_length return type to double > I'm not sure I want to change estimate_array_length from returning > ints to returning doubles, since it's called in many places. But your patch forces every one of those places to be touched anyway, as a consequence of adding the "root" argument. I looked at the callers and saw that every single one of them (in core anyway) ends up using the result in a "double" rowcount calculation, so we're really not buying anything by converting to integer and back again. There's also a question of whether the number from DECHIST could be big enough to overflow an int. Perhaps not given the current calculation method, but since it's a float4 there's at least a theoretical risk. Hence, I adopted Jian's suggestion. One other point is that examine_variable is capable of succeeding on things that aren't Vars, so I thought the restriction to Vars was inappropriate. regards, tom lane
Re: Adding facility for injection points (or probe points?) for more advanced tests
On Thu, Jan 04, 2024 at 04:31:02PM -0600, Nathan Bossart wrote: > On Thu, Jan 04, 2024 at 06:04:20PM +0530, Ashutosh Bapat wrote: >> 0003 and 0004 are using the extension in this module for some serious >> testing. The name of the extension test_injection_point indicates that >> it's for testing injection points and not for some serious use of >> injection callbacks it adds. Changes 0003 and 0004 suggest otherwise. > > Yeah, I think test_injection_point should be reserved for testing the > injection point machinery. Sure. FWIW, it makes sense to me to keep the SQL interface and the callbacks in the module, per the reasons below. >> I suggest we move test_injection_points from src/test/modules to >> contrib/ and rename it as "injection_points". The test files may still >> be named as test_injection_point. The TAP tests in 0003 and 0004 once >> moved to their appropriate places, will load injection_point extension >> and use it. That way predefined injection point callbacks will also be >> available for others to use. > > Rather than defining a module somewhere that tests would need to load, > should we just put the common callbacks in the core server? Unless there's > a strong reason to define them elsewhere, that could be a nice way to save > a step in the tests. Nah, having some pre-existing callbacks existing in the backend is against the original minimalistic design spirit. These would also require an SQL interface, and the interface design also depends on the functions registering them when pushing down custom conditions. Pushing that down to extensions to do what they want will lead to less noise, particularly if you consider that we will most likely want to tweak the callback interfaces for backpatched bugs. That's also why I think contrib/ is not a good idea, src/test/modules/ serving the actual testing purpose here. -- Michael signature.asc Description: PGP signature
Re: Confine vacuum skip logic to lazy_scan_skip
On 1/4/24 2:23 PM, Andres Freund wrote: On 2024-01-02 12:36:18 -0500, Melanie Plageman wrote: Subject: [PATCH v2 1/6] lazy_scan_skip remove unnecessary local var rel_pages Subject: [PATCH v2 2/6] lazy_scan_skip remove unneeded local var nskippable_blocks I think these may lead to worse code - the compiler has to reload vacrel->rel_pages/next_unskippable_block for every loop iteration, because it can't guarantee that they're not changed within one of the external functions called in the loop body. Admittedly I'm not up to speed on recent vacuum changes, but I have to wonder if the concept of skipping should go away in the context of vector IO? Instead of thinking about "we can skip this range of blocks", why not maintain a list of "here's the next X number of blocks that we need to vacuum"? -- Jim Nasby, Data Architect, Austin TX
Re: the s_lock_stuck on perform_spin_delay
Hi, On 2024-01-04 14:59:06 +0800, Andy Fan wrote: > My question is if someone doesn't obey the rule by mistake (everyone > can make mistake), shall we PANIC on a production environment? IMO I > think it can be a WARNING on a production environment and be a stuck > when 'ifdef USE_ASSERT_CHECKING'. > > [...] > > I notice this issue actually because of the patch "Cache relation > sizes?" from Thomas Munro [1], where the latest patch[2] still have the > following code. > + sr = smgr_alloc_sr(); <-- HERE a spin lock is hold > + > + /* Upgrade to exclusive lock so we can create a mapping. */ > + LWLockAcquire(mapping_lock, LW_EXCLUSIVE); <-- HERE a complex > operation is needed. it may take a long time. > > Our internal testing system found more misuses on our own PG version. > I think a experienced engineer like Thomas can make this mistake and the > patch was reviewed by 3 peoples, the bug is still there. It is not easy > to say just don't do it. I don't follow this argument - just ignoring the problem, which emitting a WARNING basically is, doesn't reduce the impact of the bug, it *increases* the impact, because now the system will not recover from the bug without explicit operator intervention. During that time the system might be essentially unresponsive, because all backends end up contending for some spinlock, which makes investigating such issues very hard. I think we should add infrastructure to detect bugs like this during development, but not PANICing when this happens in production seems completely non-viable. Greetings, Andres Freund
Re: Emit fewer vacuum records by reaping removable tuples during pruning
On Thu, Jan 4, 2024 at 12:31 PM Robert Haas wrote: > > On Wed, Dec 27, 2023 at 11:27 AM Melanie Plageman > wrote: > > Do you have specific concerns about its correctness? I understand it > > is an area where we have to be sure we are correct. But, to be fair, > > that is true of all the pruning and vacuuming code. > > I'm kind of concerned that 0002 might be a performance regression. It > pushes more branches down into the heap-pruning code, which I think > can sometimes be quite hot, for the sake of a case that rarely occurs > in practice. I take your point about it improving things when there > are no indexes, but what about when there are? And even if there are > no adverse performance consequences, is it really worth complicating > the logic at such a low level? Regarding the additional code complexity, I think the extra call to lazy_vacuum_heap_page() in lazy_scan_heap() actually represents a fair amount of code complexity. It is a special case of page-level processing that should be handled by heap_page_prune() and not lazy_scan_heap(). lazy_scan_heap() is responsible for three main things -- loop through the blocks in a relation and process each page (pruning, freezing, etc), invoke index vacuuming, invoke functions to loop through dead_items and vacuum pages. The logic to do the per-page processing is spread across three places, though. When a single page is being processed, page pruning happens in heap_page_prune(). Freezing, dead items recording, and visibility checks happen in lazy_scan_prune(). Visibility map updates and freespace map updates happen back in lazy_scan_heap(). Except, if the table has no indexes, in which case, lazy_scan_heap() also invokes lazy_vacuum_heap_page() to set dead line pointers unused and do another separate visibility check and VM update. I maintain that all page-level processing should be done in the page-level processing functions (like lazy_scan_prune()). And lazy_scan_heap() shouldn't be directly responsible for special case page-level processing. > Also, I find "pronto_reap" to be a poor choice of name. "pronto" is an > informal word that seems to have no advantage over something like > "immediate" or "now," and I don't think "reap" has a precise, > universally-understood meaning. You could call this "mark_unused_now" > or "immediately_mark_unused" or something and it would be far more > self-documenting, IMHO. Yes, I see how pronto is unnecessarily informal. If there are no cases other than when the table has no indexes that we would consider immediately marking LPs unused, then perhaps it is better to call it "no_indexes" (per andres' suggestion)? - Melanie
Re: the s_lock_stuck on perform_spin_delay
On 1/4/24 10:33 AM, Tom Lane wrote: Robert Haas writes: On Thu, Jan 4, 2024 at 10:22 AM Tom Lane wrote: We should be making an effort to ban coding patterns like "return with spinlock still held", because they're just too prone to errors similar to this one. I agree that we don't want to add overhead, and also about how spinlocks should be used, but I dispute "easily detectable statically." I mean, if you or I look at some code that uses a spinlock, we'll know whether the pattern that you mention is being followed or not, modulo differences of opinion in debatable cases. But you and I cannot be there to look at all the code all the time. If we had a static checking tool that was run as part of every build, or in the buildfarm, or by cfbot, or somewhere else that raised the alarm if this rule was violated, then we could claim to be effectively enforcing this rule. I was indeed suggesting that maybe we could find a way to detect such things automatically. While I've not been paying close attention, I recall there's been some discussions of using LLVM/clang infrastructure for customized static analysis, so maybe it'd be possible without an undue amount of effort. FWIW, the lackey[1] tool in Valgrind is able to do some kinds of instruction counting, so it might be possible to measure how many instructions are actualyl being executed while holding a spinlock. Might be easier than code analysis. Another possibility might be using the CPUs timestamp counter. 1: https://valgrind.org/docs/manual/lk-manual.html -- Jim Nasby, Data Architect, Austin TX
Re: the s_lock_stuck on perform_spin_delay
Hi, On 2024-01-04 12:04:07 -0500, Robert Haas wrote: > On Thu, Jan 4, 2024 at 11:33 AM Tom Lane wrote: > > I believe it's (a). No matter what the reason for a stuck spinlock > > is, the only reliable method of getting out of the problem is to > > blow things up and start over. The patch proposed at the top of this > > thread would leave the system unable to recover on its own, with the > > only recourse being for the DBA to manually force a crash/restart ... > > once she figured out that that was necessary, which might take a long > > while if the only external evidence is an occasional WARNING that > > might not even be making it to the postmaster log. How's that better? > > It's a fair question. I think you're correct if we assume that > everyone's following the coding rule ... at least assuming that the > target system isn't too insanely slow, and I've seen some pretty > crazily overloaded machines. But if the coding rule is not being > followed, then "the only reliable method of getting out of the problem > is to blow things up and start over" becomes a dubious conclusion. If the coding rule isn't being followed, a crash restart is the least of ones problems... But that doesn't mean we shouldn't add infrastructure to make it easier to detect violations of the spinlock rules - we've had lots of buglets around this over the years ourselves, so we hardly can expect extension authors to get this right. Particularly because we don't even document the rules well, afair. I think we should add cassert-only infrastructure tracking whether we currently hold spinlocks, are in a signal handler and perhaps a few other states. That'd allow us to add assertions like: - no memory allocations when holding spinlocks or in signal handlers - no lwlocks while holding spinlocks - no lwlocks or spinlocks while in signal handlers > Also, I wonder if many or even all uses of spinlocks uses ought to be > replaced with either LWLocks or atomics. An LWLock might be slightly > slower when contention is low, but it scales better when contention is > high, displays a wait event so that you can see that you have > contention if you do, and doesn't PANIC the system if the contention > gets too bad. And an atomic will just be faster, in the cases where > it's adequate. I tried to replace all - unfortunately the results were not great. The problem isn't primarily the lack of spinning (although it might be worth adding that to lwlocks) or the cost of error recovery, the problem is that a reader-writer lock are inherently more expensive than simpler locks that don't have multiple levels. One example of such increased overhead is that on x86 an lwlock unlock has to be an atomic operation (to maintain the lock count), whereas as spinlock unlock can just be a write + compiler barrier. Unfortunately the added atomic operation turns out to matter in some performance critical cases like the insertpos_lck. I think we ought to split lwlocks into reader/writer and simpler mutex. The simpler mutex still will be slower than s_lock in some relevant cases, e.g. due to the error recovery handling, but it'd be "local" overhead, rather than something affecting scalability. FWIW, these days spinlocks do report a wait event when in perform_spin_delay() - albeit without detail which lock is being held. Greetings, Andres Freund
Re: UUID v7
First of all, I'm a huge fan of UUID v7. So I'm very excited that this is progressing. I'm definitely going to look closer at this patch soon. Some tiny initial feedback: (bikeshed) I'd prefer renaming `get_uuid_v7_time` to the shorter `uuid_v7_time`, the `get_` prefix seems rarely used in Postgres functions (e.g. `date_part` is not called `get_date_part`). Also it's visually very similar to the gen_ prefix. On Thu, 4 Jan 2024 at 19:20, Andrey M. Borodin wrote: > > On 3 Jan 2024, at 04:37, Przemysław Sztoch wrote: > > 1. Is it possible to add a function that returns the version of the > > generated uuid? > > It will be very useful. > > I don't know if it's possible, but I think there are bits in the UUID that > > inform about the version. > What do you think if we have functions get_uuid_v7_ver(uuid) and > get_uuid_v7_var(uuid) to extract bit fields according to [0] ? Or, perhaps, > this should be one function with two return parameters? > It's not in a patch yet, I'm just considering how this functionality should > look like. I do agree that those functions would be useful, especially now that we're introducing a function that errors when it's passed a UUID that's not of version 7. With the version extraction function you could return something else for other uuids if you have many and not all of them are version 7. I do think though that these functions should not have v7 in their name, since they would apply to all uuids of all versions (so if also removing the get_ prefix they would be called uuid_ver and uuid_var) > > 4. Sometimes you will need to generate a uuid for historical time. There > > should be an additional function gen_uuid_v7(timestamp). > Done, please see patch attached. But I changed signature to > gen_uuid_v7(int8), to avoid messing with bytes from user who knows what they > want. Or do you think gen_uuid_v7(timestamp) would be more convenient? I think timestamp would be quite useful. timestamp would encode the time in the same way as gen_uuid_v7() would, but based on the given time instead of the current time.
Re: Hide exposed impl detail of wchar.c
On Mon, Nov 20, 2023 at 10:39:43PM -0600, Nathan Bossart wrote: > Alright. The next minor release isn't until February, so I'll let this one > sit a little while longer in case anyone objects to back-patching something > like this [0]. > > [0] https://postgr.es/m/attachment/152305/move_is_valid_ascii_v2.patch Barring objections, I plan to commit this and back-patch it to v16 in the next few days. -- Nathan Bossart Amazon Web Services: https://aws.amazon.com
Re: doing also VM cache snapshot and restore with pg_prewarm, having more information of the VM inside PostgreSQL
On 1/3/24 5:57 PM, Cedric Villemain wrote: for 15 years pgfincore has been sitting quietly and being used in large setups to help in HA and resources management. It can perfectly stay as is, to be honest I was expecting to one day include a windows support and propose that to PostgreSQL, it appears getting support on linux and BSD is more than enough today. So I wonder if there are interest for having virtual memory snapshot and restore operations with, for example, pg_prewarm/autowarm ? IMHO, to improve the user experience here we'd need something that combined the abilities of all these extensions into a cohesive interface that allowed users to simply say "please get this data into cache". Simply moving pgfincore into core Postgres wouldn't satisfy that need. So I think the real question is whether the community feels spport for better cache (buffercache + filesystem) management is a worthwhile feature to add to Postgres. Micromanaging cache contents for periodic jobs seems almost like a mis-feature. While it's a useful tool to have in the toolbox, it's also a non-trivial layer of complexity. IMHO not something we'd want to add. Though, there might be smaller items that would make creating tools to do that easier, such as some ability to see what blocks a backend is accessing (perhaps via a hook). On the surface, improving RTO via cache warming sounds interesting ... but I'm not sure how useful it would be in reality. Users that care about RTO would almost always have some form of hot-standby, and generally those will already have a lot of data in cache. While they won't have read-only data in cache, I have to wonder if the answer to that problem is allowing writers to tell a replica what blocks are being read, so the replica can keep them in cache. Also, most (all?) operations that require a restart could be handled via a failover, so I'm not sure how much cache management moves the needle there. Some usecases covered: snapshot/restore cache around cronjobs, around dumps, switchover, failover, on stop/start of postgres (think kernel upgrade with a cold restart), ... pgfincore also provides some nice information with mincore (on FreeBSD mincore is more interesting) or cachestat, again it can remain as an out of tree extension but I will be happy to add to commitfest if there are interest from the community. An example of cachestat output: postgres=# select *from vm_relation_cachestat('foo',range:=1024*32); block_start | block_count | nr_pages | nr_cache | nr_dirty | nr_writeback | nr_evicted | nr_recently_evicted -+-+--+--+--+--++- 0 | 32768 | 65536 | 62294 | 0 | 0 | 3242 | 3242 32768 | 32768 | 65536 | 39279 | 0 | 0 | 26257 | 26257 65536 | 32768 | 65536 | 22516 | 0 | 0 | 43020 | 43020 98304 | 32768 | 65536 | 24944 | 0 | 0 | 40592 | 40592 131072 | 1672 | 3344 | 487 | 0 | 0 | 2857 | 2857 Comments? --- Cédric Villemain +33 (0)6 20 30 22 52 https://Data-Bene.io PostgreSQL Expertise, Support, Training, R -- Jim Nasby, Data Architect, Austin TX
Re: Emit fewer vacuum records by reaping removable tuples during pruning
Thanks for the review! On Thu, Jan 4, 2024 at 3:03 PM Andres Freund wrote: > > On 2023-11-17 18:12:08 -0500, Melanie Plageman wrote: > > Assert(ItemIdIsNormal(lp)); > > htup = (HeapTupleHeader) PageGetItem(dp, lp); > > @@ -715,7 +733,17 @@ heap_prune_chain(Buffer buffer, OffsetNumber > > rootoffnum, > >* redirect the root to the correct chain member. > >*/ > > if (i >= nchain) > > - heap_prune_record_dead(prstate, rootoffnum); > > + { > > + /* > > + * If the relation has no indexes, we can remove dead > > tuples > > + * during pruning instead of marking their line > > pointers dead. Set > > + * this tuple's line pointer LP_UNUSED. > > + */ > > + if (prstate->pronto_reap) > > + heap_prune_record_unused(prstate, rootoffnum); > > + else > > + heap_prune_record_dead(prstate, rootoffnum); > > + } > > else > > heap_prune_record_redirect(prstate, rootoffnum, > > chainitems[i]); > > } > > @@ -726,9 +754,12 @@ heap_prune_chain(Buffer buffer, OffsetNumber > > rootoffnum, > >* item. This can happen if the loop in heap_page_prune > > caused us to > >* visit the dead successor of a redirect item before > > visiting the > >* redirect item. We can clean up by setting the redirect > > item to > > - * DEAD state. > > + * DEAD state. If pronto_reap is true, we can set it > > LP_UNUSED now. > >*/ > > - heap_prune_record_dead(prstate, rootoffnum); > > + if (prstate->pronto_reap) > > + heap_prune_record_unused(prstate, rootoffnum); > > + else > > + heap_prune_record_dead(prstate, rootoffnum); > > } > > > > return ndeleted; > > There's three new calls to heap_prune_record_unused() and the logic got more > nested. Is there a way to get to a nicer end result? So, I could do another loop through the line pointers in heap_page_prune() (after the loop calling heap_prune_chain()) and, if pronto_reap is true, set dead line pointers LP_UNUSED. Then, when constructing the WAL record, I would just not add the prstate.nowdead that I saved from heap_prune_chain() to the prune WAL record. This would eliminate the extra if statements from heap_prune_chain(). It will be more performant than sticking with the original (master) call to lazy_vacuum_heap_page(). However, I'm not convinced that the extra loop to set line pointers LP_DEAD -> LP_UNUSED is less confusing than keeping the if pronto_reap test in heap_prune_chain(). heap_prune_chain() is where line pointers' new values are decided. It seems weird to pick one new value for a line pointer in heap_prune_chain() and then pick another, different new value in a loop after heap_prune_chain(). I don't see any way to eliminate the if pronto_reap tests without a separate loop setting LP_DEAD->LP_UNUSED, though. > > From 608658f2cbc0acde55aac815c0fdb523ec24c452 Mon Sep 17 00:00:00 2001 > > From: Melanie Plageman > > Date: Mon, 13 Nov 2023 16:47:08 -0500 > > Subject: [PATCH v2 1/2] Indicate rel truncation unsafe in lazy_scan[no]prune > > > > Both lazy_scan_prune() and lazy_scan_noprune() must determine whether or > > not there are tuples on the page making rel truncation unsafe. > > LVRelState->nonempty_pages is updated to reflect this. Previously, both > > functions set an output parameter or output parameter member, hastup, to > > indicate that nonempty_pages should be updated to reflect the latest > > non-removable page. There doesn't seem to be any reason to wait until > > lazy_scan_[no]prune() returns to update nonempty_pages. Plenty of other > > counters in the LVRelState are updated in lazy_scan_[no]prune(). > > This allows us to get rid of the output parameter hastup. > > > > @@ -972,20 +970,21 @@ lazy_scan_heap(LVRelState *vacrel) > > continue; > > } > > > > - /* Collect LP_DEAD items in dead_items array, count > > tuples */ > > - if (lazy_scan_noprune(vacrel, buf, blkno, page, > > , > > + /* > > + * Collect LP_DEAD items in dead_items array, count > > tuples, > > + * determine if rel truncation is safe > > + */ > > + if (lazy_scan_noprune(vacrel, buf, blkno, page, > > > > )) > > { > > Sizefreespace = 0; > > > > /* > >* Processed page successfully (without > >
Re: Adding facility for injection points (or probe points?) for more advanced tests
On Thu, Jan 04, 2024 at 06:04:20PM +0530, Ashutosh Bapat wrote: > 0003 and 0004 are using the extension in this module for some serious > testing. The name of the extension test_injection_point indicates that > it's for testing injection points and not for some serious use of > injection callbacks it adds. Changes 0003 and 0004 suggest otherwise. Yeah, I think test_injection_point should be reserved for testing the injection point machinery. > I suggest we move test_injection_points from src/test/modules to > contrib/ and rename it as "injection_points". The test files may still > be named as test_injection_point. The TAP tests in 0003 and 0004 once > moved to their appropriate places, will load injection_point extension > and use it. That way predefined injection point callbacks will also be > available for others to use. Rather than defining a module somewhere that tests would need to load, should we just put the common callbacks in the core server? Unless there's a strong reason to define them elsewhere, that could be a nice way to save a step in the tests. -- Nathan Bossart Amazon Web Services: https://aws.amazon.com
Re: Adding facility for injection points (or probe points?) for more advanced tests
On Thu, Jan 04, 2024 at 08:53:11AM +0900, Michael Paquier wrote: > On Tue, Jan 02, 2024 at 11:14:56PM -0600, Nathan Bossart wrote: >> I'm wondering how important it is to cache the callbacks locally. >> load_external_function() won't reload an already-loaded library, so AFAICT >> this is ultimately just saving a call to dlsym(). > > This keeps a copy to a callback under the same address space, and I > guess that it would matter if the code where a callback is added gets > very hot because this means less function pointers. At the end I > would keep the cache as the code to handle it is neither complex nor > long, while being isolated in its own paths. Fair enough. >> 0003 and 0004 add tests to the test_injection_points module. Is the idea >> that we'd add any tests that required injection points here? I think it'd >> be better if we could move the tests closer to the logic they're testing, >> but perhaps that is difficult because you also need to define the callback >> functions somewhere. Hm... > > Yeah. Agreed that the final result should not have these tests in the > module test_injection_points. What I was thinking here is to move > 002_invalid_checkpoint_after_promote.pl to src/test/recovery/ and pull > the module with the callbacks with an EXTRA_INSTALL. +1 -- Nathan Bossart Amazon Web Services: https://aws.amazon.com
Re: Add new for_each macros for iterating over a List that do not require ListCell pointer
Committed after some additional light edits. Thanks for the patch! On Wed, Jan 03, 2024 at 11:36:36PM +0100, Jelte Fennema-Nio wrote: > On Wed, 3 Jan 2024 at 23:13, Tom Lane wrote: >> I like Nathan's wording. > > To be clear, I don't want to block this patch on the wording of that > single comment. So, if you feel Nathan's wording was better, I'm fine > with that too. But let me respond to your arguments anyway: I decided to keep the v8 wording, if for no other reason than I didn't see the need for lots of detail about how it compiles. IMHO even the vague mention of loop unrolling is probably more than is really necessary. -- Nathan Bossart Amazon Web Services: https://aws.amazon.com
Re: SET ROLE x NO RESET
On 1/3/24 5:47 PM, Nico Williams wrote: Though maybe `NO RESET` isn't really needed to build these, since after all one could use an unprivileged role and a SECURITY DEFINER function that does the `SET ROLE` following some user-defined authentication method, and so what if the client can RESET the role, since that brings it back to the otherwise unprivileged role. An unprivileged role that has the ability to assume any other role ;p Also, last I checked you can't do SET ROLE in a security definer function. Who needs to RESET roles anyways? Answer: connection pools, but not every connection is used via a pool. This brings up something: attempts to reset a NO RESET session need to fail in such a way that a connection pool can detect this and disconnect, or else it needs to fail by terminating the connection altogether. RESET ROLE is also useful for setting up a "superuser role" that DBAs have access to via a NO INHERIT role. It allows them to do things like... SET ROLE su; -- Do some superuserly thing RESET ROLE; Admittedly, the last step could be just to set their role back to themselves, but RESET ROLE removes the risk of typos. -- Jim Nasby, Data Architect, Austin TX
Re: PL/pgSQL: Incomplete item Allow handling of %TYPE arrays, e.g. tab.col%TYPE[]
Pavel Stehule writes: > Now, I think so this simple patch is ready for committers I pushed this with some editorialization -- mostly, rewriting the documentation and comments. I found that the existing docs for %TYPE were not great. There are two separate use-cases, one for referencing a table column and one for referencing a previously-declared variable, and the docs were about as clear as mud about explaining that. I also looked into the problem Pavel mentioned that it doesn't work for RECORD. If you just write "record[]" you get an error message that at least indicates it's an unsupported case: regression=# do $$declare r record[]; begin end$$; ERROR: variable "r" has pseudo-type record[] CONTEXT: compilation of PL/pgSQL function "inline_code_block" near line 1 Maybe we could improve on that, but it would be a lot of work and I'm not terribly excited about it. However, %TYPE fails entirely for both "record" and named composite types, and the reason turns out to be just that plpgsql_parse_wordtype fails to handle the PLPGSQL_NSTYPE_REC case. So that's easily fixed. I also wonder what the heck the last half of plpgsql_parse_wordtype is for at all. It looks for a named type, which means you can do regression=# do $$declare x float8%type; begin end$$; DO but that's just stupid. You could leave off the %TYPE and get the same result. Moreover, it is inconsistent because plpgsql_parse_cwordtype has no equivalent behavior: regression=# do $$declare x pg_catalog.float8%type; begin end$$; ERROR: syntax error at or near "%" LINE 1: do $$declare x pg_catalog.float8%type; begin end$$; ^ CONTEXT: invalid type name "pg_catalog.float8%type" It's also undocumented and untested (the code coverage report shows this part is never reached). So I propose we remove it. That leads me to the attached proposed follow-on patch. Another thing we could think about, but I've not done it here, is to make plpgsql_parse_wordtype and friends throw error instead of just returning NULL when they don't find the name. Right now, if NULL is returned, we end up passing the whole string to parse_datatype, leading to unhelpful errors like the one shown above. We could do better than that I think, perhaps like "argument of %TYPE is not a known variable". regards, tom lane diff --git a/src/pl/plpgsql/src/expected/plpgsql_record.out b/src/pl/plpgsql/src/expected/plpgsql_record.out index afb922df29..36d65e4286 100644 --- a/src/pl/plpgsql/src/expected/plpgsql_record.out +++ b/src/pl/plpgsql/src/expected/plpgsql_record.out @@ -306,6 +306,32 @@ NOTICE: r1 = (1,2) ERROR: record "r1" has no field "nosuchfield" CONTEXT: SQL expression "r1.nosuchfield" PL/pgSQL function inline_code_block line 9 at RAISE +-- check that type record can be passed through %type +do $$ +declare r1 record; +r2 r1%type; +begin + r2 := row(1,2); + raise notice 'r2 = %', r2; + r2 := row(3,4,5); + raise notice 'r2 = %', r2; +end$$; +NOTICE: r2 = (1,2) +NOTICE: r2 = (3,4,5) +-- arrays of record are not supported at the moment +do $$ +declare r1 record[]; +begin +end$$; +ERROR: variable "r1" has pseudo-type record[] +CONTEXT: compilation of PL/pgSQL function "inline_code_block" near line 2 +do $$ +declare r1 record; +r2 r1%type[]; +begin +end$$; +ERROR: variable "r2" has pseudo-type record[] +CONTEXT: compilation of PL/pgSQL function "inline_code_block" near line 3 -- check repeated assignments to composite fields create table some_table (id int, data text); do $$ diff --git a/src/pl/plpgsql/src/pl_comp.c b/src/pl/plpgsql/src/pl_comp.c index b745eaa3f8..c63719c796 100644 --- a/src/pl/plpgsql/src/pl_comp.c +++ b/src/pl/plpgsql/src/pl_comp.c @@ -1596,8 +1596,8 @@ plpgsql_parse_tripword(char *word1, char *word2, char *word3, /* -- - * plpgsql_parse_wordtype The scanner found word%TYPE. word can be - *a variable name or a basetype. + * plpgsql_parse_wordtype The scanner found word%TYPE. word should be + *a pre-existing variable name. * * Returns datatype struct, or NULL if no match found for word. * -- @@ -1605,10 +1605,7 @@ plpgsql_parse_tripword(char *word1, char *word2, char *word3, PLpgSQL_type * plpgsql_parse_wordtype(char *ident) { - PLpgSQL_type *dtype; PLpgSQL_nsitem *nse; - TypeName *typeName; - HeapTuple typeTup; /* * Do a lookup in the current namespace stack @@ -1623,39 +1620,13 @@ plpgsql_parse_wordtype(char *ident) { case PLPGSQL_NSTYPE_VAR: return ((PLpgSQL_var *) (plpgsql_Datums[nse->itemno]))->datatype; - -/* XXX perhaps allow REC/ROW here? */ - + case PLPGSQL_NSTYPE_REC: +return ((PLpgSQL_rec *) (plpgsql_Datums[nse->itemno]))->datatype; default: return NULL; } } - /* - * Word wasn't found in the namespace stack. Try to find a data type with - * that name, but ignore shell types and complex types. - */ - typeName = makeTypeName(ident); - typeTup =
Re: Confine vacuum skip logic to lazy_scan_skip
Hi, On 2024-01-02 12:36:18 -0500, Melanie Plageman wrote: > Subject: [PATCH v2 1/6] lazy_scan_skip remove unnecessary local var rel_pages > Subject: [PATCH v2 2/6] lazy_scan_skip remove unneeded local var > nskippable_blocks I think these may lead to worse code - the compiler has to reload vacrel->rel_pages/next_unskippable_block for every loop iteration, because it can't guarantee that they're not changed within one of the external functions called in the loop body. > Subject: [PATCH v2 3/6] Add lazy_scan_skip unskippable state > > Future commits will remove all skipping logic from lazy_scan_heap() and > confine it to lazy_scan_skip(). To make those commits more clear, first > introduce the struct, VacSkipState, which will maintain the variables > needed to skip ranges less than SKIP_PAGES_THRESHOLD. Why not add this to LVRelState, possibly as a struct embedded within it? > From 335faad5948b2bec3b83c2db809bb9161d373dcb Mon Sep 17 00:00:00 2001 > From: Melanie Plageman > Date: Sat, 30 Dec 2023 16:59:27 -0500 > Subject: [PATCH v2 4/6] Confine vacuum skip logic to lazy_scan_skip > > In preparation for vacuum to use the streaming read interface (and eventually > AIO), refactor vacuum's logic for skipping blocks such that it is entirely > confined to lazy_scan_skip(). This turns lazy_scan_skip() and the VacSkipState > it uses into an iterator which yields blocks to lazy_scan_heap(). Such a > structure is conducive to an async interface. And it's cleaner - I find the current code extremely hard to reason about. > By always calling lazy_scan_skip() -- instead of only when we have reached the > next unskippable block, we no longer need the skipping_current_range variable. > lazy_scan_heap() no longer needs to manage the skipped range -- checking if we > reached the end in order to then call lazy_scan_skip(). And lazy_scan_skip() > can derive the visibility status of a block from whether or not we are in a > skippable range -- that is, whether or not the next_block is equal to the next > unskippable block. I wonder if it should be renamed as part of this - the name is somewhat confusing now (and perhaps before)? lazy_scan_get_next_block() or such? > + while (true) > { > Buffer buf; > Pagepage; > - boolall_visible_according_to_vm; > LVPagePruneState prunestate; > > - if (blkno == vacskip.next_unskippable_block) > - { > - /* > - * Can't skip this page safely. Must scan the page. > But > - * determine the next skippable range after the page > first. > - */ > - all_visible_according_to_vm = > vacskip.next_unskippable_allvis; > - lazy_scan_skip(vacrel, , blkno + 1); > - > - Assert(vacskip.next_unskippable_block >= blkno + 1); > - } > - else > - { > - /* Last page always scanned (may need to set > nonempty_pages) */ > - Assert(blkno < rel_pages - 1); > - > - if (vacskip.skipping_current_range) > - continue; > + blkno = lazy_scan_skip(vacrel, , blkno + 1, > + > _visible_according_to_vm); > > - /* Current range is too small to skip -- just scan the > page */ > - all_visible_according_to_vm = true; > - } > + if (blkno == InvalidBlockNumber) > + break; > > vacrel->scanned_pages++; > I don't like that we still do determination about the next block outside of lazy_scan_skip() and have duplicated exit conditions between lazy_scan_skip() and lazy_scan_heap(). I'd probably change the interface to something like while (lazy_scan_get_next_block(vacrel, )) { ... } > From b6603e35147c4bbe3337280222e6243524b0110e Mon Sep 17 00:00:00 2001 > From: Melanie Plageman > Date: Sun, 31 Dec 2023 09:47:18 -0500 > Subject: [PATCH v2 5/6] VacSkipState saves reference to LVRelState > > The streaming read interface can only give pgsr_next callbacks access to > two pieces of private data. As such, move a reference to the LVRelState > into the VacSkipState. > > This is a separate commit (as opposed to as part of the commit > introducing VacSkipState) because it is required for using the streaming > read interface but not a natural change on its own. VacSkipState is per > block and the LVRelState is referenced for the whole relation vacuum. I'd do it the other way round, i.e. either embed VacSkipState ino LVRelState or point to it from VacSkipState. LVRelState is already tied to the iteration state, so I don't think there's a reason not to do so. Greetings, Andres Freund
Re: Emit fewer vacuum records by reaping removable tuples during pruning
Hi, On 2023-11-17 18:12:08 -0500, Melanie Plageman wrote: > diff --git a/src/backend/access/heap/heapam.c > b/src/backend/access/heap/heapam.c > index 14de8158d49..b578c32eeb6 100644 > --- a/src/backend/access/heap/heapam.c > +++ b/src/backend/access/heap/heapam.c > @@ -8803,8 +8803,13 @@ heap_xlog_prune(XLogReaderState *record) > nunused = (end - nowunused); > Assert(nunused >= 0); > > - /* Update all line pointers per the record, and repair > fragmentation */ > - heap_page_prune_execute(buffer, > + /* > + * Update all line pointers per the record, and repair > fragmentation. > + * We always pass pronto_reap as true, because we don't know > whether > + * or not this option was used when pruning. This reduces the > + * validation done on replay in an assert build. > + */ Hm, that seems not great. Both because we loose validation and because it seems to invite problems down the line, due to pronto_reap falsely being set to true in heap_page_prune_execute(). > @@ -581,7 +589,17 @@ heap_prune_chain(Buffer buffer, OffsetNumber rootoffnum, >* function.) >*/ > if (ItemIdIsDead(lp)) > + { > + /* > + * If the relation has no indexes, we can set dead line > pointers > + * LP_UNUSED now. We don't increment ndeleted here > since the LP > + * was already marked dead. > + */ > + if (prstate->pronto_reap) > + heap_prune_record_unused(prstate, offnum); > + > break; > + } I wasn't immediately sure whether this is reachable - but it is, e.g. after on-access pruning (which currently doesn't yet use pronto reaping), after pg_upgrade or dropping an index. > Assert(ItemIdIsNormal(lp)); > htup = (HeapTupleHeader) PageGetItem(dp, lp); > @@ -715,7 +733,17 @@ heap_prune_chain(Buffer buffer, OffsetNumber rootoffnum, >* redirect the root to the correct chain member. >*/ > if (i >= nchain) > - heap_prune_record_dead(prstate, rootoffnum); > + { > + /* > + * If the relation has no indexes, we can remove dead > tuples > + * during pruning instead of marking their line > pointers dead. Set > + * this tuple's line pointer LP_UNUSED. > + */ > + if (prstate->pronto_reap) > + heap_prune_record_unused(prstate, rootoffnum); > + else > + heap_prune_record_dead(prstate, rootoffnum); > + } > else > heap_prune_record_redirect(prstate, rootoffnum, > chainitems[i]); > } > @@ -726,9 +754,12 @@ heap_prune_chain(Buffer buffer, OffsetNumber rootoffnum, >* item. This can happen if the loop in heap_page_prune caused > us to >* visit the dead successor of a redirect item before visiting > the >* redirect item. We can clean up by setting the redirect item > to > - * DEAD state. > + * DEAD state. If pronto_reap is true, we can set it LP_UNUSED > now. >*/ > - heap_prune_record_dead(prstate, rootoffnum); > + if (prstate->pronto_reap) > + heap_prune_record_unused(prstate, rootoffnum); > + else > + heap_prune_record_dead(prstate, rootoffnum); > } > > return ndeleted; There's three new calls to heap_prune_record_unused() and the logic got more nested. Is there a way to get to a nicer end result? > From 608658f2cbc0acde55aac815c0fdb523ec24c452 Mon Sep 17 00:00:00 2001 > From: Melanie Plageman > Date: Mon, 13 Nov 2023 16:47:08 -0500 > Subject: [PATCH v2 1/2] Indicate rel truncation unsafe in lazy_scan[no]prune > > Both lazy_scan_prune() and lazy_scan_noprune() must determine whether or > not there are tuples on the page making rel truncation unsafe. > LVRelState->nonempty_pages is updated to reflect this. Previously, both > functions set an output parameter or output parameter member, hastup, to > indicate that nonempty_pages should be updated to reflect the latest > non-removable page. There doesn't seem to be any reason to wait until > lazy_scan_[no]prune() returns to update nonempty_pages. Plenty of other > counters in the LVRelState are updated in lazy_scan_[no]prune(). > This allows us to get rid of the output parameter hastup. > @@ -972,20 +970,21 @@ lazy_scan_heap(LVRelState *vacrel) > continue; > } > > - /* Collect LP_DEAD items in dead_items array, count > tuples
Re: Emit fewer vacuum records by reaping removable tuples during pruning
Hi, On 2024-01-04 12:31:36 -0500, Robert Haas wrote: > On Wed, Dec 27, 2023 at 11:27 AM Melanie Plageman > wrote: > > Do you have specific concerns about its correctness? I understand it > > is an area where we have to be sure we are correct. But, to be fair, > > that is true of all the pruning and vacuuming code. > > I'm kind of concerned that 0002 might be a performance regression. It > pushes more branches down into the heap-pruning code, which I think > can sometimes be quite hot, for the sake of a case that rarely occurs > in practice. I was wondering the same when talking about this with Melanie. But I guess there are some scenarios that aren't unrealistic, consider e.g. bulk data loading with COPY with an UPDATE to massage the data afterwards, before creating the indexes. Where I could see this becoming more interesting / powerful is if we were able to do 'pronto reaping' not just from vacuum but also during on-access pruning. For ETL workloads VACUUM might often not run between modifications of the same page, but on-access pruning will. Being able to reclaim dead items at that point seems like it could be pretty sizable improvement? > I take your point about it improving things when there are no indexes, but > what about when there are? I suspect that branch prediction should take care of the additional branches. heap_prune_chain() indeed can be very hot, but I think we're primarily bottlenecked on data cache misses. For a single VACUUM, whether we'd do pronto reaping would be constant -> very well predictable. We could add an unlikely() to make sure the branch-predictor-is-empty case optimizes for the much more common case of having indexes. Falsely assuming we'd not pronto reap wouldn't be particularly bad, as the wins for the case are so much bigger. If we were to use pronto reaping for on-access pruning, it's perhaps a bit less predictable, as pruning for pages of a relation with indexes could be interleaved with pruning for relations without. But even there I suspect it'd not be the primary bottleneck: We call heap_page_prune_chain() in a loop for every tuple on a page, the branch predictor should quickly learn whether we're using pronto reaping. Whereas we're not becoming less cache-miss heavy when looking at subsequent tuples. > And even if there are no adverse performance consequences, is it really > worth complicating the logic at such a low level? Yes, I think this is the main question here. It's not clear though that the state after the patch is meaningfullye more complicated? It removes nontrivial code from lazy_scan_heap() and pays for that with a bit more complexity in heap_prune_chain(). > Also, I find "pronto_reap" to be a poor choice of name. "pronto" is an > informal word that seems to have no advantage over something like > "immediate" or "now," I didn't like that either :) > and I don't think "reap" has a precise, universally-understood meaning. Less concerned about that. > You could call this "mark_unused_now" or "immediately_mark_unused" or > something and it would be far more self-documenting, IMHO. How about 'no_indexes' or such? Greetings, Andres Freund
Re: ALTER COLUMN ... SET EXPRESSION to alter stored generated column's expression
On 25.12.23 13:10, Amul Sul wrote: > I think we can't support that (like alter type) since we need to place > this new > pass before AT_PASS_OLD_INDEX & AT_PASS_OLD_CONSTR to re-add indexes and > constraints for the validation. Could we have AT_PASS_ADD_COL before AT_PASS_OLD_*? So overall it would be ... AT_PASS_ALTER_TYPE, AT_PASS_ADD_COL, // moved AT_PASS_SET_EXPRESSION, // new AT_PASS_OLD_INDEX, AT_PASS_OLD_CONSTR, AT_PASS_ADD_CONSTR, ... This appears to not break any existing tests. (Sorry, for the delay) Agree. I did that change in 0001 patch. I have committed this patch set. I couple of notes: You had included the moving of the AT_PASS_ADD_COL enum in the first patch. This is not a good style. Refactoring patches should not include semantic changes. I have moved that change the final patch that introduced the new feature. I did not commit the 0002 patch that renamed some functions. I think names like AlterColumn are too general, which makes this renaming possibly counterproductive. I don't know a better name, maybe AlterTypeOrSimilar, but that's obviously silly. I think leaving it at AlterType isn't too bad, since most of the code is indeed for ALTER TYPE support. We can reconsider this if we have a better idea. In RememberAllDependentForRebuilding(), I dropped some of the additional errors that you introduced for the AT_SetExpression cases. These didn't seem useful. For example, it is not possible for a generated column to depend on another generated column, so there is no point checking for it. Also, there were no test cases that covered any of these situations. If we do want any of these, we should have tests and documentation for them. For the tests that examine the EXPLAIN plans, I had to add an ANALYZE after the SET EXPRESSION. Otherwise, I got unstable test results, presumably because in some cases the analyze happened in the background.
Re: add AVX2 support to simd.h
On Tue, Jan 02, 2024 at 10:11:23AM -0600, Nathan Bossart wrote: > (In case it isn't clear, I'm volunteering to set up such a buildfarm > machine.) I set up "akepa" to run with -march=x86-64-v3. -- Nathan Bossart Amazon Web Services: https://aws.amazon.com
Re: Emit fewer vacuum records by reaping removable tuples during pruning
On Wed, Dec 27, 2023 at 11:27 AM Melanie Plageman wrote: > Do you have specific concerns about its correctness? I understand it > is an area where we have to be sure we are correct. But, to be fair, > that is true of all the pruning and vacuuming code. I'm kind of concerned that 0002 might be a performance regression. It pushes more branches down into the heap-pruning code, which I think can sometimes be quite hot, for the sake of a case that rarely occurs in practice. I take your point about it improving things when there are no indexes, but what about when there are? And even if there are no adverse performance consequences, is it really worth complicating the logic at such a low level? Also, I find "pronto_reap" to be a poor choice of name. "pronto" is an informal word that seems to have no advantage over something like "immediate" or "now," and I don't think "reap" has a precise, universally-understood meaning. You could call this "mark_unused_now" or "immediately_mark_unused" or something and it would be far more self-documenting, IMHO. -- Robert Haas EDB: http://www.enterprisedb.com
Re: the s_lock_stuck on perform_spin_delay
On Thu, Jan 4, 2024 at 11:33 AM Tom Lane wrote: > I believe it's (a). No matter what the reason for a stuck spinlock > is, the only reliable method of getting out of the problem is to > blow things up and start over. The patch proposed at the top of this > thread would leave the system unable to recover on its own, with the > only recourse being for the DBA to manually force a crash/restart ... > once she figured out that that was necessary, which might take a long > while if the only external evidence is an occasional WARNING that > might not even be making it to the postmaster log. How's that better? It's a fair question. I think you're correct if we assume that everyone's following the coding rule ... at least assuming that the target system isn't too insanely slow, and I've seen some pretty crazily overloaded machines. But if the coding rule is not being followed, then "the only reliable method of getting out of the problem is to blow things up and start over" becomes a dubious conclusion. Also, I wonder if many or even all uses of spinlocks uses ought to be replaced with either LWLocks or atomics. An LWLock might be slightly slower when contention is low, but it scales better when contention is high, displays a wait event so that you can see that you have contention if you do, and doesn't PANIC the system if the contention gets too bad. And an atomic will just be faster, in the cases where it's adequate. The trouble with trying to do a replacement is that some of the spinlock-using code is ancient and quite hairy. info_lck in particular looks like a hot mess -- it's used in complex ways and in performance critical paths, with terrifying comments like this: * To read XLogCtl->LogwrtResult, you must hold either info_lck or * WALWriteLock. To update it, you need to hold both locks. The point of * this arrangement is that the value can be examined by code that already * holds WALWriteLock without needing to grab info_lck as well. In addition * to the shared variable, each backend has a private copy of LogwrtResult, * which is updated when convenient. * * The request bookkeeping is simpler: there is a shared XLogCtl->LogwrtRqst * (protected by info_lck), but we don't need to cache any copies of it. * * info_lck is only held long enough to read/update the protected variables, * so it's a plain spinlock. The other locks are held longer (potentially * over I/O operations), so we use LWLocks for them. These locks are: But info_lck was introduced in 1999 and this scheme was introduced in 2012, and a lot has changed since then. Whatever benchmarking was done to validate this locking regime is probably obsolete at this point. Back then, LWLocks were built on top of spinlocks, and were, I believe, a lot slower than they are now. Plus CPU performance characteristics have changed a lot. So who really knows if the way we're doing things here makes any sense at all these days? But one doesn't want to do a naive replacement and pessimize things, either. -- Robert Haas EDB: http://www.enterprisedb.com
Re: the s_lock_stuck on perform_spin_delay
Robert Haas writes: > On Thu, Jan 4, 2024 at 10:22 AM Tom Lane wrote: >> We should be making an effort to ban coding patterns like >> "return with spinlock still held", because they're just too prone >> to errors similar to this one. > I agree that we don't want to add overhead, and also about how > spinlocks should be used, but I dispute "easily detectable > statically." I mean, if you or I look at some code that uses a > spinlock, we'll know whether the pattern that you mention is being > followed or not, modulo differences of opinion in debatable cases. But > you and I cannot be there to look at all the code all the time. If we > had a static checking tool that was run as part of every build, or in > the buildfarm, or by cfbot, or somewhere else that raised the alarm if > this rule was violated, then we could claim to be effectively > enforcing this rule. I was indeed suggesting that maybe we could find a way to detect such things automatically. While I've not been paying close attention, I recall there's been some discussions of using LLVM/clang infrastructure for customized static analysis, so maybe it'd be possible without an undue amount of effort. > I think the question we should be asking here is what the purpose of > the PANIC is. I can think of two possible purposes. It could be either > (a) an attempt to prevent real-world harm by turning database hangs > into database panics, so that at least the system will restart and get > moving again instead of sitting there stuck for all eternity or (b) an > attempt to punish people for writing bad code by turning coding rule > violations into panics on production systems. I believe it's (a). No matter what the reason for a stuck spinlock is, the only reliable method of getting out of the problem is to blow things up and start over. The patch proposed at the top of this thread would leave the system unable to recover on its own, with the only recourse being for the DBA to manually force a crash/restart ... once she figured out that that was necessary, which might take a long while if the only external evidence is an occasional WARNING that might not even be making it to the postmaster log. How's that better? > ... (b3) if the PANIC does fire, it > gives you basically zero help in figuring out where the actual problem > is. The PostgreSQL code base is way too big for "ERROR: you screwed > up" to be an acceptable diagnostic. Ideally I agree with the latter, but that doesn't mean that doing better is easy or even possible. (The proposed patch certainly does nothing to help diagnose such issues.) As for the former point, panicking here at least offers the chance of getting a stack trace, which might help a developer find the problem. regards, tom lane
Re: the s_lock_stuck on perform_spin_delay
On Thu, Jan 4, 2024 at 10:22 AM Tom Lane wrote: > I'm not a fan of adding overhead to such a performance-critical > thing as spinlocks in order to catch coding errors that are easily > detectable statically. IMV the correct usage of spinlocks is that > they should only be held across *short, straight line* code segments. > We should be making an effort to ban coding patterns like > "return with spinlock still held", because they're just too prone > to errors similar to this one. Note that trying to take another > lock is far from the only bad thing that can happen if you're > not very conservative about what code can execute with a spinlock > held. I agree that we don't want to add overhead, and also about how spinlocks should be used, but I dispute "easily detectable statically." I mean, if you or I look at some code that uses a spinlock, we'll know whether the pattern that you mention is being followed or not, modulo differences of opinion in debatable cases. But you and I cannot be there to look at all the code all the time. If we had a static checking tool that was run as part of every build, or in the buildfarm, or by cfbot, or somewhere else that raised the alarm if this rule was violated, then we could claim to be effectively enforcing this rule. But with 20-30 active committers and ~100 active developers at any given point in time, any system that relies on every relevant person knowing all the rules and remembering to enforce them on every commit is bound to be less than 100% effective. Some people won't know what the rule is, some people will decide that their particular situation is Very Special, some people will just forget to check for violations, and some people will check for violations but miss something. I think the question we should be asking here is what the purpose of the PANIC is. I can think of two possible purposes. It could be either (a) an attempt to prevent real-world harm by turning database hangs into database panics, so that at least the system will restart and get moving again instead of sitting there stuck for all eternity or (b) an attempt to punish people for writing bad code by turning coding rule violations into panics on production systems. If it's (a), that's defensible, though we can still ask whether it does more harm than good. If it's (b), that's not a good way of handling that problem, because (b1) it affects production builds and not just development builds, (b2) many coding rule violations are vanishingly unlikely to trigger that PANIC in practice, and (b3) if the PANIC does fire, it gives you basically zero help in figuring out where the actual problem is. The PostgreSQL code base is way too big for "ERROR: you screwed up" to be an acceptable diagnostic. -- Robert Haas EDB: http://www.enterprisedb.com
Re: POC PATCH: copy from ... exceptions to: (was Re: VLDB Features)
On Thu, 28 Dec 2023 at 09:27, jian he wrote: > > On Wed, Dec 20, 2023 at 8:27 PM Masahiko Sawada wrote: > > > > > > Why do we need to use SPI? I think we can form heap tuples and insert > > them to the error table. Creating the error table also doesn't need to > > use SPI. > > > Thanks for pointing it out. I figured out how to form heap tuples and > insert them to the error table. > but I don't know how to create the error table without using SPI. > Please pointer it out. > > > > > > > copy_errors one per schema. > > > foo.copy_errors will be owned by the schema: foo owner. > > > > It seems that the error table is created when the SAVE_ERROR is used > > for the first time. It probably blocks concurrent COPY FROM commands > > with SAVE_ERROR option to different tables if the error table is not > > created yet. > > > I don't know how to solve this problem Maybe we can document this. > but it will block the COPY FROM immediately. > > > > > > > if you can insert to a table in that specific schema let's say foo, > > > then you will get privilege to INSERT/DELETE/SELECT > > > to foo.copy_errors. > > > If you are not a superuser, you are only allowed to do > > > INSERT/DELETE/SELECT on foo.copy_errors rows where USERID = > > > current_user::regrole::oid. > > > This is done via row level security. > > > > I don't think it works. If the user is dropped, the user's oid could > > be reused for a different user. > > > > You are right. > so I changed, now the schema owner will be the error table owner. > every error table tuple inserts, > I switch to schema owner, do the insert, then switch back to the > COPY_FROM operation user. > now everyone (except superuser) will need explicit grant to access the > error table. There are some compilation issues reported at [1] for the patch: [04:04:26.288] copyfromparse.c: In function ‘NextCopyFrom’: [04:04:26.288] copyfromparse.c:1126:25: error: ‘copy_errors_tupDesc’ may be used uninitialized in this function [-Werror=maybe-uninitialized] [04:04:26.288] 1126 | copy_errors_tup = heap_form_tuple(copy_errors_tupDesc, [04:04:26.288] | ^~~~ [04:04:26.288] 1127 | t_values, [04:04:26.288] | ~ [04:04:26.288] 1128 | t_isnull); [04:04:26.288] | ~ [04:04:26.288] copyfromparse.c:1160:4: error: ‘copy_errorsrel’ may be used uninitialized in this function [-Werror=maybe-uninitialized] [04:04:26.288] 1160 | table_close(copy_errorsrel, RowExclusiveLock); [04:04:26.288] | ^ [1] - https://cirrus-ci.com/task/4785221183209472 Regards, Vignesh
Re: Proposal to add page headers to SLRU pages
> On Jan 2, 2024, at 19:35, Aleksander Alekseev > wrote: > > Thanks for the updated patch. > > cfbot seems to have some complaints regarding compiler warnings and > also building the patch on Windows: > > http://cfbot.cputube.org/ Thanks for the information. Here is the updated patch. Regards, Yong slru_page_header_v3.patch Description: slru_page_header_v3.patch
Re: speed up a logical replica setup
On Thu, Jan 4, 2024, at 3:05 AM, Amit Kapila wrote: > Won't it be a better user experience that after setting up the target > server as a logical replica (subscriber), it started to work > seamlessly without user intervention? If we have an option to control the replication slot removal (default is on), it seems a good UI. Even if the user decides to disable the replication slot removal, it should print a message saying that these replication slots can cause WAL retention. > > The initial version had an option to stop the subscriber. I decided to > > remove the option and stop the subscriber by default mainly because (1) it > > is > > an extra step to start the server (another point is that the WAL retention > > doesn't happen due to additional (synchronized?) replication slots on > > subscriber -- point 2). It was a conservative choice. If point 2 isn't an > > issue, imo point 1 is no big deal. > > > > By point 2, do you mean to have a check for "max replication slots"? > It so, the one possibility is to even increase that config, if the > required max_replication_slots is low. By point 2, I mean WAL retention (sentence inside parenthesis). -- Euler Taveira EDB https://www.enterprisedb.com/
Re: speed up a logical replica setup
On Thu, Jan 4, 2024, at 2:41 AM, Amit Kapila wrote: > So, you also seem to be saying that it is not required once > pg_subscriber has promoted it. So, why it should be optional to remove > physical_replication_slot? I think we must remove it from the primary > unless there is some other reason. My point is to *always* remove the primary_slot_name on primary. > My point is to not have an additional slot and keep a comment > indicating that we need an extra slot once we add pg_basebackup > support. Got it. > > 3. How about sync slots on the physical standby if present? Do we want > > to retain those as it is or do we need to remove those? We are > > actively working on the patch [1] for the same. > > > > > > I didn't read the current version of the referred patch but if the proposal > > is > > to synchronize logical replication slots iif you are using a physical > > replication, as soon as pg_subscriber finishes the execution, there won't be > > synchronization on these logical replication slots because there isn't a > > physical replication anymore. If the goal is a promotion, the current > > behavior > > is correct because the logical replica will retain WAL since it was > > converted. > > > > I don't understand what you mean by promotion in this context. If > users want to simply promote the standby then there is no need to do > additional things that this tool is doing. ENOCOFFEE. s/promotion/switchover/ > > However, if you are creating a logical replica, this WAL retention is not > > good > > and the customer should eventually remove these logical replication slots on > > the logical replica. > > > > I think asking users to manually remove such slots won't be a good > idea. We might want to either remove them by default or provide an > option to the user. Am I correct that the majority of the use cases these replication slots will be useless? If so, let's remove them by default and add an option to control this behavior (replication slot removal). -- Euler Taveira EDB https://www.enterprisedb.com/
Re: POC: Extension for adding distributed tracing - pg_tracing
Hi, > This approach avoids the need to rewrite SQL or give special meaning to SQL > comments. SQLCommenter already has a good amount of support and is referenced in opentelemetry https://github.com/open-telemetry/opentelemetry-sqlcommenter So the goal was more to leverage the existing trace propagation systems as much as possible. > I used GUCs to set the `opentelemtery_trace_id` and `opentelemetry_span_id`. > These can be sent as protocol-level messages by the client driver if the > client driver has native trace integration enabled, in which case the user > doesn't need to see or care. And for other drivers, the application can use > `SET` or `SET LOCAL` to assign them. Emitting `SET` and `SET LOCAL` for every traced statement sounds more disruptive and expensive than relying on SQLCommenter. When not using `SET LOCAL`, the variables also need to be cleared. This will also introduce a lot of headaches as the `SET` themselves could be traced and generate spans when tracing utility statements is already tricky enough. > But IIRC the current BDR/PGD folks are now using an OpenTelemetry > sidecar process instead. I think they send it UDP packets to emit > metrics and events. I would be curious to hear more about this. I didn't want to be tied to a specific protocol (I've only seen gRPC and http support on latest opentelemetry collectors) and I fear that relying on Postgres sending traces would lower the chance of having this supported on managed Postgres solutions (like RDS and CloudSQL). > By queries you mean particular queries, not transactions? And since > they have an assigned ID it means that the query is already executing > and we want to enable the tracking in another session, right? I think that was the idea. The query identifier generated for a specific statement is stable so we could have a GUC variable with a list of query id and only queries matching the provided query ids would be sampled. This would make it easier to generate traces for a specific query within a session. On Tue, Jan 2, 2024 at 1:14 PM Aleksander Alekseev wrote: > > Hi, > > > Overall solution looks good for me except SQL Commenter - query > > instrumentation > > with SQL comments is often not possible on production systems. Instead > > the very often requested feature is to enable tracing for a given single > > query ID, > > or several (very limited number of) queries by IDs. It could be done by > > adding > > SQL function to add and remove query ID into a list (even array would do) > > stored in top tracing context. > > Not 100% sure if I follow. > > By queries you mean particular queries, not transactions? And since > they have an assigned ID it means that the query is already executing > and we want to enable the tracking in another session, right? If this > is the case I would argue that enabling/disabling tracing for an > _already_ running query (or transaction) would be way too complicated. > > I wouldn't mind enabling/disabling tracing for the current session > and/or a given session ID. In the latter case it can have an effect > only for the new transactions. This however could be implemented > separately from the proposed patchset. I suggest keeping the scope > small. > > -- > Best regards, > Aleksander Alekseev
Re: the s_lock_stuck on perform_spin_delay
Robert Haas writes: > I'm not sure that the approach this patch takes is correct in detail, > but I kind of agree with you about the overall point. I mean, the idea > of the PANIC is to avoid having the system just sit there in a state > from which it will never recover ... but it can also have the effect > of killing a system that wasn't really dead. I'm not sure what the > best thing to do here is, but it's worth talking about, IMHO. I'm not a fan of adding overhead to such a performance-critical thing as spinlocks in order to catch coding errors that are easily detectable statically. IMV the correct usage of spinlocks is that they should only be held across *short, straight line* code segments. We should be making an effort to ban coding patterns like "return with spinlock still held", because they're just too prone to errors similar to this one. Note that trying to take another lock is far from the only bad thing that can happen if you're not very conservative about what code can execute with a spinlock held. regards, tom lane
Re: WIP Incremental JSON Parser
On Wed, Jan 3, 2024 at 6:36 PM Nico Williams wrote: > On Tue, Jan 02, 2024 at 10:14:16AM -0500, Robert Haas wrote: > > It seems like a pretty significant savings no matter what. Suppose the > > backup_manifest file is 2GB, and instead of creating a 2GB buffer, you > > create an 1MB buffer and feed the data to the parser in 1MB chunks. > > Well, that saves 2GB less 1MB, full stop. Now if we address the issue > > you raise here in some way, we can potentially save even more memory, > > which is great, but even if we don't, we still saved a bunch of memory > > that could not have been saved in any other way. > > You could also build a streaming incremental parser. That is, one that > outputs a path and a leaf value (where leaf values are scalar values, > `null`, `true`, `false`, numbers, and strings). Then if the caller is > doing something JSONPath-like then the caller can probably immediately > free almost all allocations and even terminate the parse early. I think our current parser is event-based rather than this ... but it seems like this could easily be built on top of it, if someone wanted to. -- Robert Haas EDB: http://www.enterprisedb.com
Re: index prefetching
Hi, Here's a somewhat reworked version of the patch. My initial goal was to see if it could adopt the StreamingRead API proposed in [1], but that turned out to be less straight-forward than I hoped, for two reasons: (1) The StreamingRead API seems to be designed for pages, but the index code naturally works with TIDs/tuples. Yes, the callbacks can associate the blocks with custom data (in this case that'd be the TID), but it seemed a bit strange ... (2) The place adding requests to the StreamingRead queue is pretty far from the place actually reading the pages - for prefetching, the requests would be generated in nodeIndexscan, but the page reading happens somewhere deep in index_fetch_heap/heapam_index_fetch_tuple. Sure, the TIDs would come from a callback, so it's a bit as if the requests were generated in heapam_index_fetch_tuple - but it has no idea StreamingRead exists, so where would it get it. We might teach it about it, but what if there are multiple places calling index_fetch_heap()? Not all of which may be using StreamingRead (only indexscans would do that). Or if there are multiple index scans, there's need to be a separate StreamingRead queues, right? In any case, I felt a bit out of my depth here, and I chose not to do all this work without discussing the direction here. (Also, see the point about cursors and xs_heap_continue a bit later in this post.) I did however like the general StreamingRead API - how it splits the work between the API and the callback. The patch used to do everything, which meant it hardcoded a lot of the IOS-specific logic etc. I did plan to have some sort of "callback" for reading from the queue, but that didn't quite solve this issue - a lot of the stuff remained hard-coded. But the StreamingRead API made me realize that having a callback for the first phase (that adds requests to the queue) would fix that. So I did that - there's now one simple callback in for index scans, and a bit more complex callback for index-only scans. Thanks to this the hard-coded stuff mostly disappears, which is good. Perhaps a bigger change is that I decided to move this into a separate API on top of indexam.c. The original idea was to integrate this into index_getnext_tid/index_getnext_slot, so that all callers benefit from the prefetching automatically. Which would be nice, but it also meant it's need to happen in the indexam.c code, which seemed dirty. This patch introduces an API similar to StreamingRead. It calls the indexam.c stuff, but does all the prefetching on top of it, not in it. If a place calling index_getnext_tid() wants to allow prefetching, it needs to switch to IndexPrefetchNext(). (There's no function that would replace index_getnext_slot, at the moment. Maybe there should be.) Note 1: The IndexPrefetch name is a bit misleading, because it's used even with prefetching disabled - all index reads from the index scan happen through it. Maybe it should be called IndexReader or something like that. Note 2: I left the code in indexam.c for now, but in principle it could (should) be moved to a different place. I think this layering makes sense, and it's probably much closer to what Andres meant when he said the prefetching should happen in the executor. Even if the patch ends up using StreamingRead in the future, I guess we'll want something like IndexPrefetch - it might use the StreamingRead internally, but it would still need to do some custom stuff to detect I/O patterns or something that does not quite fit into the StreamingRead. Now, let's talk about two (mostly unrelated) problems I ran into. Firstly, I realized there's a bit of a problem with cursors. The prefetching works like this: 1) reading TIDs from the index 2) stashing them into a queue in IndexPrefetch 3) doing prefetches for the new TIDs added to the queue 4) returning the TIDs to the caller, one by one And all of this works ... unless the direction of the scan changes. Which for cursors can happen if someone does FETCH BACKWARD or stuff like that. I'm not sure how difficult it'd be to make this work. I suppose we could simply discard the prefetched entries and do the right number of steps back for the index scan. But I haven't tried, and maybe it's more complex than I'm imagining. Also, if the cursor changes the direction a lot, it'd make the prefetching harmful. The patch simply disables prefetching for such queries, using the same logic that we do for parallelism. This may be over-zealous. FWIW this is one of the things that probably should remain outside of StreamingRead API - it seems pretty index-specific, and I'm not sure we'd even want to support these "backward" movements in the API. The other issue I'm aware of is handling xs_heap_continue. I believe it works fine for "false" but I need to take a look at non-MVCC snapshots (i.e. when xs_heap_continue=true). I haven't done any benchmarks with this reworked API - there's a couple more allocations etc. but it did not change in a
Re: the s_lock_stuck on perform_spin_delay
Hi Matthias and Robert, Matthias van de Meent writes: > On Thu, 4 Jan 2024 at 08:09, Andy Fan wrote: >> >> My question is if someone doesn't obey the rule by mistake (everyone >> can make mistake), shall we PANIC on a production environment? IMO I >> think it can be a WARNING on a production environment and be a stuck >> when 'ifdef USE_ASSERT_CHECKING'. >> [...] >> I think a experienced engineer like Thomas can make this mistake and the >> patch was reviewed by 3 peoples, the bug is still there. It is not easy >> to say just don't do it. >> >> the attached code show the prototype in my mind. Any feedback is welcome. > > While I understand your point and could maybe agree with the change > itself (a crash isn't great), It's great that both of you agree that the crash is not great. > I don't think it's an appropriate fix > for the problem of holding a spinlock while waiting for a LwLock, as > spin.h specifically mentions the following (and you quoted the same): > > """ > Keep in mind the coding rule that spinlocks must not be held for more > than a few instructions. > """ Yes, I agree that the known [LW]LockAcquire after holding a Spin lock should be fixed at the first chance rather than pander to it with my previous patch. My previous patch just take care of the *unknown* cases (and I cced thomas in the hope that he knows the bug). I also agree that the special case about [LW]LockAcquire should be detected more effective as you suggested below. So v2 comes and commit 2 is for this suggestion. > > I suspect that we'd be better off with stronger protections against > waiting for LwLocks while we hold any spin lock. More specifically, > I'm thinking about something like tracking how many spin locks we > hold, and Assert()-ing that we don't hold any such locks when we start > to wait for an LwLock or run CHECK_FOR_INTERRUPTS-related code (with > potential manual contextual overrides in specific areas of code where > specific care has been taken to make it safe to hold spin locks while > doing those operations - although I consider their existence unlikely > I can't rule them out as I've yet to go through all lock-touching > code). This would probably work in a similar manner as > START_CRIT_SECTION/END_CRIT_SECTION. > > Kind regards, > > Matthias van de Meent > Neon (https://neon.tech) -- Best Regards Andy Fan >From 7d7fd0f0e9b13a290bfffaec0ad40773191155f2 Mon Sep 17 00:00:00 2001 From: "yizhi.fzh" Date: Thu, 4 Jan 2024 14:33:37 +0800 Subject: [PATCH v2 1/2] Don't call s_lock_stuck in production environment In the current implementation, if a spin lock is misused, the s_lock_stuck in perform_spin_delay can cause the entire system to PANIC. In order to balance fault tolerance and the ability to detect incorrect usage, we can use WARNING to replace s_lock_stuck in the production environment, but still use s_lock_stuck in builds with USE_ASSERT_CHECKING. --- src/backend/storage/lmgr/s_lock.c | 9 + 1 file changed, 9 insertions(+) diff --git a/src/backend/storage/lmgr/s_lock.c b/src/backend/storage/lmgr/s_lock.c index 327ac64f7c..9446605122 100644 --- a/src/backend/storage/lmgr/s_lock.c +++ b/src/backend/storage/lmgr/s_lock.c @@ -67,6 +67,7 @@ slock_t dummy_spinlock; static int spins_per_delay = DEFAULT_SPINS_PER_DELAY; +#ifdef USE_ASSERT_CHECKING /* * s_lock_stuck() - complain about a stuck spinlock */ @@ -85,6 +86,7 @@ s_lock_stuck(const char *file, int line, const char *func) func, file, line); #endif } +#endif /* * s_lock(lock) - platform-independent portion of waiting for a spinlock. @@ -132,7 +134,14 @@ perform_spin_delay(SpinDelayStatus *status) if (++(status->spins) >= spins_per_delay) { if (++(status->delays) > NUM_DELAYS) + { +#ifdef USE_ASSERT_CHECKING s_lock_stuck(status->file, status->line, status->func); +#else + if (status->delays % NUM_DELAYS == 0) +elog(WARNING, "perform spin lock on %s %d times", status->func, status->delays); +#endif + } if (status->cur_delay == 0) /* first time to delay? */ status->cur_delay = MIN_DELAY_USEC; -- 2.34.1 >From fae6241960a2c0df8df5d83001ce298228f4cbf2 Mon Sep 17 00:00:00 2001 From: "yizhi.fzh" Date: Thu, 4 Jan 2024 22:19:50 +0800 Subject: [PATCH v2 2/2] Disallow [LW]LockAcquire when holding a spin lock-file-mode Spinlocks are intended for *very* short-term locks, but sometimes it is misused, this commit guards no [LW]LockAcquire should be called when holding a spin lock. --- src/backend/storage/buffer/bufmgr.c | 1 + src/backend/storage/lmgr/lock.c | 2 ++ src/backend/storage/lmgr/lwlock.c | 2 ++ src/backend/utils/init/globals.c| 1 + src/include/storage/buf_internals.h | 1 + src/include/storage/spin.h | 24 ++-- 6 files changed, 29 insertions(+), 2 deletions(-) diff --git a/src/backend/storage/buffer/bufmgr.c b/src/backend/storage/buffer/bufmgr.c index 9f9d3f24ac..d2534ee9aa 100644 --- a/src/backend/storage/buffer/bufmgr.c +++
Re: Synchronizing slots from primary to standby
Hi, On Thu, Jan 04, 2024 at 10:27:31AM +0530, shveta malik wrote: > On Thu, Jan 4, 2024 at 9:18 AM shveta malik wrote: > > > > On Wed, Jan 3, 2024 at 6:33 PM Zhijie Hou (Fujitsu) > > wrote: > > > > > > On Tuesday, January 2, 2024 6:32 PM shveta malik > > > wrote: > > > > On Fri, Dec 29, 2023 at 10:25 AM Amit Kapila > > > > > > > > The topup patch has also changed app_name to > > > > {cluster_name}_slotsyncworker so that we do not confuse between > > > > walreceiver > > > > and slotsyncworker entry. > > > > > > > > Please note that there is no change in rest of the patches, changes are > > > > in > > > > additional 0004 patch alone. > > > > > > Attach the V56 patch set which supports ALTER SUBSCRIPTION SET (failover). > > > This is useful when user want to refresh the publication tables, they can > > > now alter the > > > failover option to false and then execute the refresh command. > > > > > > Best Regards, > > > Hou zj > > > > The patches no longer apply to HEAD due to a recent commit 007693f. I > > am working on rebasing and will post the new patches soon > > > > thanks > > Shveta > > Commit 007693f has changed 'conflicting' to 'conflict_reason', so > adjusted the code around that in the slotsync worker. > > Also removed function 'pg_get_slot_invalidation_cause' as now > conflict_reason tells the same. > > PFA rebased patches with above changes. > Thanks! Looking at 0004: 1 -libpqrcv_connect(const char *conninfo, bool logical, bool must_use_password, -const char *appname, char **err) +libpqrcv_connect(const char *conninfo, bool replication, bool logical, +bool must_use_password, const char *appname, char **err) What about adjusting the preceding comment a bit to describe what the new replication parameter is for? 2 + /* We can not have logical w/o replication */ what about replacing w/o by without? 3 === + if(!replication) + Assert(!logical); + + if (replication) { what about using "if () else" instead (to avoid unnecessary test)? Having said that the patch seems a reasonable way to implement non-replication connection in slotsync worker. 4 === Looking closer, the only place where walrcv_connect() is called with replication set to false and logical set to false is in ReplSlotSyncWorkerMain(). That does make sense, but what do you think about creating dedicated libpqslotsyncwrkr_connect and slotsyncwrkr_connect (instead of using the libpqrcv_connect / walrcv_connect ones)? That way we could make use of slotsyncwrkr_connect() in ReplSlotSyncWorkerMain() as I think it's confusing to use "rcv" functions while the process using them is not of backend type walreceiver. I'm not sure that worth the extra complexity though, what do you think? Regards, -- Bertrand Drouvot PostgreSQL Contributors Team RDS Open Source Databases Amazon Web Services: https://aws.amazon.com
Re: the s_lock_stuck on perform_spin_delay
On Thu, Jan 4, 2024 at 2:09 AM Andy Fan wrote: > My question is if someone doesn't obey the rule by mistake (everyone > can make mistake), shall we PANIC on a production environment? IMO I > think it can be a WARNING on a production environment and be a stuck > when 'ifdef USE_ASSERT_CHECKING'. > > People may think spin lock may consume too much CPU, but it is not true > in the discussed scene since perform_spin_delay have pg_usleep in it, > and the MAX_DELAY_USEC is 1 second and MIN_DELAY_USEC is 0.001s. > > I notice this issue actually because of the patch "Cache relation > sizes?" from Thomas Munro [1], where the latest patch[2] still have the > following code. > + sr = smgr_alloc_sr(); <-- HERE a spin lock is hold > + > + /* Upgrade to exclusive lock so we can create a mapping. */ > + LWLockAcquire(mapping_lock, LW_EXCLUSIVE); <-- HERE a complex > operation is needed. it may take a long time. I'm not sure that the approach this patch takes is correct in detail, but I kind of agree with you about the overall point. I mean, the idea of the PANIC is to avoid having the system just sit there in a state from which it will never recover ... but it can also have the effect of killing a system that wasn't really dead. I'm not sure what the best thing to do here is, but it's worth talking about, IMHO. -- Robert Haas EDB: http://www.enterprisedb.com
Re: Add a perl function in Cluster.pm to generate WAL
Hello Tom, 04.01.2024 02:39, Tom Lane wrote: Buildfarm member skink has failed 3 times in 035_standby_logical_decoding.pl in the last couple of days: https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=skink=2024-01-03%2020%3A07%3A15 https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=skink=2024-01-03%2017%3A09%3A27 https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=skink=2024-01-01%2020%3A10%3A18 These all look like # poll_query_until timed out executing this query: # select (confl_active_logicalslot = 1) from pg_stat_database_conflicts where datname = 'testdb' # expecting this output: # t # last actual query output: # f although it's notable that two different tests are involved (vacuum vs. vacuum full). I am not real sure what is happening there, but I see that c161ab74f changed some details of how that test works, so I wonder if it's responsible for these failures. The timing isn't a perfect match, since this commit went in two weeks ago, but I don't see any more-recent commits that seem like plausible explanations. Reproduced here. As I can see in the failure logs you referenced, the first problem is: # Failed test 'inactiveslot slot invalidation is logged with vacuum on pg_authid' # at t/035_standby_logical_decoding.pl line 224. It reminded me of: https://www.postgresql.org/message-id/b2a1f7d0-7629-72c0-2da7-e9c4e336b010%40gmail.com It seems that it's not something new, and maybe that my analysis is still valid. If so, VACUUM FREEZE/autovacuum = off should fix the issue. Best regards, Alexander
Re: Adding facility for injection points (or probe points?) for more advanced tests
On Tue, Jan 2, 2024 at 3:36 PM Ashutosh Bapat wrote: > > I will look at 0002 next. One more comment on 0001 InjectionPointAttach() doesn't test whether the given function exists in the given library. Even if InjectionPointAttach() succeeds, INJECTION_POINT might throw error because the function doesn't exist. This can be seen as an unwanted behaviour. I think InjectionPointAttach() should test whether the function exists and possibly load it as well by adding it to the local cache. 0002 comments --- /dev/null +++ b/src/test/modules/test_injection_points/expected/test_injection_points.out When built without injection point support, this test fails. We should add an alternate output file for such a build so that the behaviour with and without injection point support is tested. Or set up things such that the test is not run under make check in that directory. I will prefer the first option. + +SELECT test_injection_points_run('TestInjectionError'); -- error +ERROR: error triggered for injection point TestInjectionError +-- Re-load and run again. What's getting Re-loaded here? \c will create a new connection and thus a new backend. Maybe the comment should say "test in a fresh backend" or something of that sort? + +SELECT test_injection_points_run('TestInjectionError'); -- error +ERROR: error triggered for injection point TestInjectionError +-- Remove one entry and check the other one. Looks confusing to me, we are testing the one removed as well. Am I missing something? +(1 row) + +-- All entries removed, nothing happens We aren't removing all entries TestInjectionLog2 is still there. Am I missing something? 0003 looks mostly OK. 0004 comments + +# after recovery, the server will not start, and log PANIC: could not locate a valid checkpoint record IIUC the comment describes the behaviour with 7863ee4def65 reverted. But the test after this comment is written for the behaviour with 7863ee4def65. That's confusing. Is the intent to describe both behaviours in the comment? + + /* And sleep.. */ + ConditionVariablePrepareToSleep(_state->wait_point); + ConditionVariableSleep(_state->wait_point, test_injection_wait_event); + ConditionVariableCancelSleep(); According to the prologue of ConditionVariableSleep(), that function should be called in a loop checking for the desired condition. All the callers that I examined follow that pattern. I think we need to follow that pattern here as well. Below comment from ConditionVariableTimedSleep() makes me think that the caller of ConditionVariableSleep() can be woken up even if the condition variable was not signaled. That's why the while() loop around ConditionVariableSleep(). * If we're still in the wait list, then the latch must have been set * by something other than ConditionVariableSignal; though we don't * guarantee not to return spuriously, we'll avoid this obvious case. */. That's all I have for now. -- Best Wishes, Ashutosh Bapat
Re: Revisiting {CREATE INDEX, REINDEX} CONCURRENTLY improvements
Hello! > Correct, but there are changes being discussed where we would freeze > tuples during pruning as well [0], which would invalidate that > implementation detail. And, if I had to choose between improved > opportunistic freezing and improved R/CIC, I'd probably choose > improved freezing over R/CIC. As another option, we could extract a dedicated horizon value for an opportunistic freezing. And use some flags in R/CIC backend to keep it at the required value. Best regards, Michail.
Re: INFORMATION_SCHEMA note
(typo in the subject fixed) > In the following paragraph in information_schema: > > character encoding form > > >An encoding of some character repertoire. Most older character >repertoires only use one encoding form, and so there are no >separate names for them (e.g., LATIN1 is an >encoding form applicable to the LATIN1 >repertoire). But for example Unicode has the encoding forms >UTF8, UTF16, etc. (not >all supported by PostgreSQL). Encoding forms are not exposed >as an SQL object, but are visible in this view. > > This claims that the LATIN1 repertoire only uses one encoding form, > but actually LATIN1 can be encoded in another form: ISO-2022-JP-2 (a 7 > bit encoding. See RFC 1554 > (https://datatracker.ietf.org/doc/html/rfc1554) for more details). > > If we still want to list a use-one-encoding-form example, probably we > could use LATIN2 instead or others that are not supported by > ISO-2022-JP-2 (ISO-2022-JP-2 supports LATIN1 and LATIN7). > > Attached is the patch that does this. Any objection? -- Tatsuo Ishii SRA OSS LLC English: http://www.sraoss.co.jp/index_en/ Japanese:http://www.sraoss.co.jp
Re: Adding facility for injection points (or probe points?) for more advanced tests
On Thu, Jan 4, 2024 at 5:23 AM Michael Paquier wrote: > > > 0003 and 0004 add tests to the test_injection_points module. Is the idea > > that we'd add any tests that required injection points here? I think it'd > > be better if we could move the tests closer to the logic they're testing, > > but perhaps that is difficult because you also need to define the callback > > functions somewhere. Hm... > > Yeah. Agreed that the final result should not have these tests in the > module test_injection_points. What I was thinking here is to move > 002_invalid_checkpoint_after_promote.pl to src/test/recovery/ and pull > the module with the callbacks with an EXTRA_INSTALL. 0003 and 0004 are using the extension in this module for some serious testing. The name of the extension test_injection_point indicates that it's for testing injection points and not for some serious use of injection callbacks it adds. Changes 0003 and 0004 suggest otherwise. I suggest we move test_injection_points from src/test/modules to contrib/ and rename it as "injection_points". The test files may still be named as test_injection_point. The TAP tests in 0003 and 0004 once moved to their appropriate places, will load injection_point extension and use it. That way predefined injection point callbacks will also be available for others to use. -- Best Wishes, Ashutosh Bapat
Re: speed up a logical replica setup
On Thu, Jan 4, 2024 at 4:34 PM Amit Kapila wrote: > > But what good is having the same publications as primary > > also on logical replica? > > > > The one use case that comes to my mind is to set up bi-directional > replication. The publishers want to subscribe to the new subscriber. Hmm. Looks like another user controlled cleanup. -- Best Wishes, Ashutosh Bapat
Re: Random pg_upgrade test failure on drongo
Hello Amit, 03.01.2024 14:42, Amit Kapila wrote: So I started to think about other approach: to perform unlink as it's implemented now, but then wait until the DELETE_PENDING state is gone. There is a comment in the code which suggests we shouldn't wait indefinitely. See "However, we won't wait indefinitely for someone else to close the file, as the caller might be holding locks and blocking other backends." Yes, I saw it, but initially I thought that we have a transient condition there, so waiting in open() (instead of failing immediately) seemed like a good idea then... And the internal process is ... background writer (BgBufferSync()). So, I tried just adding bgwriter_lru_maxpages = 0 to postgresql.conf and got 20 x 10 tests passing. Thus, it we want just to get rid of the test failure, maybe it's enough to add this to the test's config... What about checkpoints? Can't it do the same while writing the buffers? As we deal here with pg_upgrade/pg_restore, it must not be very easy to get the desired effect, but I think it's not impossible in principle. More details below. What happens during the pg_upgrade execution is essentially: 1) CREATE DATABASE "postgres" WITH TEMPLATE = template0 OID = 5 ...; -- this command flushes file buffers as well 2) ALTER DATABASE postgres OWNER TO ... 3) COMMENT ON DATABASE "postgres" IS ... 4) -- For binary upgrade, preserve pg_largeobject and index relfilenodes SELECT pg_catalog.binary_upgrade_set_next_index_relfilenode('2683'::pg_catalog.oid); SELECT pg_catalog.binary_upgrade_set_next_heap_relfilenode('2613'::pg_catalog.oid); TRUNCATE pg_catalog.pg_largeobject; -- ^^^ here we can get the error "could not create file "base/5/2683": File exists" ... We get the effect discussed when the background writer process decides to flush a file buffer for pg_largeobject during stage 1. (Thus, if a checkpoint somehow happened to occur during CREATE DATABASE, the result must be the same.) And another important factor is shared_buffers = 1MB (set during the test). With the default setting of 128MB I couldn't see the failure. It can be reproduced easily (on old Windows versions) just by running pg_upgrade in a loop (I've got failures on iterations 22, 37, 17 (with the default cluster)). If an old cluster contains dozen of databases, this increases the failure probability significantly (with 10 additional databases I've got failures on iterations 4, 1, 6). Please see the reproducing script attached. Best regards, Alexander@echo off REM define PGBIN and PATH set PGBIN=%cd%\tmp_install\usr\local\pgsql\bin set PATH=%PGBIN%;%PATH% setlocal enabledelayedexpansion rmdir /S /Q tmpdb initdb -D tmpdb >initdb.log 2>&1 echo shared_buffers = 1MB>>tmpdb\postgresql.conf set /a i = 0 :LOOP set /a i+=1 echo ITERATION %i% rmdir /S /Q tmpdb_old xcopy /e /i /q /r tmpdb tmpdb_old\ pg_ctl -D tmpdb_old -l server.log start echo SELECT concat('CREATE DATABASE db', g) FROM generate_series(1, 10) g \gexec | psql -d postgres pg_ctl -D tmpdb_old stop rmdir /S /Q tmpdb_new xcopy /e /i /q /r tmpdb tmpdb_new\ pg_upgrade --no-sync -r -b "%PGBIN%" -B "%PGBIN%" -d tmpdb_old -D tmpdb_new if %ERRORLEVEL% NEQ 0 goto ERR goto LOOP :ERR echo ERROR pause :EXIT
Re: Revisiting {CREATE INDEX, REINDEX} CONCURRENTLY improvements
On Mon, 25 Dec 2023 at 15:12, Michail Nikolaev wrote: > > Hello! > > It seems like the idea of "old" snapshot is still a valid one. > > > Should this deal with any potential XID wraparound, too? > > As far as I understand in our case, we are not affected by this in any way. > Vacuum in our table is not possible because of locking, so, nothing > may be frozen (see below). > In the case of super long index building, transactional limits will > stop new connections using current > regular infrastructure because it is based on relation data (but not > actual xmin of backends). > > > How does this behave when the newly inserted tuple's xmin gets frozen? > > This would be allowed to happen during heap page pruning, afaik - no > > rules that I know of which are against that - but it would create > > issues where normal snapshot visibility rules would indicate it > > visible to both snapshots regardless of whether it actually was > > visible to the older snapshot when that snapshot was created... > > As I can see, heap_page_prune never freezes any tuples. > In the case of regular vacuum, it used this way: call heap_page_prune > and then call heap_prepare_freeze_tuple and then > heap_freeze_execute_prepared. Correct, but there are changes being discussed where we would freeze tuples during pruning as well [0], which would invalidate that implementation detail. And, if I had to choose between improved opportunistic freezing and improved R/CIC, I'd probably choose improved freezing over R/CIC. As an alternative, we _could_ keep track of concurrent index inserts using a dummy index (with the same predicate) which only holds the TIDs of the inserted tuples. We'd keep it as an empty index in phase 1, and every time we reset the visibility snapshot we now only need to scan that index to know what tuples were concurrently inserted. This should have a significantly lower IO overhead than repeated full index bulkdelete scans for the new index in the second table scan phase of R/CIC. However, in a worst case it could still require another O(tablesize) of storage. Kind regards, Matthias van de Meent Neon (https://neon.tech) [0] https://www.postgresql.org/message-id/caakru_a+g2oe6ahjcbibftnfiy2aib4e31x9qyj_qkjxzmz...@mail.gmail.com
Re: speed up a logical replica setup
On Thu, Jan 4, 2024 at 12:22 PM Shlok Kyal wrote: > > Hi, > I was testing the patch with following test cases: > > Test 1 : > - Create a 'primary' node > - Setup physical replica using pg_basebackup "./pg_basebackup –h > localhost –X stream –v –R –W –D ../standby " > - Insert data before and after pg_basebackup > - Run pg_subscriber and then insert some data to check logical > replication "./pg_subscriber –D ../standby -S “host=localhost > port=9000 dbname=postgres” -P “host=localhost port=9000 > dbname=postgres” -d postgres" > - Also check pg_publication, pg_subscriber and pg_replication_slots tables. > > Observation: > Data is not lost. Replication is happening correctly. Pg_subscriber is > working as expected. > > Test 2: > - Create a 'primary' node > - Use normal pg_basebackup but don’t set up Physical replication > "./pg_basebackup –h localhost –v –W –D ../standby" > - Insert data before and after pg_basebackup > - Run pg_subscriber > > Observation: > Pg_subscriber command is not completing and is stuck with following > log repeating: > LOG: waiting for WAL to become available at 0/3000168 > LOG: invalid record length at 0/3000150: expected at least 24, got 0 > I think probably the required WAL is not copied. Can you use the -X option to stream WAL as well and then test? But I feel in this case also, we should wait for some threshold time and then exit with failure, removing new objects created, if any. > Test 3: > - Create a 'primary' node > - Use normal pg_basebackup but don’t set up Physical replication > "./pg_basebackup –h localhost –v –W –D ../standby" > -Insert data before pg_basebackup but not after pg_basebackup > -Run pg_subscriber > > Observation: > Pg_subscriber command is not completing and is stuck with following > log repeating: > LOG: waiting for WAL to become available at 0/3000168 > LOG: invalid record length at 0/3000150: expected at least 24, got 0 > This is similar to the previous test and you can try the same option here as well. -- With Regards, Amit Kapila.
Re: speed up a logical replica setup
On Thu, Jan 4, 2024 at 12:30 PM Ashutosh Bapat wrote: > > On Wed, Jan 3, 2024 at 2:49 PM Amit Kapila wrote: > > > > c) Drop the replication slots d) Drop the > > > publications > > > > > > > I am not so sure about dropping publications because, unlike > > subscriptions which can start to pull the data, there is no harm with > > publications. Similar to publications there could be some user-defined > > functions or other other objects which may not be required once the > > standby replica is converted to subscriber. I guess we need to leave > > those to the user. > > > > IIUC, primary use of pg_subscriber utility is to start a logical > subscription from a physical base backup (to reduce initial sync time) > as against logical backup taken while creating a subscription. Hence I > am expecting that apart from this difference, the resultant logical > replica should look similar to the logical replica setup using a > logical subscription sync. Hence we should not leave any replication > objects around. UDFs (views, and other objects) may have some use on a > logical replica. We may replicate changes to UDF once DDL replication > is supported. But what good is having the same publications as primary > also on logical replica? > The one use case that comes to my mind is to set up bi-directional replication. The publishers want to subscribe to the new subscriber. -- With Regards, Amit Kapila.
Re: the s_lock_stuck on perform_spin_delay
On Thu, 4 Jan 2024 at 08:09, Andy Fan wrote: > > My question is if someone doesn't obey the rule by mistake (everyone > can make mistake), shall we PANIC on a production environment? IMO I > think it can be a WARNING on a production environment and be a stuck > when 'ifdef USE_ASSERT_CHECKING'. > [...] > I think a experienced engineer like Thomas can make this mistake and the > patch was reviewed by 3 peoples, the bug is still there. It is not easy > to say just don't do it. > > the attached code show the prototype in my mind. Any feedback is welcome. While I understand your point and could maybe agree with the change itself (a crash isn't great), I don't think it's an appropriate fix for the problem of holding a spinlock while waiting for a LwLock, as spin.h specifically mentions the following (and you quoted the same): """ Keep in mind the coding rule that spinlocks must not be held for more than a few instructions. """ I suspect that we'd be better off with stronger protections against waiting for LwLocks while we hold any spin lock. More specifically, I'm thinking about something like tracking how many spin locks we hold, and Assert()-ing that we don't hold any such locks when we start to wait for an LwLock or run CHECK_FOR_INTERRUPTS-related code (with potential manual contextual overrides in specific areas of code where specific care has been taken to make it safe to hold spin locks while doing those operations - although I consider their existence unlikely I can't rule them out as I've yet to go through all lock-touching code). This would probably work in a similar manner as START_CRIT_SECTION/END_CRIT_SECTION. Kind regards, Matthias van de Meent Neon (https://neon.tech)
Re: pg_upgrade and logical replication
On Tue, 2 Jan 2024 at 15:58, Amit Kapila wrote: > > On Fri, Dec 29, 2023 at 2:26 PM vignesh C wrote: > > > > On Thu, 28 Dec 2023 at 15:59, Amit Kapila wrote: > > > > > > On Wed, Dec 13, 2023 at 12:09 PM vignesh C wrote: > > > > > > > > Thanks for the comments, the attached v25 version patch has the > > > > changes for the same. > > > > > > > > > > I have looked at it again and made some cosmetic changes like changing > > > some comments and a minor change in one of the error messages. See, if > > > the changes look okay to you. > > > > Thanks, the changes look good. > > > > Pushed. Thanks for pushing this patch, I have updated the commitfest entry to Committed for the same. Regards, Vignesh
Re: pg_upgrade and logical replication
On Wed, 3 Jan 2024 at 11:25, Amit Kapila wrote: > > On Wed, Jan 3, 2024 at 6:21 AM Michael Paquier wrote: > > > > On Tue, Jan 02, 2024 at 03:58:25PM +0530, Amit Kapila wrote: > > > On Fri, Dec 29, 2023 at 2:26 PM vignesh C wrote: > > >> Thanks, the changes look good. > > > > > > Pushed. > > > > Yeah! Thanks Amit and everybody involved here! Thanks also to Julien > > for raising the thread and the problem, to start with. > > > > I think the next possible step here is to document how to upgrade the > logical replication nodes as previously discussed in this thread [1]. > IIRC, there were a few issues with the steps mentioned but if we want > to document those we can start a separate thread for it as that > involves both publishers and subscribers. I have posted a patch for this at: https://www.postgresql.org/message-id/CALDaNm1_iDO6srWzntqTr0ZDVkk2whVhNKEWAvtgZBfSmuBeZQ%40mail.gmail.com Regards, Vignesh
Re: Synchronizing slots from primary to standby
On Wed, Jan 3, 2024 at 4:57 PM Bertrand Drouvot wrote: > > On Wed, Jan 03, 2024 at 04:20:03PM +0530, Amit Kapila wrote: > > On Fri, Dec 29, 2023 at 12:32 PM Amit Kapila > > wrote: > > > > > > On Fri, Dec 29, 2023 at 6:59 AM Masahiko Sawada > > > wrote: > > > > > > > > On Wed, Dec 27, 2023 at 7:43 PM Amit Kapila > > > > wrote: > > > > > > > > > > > > > > > > > 3) The slotsync worker uses primary_conninfo but also uses a new GUC > > > > > > parameter, say slot_sync_dbname, to specify the database to connect. > > > > > > The slot_sync_dbname overwrites the dbname if primary_conninfo also > > > > > > specifies it. If both don't have a dbname, raise an error. > > > > > > > > > > > > > > > > Would the users prefer to provide a value for a separate GUC instead > > > > > of changing primary_conninfo? It is possible that we can have some > > > > > users prefer to use one GUC and others prefer a separate GUC but we > > > > > should add a new GUC if we are sure that is what users would prefer. > > > > > Also, even if have to consider this option, I think we can easily > > > > > later add a new GUC to provide a dbname in addition to having the > > > > > provision of giving it in primary_conninfo. > > > > > > > > I think having two separate GUCs is more flexible for example when > > > > users want to change the dbname to connect. It makes sense that the > > > > slotsync worker wants to use the same connection string as the > > > > walreceiver uses. But I guess today most primary_conninfo settings > > > > that are set manually or are generated by tools such as pg_basebackup > > > > don't have dbname. If we require a dbname in primary_conninfo, many > > > > tools will need to be changed. Once the connection string is > > > > generated, it would be tricky to change the dbname in it, as Shveta > > > > mentioned. The users will have to carefully select the database to > > > > connect when taking a base backup. > > > > > > > > > > I see your point and agree that users need to be careful. I was trying > > > to compare it with other places like the conninfo used with a > > > subscription where no separate dbname needs to be provided. Now, here > > > the situation is not the same because the same conninfo is used for > > > different purposes (walreceiver doesn't require dbname (dbname is > > > ignored even if present) whereas slotsyncworker requires dbname). I > > > was just trying to see if we can avoid having a new GUC for this > > > purpose. Does anyone else have an opinion on this matter? > > > > > > > Bertrand, Dilip, and others involved in this thread or otherwise, see > > if you can share an opinion on the above point because it would be > > good to get some more opinions before we decide to add a new GUC (for > > dbname) for slotsync worker. > > > > I think that as long as enable_syncslot is off then there is no need to add > the > dbname in primary_conninfo (means there is no need to change an existing > primary_conninfo > for the ones that don't use the sync slot feature). > > So given that primary_conninfo does not necessary need to be changed (for > ones that > don't use the sync slot feature) and that adding a new GUC looks more a > one-way door > change to me, I'd vote to keep the patch as it is (we can still revisit this > later > on and add a new GUC if we feel the need based on user's feedback). > Okay, thanks for the feedback. Dilip also shares the same opinion, so let's wait and see if there is any strong argument to add this new GUC. -- With Regards, Amit Kapila.
Documentation to upgrade logical replication cluster
Hi, We have documentation on how to upgrade "publisher" and "subscriber" at [1], but currently we do not have any documentation on how to upgrade logical replication clusters. Here is a patch to document how to upgrade different logical replication clusters: a) Upgrade 2 node logical replication cluster b) Upgrade cascaded logical replication cluster c) Upgrade 2 node circular logical replication cluster. Thoughts? [1] - https://www.postgresql.org/docs/devel/pgupgrade.html Regards, Vignesh From 9458a2c62a0702316d9ab339cd01dac0e088c52e Mon Sep 17 00:00:00 2001 From: Vignesh C Date: Wed, 13 Dec 2023 14:11:58 +0530 Subject: [PATCH v1] Documentation for upgrading logical replication cluster. Documentation for upgrading logical replication cluster. --- doc/src/sgml/ref/pgupgrade.sgml | 618 +++- 1 file changed, 616 insertions(+), 2 deletions(-) diff --git a/doc/src/sgml/ref/pgupgrade.sgml b/doc/src/sgml/ref/pgupgrade.sgml index 87be1fb1c2..87f453d509 100644 --- a/doc/src/sgml/ref/pgupgrade.sgml +++ b/doc/src/sgml/ref/pgupgrade.sgml @@ -383,7 +383,7 @@ make prefix=/usr/local/pgsql.new install - + Prepare for publisher upgrades @@ -456,7 +456,7 @@ make prefix=/usr/local/pgsql.new install - + Prepare for subscriber upgrades @@ -506,6 +506,620 @@ make prefix=/usr/local/pgsql.new install + +Prepare for logical replication cluster upgrades + + + Migration of logical replication clusters can be done when all the members + of the old logical replication clusters are version 17.0 or later. + + + + + The logical replication restrictions apply to logical replication cluster + upgrades also. See for + the details of logical replication restrictions. + + + The prerequisites of publisher upgrade applies to logical Replication + cluster upgrades also. See + for the details of publisher upgrade prerequisites. + + + The prerequisites of subscriber upgrade applies to logical Replication + cluster upgrades also. See + for the details of subscriber upgrade prerequisites. + + + + + + Upgrading logical replication cluster requires multiple steps to be + performed on various nodes. Because not all operations are + transactional, the user is advised to take backups. Backups can be taken + as described in . + + + + + The steps to upgrade logical replication clusters in various scenarios are + given below. + + + + + Steps to Upgrade 2 node logical replication cluster + + + + + Let's say publisher is in node1 and subscriber is + in node2. + + + + + + Stop the publisher server in node1, for e.g.: + + +dba@node1:/opt/PostgreSQL/postgres//bin$ pg_ctl -D /opt/PostgreSQL/pub_data stop -l logfile + + + + + + + + Disable all the subscriptions on node2 that are + subscribing the changes from node1 by using + ALTER SUBSCRIPTION ... DISABLE, + for e.g.: + +node2=# ALTER SUBSCRIPTION sub1_node1_node2 DISABLE; +ALTER SUBSCRIPTION +node2=# ALTER SUBSCRIPTION sub2_node1_node2 DISABLE; +ALTER SUBSCRIPTION + + + + + + + Upgrade the publisher node node1's server to the + required newer version, for e.g.: + +dba@node1:/opt/PostgreSQL/postgres//bin$ pg_upgrade +--old-datadir "/opt/PostgreSQL/postgres/17/pub_data" +--new-datadir "/opt/PostgreSQL/postgres//pub_upgraded_data" +--old-bindir "/opt/PostgreSQL/postgres/17/bin" +--new-bindir "/opt/PostgreSQL/postgres//bin" + + + + + + + Start the upgraded publisher node node1's server, for e.g.: + +dba@node1:/opt/PostgreSQL/postgres//bin$ pg_ctl -D /opt/PostgreSQL/pub_upgraded_data start -l logfile + + + + + + + Stop the subscriber server in node2, for e.g.: + +dba@node2:/opt/PostgreSQL/postgres//bin$ pg_ctl -D /opt/PostgreSQL/sub_data stop -l logfile + + + + + + + Upgrade the subscriber node node2's server to + the required new version, for e.g.: + +dba@node2:/opt/PostgreSQL/postgres//bin$ pg_upgrade +--old-datadir "/opt/PostgreSQL/postgres/17/sub_data" +--new-datadir "/opt/PostgreSQL/postgres//sub_upgraded_data" +--old-bindir "/opt/PostgreSQL/postgres/17/bin" +--new-bindir "/opt/PostgreSQL/postgres//bin" + + + + + + + Start the upgraded subscriber node node2's server, + for e.g.: + +dba@node2:/opt/PostgreSQL/postgres//bin$ pg_ctl -D /opt/PostgreSQL/sub_upgraded_data start -l logfile + + + + + + + Create any tables that were created in the upgraded
Re: Transaction timeout
> On 4 Jan 2024, at 07:14, Japin Li wrote: > > Does the timeout is too short for testing? I see the timeouts for lock_timeout > and statement_timeout is more bigger than transaction_timeout. Makes sense. Done. I've also put some effort into fine-tuning timeouts Nik's case tests. To have 100ms gap between check, false positive and actual bug we had I had to use transaction_timeout = 300ms. Currently all tests take more than 1000ms! But I do not see a way to make these tests both stable and short. Best regards, Andrey Borodin. v22-0001-Introduce-transaction_timeout.patch Description: Binary data v22-0002-Add-tests-for-transaction_timeout.patch Description: Binary data v22-0003-Try-to-enable-transaction_timeout-before-next-co.patch Description: Binary data v22-0004-fix-reschedule-timeout-for-each-commmand.patch Description: Binary data