date:20240104

Re: verify predefined LWLocks have entries in wait_event_names.txt

2024-01-04 Thread Bertrand Drouvot

Hi,

On Fri, Jan 05, 2024 at 12:11:44AM -0600, Nathan Bossart wrote:
> On Wed, Jan 03, 2024 at 07:59:45AM +, Bertrand Drouvot wrote:
> > +1 to add a test and put in a place that would produce failures at build 
> > time.
> > I think that having the test in the script that generates the header file 
> > is more
> > appropriate (as building the documentation looks less usual to me when 
> > working on
> > a patch).
> 
> Okay, I did that in v2.

Thanks!

> +# NB: Predefined locks (those declared in lwlocknames.txt) must be listed in
> +# the top section of locks and must be listed in the same order as in
> +# lwlocknames.txt.
> +#
>  
>  Section: ClassName - WaitEventLWLock
>  
> @@ -326,6 +330,12 @@ NotifyQueueTail  "Waiting to update limit on 
> NOTIFY message st
>  WaitEventExtension   "Waiting to read or update custom wait events 
> information for extensions."
>  WALSummarizer"Waiting to read or update WAL summarization state."
>  
> +#
> +# Predefined LWLocks (those declared in lwlocknames.txt) must be listed in 
> the
> +# section above and must be listed in the same order as in lwlocknames.txt.
> +# Other LWLocks must be listed in the section below.
> +#
> +

Another option could be to create a sub-section for predefined LWLocks that are
part of lwlocknames.txt and then sort both list (the one in the sub-section and
the one in lwlocknames.txt). That would avoid the "must be listed in the same 
order"
constraint. That said, I think the way it is done in the patch is fine because
if one does not follow the constraint then the build would fail.

I did a few tests leading to:

CommitBDTTsSLRALock defined in lwlocknames.txt but missing from 
wait_event_names.txt at ./generate-lwlocknames.pl line 107, <$lwlocknames> line 
58.

OR 

lists of predefined LWLocks in lwlocknames.txt and wait_event_names.txt do not 
match at ./generate-lwlocknames.pl line 109, <$lwlocknames> line 46.

OR

CommitBDTTsSLRALock defined in wait_event_names.txt but missing from 
lwlocknames.txt at ./generate-lwlocknames.pl line 126, <$lwlocknames> line 57.

So, that looks good to me except one remark regarding:

+   die "lists of predefined LWLocks in lwlocknames.txt and 
wait_event_names.txt do not match"
+ unless $wait_event_lwlocks[$i] eq $lockname;

What about printing $wait_event_lwlocks[$i] and $lockname in the error message?
Something like?

"
die "lists of predefined LWLocks in lwlocknames.txt and 
wait_event_names.txt do not match (comparing $lockname and 
$wait_event_lwlocks[$i])"
  unless $wait_event_lwlocks[$i] eq $lockname;
"

I think that would give more clues for debugging purpose.

Regards,

-- 
Bertrand Drouvot
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com

Re: Adding facility for injection points (or probe points?) for more advanced tests

2024-01-04 Thread Ashutosh Bapat

On Fri, Jan 5, 2024 at 12:49 PM Michael Paquier  wrote:
>
> I mean why?  We test a bunch of stuff in src/test/modules/, and this
> is not intended to be released to the outside world.
>
> Putting that in contrib/ has a lot of extra cost.  One is
> documentation and more complexity regarding versioning when it comes
> to upgrading it to a new version.  I don't think that it is a good
> idea to deal with this extra load of work for something that I'd aim
> to be used for having improved *test* coverage, and the build switch
> should stay.  Saying that, I'd be OK with renaming the module to
> injection_points,

Ok. Thanks.

> but I will fight hard about keeping that in
> src/test/modules/.  That's less maintenance headache to think about
> when having to deal with complex racy bugs.

For me getting this feature in code in a usable manner is more
important than its place in the code. I have no plans to fight over
it. :).

-- 
Best Wishes,
Ashutosh Bapat

Re: [17] CREATE SUBSCRIPTION ... SERVER

2024-01-04 Thread Ashutosh Bapat

On Fri, Jan 5, 2024 at 6:26 AM Jeff Davis  wrote:

>
> > 2. Can one use {FDW, user_mapping, foreign_server} combo other than
> > the built-in pg_connection_fdw?
>
> Yes, you can use any FDW for which you have USAGE privileges, passes
> the validations, and provides enough of the expected fields to form a
> connection string.
>
> There was some discussion on this point already. Initially, I
> implemented it with more catalog and grammar support, which improved
> error checking, but others objected that the grammar wasn't worth it
> and that it was too inflexible. See:
>
> https://www.postgresql.org/message-id/172273.1693403385%40sss.pgh.pa.us
> https://www.postgresql.org/message-id/CAExHW5unvpDv6yMSmqurHP7Du1PqoJFWVxeK-4YNm5EnoNJiSQ%40mail.gmail.com
>

Can you please provide an example using postgres_fdw to create a
subscription using this patch. I think we should document it in
postgres_fdw and add a test for the same.

-- 
Best Wishes,
Ashutosh Bapat

Re: Adding facility for injection points (or probe points?) for more advanced tests

2024-01-04 Thread Michael Paquier

On Fri, Jan 05, 2024 at 12:41:33PM +0530, Ashutosh Bapat wrote:
> Well, you have already showed that the SQL interface created for the
> test module is being used for testing a core feature. The tests for
> that should stay somewhere near the other tests for those features.
> Using an extension named "test_injection_point" and which resides in a
> test module for testing core features doesn't look great. Hence
> suggestion to move it to contrib.

I mean why?  We test a bunch of stuff in src/test/modules/, and this
is not intended to be released to the outside world.

Putting that in contrib/ has a lot of extra cost.  One is
documentation and more complexity regarding versioning when it comes
to upgrading it to a new version.  I don't think that it is a good
idea to deal with this extra load of work for something that I'd aim
to be used for having improved *test* coverage, and the build switch
should stay.  Saying that, I'd be OK with renaming the module to
injection_points, but I will fight hard about keeping that in
src/test/modules/.  That's less maintenance headache to think about
when having to deal with complex racy bugs.
--
Michael


signature.asc
Description: PGP signature

Re: Adding facility for injection points (or probe points?) for more advanced tests

2024-01-04 Thread Ashutosh Bapat

On Fri, Jan 5, 2024 at 5:08 AM Michael Paquier  wrote:
>
> >> I suggest we move test_injection_points from src/test/modules to
> >> contrib/ and rename it as "injection_points". The test files may still
> >> be named as test_injection_point. The TAP tests in 0003 and 0004 once
> >> moved to their appropriate places, will load injection_point extension
> >> and use it. That way predefined injection point callbacks will also be
> >> available for others to use.
> >
> > Rather than defining a module somewhere that tests would need to load,
> > should we just put the common callbacks in the core server?  Unless there's
> > a strong reason to define them elsewhere, that could be a nice way to save
> > a step in the tests.
>
> Nah, having some pre-existing callbacks existing in the backend is
> against the original minimalistic design spirit.  These would also
> require an SQL interface, and the interface design also depends on the
> functions registering them when pushing down custom conditions.
> Pushing that down to extensions to do what they want will lead to less
> noise, particularly if you consider that we will most likely want to
> tweak the callback interfaces for backpatched bugs.  That's also why I
> think contrib/ is not a good idea, src/test/modules/ serving the
> actual testing purpose here.

Well, you have already showed that the SQL interface created for the
test module is being used for testing a core feature. The tests for
that should stay somewhere near the other tests for those features.
Using an extension named "test_injection_point" and which resides in a
test module for testing core features doesn't look great. Hence
suggestion to move it to contrib.

-- 
Best Wishes,
Ashutosh Bapat

Re: speed up a logical replica setup

2024-01-04 Thread Shlok Kyal

On Thu, 4 Jan 2024 at 16:46, Amit Kapila  wrote:
>
> On Thu, Jan 4, 2024 at 12:22 PM Shlok Kyal  wrote:
> >
> > Hi,
> > I was testing the patch with following test cases:
> >
> > Test 1 :
> > - Create a 'primary' node
> > - Setup physical replica using pg_basebackup  "./pg_basebackup –h
> > localhost –X stream –v –R –W –D ../standby "
> > - Insert data before and after pg_basebackup
> > - Run pg_subscriber and then insert some data to check logical
> > replication "./pg_subscriber –D ../standby -S “host=localhost
> > port=9000 dbname=postgres” -P “host=localhost port=9000
> > dbname=postgres” -d postgres"
> > - Also check pg_publication, pg_subscriber and pg_replication_slots tables.
> >
> > Observation:
> > Data is not lost. Replication is happening correctly. Pg_subscriber is
> > working as expected.
> >
> > Test 2:
> > - Create a 'primary' node
> > - Use normal pg_basebackup but don’t set up Physical replication
> > "./pg_basebackup –h localhost –v –W –D ../standby"
> > - Insert data before and after pg_basebackup
> > - Run pg_subscriber
> >
> > Observation:
> > Pg_subscriber command is not completing and is stuck with following
> > log repeating:
> > LOG: waiting for WAL to become available at 0/3000168
> > LOG: invalid record length at 0/3000150: expected at least 24, got 0
> >
>
> I think probably the required WAL is not copied. Can you use the -X
> option to stream WAL as well and then test? But I feel in this case
> also, we should wait for some threshold time and then exit with
> failure, removing new objects created, if any.

I have tested with -X stream option in pg_basebackup as well. In this
case also the pg_subscriber command is getting stuck.
logs:
2024-01-05 11:49:34.436 IST [61948] LOG:  invalid resource manager ID
102 at 0/3000118
2024-01-05 11:49:34.436 IST [61948] LOG:  waiting for WAL to become
available at 0/3000130

>
> > Test 3:
> > - Create a 'primary' node
> > - Use normal pg_basebackup but don’t set up Physical replication
> > "./pg_basebackup –h localhost –v –W –D ../standby"
> > -Insert data before pg_basebackup but not after pg_basebackup
> > -Run pg_subscriber
> >
> > Observation:
> > Pg_subscriber command is not completing and is stuck with following
> > log repeating:
> > LOG: waiting for WAL to become available at 0/3000168
> > LOG: invalid record length at 0/3000150: expected at least 24, got 0
> >
>
> This is similar to the previous test and you can try the same option
> here as well.
For this test as well tried with -X stream option  in pg_basebackup.
It is getting stuck here as well with similar log.

Will investigate the issue further.


Thanks and regards
Shlok Kyal

Re: verify predefined LWLocks have entries in wait_event_names.txt

2024-01-04 Thread Nathan Bossart

On Wed, Jan 03, 2024 at 11:34:25AM +0900, Michael Paquier wrote:
> On Tue, Jan 02, 2024 at 11:31:20AM -0600, Nathan Bossart wrote:
>> +# Find the location of lwlocknames.h.
>> +my $include_dir = $node->config_data('--includedir');
>> +my $lwlocknames_file = "$include_dir/server/storage/lwlocknames.h";
> 
> I am afraid that this is incorrect because an installation could
> decide to install server-side headers in a different path than
> $include/server/.  Using --includedir-server would be the correct
> answer, appending "storage/lwlocknames.h" to the path retrieved from
> pg_config.

Ah, good to know, thanks.

-- 
Nathan Bossart
Amazon Web Services: https://aws.amazon.com

Re: verify predefined LWLocks have entries in wait_event_names.txt

2024-01-04 Thread Nathan Bossart

On Wed, Jan 03, 2024 at 07:59:45AM +, Bertrand Drouvot wrote:
> +1 to add a test and put in a place that would produce failures at build time.
> I think that having the test in the script that generates the header file is 
> more
> appropriate (as building the documentation looks less usual to me when 
> working on
> a patch).

Okay, I did that in v2.

-- 
Nathan Bossart
Amazon Web Services: https://aws.amazon.com
>From 587f6a2c108fc7496fcb3d5764ca06ac5fb21326 Mon Sep 17 00:00:00 2001
From: Nathan Bossart 
Date: Fri, 5 Jan 2024 00:07:40 -0600
Subject: [PATCH v2 1/1] verify lists of predefined LWLocks in lwlocknames.txt
 and wait_event_names.txt match

---
 src/backend/Makefile  |  2 +-
 src/backend/storage/lmgr/Makefile |  4 +-
 .../storage/lmgr/generate-lwlocknames.pl  | 63 +++
 .../utils/activity/wait_event_names.txt   | 10 +++
 src/include/storage/meson.build   |  2 +-
 5 files changed, 77 insertions(+), 4 deletions(-)

diff --git a/src/backend/Makefile b/src/backend/Makefile
index 816461aa17..7d2ea7d54a 100644
--- a/src/backend/Makefile
+++ b/src/backend/Makefile
@@ -130,7 +130,7 @@ $(top_builddir)/src/port/libpgport_srv.a: | submake-libpgport
 parser/gram.h: parser/gram.y
 	$(MAKE) -C parser gram.h
 
-storage/lmgr/lwlocknames.h: storage/lmgr/generate-lwlocknames.pl storage/lmgr/lwlocknames.txt
+storage/lmgr/lwlocknames.h: storage/lmgr/generate-lwlocknames.pl storage/lmgr/lwlocknames.txt utils/activity/wait_event_names.txt
 	$(MAKE) -C storage/lmgr lwlocknames.h lwlocknames.c
 
 utils/activity/wait_event_types.h: utils/activity/generate-wait_event_types.pl utils/activity/wait_event_names.txt
diff --git a/src/backend/storage/lmgr/Makefile b/src/backend/storage/lmgr/Makefile
index c48ba943c4..504480e847 100644
--- a/src/backend/storage/lmgr/Makefile
+++ b/src/backend/storage/lmgr/Makefile
@@ -39,8 +39,8 @@ s_lock_test: s_lock.c $(top_builddir)/src/common/libpgcommon.a $(top_builddir)/s
 lwlocknames.c: lwlocknames.h
 	touch $@
 
-lwlocknames.h: $(top_srcdir)/src/backend/storage/lmgr/lwlocknames.txt generate-lwlocknames.pl
-	$(PERL) $(srcdir)/generate-lwlocknames.pl $<
+lwlocknames.h: $(top_srcdir)/src/backend/storage/lmgr/lwlocknames.txt $(top_srcdir)/src/backend/utils/activity/wait_event_names.txt generate-lwlocknames.pl
+	$(PERL) $(srcdir)/generate-lwlocknames.pl $^
 
 check: s_lock_test
 	./s_lock_test
diff --git a/src/backend/storage/lmgr/generate-lwlocknames.pl b/src/backend/storage/lmgr/generate-lwlocknames.pl
index bef5e16e11..9afee7984b 100644
--- a/src/backend/storage/lmgr/generate-lwlocknames.pl
+++ b/src/backend/storage/lmgr/generate-lwlocknames.pl
@@ -15,6 +15,7 @@ my $continue = "\n";
 GetOptions('outdir:s' => \$output_path);
 
 open my $lwlocknames, '<', $ARGV[0] or die;
+open my $wait_event_names, '<', $ARGV[1] or die;
 
 # Include PID in suffix in case parallel make runs this multiple times.
 my $htmp = "$output_path/lwlocknames.h.tmp$$";
@@ -30,6 +31,59 @@ print $c $autogen, "\n";
 
 print $c "const char *const IndividualLWLockNames[] = {";
 
+#
+# First, record the predefined LWLocks listed in wait_event_names.txt.  We'll
+# cross-check those with the ones in lwlocknames.txt.
+#
+my @wait_event_lwlocks;
+my $found_lwlock_section = 0;
+my $recording_lwlocks = 0;
+
+while (<$wait_event_names>)
+{
+	chomp;
+
+	# Skip comments
+	next if /^#/;
+
+	# Skip empty lines
+	if (/^\s*$/)
+	{
+		#
+		# If we encounter an empty line while recording the LWLocks listed in
+		# wait_event_names.txt, we are done.
+		#
+		last if $recording_lwlocks;
+
+		#
+		# Begin recording the LWLocks listed in wait_event_names.txt following
+		# the empty line after the WaitEventLWLock section declaration.
+		#
+		$recording_lwlocks = 1 if $found_lwlock_section;
+		next;
+	}
+
+	#
+	# Prepare to record the LWLocks when we find the WaitEventLWLock section
+	# declaration.
+	#
+	if (/^Section: ClassName(.*)/)
+	{
+		my $section_name = $_;
+		$section_name =~ s/^.*- //;
+		$found_lwlock_section = 1 if $section_name eq "WaitEventLWLock";
+		next;
+	}
+
+	# Go to the next line if we are not yet recording LWLocks.
+	next if not $recording_lwlocks;
+
+	# Record the LWLock.
+	(my $waiteventname, my $waitevendocsentence) = split(/\t/, $_);
+	push(@wait_event_lwlocks, $waiteventname . "Lock");
+}
+
+my $i = 0;
 while (<$lwlocknames>)
 {
 	chomp;
@@ -50,6 +104,12 @@ while (<$lwlocknames>)
 	die "lwlocknames.txt not in order" if $lockidx < $lastlockidx;
 	die "lwlocknames.txt has duplicates" if $lockidx == $lastlockidx;
 
+	die $lockname . " defined in lwlocknames.txt but missing from wait_event_names.txt"
+	  unless $i < scalar(@wait_event_lwlocks);
+	die "lists of predefined LWLocks in lwlocknames.txt and wait_event_names.txt do not match"
+	  unless $wait_event_lwlocks[$i] eq $lockname;
+	$i++;
+
 	while ($lastlockidx < $lockidx - 1)
 	{
 		++$lastlockidx;
@@ -63,6 +123,9 @@ while (<$lwlocknames>)
 	print $h "#define $lockname ([$lockidx].lock)\n";
 }

Re: SET ROLE x NO RESET

2024-01-04 Thread Michał Kłeczek




> On 3 Jan 2024, at 18:22, Jelte Fennema-Nio  wrote:
> 
> 
>> In my case I have scripts that I want to execute with limited privileges
>> and make sure the scripts cannot escape the sandbox via RESET ROLE.
> 
> Depending on the desired workflow I think that could work for you too.
> Because it allows you to do this (and use -f script.sql instead of -c
> 'select ...):
> 
> ❯ psql "user=postgres _pq_.protocol_managed_params=role options='-c
> role=pg_read_all_data'" -c 'select current_user; set role postgres'
>   current_user
> ──
> pg_read_all_data
> (1 row)
> 
> ERROR:  42501: parameter can only be set at the protocol level "role"
> LOCATION:  set_config_with_handle, guc.c:3583
> Time: 0.667 ms

My scripts are actually Liquibase change logs.
I’ve extended Liquibase so that each change set is executed with limited 
privileges.

While doable with protocol level implementation, it would require support from 
PgJDBC.

—
Michal

Re: Documentation to upgrade logical replication cluster

2024-01-04 Thread Peter Smith

On Fri, Jan 5, 2024 at 2:38 PM Hayato Kuroda (Fujitsu)
 wrote:
...
> 2.
> I'm not sure it should be listed as step 10. I felt that it should be new 
> section.
> At that time other steps like "Prepare for {publisher|subscriber} upgrades" 
> can be moved as well.
> Thought?

During my review, I also felt that step 10 is now so long that it is a
distraction from the other content on this page.

==
Kind Regards,
Peter Smith.
Fujitsu Australia

Re: Documentation to upgrade logical replication cluster

2024-01-04 Thread Peter Smith

Here are some review comments for patch v1-0001.

==
doc/src/sgml/ref/pgupgrade.sgml

1. GENERAL - blank lines

Most (but not all) of your procedure steps are preceded by blank lines
to make them more readable in the SGML. Add the missing blank lines
for the steps that didn't have them.

2. GENERAL - for e.g.:

All the "for e.g:" that precedes the code examples can just say
"e.g.:" like in other examples on this page.

~~~
3. GENERAL - reference from elsewhere

I was wondering if "Chapter 30. Logical Replication" should include a
section that references back to all this just to make it easier to
find.

~~~

4.
+
+ Migration of logical replication clusters can be done when all the members
+ of the old logical replication clusters are version 17.0 or later.
+

/can be done when/is possible only when/

~~~

5.
+
+ The prerequisites of publisher upgrade applies to logical Replication
+ cluster upgrades also. See 
+ for the details of publisher upgrade prerequisites.
+

/applies to/apply to/
/logical Replication/logical replication/

~~~

6.
+
+ The prerequisites of subscriber upgrade applies to logical Replication
+ cluster upgrades also. See 
+ for the details of subscriber upgrade prerequisites.
+
+   

/applies to/apply to/
/logical Replication/logical replication/

~~~

7.
+
+ The steps to upgrade logical replication clusters in various scenarios are
+ given below.
+

The 3 titles do not render very prominently, so it is too easy to get
lost scrolling up and down looking for the different scenarios. If the
title rendering can't be improved, at least a list of 3 links here
(like a TOC) would be helpful.

~~~

//
Steps to Upgrade 2 node logical replication cluster
//

8. GENERAL - server names

I noticed in this set of steps you called the servers 'pub_data' and
'pub_upgraded_data' and 'sub_data' and 'sub_upgraded_data'. I see it
is easy to read like this, it is also different from all the
subsequent procedures where the names are just like 'data1', 'data2',
'data3', and 'data1_upgraded', 'data2_upgraded', 'data3_upgraded'.

I felt maybe it is better to use a consistent naming for all the procedures.

~~~

9.
+ 
+  Steps to Upgrade 2 node logical replication cluster

SUGGESTION
Steps to upgrade a two-node logical replication cluster

~~~

10.
+
+   
+
+ 
+  Let's say publisher is in node1 and subscriber is
+  in node2.
+ 
+

10a.
This renders as Step 1. But IMO this should not be a "step" at all --
it's just a description of the scenario.

~

10b.
The subsequent steps refer to subscriptions 'sub1_node1_node2' and
'sub2_node1_node2'. IMO it would help with the example code if those
are named up front here too. e.g.

node2 has two subscriptions for changes from node1:
sub1_node1_node2
sub2_node1_node2

~~~

11.
+
+ 
+  Upgrade the publisher node node1's server to the
+  required newer version, for e.g.:

The wording repeating node/node1 seems complicated.

SUGGESTION
Upgrade the publisher node's server to the required newer version, e.g.:

~~~

12.
+
+ 
+  Start the upgraded publisher node
node1's server, for e.g.:

IMO better to use the similar wording used for the "Stop" step

SUGGESTION
Start the upgraded publisher server in node1, e.g.:

~~~

13.
+
+ 
+  Upgrade the subscriber node node2's server to
+  the required new version, for e.g.:

The wording repeating node/node2 seems complicated.

SUGGESTION
Upgrade the subscriber node's server to the required newer version, e.g.:

~~~

14.
+
+ 
+  Start the upgraded subscriber node node2's server,
+  for e.g.:

IMO better to use the similar wording used for the "Stop" step

SUGGESTION
Start the upgraded subscriber server in node2, e.g.:

~~~

15.
+
+ 
+  Create any tables that were created in the upgraded
publisher node1
+  server between step-5 and now, for e.g.:
+
+node2=# CREATE TABLE distributors (
+node2(# did  integer CONSTRAINT no_null NOT NULL,
+node2(# name varchar(40) NOT NULL
+node2(# );
+CREATE TABLE
+
+ 
+

15a
Maybe it is better to have a link to setp5 instead of just hardwiring
"Step-5" in the text.

~

15b.
I didn't think it was needed to spread the CREATE TABLE across
multiple lines. It is just a dummy example anyway so IMO better to use
up less space.

~~~

16.
+
+ 
+  Refresh the publications using
+  ALTER
SUBSCRIPTION ... REFRESH PUBLICATION,

/Refresh the publications/Refresh the subscription's publications/

~~~

//
Steps to upgrade cascaded logical replication clusters
//

(these comments are similar to those in the previous procedure, but I
will give them all again)

17.
+
+ 
+  Steps to upgrade cascaded logical replication clusters
+   
+
+ 
+

Re: SQL:2011 application time

2024-01-04 Thread jian he

On Tue, Jan 2, 2024 at 9:59 AM Paul Jungwirth
 wrote:
>
> On 12/31/23 00:51, Paul Jungwirth wrote:
> > That's it for now.
>
> Here is another update. I fixed FOR PORTION OF on partitioned tables, in 
> particular when the attnums
> are different from the root partition.
>
> Rebased to cea89c93a1.
>

Hi.

+/*
+ * range_without_portion_internal - Sets outputs and outputn to the ranges
+ * remaining and their count (respectively) after subtracting r2 from r1.
+ * The array should never contain empty ranges.
+ * The outputs will be ordered. We expect that outputs is an array of
+ * RangeType pointers, already allocated with two slots.
+ */
+void
+range_without_portion_internal(TypeCacheEntry *typcache, RangeType *r1,
+   RangeType *r2, RangeType **outputs, int *outputn)
+{
+ int cmp_l1l2,
+ cmp_l1u2,
+ cmp_u1l2,
+ cmp_u1u2;
+ RangeBound lower1,
+ lower2;
+ RangeBound upper1,
+ upper2;
+ bool empty1,
+ empty2;
+
+ range_deserialize(typcache, r1, , , );
+ range_deserialize(typcache, r2, , , );
+
+ if (empty1)
+ {
+ /* if r1 is empty then r1 - r2 is empty, so return zero results */
+ *outputn = 0;
+ return;
+ }
+ else if (empty2)
+ {
+ /* r2 is empty so the result is just r1 (which we know is not empty) */
+ outputs[0] = r1;
+ *outputn = 1;
+ return;
+ }
+
+ /*
+ * Use the same logic as range_minus_internal,
+ * but support the split case
+ */
+ cmp_l1l2 = range_cmp_bounds(typcache, , );
+ cmp_l1u2 = range_cmp_bounds(typcache, , );
+ cmp_u1l2 = range_cmp_bounds(typcache, , );
+ cmp_u1u2 = range_cmp_bounds(typcache, , );
+
+ if (cmp_l1l2 < 0 && cmp_u1u2 > 0)
+ {
+ lower2.inclusive = !lower2.inclusive;
+ lower2.lower = false; /* it will become the upper bound */
+ outputs[0] = make_range(typcache, , , false, NULL);
+
+ upper2.inclusive = !upper2.inclusive;
+ upper2.lower = true; /* it will become the lower bound */
+ outputs[1] = make_range(typcache, , , false, NULL);
+
+ *outputn = 2;
+ }
+ else if (cmp_l1u2 > 0 || cmp_u1l2 < 0)
+ {
+ outputs[0] = r1;
+ *outputn = 1;
+ }
+ else if (cmp_l1l2 >= 0 && cmp_u1u2 <= 0)
+ {
+ *outputn = 0;
+ }
+ else if (cmp_l1l2 <= 0 && cmp_u1l2 >= 0 && cmp_u1u2 <= 0)
+ {
+ lower2.inclusive = !lower2.inclusive;
+ lower2.lower = false; /* it will become the upper bound */
+ outputs[0] = make_range(typcache, , , false, NULL);
+ *outputn = 1;
+ }
+ else if (cmp_l1l2 >= 0 && cmp_u1u2 >= 0 && cmp_l1u2 <= 0)
+ {
+ upper2.inclusive = !upper2.inclusive;
+ upper2.lower = true; /* it will become the lower bound */
+ outputs[0] = make_range(typcache, , , false, NULL);
+ *outputn = 1;
+ }
+ else
+ {
+ elog(ERROR, "unexpected case in range_without_portion");
+ }
+}

I am confused.
say condition: " (cmp_l1l2 <= 0 && cmp_u1l2 >= 0 && cmp_u1u2 <= 0)"
the following code will only run PartA, never run PartB?

`
else if (cmp_l1l2 >= 0 && cmp_u1u2 <= 0)
PartA
else if (cmp_l1l2 <= 0 && cmp_u1l2 >= 0 && cmp_u1u2 <= 0)
PartB
`

minimum example:
#include
#include
#include
#include
int
main(void)
{
int cmp_l1l2;
int cmp_u1u2;
int cmp_u1l2;
int cmp_l1u2;
cmp_l1u2 = -1;
cmp_l1l2 = 0;
cmp_u1u2 = 0;
cmp_u1l2 = 0;
assert(cmp_u1l2 == 0);
if (cmp_l1u2 > 0 || cmp_u1l2 < 0)
printf("calling partA\n");
else if (cmp_l1l2 >= 0 && cmp_u1u2 <= 0)
printf("calling partB\n");
else if (cmp_l1l2 <= 0 && cmp_u1l2 >= 0 && cmp_u1u2 <= 0)
printf("calling partC\n");
}

I am confused with the name "range_without_portion", I think
"range_not_overlap" would be better.

select numrange(1.1, 2.2) @- numrange(2.0, 3.0);
the result is not the same as
select numrange(2.0, 3.0) @- numrange(1.1, 2.2);

So your categorize oprkind as 'b' for operator "@-" is wrong?
select oprname,oprkind,oprcanhash,oprcanmerge,oprleft,oprright,oprresult,oprcode
from pg_operator
where oprname = '@-';

aslo
select count(*), oprkind from pg_operator group by oprkind;
there are only 5% are prefix operators.
maybe we should design it as:
1. if both inputs are empty range, the result array is empty.
2. if both inputs are non-empty and never overlaps, put both of them
to the result array.
3. if one input is empty another one is not, then put the non-empty
one into the result array.

after applying the patch: now the catalog data seems not correct to me.
SELECT  a1.amopfamily
,a1.amoplefttype::regtype
,a1.amoprighttype
,a1.amopstrategy
,amoppurpose
,amopsortfamily
,amopopr
,op.oprname
,am.amname
FROMpg_amop as a1 join pg_operator op on op.oid = a1.amopopr
joinpg_am   am on am.oid = a1.amopmethod
where   amoppurpose = 'p';
output:
 amopfamily | amoplefttype  | amoprighttype | amopstrategy |
amoppurpose | amopsortfamily | amopopr | oprname | amname
+---+---+--+-++-+-+
   2593 | box   |   603 |   31 | p
  |  0 | 803 | #   | gist
   3919 | anyrange  |  3831 |

Re: speed up a logical replica setup

2024-01-04 Thread Amit Kapila

On Thu, Jan 4, 2024 at 9:27 PM Euler Taveira  wrote:
>
> On Thu, Jan 4, 2024, at 3:05 AM, Amit Kapila wrote:
>
> Won't it be a better user experience that after setting up the target
> server as a logical replica (subscriber), it started to work
> seamlessly without user intervention?
>
>
> If we have an option to control the replication slot removal (default is on),
> it seems a good UI. Even if the user decides to disable the replication slot
> removal, it should print a message saying that these replication slots can
> cause WAL retention.
>

As pointed out in the previous response, I think we should not proceed
with such a risk of WAL retention and other nodes dependency, we
should either give an ERROR (default) or remove slots, if the user
provides an option. If we do so, do you think by default we can keep
the server started or let the user start it later? I think one
advantage of letting the user start it later is that she gets a chance
to adjust config parameters in postgresql.conf and by default we won't
be using system resources.

-- 
With Regards,
Amit Kapila.

Re: PL/pgSQL: Incomplete item Allow handling of %TYPE arrays, e.g. tab.col%TYPE[]

2024-01-04 Thread Pavel Stehule

čt 4. 1. 2024 v 22:02 odesílatel Tom Lane  napsal:

> Pavel Stehule  writes:
> > Now, I think so this simple patch is ready for committers
>
> I pushed this with some editorialization -- mostly, rewriting the
> documentation and comments.  I found that the existing docs for %TYPE
> were not great.  There are two separate use-cases, one for referencing
> a table column and one for referencing a previously-declared variable,
> and the docs were about as clear as mud about explaining that.
>
> I also looked into the problem Pavel mentioned that it doesn't work
> for RECORD.  If you just write "record[]" you get an error message
> that at least indicates it's an unsupported case:
>
> regression=# do $$declare r record[]; begin end$$;
> ERROR:  variable "r" has pseudo-type record[]
> CONTEXT:  compilation of PL/pgSQL function "inline_code_block" near line 1
>
> Maybe we could improve on that, but it would be a lot of work and
> I'm not terribly excited about it.  However, %TYPE fails entirely
> for both "record" and named composite types, and the reason turns
> out to be just that plpgsql_parse_wordtype fails to handle the
> PLPGSQL_NSTYPE_REC case.  So that's easily fixed.
>
> I also wonder what the heck the last half of plpgsql_parse_wordtype
> is for at all.  It looks for a named type, which means you can do
>
> regression=# do $$declare x float8%type; begin end$$;
> DO
>
> but that's just stupid.  You could leave off the %TYPE and get
> the same result.  Moreover, it is inconsistent because
> plpgsql_parse_cwordtype has no equivalent behavior:
>
> regression=# do $$declare x pg_catalog.float8%type; begin end$$;
> ERROR:  syntax error at or near "%"
> LINE 1: do $$declare x pg_catalog.float8%type; begin end$$;
> ^
> CONTEXT:  invalid type name "pg_catalog.float8%type"
>
> It's also undocumented and untested (the code coverage report
> shows this part is never reached).  So I propose we remove it.
>
> That leads me to the attached proposed follow-on patch.
>
> Another thing we could think about, but I've not done it here,
> is to make plpgsql_parse_wordtype and friends throw error
> instead of just returning NULL when they don't find the name.
> Right now, if NULL is returned, we end up passing the whole
> string to parse_datatype, leading to unhelpful errors like
> the one shown above.  We could do better than that I think,
> perhaps like "argument of %TYPE is not a known variable".
>

+1

Regards

Pavel

>
> regards, tom lane
>
>

Re: speed up a logical replica setup

2024-01-04 Thread Amit Kapila

On Thu, Jan 4, 2024 at 9:18 PM Euler Taveira  wrote:
>
> On Thu, Jan 4, 2024, at 2:41 AM, Amit Kapila wrote:
>
>
> I think asking users to manually remove such slots won't be a good
> idea. We might want to either remove them by default or provide an
> option to the user.
>
>
> Am I correct that the majority of the use cases these replication slots will 
> be
> useless?
>

I am not so sure about it. Say, if some sync slots are present this
means the user wants this replica to be used later as a publisher.
Now, if the existing primary/publisher node is still alive then we
don't have these slots but if the user wants to switch over to this
new node as well then they may be required.

Is there a possibility that a cascading standby also has a slot on the
current physical replica being converted to a new subscriber?

> If so, let's remove them by default and add an option to control this
> behavior (replication slot removal).
>

The presence of slots on the physical replica indicates that the other
nodes/clusters could be dependent on it, so, I feel by default we
should give an error and if the user uses some option to remove slots
then it is fine to remove them.

-- 
With Regards,
Amit Kapila.

Re: pg_upgrade test failure

2024-01-04 Thread vignesh C

On Sun, 29 Oct 2023 at 11:14, Hayato Kuroda (Fujitsu)
 wrote:
>
> Dear Andres,
>
> While tracking BF failures related with pg_ugprade, I found the same failure 
> has still happened [1] - [4].
> According to the log, the output directory was remained even after the 
> successful upgrade [5].
> I analyzed and attached the fix patch, and below is my analysis... how do you 
> think?
>

The same failure occurs randomly at [1] for a newly added test too:
pg_upgrade: warning: could not remove directory
"C:/tools/nmsys64/home/pgrunner/bf/root/HEAD/pgsql.build/testrun/pg_upgrade/004_subscription/data/t_004_subscription_new_sub_data/pgdata/pg_upgrade_output.d/20240104T215133.796/log":
Directory not empty
pg_upgrade: warning: could not remove directory
"C:/tools/nmsys64/home/pgrunner/bf/root/HEAD/pgsql.build/testrun/pg_upgrade/004_subscription/data/t_004_subscription_new_sub_data/pgdata/pg_upgrade_output.d/20240104T215133.796":
Directory not empty
pg_upgrade: warning: could not stat file
"C:/tools/nmsys64/home/pgrunner/bf/root/HEAD/pgsql.build/testrun/pg_upgrade/004_subscription/data/t_004_subscription_new_sub_data/pgdata/pg_upgrade_output.d/20240104T215133.796/log/pg_upgrade_internal.log":
No such file or directory
pg_upgrade: warning: could not remove directory
"C:/tools/nmsys64/home/pgrunner/bf/root/HEAD/pgsql.build/testrun/pg_upgrade/004_subscription/data/t_004_subscription_new_sub_data/pgdata/pg_upgrade_output.d/20240104T215133.796/log":
Directory not empty
pg_upgrade: warning: could not remove directory
"C:/tools/nmsys64/home/pgrunner/bf/root/HEAD/pgsql.build/testrun/pg_upgrade/004_subscription/data/t_004_subscription_new_sub_data/pgdata/pg_upgrade_output.d/20240104T215133.796":
Directory not empty

[1] - 
https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=fairywren=2024-01-04%2019%3A56%3A20

Regards,
Vignesh

Re: Synchronizing slots from primary to standby

2024-01-04 Thread Amit Kapila

On Fri, Jan 5, 2024 at 8:59 AM shveta malik  wrote:
>
> On Thu, Jan 4, 2024 at 7:24 PM Bertrand Drouvot
>  wrote:
> >
> > 4 ===
> >
> > Looking closer, the only place where walrcv_connect() is called with 
> > replication
> > set to false and logical set to false is in ReplSlotSyncWorkerMain().
> >
> > That does make sense, but what do you think about creating dedicated 
> > libpqslotsyncwrkr_connect
> > and slotsyncwrkr_connect (instead of using the libpqrcv_connect / 
> > walrcv_connect ones)?
> >
> > That way we could make use of slotsyncwrkr_connect() in 
> > ReplSlotSyncWorkerMain()
> > as I think it's confusing to use "rcv" functions while the process using 
> > them is
> > not of backend type walreceiver.
> >
> > I'm not sure that worth the extra complexity though, what do you think?
>
> I gave it a thought earlier, but then I was not sure even if I create
> a new function w/o "rcv" in it then where should it be placed as the
> existing file name itself is libpq'walreceiver'.c. Shall we be
> creating a new file then? But it does not seem good to create a new
> setup (new file, function pointers  other stuff) around 1 function.
> And thus reusing the same function with 'replication' (new arg) felt
> like a better choice than other options. If in future, there is any
> other module trying to do the same, then it can use current
> walrcv_connect() with rep=false. If I make it specific to slot-sync
> worker, then it will not be reusable by other modules (if needed).
>

I agree that the benefit of creating a new API is not very clear. How
about adjusting the description in the file header of
libpqwalreceiver.c. I think apart from walreceiver, it is now also
used by logical replication workers and with this patch by the
slotsync worker as well.

-- 
With Regards,
Amit Kapila.

Re: Random pg_upgrade test failure on drongo

2024-01-04 Thread Amit Kapila

On Thu, Jan 4, 2024 at 5:30 PM Alexander Lakhin  wrote:
>
> 03.01.2024 14:42, Amit Kapila wrote:
> >
>
> >> And the internal process is ... background writer (BgBufferSync()).
> >>
> >> So, I tried just adding bgwriter_lru_maxpages = 0 to postgresql.conf and
> >> got 20 x 10 tests passing.
> >>
> >> Thus, it we want just to get rid of the test failure, maybe it's enough to
> >> add this to the test's config...
> >>
> > What about checkpoints? Can't it do the same while writing the buffers?
>
> As we deal here with pg_upgrade/pg_restore, it must not be very easy to get
> the desired effect, but I think it's not impossible in principle.
> More details below.
> What happens during the pg_upgrade execution is essentially:
> 1) CREATE DATABASE "postgres" WITH TEMPLATE = template0 OID = 5 ...;
> -- this command flushes file buffers as well
> 2) ALTER DATABASE postgres OWNER TO ...
> 3) COMMENT ON DATABASE "postgres" IS ...
> 4) -- For binary upgrade, preserve pg_largeobject and index relfilenodes
>  SELECT 
> pg_catalog.binary_upgrade_set_next_index_relfilenode('2683'::pg_catalog.oid);
>  SELECT 
> pg_catalog.binary_upgrade_set_next_heap_relfilenode('2613'::pg_catalog.oid);
>  TRUNCATE pg_catalog.pg_largeobject;
> --  ^^^ here we can get the error "could not create file "base/5/2683": File 
> exists"
> ...
>
> We get the effect discussed when the background writer process decides to
> flush a file buffer for pg_largeobject during stage 1.
> (Thus, if a checkpoint somehow happened to occur during CREATE DATABASE,
> the result must be the same.)
> And another important factor is shared_buffers = 1MB (set during the test).
> With the default setting of 128MB I couldn't see the failure.
>
> It can be reproduced easily (on old Windows versions) just by running
> pg_upgrade in a loop (I've got failures on iterations 22, 37, 17 (with the
> default cluster)).
> If an old cluster contains dozen of databases, this increases the failure
> probability significantly (with 10 additional databases I've got failures
> on iterations 4, 1, 6).
>

I don't have an old Windows environment to test but I agree with your
analysis and theory. The question is what should we do for these new
random BF failures? I think we should set bgwriter_lru_maxpages to 0
and checkpoint_timeout to 1hr for these new tests. Doing some invasive
fix as part of this doesn't sound reasonable because this is an
existing problem and there seems to be another patch by Thomas that
probably deals with the root cause of the existing problem [1] as
pointed out by you.

[1] - https://commitfest.postgresql.org/40/3951/

-- 
With Regards,
Amit Kapila.

RE: Documentation to upgrade logical replication cluster

2024-01-04 Thread Hayato Kuroda (Fujitsu)

Dear Vignesh,

Thanks for making a patch! Below part is my comments.

1.
Only two steps were added an id, but I think it should be for all the steps.
See [1].

2.
I'm not sure it should be listed as step 10. I felt that it should be new 
section.
At that time other steps like "Prepare for {publisher|subscriber} upgrades" can 
be moved as well.
Thought?

3.
```
+ The prerequisites of publisher upgrade applies to logical Replication
```

Replication -> replication

4.
```
+ 
+  Let's say publisher is in node1 and subscriber is
+  in node2.
+ 
```

I felt it is more friendly if you added the name of directory for each instance.

5.
You did not write the initialization of new node. Was it intentional?

6.
```
+ 
+  Disable all the subscriptions on node2 that are
+  subscribing the changes from node1 by using
+  ALTER 
SUBSCRIPTION ... DISABLE,
+  for e.g.:
+
+node2=# ALTER SUBSCRIPTION sub1_node1_node2 DISABLE;
+ALTER SUBSCRIPTION
+node2=# ALTER SUBSCRIPTION sub2_node1_node2 DISABLE;
+ALTER SUBSCRIPTION
+
+ 
```

Subscriptions are disabled after stopping a publisher, but it leads ERRORs on 
the publisher.
I think it's better to swap these steps.

7.
```
+
+dba@node1:/opt/PostgreSQL/postgres//bin$ pg_ctl -D 
/opt/PostgreSQL/pub_data stop -l logfile
+
```

Hmm. I thought you did not have to show the current directory. You were in the
bin dir, but it is not our requirement, right? 

8.
```
+
+dba@node1:/opt/PostgreSQL/postgres//bin$ pg_upgrade
+--old-datadir "/opt/PostgreSQL/postgres/17/pub_data"
+--new-datadir 
"/opt/PostgreSQL/postgres//pub_upgraded_data"
+--old-bindir "/opt/PostgreSQL/postgres/17/bin"
+--new-bindir "/opt/PostgreSQL/postgres//bin"
+
```

For PG17, both old and new bindir look the same. Can we use 18 as new-bindir?

9.
```
+ 
+  Create any tables that were created in node2
+  between step-2 and now, for e.g.:
+
+node2=# CREATE TABLE distributors (
+node2(# did  integer CONSTRAINT no_null NOT NULL,
+node2(# name varchar(40) NOT NULL
+node2(# );
+CREATE TABLE
+
+ 
```

I think this SQLs must be done on node1, because it has not boot between step-2
and step-7.

10.
```
+
+ 
+  Enable all the subscriptions on node2 that are
+  subscribing the changes from node1 by using
+  ALTER 
SUBSCRIPTION ... ENABLE,
+  for e.g.:
+
+node2=# ALTER SUBSCRIPTION sub1_node1_node2 ENABLE;
+ALTER SUBSCRIPTION
+node2=# ALTER SUBSCRIPTION sub2_node1_node2 ENABLE;
+ALTER SUBSCRIPTION
+
+ 
+
+
+
+ 
+  Refresh the publications  using
+  ALTER 
SUBSCRIPTION ... REFRESH PUBLICATION,
+  for e.g.:
+
+node2=# ALTER SUBSCRIPTION sub1_node1_node2 REFRESH PUBLICATION;
+ALTER SUBSCRIPTION
+node2=# ALTER SUBSCRIPTION sub2_node1_node2 REFRESH PUBLICATION;
+ALTER SUBSCRIPTION
+
+ 
+
```

I was very confused the location where they would be really do. If my above
comment is correct, should they be executed on node1 as well? Could you please 
all
the notation again?

11.
```
+ 
+  Disable all the subscriptions on node1 that are
+  subscribing the changes from node2 by using
+  ALTER 
SUBSCRIPTION ... DISABLE,
+  for e.g.:
+
+node2=# ALTER SUBSCRIPTION sub1_node2_node1 DISABLE;
+ALTER SUBSCRIPTION
+node2=# ALTER SUBSCRIPTION sub2_node2_node1 DISABLE;
+ALTER SUBSCRIPTION
+
+ 
```

They should be on node1, but noted as node2.

12.
```
+ 
+  Enable all the subscriptions on node1 that are
+  subscribing the changes from node2 by using
+  ALTER 
SUBSCRIPTION ... ENABLE,
+  for e.g.:
+
+node2=# ALTER SUBSCRIPTION sub1_node2_node1 ENABLE;
+ALTER SUBSCRIPTION
+node2=# ALTER SUBSCRIPTION sub2_node2_node1 ENABLE;
+ALTER SUBSCRIPTION
+
+ 
```

You said that "enable all the subscription on node1", but SQLs are done on 
node2.

[1]: 
https://www.postgresql.org/message-id/tyapr01mb58667ae04d291924671e2051f5...@tyapr01mb5866.jpnprd01.prod.outlook.com


Best Regards,
Hayato Kuroda
FUJITSU LIMITED

Re: Synchronizing slots from primary to standby

2024-01-04 Thread shveta malik

On Thu, Jan 4, 2024 at 7:24 PM Bertrand Drouvot
 wrote:
>
> Hi,
>
> On Thu, Jan 04, 2024 at 10:27:31AM +0530, shveta malik wrote:
> > On Thu, Jan 4, 2024 at 9:18 AM shveta malik  wrote:
> > >
> > > On Wed, Jan 3, 2024 at 6:33 PM Zhijie Hou (Fujitsu)
> > >  wrote:
> > > >
> > > > On Tuesday, January 2, 2024 6:32 PM shveta malik 
> > > >  wrote:
> > > > > On Fri, Dec 29, 2023 at 10:25 AM Amit Kapila 
> > > > >
> > > > > The topup patch has also changed app_name to
> > > > > {cluster_name}_slotsyncworker so that we do not confuse between 
> > > > > walreceiver
> > > > > and slotsyncworker entry.
> > > > >
> > > > > Please note that there is no change in rest of the patches, changes 
> > > > > are in
> > > > > additional 0004 patch alone.
> > > >
> > > > Attach the V56 patch set which supports ALTER SUBSCRIPTION SET 
> > > > (failover).
> > > > This is useful when user want to refresh the publication tables, they 
> > > > can now alter the
> > > > failover option to false and then execute the refresh command.
> > > >
> > > > Best Regards,
> > > > Hou zj
> > >
> > > The patches no longer apply to HEAD due to a recent commit 007693f. I
> > > am working on rebasing and will post the new patches soon
> > >
> > > thanks
> > > Shveta
> >
> > Commit 007693f has changed 'conflicting' to 'conflict_reason', so
> > adjusted the code around that in the slotsync worker.
> >
> > Also removed function 'pg_get_slot_invalidation_cause' as now
> > conflict_reason tells the same.
> >
> > PFA rebased patches with above changes.
> >
>
> Thanks!
>
> Looking at 0004:
>
> 1 
>
> -libpqrcv_connect(const char *conninfo, bool logical, bool must_use_password,
> -const char *appname, char **err)
> +libpqrcv_connect(const char *conninfo, bool replication, bool logical,
> +bool must_use_password, const char *appname, 
> char **err)
>
> What about adjusting the preceding comment a bit to describe what the new 
> replication
> parameter is for?
>
> 2 
>
> +   /* We can not have logical w/o replication */
>
> what about replacing w/o by without?
>
> 3 ===
>
> +   if(!replication)
> +   Assert(!logical);
> +
> +   if (replication)
> {
>
> what about using "if () else" instead (to avoid unnecessary test)?
>
>
> Having said that the patch seems a reasonable way to implement non-replication
> connection in slotsync worker.
>
> 4 ===
>
> Looking closer, the only place where walrcv_connect() is called with 
> replication
> set to false and logical set to false is in ReplSlotSyncWorkerMain().
>
> That does make sense, but what do you think about creating dedicated 
> libpqslotsyncwrkr_connect
> and slotsyncwrkr_connect (instead of using the libpqrcv_connect / 
> walrcv_connect ones)?
>
> That way we could make use of slotsyncwrkr_connect() in 
> ReplSlotSyncWorkerMain()
> as I think it's confusing to use "rcv" functions while the process using them 
> is
> not of backend type walreceiver.
>
> I'm not sure that worth the extra complexity though, what do you think?

I gave it a thought earlier, but then I was not sure even if I create
a new function w/o "rcv" in it then where should it be placed as the
existing file name itself is libpq'walreceiver'.c. Shall we be
creating a new file then? But it does not seem good to create a new
setup (new file, function pointers  other stuff) around 1 function.
And thus reusing the same function with 'replication' (new arg) felt
like a better choice than other options. If in future, there is any
other module trying to do the same, then it can use current
walrcv_connect() with rep=false. If I make it specific to slot-sync
worker, then it will not be reusable by other modules (if needed).


thanks
Shveta

Re: the s_lock_stuck on perform_spin_delay

2024-01-04 Thread Andy Fan

Hi,

Andres Freund  writes:

>
> On 2024-01-04 14:59:06 +0800, Andy Fan wrote:
>> My question is if someone doesn't obey the rule by mistake (everyone
>> can make mistake), shall we PANIC on a production environment? IMO I
>> think it can be a WARNING on a production environment and be a stuck
>> when 'ifdef USE_ASSERT_CHECKING'.
>>
>> [...]
>>
>> I notice this issue actually because of the patch "Cache relation
>> sizes?" from Thomas Munro [1], where the latest patch[2] still have the
>> following code.
>> +sr = smgr_alloc_sr();  <-- HERE a spin lock is hold
>> +
>> +/* Upgrade to exclusive lock so we can create a mapping. */
>> +LWLockAcquire(mapping_lock, LW_EXCLUSIVE); <-- HERE a complex
>>   operation is needed. it may take a long time.
>>
>> Our internal testing system found more misuses on our own PG version.
>
>> I think a experienced engineer like Thomas can make this mistake and the
>> patch was reviewed by 3 peoples, the bug is still there. It is not easy
>> to say just don't do it.
>
> I don't follow this argument - just ignoring the problem,

I agree with you but I'm feeling you ignored my post at [1], where I
said for the known issue, it should be fixed at the first chance.

> which emitting a
> WARNING basically is, doesn't reduce the impact of the bug, it *increases* the
> impact, because now the system will not recover from the bug without explicit
> operator intervention.  During that time the system might be essentially
> unresponsive, because all backends end up contending for some spinlock, which
> makes investigating such issues very hard.

Acutally they are doing pg_usleep at the most time.

Besides what Robert said, one more reason to question PANIC is that: PAINC
can't always make the system recovery faster because: a). In the most
system, PANIC makes a core dump which take times and spaces. b). After
the reboot, all the caches like relcache, plancache, fdcache need to be
rebuit. c). Customer needs to handle failure better or else they will be
hurt *more often*.  All of such sense cause slowness as well.

>
> I think we should add infrastructure to detect bugs like this during
> development,

The commit 2 in [1] does something like this. for the details, I missed the
check for memory allocation case as you suggested at [2], but checked
heavyweight lock as well. others should be same IIUC.

> but not PANICing when this happens in production seems completely
> non-viable.
>

Not sure what does *this* exactly means. If it means the bug in Thomas's 
patch, I absoluately agree with you(since it is a known bug and it
should be fixed). If it means the general *unknown* case, it's something
we talked above.

I'm also agree that some LLVM static checker should be pretty good
ideally, it just requires more knowledge base and future maintain
effort. I am willing to have a try shortly. 

[1] https://www.postgresql.org/message-id/871qaxp3ly.fsf%40163.com
[2]
https://www.postgresql.org/message-id/20240104225403.dgmbbfffmm3srpgq%40awork3.anarazel.de

-- 
Best Regards
Andy Fan

Re: the s_lock_stuck on perform_spin_delay

2024-01-04 Thread Andres Freund

Hi,

On 2024-01-04 17:03:18 -0600, Jim Nasby wrote:
> On 1/4/24 10:33 AM, Tom Lane wrote:
> 
> Robert Haas  writes:
> 
> On Thu, Jan 4, 2024 at 10:22 AM Tom Lane  wrote:
> 
> We should be making an effort to ban coding patterns like
> "return with spinlock still held", because they're just too prone
> to errors similar to this one.
> 
> I agree that we don't want to add overhead, and also about how
> spinlocks should be used, but I dispute "easily detectable
> statically." I mean, if you or I look at some code that uses a
> spinlock, we'll know whether the pattern that you mention is being
> followed or not, modulo differences of opinion in debatable cases. But
> you and I cannot be there to look at all the code all the time. If we
> had a static checking tool that was run as part of every build, or in
> the buildfarm, or by cfbot, or somewhere else that raised the alarm if
> this rule was violated, then we could claim to be effectively
> enforcing this rule.
> 
> I was indeed suggesting that maybe we could find a way to detect
> such things automatically.  While I've not been paying close
> attention, I recall there's been some discussions of using LLVM/clang
> infrastructure for customized static analysis, so maybe it'd be
> possible without an undue amount of effort.

I played around with this a while back. One reference with a link to a
playground to experiment with attributes:
https://www.postgresql.org/message-id/20200616233105.sm5bvodo6unigno7%40alap3.anarazel.de

Unfortunately clang's thread safety analysis doesn't handle conditionally
acquired locks, which made it far less promising than I initially thought.

I think there might be some other approaches, but they will all suffer from
not understanding "danger" encountered indirectly, via function calls doing
dangerous things. Which we would like to exclude, but I don't think that's
trivial either.

> FWIW, the lackey[1] tool in Valgrind is able to do some kinds of instruction
> counting, so it might be possible to measure how many instructions are 
> actualyl
> being executed while holding a spinlock. Might be easier than code analysis.

I don't think that's particularly promising. Lackey is *slow*. And it requires
actually reaching problematic states. Consider e.g. the case that was reported
upthread, an lwlock acquired within a spinlock protected section - most of the
time that's not going to result in a lot of cycles, because the lwlock is
free.

Greetings,

Andres Freund

Re: add AVX2 support to simd.h

2024-01-04 Thread John Naylor

On Wed, Jan 3, 2024 at 10:29 PM Nathan Bossart  wrote:
> If the requirement is that normal builds use AVX2, then I fear we will be
> waiting a long time.  IIUC the current proposals (building multiple
> binaries or adding a configuration option that maps to compiler flags)
> would still be opt-in,

If and when we get one of those, I would consider that  a "normal"
build. Since there are no concrete proposals yet, I'm still waiting
for you to justify imposing an immediate maintenance cost for zero
benefit.

Re: Improve WALRead() to suck data directly from WAL buffers when possible

2024-01-04 Thread Jeff Davis

On Wed, 2023-12-20 at 15:36 +0530, Bharath Rupireddy wrote:
> Thanks. Attaching remaining patches as v18 patch-set after commits
> c3a8e2a7cb16 and 766571be1659.

Comments:

I still think the right thing for this patch is to call
XLogReadFromBuffers() directly from the callers who need it, and not
change WALRead(). I am open to changing this later, but for now that
makes sense to me so that we can clearly identify which callers benefit
and why. I have brought this up a few times before[1][2], so there must
be some reason that I don't understand -- can you explain it?

The XLogReadFromBuffersResult is never used. I can see how it might be
useful for testing or asserts, but it's not used even in the test
module. I don't think we should clutter the API with that kind of thing
-- let's just return the nread.

I also do not like the terminology "partial hit" to be used in this
way. Perhaps "short read" or something about hitting the end of
readable WAL would be better?

I like how the callers of WALRead() are being more precise about the
bytes they are requesting.

You've added several spinlock acquisitions to the loop. Two explicitly,
and one implicitly in WaitXLogInsertionsToFinish(). These may allow you
to read slightly further, but introduce performance risk. Was this
discussed?

The callers are not checking for XLREADBUGS_UNINITIALIZED_WAL, so it
seems like there's a risk of getting partially-written data? And it's
not clear to me the check of the wal page headers is the right one
anyway.

It seems like all of this would be simpler if you checked first how far
you can safely read data, and then just loop and read that far. I'm not
sure that it's worth it to try to mix the validity checks with the
reading of the data.

Regards,
Jeff Davis

[1] 
https://www.postgresql.org/message-id/4132fe48f831ed6f73a9eb191af5fe475384969c.camel%40j-davis.com
[2]
https://www.postgresql.org/message-id/2ef04861c0f77e7ae78b703770cc2bbbac3d85e6.ca...@j-davis.com

Re: Build versionless .so for Android

2024-01-04 Thread Matthias Kuhn

Hi,

Attached a patch with a (hopefully) better wording of the comment.

I have unsuccessfully tried to find an official source for this policy.
So for reference some discussions about the topic:

-
https://stackoverflow.com/questions/11491065/linking-with-versioned-shared-library-in-android-ndk
-
https://stackoverflow.com/questions/18681401/how-can-i-remove-all-versioning-information-from-shared-object-files

Please let me know if I can help in any way

Matthias

On Tue, Jan 2, 2024 at 6:43 PM Robert Haas  wrote:

> On Sun, Dec 31, 2023 at 1:24 AM Michael Paquier 
> wrote:
> > FWIW, I have mixed feelings about patching the code to treat
> > android-linux as an exception while changing nothing in the
> > documentation to explain why this is required.  A custom patch may
> > serve you better here, and note that you did not even touch the meson
> > paths.  See libpq_c_args in src/interfaces/libpq/meson.build as one
> > example.  That's just my opinion, of course, still there are no
> > buildfarm members that would cover your patch.
>
> It's reasonable to want good comments -- the one in the patch (1)
> doesn't explain why this is required and (2) suggests that it is only
> needed when cross-compiling which seems surprising and (3) has a typo.
> But if it's true that this is needed, I tentatively think we might do
> better to take the patch than force people to carry it out of tree.
>
> --
> Robert Haas
> EDB: http://www.enterprisedb.com
>
diff --git a/doc/src/sgml/installation.sgml b/doc/src/sgml/installation.sgml
index 75dc81a..9c88558 100644
--- a/doc/src/sgml/installation.sgml
+++ b/doc/src/sgml/installation.sgml
@@ -1989,6 +1989,14 @@ build-postgresql:
   among developers is to use PROFILE for one-time flag
   adjustments, while COPT might be kept set all the time.
  
+
+ 
+  When building libpq as shared libraries, the resulting files are suffixed
+  with the version libpq.5.[version] and an unversioned
+  symlink libpq.soto the versioned file is created. An
+  exception to that is Android where library file names must end with
+  .so. Building for Android is only supported using make.
+ 
 
   
  
diff --git a/src/Makefile.shlib b/src/Makefile.shlib
index 92d893e..4801b7a 100644
--- a/src/Makefile.shlib
+++ b/src/Makefile.shlib
@@ -187,7 +187,13 @@ endif
 ifeq ($(PORTNAME), linux)
   LINK.shared		= $(COMPILER) -shared
   ifdef soname
-LINK.shared		+= -Wl,-soname,$(soname)
+ifeq (linux-android,$(findstring linux-android,$(host_os)))
+# Android does not support versioned soname
+shlib		= lib$(NAME)$(DLSUFFIX)
+LINK.shared		+= -Wl,-soname,lib$(NAME)$(DLSUFFIX)
+else
+LINK.shared		+= -Wl,-soname,$(soname)
+endif
   endif
   BUILD.exports		= ( echo '{ global:'; $(AWK) '/^[^\#]/ {printf "%s;\n",$$1}' $<; echo ' local: *; };' ) >$@
   exports_file		= $(SHLIB_EXPORTS:%.txt=%.list)

Re: Add a perl function in Cluster.pm to generate WAL

2024-01-04 Thread Michael Paquier

On Thu, Jan 04, 2024 at 04:00:01PM +0300, Alexander Lakhin wrote:
> Reproduced here.

Did you just make the run slow enough to show the failure with
valgrind?

> As I can see in the failure logs you referenced, the first problem is:
> #   Failed test 'inactiveslot slot invalidation is logged with vacuum on 
> pg_authid'
> #   at t/035_standby_logical_decoding.pl line 224.
> It reminded me of:
> https://www.postgresql.org/message-id/b2a1f7d0-7629-72c0-2da7-e9c4e336b010%40gmail.com
> 
> It seems that it's not something new, and maybe that my analysis is still
> valid. If so, VACUUM FREEZE/autovacuum = off should fix the issue.

Not sure about that yet.
--
Michael


signature.asc
Description: PGP signature

Re: Improve rowcount estimate for UNNEST(column)

2024-01-04 Thread Tom Lane

Paul Jungwirth  writes:
> Here is a patch with an improved test. With the old "10" estimate we get a 
> Merge Join, but now that 
> the planner can see there are only ~4 elements per array, we get a Nested 
> Loop.

Pushed with minor editorialization.  I ended up not using the test
case, because I was afraid it wouldn't be all that stable, and
code coverage shows that we are exercising the added code path
even without a bespoke test case.

> On 11/29/23 20:35, jian he wrote:
>>> I did a minor change. change estimate_array_length return type to double

> I'm not sure I want to change estimate_array_length from returning
> ints to returning doubles, since it's called in many places.

But your patch forces every one of those places to be touched anyway,
as a consequence of adding the "root" argument.  I looked at the
callers and saw that every single one of them (in core anyway) ends up
using the result in a "double" rowcount calculation, so we're really
not buying anything by converting to integer and back again.  There's
also a question of whether the number from DECHIST could be big enough
to overflow an int.  Perhaps not given the current calculation method,
but since it's a float4 there's at least a theoretical risk.  Hence,
I adopted Jian's suggestion.

One other point is that examine_variable is capable of succeeding
on things that aren't Vars, so I thought the restriction to Vars
was inappropriate.

regards, tom lane

Re: Adding facility for injection points (or probe points?) for more advanced tests

2024-01-04 Thread Michael Paquier

On Thu, Jan 04, 2024 at 04:31:02PM -0600, Nathan Bossart wrote:
> On Thu, Jan 04, 2024 at 06:04:20PM +0530, Ashutosh Bapat wrote:
>> 0003 and 0004 are using the extension in this module for some serious
>> testing. The name of the extension test_injection_point indicates that
>> it's for testing injection points and not for some serious use of
>> injection callbacks it adds. Changes 0003 and 0004 suggest otherwise.
> 
> Yeah, I think test_injection_point should be reserved for testing the
> injection point machinery.

Sure.  FWIW, it makes sense to me to keep the SQL interface and the
callbacks in the module, per the reasons below.

>> I suggest we move test_injection_points from src/test/modules to
>> contrib/ and rename it as "injection_points". The test files may still
>> be named as test_injection_point. The TAP tests in 0003 and 0004 once
>> moved to their appropriate places, will load injection_point extension
>> and use it. That way predefined injection point callbacks will also be
>> available for others to use.
> 
> Rather than defining a module somewhere that tests would need to load,
> should we just put the common callbacks in the core server?  Unless there's
> a strong reason to define them elsewhere, that could be a nice way to save
> a step in the tests.

Nah, having some pre-existing callbacks existing in the backend is
against the original minimalistic design spirit.  These would also
require an SQL interface, and the interface design also depends on the
functions registering them when pushing down custom conditions.
Pushing that down to extensions to do what they want will lead to less
noise, particularly if you consider that we will most likely want to
tweak the callback interfaces for backpatched bugs.  That's also why I
think contrib/ is not a good idea, src/test/modules/ serving the
actual testing purpose here.
--
Michael


signature.asc
Description: PGP signature

Re: Confine vacuum skip logic to lazy_scan_skip

2024-01-04 Thread Jim Nasby

On 1/4/24 2:23 PM, Andres Freund wrote:

  On 2024-01-02 12:36:18 -0500, Melanie Plageman wrote:
Subject: [PATCH v2 1/6] lazy_scan_skip remove unnecessary local var rel_pages
Subject: [PATCH v2 2/6] lazy_scan_skip remove unneeded local var
 nskippable_blocks
I think these may lead to worse code - the compiler has to reload
vacrel->rel_pages/next_unskippable_block for every loop iteration, because it
can't guarantee that they're not changed within one of the external functions
called in the loop body.

Admittedly I'm not up to speed on recent vacuum changes, but I
  have to wonder if the concept of skipping should go away in the
  context of vector IO? Instead of thinking about "we can skip this
  range of blocks", why not maintain a list of "here's the next X
  number of blocks that we need to vacuum"?

-- 
Jim Nasby, Data Architect, Austin TX

Re: the s_lock_stuck on perform_spin_delay

2024-01-04 Thread Andres Freund

Hi,

On 2024-01-04 14:59:06 +0800, Andy Fan wrote:
> My question is if someone doesn't obey the rule by mistake (everyone
> can make mistake), shall we PANIC on a production environment? IMO I
> think it can be a WARNING on a production environment and be a stuck
> when 'ifdef USE_ASSERT_CHECKING'.
>
> [...]
>
> I notice this issue actually because of the patch "Cache relation
> sizes?" from Thomas Munro [1], where the latest patch[2] still have the
> following code.
> + sr = smgr_alloc_sr();  <-- HERE a spin lock is hold
> +
> + /* Upgrade to exclusive lock so we can create a mapping. */
> + LWLockAcquire(mapping_lock, LW_EXCLUSIVE); <-- HERE a complex
>   operation is needed. it may take a long time.
>
> Our internal testing system found more misuses on our own PG version.

> I think a experienced engineer like Thomas can make this mistake and the
> patch was reviewed by 3 peoples, the bug is still there. It is not easy
> to say just don't do it.

I don't follow this argument - just ignoring the problem, which emitting a
WARNING basically is, doesn't reduce the impact of the bug, it *increases* the
impact, because now the system will not recover from the bug without explicit
operator intervention.  During that time the system might be essentially
unresponsive, because all backends end up contending for some spinlock, which
makes investigating such issues very hard.

I think we should add infrastructure to detect bugs like this during
development, but not PANICing when this happens in production seems completely
non-viable.

Greetings,

Andres Freund

Re: Emit fewer vacuum records by reaping removable tuples during pruning

2024-01-04 Thread Melanie Plageman

On Thu, Jan 4, 2024 at 12:31 PM Robert Haas  wrote:
>
> On Wed, Dec 27, 2023 at 11:27 AM Melanie Plageman
>  wrote:
> > Do you have specific concerns about its correctness? I understand it
> > is an area where we have to be sure we are correct. But, to be fair,
> > that is true of all the pruning and vacuuming code.
>
> I'm kind of concerned that 0002 might be a performance regression. It
> pushes more branches down into the heap-pruning code, which I think
> can sometimes be quite hot, for the sake of a case that rarely occurs
> in practice. I take your point about it improving things when there
> are no indexes, but what about when there are? And even if there are
> no adverse performance consequences, is it really worth complicating
> the logic at such a low level?

Regarding the additional code complexity, I think the extra call to
lazy_vacuum_heap_page() in lazy_scan_heap() actually represents a fair
amount of code complexity. It is a special case of page-level
processing that should be handled by heap_page_prune() and not
lazy_scan_heap().

lazy_scan_heap() is responsible for three main things -- loop through
the blocks in a relation and process each page (pruning, freezing,
etc), invoke index vacuuming, invoke functions to loop through
dead_items and vacuum pages. The logic to do the per-page processing
is spread across three places, though.

When a single page is being processed, page pruning happens in
heap_page_prune(). Freezing, dead items recording, and visibility
checks happen in lazy_scan_prune(). Visibility map updates and
freespace map updates happen back in lazy_scan_heap(). Except, if the
table has no indexes, in which case, lazy_scan_heap() also invokes
lazy_vacuum_heap_page() to set dead line pointers unused and do
another separate visibility check and VM update. I maintain that all
page-level processing should be done in the page-level processing
functions (like lazy_scan_prune()). And lazy_scan_heap() shouldn't be
directly responsible for special case page-level processing.

> Also, I find "pronto_reap" to be a poor choice of name. "pronto" is an
> informal word that seems to have no advantage over something like
> "immediate" or "now," and I don't think "reap" has a precise,
> universally-understood meaning. You could call this "mark_unused_now"
> or "immediately_mark_unused" or something and it would be far more
> self-documenting, IMHO.

Yes, I see how pronto is unnecessarily informal. If there are no cases
other than when the table has no indexes that we would consider
immediately marking LPs unused, then perhaps it is better to call it
"no_indexes" (per andres' suggestion)?

- Melanie

Re: the s_lock_stuck on perform_spin_delay

2024-01-04 Thread Jim Nasby



  
  
On 1/4/24 10:33 AM, Tom Lane wrote:


  Robert Haas  writes:
On Thu, Jan 4, 2024 at 10:22 AM Tom Lane  wrote:
We should be making an effort to ban coding patterns like
"return with spinlock still held", because they're just too prone
to errors similar to this one.
I agree that we don't want to add overhead, and also about how
spinlocks should be used, but I dispute "easily detectable
statically." I mean, if you or I look at some code that uses a
spinlock, we'll know whether the pattern that you mention is being
followed or not, modulo differences of opinion in debatable cases. But
you and I cannot be there to look at all the code all the time. If we
had a static checking tool that was run as part of every build, or in
the buildfarm, or by cfbot, or somewhere else that raised the alarm if
this rule was violated, then we could claim to be effectively
enforcing this rule.
I was indeed suggesting that maybe we could find a way to detect
such things automatically.  While I've not been paying close
attention, I recall there's been some discussions of using LLVM/clang
infrastructure for customized static analysis, so maybe it'd be
possible without an undue amount of effort.

FWIW, the lackey[1] tool in Valgrind is able to do some kinds of
  instruction counting, so it might be possible to measure how many
  instructions are actualyl being executed while holding a spinlock.
  Might be easier than code analysis.
Another possibility might be using the CPUs timestamp counter.

1: https://valgrind.org/docs/manual/lk-manual.html

-- 
Jim Nasby, Data Architect, Austin TX

Re: the s_lock_stuck on perform_spin_delay

2024-01-04 Thread Andres Freund

Hi,

On 2024-01-04 12:04:07 -0500, Robert Haas wrote:
> On Thu, Jan 4, 2024 at 11:33 AM Tom Lane  wrote:
> > I believe it's (a).  No matter what the reason for a stuck spinlock
> > is, the only reliable method of getting out of the problem is to
> > blow things up and start over.  The patch proposed at the top of this
> > thread would leave the system unable to recover on its own, with the
> > only recourse being for the DBA to manually force a crash/restart ...
> > once she figured out that that was necessary, which might take a long
> > while if the only external evidence is an occasional WARNING that
> > might not even be making it to the postmaster log.  How's that better?
>
> It's a fair question. I think you're correct if we assume that
> everyone's following the coding rule ... at least assuming that the
> target system isn't too insanely slow, and I've seen some pretty
> crazily overloaded machines. But if the coding rule is not being
> followed, then "the only reliable method of getting out of the problem
> is to blow things up and start over" becomes a dubious conclusion.

If the coding rule isn't being followed, a crash restart is the least of ones
problems...  But that doesn't mean we shouldn't add infrastructure to make it
easier to detect violations of the spinlock rules - we've had lots of buglets
around this over the years ourselves, so we hardly can expect extension
authors to get this right. Particularly because we don't even document the
rules well, afair.

I think we should add cassert-only infrastructure tracking whether we
currently hold spinlocks, are in a signal handler and perhaps a few other
states. That'd allow us to add assertions like:

- no memory allocations when holding spinlocks or in signal handlers
- no lwlocks while holding spinlocks
- no lwlocks or spinlocks while in signal handlers

> Also, I wonder if many or even all uses of spinlocks uses ought to be
> replaced with either LWLocks or atomics. An LWLock might be slightly
> slower when contention is low, but it scales better when contention is
> high, displays a wait event so that you can see that you have
> contention if you do, and doesn't PANIC the system if the contention
> gets too bad. And an atomic will just be faster, in the cases where
> it's adequate.

I tried to replace all - unfortunately the results were not great. The problem
isn't primarily the lack of spinning (although it might be worth adding that
to lwlocks) or the cost of error recovery, the problem is that a reader-writer
lock are inherently more expensive than simpler locks that don't have multiple
levels.

One example of such increased overhead is that on x86 an lwlock unlock has to
be an atomic operation (to maintain the lock count), whereas as spinlock
unlock can just be a write + compiler barrier. Unfortunately the added atomic
operation turns out to matter in some performance critical cases like the
insertpos_lck.

I think we ought to split lwlocks into reader/writer and simpler mutex. The
simpler mutex still will be slower than s_lock in some relevant cases,
e.g. due to the error recovery handling, but it'd be "local" overhead, rather
than something affecting scalability.

FWIW, these days spinlocks do report a wait event when in perform_spin_delay()
- albeit without detail which lock is being held.

Greetings,

Andres Freund

Re: UUID v7

2024-01-04 Thread Jelte Fennema-Nio

First of all, I'm a huge fan of UUID v7. So I'm very excited that this
is progressing. I'm definitely going to look closer at this patch
soon. Some tiny initial feedback:

(bikeshed) I'd prefer renaming `get_uuid_v7_time` to the shorter
`uuid_v7_time`, the `get_` prefix seems rarely used in Postgres
functions (e.g. `date_part` is not called `get_date_part`). Also it's
visually very similar to the gen_ prefix.

On Thu, 4 Jan 2024 at 19:20, Andrey M. Borodin  wrote:
> > On 3 Jan 2024, at 04:37, Przemysław Sztoch  wrote:
> > 1. Is it possible to add a function that returns the version of the 
> > generated uuid?
> > It will be very useful.
> > I don't know if it's possible, but I think there are bits in the UUID that 
> > inform about the version.
> What do you think if we have functions get_uuid_v7_ver(uuid) and 
> get_uuid_v7_var(uuid) to extract bit fields according to [0] ? Or, perhaps, 
> this should be one function with two return parameters?
> It's not in a patch yet, I'm just considering how this functionality should 
> look like.

I do agree that those functions would be useful, especially now that
we're introducing a function that errors when it's passed a UUID
that's not of version 7. With the version extraction function you
could return something else for other uuids if you have many and not
all of them are version 7.

I do think though that these functions should not have v7 in their
name, since they would apply to all uuids of all versions (so if also
removing the get_ prefix they would be called uuid_ver and uuid_var)

> > 4. Sometimes you will need to generate a uuid for historical time. There 
> > should be an additional function gen_uuid_v7(timestamp).
> Done, please see patch attached. But I changed signature to 
> gen_uuid_v7(int8), to avoid messing with bytes from user who knows what they 
> want. Or do you think gen_uuid_v7(timestamp) would be more convenient?

I think timestamp would be quite useful. timestamp would encode the
time in the same way as gen_uuid_v7() would, but based on the given
time instead of the current time.

Re: Hide exposed impl detail of wchar.c

2024-01-04 Thread Nathan Bossart

On Mon, Nov 20, 2023 at 10:39:43PM -0600, Nathan Bossart wrote:
> Alright.  The next minor release isn't until February, so I'll let this one
> sit a little while longer in case anyone objects to back-patching something
> like this [0].
> 
> [0] https://postgr.es/m/attachment/152305/move_is_valid_ascii_v2.patch

Barring objections, I plan to commit this and back-patch it to v16 in the
next few days.

-- 
Nathan Bossart
Amazon Web Services: https://aws.amazon.com

Re: doing also VM cache snapshot and restore with pg_prewarm, having more information of the VM inside PostgreSQL

2024-01-04 Thread Jim Nasby



  
  
On 1/3/24 5:57 PM, Cedric Villemain
  wrote:


  
  for 15 years pgfincore has been sitting quietly and being used
in large setups to help in HA and resources management.
It can perfectly stay as is, to be honest I was expecting to one
day include a windows support and propose that to PostgreSQL, it
appears getting support on linux and BSD is more than enough
today.
  So I wonder if there are interest for having virtual memory
snapshot and restore operations with, for example,
pg_prewarm/autowarm ?
  

IMHO, to improve the user experience here we'd need something
  that combined the abilities of all these extensions into a
  cohesive interface that allowed users to simply say "please get
  this data into cache". Simply moving pgfincore into core Postgres
  wouldn't satisfy that need.
So I think the real question is whether the community feels
  spport for better cache (buffercache + filesystem) management is a
  worthwhile feature to add to Postgres.
Micromanaging cache contents for periodic jobs seems almost like
  a mis-feature. While it's a useful tool to have in the toolbox,
  it's also a non-trivial layer of complexity. IMHO not something
  we'd want to add. Though, there might be smaller items that would
  make creating tools to do that easier, such as some ability to see
  what blocks a backend is accessing (perhaps via a hook).
On the surface, improving RTO via cache warming sounds
  interesting ... but I'm not sure how useful it would be in
  reality. Users that care about RTO would almost always have some
  form of hot-standby, and generally those will already have a lot
  of data in cache. While they won't have read-only data in cache, I
  have to wonder if the answer to that problem is allowing writers
  to tell a replica what blocks are being read, so the replica can
  keep them in cache. Also, most (all?) operations that require a
  restart could be handled via a failover, so I'm not sure how much
  cache management moves the needle there.


   
  Some usecases covered: snapshot/restore cache around cronjobs,
around dumps, switchover, failover, on stop/start of postgres
(think kernel upgrade with a cold restart), ...

  
  pgfincore also provides some nice information with mincore (on
FreeBSD mincore is more interesting) or cachestat, again it can
remain as an out of tree extension but I will be happy to add to
commitfest if there are interest from the community.
An example of cachestat output:
  postgres=#
select *from vm_relation_cachestat('foo',range:=1024*32); 
  block_start | block_count | nr_pages | nr_cache | nr_dirty |
  nr_writeback | nr_evicted | nr_recently_evicted  
-+-+--+--+--+--++-
  
    0 |   32768 |    65536 |    62294 |    0 |
     0 |   3242 |    3242 
    32768 |   32768 |    65536 |    39279 |    0 |
     0 |  26257 |   26257 
    65536 |   32768 |    65536 |    22516 |    0 |
     0 |  43020 |   43020 
    98304 |   32768 |    65536 |    24944 |    0 |
     0 |  40592 |   40592 
   131072 |    1672 | 3344 |  487 |    0 |
     0 |   2857 |    2857
  

  

  Comments?

  ---
Cédric Villemain +33 (0)6 20 30 22 52
https://Data-Bene.io
PostgreSQL Expertise, Support, Training, R



-- 
Jim Nasby, Data Architect, Austin TX

Re: Emit fewer vacuum records by reaping removable tuples during pruning

2024-01-04 Thread Melanie Plageman

Thanks for the review!

On Thu, Jan 4, 2024 at 3:03 PM Andres Freund  wrote:
>
> On 2023-11-17 18:12:08 -0500, Melanie Plageman wrote:
> >   Assert(ItemIdIsNormal(lp));
> >   htup = (HeapTupleHeader) PageGetItem(dp, lp);
> > @@ -715,7 +733,17 @@ heap_prune_chain(Buffer buffer, OffsetNumber 
> > rootoffnum,
> >* redirect the root to the correct chain member.
> >*/
> >   if (i >= nchain)
> > - heap_prune_record_dead(prstate, rootoffnum);
> > + {
> > + /*
> > +  * If the relation has no indexes, we can remove dead 
> > tuples
> > +  * during pruning instead of marking their line 
> > pointers dead. Set
> > +  * this tuple's line pointer LP_UNUSED.
> > +  */
> > + if (prstate->pronto_reap)
> > + heap_prune_record_unused(prstate, rootoffnum);
> > + else
> > + heap_prune_record_dead(prstate, rootoffnum);
> > + }
> >   else
> >   heap_prune_record_redirect(prstate, rootoffnum, 
> > chainitems[i]);
> >   }
> > @@ -726,9 +754,12 @@ heap_prune_chain(Buffer buffer, OffsetNumber 
> > rootoffnum,
> >* item.  This can happen if the loop in heap_page_prune 
> > caused us to
> >* visit the dead successor of a redirect item before 
> > visiting the
> >* redirect item.  We can clean up by setting the redirect 
> > item to
> > -  * DEAD state.
> > +  * DEAD state. If pronto_reap is true, we can set it 
> > LP_UNUSED now.
> >*/
> > - heap_prune_record_dead(prstate, rootoffnum);
> > + if (prstate->pronto_reap)
> > + heap_prune_record_unused(prstate, rootoffnum);
> > + else
> > + heap_prune_record_dead(prstate, rootoffnum);
> >   }
> >
> >   return ndeleted;
>
> There's three new calls to heap_prune_record_unused() and the logic got more
> nested. Is there a way to get to a nicer end result?

So, I could do another loop through the line pointers in
heap_page_prune() (after the loop calling heap_prune_chain()) and, if
pronto_reap is true, set dead line pointers LP_UNUSED. Then, when
constructing the WAL record, I would just not add the prstate.nowdead
that I saved from heap_prune_chain() to the prune WAL record.

This would eliminate the extra if statements from heap_prune_chain().
It will be more performant than sticking with the original (master)
call to lazy_vacuum_heap_page(). However, I'm not convinced that the
extra loop to set line pointers LP_DEAD -> LP_UNUSED is less confusing
than keeping the if pronto_reap test in heap_prune_chain().
heap_prune_chain() is where line pointers' new values are decided. It
seems weird to pick one new value for a line pointer in
heap_prune_chain() and then pick another, different new value in a
loop after heap_prune_chain(). I don't see any way to eliminate the if
pronto_reap tests without a separate loop setting LP_DEAD->LP_UNUSED,
though.

> > From 608658f2cbc0acde55aac815c0fdb523ec24c452 Mon Sep 17 00:00:00 2001
> > From: Melanie Plageman 
> > Date: Mon, 13 Nov 2023 16:47:08 -0500
> > Subject: [PATCH v2 1/2] Indicate rel truncation unsafe in lazy_scan[no]prune
> >
> > Both lazy_scan_prune() and lazy_scan_noprune() must determine whether or
> > not there are tuples on the page making rel truncation unsafe.
> > LVRelState->nonempty_pages is updated to reflect this. Previously, both
> > functions set an output parameter or output parameter member, hastup, to
> > indicate that nonempty_pages should be updated to reflect the latest
> > non-removable page. There doesn't seem to be any reason to wait until
> > lazy_scan_[no]prune() returns to update nonempty_pages. Plenty of other
> > counters in the LVRelState are updated in lazy_scan_[no]prune().
> > This allows us to get rid of the output parameter hastup.
>
>
> > @@ -972,20 +970,21 @@ lazy_scan_heap(LVRelState *vacrel)
> >   continue;
> >   }
> >
> > - /* Collect LP_DEAD items in dead_items array, count 
> > tuples */
> > - if (lazy_scan_noprune(vacrel, buf, blkno, page, 
> > ,
> > + /*
> > +  * Collect LP_DEAD items in dead_items array, count 
> > tuples,
> > +  * determine if rel truncation is safe
> > +  */
> > + if (lazy_scan_noprune(vacrel, buf, blkno, page,
> > 
> > ))
> >   {
> >   Sizefreespace = 0;
> >
> >   /*
> >* Processed page successfully (without 
> >

Re: Adding facility for injection points (or probe points?) for more advanced tests

2024-01-04 Thread Nathan Bossart

On Thu, Jan 04, 2024 at 06:04:20PM +0530, Ashutosh Bapat wrote:
> 0003 and 0004 are using the extension in this module for some serious
> testing. The name of the extension test_injection_point indicates that
> it's for testing injection points and not for some serious use of
> injection callbacks it adds. Changes 0003 and 0004 suggest otherwise.

Yeah, I think test_injection_point should be reserved for testing the
injection point machinery.

> I suggest we move test_injection_points from src/test/modules to
> contrib/ and rename it as "injection_points". The test files may still
> be named as test_injection_point. The TAP tests in 0003 and 0004 once
> moved to their appropriate places, will load injection_point extension
> and use it. That way predefined injection point callbacks will also be
> available for others to use.

Rather than defining a module somewhere that tests would need to load,
should we just put the common callbacks in the core server?  Unless there's
a strong reason to define them elsewhere, that could be a nice way to save
a step in the tests.

-- 
Nathan Bossart
Amazon Web Services: https://aws.amazon.com

Re: Adding facility for injection points (or probe points?) for more advanced tests

2024-01-04 Thread Nathan Bossart

On Thu, Jan 04, 2024 at 08:53:11AM +0900, Michael Paquier wrote:
> On Tue, Jan 02, 2024 at 11:14:56PM -0600, Nathan Bossart wrote:
>> I'm wondering how important it is to cache the callbacks locally.
>> load_external_function() won't reload an already-loaded library, so AFAICT
>> this is ultimately just saving a call to dlsym().
> 
> This keeps a copy to a callback under the same address space, and I
> guess that it would matter if the code where a callback is added gets
> very hot because this means less function pointers.  At the end I
> would keep the cache as the code to handle it is neither complex nor
> long, while being isolated in its own paths.

Fair enough.

>> 0003 and 0004 add tests to the test_injection_points module.  Is the idea
>> that we'd add any tests that required injection points here?  I think it'd
>> be better if we could move the tests closer to the logic they're testing,
>> but perhaps that is difficult because you also need to define the callback
>> functions somewhere.  Hm...
> 
> Yeah.  Agreed that the final result should not have these tests in the
> module test_injection_points.  What I was thinking here is to move
> 002_invalid_checkpoint_after_promote.pl to src/test/recovery/ and pull
> the module with the callbacks with an EXTRA_INSTALL.

+1

-- 
Nathan Bossart
Amazon Web Services: https://aws.amazon.com

Re: Add new for_each macros for iterating over a List that do not require ListCell pointer

2024-01-04 Thread Nathan Bossart

Committed after some additional light edits.  Thanks for the patch!

On Wed, Jan 03, 2024 at 11:36:36PM +0100, Jelte Fennema-Nio wrote:
> On Wed, 3 Jan 2024 at 23:13, Tom Lane  wrote:
>> I like Nathan's wording.
> 
> To be clear, I don't want to block this patch on the wording of that
> single comment. So, if you feel Nathan's wording was better, I'm fine
> with that too. But let me respond to your arguments anyway:

I decided to keep the v8 wording, if for no other reason than I didn't see
the need for lots of detail about how it compiles.  IMHO even the vague
mention of loop unrolling is probably more than is really necessary.

-- 
Nathan Bossart
Amazon Web Services: https://aws.amazon.com

Re: SET ROLE x NO RESET

2024-01-04 Thread Jim Nasby



  
  
On 1/3/24 5:47 PM, Nico Williams wrote:


  Though maybe `NO RESET` isn't really needed to build these, since after
all one could use an unprivileged role and a SECURITY DEFINER function
that does the `SET ROLE` following some user-defined authentication
method, and so what if the client can RESET the role, since that brings
it back to the otherwise unprivileged role.

An unprivileged role that has the ability to assume any other
  role ;p
Also, last I checked you can't do SET ROLE in a security definer
  function.


  
Who needs to RESET roles anyways?  Answer: connection pools, but not
every connection is used via a pool.  This brings up something: attempts
to reset a NO RESET session need to fail in such a way that a connection
pool can detect this and disconnect, or else it needs to fail by
terminating the connection altogether.


RESET ROLE is also useful for setting up a "superuser role" that
  DBAs have access to via a NO INHERIT role. It allows them to do
  things like...
SET ROLE su;
-- Do some superuserly thing
RESET ROLE;
Admittedly, the last step could be just to set their role back to
  themselves, but RESET ROLE removes the risk of typos.

-- 
Jim Nasby, Data Architect, Austin TX

Re: PL/pgSQL: Incomplete item Allow handling of %TYPE arrays, e.g. tab.col%TYPE[]

2024-01-04 Thread Tom Lane

Pavel Stehule  writes:
> Now, I think so this simple patch is ready for committers

I pushed this with some editorialization -- mostly, rewriting the
documentation and comments.  I found that the existing docs for %TYPE
were not great.  There are two separate use-cases, one for referencing
a table column and one for referencing a previously-declared variable,
and the docs were about as clear as mud about explaining that.

I also looked into the problem Pavel mentioned that it doesn't work
for RECORD.  If you just write "record[]" you get an error message
that at least indicates it's an unsupported case:

regression=# do $$declare r record[]; begin end$$;
ERROR:  variable "r" has pseudo-type record[]
CONTEXT:  compilation of PL/pgSQL function "inline_code_block" near line 1

Maybe we could improve on that, but it would be a lot of work and
I'm not terribly excited about it.  However, %TYPE fails entirely
for both "record" and named composite types, and the reason turns
out to be just that plpgsql_parse_wordtype fails to handle the
PLPGSQL_NSTYPE_REC case.  So that's easily fixed.

I also wonder what the heck the last half of plpgsql_parse_wordtype
is for at all.  It looks for a named type, which means you can do

regression=# do $$declare x float8%type; begin end$$;
DO

but that's just stupid.  You could leave off the %TYPE and get
the same result.  Moreover, it is inconsistent because
plpgsql_parse_cwordtype has no equivalent behavior:

regression=# do $$declare x pg_catalog.float8%type; begin end$$;
ERROR:  syntax error at or near "%"
LINE 1: do $$declare x pg_catalog.float8%type; begin end$$;
^
CONTEXT:  invalid type name "pg_catalog.float8%type"

It's also undocumented and untested (the code coverage report
shows this part is never reached).  So I propose we remove it.

That leads me to the attached proposed follow-on patch.

Another thing we could think about, but I've not done it here,
is to make plpgsql_parse_wordtype and friends throw error
instead of just returning NULL when they don't find the name.
Right now, if NULL is returned, we end up passing the whole
string to parse_datatype, leading to unhelpful errors like
the one shown above.  We could do better than that I think,
perhaps like "argument of %TYPE is not a known variable".

regards, tom lane

diff --git a/src/pl/plpgsql/src/expected/plpgsql_record.out b/src/pl/plpgsql/src/expected/plpgsql_record.out
index afb922df29..36d65e4286 100644
--- a/src/pl/plpgsql/src/expected/plpgsql_record.out
+++ b/src/pl/plpgsql/src/expected/plpgsql_record.out
@@ -306,6 +306,32 @@ NOTICE:  r1 = (1,2)
 ERROR:  record "r1" has no field "nosuchfield"
 CONTEXT:  SQL expression "r1.nosuchfield"
 PL/pgSQL function inline_code_block line 9 at RAISE
+-- check that type record can be passed through %type
+do $$
+declare r1 record;
+r2 r1%type;
+begin
+  r2 := row(1,2);
+  raise notice 'r2 = %', r2;
+  r2 := row(3,4,5);
+  raise notice 'r2 = %', r2;
+end$$;
+NOTICE:  r2 = (1,2)
+NOTICE:  r2 = (3,4,5)
+-- arrays of record are not supported at the moment
+do $$
+declare r1 record[];
+begin
+end$$;
+ERROR:  variable "r1" has pseudo-type record[]
+CONTEXT:  compilation of PL/pgSQL function "inline_code_block" near line 2
+do $$
+declare r1 record;
+r2 r1%type[];
+begin
+end$$;
+ERROR:  variable "r2" has pseudo-type record[]
+CONTEXT:  compilation of PL/pgSQL function "inline_code_block" near line 3
 -- check repeated assignments to composite fields
 create table some_table (id int, data text);
 do $$
diff --git a/src/pl/plpgsql/src/pl_comp.c b/src/pl/plpgsql/src/pl_comp.c
index b745eaa3f8..c63719c796 100644
--- a/src/pl/plpgsql/src/pl_comp.c
+++ b/src/pl/plpgsql/src/pl_comp.c
@@ -1596,8 +1596,8 @@ plpgsql_parse_tripword(char *word1, char *word2, char *word3,
 
 
 /* --
- * plpgsql_parse_wordtype	The scanner found word%TYPE. word can be
- *a variable name or a basetype.
+ * plpgsql_parse_wordtype	The scanner found word%TYPE. word should be
+ *a pre-existing variable name.
  *
  * Returns datatype struct, or NULL if no match found for word.
  * --
@@ -1605,10 +1605,7 @@ plpgsql_parse_tripword(char *word1, char *word2, char *word3,
 PLpgSQL_type *
 plpgsql_parse_wordtype(char *ident)
 {
-	PLpgSQL_type *dtype;
 	PLpgSQL_nsitem *nse;
-	TypeName   *typeName;
-	HeapTuple	typeTup;
 
 	/*
 	 * Do a lookup in the current namespace stack
@@ -1623,39 +1620,13 @@ plpgsql_parse_wordtype(char *ident)
 		{
 			case PLPGSQL_NSTYPE_VAR:
 return ((PLpgSQL_var *) (plpgsql_Datums[nse->itemno]))->datatype;
-
-/* XXX perhaps allow REC/ROW here? */
-
+			case PLPGSQL_NSTYPE_REC:
+return ((PLpgSQL_rec *) (plpgsql_Datums[nse->itemno]))->datatype;
 			default:
 return NULL;
 		}
 	}
 
-	/*
-	 * Word wasn't found in the namespace stack. Try to find a data type with
-	 * that name, but ignore shell types and complex types.
-	 */
-	typeName = makeTypeName(ident);
-	typeTup =

Re: Confine vacuum skip logic to lazy_scan_skip

2024-01-04 Thread Andres Freund

Hi,

On 2024-01-02 12:36:18 -0500, Melanie Plageman wrote:
> Subject: [PATCH v2 1/6] lazy_scan_skip remove unnecessary local var rel_pages
> Subject: [PATCH v2 2/6] lazy_scan_skip remove unneeded local var
>  nskippable_blocks

I think these may lead to worse code - the compiler has to reload
vacrel->rel_pages/next_unskippable_block for every loop iteration, because it
can't guarantee that they're not changed within one of the external functions
called in the loop body.

> Subject: [PATCH v2 3/6] Add lazy_scan_skip unskippable state
> 
> Future commits will remove all skipping logic from lazy_scan_heap() and
> confine it to lazy_scan_skip(). To make those commits more clear, first
> introduce the struct, VacSkipState, which will maintain the variables
> needed to skip ranges less than SKIP_PAGES_THRESHOLD.

Why not add this to LVRelState, possibly as a struct embedded within it?


> From 335faad5948b2bec3b83c2db809bb9161d373dcb Mon Sep 17 00:00:00 2001
> From: Melanie Plageman 
> Date: Sat, 30 Dec 2023 16:59:27 -0500
> Subject: [PATCH v2 4/6] Confine vacuum skip logic to lazy_scan_skip
> 
> In preparation for vacuum to use the streaming read interface (and eventually
> AIO), refactor vacuum's logic for skipping blocks such that it is entirely
> confined to lazy_scan_skip(). This turns lazy_scan_skip() and the VacSkipState
> it uses into an iterator which yields blocks to lazy_scan_heap(). Such a
> structure is conducive to an async interface.

And it's cleaner - I find the current code extremely hard to reason about.


> By always calling lazy_scan_skip() -- instead of only when we have reached the
> next unskippable block, we no longer need the skipping_current_range variable.
> lazy_scan_heap() no longer needs to manage the skipped range -- checking if we
> reached the end in order to then call lazy_scan_skip(). And lazy_scan_skip()
> can derive the visibility status of a block from whether or not we are in a
> skippable range -- that is, whether or not the next_block is equal to the next
> unskippable block.

I wonder if it should be renamed as part of this - the name is somewhat
confusing now (and perhaps before)? lazy_scan_get_next_block() or such?


> + while (true)
>   {
>   Buffer  buf;
>   Pagepage;
> - boolall_visible_according_to_vm;
>   LVPagePruneState prunestate;
>  
> - if (blkno == vacskip.next_unskippable_block)
> - {
> - /*
> -  * Can't skip this page safely.  Must scan the page.  
> But
> -  * determine the next skippable range after the page 
> first.
> -  */
> - all_visible_according_to_vm = 
> vacskip.next_unskippable_allvis;
> - lazy_scan_skip(vacrel, , blkno + 1);
> -
> - Assert(vacskip.next_unskippable_block >= blkno + 1);
> - }
> - else
> - {
> - /* Last page always scanned (may need to set 
> nonempty_pages) */
> - Assert(blkno < rel_pages - 1);
> -
> - if (vacskip.skipping_current_range)
> - continue;
> + blkno = lazy_scan_skip(vacrel, , blkno + 1,
> +
> _visible_according_to_vm);
>  
> - /* Current range is too small to skip -- just scan the 
> page */
> - all_visible_according_to_vm = true;
> - }
> + if (blkno == InvalidBlockNumber)
> + break;
>  
>   vacrel->scanned_pages++;
>

I don't like that we still do determination about the next block outside of
lazy_scan_skip() and have duplicated exit conditions between lazy_scan_skip()
and lazy_scan_heap().

I'd probably change the interface to something like

while (lazy_scan_get_next_block(vacrel, ))
{
...
}


> From b6603e35147c4bbe3337280222e6243524b0110e Mon Sep 17 00:00:00 2001
> From: Melanie Plageman 
> Date: Sun, 31 Dec 2023 09:47:18 -0500
> Subject: [PATCH v2 5/6] VacSkipState saves reference to LVRelState
> 
> The streaming read interface can only give pgsr_next callbacks access to
> two pieces of private data. As such, move a reference to the LVRelState
> into the VacSkipState.
> 
> This is a separate commit (as opposed to as part of the commit
> introducing VacSkipState) because it is required for using the streaming
> read interface but not a natural change on its own. VacSkipState is per
> block and the LVRelState is referenced for the whole relation vacuum.

I'd do it the other way round, i.e. either embed VacSkipState ino LVRelState
or point to it from VacSkipState.

LVRelState is already tied to the iteration state, so I don't think there's a
reason not to do so.

Greetings,

Andres Freund

Re: Emit fewer vacuum records by reaping removable tuples during pruning

2024-01-04 Thread Andres Freund

Hi,

On 2023-11-17 18:12:08 -0500, Melanie Plageman wrote:
> diff --git a/src/backend/access/heap/heapam.c 
> b/src/backend/access/heap/heapam.c
> index 14de8158d49..b578c32eeb6 100644
> --- a/src/backend/access/heap/heapam.c
> +++ b/src/backend/access/heap/heapam.c
> @@ -8803,8 +8803,13 @@ heap_xlog_prune(XLogReaderState *record)
>   nunused = (end - nowunused);
>   Assert(nunused >= 0);
>
> - /* Update all line pointers per the record, and repair 
> fragmentation */
> - heap_page_prune_execute(buffer,
> + /*
> +  * Update all line pointers per the record, and repair 
> fragmentation.
> +  * We always pass pronto_reap as true, because we don't know 
> whether
> +  * or not this option was used when pruning. This reduces the
> +  * validation done on replay in an assert build.
> +  */

Hm, that seems not great. Both because we loose validation and because it
seems to invite problems down the line, due to pronto_reap falsely being set
to true in heap_page_prune_execute().


> @@ -581,7 +589,17 @@ heap_prune_chain(Buffer buffer, OffsetNumber rootoffnum,
>* function.)
>*/
>   if (ItemIdIsDead(lp))
> + {
> + /*
> +  * If the relation has no indexes, we can set dead line 
> pointers
> +  * LP_UNUSED now. We don't increment ndeleted here 
> since the LP
> +  * was already marked dead.
> +  */
> + if (prstate->pronto_reap)
> + heap_prune_record_unused(prstate, offnum);
> +
>   break;
> + }

I wasn't immediately sure whether this is reachable - but it is, e.g. after
on-access pruning (which currently doesn't yet use pronto reaping), after
pg_upgrade or dropping an index.


>   Assert(ItemIdIsNormal(lp));
>   htup = (HeapTupleHeader) PageGetItem(dp, lp);
> @@ -715,7 +733,17 @@ heap_prune_chain(Buffer buffer, OffsetNumber rootoffnum,
>* redirect the root to the correct chain member.
>*/
>   if (i >= nchain)
> - heap_prune_record_dead(prstate, rootoffnum);
> + {
> + /*
> +  * If the relation has no indexes, we can remove dead 
> tuples
> +  * during pruning instead of marking their line 
> pointers dead. Set
> +  * this tuple's line pointer LP_UNUSED.
> +  */
> + if (prstate->pronto_reap)
> + heap_prune_record_unused(prstate, rootoffnum);
> + else
> + heap_prune_record_dead(prstate, rootoffnum);
> + }
>   else
>   heap_prune_record_redirect(prstate, rootoffnum, 
> chainitems[i]);
>   }
> @@ -726,9 +754,12 @@ heap_prune_chain(Buffer buffer, OffsetNumber rootoffnum,
>* item.  This can happen if the loop in heap_page_prune caused 
> us to
>* visit the dead successor of a redirect item before visiting 
> the
>* redirect item.  We can clean up by setting the redirect item 
> to
> -  * DEAD state.
> +  * DEAD state. If pronto_reap is true, we can set it LP_UNUSED 
> now.
>*/
> - heap_prune_record_dead(prstate, rootoffnum);
> + if (prstate->pronto_reap)
> + heap_prune_record_unused(prstate, rootoffnum);
> + else
> + heap_prune_record_dead(prstate, rootoffnum);
>   }
>
>   return ndeleted;

There's three new calls to heap_prune_record_unused() and the logic got more
nested. Is there a way to get to a nicer end result?


> From 608658f2cbc0acde55aac815c0fdb523ec24c452 Mon Sep 17 00:00:00 2001
> From: Melanie Plageman 
> Date: Mon, 13 Nov 2023 16:47:08 -0500
> Subject: [PATCH v2 1/2] Indicate rel truncation unsafe in lazy_scan[no]prune
>
> Both lazy_scan_prune() and lazy_scan_noprune() must determine whether or
> not there are tuples on the page making rel truncation unsafe.
> LVRelState->nonempty_pages is updated to reflect this. Previously, both
> functions set an output parameter or output parameter member, hastup, to
> indicate that nonempty_pages should be updated to reflect the latest
> non-removable page. There doesn't seem to be any reason to wait until
> lazy_scan_[no]prune() returns to update nonempty_pages. Plenty of other
> counters in the LVRelState are updated in lazy_scan_[no]prune().
> This allows us to get rid of the output parameter hastup.


> @@ -972,20 +970,21 @@ lazy_scan_heap(LVRelState *vacrel)
>   continue;
>   }
>
> - /* Collect LP_DEAD items in dead_items array, count 
> tuples

Re: Emit fewer vacuum records by reaping removable tuples during pruning

2024-01-04 Thread Andres Freund

Hi,

On 2024-01-04 12:31:36 -0500, Robert Haas wrote:
> On Wed, Dec 27, 2023 at 11:27 AM Melanie Plageman
>  wrote:
> > Do you have specific concerns about its correctness? I understand it
> > is an area where we have to be sure we are correct. But, to be fair,
> > that is true of all the pruning and vacuuming code.
> 
> I'm kind of concerned that 0002 might be a performance regression. It
> pushes more branches down into the heap-pruning code, which I think
> can sometimes be quite hot, for the sake of a case that rarely occurs
> in practice.

I was wondering the same when talking about this with Melanie. But I guess
there are some scenarios that aren't unrealistic, consider e.g. bulk data
loading with COPY with an UPDATE to massage the data afterwards, before
creating the indexes.

Where I could see this becoming more interesting / powerful is if we were able
to do 'pronto reaping' not just from vacuum but also during on-access
pruning. For ETL workloads VACUUM might often not run between modifications of
the same page, but on-access pruning will. Being able to reclaim dead items at
that point seems like it could be pretty sizable improvement?

> I take your point about it improving things when there are no indexes, but
> what about when there are?

I suspect that branch prediction should take care of the additional
branches. heap_prune_chain() indeed can be very hot, but I think we're
primarily bottlenecked on data cache misses.

For a single VACUUM, whether we'd do pronto reaping would be constant -> very
well predictable. We could add an unlikely() to make sure the
branch-predictor-is-empty case optimizes for the much more common case of
having indexes.  Falsely assuming we'd not pronto reap wouldn't be
particularly bad, as the wins for the case are so much bigger.

If we were to use pronto reaping for on-access pruning, it's perhaps a bit
less predictable, as pruning for pages of a relation with indexes could be
interleaved with pruning for relations without. But even there I suspect it'd
not be the primary bottleneck: We call heap_page_prune_chain() in a loop for
every tuple on a page, the branch predictor should quickly learn whether we're
using pronto reaping. Whereas we're not becoming less cache-miss heavy when
looking at subsequent tuples.

> And even if there are no adverse performance consequences, is it really
> worth complicating the logic at such a low level?

Yes, I think this is the main question here. It's not clear though that the
state after the patch is meaningfullye more complicated? It removes nontrivial
code from lazy_scan_heap() and pays for that with a bit more complexity in
heap_prune_chain().

> Also, I find "pronto_reap" to be a poor choice of name. "pronto" is an
> informal word that seems to have no advantage over something like
> "immediate" or "now,"

I didn't like that either :)

> and I don't think "reap" has a precise, universally-understood meaning.

Less concerned about that.

> You could call this "mark_unused_now" or "immediately_mark_unused" or
> something and it would be far more self-documenting, IMHO.

How about 'no_indexes' or such?

Greetings,

Andres Freund

Re: ALTER COLUMN ... SET EXPRESSION to alter stored generated column's expression

2024-01-04 Thread Peter Eisentraut

On 25.12.23 13:10, Amul Sul wrote:

 > I think we can't support that (like alter type) since we need to
place
 > this new
 > pass before AT_PASS_OLD_INDEX & AT_PASS_OLD_CONSTR to re-add
indexes and
 > constraints for the validation.

Could we have AT_PASS_ADD_COL before AT_PASS_OLD_*?  So overall it
would be

...
AT_PASS_ALTER_TYPE,
AT_PASS_ADD_COL,         // moved
AT_PASS_SET_EXPRESSION,  // new
AT_PASS_OLD_INDEX,
AT_PASS_OLD_CONSTR,
AT_PASS_ADD_CONSTR,
...

This appears to not break any existing tests.

(Sorry, for the delay)

Agree. I did that change in 0001 patch.

I have committed this patch set.

I couple of notes:

You had included the moving of the AT_PASS_ADD_COL enum in the first 
patch.  This is not a good style.  Refactoring patches should not 
include semantic changes.  I have moved that change the final patch that 
introduced the new feature.

I did not commit the 0002 patch that renamed some functions.  I think 
names like AlterColumn are too general, which makes this renaming 
possibly counterproductive.  I don't know a better name, maybe 
AlterTypeOrSimilar, but that's obviously silly.  I think leaving it at 
AlterType isn't too bad, since most of the code is indeed for ALTER TYPE 
support.  We can reconsider this if we have a better idea.

In RememberAllDependentForRebuilding(), I dropped some of the additional 
errors that you introduced for the AT_SetExpression cases.  These didn't 
seem useful.  For example, it is not possible for a generated column to 
depend on another generated column, so there is no point checking for 
it.  Also, there were no test cases that covered any of these 
situations.  If we do want any of these, we should have tests and 
documentation for them.

For the tests that examine the EXPLAIN plans, I had to add an ANALYZE 
after the SET EXPRESSION.  Otherwise, I got unstable test results, 
presumably because in some cases the analyze happened in the background.

Re: add AVX2 support to simd.h

2024-01-04 Thread Nathan Bossart

On Tue, Jan 02, 2024 at 10:11:23AM -0600, Nathan Bossart wrote:
> (In case it isn't clear, I'm volunteering to set up such a buildfarm
> machine.)

I set up "akepa" to run with -march=x86-64-v3.

-- 
Nathan Bossart
Amazon Web Services: https://aws.amazon.com

Re: Emit fewer vacuum records by reaping removable tuples during pruning

2024-01-04 Thread Robert Haas

On Wed, Dec 27, 2023 at 11:27 AM Melanie Plageman
 wrote:
> Do you have specific concerns about its correctness? I understand it
> is an area where we have to be sure we are correct. But, to be fair,
> that is true of all the pruning and vacuuming code.

I'm kind of concerned that 0002 might be a performance regression. It
pushes more branches down into the heap-pruning code, which I think
can sometimes be quite hot, for the sake of a case that rarely occurs
in practice. I take your point about it improving things when there
are no indexes, but what about when there are? And even if there are
no adverse performance consequences, is it really worth complicating
the logic at such a low level?

Also, I find "pronto_reap" to be a poor choice of name. "pronto" is an
informal word that seems to have no advantage over something like
"immediate" or "now," and I don't think "reap" has a precise,
universally-understood meaning. You could call this "mark_unused_now"
or "immediately_mark_unused" or something and it would be far more
self-documenting, IMHO.

-- 
Robert Haas
EDB: http://www.enterprisedb.com

Re: the s_lock_stuck on perform_spin_delay

2024-01-04 Thread Robert Haas

On Thu, Jan 4, 2024 at 11:33 AM Tom Lane  wrote:
> I believe it's (a).  No matter what the reason for a stuck spinlock
> is, the only reliable method of getting out of the problem is to
> blow things up and start over.  The patch proposed at the top of this
> thread would leave the system unable to recover on its own, with the
> only recourse being for the DBA to manually force a crash/restart ...
> once she figured out that that was necessary, which might take a long
> while if the only external evidence is an occasional WARNING that
> might not even be making it to the postmaster log.  How's that better?

It's a fair question. I think you're correct if we assume that
everyone's following the coding rule ... at least assuming that the
target system isn't too insanely slow, and I've seen some pretty
crazily overloaded machines. But if the coding rule is not being
followed, then "the only reliable method of getting out of the problem
is to blow things up and start over" becomes a dubious conclusion.

Also, I wonder if many or even all uses of spinlocks uses ought to be
replaced with either LWLocks or atomics. An LWLock might be slightly
slower when contention is low, but it scales better when contention is
high, displays a wait event so that you can see that you have
contention if you do, and doesn't PANIC the system if the contention
gets too bad. And an atomic will just be faster, in the cases where
it's adequate.

The trouble with trying to do a replacement is that some of the
spinlock-using code is ancient and quite hairy. info_lck in particular
looks like a hot mess -- it's used in complex ways and in performance
critical paths, with terrifying comments like this:

 * To read XLogCtl->LogwrtResult, you must hold either info_lck or
 * WALWriteLock.  To update it, you need to hold both locks.  The point of
 * this arrangement is that the value can be examined by code that already
 * holds WALWriteLock without needing to grab info_lck as well.  In addition
 * to the shared variable, each backend has a private copy of LogwrtResult,
 * which is updated when convenient.
 *
 * The request bookkeeping is simpler: there is a shared XLogCtl->LogwrtRqst
 * (protected by info_lck), but we don't need to cache any copies of it.
 *
 * info_lck is only held long enough to read/update the protected variables,
 * so it's a plain spinlock.  The other locks are held longer (potentially
 * over I/O operations), so we use LWLocks for them.  These locks are:

But info_lck was introduced in 1999 and this scheme was introduced in
2012, and a lot has changed since then. Whatever benchmarking was done
to validate this locking regime is probably obsolete at this point.
Back then, LWLocks were built on top of spinlocks, and were, I
believe, a lot slower than they are now. Plus CPU performance
characteristics have changed a lot. So who really knows if the way
we're doing things here makes any sense at all these days? But one
doesn't want to do a naive replacement and pessimize things, either.

-- 
Robert Haas
EDB: http://www.enterprisedb.com

Re: the s_lock_stuck on perform_spin_delay

2024-01-04 Thread Tom Lane

Robert Haas  writes:
> On Thu, Jan 4, 2024 at 10:22 AM Tom Lane  wrote:
>> We should be making an effort to ban coding patterns like
>> "return with spinlock still held", because they're just too prone
>> to errors similar to this one.

> I agree that we don't want to add overhead, and also about how
> spinlocks should be used, but I dispute "easily detectable
> statically." I mean, if you or I look at some code that uses a
> spinlock, we'll know whether the pattern that you mention is being
> followed or not, modulo differences of opinion in debatable cases. But
> you and I cannot be there to look at all the code all the time. If we
> had a static checking tool that was run as part of every build, or in
> the buildfarm, or by cfbot, or somewhere else that raised the alarm if
> this rule was violated, then we could claim to be effectively
> enforcing this rule.

I was indeed suggesting that maybe we could find a way to detect
such things automatically.  While I've not been paying close
attention, I recall there's been some discussions of using LLVM/clang
infrastructure for customized static analysis, so maybe it'd be
possible without an undue amount of effort.

> I think the question we should be asking here is what the purpose of
> the PANIC is. I can think of two possible purposes. It could be either
> (a) an attempt to prevent real-world harm by turning database hangs
> into database panics, so that at least the system will restart and get
> moving again instead of sitting there stuck for all eternity or (b) an
> attempt to punish people for writing bad code by turning coding rule
> violations into panics on production systems.

I believe it's (a).  No matter what the reason for a stuck spinlock
is, the only reliable method of getting out of the problem is to
blow things up and start over.  The patch proposed at the top of this
thread would leave the system unable to recover on its own, with the
only recourse being for the DBA to manually force a crash/restart ...
once she figured out that that was necessary, which might take a long
while if the only external evidence is an occasional WARNING that
might not even be making it to the postmaster log.  How's that better?

> ... (b3) if the PANIC does fire, it
> gives you basically zero help in figuring out where the actual problem
> is. The PostgreSQL code base is way too big for "ERROR: you screwed
> up" to be an acceptable diagnostic.

Ideally I agree with the latter, but that doesn't mean that doing
better is easy or even possible.  (The proposed patch certainly does
nothing to help diagnose such issues.)  As for the former point,
panicking here at least offers the chance of getting a stack trace,
which might help a developer find the problem.

regards, tom lane

Re: the s_lock_stuck on perform_spin_delay

2024-01-04 Thread Robert Haas

On Thu, Jan 4, 2024 at 10:22 AM Tom Lane  wrote:
> I'm not a fan of adding overhead to such a performance-critical
> thing as spinlocks in order to catch coding errors that are easily
> detectable statically.  IMV the correct usage of spinlocks is that
> they should only be held across *short, straight line* code segments.
> We should be making an effort to ban coding patterns like
> "return with spinlock still held", because they're just too prone
> to errors similar to this one.  Note that trying to take another
> lock is far from the only bad thing that can happen if you're
> not very conservative about what code can execute with a spinlock
> held.

I agree that we don't want to add overhead, and also about how
spinlocks should be used, but I dispute "easily detectable
statically." I mean, if you or I look at some code that uses a
spinlock, we'll know whether the pattern that you mention is being
followed or not, modulo differences of opinion in debatable cases. But
you and I cannot be there to look at all the code all the time. If we
had a static checking tool that was run as part of every build, or in
the buildfarm, or by cfbot, or somewhere else that raised the alarm if
this rule was violated, then we could claim to be effectively
enforcing this rule. But with 20-30 active committers and ~100 active
developers at any given point in time, any system that relies on every
relevant person knowing all the rules and remembering to enforce them
on every commit is bound to be less than 100% effective. Some people
won't know what the rule is, some people will decide that their
particular situation is Very Special, some people will just forget to
check for violations, and some people will check for violations but
miss something.

I think the question we should be asking here is what the purpose of
the PANIC is. I can think of two possible purposes. It could be either
(a) an attempt to prevent real-world harm by turning database hangs
into database panics, so that at least the system will restart and get
moving again instead of sitting there stuck for all eternity or (b) an
attempt to punish people for writing bad code by turning coding rule
violations into panics on production systems. If it's (a), that's
defensible, though we can still ask whether it does more harm than
good. If it's (b), that's not a good way of handling that problem,
because (b1) it affects production builds and not just development
builds, (b2) many coding rule violations are vanishingly unlikely to
trigger that PANIC in practice, and (b3) if the PANIC does fire, it
gives you basically zero help in figuring out where the actual problem
is. The PostgreSQL code base is way too big for "ERROR: you screwed
up" to be an acceptable diagnostic.

-- 
Robert Haas
EDB: http://www.enterprisedb.com

Re: POC PATCH: copy from ... exceptions to: (was Re: VLDB Features)

2024-01-04 Thread vignesh C

On Thu, 28 Dec 2023 at 09:27, jian he  wrote:
>
> On Wed, Dec 20, 2023 at 8:27 PM Masahiko Sawada  wrote:
> >
> >
> > Why do we need to use SPI? I think we can form heap tuples and insert
> > them to the error table. Creating the error table also doesn't need to
> > use SPI.
> >
> Thanks for pointing it out. I figured out how to form heap tuples and
> insert them to the error table.
> but I don't know how to create the error table without using SPI.
> Please pointer it out.
>
> > >
> > > copy_errors one per schema.
> > > foo.copy_errors will be owned by the schema: foo owner.
> >
> > It seems that the error table is created when the SAVE_ERROR is used
> > for the first time. It probably blocks concurrent COPY FROM commands
> > with SAVE_ERROR option to different tables if the error table is not
> > created yet.
> >
> I don't know how to solve this problem Maybe we can document this.
> but it will block the COPY FROM immediately.
>
> > >
> > > if you can insert to a table in that specific schema let's say foo,
> > > then you will get privilege to INSERT/DELETE/SELECT
> > > to foo.copy_errors.
> > > If you are not a superuser, you are only allowed to do
> > > INSERT/DELETE/SELECT on foo.copy_errors rows where USERID =
> > > current_user::regrole::oid.
> > > This is done via row level security.
> >
> > I don't think it works. If the user is dropped, the user's oid could
> > be reused for a different user.
> >
>
> You are right.
> so I changed, now the schema owner will be the error table owner.
> every error table tuple inserts,
> I switch to schema owner, do the insert, then switch back to the
> COPY_FROM operation user.
> now everyone (except superuser) will need explicit grant to access the
> error table.

There are some compilation issues reported at [1] for the patch:
[04:04:26.288] copyfromparse.c: In function ‘NextCopyFrom’:
[04:04:26.288] copyfromparse.c:1126:25: error: ‘copy_errors_tupDesc’
may be used uninitialized in this function
[-Werror=maybe-uninitialized]
[04:04:26.288] 1126 | copy_errors_tup = heap_form_tuple(copy_errors_tupDesc,
[04:04:26.288] | ^~~~
[04:04:26.288] 1127 | t_values,
[04:04:26.288] | ~
[04:04:26.288] 1128 | t_isnull);
[04:04:26.288] | ~
[04:04:26.288] copyfromparse.c:1160:4: error: ‘copy_errorsrel’ may be
used uninitialized in this function [-Werror=maybe-uninitialized]
[04:04:26.288] 1160 | table_close(copy_errorsrel, RowExclusiveLock);
[04:04:26.288] | ^

[1] - https://cirrus-ci.com/task/4785221183209472

Regards,
Vignesh

Re: Proposal to add page headers to SLRU pages

2024-01-04 Thread Li, Yong



> On Jan 2, 2024, at 19:35, Aleksander Alekseev  
> wrote:
>
> Thanks for the updated patch.
>
> cfbot seems to have some complaints regarding compiler warnings and
> also building the patch on Windows:
>
> http://cfbot.cputube.org/

Thanks for the information.  Here is the updated patch.

Regards,
Yong



slru_page_header_v3.patch
Description: slru_page_header_v3.patch

Re: speed up a logical replica setup

2024-01-04 Thread Euler Taveira

On Thu, Jan 4, 2024, at 3:05 AM, Amit Kapila wrote:
> Won't it be a better user experience that after setting up the target
> server as a logical replica (subscriber), it started to work
> seamlessly without user intervention?

If we have an option to control the replication slot removal (default is on),
it seems a good UI. Even if the user decides to disable the replication slot
removal, it should print a message saying that these replication slots can
cause WAL retention.

> > The initial version had an option to stop the subscriber. I decided to
> > remove the option and stop the subscriber by default mainly because (1) it 
> > is
> > an extra step to start the server (another point is that the WAL retention
> > doesn't happen due to additional (synchronized?) replication slots on
> > subscriber -- point 2). It was a conservative choice. If point 2 isn't an
> > issue, imo point 1 is no big deal.
> >
> 
> By point 2, do you mean to have a check for "max replication slots"?
> It so, the one possibility is to even increase that config, if the
> required max_replication_slots is low.

By point 2, I mean WAL retention (sentence inside parenthesis).


--
Euler Taveira
EDB   https://www.enterprisedb.com/

Re: speed up a logical replica setup

2024-01-04 Thread Euler Taveira

On Thu, Jan 4, 2024, at 2:41 AM, Amit Kapila wrote:
> So, you also seem to be saying that it is not required once
> pg_subscriber has promoted it. So, why it should be optional to remove
> physical_replication_slot? I think we must remove it from the primary
> unless there is some other reason.

My point is to *always* remove the primary_slot_name on primary.

> My point is to not have an additional slot and keep a comment
> indicating that we need an extra slot once we add pg_basebackup
> support.

Got it.

> > 3. How about sync slots on the physical standby if present? Do we want
> > to retain those as it is or do we need to remove those? We are
> > actively working on the patch [1] for the same.
> >
> >
> > I didn't read the current version of the referred patch but if the proposal 
> > is
> > to synchronize logical replication slots iif you are using a physical
> > replication, as soon as pg_subscriber finishes the execution, there won't be
> > synchronization on these logical replication slots because there isn't a
> > physical replication anymore. If the goal is a promotion, the current 
> > behavior
> > is correct because the logical replica will retain WAL since it was 
> > converted.
> >
> 
> I don't understand what you mean by promotion in this context. If
> users want to simply promote the standby then there is no need to do
> additional things that this tool is doing.

ENOCOFFEE. s/promotion/switchover/

> > However, if you are creating a logical replica, this WAL retention is not 
> > good
> > and the customer should eventually remove these logical replication slots on
> > the logical replica.
> >
> 
> I think asking users to manually remove such slots won't be a good
> idea. We might want to either remove them by default or provide an
> option to the user.

Am I correct that the majority of the use cases these replication slots will be
useless? If so, let's remove them by default and add an option to control this
behavior (replication slot removal).


--
Euler Taveira
EDB   https://www.enterprisedb.com/

Re: POC: Extension for adding distributed tracing - pg_tracing

2024-01-04 Thread Anthonin Bonnefoy

Hi,

> This approach avoids the need to rewrite SQL or give special meaning to SQL 
> comments.

SQLCommenter already has a good amount of support and is referenced in
opentelemetry https://github.com/open-telemetry/opentelemetry-sqlcommenter
So the goal was more to leverage the existing trace propagation
systems as much as possible.

> I used GUCs to set the `opentelemtery_trace_id` and `opentelemetry_span_id`.
> These can be sent as protocol-level messages by the client driver if the
> client driver has native trace integration enabled, in which case the user
> doesn't need to see or care. And for other drivers, the application can use
> `SET` or `SET LOCAL` to assign them.

Emitting `SET` and `SET LOCAL` for every traced statement sounds more
disruptive and expensive than relying on SQLCommenter. When not using
`SET LOCAL`, the variables also need to be cleared.
This will also introduce a lot of headaches as the `SET` themselves
could be traced and generate spans when tracing utility statements is
already tricky enough.

> But IIRC the current BDR/PGD folks are now using an OpenTelemetry
> sidecar process instead. I think they send it UDP packets to emit
> metrics and events.

I would be curious to hear more about this. I didn't want to be tied
to a specific protocol (I've only seen gRPC and http support on latest
opentelemetry collectors) and I fear that relying on Postgres sending
traces would lower the chance of having this supported on managed
Postgres solutions (like RDS and CloudSQL).

> By queries you mean particular queries, not transactions? And since
> they have an assigned ID it means that the query is already executing
> and we want to enable the tracking in another session, right?

I think that was the idea. The query identifier generated for a
specific statement is stable so we could have a GUC variable with a
list of query id and only queries matching the provided query ids
would be sampled. This would make it easier to generate traces for a
specific query within a session.

On Tue, Jan 2, 2024 at 1:14 PM Aleksander Alekseev
 wrote:
>
> Hi,
>
> > Overall solution looks good for me except SQL Commenter - query 
> > instrumentation
> > with SQL comments is often not possible on production systems. Instead
> > the very often requested feature is to enable tracing for a given single 
> > query ID,
> > or several (very limited number of) queries by IDs. It could be done by 
> > adding
> > SQL function to add and remove query ID into a list (even array would do)
> > stored in top tracing context.
>
> Not 100% sure if I follow.
>
> By queries you mean particular queries, not transactions? And since
> they have an assigned ID it means that the query is already executing
> and we want to enable the tracking in another session, right? If this
> is the case I would argue that enabling/disabling tracing for an
> _already_ running query (or transaction) would be way too complicated.
>
> I wouldn't mind enabling/disabling tracing for the current session
> and/or a given session ID. In the latter case it can have an effect
> only for the new transactions. This however could be implemented
> separately from the proposed patchset. I suggest keeping the scope
> small.
>
> --
> Best regards,
> Aleksander Alekseev

Re: the s_lock_stuck on perform_spin_delay

2024-01-04 Thread Tom Lane

Robert Haas  writes:
> I'm not sure that the approach this patch takes is correct in detail,
> but I kind of agree with you about the overall point. I mean, the idea
> of the PANIC is to avoid having the system just sit there in a state
> from which it will never recover ... but it can also have the effect
> of killing a system that wasn't really dead. I'm not sure what the
> best thing to do here is, but it's worth talking about, IMHO.

I'm not a fan of adding overhead to such a performance-critical
thing as spinlocks in order to catch coding errors that are easily
detectable statically.  IMV the correct usage of spinlocks is that
they should only be held across *short, straight line* code segments.
We should be making an effort to ban coding patterns like
"return with spinlock still held", because they're just too prone
to errors similar to this one.  Note that trying to take another
lock is far from the only bad thing that can happen if you're
not very conservative about what code can execute with a spinlock
held.

regards, tom lane

Re: WIP Incremental JSON Parser

2024-01-04 Thread Robert Haas

On Wed, Jan 3, 2024 at 6:36 PM Nico Williams  wrote:
> On Tue, Jan 02, 2024 at 10:14:16AM -0500, Robert Haas wrote:
> > It seems like a pretty significant savings no matter what. Suppose the
> > backup_manifest file is 2GB, and instead of creating a 2GB buffer, you
> > create an 1MB buffer and feed the data to the parser in 1MB chunks.
> > Well, that saves 2GB less 1MB, full stop. Now if we address the issue
> > you raise here in some way, we can potentially save even more memory,
> > which is great, but even if we don't, we still saved a bunch of memory
> > that could not have been saved in any other way.
>
> You could also build a streaming incremental parser.  That is, one that
> outputs a path and a leaf value (where leaf values are scalar values,
> `null`, `true`, `false`, numbers, and strings).  Then if the caller is
> doing something JSONPath-like then the caller can probably immediately
> free almost all allocations and even terminate the parse early.

I think our current parser is event-based rather than this ... but it
seems like this could easily be built on top of it, if someone wanted
to.

-- 
Robert Haas
EDB: http://www.enterprisedb.com

Re: index prefetching

2024-01-04 Thread Tomas Vondra

Hi,

Here's a somewhat reworked version of the patch. My initial goal was to
see if it could adopt the StreamingRead API proposed in [1], but that
turned out to be less straight-forward than I hoped, for two reasons:

(1) The StreamingRead API seems to be designed for pages, but the index
code naturally works with TIDs/tuples. Yes, the callbacks can associate
the blocks with custom data (in this case that'd be the TID), but it
seemed a bit strange ...

(2) The place adding requests to the StreamingRead queue is pretty far
from the place actually reading the pages - for prefetching, the
requests would be generated in nodeIndexscan, but the page reading
happens somewhere deep in index_fetch_heap/heapam_index_fetch_tuple.
Sure, the TIDs would come from a callback, so it's a bit as if the
requests were generated in heapam_index_fetch_tuple - but it has no idea
StreamingRead exists, so where would it get it.

We might teach it about it, but what if there are multiple places
calling index_fetch_heap()? Not all of which may be using StreamingRead
(only indexscans would do that). Or if there are multiple index scans,
there's need to be a separate StreamingRead queues, right?

In any case, I felt a bit out of my depth here, and I chose not to do
all this work without discussing the direction here. (Also, see the
point about cursors and xs_heap_continue a bit later in this post.)


I did however like the general StreamingRead API - how it splits the
work between the API and the callback. The patch used to do everything,
which meant it hardcoded a lot of the IOS-specific logic etc. I did plan
to have some sort of "callback" for reading from the queue, but that
didn't quite solve this issue - a lot of the stuff remained hard-coded.
But the StreamingRead API made me realize that having a callback for the
first phase (that adds requests to the queue) would fix that.

So I did that - there's now one simple callback in for index scans, and
a bit more complex callback for index-only scans. Thanks to this the
hard-coded stuff mostly disappears, which is good.

Perhaps a bigger change is that I decided to move this into a separate
API on top of indexam.c. The original idea was to integrate this into
index_getnext_tid/index_getnext_slot, so that all callers benefit from
the prefetching automatically. Which would be nice, but it also meant
it's need to happen in the indexam.c code, which seemed dirty.

This patch introduces an API similar to StreamingRead. It calls the
indexam.c stuff, but does all the prefetching on top of it, not in it.
If a place calling index_getnext_tid() wants to allow prefetching, it
needs to switch to IndexPrefetchNext(). (There's no function that would
replace index_getnext_slot, at the moment. Maybe there should be.)

Note 1: The IndexPrefetch name is a bit misleading, because it's used
even with prefetching disabled - all index reads from the index scan
happen through it. Maybe it should be called IndexReader or something
like that.

Note 2: I left the code in indexam.c for now, but in principle it could
(should) be moved to a different place.

I think this layering makes sense, and it's probably much closer to what
Andres meant when he said the prefetching should happen in the executor.
Even if the patch ends up using StreamingRead in the future, I guess
we'll want something like IndexPrefetch - it might use the StreamingRead
internally, but it would still need to do some custom stuff to detect
I/O patterns or something that does not quite fit into the StreamingRead.


Now, let's talk about two (mostly unrelated) problems I ran into.

Firstly, I realized there's a bit of a problem with cursors. The
prefetching works like this:

1) reading TIDs from the index
2) stashing them into a queue in IndexPrefetch
3) doing prefetches for the new TIDs added to the queue
4) returning the TIDs to the caller, one by one

And all of this works ... unless the direction of the scan changes.
Which for cursors can happen if someone does FETCH BACKWARD or stuff
like that. I'm not sure how difficult it'd be to make this work. I
suppose we could simply discard the prefetched entries and do the right
number of steps back for the index scan. But I haven't tried, and maybe
it's more complex than I'm imagining. Also, if the cursor changes the
direction a lot, it'd make the prefetching harmful.

The patch simply disables prefetching for such queries, using the same
logic that we do for parallelism. This may be over-zealous.

FWIW this is one of the things that probably should remain outside of
StreamingRead API - it seems pretty index-specific, and I'm not sure
we'd even want to support these "backward" movements in the API.


The other issue I'm aware of is handling xs_heap_continue. I believe it
works fine for "false" but I need to take a look at non-MVCC snapshots
(i.e. when xs_heap_continue=true).


I haven't done any benchmarks with this reworked API - there's a couple
more allocations etc. but it did not change in a

Re: the s_lock_stuck on perform_spin_delay

2024-01-04 Thread Andy Fan


Hi Matthias and Robert,

Matthias van de Meent  writes:

> On Thu, 4 Jan 2024 at 08:09, Andy Fan  wrote:
>>
>> My question is if someone doesn't obey the rule by mistake (everyone
>> can make mistake), shall we PANIC on a production environment? IMO I
>> think it can be a WARNING on a production environment and be a stuck
>> when 'ifdef USE_ASSERT_CHECKING'.
>> [...]
>> I think a experienced engineer like Thomas can make this mistake and the
>> patch was reviewed by 3 peoples, the bug is still there. It is not easy
>> to say just don't do it.
>>
>> the attached code show the prototype in my mind. Any feedback is welcome.
>
> While I understand your point and could maybe agree with the change
> itself (a crash isn't great),

It's great that both of you agree that the crash is not great. 

> I don't think it's an appropriate fix
> for the problem of holding a spinlock while waiting for a LwLock, as
> spin.h specifically mentions the following (and you quoted the same):
>
> """
> Keep in mind the coding rule that spinlocks must not be held for more
> than a few instructions.
> """

Yes, I agree that the known [LW]LockAcquire after holding a Spin lock
should be fixed at the first chance rather than pander to it with my
previous patch. My previous patch just take care of the *unknown*
cases (and I cced thomas in the hope that he knows the bug). I also
agree that the special case about [LW]LockAcquire should be detected
more effective as you suggested below. So v2 comes and commit 2 is for
this suggestion. 

>
> I suspect that we'd be better off with stronger protections against
> waiting for LwLocks while we hold any spin lock. More specifically,
> I'm thinking about something like tracking how many spin locks we
> hold, and Assert()-ing that we don't hold any such locks when we start
> to wait for an LwLock or run CHECK_FOR_INTERRUPTS-related code (with
> potential manual contextual overrides in specific areas of code where
> specific care has been taken to make it safe to hold spin locks while
> doing those operations - although I consider their existence unlikely
> I can't rule them out as I've yet to go through all lock-touching
> code). This would probably work in a similar manner as
> START_CRIT_SECTION/END_CRIT_SECTION.
>
> Kind regards,
>
> Matthias van de Meent
> Neon (https://neon.tech)


-- 
Best Regards
Andy Fan

>From 7d7fd0f0e9b13a290bfffaec0ad40773191155f2 Mon Sep 17 00:00:00 2001
From: "yizhi.fzh" 
Date: Thu, 4 Jan 2024 14:33:37 +0800
Subject: [PATCH v2 1/2] Don't call s_lock_stuck in production environment

In the current implementation, if a spin lock is misused, the
s_lock_stuck in perform_spin_delay can cause the entire system to
PANIC. In order to balance fault tolerance and the ability to detect
incorrect usage, we can use WARNING to replace s_lock_stuck in the
production environment, but still use s_lock_stuck in builds with
USE_ASSERT_CHECKING.
---
 src/backend/storage/lmgr/s_lock.c | 9 +
 1 file changed, 9 insertions(+)

diff --git a/src/backend/storage/lmgr/s_lock.c b/src/backend/storage/lmgr/s_lock.c
index 327ac64f7c..9446605122 100644
--- a/src/backend/storage/lmgr/s_lock.c
+++ b/src/backend/storage/lmgr/s_lock.c
@@ -67,6 +67,7 @@ slock_t		dummy_spinlock;
 static int	spins_per_delay = DEFAULT_SPINS_PER_DELAY;
 
 
+#ifdef USE_ASSERT_CHECKING
 /*
  * s_lock_stuck() - complain about a stuck spinlock
  */
@@ -85,6 +86,7 @@ s_lock_stuck(const char *file, int line, const char *func)
 		 func, file, line);
 #endif
 }
+#endif
 
 /*
  * s_lock(lock) - platform-independent portion of waiting for a spinlock.
@@ -132,7 +134,14 @@ perform_spin_delay(SpinDelayStatus *status)
 	if (++(status->spins) >= spins_per_delay)
 	{
 		if (++(status->delays) > NUM_DELAYS)
+		{
+#ifdef USE_ASSERT_CHECKING
 			s_lock_stuck(status->file, status->line, status->func);
+#else
+			if (status->delays % NUM_DELAYS == 0)
+elog(WARNING, "perform spin lock on %s %d times", status->func, status->delays);
+#endif
+		}
 
 		if (status->cur_delay == 0) /* first time to delay? */
 			status->cur_delay = MIN_DELAY_USEC;
-- 
2.34.1

>From fae6241960a2c0df8df5d83001ce298228f4cbf2 Mon Sep 17 00:00:00 2001
From: "yizhi.fzh" 
Date: Thu, 4 Jan 2024 22:19:50 +0800
Subject: [PATCH v2 2/2] Disallow [LW]LockAcquire when holding a spin
 lock-file-mode

Spinlocks are intended for *very* short-term locks, but sometimes it is
misused, this commit guards no [LW]LockAcquire should be called when
holding a spin lock.
---
 src/backend/storage/buffer/bufmgr.c |  1 +
 src/backend/storage/lmgr/lock.c |  2 ++
 src/backend/storage/lmgr/lwlock.c   |  2 ++
 src/backend/utils/init/globals.c|  1 +
 src/include/storage/buf_internals.h |  1 +
 src/include/storage/spin.h  | 24 ++--
 6 files changed, 29 insertions(+), 2 deletions(-)

diff --git a/src/backend/storage/buffer/bufmgr.c b/src/backend/storage/buffer/bufmgr.c
index 9f9d3f24ac..d2534ee9aa 100644
--- a/src/backend/storage/buffer/bufmgr.c
+++

Re: Synchronizing slots from primary to standby

2024-01-04 Thread Bertrand Drouvot

Hi,

On Thu, Jan 04, 2024 at 10:27:31AM +0530, shveta malik wrote:
> On Thu, Jan 4, 2024 at 9:18 AM shveta malik  wrote:
> >
> > On Wed, Jan 3, 2024 at 6:33 PM Zhijie Hou (Fujitsu)
> >  wrote:
> > >
> > > On Tuesday, January 2, 2024 6:32 PM shveta malik  
> > > wrote:
> > > > On Fri, Dec 29, 2023 at 10:25 AM Amit Kapila 
> > > >
> > > > The topup patch has also changed app_name to
> > > > {cluster_name}_slotsyncworker so that we do not confuse between 
> > > > walreceiver
> > > > and slotsyncworker entry.
> > > >
> > > > Please note that there is no change in rest of the patches, changes are 
> > > > in
> > > > additional 0004 patch alone.
> > >
> > > Attach the V56 patch set which supports ALTER SUBSCRIPTION SET (failover).
> > > This is useful when user want to refresh the publication tables, they can 
> > > now alter the
> > > failover option to false and then execute the refresh command.
> > >
> > > Best Regards,
> > > Hou zj
> >
> > The patches no longer apply to HEAD due to a recent commit 007693f. I
> > am working on rebasing and will post the new patches soon
> >
> > thanks
> > Shveta
> 
> Commit 007693f has changed 'conflicting' to 'conflict_reason', so
> adjusted the code around that in the slotsync worker.
> 
> Also removed function 'pg_get_slot_invalidation_cause' as now
> conflict_reason tells the same.
> 
> PFA rebased patches with above changes.
> 

Thanks!

Looking at 0004:

1 

-libpqrcv_connect(const char *conninfo, bool logical, bool must_use_password,
-const char *appname, char **err)
+libpqrcv_connect(const char *conninfo, bool replication, bool logical,
+bool must_use_password, const char *appname, 
char **err)

What about adjusting the preceding comment a bit to describe what the new 
replication
parameter is for?

2 

+   /* We can not have logical w/o replication */

what about replacing w/o by without?

3 ===

+   if(!replication)
+   Assert(!logical);
+
+   if (replication)
{

what about using "if () else" instead (to avoid unnecessary test)?

Having said that the patch seems a reasonable way to implement non-replication
connection in slotsync worker.

4 ===

Looking closer, the only place where walrcv_connect() is called with replication
set to false and logical set to false is in ReplSlotSyncWorkerMain().

That does make sense, but what do you think about creating dedicated 
libpqslotsyncwrkr_connect
and slotsyncwrkr_connect (instead of using the libpqrcv_connect / 
walrcv_connect ones)?

That way we could make use of slotsyncwrkr_connect() in ReplSlotSyncWorkerMain()
as I think it's confusing to use "rcv" functions while the process using them is
not of backend type walreceiver.

I'm not sure that worth the extra complexity though, what do you think?

Regards,

-- 
Bertrand Drouvot
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com

Re: the s_lock_stuck on perform_spin_delay

2024-01-04 Thread Robert Haas

On Thu, Jan 4, 2024 at 2:09 AM Andy Fan  wrote:
> My question is if someone doesn't obey the rule by mistake (everyone
> can make mistake), shall we PANIC on a production environment? IMO I
> think it can be a WARNING on a production environment and be a stuck
> when 'ifdef USE_ASSERT_CHECKING'.
>
> People may think spin lock may consume too much CPU, but it is not true
> in the discussed scene since perform_spin_delay have pg_usleep in it,
> and the MAX_DELAY_USEC is 1 second and MIN_DELAY_USEC is 0.001s.
>
> I notice this issue actually because of the patch "Cache relation
> sizes?" from Thomas Munro [1], where the latest patch[2] still have the
> following code.
> +   sr = smgr_alloc_sr();  <-- HERE a spin lock is hold
> +
> +   /* Upgrade to exclusive lock so we can create a mapping. */
> +   LWLockAcquire(mapping_lock, LW_EXCLUSIVE); <-- HERE a complex
>   operation is needed. it may take a long time.

I'm not sure that the approach this patch takes is correct in detail,
but I kind of agree with you about the overall point. I mean, the idea
of the PANIC is to avoid having the system just sit there in a state
from which it will never recover ... but it can also have the effect
of killing a system that wasn't really dead. I'm not sure what the
best thing to do here is, but it's worth talking about, IMHO.

-- 
Robert Haas
EDB: http://www.enterprisedb.com

Re: Add a perl function in Cluster.pm to generate WAL

2024-01-04 Thread Alexander Lakhin


Hello Tom,

04.01.2024 02:39, Tom Lane wrote:

Buildfarm member skink has failed 3 times in
035_standby_logical_decoding.pl in the last couple of days:

https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=skink=2024-01-03%2020%3A07%3A15

https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=skink=2024-01-03%2017%3A09%3A27

https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=skink=2024-01-01%2020%3A10%3A18

These all look like

# poll_query_until timed out executing this query:
# select (confl_active_logicalslot = 1) from pg_stat_database_conflicts where 
datname = 'testdb'
# expecting this output:
# t
# last actual query output:
# f

although it's notable that two different tests are involved
(vacuum vs. vacuum full).

I am not real sure what is happening there, but I see that c161ab74f
changed some details of how that test works, so I wonder if it's
responsible for these failures.  The timing isn't a perfect match,
since this commit went in two weeks ago, but I don't see any
more-recent commits that seem like plausible explanations.


Reproduced here.
As I can see in the failure logs you referenced, the first problem is:
#   Failed test 'inactiveslot slot invalidation is logged with vacuum on 
pg_authid'
#   at t/035_standby_logical_decoding.pl line 224.

It reminded me of:
https://www.postgresql.org/message-id/b2a1f7d0-7629-72c0-2da7-e9c4e336b010%40gmail.com

It seems that it's not something new, and maybe that my analysis is still
valid. If so, VACUUM FREEZE/autovacuum = off should fix the issue.

Best regards,
Alexander

Re: Adding facility for injection points (or probe points?) for more advanced tests

2024-01-04 Thread Ashutosh Bapat

On Tue, Jan 2, 2024 at 3:36 PM Ashutosh Bapat
 wrote:

>
> I will look at 0002 next.

One more comment on 0001
InjectionPointAttach() doesn't test whether the given function exists
in the given library. Even if InjectionPointAttach() succeeds,
INJECTION_POINT might throw error because the function doesn't exist.
This can be seen as an unwanted behaviour. I think
InjectionPointAttach() should test whether the function exists and
possibly load it as well by adding it to the local cache.

0002 comments
--- /dev/null
+++ b/src/test/modules/test_injection_points/expected/test_injection_points.out

When built without injection point support, this test fails. We should
add an alternate output file for such a build so that the behaviour
with and without injection point support is tested. Or set up things
such that the test is not run under make check in that directory. I
will prefer the first option.

+
+SELECT test_injection_points_run('TestInjectionError'); -- error
+ERROR: error triggered for injection point TestInjectionError
+-- Re-load and run again.

What's getting Re-loaded here? \c will create a new connection and
thus a new backend. Maybe the comment should say "test in a fresh
backend" or something of that sort?

+
+SELECT test_injection_points_run('TestInjectionError'); -- error
+ERROR: error triggered for injection point TestInjectionError
+-- Remove one entry and check the other one.

Looks confusing to me, we are testing the one removed as well. Am I
missing something?

+(1 row)
+
+-- All entries removed, nothing happens

We aren't removing all entries TestInjectionLog2 is still there. Am I
missing something?

0003 looks mostly OK.

0004 comments

+
+# after recovery, the server will not start, and log PANIC: could not
locate a valid checkpoint record

IIUC the comment describes the behaviour with 7863ee4def65 reverted.
But the test after this comment is written for the behaviour with
7863ee4def65. That's confusing. Is the intent to describe both
behaviours in the comment?

+
+ /* And sleep.. */
+ ConditionVariablePrepareToSleep(_state->wait_point);
+ ConditionVariableSleep(_state->wait_point, test_injection_wait_event);
+ ConditionVariableCancelSleep();

According to the prologue of ConditionVariableSleep(), that function
should be called in a loop checking for the desired condition. All the
callers that I examined follow that pattern. I think we need to follow
that pattern here as well.

Below comment from ConditionVariableTimedSleep() makes me think that
the caller of ConditionVariableSleep() can be woken up even if the
condition variable was not signaled. That's why the while() loop
around ConditionVariableSleep().

* If we're still in the wait list, then the latch must have been set
* by something other than ConditionVariableSignal; though we don't
* guarantee not to return spuriously, we'll avoid this obvious case.
*/.

That's all I have for now.

-- 
Best Wishes,
Ashutosh Bapat

Re: Revisiting {CREATE INDEX, REINDEX} CONCURRENTLY improvements

2024-01-04 Thread Michail Nikolaev

Hello!

> Correct, but there are changes being discussed where we would freeze
> tuples during pruning as well [0], which would invalidate that
> implementation detail. And, if I had to choose between improved
> opportunistic freezing and improved R/CIC, I'd probably choose
> improved freezing over R/CIC.

As another option, we could extract a dedicated horizon value for an
opportunistic freezing.
And use some flags in R/CIC backend to keep it at the required value.

Best regards,
Michail.

Re: INFORMATION_SCHEMA note

2024-01-04 Thread Tatsuo Ishii

(typo in the subject fixed)

> In the following paragraph in information_schema:
> 
>  character encoding form
>  
>   
>An encoding of some character repertoire.  Most older character
>repertoires only use one encoding form, and so there are no
>separate names for them (e.g., LATIN1 is an
>encoding form applicable to the LATIN1
>repertoire).  But for example Unicode has the encoding forms
>UTF8, UTF16, etc. (not
>all supported by PostgreSQL).  Encoding forms are not exposed
>as an SQL object, but are visible in this view.
> 
> This claims that the LATIN1 repertoire only uses one encoding form,
> but actually LATIN1 can be encoded in another form: ISO-2022-JP-2 (a 7
> bit encoding. See RFC 1554
> (https://datatracker.ietf.org/doc/html/rfc1554) for more details).
> 
> If we still want to list a use-one-encoding-form example, probably we
> could use LATIN2 instead or others that are not supported by
> ISO-2022-JP-2 (ISO-2022-JP-2 supports LATIN1 and LATIN7).
> 
> Attached is the patch that does this.

Any objection?
--
Tatsuo Ishii
SRA OSS LLC
English: http://www.sraoss.co.jp/index_en/
Japanese:http://www.sraoss.co.jp

Re: Adding facility for injection points (or probe points?) for more advanced tests

2024-01-04 Thread Ashutosh Bapat

On Thu, Jan 4, 2024 at 5:23 AM Michael Paquier  wrote:
>
> > 0003 and 0004 add tests to the test_injection_points module.  Is the idea
> > that we'd add any tests that required injection points here?  I think it'd
> > be better if we could move the tests closer to the logic they're testing,
> > but perhaps that is difficult because you also need to define the callback
> > functions somewhere.  Hm...
>
> Yeah.  Agreed that the final result should not have these tests in the
> module test_injection_points.  What I was thinking here is to move
> 002_invalid_checkpoint_after_promote.pl to src/test/recovery/ and pull
> the module with the callbacks with an EXTRA_INSTALL.

0003 and 0004 are using the extension in this module for some serious
testing. The name of the extension test_injection_point indicates that
it's for testing injection points and not for some serious use of
injection callbacks it adds. Changes 0003 and 0004 suggest otherwise.
I suggest we move test_injection_points from src/test/modules to
contrib/ and rename it as "injection_points". The test files may still
be named as test_injection_point. The TAP tests in 0003 and 0004 once
moved to their appropriate places, will load injection_point extension
and use it. That way predefined injection point callbacks will also be
available for others to use.

-- 
Best Wishes,
Ashutosh Bapat

Re: speed up a logical replica setup

2024-01-04 Thread Ashutosh Bapat

On Thu, Jan 4, 2024 at 4:34 PM Amit Kapila  wrote:
> > But what good is having the same publications as primary
> > also on logical replica?
> >
>
> The one use case that comes to my mind is to set up bi-directional
> replication. The publishers want to subscribe to the new subscriber.

Hmm. Looks like another user controlled cleanup.

-- 
Best Wishes,
Ashutosh Bapat

Re: Random pg_upgrade test failure on drongo

2024-01-04 Thread Alexander Lakhin


Hello Amit,

03.01.2024 14:42, Amit Kapila wrote:



So I started to think about other approach: to perform unlink as it's
implemented now, but then wait until the DELETE_PENDING state is gone.


There is a comment in the code which suggests we shouldn't wait
indefinitely. See "However, we won't wait indefinitely for someone
else to close the file, as the caller might be holding locks and
blocking other backends."


Yes, I saw it, but initially I thought that we have a transient condition
there, so waiting in open() (instead of failing immediately) seemed like a
good idea then...


And the internal process is ... background writer (BgBufferSync()).

So, I tried just adding bgwriter_lru_maxpages = 0 to postgresql.conf and
got 20 x 10 tests passing.

Thus, it we want just to get rid of the test failure, maybe it's enough to
add this to the test's config...


What about checkpoints? Can't it do the same while writing the buffers?


As we deal here with pg_upgrade/pg_restore, it must not be very easy to get
the desired effect, but I think it's not impossible in principle.
More details below.
What happens during the pg_upgrade execution is essentially:
1) CREATE DATABASE "postgres" WITH TEMPLATE = template0 OID = 5 ...;
-- this command flushes file buffers as well
2) ALTER DATABASE postgres OWNER TO ...
3) COMMENT ON DATABASE "postgres" IS ...
4) -- For binary upgrade, preserve pg_largeobject and index relfilenodes
    SELECT 
pg_catalog.binary_upgrade_set_next_index_relfilenode('2683'::pg_catalog.oid);
    SELECT 
pg_catalog.binary_upgrade_set_next_heap_relfilenode('2613'::pg_catalog.oid);
    TRUNCATE pg_catalog.pg_largeobject;
--  ^^^ here we can get the error "could not create file "base/5/2683": File 
exists"
...

We get the effect discussed when the background writer process decides to
flush a file buffer for pg_largeobject during stage 1.
(Thus, if a checkpoint somehow happened to occur during CREATE DATABASE,
the result must be the same.)
And another important factor is shared_buffers = 1MB (set during the test).
With the default setting of 128MB I couldn't see the failure.

It can be reproduced easily (on old Windows versions) just by running
pg_upgrade in a loop (I've got failures on iterations 22, 37, 17 (with the
default cluster)).
If an old cluster contains dozen of databases, this increases the failure
probability significantly (with 10 additional databases I've got failures
on iterations 4, 1, 6).

Please see the reproducing script attached.

Best regards,
Alexander@echo off
REM define PGBIN and PATH
set PGBIN=%cd%\tmp_install\usr\local\pgsql\bin
set PATH=%PGBIN%;%PATH%

setlocal enabledelayedexpansion

rmdir /S /Q tmpdb
initdb -D tmpdb >initdb.log 2>&1
echo shared_buffers = 1MB>>tmpdb\postgresql.conf

set /a i = 0
:LOOP
set /a i+=1
echo ITERATION %i%

rmdir /S /Q tmpdb_old
xcopy /e /i /q /r tmpdb tmpdb_old\

pg_ctl -D tmpdb_old -l server.log start
echo SELECT concat('CREATE DATABASE db', g) FROM generate_series(1, 10) g 
\gexec | psql -d postgres
pg_ctl -D tmpdb_old stop

rmdir /S /Q tmpdb_new
xcopy /e /i /q /r tmpdb tmpdb_new\

pg_upgrade --no-sync -r -b "%PGBIN%" -B "%PGBIN%" -d tmpdb_old -D tmpdb_new
if %ERRORLEVEL% NEQ 0 goto ERR
goto LOOP

:ERR
echo ERROR
pause

:EXIT

Re: Revisiting {CREATE INDEX, REINDEX} CONCURRENTLY improvements

2024-01-04 Thread Matthias van de Meent

On Mon, 25 Dec 2023 at 15:12, Michail Nikolaev
 wrote:
>
> Hello!
>
> It seems like the idea of "old" snapshot is still a valid one.
>
> > Should this deal with any potential XID wraparound, too?
>
> As far as I understand in our case, we are not affected by this in any way.
> Vacuum in our table is not possible because of locking, so, nothing
> may be frozen (see below).
> In the case of super long index building, transactional limits will
> stop new connections using current
> regular infrastructure because it is based on relation data (but not
> actual xmin of backends).
>
> > How does this behave when the newly inserted tuple's xmin gets frozen?
> > This would be allowed to happen during heap page pruning, afaik - no
> > rules that I know of which are against that - but it would create
> > issues where normal snapshot visibility rules would indicate it
> > visible to both snapshots regardless of whether it actually was
> > visible to the older snapshot when that snapshot was created...
>
> As I can see, heap_page_prune never freezes any tuples.
> In the case of regular vacuum, it used this way: call heap_page_prune
> and then call heap_prepare_freeze_tuple and then
> heap_freeze_execute_prepared.

Correct, but there are changes being discussed where we would freeze
tuples during pruning as well [0], which would invalidate that
implementation detail. And, if I had to choose between improved
opportunistic freezing and improved R/CIC, I'd probably choose
improved freezing over R/CIC.

As an alternative, we _could_ keep track of concurrent index inserts
using a dummy index (with the same predicate) which only holds the
TIDs of the inserted tuples. We'd keep it as an empty index in phase
1, and every time we reset the visibility snapshot we now only need to
scan that index to know what tuples were concurrently inserted. This
should have a significantly lower IO overhead than repeated full index
bulkdelete scans for the new index in the second table scan phase of
R/CIC. However, in a worst case it could still require another
O(tablesize) of storage.

Kind regards,

Matthias van de Meent
Neon (https://neon.tech)

[0] 
https://www.postgresql.org/message-id/caakru_a+g2oe6ahjcbibftnfiy2aib4e31x9qyj_qkjxzmz...@mail.gmail.com

Re: speed up a logical replica setup

2024-01-04 Thread Amit Kapila

On Thu, Jan 4, 2024 at 12:22 PM Shlok Kyal  wrote:
>
> Hi,
> I was testing the patch with following test cases:
>
> Test 1 :
> - Create a 'primary' node
> - Setup physical replica using pg_basebackup  "./pg_basebackup –h
> localhost –X stream –v –R –W –D ../standby "
> - Insert data before and after pg_basebackup
> - Run pg_subscriber and then insert some data to check logical
> replication "./pg_subscriber –D ../standby -S “host=localhost
> port=9000 dbname=postgres” -P “host=localhost port=9000
> dbname=postgres” -d postgres"
> - Also check pg_publication, pg_subscriber and pg_replication_slots tables.
>
> Observation:
> Data is not lost. Replication is happening correctly. Pg_subscriber is
> working as expected.
>
> Test 2:
> - Create a 'primary' node
> - Use normal pg_basebackup but don’t set up Physical replication
> "./pg_basebackup –h localhost –v –W –D ../standby"
> - Insert data before and after pg_basebackup
> - Run pg_subscriber
>
> Observation:
> Pg_subscriber command is not completing and is stuck with following
> log repeating:
> LOG: waiting for WAL to become available at 0/3000168
> LOG: invalid record length at 0/3000150: expected at least 24, got 0
>

I think probably the required WAL is not copied. Can you use the -X
option to stream WAL as well and then test? But I feel in this case
also, we should wait for some threshold time and then exit with
failure, removing new objects created, if any.

> Test 3:
> - Create a 'primary' node
> - Use normal pg_basebackup but don’t set up Physical replication
> "./pg_basebackup –h localhost –v –W –D ../standby"
> -Insert data before pg_basebackup but not after pg_basebackup
> -Run pg_subscriber
>
> Observation:
> Pg_subscriber command is not completing and is stuck with following
> log repeating:
> LOG: waiting for WAL to become available at 0/3000168
> LOG: invalid record length at 0/3000150: expected at least 24, got 0
>

This is similar to the previous test and you can try the same option
here as well.

-- 
With Regards,
Amit Kapila.

Re: speed up a logical replica setup

2024-01-04 Thread Amit Kapila

On Thu, Jan 4, 2024 at 12:30 PM Ashutosh Bapat
 wrote:
>
> On Wed, Jan 3, 2024 at 2:49 PM Amit Kapila  wrote:
> >
> >  c) Drop the replication slots d) Drop the
> > > publications
> > >
> >
> > I am not so sure about dropping publications because, unlike
> > subscriptions which can start to pull the data, there is no harm with
> > publications. Similar to publications there could be some user-defined
> > functions or other other objects which may not be required once the
> > standby replica is converted to subscriber. I guess we need to leave
> > those to the user.
> >
>
> IIUC, primary use of pg_subscriber utility is to start a logical
> subscription from a physical base backup (to reduce initial sync time)
> as against logical backup taken while creating a subscription. Hence I
> am expecting that apart from this difference, the resultant logical
> replica should look similar to the logical replica setup using a
> logical subscription sync. Hence we should not leave any replication
> objects around. UDFs (views, and other objects) may have some use on a
> logical replica. We may replicate changes to UDF once DDL replication
> is supported. But what good is having the same publications as primary
> also on logical replica?
>

The one use case that comes to my mind is to set up bi-directional
replication. The publishers want to subscribe to the new subscriber.

-- 
With Regards,
Amit Kapila.

Re: the s_lock_stuck on perform_spin_delay

2024-01-04 Thread Matthias van de Meent

On Thu, 4 Jan 2024 at 08:09, Andy Fan  wrote:
>
> My question is if someone doesn't obey the rule by mistake (everyone
> can make mistake), shall we PANIC on a production environment? IMO I
> think it can be a WARNING on a production environment and be a stuck
> when 'ifdef USE_ASSERT_CHECKING'.
> [...]
> I think a experienced engineer like Thomas can make this mistake and the
> patch was reviewed by 3 peoples, the bug is still there. It is not easy
> to say just don't do it.
>
> the attached code show the prototype in my mind. Any feedback is welcome.

While I understand your point and could maybe agree with the change
itself (a crash isn't great), I don't think it's an appropriate fix
for the problem of holding a spinlock while waiting for a LwLock, as
spin.h specifically mentions the following (and you quoted the same):

"""
Keep in mind the coding rule that spinlocks must not be held for more
than a few instructions.
"""

I suspect that we'd be better off with stronger protections against
waiting for LwLocks while we hold any spin lock. More specifically,
I'm thinking about something like tracking how many spin locks we
hold, and Assert()-ing that we don't hold any such locks when we start
to wait for an LwLock or run CHECK_FOR_INTERRUPTS-related code (with
potential manual contextual overrides in specific areas of code where
specific care has been taken to make it safe to hold spin locks while
doing those operations - although I consider their existence unlikely
I can't rule them out as I've yet to go through all lock-touching
code). This would probably work in a similar manner as
START_CRIT_SECTION/END_CRIT_SECTION.

Kind regards,

Matthias van de Meent
Neon (https://neon.tech)

Re: pg_upgrade and logical replication

2024-01-04 Thread vignesh C

On Tue, 2 Jan 2024 at 15:58, Amit Kapila  wrote:
>
> On Fri, Dec 29, 2023 at 2:26 PM vignesh C  wrote:
> >
> > On Thu, 28 Dec 2023 at 15:59, Amit Kapila  wrote:
> > >
> > > On Wed, Dec 13, 2023 at 12:09 PM vignesh C  wrote:
> > > >
> > > > Thanks for the comments, the attached v25 version patch has the
> > > > changes for the same.
> > > >
> > >
> > > I have looked at it again and made some cosmetic changes like changing
> > > some comments and a minor change in one of the error messages. See, if
> > > the changes look okay to you.
> >
> > Thanks, the changes look good.
> >
>
> Pushed.

Thanks for pushing this patch, I have updated the commitfest entry to
Committed for the same.

Regards,
Vignesh

Re: pg_upgrade and logical replication

2024-01-04 Thread vignesh C

On Wed, 3 Jan 2024 at 11:25, Amit Kapila  wrote:
>
> On Wed, Jan 3, 2024 at 6:21 AM Michael Paquier  wrote:
> >
> > On Tue, Jan 02, 2024 at 03:58:25PM +0530, Amit Kapila wrote:
> > > On Fri, Dec 29, 2023 at 2:26 PM vignesh C  wrote:
> > >> Thanks, the changes look good.
> > >
> > > Pushed.
> >
> > Yeah!  Thanks Amit and everybody involved here!  Thanks also to Julien
> > for raising the thread and the problem, to start with.
> >
>
> I think the next possible step here is to document how to upgrade the
> logical replication nodes as previously discussed in this thread [1].
> IIRC, there were a few issues with the steps mentioned but if we want
> to document those we can start a separate thread for it as that
> involves both publishers and subscribers.

I have posted a patch for this at:
https://www.postgresql.org/message-id/CALDaNm1_iDO6srWzntqTr0ZDVkk2whVhNKEWAvtgZBfSmuBeZQ%40mail.gmail.com

Regards,
Vignesh

Re: Synchronizing slots from primary to standby

2024-01-04 Thread Amit Kapila

On Wed, Jan 3, 2024 at 4:57 PM Bertrand Drouvot
 wrote:
>
> On Wed, Jan 03, 2024 at 04:20:03PM +0530, Amit Kapila wrote:
> > On Fri, Dec 29, 2023 at 12:32 PM Amit Kapila  
> > wrote:
> > >
> > > On Fri, Dec 29, 2023 at 6:59 AM Masahiko Sawada  
> > > wrote:
> > > >
> > > > On Wed, Dec 27, 2023 at 7:43 PM Amit Kapila  
> > > > wrote:
> > > > >
> > > > > >
> > > > > > 3) The slotsync worker uses primary_conninfo but also uses a new GUC
> > > > > > parameter, say slot_sync_dbname, to specify the database to connect.
> > > > > > The slot_sync_dbname overwrites the dbname if primary_conninfo also
> > > > > > specifies it. If both don't have a dbname, raise an error.
> > > > > >
> > > > >
> > > > > Would the users prefer to provide a value for a separate GUC instead
> > > > > of changing primary_conninfo? It is possible that we can have some
> > > > > users prefer to use one GUC and others prefer a separate GUC but we
> > > > > should add a new GUC if we are sure that is what users would prefer.
> > > > > Also, even if have to consider this option, I think we can easily
> > > > > later add a new GUC to provide a dbname in addition to having the
> > > > > provision of giving it in primary_conninfo.
> > > >
> > > > I think having two separate GUCs is more flexible for example when
> > > > users want to change the dbname to connect. It makes sense that the
> > > > slotsync worker wants to use the same connection string as the
> > > > walreceiver uses. But I guess today most primary_conninfo settings
> > > > that are set manually or are generated by tools such as pg_basebackup
> > > > don't have dbname. If we require a dbname in primary_conninfo, many
> > > > tools will need to be changed. Once the connection string is
> > > > generated, it would be tricky to change the dbname in it, as Shveta
> > > > mentioned. The users will have to carefully select the database to
> > > > connect when taking a base backup.
> > > >
> > >
> > > I see your point and agree that users need to be careful. I was trying
> > > to compare it with other places like the conninfo used with a
> > > subscription where no separate dbname needs to be provided. Now, here
> > > the situation is not the same because the same conninfo is used for
> > > different purposes (walreceiver doesn't require dbname (dbname is
> > > ignored even if present) whereas slotsyncworker requires dbname). I
> > > was just trying to see if we can avoid having a new GUC for this
> > > purpose. Does anyone else have an opinion on this matter?
> > >
> >
> > Bertrand, Dilip, and others involved in this thread or otherwise, see
> > if you can share an opinion on the above point because it would be
> > good to get some more opinions before we decide to add a new GUC (for
> > dbname) for slotsync worker.
> >
>
> I think that as long as enable_syncslot is off then there is no need to add 
> the
> dbname in primary_conninfo (means there is no need to change an existing 
> primary_conninfo
> for the ones that don't use the sync slot feature).
>
> So given that primary_conninfo does not necessary need to be changed (for 
> ones that
> don't use the sync slot feature) and that adding a new GUC looks more a 
> one-way door
> change to me, I'd vote to keep the patch as it is (we can still revisit this 
> later
> on and add a new GUC if we feel the need based on user's feedback).
>

Okay, thanks for the feedback. Dilip also shares the same opinion, so
let's wait and see if there is any strong argument to add this new
GUC.

-- 
With Regards,
Amit Kapila.

Documentation to upgrade logical replication cluster

2024-01-04 Thread vignesh C

Hi,

We have documentation on how to upgrade "publisher" and "subscriber"
at [1], but currently we do not have any documentation on how to
upgrade logical replication clusters.
Here is a patch to document how to upgrade different logical
replication clusters: a) Upgrade 2 node logical replication cluster b)
Upgrade cascaded logical replication cluster c) Upgrade 2 node
circular logical replication cluster.
Thoughts?

[1] - https://www.postgresql.org/docs/devel/pgupgrade.html

Regards,
Vignesh
From 9458a2c62a0702316d9ab339cd01dac0e088c52e Mon Sep 17 00:00:00 2001
From: Vignesh C 
Date: Wed, 13 Dec 2023 14:11:58 +0530
Subject: [PATCH v1] Documentation for upgrading logical replication cluster.

Documentation for upgrading logical replication cluster.
---
 doc/src/sgml/ref/pgupgrade.sgml | 618 +++-
 1 file changed, 616 insertions(+), 2 deletions(-)

diff --git a/doc/src/sgml/ref/pgupgrade.sgml b/doc/src/sgml/ref/pgupgrade.sgml
index 87be1fb1c2..87f453d509 100644
--- a/doc/src/sgml/ref/pgupgrade.sgml
+++ b/doc/src/sgml/ref/pgupgrade.sgml
@@ -383,7 +383,7 @@ make prefix=/usr/local/pgsql.new install
 

 
-   
+   
 Prepare for publisher upgrades
 
 
@@ -456,7 +456,7 @@ make prefix=/usr/local/pgsql.new install
 

 
-   
+   
 Prepare for subscriber upgrades
 
 
@@ -506,6 +506,620 @@ make prefix=/usr/local/pgsql.new install
 

 
+   
+Prepare for logical replication cluster upgrades
+
+
+ Migration of logical replication clusters can be done when all the members
+ of the old logical replication clusters are version 17.0 or later.
+
+
+   
+
+ The logical replication restrictions apply to logical replication cluster
+ upgrades also. See  for
+ the details of logical replication restrictions.
+
+
+ The prerequisites of publisher upgrade applies to logical Replication
+ cluster upgrades also. See 
+ for the details of publisher upgrade prerequisites.
+
+
+ The prerequisites of subscriber upgrade applies to logical Replication
+ cluster upgrades also. See 
+ for the details of subscriber upgrade prerequisites.
+
+   
+
+
+ 
+  Upgrading logical replication cluster requires multiple steps to be
+  performed on various nodes. Because not all operations are
+  transactional, the user is advised to take backups. Backups can be taken
+  as described in .
+ 
+
+
+
+ The steps to upgrade logical replication clusters in various scenarios are
+ given below.
+
+
+
+ 
+  Steps to Upgrade 2 node logical replication cluster
+
+   
+
+ 
+  Let's say publisher is in node1 and subscriber is
+  in node2.
+ 
+
+
+
+ 
+  Stop the publisher server in node1, for e.g.:
+
+
+dba@node1:/opt/PostgreSQL/postgres//bin$ pg_ctl -D /opt/PostgreSQL/pub_data stop -l logfile
+
+
+ 
+
+
+
+ 
+  Disable all the subscriptions on node2 that are
+  subscribing the changes from node1 by using
+  ALTER SUBSCRIPTION ... DISABLE,
+  for e.g.:
+
+node2=# ALTER SUBSCRIPTION sub1_node1_node2 DISABLE;
+ALTER SUBSCRIPTION
+node2=# ALTER SUBSCRIPTION sub2_node1_node2 DISABLE;
+ALTER SUBSCRIPTION
+
+ 
+
+
+
+ 
+  Upgrade the publisher node node1's server to the
+  required newer version, for e.g.:
+
+dba@node1:/opt/PostgreSQL/postgres//bin$ pg_upgrade
+--old-datadir "/opt/PostgreSQL/postgres/17/pub_data"
+--new-datadir "/opt/PostgreSQL/postgres//pub_upgraded_data"
+--old-bindir "/opt/PostgreSQL/postgres/17/bin"
+--new-bindir "/opt/PostgreSQL/postgres//bin"
+
+ 
+
+
+
+ 
+  Start the upgraded publisher node node1's server, for e.g.:
+
+dba@node1:/opt/PostgreSQL/postgres//bin$ pg_ctl -D /opt/PostgreSQL/pub_upgraded_data start -l logfile
+
+ 
+
+
+
+ 
+  Stop the subscriber server in node2, for e.g.:
+
+dba@node2:/opt/PostgreSQL/postgres//bin$ pg_ctl -D /opt/PostgreSQL/sub_data stop -l logfile
+
+ 
+
+
+
+ 
+  Upgrade the subscriber node node2's server to
+  the required new version, for e.g.:
+
+dba@node2:/opt/PostgreSQL/postgres//bin$ pg_upgrade
+--old-datadir "/opt/PostgreSQL/postgres/17/sub_data"
+--new-datadir "/opt/PostgreSQL/postgres//sub_upgraded_data"
+--old-bindir "/opt/PostgreSQL/postgres/17/bin"
+--new-bindir "/opt/PostgreSQL/postgres//bin"
+
+ 
+
+
+
+ 
+  Start the upgraded subscriber node node2's server,
+  for e.g.:
+
+dba@node2:/opt/PostgreSQL/postgres//bin$ pg_ctl -D /opt/PostgreSQL/sub_upgraded_data start -l logfile
+
+ 
+
+
+
+ 
+  Create any tables that were created in the upgraded

Re: Transaction timeout

2024-01-04 Thread Andrey M. Borodin

> On 4 Jan 2024, at 07:14, Japin Li  wrote:
> 
> Does the timeout is too short for testing? I see the timeouts for lock_timeout
> and statement_timeout is more bigger than transaction_timeout.

Makes sense. Done. I've also put some effort into fine-tuning timeouts Nik's 
case tests. To have 100ms gap between check, false positive and actual bug we 
had I had to use transaction_timeout = 300ms. Currently all tests take more 
than 1000ms!
But I do not see a way to make these tests both stable and short.

Best regards, Andrey Borodin.

v22-0001-Introduce-transaction_timeout.patch
Description: Binary data

v22-0002-Add-tests-for-transaction_timeout.patch
Description: Binary data

v22-0003-Try-to-enable-transaction_timeout-before-next-co.patch
Description: Binary data

v22-0004-fix-reschedule-timeout-for-each-commmand.patch
Description: Binary data

78 matches

Mail list logo