date:20240219

On Tue, Feb 20, 2024 at 11:54 AM Japin Li  wrote:
>
>
> On Tue, 20 Feb 2024 at 13:40, Bharath Rupireddy 
>  wrote:
> > On Mon, Feb 19, 2024 at 8:25 PM Japin Li  wrote:
> >> [2]
> >> +# Ensure checkpoint doesn't come in our way
> >> +$primary->append_conf('postgresql.conf', qq(
> >> +min_wal_size = 2MB
> >> +max_wal_size = 1GB
> >> +checkpoint_timeout = 1h
> >> +   autovacuum = off
> >> +));
> >>
> >> Keeping the same indentation might be better.
> >
> > The autovacuum line looks mis-indented in the patch file. However, I
> > now ran src/tools/pgindent/perltidyrc
> > src/test/recovery/t/041_wal_source_switch.pl on it.
> >
>
> Thanks for updating the patch.  It seems still with the wrong indent.

Thanks. perltidyrc didn't complain about anything on v19. However, I
kept the alignment same as other TAP tests for multi-line append_conf.
If that's not correct, I'll leave it to the committer to decide. PSA
v20 patch.

--
Bharath Rupireddy
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com
From beea9f0b8bbc76bb48dd1a5d64a5b52bafd09e6f Mon Sep 17 00:00:00 2001
From: Bharath Rupireddy 
Date: Tue, 20 Feb 2024 07:31:29 +
Subject: [PATCH v20] Allow standby to switch WAL source from archive to
 streaming

A standby typically switches to streaming replication (get WAL
from primary), only when receive from WAL archive finishes (no
more WAL left there) or fails for any reason. Reading WAL from
archive may not always be as efficient and fast as reading from
primary. This can be due to the differences in disk types, IO
costs, network latencies etc.. All of these can impact the
recovery performance on standby and increase the replication lag
on primary. In addition, the primary keeps accumulating WAL needed
for the standby while the standby reads WAL from archive because
the standby replication slot stays inactive. To avoid these
problems, one can use this parameter to make standby switch to
stream mode sooner.

This feature adds a new GUC that specifies amount of time after
which standby attempts to switch WAL source from WAL archive to
streaming replication (getting WAL from primary). However, standby
exhausts all the WAL present in pg_wal before switching. If
standby fails to switch to stream mode, it falls back to archive
mode.

Author: Bharath Rupireddy
Reviewed-by: Cary Huang, Nathan Bossart
Reviewed-by: Kyotaro Horiguchi, SATYANARAYANA NARLAPURAM
Discussion: https://www.postgresql.org/message-id/CAHg+QDdLmfpS0n0U3U+e+dw7X7jjEOsJJ0aLEsrtxs-tUyf5Ag@mail.gmail.com
---
 doc/src/sgml/config.sgml  |  48 
 doc/src/sgml/high-availability.sgml   |  15 ++-
 src/backend/access/transam/xlogrecovery.c | 115 --
 src/backend/utils/misc/guc_tables.c   |  12 ++
 src/backend/utils/misc/postgresql.conf.sample |   4 +
 src/include/access/xlogrecovery.h |   1 +
 src/test/recovery/meson.build |   1 +
 src/test/recovery/t/041_wal_source_switch.pl  | 110 +
 8 files changed, 287 insertions(+), 19 deletions(-)
 create mode 100644 src/test/recovery/t/041_wal_source_switch.pl

diff --git a/doc/src/sgml/config.sgml b/doc/src/sgml/config.sgml
index ffd711b7f2..2fbf1ad6e1 100644
--- a/doc/src/sgml/config.sgml
+++ b/doc/src/sgml/config.sgml
@@ -4872,6 +4872,54 @@ ANY num_sync ( 
+  streaming_replication_retry_interval (integer)
+  
+   streaming_replication_retry_interval configuration parameter
+  
+  
+  
+   
+Specifies amount of time after which standby attempts to switch WAL
+source from archive to streaming replication (i.e., getting WAL from
+primary). However, the standby exhausts all the WAL present in
+pg_wal directory before switching. If the standby
+fails to switch to stream mode, it falls back to archive mode. If this
+parameter value is specified without units, it is taken as
+milliseconds. Default is 5min. With a lower value
+for this parameter, the standby makes frequent WAL source switch
+attempts. To avoid this, it is recommended to set a reasonable value.
+A setting of 0 disables the feature. When disabled,
+the standby typically switches to stream mode only after receiving WAL
+from archive finishes (i.e., no more WAL left there) or fails for any
+reason. This parameter can only be set in the
+postgresql.conf file or on the server command
+line.
+   
+   
+
+ Standby may not always attempt to switch source from WAL archive to
+ streaming replication at exact
+ streaming_replication_retry_interval intervals. For
+ example, if the parameter is set to 1min and
+ fetching WAL file from archive takes about 2min,
+ then the source switch attempt happens for the next WAL file after
+ current WAL file fetched from archive is fully applied.
+

Re: confirmed flush lsn seems to be move backward in certain error cases

2024-02-19 Thread vignesh C

On Sat, 17 Feb 2024 at 12:03, Amit Kapila  wrote:
>
> On Fri, Feb 16, 2024 at 5:53 PM vignesh C  wrote:
> >
> >
> > After the insert operation is replicated to the subscriber, the
> > subscriber will set the lsn value sent by the publisher in the
> > replication origin (in my case it was 0/1510978). publisher will then
> > send keepalive messages with the current WAL position in the publisher
> > (in my case it was 0/15109B0), but subscriber will simply send this
> > position as the flush_lsn to the publisher as there are no ongoing
> > transactions. Then since the publisher is started, it will identify
> > that publication does not exist and stop the walsender/apply worker
> > process. When the apply worker is restarted, we will get the
> > remote_lsn(in my case it was 0/1510978) of the origin and set it to
> > origin_startpos. We will start the apply worker with this
> > origin_startpos (origin's remote_lsn). This position will be sent as
> > feedback to the walsender process from the below stack:
> > run_apply_worker->start_apply->LogicalRepApplyLoop->send_feedback.
> > It will use the following send_feedback function call of
> > LogicalRepApplyLoop function as in below code here as nothing is
> > received from walsender:
> > LogicalRepApplyLoop function
> > ...
> > len = walrcv_receive(LogRepWorkerWalRcvConn, &buf, &fd);
> > if (len != 0)
> > {
> >/* Loop to process all available data (without blocking). */
> >for (;;)
> >{
> >   CHECK_FOR_INTERRUPTS();
> >   ...
> >}
> > }
> >
> > /* confirm all writes so far */
> > send_feedback(last_received, false, false);
> > ...
> >
> > In send_feedback, we will set flushpos to replication origin's
> > remote_lsn and send it to the walsender process. Walsender process
> > will receive this information and set confirmed_flush in:
> > ProcessStandbyReplyMessage->LogicalConfirmReceivedLocation
> >
> > Then immediately we are trying to stop the publisher instance,
> > shutdown checkpoint process will be triggered. In this case:
> > confirmed_flush = 0/1510978 will be lesser than
> > last_saved_confirmed_flush = 0/15109B0 which will result in Assertion
> > failure.
> >
> > This issue is happening because we allow setting the confirmed_flush
> > to a backward position.
> >
>
> I see your point.
>
> > There are a couple of ways to fix this:
> > a) One way it not to update the confirm_flush if the lsn sent is an
> > older value like in Confirm_flush_dont_allow_backward.patch
> >
>
> @@ -1839,7 +1839,8 @@ LogicalConfirmReceivedLocation(XLogRecPtr lsn)
>
>   SpinLockAcquire(&MyReplicationSlot->mutex);
>
> - MyReplicationSlot->data.confirmed_flush = lsn;
> + if (lsn > MyReplicationSlot->data.confirmed_flush)
> + MyReplicationSlot->data.confirmed_flush = lsn;
>
>   /* if we're past the location required for bumping xmin, do so */
>   if (MyReplicationSlot->candidate_xmin_lsn != InvalidXLogRecPtr &&
> @@ -1904,7 +1905,8 @@ LogicalConfirmReceivedLocation(XLogRecPtr lsn)
>   else
>   {
>   SpinLockAcquire(&MyReplicationSlot->mutex);
> - MyReplicationSlot->data.confirmed_flush = lsn;
> + if (lsn > MyReplicationSlot->data.confirmed_flush)
> + MyReplicationSlot->data.confirmed_flush = lsn;
>
> BTW, from which code path does it update the prior value of
> confirmed_flush?

The confirmed_flush is getting set in the else condition for this scenario.

If it is through the else check, then can we see if
> it may change the confirm_flush to the prior position via the first
> code path? I am asking because in the first code path, we can even
> flush the re-treated value of confirm_flush LSN.

I was not able to find any scenario to set a prior position with the
first code path. I tried various scenarios like adding delay in
walsender, add delay in apply worker, restart the instances and with
various DML operations. It was always setting it to either to the same
value as previous or greater value.

Regards,
Vignesh

Re: [PoC] Improve dead tuple storage for lazy vacuum

2024-02-19 Thread Masahiko Sawada

On Mon, Feb 19, 2024 at 7:47 PM John Naylor  wrote:
>
> On Mon, Feb 19, 2024 at 9:02 AM Masahiko Sawada  wrote:
> >
> > I think that vacuum and tidbitmap (and future users) would end up
> > having the same max block size calculation. And it seems slightly odd
> > layering to me that max-block-size-specified context is created on
> > vacuum (or tidbitmap) layer, a varlen-value radix tree is created by
> > tidstore layer, and the passed context is used for leaves (if
> > varlen-value is used) on radix tree layer.
>
> That sounds slightly more complicated than I was thinking of, but we
> could actually be talking about the same thing: I'm drawing a
> distinction between "used = must be detected / #ifdef'd" and "used =
> actually happens to call allocation". I meant that the passed context
> would _always_ be used for leaves, regardless of varlen or not. So
> with fixed-length values short enough to live in child pointer slots,
> that context would still be used for iteration etc.
>
> > Another idea is to create a
> > max-block-size-specified context on the tidstore layer. That is,
> > vacuum and tidbitmap pass a work_mem and a flag indicating whether the
> > tidstore can use the bump context, and tidstore creates a (aset of
> > bump) memory context with the calculated max block size and passes it
> > to the radix tree.
>
> That might be a better abstraction since both uses have some memory limit.

I've drafted this idea, and fixed a bug in tidstore.c. Here is the
summary of updates from v62:

- removed v62-0007 patch as we discussed
- squashed v62-0006 and v62-0008 patches into 0003 patch
- v63-0008 patch fixes a bug in tidstore.
- v63-0009 patch is a draft idea of cleanup memory context handling.

>
> > As for using the bump memory context, I feel that we need to store
> > iterator struct in aset context at least as it can be individually
> > freed and re-created. Or it might not be necessary to allocate the
> > iterator struct in the same context as radix tree.
>
> Okay, that's one thing I was concerned about. Since we don't actually
> have a bump context yet, it seems simple to assume aset for non-nodes,
> and if we do get it, we can adjust slightly. Anyway, this seems like a
> good thing to try to clean up, but it's also not a show-stopper.
>
> On that note: I will be going on honeymoon shortly, and then to PGConf
> India, so I will have sporadic connectivity for the next 10 days and
> won't be doing any hacking during that time.

Thank you for letting us know. Enjoy yourself!

Regards

-- 
Masahiko Sawada
Amazon Web Services: https://aws.amazon.com


v63-ART.tar.gz
Description: GNU Zip compressed data

Re: Introduce XID age and inactive timeout based replication slot invalidation

On Fri, Feb 9, 2024 at 1:12 PM Bertrand Drouvot
 wrote:
>
> I think "conflict" is an important topic and does contain several reasons. The
> slot "first" conflict and then leads to slot "invalidation".
>
> > They both are the same internally, so why
> > confuse the users?
>
> I don't think that would confuse the users, I do think that would be easier to
> check for conflicting slots.

I've added a separate column for invalidation reasons for now. I'll
see how others think on this as the time goes by.

> I did not look closely at the code, just played a bit with the patch and was 
> able
> to produce something like:
>
> postgres=# select 
> slot_name,slot_type,active,active_pid,wal_status,invalidation_reason from 
> pg_replication_slots;
>   slot_name  | slot_type | active | active_pid | wal_status | 
> invalidation_reason
> -+---++++-
>  rep1| physical  | f  || reserved   |
>  master_slot | physical  | t  |1482441 | unreserved | wal_removed
> (2 rows)
>
> does that make sense to have an "active/working" slot "ivalidated"?

Thanks. Can you please provide the steps to generate this error? Are
you setting max_slot_wal_keep_size on primary to generate
"wal_removed"?

Attached v5 patch set after rebasing and addressing review comments.
Please review it further.

--
Bharath Rupireddy
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com


v5-0001-Track-invalidation_reason-in-pg_replication_slots.patch
Description: Binary data


v5-0002-Add-XID-based-replication-slot-invalidation.patch
Description: Binary data


v5-0003-Track-inactive-replication-slot-information.patch
Description: Binary data


v5-0004-Add-inactive_timeout-based-replication-slot-inval.patch
Description: Binary data

Re: Patch: Add parse_type Function

2024-02-19 Thread jian he

On Tue, Feb 20, 2024 at 11:06 AM David E. Wheeler  wrote:
>
> LOL, I missed that, just wrote it myself in the last hour. :-) v6 attached.
>

+SELECT to_regtypemod('interval nonesuch'); -- grammar error expected
+ERROR:  syntax error at or near "nonesuch"
+LINE 1: SELECT to_regtypemod('interval nonesuch');
+ ^
+CONTEXT:  invalid type name "interval nonesuch"
+SELECT to_regtypemod('year(4)'); -- grammar error expected
+ to_regtypemod
+---
+
+(1 row)
+

the second hint `-- grammar error expected` seems to contradict with
the results?

Re: Switching XLog source from archive to streaming when primary available

2024-02-19 Thread Japin Li


On Tue, 20 Feb 2024 at 13:40, Bharath Rupireddy 
 wrote:
> On Mon, Feb 19, 2024 at 8:25 PM Japin Li  wrote:
>> [2]
>> +# Ensure checkpoint doesn't come in our way
>> +$primary->append_conf('postgresql.conf', qq(
>> +min_wal_size = 2MB
>> +max_wal_size = 1GB
>> +checkpoint_timeout = 1h
>> +   autovacuum = off
>> +));
>>
>> Keeping the same indentation might be better.
>
> The autovacuum line looks mis-indented in the patch file. However, I
> now ran src/tools/pgindent/perltidyrc
> src/test/recovery/t/041_wal_source_switch.pl on it.
>

Thanks for updating the patch.  It seems still with the wrong indent.

diff --git a/src/test/recovery/t/041_wal_source_switch.pl b/src/test/recovery/t/041_wal_source_switch.pl
index 082680bf4a..b5eddba1d5 100644
--- a/src/test/recovery/t/041_wal_source_switch.pl
+++ b/src/test/recovery/t/041_wal_source_switch.pl
@@ -18,9 +18,9 @@ $primary->init(
 # Ensure checkpoint doesn't come in our way
 $primary->append_conf(
 	'postgresql.conf', qq(
-min_wal_size = 2MB
-max_wal_size = 1GB
-checkpoint_timeout = 1h
+	min_wal_size = 2MB
+	max_wal_size = 1GB
+	checkpoint_timeout = 1h
 	autovacuum = off
 ));
 $primary->start;
@@ -85,7 +85,7 @@ my $offset = -s $standby->logfile;
 my $apply_delay = $retry_interval * 5;
 $standby->append_conf(
 	'postgresql.conf', qq(
-recovery_min_apply_delay = '${apply_delay}ms'
+	recovery_min_apply_delay = '${apply_delay}ms'
 ));
 $standby->start;

Re: Shared detoast Datum proposal

2024-02-19 Thread Andy Fan


Hi,

I didn't another round of self-review.  Comments, variable names, the
order of function definition are improved so that it can be read as
smooth as possible. so v6 attached.

-- 
Best Regards
Andy Fan
>From f2e7772228e8a18027b9c29f10caba9c6570d934 Mon Sep 17 00:00:00 2001
From: "yizhi.fzh" 
Date: Tue, 20 Feb 2024 11:11:53 +0800
Subject: [PATCH v6 1/2] Introduce a Bitset data struct.

While Bitmapset is designed for variable-length of bits, Bitset is
designed for fixed-length of bits, the fixed length must be specified at
the bitset_init stage and keep unchanged at the whole lifespan. Because
of this, some operations on Bitset is simpler than Bitmapset.

The bitset_clear unsets all the bits but kept the allocated memory, this
capacity is impossible for bit Bitmapset for some solid reasons and this
is the main reason to add this data struct.

[1] https://postgr.es/m/CAApHDvpdp9LyAoMXvS7iCX-t3VonQM3fTWCmhconEvORrQ%2BZYA%40mail.gmail.com
[2] https://postgr.es/m/875xzqxbv5.fsf%40163.com
---
 src/backend/nodes/bitmapset.c | 200 +-
 src/backend/nodes/outfuncs.c  |  51 +
 src/include/nodes/bitmapset.h |  28 +++
 src/include/nodes/nodes.h |   4 +
 src/test/modules/test_misc/Makefile   |  11 +
 src/test/modules/test_misc/README |   4 +-
 .../test_misc/expected/test_bitset.out|   7 +
 src/test/modules/test_misc/meson.build|  17 ++
 .../modules/test_misc/sql/test_bitset.sql |   3 +
 src/test/modules/test_misc/test_misc--1.0.sql |   5 +
 src/test/modules/test_misc/test_misc.c| 118 +++
 src/test/modules/test_misc/test_misc.control  |   4 +
 src/tools/pgindent/typedefs.list  |   1 +
 13 files changed, 441 insertions(+), 12 deletions(-)
 create mode 100644 src/test/modules/test_misc/expected/test_bitset.out
 create mode 100644 src/test/modules/test_misc/sql/test_bitset.sql
 create mode 100644 src/test/modules/test_misc/test_misc--1.0.sql
 create mode 100644 src/test/modules/test_misc/test_misc.c
 create mode 100644 src/test/modules/test_misc/test_misc.control

diff --git a/src/backend/nodes/bitmapset.c b/src/backend/nodes/bitmapset.c
index 65805d4527..40cfea2308 100644
--- a/src/backend/nodes/bitmapset.c
+++ b/src/backend/nodes/bitmapset.c
@@ -1315,23 +1315,18 @@ bms_join(Bitmapset *a, Bitmapset *b)
  * It makes no difference in simple loop usage, but complex iteration logic
  * might need such an ability.
  */
-int
-bms_next_member(const Bitmapset *a, int prevbit)
+
+static int
+bms_next_member_internal(int nwords, const bitmapword *words, int prevbit)
 {
-	int			nwords;
 	int			wordnum;
 	bitmapword	mask;
 
-	Assert(bms_is_valid_set(a));
-
-	if (a == NULL)
-		return -2;
-	nwords = a->nwords;
 	prevbit++;
 	mask = (~(bitmapword) 0) << BITNUM(prevbit);
 	for (wordnum = WORDNUM(prevbit); wordnum < nwords; wordnum++)
 	{
-		bitmapword	w = a->words[wordnum];
+		bitmapword	w = words[wordnum];
 
 		/* ignore bits before prevbit */
 		w &= mask;
@@ -1351,6 +1346,19 @@ bms_next_member(const Bitmapset *a, int prevbit)
 	return -2;
 }
 
+int
+bms_next_member(const Bitmapset *a, int prevbit)
+{
+	Assert(a == NULL || IsA(a, Bitmapset));
+
+	Assert(bms_is_valid_set(a));
+
+	if (a == NULL)
+		return -2;
+
+	return bms_next_member_internal(a->nwords, a->words, prevbit);
+}
+
 /*
  * bms_prev_member - find prev member of a set
  *
@@ -1458,3 +1466,177 @@ bitmap_match(const void *key1, const void *key2, Size keysize)
 	return !bms_equal(*((const Bitmapset *const *) key1),
 	  *((const Bitmapset *const *) key2));
 }
+
+/*
+ * bitset_init - create a Bitset. the set will be round up to nwords;
+ */
+Bitset *
+bitset_init(size_t size)
+{
+	int			nword = (size + BITS_PER_BITMAPWORD - 1) / BITS_PER_BITMAPWORD;
+	Bitset	   *result;
+
+	if (size == 0)
+		return NULL;
+
+	result = (Bitset *) palloc0(sizeof(Bitset) + nword * sizeof(bitmapword));
+	result->nwords = nword;
+
+	return result;
+}
+
+/*
+ * bitset_clear - clear the bits only, but the memory is still there.
+ */
+void
+bitset_clear(Bitset *a)
+{
+	if (a != NULL)
+		memset(a->words, 0, sizeof(bitmapword) * a->nwords);
+}
+
+void
+bitset_free(Bitset *a)
+{
+	if (a != NULL)
+		pfree(a);
+}
+
+bool
+bitset_is_empty(Bitset *a)
+{
+	int			i;
+
+	if (a == NULL)
+		return true;
+
+	for (i = 0; i < a->nwords; i++)
+	{
+		bitmapword	w = a->words[i];
+
+		if (w != 0)
+			return false;
+	}
+
+	return true;
+}
+
+Bitset *
+bitset_copy(Bitset *a)
+{
+	Bitset	   *result;
+
+	if (a == NULL)
+		return NULL;
+
+	result = bitset_init(a->nwords * BITS_PER_BITMAPWORD);
+
+	memcpy(result->words, a->words, sizeof(bitmapword) * a->nwords);
+	return result;
+}
+
+void
+bitset_add_member(Bitset *a, int x)
+{
+	int			wordnum,
+bitnum;
+
+	Assert(x >= 0);
+
+	wordnum = WORDNUM(x);
+	bitnum = BITNUM(x);
+
+	Assert(wordnum < a->nwords);
+
+	a->words[wordnum] |= ((bitmapword) 1 << bitnum);
+}
+
+void
+bitset_del_member(Bitset *a, int x)
+{
+	int			wo

Re: Improve WALRead() to suck data directly from WAL buffers when possible

On Sat, Feb 17, 2024 at 10:27 AM Bharath Rupireddy
 wrote:
>
> On Fri, Feb 16, 2024 at 11:01 PM Jeff Davis  wrote:
> >
> > > Here, I'm with v23 patch set:
> >
> > Thank you, I'll look at these.
>
> Thanks. Here's the v24 patch set after rebasing.

Ran pgperltidy on the new TAP test file added. Please see the attached
v25 patch set.

--
Bharath Rupireddy
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com


v25-0001-Add-check-in-WALReadFromBuffers-against-requeste.patch
Description: Binary data


v25-0002-Add-test-module-for-verifying-read-from-WAL-buff.patch
Description: Binary data


v25-0003-Use-WALReadFromBuffers-in-more-places.patch
Description: Binary data


v25-0004-Demonstrate-reading-unflushed-WAL-directly-from-.patch
Description: Binary data

Re: Add test module for verifying backtrace functionality

On Tue, Feb 13, 2024 at 2:11 AM Bharath Rupireddy
 wrote:
>
> Hi,
>
> Postgres has a good amount of code for dealing with backtraces - two
> GUCs backtrace_functions and backtrace_on_internal_error,
> errbacktrace; all of which use core function set_backtrace from
> elog.c. I've not seen this code being tested at all, see code coverage
> report - 
> https://coverage.postgresql.org/src/backend/utils/error/elog.c.gcov.html.
>
> I think adding a simple test module (containing no .c files) with only
> TAP tests will help cover this code. I ended up having it as a
> separate module under src/test/modules/test_backtrace as I was not
> able to find an existing TAP file in src/test to add these tests.  I'm
> able to verify the backtrace related code with the attached patch
> consistently. The TAP tests rely on the fact that the server emits
> text "BACKTRACE: " to server logs before logging the backtrace, and
> the backtrace contains the function name in which the error occurs.
> I've turned off query statement logging (set log_statement = none,
> log_min_error_statement = fatal) so that the tests get to see the
> functions only in the backtrace. Although the CF bot is happy with the
> attached patch 
> https://github.com/BRupireddy2/postgres/tree/add_test_module_for_bcktrace_functionality_v1,
> there might be some more flakiness to it.
>
> Thoughts?

Ran pgperltidy on the new TAP test file added. Please see the attached v2 patch.

--
Bharath Rupireddy
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com
From 2cfe9c87dea980efa2f319e9c2e873e83e561b9e Mon Sep 17 00:00:00 2001
From: Bharath Rupireddy 
Date: Tue, 20 Feb 2024 05:48:41 +
Subject: [PATCH v2] Add test module for backtrace functionality

---
 src/test/modules/Makefile |  1 +
 src/test/modules/meson.build  |  1 +
 src/test/modules/test_backtrace/.gitignore|  4 +
 src/test/modules/test_backtrace/Makefile  | 14 +++
 src/test/modules/test_backtrace/README|  4 +
 src/test/modules/test_backtrace/meson.build   | 12 +++
 .../modules/test_backtrace/t/001_backtrace.pl | 98 +++
 7 files changed, 134 insertions(+)
 create mode 100644 src/test/modules/test_backtrace/.gitignore
 create mode 100644 src/test/modules/test_backtrace/Makefile
 create mode 100644 src/test/modules/test_backtrace/README
 create mode 100644 src/test/modules/test_backtrace/meson.build
 create mode 100644 src/test/modules/test_backtrace/t/001_backtrace.pl

diff --git a/src/test/modules/Makefile b/src/test/modules/Makefile
index 89aa41b5e3..3967311048 100644
--- a/src/test/modules/Makefile
+++ b/src/test/modules/Makefile
@@ -13,6 +13,7 @@ SUBDIRS = \
 		  libpq_pipeline \
 		  plsample \
 		  spgist_name_ops \
+		  test_backtrace \
 		  test_bloomfilter \
 		  test_copy_callbacks \
 		  test_custom_rmgrs \
diff --git a/src/test/modules/meson.build b/src/test/modules/meson.build
index 8fbe742d38..3b21b389af 100644
--- a/src/test/modules/meson.build
+++ b/src/test/modules/meson.build
@@ -12,6 +12,7 @@ subdir('libpq_pipeline')
 subdir('plsample')
 subdir('spgist_name_ops')
 subdir('ssl_passphrase_callback')
+subdir('test_backtrace')
 subdir('test_bloomfilter')
 subdir('test_copy_callbacks')
 subdir('test_custom_rmgrs')
diff --git a/src/test/modules/test_backtrace/.gitignore b/src/test/modules/test_backtrace/.gitignore
new file mode 100644
index 00..5dcb3ff972
--- /dev/null
+++ b/src/test/modules/test_backtrace/.gitignore
@@ -0,0 +1,4 @@
+# Generated subdirectories
+/log/
+/results/
+/tmp_check/
diff --git a/src/test/modules/test_backtrace/Makefile b/src/test/modules/test_backtrace/Makefile
new file mode 100644
index 00..b908864e89
--- /dev/null
+++ b/src/test/modules/test_backtrace/Makefile
@@ -0,0 +1,14 @@
+# src/test/modules/test_backtrace/Makefile
+
+TAP_TESTS = 1
+
+ifdef USE_PGXS
+PG_CONFIG = pg_config
+PGXS := $(shell $(PG_CONFIG) --pgxs)
+include $(PGXS)
+else
+subdir = src/test/modules/test_backtrace
+top_builddir = ../../../..
+include $(top_builddir)/src/Makefile.global
+include $(top_srcdir)/contrib/contrib-global.mk
+endif
diff --git a/src/test/modules/test_backtrace/README b/src/test/modules/test_backtrace/README
new file mode 100644
index 00..ae1366bd58
--- /dev/null
+++ b/src/test/modules/test_backtrace/README
@@ -0,0 +1,4 @@
+This directory doesn't actually contain any extension module.
+
+Instead it is a home for backtrace tests that we want to run using TAP
+infrastructure.
diff --git a/src/test/modules/test_backtrace/meson.build b/src/test/modules/test_backtrace/meson.build
new file mode 100644
index 00..4adffcc914
--- /dev/null
+++ b/src/test/modules/test_backtrace/meson.build
@@ -0,0 +1,12 @@
+# Copyright (c) 2024, PostgreSQL Global Development Group
+
+tests += {
+  'name': 'test_backtrace',
+  'sd': meson.current_source_dir(),
+  'bd': meson.current_build_dir(),
+  'tap': {
+'tests': [
+  't/001_backtrace.pl',
+],
+  },

Re: partitioning and identity column

2024-02-19 Thread Alexander Lakhin


20.02.2024 07:57, Ashutosh Bapat wrote:

Could you please name functions, which you suspect, for me to recheck them?
Perhaps we should consider fixing all of such functions, in light of
b0f7dd915 and d57b7cc33...

Looks like the second commit has fixed all other places I knew except
Identity related functions. So worth fixing identity related functions
too. I see
dropconstraint_internal() has two calls to check_stack_depth() back to
back. The second one is not needed?


Yeah, that's funny. It looks like such a double protection emerged
because Alvaro protected the function (in b0f7dd915), which was waiting for
adding check_stack_depth() in the other thread (resulted in d57b7cc33).

Thank you for spending time on this!

Best regards,
Alexander

Re: logical decoding and replication of sequences, take 2

On Thu, Dec 21, 2023 at 6:47 PM Tomas Vondra
 wrote:
>
> On 12/19/23 13:54, Christophe Pettus wrote:
> > Hi,
> >
> > I wanted to hop in here on one particular issue:
> >
> >> On Dec 12, 2023, at 02:01, Tomas Vondra  
> >> wrote:
> >> - desirability of the feature: Random IDs (UUIDs etc.) are likely a much
> >> better solution for distributed (esp. active-active) systems. But there
> >> are important use cases that are likely to keep using regular sequences
> >> (online upgrades of single-node instances, existing systems, ...).
> >
> > +1.
> >
> > Right now, the lack of sequence replication is a rather large
> > foot-gun on logical replication upgrades.  Copying the sequences
> > over during the cutover period is doable, of course, but:
> >
> > (a) There's no out-of-the-box tooling that does it, so everyone has
> > to write some scripts just for that one function.
> >
> > (b) It's one more thing that extends the cutover window.
> >
>
> I agree it's an annoying gap for this use case. But if this is the only
> use cases, maybe a better solution would be to provide such tooling
> instead of adding it to the logical decoding?
>
> It might seem a bit strange if most data is copied by replication
> directly, while sequences need special handling, ofc.
>

One difference between the logical replication of tables and sequences
is that we can guarantee with synchronous_commit (and
synchronous_standby_names) that after failover transactions data is
replicated or not whereas for sequences we can't guarantee that
because of their non-transactional nature. Say, there are two
transactions T1 and T2, it is possible that T1's entire table data and
sequence data are committed and replicated but T2's sequence data is
replicated. So, after failover to logical subscriber in such a case if
one routes T2 again to the new node as it was not successful
previously then it would needlessly perform the sequence changes
again. I don't how much that matters but that would probably be the
difference between the replication of tables and sequences.

I agree with your point above that for upgrades some tool like
pg_copysequence where we can provide a way to copy sequence data to
subscribers from the publisher would suffice the need.

-- 
With Regards,
Amit Kapila.

Re: Switching XLog source from archive to streaming when primary available

On Mon, Feb 19, 2024 at 8:25 PM Japin Li  wrote:
>
> > Strengthened tests a bit by using recovery_min_apply_delay to mimic
> > standby spending some time fetching from archive. PSA v18 patch.
>
> Here are some minor comments:

Thanks for taking a look at it.

> [1]
> +primary). However, the standby exhausts all the WAL present in pg_wal
>
> s|pg_wal|pg_wal|g

Done.

> [2]
> +# Ensure checkpoint doesn't come in our way
> +$primary->append_conf('postgresql.conf', qq(
> +min_wal_size = 2MB
> +max_wal_size = 1GB
> +checkpoint_timeout = 1h
> +   autovacuum = off
> +));
>
> Keeping the same indentation might be better.

The autovacuum line looks mis-indented in the patch file. However, I
now ran src/tools/pgindent/perltidyrc
src/test/recovery/t/041_wal_source_switch.pl on it.

Please see the attached v19 patch.

--
Bharath Rupireddy
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com


v19-0001-Allow-standby-to-switch-WAL-source-from-archive-.patch
Description: Binary data

Re: JIT compilation per plan node

2024-02-19 Thread David Rowley

On Tue, 20 Feb 2024 at 18:31, Tom Lane  wrote:
> FWIW, I seriously doubt that an extra walk of the plan tree is even
> measurable compared to the number of cycles JIT compilation will
> expend if it's called.  So I don't buy your argument here.
> We would be better off to do this in a way that's clean and doesn't
> add overhead for non-JIT-enabled builds.

The extra walk of the tree would need to be done for every plan, not
just the ones where we do JIT. I'd rather find a way to not add this
extra plan tree walk, especially since the vast majority of cases on
an average instance won't be doing any JIT.

David

Re: Fix race condition in InvalidatePossiblyObsoleteSlot()

On Tue, Feb 20, 2024 at 08:51:17AM +0900, Michael Paquier wrote:
> Prefixing these with "initial_" is fine, IMO.  That shows the
> intention that these come from the slot's data before doing the
> termination.  So I'm OK with what's been proposed in v3.

I was looking at that a second time, and just concluded that this is
OK, so I've applied that down to 16, wordsmithing a bit the comments.

> Both are the same to me, so I have no extra opinion to offer.  ;)

I've kept this one as-is, though.
--
Michael


signature.asc
Description: PGP signature

Re: JIT compilation per plan node

David Rowley  writes:
> On Tue, 20 Feb 2024 at 05:26, Tomas Vondra
>  wrote:
>> Wouldn't it be simpler to just build the plan as we do now, and then
>> have an expression_tree_walker that walks the complete plan top-down,
>> inspects the nodes, enables JIT where appropriate and so on? That can
>> have arbitrary context, no problem with that.

> Why walk the entire plan tree again to do something we could do when
> building it in the first place?

FWIW, I seriously doubt that an extra walk of the plan tree is even
measurable compared to the number of cycles JIT compilation will
expend if it's called.  So I don't buy your argument here.
We would be better off to do this in a way that's clean and doesn't
add overhead for non-JIT-enabled builds.

regards, tom lane

Re: JIT compilation per plan node

2024-02-19 Thread David Rowley

On Tue, 20 Feb 2024 at 05:26, Tomas Vondra
 wrote:
> I doubt CreatePlanContext is a great way to achieve this. For one, it
> breaks the long-standing custom that PlannerInfo is the first parameter,
> usually followed by RelOptInfo, etc. CreatePlanContext is added to some
> functions (but not all), which makes it ... unpredictable.

I suggested this to Melih as a way to do this based on what Andy wrote
in [1]. I agree with Andy that it's not great to add est_calls to
every function in createplan.c. I feel that CreatePlanContext is a way
to take the hit of adding a parameter once and hopefully never having
to do it again to this degree.  I wondered if PlannerInfo could be a
field in CreatePlanContext.

> FWIW it's not clear to me if/how this solves the problem with early
> create_plan() calls for subplans. Or is it still the same?
>
> Wouldn't it be simpler to just build the plan as we do now, and then
> have an expression_tree_walker that walks the complete plan top-down,
> inspects the nodes, enables JIT where appropriate and so on? That can
> have arbitrary context, no problem with that.

Why walk the entire plan tree again to do something we could do when
building it in the first place?  Recursively walking trees isn't great
from a performance point of view. It would be nice to avoid this if we
can find some other way to handle subplans.  I do have a few other
reasons up my sleeve that subplan creation should be delayed until
later, so maybe we should fix that to unblock those issues.

> Considering we decide JIT pretty late anyway (long after costing and
> other stuff that might affect the plan selection), the result should be
> exactly the same, without the extensive createplan.c disruption ...
>
> (usual caveat: I haven't tried, maybe there's something that means this
> can't work)

It's not like we can look at the top-node's cost as a pre-check to
skip the recursive step for cheap plans as it's perfectly valid that a
node closer to the root of the plan tree have a lower total cost than
that node's subnodes.  e.g LIMIT 1.

> > - The number of estimated calls are especially useful where a node is
> > expected to be rescanned, such as Gather. Gather Merge, Memoize and Nested
> > Loop. Knowing the estimated number of calls for a node allows us to rely on
> > total cost multiplied by the estimated calls instead of only total cost for
> > the node.
>
> Not sure I follow. Why would be these nodes (Gather, Gather Merge, ...)
> more likely to be rescanned compared to other nodes?

I think Melih is listing nodes that can change the est_calls.  Any
node can be rescanned, but only a subset of nodes can adjust the
number of times they call their subnode vs how many times they
themselves are called.

> > - For each node, the planner considers if the node should be JITed. If the
> > cost of the node * the number of estimated calls is greater than
> > jit_above_cost,  it's decided to be JIT compiled. Note that this changes
> > the meaning of jit_above_cost, it's now a threshold for a single plan node
> > and not the whole query. Additionally, this change in JIT consideration is
> > only for JIT compilations. Inlining and optimizations continue to be for
> > the whole query and based on the overall cost of the query.
>
> It's not clear to me why JIT compilation is decided for each node, while
> the inlining/optimization is decided for the plan as a whole. I'm not
> familiar with the JIT stuff, so maybe it's obvious to others ...

This is a problem with LLVM, IIRC.  The problem is it's a decision
that has to be made for an entire compilation unit and it can't be
decided at the expression level.  This is pretty annoying as it's
pretty hard to decide the best logic to use to enable optimisations
and inlining :-(

I think the best thing we could discuss right now is, is this the best
way to fix the JIT costing problem.  In [2] I did link to a complaint
about the JIT costings. See [3].  The OP there wanted to keep the plan
private, but I did get to see it and described the problem on the
list.

Also, I don't happen to think the decision about JITting per plan node
is perfect as the node's total costs can be high for reasons other
than the cost of evaluating expressions.  Also, the number of times a
given expression is evaluated can vary quite a bit based on when the
expression is evaluated.  For example, a foreign table scan that does
most of the filtering remotely, but has a non-shippable expr that
needs to be evaluated locally. The foreign scan might be very
expensive, especially if lots of filtering is done by a Seq Scan and
not many rows might make it back to the local server to benefit from
JITting the non-shippable expression.

A counter-example is the join condition of a non-parameterized nested
loop.  Those get evaluated  n_outer_rows * n_inner_rows times.

I think the savings JIT gives us on evaluation of expressions is going
to be more closely tied to the number of times an expression is
evaluated than the total

Re: Memory consumed by paths during partitionwise join planning

On Tue, Feb 20, 2024 at 8:19 AM Andrei Lepikhov
 wrote:
>
> On 19/2/2024 19:25, Ashutosh Bapat wrote:
> > On Fri, Feb 16, 2024 at 8:42 AM Andrei Lepikhov
> >  wrote:
> >> Live example: right now, I am working on the code like MSSQL has - a
> >> combination of NestLoop and HashJoin paths and switching between them in
> >> real-time. It requires both paths in the path list at the moment when
> >> extensions are coming. Even if one of them isn't referenced from the
> >> upper pathlist, it may still be helpful for the extension.
> >
> > There is no guarantee that every path presented to add_path will be
> > preserved. Suboptimal paths are freed as and when add_path discovers
> > that they are suboptimal. So I don't think an extension can rely on
> > existence of a path. But having a refcount makes it easy to preserve
> > the required paths by referencing them.
> I don't insist, just provide my use case. It would be ideal if you would
> provide some external routines for extensions that allow for sticking
> the path in pathlist even when it has terrible cost estimation.

With refcounts you can reference it and store it somewhere other than
pathlist. The path won't be lost until it is dereferrenced.
RelOptInfo::Pathlist is for optimal paths.

> >
> >>
> >>
> > IIUC, you are suggesting that instead of planning each
> > partition/partitionwise join, we only create paths with the strategies
> > which were found to be optimal with previous partitions. That's a good
> > heuristic but it won't work if partition properties - statistics,
> > indexes etc. differ between groups of partitions.
> Sure, but the "Symmetry" strategy assumes that on the scope of a
> thousand partitions, especially with parallel append involved, it
> doesn't cause sensible performance degradation if we find a bit
> suboptimal path in a small subset of partitions. Does it make sense?
> As I see, when people use 10-100 partitions for the table, they usually
> strive to keep indexes symmetrical for all partitions.
>

I agree that we need something like that. In order to do that, we need
machinery to prove that all partitions have similar properties. Once
that is proved, we can skip creating paths for similar partitions. But
that's out of scope of this work and complements it.

-- 
Best Wishes,
Ashutosh Bapat

Re: logical decoding and replication of sequences, take 2

2024-02-19 Thread Robert Haas

On Fri, Feb 16, 2024 at 1:57 AM Tomas Vondra
 wrote:
> For me, the part that I feel most uneasy about is the decoding while the
> snapshot is still being built (and can flip to consistent snapshot
> between the relfilenode creation and sequence change, confusing the
> logic that decides which changes are transactional).
>
> It seems "a bit weird" that we either keep the "simple" logic that may
> end up with incorrect "non-transactional" result, but happens to then
> work fine because we immediately discard the change.
>
> But it still feels better than the alternative, which requires us to
> start decoding stuff (relfilenode creation) before building a proper
> snapshot is consistent, which we didn't do before - or at least not in
> this particular way. While I don't have a practical example where it
> would cause trouble now, I have a nagging feeling it might easily cause
> trouble in the future by making some new features harder to implement.

I don't understand the issues here well enough to comment. Is there a
good write-up someplace I can read to understand the design here?

Is the rule that changes are transactional if and only if the current
transaction has assigned a new relfilenode to the sequence?

Why does the logic get confused if the state of the snapshot changes?

My naive reaction is that it kinda sounds like you're relying on two
different mistakes cancelling each other out, and that might be a bad
idea, because maybe there's some situation where they don't. But I
don't understand the issue well enough to have an educated opinion at
this point.

-- 
Robert Haas
EDB: http://www.enterprisedb.com

Re: partitioning and identity column

On Mon, Feb 19, 2024 at 8:30 PM Alexander Lakhin  wrote:
>
> Hello Ashutosh,
>
> 19.02.2024 15:17, Ashutosh Bapat wrote:
> >
> >> Functions ATExecAddIdentity() and ATExecDropIdentity() are recursive too,
> >> so I think they can be exploited as well.
> > not just Identity related functions, but many other functions in
> > tablecmds.c have that problem as I mentioned earlier.
> >
>
> Could you please name functions, which you suspect, for me to recheck them?
> Perhaps we should consider fixing all of such functions, in light of
> b0f7dd915 and d57b7cc33...

Looks like the second commit has fixed all other places I knew except
Identity related functions. So worth fixing identity related functions
too. I see
dropconstraint_internal() has two calls to check_stack_depth() back to
back. The second one is not needed?

-- 
Best Wishes,
Ashutosh Bapat

Re: Have pg_basebackup write "dbname" in "primary_conninfo"?

On Tue, Feb 20, 2024 at 5:04 AM Ian Lawrence Barwick  wrote:
>
>
> With the addition of "pg_sync_replication_slots()", there is now a use-case 
> for
> including "dbname" in "primary_conninfo" and the docs have changed from
> stating [1]:
>
>   Do not specify a database name in the primary_conninfo string.
>
> to [2]:
>
>   For replication slot synchronization (see Section 48.2.3), it is also
>   necessary to specify a valid dbname in the primary_conninfo string. This 
> will
>   only be used for slot synchronization. It is ignored for streaming.
>
> [1] 
> https://www.postgresql.org/docs/16/runtime-config-replication.html#RUNTIME-CONFIG-REPLICATION-STANDBY
> [2] 
> https://www.postgresql.org/docs/devel/runtime-config-replication.html#RUNTIME-CONFIG-REPLICATION-STANDBY
>
> However, when setting up a standby (with the intent of creating a logical
> standby) with pg_basebackup, providing the -R/-write-recovery-conf option
> results in a "primary_conninfo" string being generated without a "dbname"
> parameter, which requires a manual change should one intend to use
> "pg_sync_replication_slots()".
>
> I can't see any reason for continuing to omit "dbname", so suggest it should
> only continue to be omitted for 16 and earlier; see attached patch.
>
> Note that this does mean that if the conninfo string provided to pg_basebackup
> does not contain "dbname", the generated "primary_conninfo" GUC will default 
> to
> "dbname=replication".
>

When I tried, it defaulted to user name:
Default case connection string:
primary_conninfo = 'user=KapilaAm
passfile=''C:UserskapilaamAppDataRoaming/postgresql/pgpass.conf''
channel_binding=disable dbname=KapilaAm port=5432 sslmode=disable
sslcompression=0 sslcertmode=disable sslsni=1
ssl_min_protocol_version=TLSv1.2 gssencmode=disable
krbsrvname=postgres gssdelegation=0 target_session_attrs=any
load_balance_hosts=disable'

When I specified -d "dbname = postgres" during backup:
primary_conninfo = 'user=KapilaAm
passfile=''C:UserskapilaamAppDataRoaming/postgresql/pgpass.conf''
channel_binding=disable dbname=KapilaAm port=5432 sslmode=disable
sslcompression=0 sslcertmode=disable sslsni=1
ssl_min_protocol_version=TLSv1.2 gssencmode=disable
krbsrvname=postgres gssdelegation=0 target_session_attrs=any
load_balance_hosts=disable'

I think it makes sense to include dbname in the connection string
programmatically (with some option) for slot sync functionality but it
is not clear to me if there is any impact of the same when the standby
is not set up to sync up logical slots. I haven't checked but if there
is a way to distinguish the case where we write dbname only when
specified by the user then that would be great but this is worth
considering even without that.

-- 
With Regards,
Amit Kapila.

Re: RFC: Logging plan of the running query

2024-02-19 Thread Robert Haas

On Fri, Feb 16, 2024 at 12:29 AM Andres Freund  wrote:
> If we went with something like tht approach, I think we'd have to do something
> like redirecting node->ExecProcNode to a wrapper, presumably from within a
> CFI. That wrapper could then implement the explain support, without slowing
> down the normal execution path.

That's an annoying complication; maybe there's some better way to
handle this. But I think we need to do something different than what
the patch does currently because...

> > It's really hard for me to accept that the heavyweight lock problem
> > for which the patch contains a workaround is the only one that exists.
> > I can't see any reason why that should be true.
>
> I suspect you're right.

...I think the current approach is just plain dead, because of this
issue. We can't take an approach that creates an unbounded number of
unclear reentrancy issues and then insert hacks one by one to cure
them (or hack around them, more to the point) as they're discovered.

The premise has to be that we only allow logging the query plan at
points where we know it's safe, rather than, as at present, allowing
it in places that are unsafe and then trying to compensate with code
elsewhere. That's not likely to ever be as stable as we want
PostgreSQL to be.

-- 
Robert Haas
EDB: http://www.enterprisedb.com

Re: Optimize planner memory consumption for huge arrays


On 20/2/2024 04:51, Tom Lane wrote:

Tomas Vondra  writes:

On 2/19/24 16:45, Tom Lane wrote:

Tomas Vondra  writes:

For example, I don't think we expect selectivity functions to allocate
long-lived objects, right? So maybe we could run them in a dedicated
memory context, and reset it aggressively (after each call).

Here's a quick and probably-incomplete implementation of that idea.
I've not tried to study its effects on memory consumption, just made
sure it passes check-world.
Thanks for the sketch. The trick with the planner_tmp_cxt_depth 
especially looks interesting.
I think we should design small memory contexts - one per scalable 
direction of memory utilization, like selectivity or partitioning 
(appending ?).
My coding experience shows that short-lived GEQO memory context forces 
people to learn on Postgres internals more precisely :).


--
regards,
Andrei Lepikhov
Postgres Professional

Re: WIP Incremental JSON Parser

2024-02-19 Thread Andrew Dunstan




On 2024-01-26 Fr 12:15, Andrew Dunstan wrote:


On 2024-01-24 We 13:08, Robert Haas wrote:


Maybe you should adjust your patch to dump the manifests into the log
file with note(). Then when cfbot runs on it you can see exactly what
the raw file looks like. Although I wonder if it's possible that the
manifest itself is OK, but somehow it gets garbled when uploaded to
the server, either because the client code that sends it or the server
code that receives it does something that isn't safe in 32-bit mode.
If we hypothesize an off-by-one error or a buffer overrun, that could
possibly explain how one field got garbled while the rest of the file
is OK.



Yeah, I thought earlier today I was on the track of an off by one 
error, but I was apparently mistaken, so here's the same patch set 
with an extra patch that logs a bunch of stuff, and might let us see 
what's upsetting the cfbot.






Well, that didn't help a lot, but meanwhile the CFBot seems to have 
decided in the last few days that it's now happy, so full steam aead! ;-)



cheers


andrew

--
Andrew Dunstan
EDB: https://www.enterprisedb.com

Re: POC, WIP: OR-clause support for indexes

On 20/2/2024 11:03, jian he wrote:

Neither the code comments nor the commit message really explain the
design idea here. That's unfortunate, principally because it makes
review difficult.

I'm very skeptical about the idea of using JumbleExpr for any part of
this. It seems fairly expensive, and it might produce false matches.
If expensive is OK, then why not just use equal()? If it's not, then
this probably isn't really OK either. But in any case there should be
comments explaining why this strategy was chosen.

The above message
(https://postgr.es/m/CA%2BTgmoZCgP6FrBQEusn4yaWm02XU8OPeoEMk91q7PRBgwaAkFw%40mail.gmail.com)
seems still not answered.
How can we evaluate whether JumbleExpr is expensive or not?
I used this naive script to test, but didn't find a big difference
when enable_or_transformation is ON or OFF.
First, I am open to discussion here. But IMO, equal() operation is quite
expensive by itself. We should use the hash table approach to avoid
quadratic behaviour when looking for similar clauses in the 'OR' list.
Moreover, we use equal() in many places: selectivity estimations,
proving of fitting the index, predtest, etc. So, by reducing the clause
list, we eliminate many calls of the equal() routine, too.

`leftop operator rightop`
the operator can also be volatile.
Do we need to check (op_volatile(opno) == PROVOLATILE_VOLATILE) within
transformBoolExprOr?

As usual, could you provide a test case to discuss it more objectively?

--
regards,
Andrei Lepikhov
Postgres Professional

Re: speed up a logical replica setup

2024-02-19 Thread Shlok Kyal

Hi,

On Tue, 20 Feb 2024 at 06:59, Euler Taveira  wrote:
>
> On Mon, Feb 19, 2024, at 7:22 AM, Shlok Kyal wrote:
>
> I have reviewed the v21 patch. And found an issue.
>
> Initially I started the standby server with a new postgresql.conf file
> (not the default postgresql.conf that is present in the instance).
> pg_ctl -D ../standby start -o "-c config_file=/new_path/postgresql.conf"
>
> And I have made 'max_replication_slots = 1' in new postgresql.conf and
> made  'max_replication_slots = 0' in the default postgresql.conf file.
> Now when we run pg_createsubscriber on standby we get error:
> pg_createsubscriber: error: could not set replication progress for the
> subscription "pg_createsubscriber_5_242843": ERROR:  cannot query or
> manipulate replication origin when max_replication_slots = 0
>
>
> That's by design. See [1]. The max_replication_slots parameter is used as the
> maximum number of subscriptions on the server.
>
> NOTICE:  dropped replication slot "pg_createsubscriber_5_242843" on publisher
> pg_createsubscriber: error: could not drop publication
> "pg_createsubscriber_5" on database "postgres": ERROR:  publication
> "pg_createsubscriber_5" does not exist
> pg_createsubscriber: error: could not drop replication slot
> "pg_createsubscriber_5_242843" on database "postgres": ERROR:
> replication slot "pg_createsubscriber_5_242843" does not exist
>
>
> That's a bug and should be fixed.
>
> [1] 
> https://www.postgresql.org/docs/current/runtime-config-replication.html#RUNTIME-CONFIG-REPLICATION-SUBSCRIBER
>
I think you misunderstood the issue, I reported.

My main concern is that the standby server is using different
'postgresql.conf' file (the default file) after :
+   /* Start subscriber and wait until accepting connections */
+   pg_log_info("starting the subscriber");
+   if (!dry_run)
+   start_standby_server(pg_ctl_path, opt.subscriber_dir, server_start_log);

But we initially started the standby server (before running the
pg_createsubscriber) with a new postgresql.conf file (different from
the default file. for this example created inside the 'new_path'
folder).
pg_ctl -D ../standby -l standby.log  start -o "-c
config_file=../new_path/postgresql.conf"

So, when we run the pg_createsubscriber, all the initial checks will
be run on the standby server started using the new postgresql.conf
file. But during pg_createsubscriber, it will restart the standby
server using the default postgresql.conf file. And this might create
some issues.

Thanks and Regards,
Shlok Kyal

Re: Optimize planner memory consumption for huge arrays


On 19/2/2024 20:47, Tomas Vondra wrote:

On 9/8/23 07:11, Lepikhov Andrei wrote:

Just for comparison, without partitioning:
elems   1   1E1 1E2 1E3 1E4 
master: 12kB14kB37kB266kB   2.5MB
patched:12kB11.5kB  13kB24kB141kB



These improvements look pretty nice, considering how simple the patch
seems to be. I can't even imagine how much memory we'd need with even
more partitions (say, 1000) if 100 partitions means 274MB.

BTW when releasing memory in scalararraysel, wouldn't it be good to also
free the elem_values/elem_nulls? I haven't tried and maybe it's not that
significant amount.

Agree. Added into the next version of the patch.
Moreover, I see a slight planning speedup. Looking into the reason for 
that, I discovered that it is because sometimes the planner utilizes the 
same memory piece for the next array element. It finds this piece more 
quickly than before that optimization.


--
regards,
Andrei Lepikhov
Postgres Professional
From 2e89dc8b743953068174c777d7a014e1ea71f659 Mon Sep 17 00:00:00 2001
From: "Andrey V. Lepikhov" 
Date: Tue, 20 Feb 2024 11:05:51 +0700
Subject: [PATCH] Utilize memory in scalararraysel in more optimal manner.

Estimating selectivity on an array operation free the memory right after
working out a single element. It reduces peak memory consumption
on massive arrays.
---
 src/backend/utils/adt/selfuncs.c | 26 --
 1 file changed, 16 insertions(+), 10 deletions(-)

diff --git a/src/backend/utils/adt/selfuncs.c b/src/backend/utils/adt/selfuncs.c
index cea777e9d4..e5bad75ec1 100644
--- a/src/backend/utils/adt/selfuncs.c
+++ b/src/backend/utils/adt/selfuncs.c
@@ -281,6 +281,7 @@ eqsel_internal(PG_FUNCTION_ARGS, bool negate)
selec = var_eq_non_const(&vardata, operator, collation, other,
 varonleft, 
negate);
 
+   pfree(other);
ReleaseVariableStats(vardata);
 
return selec;
@@ -1961,15 +1962,15 @@ scalararraysel(PlannerInfo *root,
{
List   *args;
Selectivity s2;
-
-   args = list_make2(leftop,
- 
makeConst(nominal_element_type,
-   
-1,
-   
nominal_element_collation,
-   
elmlen,
-   
elem_values[i],
-   
elem_nulls[i],
-   
elmbyval));
+   Const *c = makeConst(nominal_element_type,
+   -1,
+   
nominal_element_collation,
+   elmlen,
+   elem_values[i],
+   elem_nulls[i],
+   elmbyval);
+
+   args = list_make2(leftop, c);
if (is_join_clause)
s2 = 
DatumGetFloat8(FunctionCall5Coll(&oprselproc,

  clause->inputcollid,
@@ -1985,7 +1986,8 @@ scalararraysel(PlannerInfo *root,

  ObjectIdGetDatum(operator),

  PointerGetDatum(args),

  Int32GetDatum(varRelid)));
-
+   list_free(args);
+   pfree(c);
if (useOr)
{
s1 = s1 + s2 - s1 * s2;
@@ -2004,6 +2006,9 @@ scalararraysel(PlannerInfo *root,
if ((useOr ? isEquality : isInequality) &&
s1disjoint >= 0.0 && s1disjoint <= 1.0)
s1 = s1disjoint;
+
+   pfree(elem_values);
+   pfree(elem_nulls);
}
else if (rightop && IsA(rightop, ArrayExpr) &&
 !((ArrayExpr *) rightop)->multidims)
@@ -2052,6 +2057,7 @@ scalararraysel(PlannerInfo *root,

  ObjectIdGetDat

Re: POC, WIP: OR-clause support for indexes

2024-02-19 Thread jian he

On Mon, Feb 19, 2024 at 4:35 PM Andrei Lepikhov
 wrote:
>
> In attachment - v17 for both patches. As I see it, the only general
> explanation of the idea is not addressed. I'm not sure how deeply we
> should explain it.


> On Tue, Nov 28, 2023 at 5:04 AM Robert Haas  wrote:
>
> On Mon, Nov 27, 2023 at 3:02 AM Andrei Lepikhov
>  wrote:
> > On 25/11/2023 08:23, Alexander Korotkov wrote:
> > > I think patch certainly gets better in this aspect.  One thing I can't
> > > understand is why do we use home-grown code for resolving
> > > hash-collisions.  You can just define custom hash and match functions
> > > in HASHCTL.  Even if we need to avoid repeated JumbleExpr calls, we
> > > still can save pre-calculated hash value into hash entry and use
> > > custom hash and match.  This doesn't imply us to write our own
> > > collision-resolving code.
> >
> > Thanks, it was an insightful suggestion.
> > I implemented it, and the code has become shorter (see attachment).
>
> Neither the code comments nor the commit message really explain the
> design idea here. That's unfortunate, principally because it makes
> review difficult.
>
> I'm very skeptical about the idea of using JumbleExpr for any part of
> this. It seems fairly expensive, and it might produce false matches.
> If expensive is OK, then why not just use equal()? If it's not, then
> this probably isn't really OK either. But in any case there should be
> comments explaining why this strategy was chosen.

The above message
(https://postgr.es/m/CA%2BTgmoZCgP6FrBQEusn4yaWm02XU8OPeoEMk91q7PRBgwaAkFw%40mail.gmail.com)
seems still not answered.
How can we evaluate whether JumbleExpr is expensive or not?
I used this naive script to test, but didn't find a big difference
when enable_or_transformation is ON or OFF.

`
create table test_1_100 as (select (random()*1000)::int x,
(random()*1000) y from generate_series(1,1_000_000) i);
explain(costs off, analyze)
select * from test
where x = 1 or x + 2= 3 or x + 3= 4 or x + 4= 5
or x + 5= 6 or x + 6= 7 or x + 7= 8 or x + 8= 9 or x + 9=10
or x + 10= 11 or x + 11= 12 or x + 12= 13 or x + 13= 14
or x + 14= 15 or x + 15= 16 or x + 16= 17 or x + 17= 18
or x + 18=19 or x + 19= 20 or x + 20= 21 or x + 21= 22
or x + 22= 23 or x + 23= 24 or x + 24= 25 or x + 25= 26
or x + 26= 27 or x + 27=28 or x + 28= 29 or x + 29= 30
or x + 30= 31 \watch i=0.1 c=10
`

`leftop operator rightop`
the operator can also be volatile.
Do we need to check (op_volatile(opno) == PROVOLATILE_VOLATILE) within
transformBoolExprOr?

Re: Synchronizing slots from primary to standby

On Tue, Feb 20, 2024 at 8:25 AM Masahiko Sawada  wrote:
>
> Some comments not related to the patch but to the existing code:
>
> ---
> It might have already been discussed but is the
> src/backend/replication/logical the right place for the slocsync.c? If
> it's independent of logical decoding/replication, is under
> src/backend/replication could be more appropriate?
>

This point has not been discussed, so thanks for raising it. I think
the reasoning behind keeping it in logical is that this file contains
a code for logical slot syncing and a worker doing that. As it is
mostly about logical slot syncing so there is an argument to keep it
under logical. In the future, we may need to extend this functionality
to have a per-db slot sync worker as well in which case it will
probably be again somewhat related to logical slots. Having said that,
there is an argument to keep it under replication as well because the
functionality it provides is for physical standbys.

-- 
With Regards,
Amit Kapila.

Re: Synchronizing slots from primary to standby

2024-02-19 Thread shveta malik

On Tue, Feb 20, 2024 at 8:25 AM Masahiko Sawada  wrote:
>
>
> I've reviewed the v91 patch. Here are random comments:

Thanks for the comments.

> ---
>  /*
>   * Checks the remote server info.
>   *
> - * We ensure that the 'primary_slot_name' exists on the remote server and the
> - * remote server is not a standby node.
> + * Check whether we are a cascading standby. For non-cascading standbys, it
> + * also ensures that the 'primary_slot_name' exists on the remote server.
>   */
>
> IIUC what the validate_remote_info() does doesn't not change by this
> patch, so the previous comment seems to be clearer to me.
>
> ---
> if (remote_in_recovery)
> +   {
> +   /*
> +* If we are a cascading standby, no need to check further for
> +* 'primary_slot_name'.
> +*/
> ereport(ERROR,
> errcode(ERRCODE_FEATURE_NOT_SUPPORTED),
> errmsg("cannot synchronize replication slots from a
> standby server"));
> +   }
> +   else
> +   {
> +   boolprimary_slot_valid;
>
> -   primary_slot_valid = DatumGetBool(slot_getattr(tupslot, 2, &isnull));
> -   Assert(!isnull);
> +   primary_slot_valid = DatumGetBool(slot_getattr(tupslot, 2, &isnull));
> +   Assert(!isnull);
>
> -   if (!primary_slot_valid)
> -   ereport(ERROR,
> -   errcode(ERRCODE_INVALID_PARAMETER_VALUE),
> -   errmsg("bad configuration for slot synchronization"),
> -   /* translator: second %s is a GUC variable name */
> -   errdetail("The replication slot \"%s\" specified by
> \"%s\" does not exist on the primary server.",
> - PrimarySlotName, "primary_slot_name"));
> +   if (!primary_slot_valid)
> +   ereport(ERROR,
> +   errcode(ERRCODE_INVALID_PARAMETER_VALUE),
> +   errmsg("bad configuration for slot synchronization"),
> +   /* translator: second %s is a GUC variable name */
> +   errdetail("The replication slot \"%s\" specified
> by \"%s\" does not exist on the primary server.",
> + PrimarySlotName, "primary_slot_name"));
> +   }
>
> I think it's a refactoring rather than changes required by the
> slotsync worker. We can do that in a separate patch but why do we need
> this change in the first place?

In v90, this refactoring was made due to the fact that
validate_remote_info() was supposed to behave differently for SQL
function and slot-sync worker. SQL-function was supposed to ERROR out
while the worker was supposed to LOG and become no-op. And thus the
change was needed. In v91, we made this functionality same i.e. both
sql function and worker will error out but missed to remove this
refactoring. Thanks for catching this, I will revert it in the next
version. To match the refactoring, I made the comment change too, will
revert that as well.

> ---
> +ValidateSlotSyncParams(ERROR);
> +
>  /* Load the libpq-specific functions */
>  load_file("libpqwalreceiver", false);
>
> -ValidateSlotSyncParams();
> +(void) CheckDbnameInConninfo();
>
> Is there any reason why we move where to check the parameters?

Earlier DBname verification was done inside ValidateSlotSyncParams()
and thus it was needed to load 'libpqwalreceiver' before we call this
function. Now we have moved dbname verification in a separate call and
thus the above order got changed. ValidateSlotSyncParams() is a common
function used by SQL function and worker. Earlier slot sync worker was
checking all the GUCs after starting up and was exiting each time any
GUC was invalid. It was suggested that it would be better to check the
GUCs before starting the slot sync worker in the postmaster itself,
making the ValidateSlotSyncParams() move to postmaster (see
SlotSyncWorkerAllowed).  But it was not a good idea to load libpq in
postmaster and thus we moved libpq related verification out of
ValidateSlotSyncParams(). This resulted in the above change.  I hope
it answers your query.

thanks
Shveta

Re: POC, WIP: OR-clause support for indexes


On 19/2/2024 19:53, Ranier Vilela wrote:

v17-0002
1) move the vars *arrayconst and *dest, to after if, to avoid makeNode 
(palloc).

+ Const   *arrayconst;
+ ScalarArrayOpExpr  *dest;
+
+ pd = (PredicatesData *) lfirst(lc);
+ if (pd->elems == NIL)
+ /* The index doesn't participate in this operation */
+ continue;

+ arrayconst = lsecond_node(Const, saop->args);
+ dest = makeNode(ScalarArrayOpExpr);

Thanks for the review!
I'm not sure I understand you clearly. Does the patch in attachment fix 
the issue you raised?


--
regards,
Andrei Lepikhov
Postgres Professional
diff --git a/src/backend/optimizer/path/indxpath.c 
b/src/backend/optimizer/path/indxpath.c
index 56b04541db..1545876e2f 100644
--- a/src/backend/optimizer/path/indxpath.c
+++ b/src/backend/optimizer/path/indxpath.c
@@ -1284,7 +1284,7 @@ build_paths_for_SAOP(PlannerInfo *root, RelOptInfo *rel, 
RestrictInfo *rinfo,
ArrayType  *arrayval = NULL;
ArrayExpr  *arr = NULL;
Const  *arrayconst = lsecond_node(Const, 
saop->args);
-   ScalarArrayOpExpr  *dest = makeNode(ScalarArrayOpExpr);
+   ScalarArrayOpExpr   dest;
 
pd = (PredicatesData *) lfirst(lc);
if (pd->elems == NIL)
@@ -1302,10 +1302,10 @@ build_paths_for_SAOP(PlannerInfo *root, RelOptInfo 
*rel, RestrictInfo *rinfo,
arr->multidims = false;
 
/* Compose new SAOP, partially covering the source one */
-   memcpy(dest, saop, sizeof(ScalarArrayOpExpr));
-   dest->args = list_make2(linitial(saop->args), arr);
+   memcpy(&dest, saop, sizeof(ScalarArrayOpExpr));
+   dest.args = list_make2(linitial(saop->args), arr);
 
-   clause = (Expr *) estimate_expression_value(root, (Node *) 
dest);
+   clause = (Expr *) estimate_expression_value(root, (Node *) 
&dest);
 
/*
 * Create new RestrictInfo. It maybe more heavy than just copy 
node,

Re: Patch: Add parse_type Function

2024-02-19 Thread David E. Wheeler

On Feb 19, 2024, at 21:58, Erik Wienhold  wrote:

> See the patch I wrote for my benchmarks.  But it's pretty easy anyway to
> cut down parse_type() ;)

LOL, I missed that, just wrote it myself in the last hour. :-) v6 attached.

> But you don't actually need reformat_type() in pgTAP.  You can just get
> the type OID and modifier of the want_type and have_type and compare
> those.  Then use format_type() for the error message.  Looks a bit
> cleaner to me than doing the string comparison.

Fair.

> On second thought, I guess comparing the reformatted type names is
> necessary in order to have a uniform API on older Postgres releases
> where pgTAP has to provide its own to_regtypmod() based on typmodin
> functions.

Maybe. Worth playing with.

>> For the latter, it could easily be an example in the docs.
> 
> Can be mentioned right under format_type().

Well I included it in the to_regtypemod docs here, but could so either.

Best,

David



v6-0001-Add-to_regtypemod-SQL-function.patch
Description: Binary data

Re: Patch: Add parse_type Function

2024-02-19 Thread Erik Wienhold

On 2024-02-19 23:59 +0100, David E. Wheeler wrote:
> On Feb 19, 2024, at 15:47, Tom Lane  wrote:
> 
> >> 1. Add a to_regtypmod() for those who just want the typemod.
> > 
> > Seems like there's a good case for doing that.
> 
> I’ll work on that.

See the patch I wrote for my benchmarks.  But it's pretty easy anyway to
cut down parse_type() ;)

> > I'm less thrilled about that, mainly because I can't think of a good
> > name for it ("format_type_string" is certainly not that).  Is the
> > use-case for this functionality really strong enough that we need to
> > provide it as a single function rather than something assembled out
> > of spare parts?
> 
> Not essential for pgTAP, no, as we can just use the parts. It was the
> typmod bit that was missing.

But you don't actually need reformat_type() in pgTAP.  You can just get
the type OID and modifier of the want_type and have_type and compare
those.  Then use format_type() for the error message.  Looks a bit
cleaner to me than doing the string comparison.

On second thought, I guess comparing the reformatted type names is
necessary in order to have a uniform API on older Postgres releases
where pgTAP has to provide its own to_regtypmod() based on typmodin
functions.

> On Feb 19, 2024, at 17:13, Tom Lane  wrote:
> 
> > After some time ruminating, a couple of possibilities occurred to
> > me: reformat_type(text) canonical_type_name(text)
> 
> I was just thinking about hitting the thesaurus entry for “canonical”,
> but I quite like reformat_type. It’s super clear and draws the
> parallel to format_type() more clearly. Will likely steal the name for
> pgTAP if we don’t add it to core. :-)
> 
> > I'm still unconvinced about that, though.
> 
> I guess the questions are:
> 
> * What are the other use cases for it?

Can't think of any right now other than this are-type-names-the-same
check.  Perhaps also for pretty-printing user-provided type names where
we don't take the type info from e.g. pg_attribute.

> * How obvious is it how to do it?
> 
> For the latter, it could easily be an example in the docs.

Can be mentioned right under format_type().

-- 
Erik

Re: Synchronizing slots from primary to standby

2024-02-19 Thread Masahiko Sawada

On Mon, Feb 19, 2024 at 9:59 PM shveta malik  wrote:
>
> On Mon, Feb 19, 2024 at 5:32 PM Amit Kapila  wrote:
> >
> > Few comments on 0001
>
> Thanks for the feedback.
>
> > 
> > 1. I think it is better to error out when the valid GUC or option is
> > not set in ensure_valid_slotsync_params() and
> > ensure_valid_remote_info() instead of waiting. And we shouldn't start
> > the worker in the first place if not all required GUCs are set. This
> > will additionally simplify the code a bit.
>
> Sure, removed 'ensure' functions. Moved the validation checks to the
> postmaster before starting the slot sync worker.
>
> > 2.
> > +typedef struct SlotSyncWorkerCtxStruct
> >  {
> > - /* prevents concurrent slot syncs to avoid slot overwrites */
> > + pid_t pid;
> > + bool stopSignaled;
> >   bool syncing;
> > + time_t last_start_time;
> >   slock_t mutex;
> > -} SlotSyncCtxStruct;
> > +} SlotSyncWorkerCtxStruct;
> >
> > I think we don't need to change the name of this struct as this can be
> > used both by the worker and the backend. We can probably add the
> > comment to indicate that all the fields except 'syncing' are used by
> > slotsync worker.
>
> Modified.
>
> > 3. Similar to above, function names like SlotSyncShmemInit() shouldn't
> > be changed to SlotSyncWorkerShmemInit().
>
> Modified.
>
> > 4.
> > +ReplSlotSyncWorkerMain(int argc, char *argv[])
> > {
> > ...
> > + on_shmem_exit(slotsync_worker_onexit, (Datum) 0);
> > ...
> > + before_shmem_exit(slotsync_failure_callback, PointerGetDatum(wrconn));
> > ...
> > }
> >
> > Do we need two separate callbacks? Can't we have just one (say
> > slotsync_failure_callback) that cleans additional things in case of
> > slotsync worker? And, if we need both those callbacks then please add
> > some comments for both and why one is before_shmem_exit() and the
> > other is on_shmem_exit().
>
> I think we can merge these now. Earlier 'on_shmem_exit' was needed to
> avoid race-condition between startup and slot sync worker process to
> drop 'i' slots on promotion.  Now we do not have any such scenario.
> But I need some time to analyze it well. Will do it in the next
> version.
>
> > In addition to the above, I have made a few changes in the comments
> > and code (cosmetic code changes). Please include those in the next
> > version if you find them okay. You need to rename .txt file to remove
> > .txt and apply atop v90-0001*.
>
> Sure, included these.
>
> Please find the patch001 attached.

I've reviewed the v91 patch. Here are random comments:

---
 /*
  * Checks the remote server info.
  *
- * We ensure that the 'primary_slot_name' exists on the remote server and the
- * remote server is not a standby node.
+ * Check whether we are a cascading standby. For non-cascading standbys, it
+ * also ensures that the 'primary_slot_name' exists on the remote server.
  */

IIUC what the validate_remote_info() does doesn't not change by this
patch, so the previous comment seems to be clearer to me.

---
if (remote_in_recovery)
+   {
+   /*
+* If we are a cascading standby, no need to check further for
+* 'primary_slot_name'.
+*/
ereport(ERROR,
errcode(ERRCODE_FEATURE_NOT_SUPPORTED),
errmsg("cannot synchronize replication slots from a
standby server"));
+   }
+   else
+   {
+   boolprimary_slot_valid;

-   primary_slot_valid = DatumGetBool(slot_getattr(tupslot, 2, &isnull));
-   Assert(!isnull);
+   primary_slot_valid = DatumGetBool(slot_getattr(tupslot, 2, &isnull));
+   Assert(!isnull);

-   if (!primary_slot_valid)
-   ereport(ERROR,
-   errcode(ERRCODE_INVALID_PARAMETER_VALUE),
-   errmsg("bad configuration for slot synchronization"),
-   /* translator: second %s is a GUC variable name */
-   errdetail("The replication slot \"%s\" specified by
\"%s\" does not exist on the primary server.",
- PrimarySlotName, "primary_slot_name"));
+   if (!primary_slot_valid)
+   ereport(ERROR,
+   errcode(ERRCODE_INVALID_PARAMETER_VALUE),
+   errmsg("bad configuration for slot synchronization"),
+   /* translator: second %s is a GUC variable name */
+   errdetail("The replication slot \"%s\" specified
by \"%s\" does not exist on the primary server.",
+ PrimarySlotName, "primary_slot_name"));
+   }

I think it's a refactoring rather than changes required by the
slotsync worker. We can do that in a separate patch but why do we need
this change in the first place?

---
+ValidateSlotSyncParams(ERROR);
+
 /* Load the libpq-specific functions */
 load_file("libpqwalreceiver", false);

-ValidateSlotSyncParams();
+(void) CheckDbnameInConninfo();

Is there any reason why we move where to check the parameters?

Some comments not related to the patch but to the existing code:

---
It might have alre

Re: Memory consumed by paths during partitionwise join planning


On 19/2/2024 19:25, Ashutosh Bapat wrote:

On Fri, Feb 16, 2024 at 8:42 AM Andrei Lepikhov
 wrote:

Live example: right now, I am working on the code like MSSQL has - a
combination of NestLoop and HashJoin paths and switching between them in
real-time. It requires both paths in the path list at the moment when
extensions are coming. Even if one of them isn't referenced from the
upper pathlist, it may still be helpful for the extension.


There is no guarantee that every path presented to add_path will be
preserved. Suboptimal paths are freed as and when add_path discovers
that they are suboptimal. So I don't think an extension can rely on
existence of a path. But having a refcount makes it easy to preserve
the required paths by referencing them.
I don't insist, just provide my use case. It would be ideal if you would 
provide some external routines for extensions that allow for sticking 
the path in pathlist even when it has terrible cost estimation.





About partitioning. As I discovered planning issues connected to
partitions, the painful problem is a rule, according to which we are
trying to use all nomenclature of possible paths for each partition.
With indexes, it quickly increases optimization work. IMO, this can help
a 'symmetrical' approach, which could restrict the scope of possible
pathways for upcoming partitions if we filter some paths in a set of
previously planned partitions.


filter or free?

Filter.
I meant that Postres tries to apply IndexScan, BitmapScan,
IndexOnlyScan, and other strategies, passing throughout the partition
indexes. The optimizer spends a lot of time doing that. So, why not
introduce a symmetrical strategy and give away from the search some
indexes of types of scan based on the pathifying experience of previous
partitions of the same table: if you have dozens of partitions, Is it
beneficial for the system to find a bit more optimal IndexScan on one
partition having SeqScans on 999 other?


IIUC, you are suggesting that instead of planning each
partition/partitionwise join, we only create paths with the strategies
which were found to be optimal with previous partitions. That's a good
heuristic but it won't work if partition properties - statistics,
indexes etc. differ between groups of partitions.
Sure, but the "Symmetry" strategy assumes that on the scope of a 
thousand partitions, especially with parallel append involved, it 
doesn't cause sensible performance degradation if we find a bit 
suboptimal path in a small subset of partitions. Does it make sense?
As I see, when people use 10-100 partitions for the table, they usually 
strive to keep indexes symmetrical for all partitions.


--
regards,
Andrei Lepikhov
Postgres Professional

Support boolcol IS [NOT] UNKNOWN in partition pruning

2024-02-19 Thread David Rowley

While working on 4c2369ac5, I noticed there's close to as much code to
disallow BooleanTests in the form of "IS UNKNOWN" and "IS NOT UNKNOWN"
in partition pruning as it would take to allow pruning to work for
these.

The attached makes it work.

David
From b9f3ff909652c96f1f0dced9e1165ffa8c93c7f1 Mon Sep 17 00:00:00 2001
From: David Rowley 
Date: Tue, 20 Feb 2024 14:56:49 +1300
Subject: [PATCH v1] Support partition pruning on boolcol IS [NOT] UNKNOWN

While working on 4c2369ac5, I noticed we went out of our way not to
support clauses on boolean partitioned tables in the form of "IS
UNKNOWN" and "IS NOT UNKNOWN".  It's almost as much code to disallow
this as it is to allow it, so let's allow it.
---
 src/backend/partitioning/partprune.c  | 68 +++
 src/test/regress/expected/partition_prune.out | 56 ---
 src/test/regress/sql/partition_prune.sql  |  7 ++
 3 files changed, 91 insertions(+), 40 deletions(-)

diff --git a/src/backend/partitioning/partprune.c 
b/src/backend/partitioning/partprune.c
index ccb9a579b3..5382c24fb9 100644
--- a/src/backend/partitioning/partprune.c
+++ b/src/backend/partitioning/partprune.c
@@ -200,7 +200,7 @@ static PartClauseMatchStatus 
match_boolean_partition_clause(Oid partopfamily,

Expr *clause,

Expr *partkey,

Expr **outconst,
-   
bool *noteq);
+   
bool *notclause);
 static void partkey_datum_from_expr(PartitionPruneContext *context,
Expr 
*expr, int stateidx,
Datum 
*value, bool *isnull);
@@ -1798,13 +1798,14 @@ 
match_clause_to_partition_key(GeneratePruningStepsContext *context,
Oid partopfamily = 
part_scheme->partopfamily[partkeyidx],
partcoll = 
part_scheme->partcollation[partkeyidx];
Expr   *expr;
-   boolnoteq;
+   boolnotclause;
 
/*
 * Recognize specially shaped clauses that match a Boolean partition 
key.
 */
boolmatchstatus = match_boolean_partition_clause(partopfamily, clause,
-   
 partkey, &expr, ¬eq);
+   
 partkey, &expr,
+   
 ¬clause);
 
if (boolmatchstatus == PARTCLAUSE_MATCH_CLAUSE)
{
@@ -1818,7 +1819,7 @@ match_clause_to_partition_key(GeneratePruningStepsContext 
*context,
 * punt it off to gen_partprune_steps_internal() to generate 
pruning
 * steps.
 */
-   if (noteq)
+   if (notclause)
{
List   *new_clauses;
List   *or_clause;
@@ -1836,9 +1837,8 @@ match_clause_to_partition_key(GeneratePruningStepsContext 
*context,
else
{
/*
-* We only expect 
match_boolean_partition_clause to match for
-* IS_NOT_TRUE and IS_NOT_FALSE.  
IS_NOT_UNKNOWN is not
-* supported.
+* We only expect 
match_boolean_partition_clause to return
+* PARTCLAUSE_MATCH_CLAUSE for IS_NOT_TRUE and 
IS_NOT_FALSE.
 */
Assert(false);
}
@@ -1876,6 +1876,15 @@ 
match_clause_to_partition_key(GeneratePruningStepsContext *context,
 
return PARTCLAUSE_MATCH_CLAUSE;
}
+   else if (boolmatchstatus == PARTCLAUSE_MATCH_NULLNESS)
+   {
+   /*
+* Handle IS UNKNOWN and IS NOT UNKNOWN.  These just logically
+* translate to IS NULL and IS NOT NULL.
+*/
+   *clause_is_not_null = notclause;
+   return PARTCLAUSE_MATCH_NULLNESS;
+   }
else if (IsA(clause, OpExpr) &&
 list_length(((OpExpr *) clause)->args) == 2)
{
@@ -3650,21 +3659,21 @@ perform_pruning_combine_step(PartitionPruneConte

Re: Thoughts about NUM_BUFFER_PARTITIONS

2024-02-19 Thread wenhui qiu

Hi Heikki Linnakangas
   I saw git log found this commit:
https://github.com/postgres/postgres/commit/3acc10c997f916f6a741d0b4876126b7b08e3892
,I don't seem to see an email discussing this commit. As the commit log
tells us, we don't know exactly how large a value is optimal, and I believe
it's more flexible to make it as a parameter.Thank you very much
tomas.vondra for explaining the relationship, i see that MAX_SIMUL_LWLOCKS
was just doubled in this commit, is there a more appropriate ratio between
them?

```
commit 3acc10c997f916f6a741d0b4876126b7b08e3892
Author: Robert Haas 
Date:   Thu Oct 2 13:58:50 2014 -0400

Increase the number of buffer mapping partitions to 128.

Testing by Amit Kapila, Andres Freund, and myself, with and without
other patches that also aim to improve scalability, seems to indicate
that this change is a significant win over the current value and over
smaller values such as 64.  It's not clear how high we can push this
value before it starts to have negative side-effects elsewhere, but
going this far looks OK.

`

wenhui qiu  于2024年2月20日周二 09:36写道：

> Hi Japlin Li
>Thank you for such important information ! Got it
>
> Japin Li  于2024年2月19日周一 10:26写道：
>
>>
>> On Mon, 19 Feb 2024 at 00:56, Tomas Vondra 
>> wrote:
>> > On 2/18/24 03:30, Li Japin wrote:
>> >>
>> >> I find it seems need to change MAX_SIMUL_LWLOCKS if we enlarge the
>> NUM_BUFFER_PARTITIONS,
>> >> I didn’t find any comments to describe the relation between
>> MAX_SIMUL_LWLOCKS and
>> >> NUM_BUFFER_PARTITIONS, am I missing someghing?
>> >
>> > IMHO the relationship is pretty simple - MAX_SIMUL_LWLOCKS needs to be
>> > higher than NUM_BUFFER_PARTITIONS, so that the backend can acquire all
>> > the partition locks if needed.
>> >
>>
>> Thanks for the explanation!  Got it.
>>
>> > There's other places that acquire a bunch of locks, and all of them need
>> > to be careful not to exceed MAX_SIMUL_LWLOCKS. For example gist has
>> > GIST_MAX_SPLIT_PAGES.
>> >
>> >
>> > regards
>>
>

Re: Thoughts about NUM_BUFFER_PARTITIONS

2024-02-19 Thread wenhui qiu

Hi Japlin Li
   Thank you for such important information ! Got it

Japin Li  于2024年2月19日周一 10:26写道：

>
> On Mon, 19 Feb 2024 at 00:56, Tomas Vondra 
> wrote:
> > On 2/18/24 03:30, Li Japin wrote:
> >>
> >> I find it seems need to change MAX_SIMUL_LWLOCKS if we enlarge the
> NUM_BUFFER_PARTITIONS,
> >> I didn’t find any comments to describe the relation between
> MAX_SIMUL_LWLOCKS and
> >> NUM_BUFFER_PARTITIONS, am I missing someghing?
> >
> > IMHO the relationship is pretty simple - MAX_SIMUL_LWLOCKS needs to be
> > higher than NUM_BUFFER_PARTITIONS, so that the backend can acquire all
> > the partition locks if needed.
> >
>
> Thanks for the explanation!  Got it.
>
> > There's other places that acquire a bunch of locks, and all of them need
> > to be careful not to exceed MAX_SIMUL_LWLOCKS. For example gist has
> > GIST_MAX_SPLIT_PAGES.
> >
> >
> > regards
>

Re: speed up a logical replica setup

2024-02-19 Thread Euler Taveira

On Mon, Feb 19, 2024, at 7:22 AM, Shlok Kyal wrote:
> I have reviewed the v21 patch. And found an issue.
> 
> Initially I started the standby server with a new postgresql.conf file
> (not the default postgresql.conf that is present in the instance).
> pg_ctl -D ../standby start -o "-c config_file=/new_path/postgresql.conf"
> 
> And I have made 'max_replication_slots = 1' in new postgresql.conf and
> made  'max_replication_slots = 0' in the default postgresql.conf file.
> Now when we run pg_createsubscriber on standby we get error:
> pg_createsubscriber: error: could not set replication progress for the
> subscription "pg_createsubscriber_5_242843": ERROR:  cannot query or
> manipulate replication origin when max_replication_slots = 0

That's by design. See [1]. The max_replication_slots parameter is used as the
maximum number of subscriptions on the server.

> NOTICE:  dropped replication slot "pg_createsubscriber_5_242843" on publisher
> pg_createsubscriber: error: could not drop publication
> "pg_createsubscriber_5" on database "postgres": ERROR:  publication
> "pg_createsubscriber_5" does not exist
> pg_createsubscriber: error: could not drop replication slot
> "pg_createsubscriber_5_242843" on database "postgres": ERROR:
> replication slot "pg_createsubscriber_5_242843" does not exist

That's a bug and should be fixed.

[1] 
https://www.postgresql.org/docs/current/runtime-config-replication.html#RUNTIME-CONFIG-REPLICATION-SUBSCRIBER


--
Euler Taveira
EDB   https://www.enterprisedb.com/

Re: serial not accepted as datatype in ALTER TABLE ... ALTER COLUMN

2024-02-19 Thread Andy Fan

Ashutosh Bapat  writes:

> On Sun, Feb 18, 2024 at 1:59 PM Andy Fan  wrote:
>>
>>
>> I tried your idea with the attatchment, it is still in a drafted state
>> but it can be used as a prove-of-concept and for better following
>> communicating.  Just one point needs to metion is serial implies
>> "default value" + "not null" constaint. So when we modify a column into
>> serial, we need to modify the 'NULL value' and only to the default value
>> at the RewriteTable stage.
>>
>
> I am surprised that this requires changes in ReWrite. I thought adding
> NOT NULL constraint and default value commands would be done by
> transformColumnDefinition(). But I haven't looked at the patch close
> enough.

Hmm, I think this depends on how to handle the NULL values before the
RewriteTable. 

Consider the example like this:

\pset null (null)
create table t(a int);
insert into t select 1;
insert into t select;

postgres=# select * from t;
   a

  1
 (null)
(2 rows)

since serial type implies "not null" + "default value", shall we raise error
or fill the value with the "default" value?  The patch choose the later
way which needs changes in RewirteTable stage, but now I think the raise
error directly is an option as well. 

-- 
Best Regards
Andy Fan

Re: Have pg_basebackup write "dbname" in "primary_conninfo"?

2024-02-19 Thread Jelte Fennema-Nio

On Tue, 20 Feb 2024 at 00:34, Ian Lawrence Barwick  wrote:
> With the addition of "pg_sync_replication_slots()", there is now a use-case 
> for
> including "dbname" in "primary_conninfo" and the docs have changed from
> stating [1]:
>
>   Do not specify a database name in the primary_conninfo string.
>
> to [2]:
>
>   For replication slot synchronization (see Section 48.2.3), it is also
>   necessary to specify a valid dbname in the primary_conninfo string. This 
> will
>   only be used for slot synchronization. It is ignored for streaming.
>
> [1] 
> https://www.postgresql.org/docs/16/runtime-config-replication.html#RUNTIME-CONFIG-REPLICATION-STANDBY
> [2] 
> https://www.postgresql.org/docs/devel/runtime-config-replication.html#RUNTIME-CONFIG-REPLICATION-STANDBY

Sounds like that documentation should be updated in the same way as
was done for pg_basebackup/pg_receivewal in commit cca97ce6a665. When
considering middleware/proxies having dbname in there can be useful
even for older PG versions.

> I can't see any reason for continuing to omit "dbname", so suggest it should
> only continue to be omitted for 16 and earlier; see attached patch.

Yeah, that seems like a good change. Though, I'm wondering if the
version check is actually necessary.

Re: Fix race condition in InvalidatePossiblyObsoleteSlot()

On Mon, Feb 19, 2024 at 09:49:24AM +, Bertrand Drouvot wrote:
> On Mon, Feb 19, 2024 at 01:45:16PM +0530, Bharath Rupireddy wrote:
>> Prefix 'initial_' makes the variable names a bit longer, I think we
>> can just use effective_xmin, catalog_effective_xmin and restart_lsn,
>> the code updating then only when if (!terminated) tells one that they
>> aren't updated every time.
> 
> I'm not sure about that. I prefer to see meaningfull names instead of having
> to read the code where they are used.

Prefixing these with "initial_" is fine, IMO.  That shows the
intention that these come from the slot's data before doing the
termination.  So I'm OK with what's been proposed in v3.

>> This needs a bit more info as to why and how effective_xmin,
>> catalog_effective_xmin and restart_lsn can move ahead after signaling
>> a backend and before the signalled backend reports back.
> 
> I'm not sure of the added value of such extra details in this comment and if
> the comment would be easy to maintain. I've the feeling that knowing it's 
> possible
> is enough here. Happy to hear what others think about it too.

Documenting that the risk exists rather than all the possible reasons
why this could happen is surely more maintainable.  In short, I'm OK
with what the patch does, just telling that it is possible.

>> +Assert(!(conflict_prev != RS_INVAL_NONE && terminated &&
>> + conflict_prev != conflict));
>> 
>> It took a while for me to understand the above condition, can we
>> simplify it like below using De Morgan's laws for better readability?
>> 
>> Assert((conflict_prev == RS_INVAL_NONE) || !terminated ||
>> (conflict_prev == conflict));
> 
> I don't have a strong opinon on this, looks like a matter of taste.

Both are the same to me, so I have no extra opinion to offer.  ;)
--
Michael


signature.asc
Description: PGP signature

Have pg_basebackup write "dbname" in "primary_conninfo"?

2024-02-19 Thread Ian Lawrence Barwick

Hi

Hi


With the addition of "pg_sync_replication_slots()", there is now a use-case for
including "dbname" in "primary_conninfo" and the docs have changed from
stating [1]:

  Do not specify a database name in the primary_conninfo string.

to [2]:

  For replication slot synchronization (see Section 48.2.3), it is also
  necessary to specify a valid dbname in the primary_conninfo string. This will
  only be used for slot synchronization. It is ignored for streaming.

[1] 
https://www.postgresql.org/docs/16/runtime-config-replication.html#RUNTIME-CONFIG-REPLICATION-STANDBY
[2] 
https://www.postgresql.org/docs/devel/runtime-config-replication.html#RUNTIME-CONFIG-REPLICATION-STANDBY

However, when setting up a standby (with the intent of creating a logical
standby) with pg_basebackup, providing the -R/-write-recovery-conf option
results in a "primary_conninfo" string being generated without a "dbname"
parameter, which requires a manual change should one intend to use
"pg_sync_replication_slots()".

I can't see any reason for continuing to omit "dbname", so suggest it should
only continue to be omitted for 16 and earlier; see attached patch.

Note that this does mean that if the conninfo string provided to pg_basebackup
does not contain "dbname", the generated "primary_conninfo" GUC will default to
"dbname=replication". It would be easy enough to suppress this, but AFAICS
there's no way to tell if it was explicitly supplied by the user, in which case
it would be surprising if it were omitted from the final "primary_conninfo"
string.

The only other place where GenerateRecoveryConfig() is called is pg_rewind;
I can't see any adverse affects from the proposed change. But it's perfectly
possible there's something blindingly obvious I'm overlooking.



Regards

Ian Barwick
diff --git a/src/fe_utils/recovery_gen.c b/src/fe_utils/recovery_gen.c
index 2585f11939..de463f3956 100644
--- a/src/fe_utils/recovery_gen.c
+++ b/src/fe_utils/recovery_gen.c
@@ -49,12 +49,16 @@ GenerateRecoveryConfig(PGconn *pgconn, char *replication_slot)
 	{
 		/* Omit empty settings and those libpqwalreceiver overrides. */
 		if (strcmp(opt->keyword, "replication") == 0 ||
-			strcmp(opt->keyword, "dbname") == 0 ||
 			strcmp(opt->keyword, "fallback_application_name") == 0 ||
 			(opt->val == NULL) ||
 			(opt->val != NULL && opt->val[0] == '\0'))
 			continue;
 
+		/* Omit "dbname" in PostgreSQL 16 and earlier. */
+		if (strcmp(opt->keyword, "dbname") == 0 &&
+			PQserverVersion(pgconn) < MINIMUM_VERSION_FOR_DBNAME_PARAM)
+			continue;
+
 		/* Separate key-value pairs with spaces */
 		if (conninfo_buf.len != 0)
 			appendPQExpBufferChar(&conninfo_buf, ' ');
diff --git a/src/include/fe_utils/recovery_gen.h b/src/include/fe_utils/recovery_gen.h
index ca2c4800d0..740d90c9d8 100644
--- a/src/include/fe_utils/recovery_gen.h
+++ b/src/include/fe_utils/recovery_gen.h
@@ -20,6 +20,12 @@
  */
 #define MINIMUM_VERSION_FOR_RECOVERY_GUC 12
 
+/*
+ * from version 17, there is a use-case for adding the dbname parameter
+ * to the generated "primary_conninfo" value
+ */
+#define MINIMUM_VERSION_FOR_DBNAME_PARAM 17
+
 extern PQExpBuffer GenerateRecoveryConfig(PGconn *pgconn,
 		  char *replication_slot);
 extern void WriteRecoveryConfig(PGconn *pgconn, char *target_dir,

Re: speed up a logical replica setup

2024-02-19 Thread Euler Taveira

On Mon, Feb 19, 2024, at 6:47 AM, Peter Eisentraut wrote:
> Some review of the v21 patch:

Thanks for checking.

> - commit message
> 
> Mention pg_createsubscriber in the commit message title.  That's the 
> most important thing that someone doing git log searches in the future 
> will be looking for.

Right. Done.

> - doc/src/sgml/ref/allfiles.sgml
> 
> Move the new entry to alphabetical order.

Done.

> 
> - doc/src/sgml/ref/pg_createsubscriber.sgml
> 
> +  
> +   The pg_createsubscriber should be run at 
> the target
> +   server. The source server (known as publisher server) should accept 
> logical
> +   replication connections from the target server (known as subscriber 
> server).
> +   The target server should accept local logical replication connection.
> +  
> 
> "should" -> "must" ?

Done.

> + 
> +  Options
> 
> Sort options alphabetically.

Done.

> It would be good to indicate somewhere which options are mandatory.

I'll add this information in the option description. AFAICT the synopsis kind
of indicates it.

> + 
> +  Examples
> 
> I suggest including a pg_basebackup call into this example, so it's
> easier for readers to get the context of how this is supposed to be
> used.  You can add that pg_basebackup in this example is just an example 
> and that other base backups can also be used.

We can certainly add it but creating a standby isn't out of scope here? I will
make sure to include references to pg_basebackup and the "Setting up a Standby
Server" section.

> - doc/src/sgml/reference.sgml
> 
> Move the new entry to alphabetical order.

Done.

> - src/bin/pg_basebackup/Makefile
> 
> Move the new sections to alphabetical order.

Done.

> - src/bin/pg_basebackup/meson.build
> 
> Move the new sections to alphabetical order.

Done.

> 
> - src/bin/pg_basebackup/pg_createsubscriber.c
> 
> +typedef struct CreateSubscriberOptions
> +typedef struct LogicalRepInfo
> 
> I think these kinds of local-use struct don't need to be typedef'ed.
> (Then you also don't need to update typdefs.list.)

Done.

> +static void
> +usage(void)
> 
> Sort the options alphabetically.

Are you referring to s/options/functions/?

> +static char *
> +get_exec_path(const char *argv0, const char *progname)
> 
> Can this not use find_my_exec() and find_other_exec()?

It is indeed using it. I created this function because it needs to run the same
code path twice (pg_ctl and pg_resetwal).

> +int
> +main(int argc, char **argv)
> 
> Sort the options alphabetically (long_options struct, getopt_long()
> argument, switch cases).

Done.

> - .../t/040_pg_createsubscriber.pl
> - .../t/041_pg_createsubscriber_standby.pl
> 
> These two files could be combined into one.

Done.

> +# Force it to initialize a new cluster instead of copying a
> +# previously initdb'd cluster.
> 
> Explain why?

Ok. It needs a new cluster because it will have a different system identifier
so we can make sure the target cluster is a copy of the source server.

> +$node_s->append_conf(
> + 'postgresql.conf', qq[
> +log_min_messages = debug2
> 
> Is this setting necessary for the test?

No. It is here as a debugging aid. Better to include it in a separate patch.
There are a few messages that I don't intend to include in the final patch.

All of these modifications will be included in the next patch. I'm finishing to
integrate patches proposed by Hayato [1] and some additional fixes and
refactors.


[1] 
https://www.postgresql.org/message-id/TYCPR01MB12077A8421685E5515DE408EEF5512%40TYCPR01MB12077.jpnprd01.prod.outlook.com


--
Euler Taveira
EDB   https://www.enterprisedb.com/

Re: Injection points: some tools to wait and wake

On Mon, Feb 19, 2024 at 02:28:04PM +, Bertrand Drouvot wrote:
> +CREATE FUNCTION injection_points_wake()
> 
> what about injection_points_wakeup() instead?

Sure.

> +/* Shared state information for injection points. */
> +typedef struct InjectionPointSharedState
> +{
> +   /* protects accesses to wait_counts */
> +   slock_t lock;
> +
> +   /* Counter advancing when injection_points_wake() is called */
> +   int wait_counts;
> 
> If slock_t protects "only" one counter, then what about using pg_atomic_uint64
> or pg_atomic_uint32 instead? And btw do we need wait_counts at all? (see 
> comment 
> number 4)

We could, indeed, even if we use more than one counter.  Still, I
would be tempted to keep it in case more data is added to this
structure as that would make for less stuff to do while being normally
non-critical.  This sentence may sound pedantic, though..

> + * SQL function for waking a condition variable
> 
> waking up instead?

Okay.

> I'm wondering if this loop and wait_counts are needed, why not doing something
> like (and completely get rid of wait_counts)?
> 
> "
> ConditionVariablePrepareToSleep(&inj_state->wait_point);
> ConditionVariableSleep(&inj_state->wait_point, injection_wait_event);
> ConditionVariableCancelSleep();
> "
> 
> It's true that the comment above ConditionVariableSleep() mentions that:

Perhaps not, but it encourages good practices around the use of
condition variables, and we need to track all that in shared memory
anyway.  Ashutosh has argued in favor of the approach taken by the
patch in the original thread when I've sent a version doing exactly
what you are saying now to not track a state in shmem.
--
Michael


signature.asc
Description: PGP signature

Re: Injection points: some tools to wait and wake

On Mon, Feb 19, 2024 at 11:54:20AM +0300, Andrey M. Borodin wrote:
> 1. injection_points_wake() will wake all of waiters. But it's not
> suitable for complex tests. I think there must be a way to wake only
> specific waiter by injection point name.

I don't disagree with that, but I don't have a strong argument for
implementing that until there is an explicit need for it in core.  It
is also possible to plug in your own module, outside of core, if you
are looking for something more specific.  The backend APIs allow that.

> 2. Alexander Korotkov's stopevents could be used in isolation
> tests. This kind of tests is perfect for describing complex race
> conditions. (as a side note, I'd be happy if we could have
> primary\standby in isolation tests too)

This requires plugging is more into src/test/isolation/, with multiple
connection strings.  This has been suggested in the past.

> 3. Can we have some Perl function for this?
> +# Wait until the checkpointer is in the middle of the restart point
> +# processing, relying on the custom wait event generated in the
> +# wait callback used in the injection point previously attached.
> +ok( $node_standby->poll_query_until(
> + 'postgres',
> + qq[SELECT count(*) FROM pg_stat_activity
> +   WHERE backend_type = 'checkpointer' AND wait_event = 
> 'injection_wait' ;],
> + '1'),
> + 'checkpointer is waiting in restart point'
> +) or die "Timed out while waiting for checkpointer to run restart point";
> 
> Perhaps something like
> $node->do_a_query_and_wait_for_injection_point_observed(sql,injection_point_name);

I don't see why not.  But I'm not sure how much I need to plug in into
the main modules yet.

> 4. Maybe I missed it, but I'd like to see a guideline on how to name
> injection points.

I don't think we have any of that, or even that we need one.  In
short, I'm OK to be more consistent with the choice of ginbtree.c and
give priority to it and make it more official in the docs.

> 5. In many cases we need to have injection point under critical
> section. I propose to have a "prepared injection point". See [0] for
> example in v2-0003-Test-multixact-CV-sleep.patch
> +   INJECTION_POINT_PREPARE("GetNewMultiXactId-done");
> +
> START_CRIT_SECTION();
>  
> +   INJECTION_POINT_RUN_PREPARED();

I don't see how that's different from a wait/wake logic?  The only
thing you've changed is to stop a wait when a point is detached and
you want to make the stop conditional.  Plugging in a condition
variable is more flexible than a hardcoded sleep in terms of wait,
while being more responsive.

> 7. This is extremely distant, but some DBMSs allow to enable
> injection points by placing files on the filesystem. That would
> allow to test something during recovery when no SQL interface is
> present.

Yeah, I could see why you'd want to do something like that.  If a use
case pops up, that can surely be discussed.
--
Michael

signature.asc
Description: PGP signature

Re: 035_standby_logical_decoding unbounded hang

2024-02-19 Thread Noah Misch

On Fri, Feb 16, 2024 at 06:37:38AM +, Bertrand Drouvot wrote:
> On Thu, Feb 15, 2024 at 12:48:16PM -0800, Noah Misch wrote:
> > On Wed, Feb 14, 2024 at 03:31:16PM +, Bertrand Drouvot wrote:
> > > What about creating a sub, say wait_for_restart_lsn_calculation() in 
> > > Cluster.pm
> > > and then make use of it in create_logical_slot_on_standby() and above? 
> > > (something
> > > like wait_for_restart_lsn_calculation-v1.patch attached).
> > 
> > Waiting for restart_lsn is just a prerequisite for calling
> > pg_log_standby_snapshot(), so I wouldn't separate those two.
> 
> Yeah, makes sense.
> 
> > If we're
> > extracting a sub, I would move the pg_log_standby_snapshot() call into the 
> > sub
> > and make the API like one of these:
> > 
> >   $standby->wait_for_subscription_starting_point($primary, $slot_name);
> >   $primary->log_standby_snapshot($standby, $slot_name);
> > 
> > Would you like to finish the patch in such a way?
> 
> Sure, please find attached standby-slot-test-2-race-v2.patch doing so. I used
> log_standby_snapshot() as it seems more generic to me.

Thanks.  Pushed at commit 0e16281.

Re: Avoid switching between system-user and system-username in the doc

On Mon, Feb 19, 2024 at 06:00:11PM +0100, Jelte Fennema-Nio wrote:
> On Mon, 19 Feb 2024 at 09:52, Bertrand Drouvot
>  wrote:
>> Please find attached a tiny patch to clean that up.
> 
> LGTM

Looks like a mistake from me in efb6f4a4f9b6, will fix and backpatch
for consistency.
--
Michael


signature.asc
Description: PGP signature

Re: Streaming read-ready sequential scan code

2024-02-19 Thread Melanie Plageman

On Mon, Jan 29, 2024 at 4:17 PM Melanie Plageman
wrote:
>
> There is an outstanding question about where to allocate the
> PgStreamingRead object for sequential scans

I've written three alternative implementations of the actual streaming
read user for sequential scan which handle the question of where to
allocate the streaming read object and how to handle changing scan
direction in different ways.

Option A)
https://github.com/melanieplageman/postgres/tree/seqscan_pgsr_initscan_allocation
- Allocates the streaming read object in initscan(). Since we do not
know the scan direction at this time, if the scan ends up not being a
forwards scan, the streaming read object must later be freed -- so
this will sometimes allocate a streaming read object it never uses.
- Only supports ForwardScanDirection and once the scan direction
changes, streaming is never supported again -- even if we return to
ForwardScanDirection
- Must maintain a "fallback" codepath that does not use the streaming read API

Option B)
https://github.com/melanieplageman/postgres/tree/seqscan_pgsr_heapgettup_alloc_forward_only
- Allocates the streaming read object in heapgettup[_pagemode]() when
it has not been previously allocated. To do this it has to record and
switch into a different memory context than the per-tuple context. It
only allocates the streaming read object if it is a forwards scan. It
frees the streaming read object if the scan direction is later
changed.
- Only supports ForwardScanDirection and once the scan direction
changes, streaming is never supported again -- even if we return to
ForwardScanDirection
- Must maintain a "fallback" codepath that does not use the streaming read API

Option C)
https://github.com/melanieplageman/postgres/tree/seqscan_pgsr_all_dir_stream
- Allocates the streaming read object in heapgettup[_pagemode]() when
it has not been previously allocated. To do this it has to record and
switch into a different memory context than the per-tuple context.
- All scan directions support streaming. To do this, the scan
direction has to be tracked and we must check if the direction has
changed on every heapgettup[_pagemode]() invocation to avoid returning
wrong results.
- No "fallback" codepath as all heap sequential scans will use the
streaming read API

As you can see, each option has pros and cons. I'm interested in what
others think about which we should choose.

- Melanie

Re: Possible to trigger autovacuum?

On Mon, Feb 19, 2024 at 03:15:29PM -0600, Chris Cleveland wrote:
> Is it possible to launch an autovacuum from within an extension?
> 
> I'm developing an index access method. After the index gets built it needs
> some cleanup and optimization. I'd prefer to do this in the
> amvacuumcleanup() method so it can happen periodically and asynchronously.
> 
> I could fire up a background worker to do the job, but it would be a lot
> simpler to call please_launch_autovacuum_right_now();

The autovacuum launcher can be stopped in its nap with signals, like a
SIGHUP.  So you could rely on that to force a job to happen on a given
database based on the timing you're aiming for.
--
Michael


signature.asc
Description: PGP signature

Re: Patch: Add parse_type Function

2024-02-19 Thread David E. Wheeler

On Feb 19, 2024, at 15:47, Tom Lane  wrote:

>> 1. Add a to_regtypmod() for those who just want the typemod.
> 
> Seems like there's a good case for doing that.

I’ll work on that.

> I'm less thrilled about that, mainly because I can't think of
> a good name for it ("format_type_string" is certainly not that).
> Is the use-case for this functionality really strong enough that
> we need to provide it as a single function rather than something
> assembled out of spare parts?

Not essential for pgTAP, no, as we can just use the parts. It was the typmod 
bit that was missing.

On Feb 19, 2024, at 17:13, Tom Lane  wrote:

> After some time ruminating, a couple of possibilities occurred to
> me:
> reformat_type(text)
> canonical_type_name(text)

I was just thinking about hitting the thesaurus entry for “canonical”, but I 
quite like reformat_type. It’s super clear and draws the parallel to 
format_type() more clearly. Will likely steal the name for pgTAP if we don’t 
add it to core. :-)

> I'm still unconvinced about that, though.

I guess the questions are:

* What are the other use cases for it?
* How obvious is it how to do it?

For the latter, it could easily be an example in the docs.

Best,

David

Re: Add last_commit_lsn to pg_stat_database

On Mon, Feb 19, 2024 at 10:26:43AM +0100, Tomas Vondra wrote:
> On 2/19/24 07:57, Michael Paquier wrote:
> > On Sun, Feb 18, 2024 at 02:28:06AM +0100, Tomas Vondra wrote:
>>> 1) Do we really need to modify RecordTransactionCommitPrepared and
>>> XactLogCommitRecord to return the LSN of the commit record? Can't we
>>> just use XactLastRecEnd? It's good enough for advancing
>>> replorigin_session_origin_lsn, why wouldn't it be good enough for this?
>> 
>> Or XactLastCommitEnd?
> 
> But that's not set in RecordTransactionCommitPrepared (or twophase.c in
> general), and the patch seems to need that.

Hmm.  Okay.

> > The changes in AtEOXact_PgStat() are not really
> > attractive for what's a static variable in all the backends.
> 
> I'm sorry, I'm not sure what "changes not being attractive" means :-(

It just means that I am not much a fan of the signature changes done
for RecordTransactionCommit() and AtEOXact_PgStat_Database(), adding a
dependency to a specify LSN.  Your suggestion to switching to
XactLastRecEnd should avoid that.

> - stats_fetch_consistency=cache: always the same min/max value
> 
> - stats_fetch_consistency=none: min != max
> 
> Which would suggest you're right and this should be VOLATILE and not
> STABLE. But then again - the existing pg_stat_get_db_* functions are all
> marked as STABLE, so (a) is that correct and (b) why should this
> function be marked different?

This can look like an oversight of 5891c7a8ed8f to me.  I've skimmed
through the threads related to this commit and messages around [1]
explain why this GUC exists and why we have both behaviors, but I'm
not seeing a discussion about the volatibility of these functions.
The current situation is not bad either, the default makes these
functions stable, and I'd like to assume that users of this GUC know
what they do.  Perhaps Andres or Horiguchi-san can comment on that.

https://www.postgresql.org/message-id/382908.1616343...@sss.pgh.pa.us
--
Michael

signature.asc
Description: PGP signature

Re: Patch: Add parse_type Function

I wrote:
> I'm less thrilled about that, mainly because I can't think of
> a good name for it ("format_type_string" is certainly not that).

After some time ruminating, a couple of possibilities occurred to
me:
reformat_type(text)
canonical_type_name(text)

> Is the use-case for this functionality really strong enough that
> we need to provide it as a single function rather than something
> assembled out of spare parts?

I'm still unconvinced about that, though.

regards, tom lane

Re: Optimize planner memory consumption for huge arrays

Tomas Vondra  writes:
> On 2/19/24 16:45, Tom Lane wrote:
>> Tomas Vondra  writes:
>>> For example, I don't think we expect selectivity functions to allocate
>>> long-lived objects, right? So maybe we could run them in a dedicated
>>> memory context, and reset it aggressively (after each call).

>> That could eliminate a whole lot of potential leaks.  I'm not sure 
>> though how much it moves the needle in terms of overall planner
>> memory consumption.

> I'm not sure about that either, maybe not much - for example it would
> not help with the two other memory usage patches (which are related to
> SpecialJoinInfo and RestrictInfo, outside selectivity functions).

> It was an ad hoc thought, inspired by the issue at hand. Maybe it would
> be possible to find similar "boundaries" in other parts of the planner.

Here's a quick and probably-incomplete implementation of that idea.
I've not tried to study its effects on memory consumption, just made
sure it passes check-world.

The main hazard here is that something invoked inside clause
selectivity might try to cache a data structure for later use.
However, there are already places that do that kind of thing,
and they already explicitly switch into the planner_cxt, because
otherwise they fail under GEQO.  (If we do find places that need
fixing for this, they were probably busted under GEQO already.)
Perhaps it's worth updating the comments at those places, but
I didn't bother in this first cut.

regards, tom lane

diff --git a/src/backend/optimizer/path/clausesel.c b/src/backend/optimizer/path/clausesel.c
index c949dc1866..669acd7315 100644
--- a/src/backend/optimizer/path/clausesel.c
+++ b/src/backend/optimizer/path/clausesel.c
@@ -24,6 +24,7 @@
 #include "statistics/statistics.h"
 #include "utils/fmgroids.h"
 #include "utils/lsyscache.h"
+#include "utils/memutils.h"
 #include "utils/selfuncs.h"
 
 /*
@@ -693,6 +694,7 @@ clause_selectivity_ext(PlannerInfo *root,
 	Selectivity s1 = 0.5;		/* default for any unhandled clause type */
 	RestrictInfo *rinfo = NULL;
 	bool		cacheable = false;
+	MemoryContext saved_cxt;
 
 	if (clause == NULL)			/* can this still happen? */
 		return s1;
@@ -756,6 +758,21 @@ clause_selectivity_ext(PlannerInfo *root,
 			clause = (Node *) rinfo->clause;
 	}
 
+	/*
+	 * Run all the remaining work in the short-lived planner_tmp_cxt, which
+	 * we'll reset at the bottom.  This allows selectivity-related code to not
+	 * be too concerned about leaking memory.
+	 */
+	saved_cxt = MemoryContextSwitchTo(root->planner_tmp_cxt);
+
+	/*
+	 * This function can be called recursively, in which case we'd better not
+	 * reset planner_tmp_cxt until we exit the topmost level.  Use of
+	 * planner_tmp_cxt_depth also makes it safe for other places to use and
+	 * reset planner_tmp_cxt in the same fashion.
+	 */
+	root->planner_tmp_cxt_depth++;
+
 	if (IsA(clause, Var))
 	{
 		Var		   *var = (Var *) clause;
@@ -967,6 +984,12 @@ clause_selectivity_ext(PlannerInfo *root,
 			rinfo->outer_selec = s1;
 	}
 
+	/* Exit and clean up the planner_tmp_cxt */
+	MemoryContextSwitchTo(saved_cxt);
+	Assert(root->planner_tmp_cxt_depth > 0);
+	if (--root->planner_tmp_cxt_depth == 0)
+		MemoryContextReset(root->planner_tmp_cxt);
+
 #ifdef SELECTIVITY_DEBUG
 	elog(DEBUG4, "clause_selectivity: s1 %f", s1);
 #endif			/* SELECTIVITY_DEBUG */
diff --git a/src/backend/optimizer/plan/planner.c b/src/backend/optimizer/plan/planner.c
index be4e182869..44b9555dbf 100644
--- a/src/backend/optimizer/plan/planner.c
+++ b/src/backend/optimizer/plan/planner.c
@@ -643,6 +643,17 @@ subquery_planner(PlannerGlobal *glob, Query *parse,
 	root->plan_params = NIL;
 	root->outer_params = NULL;
 	root->planner_cxt = CurrentMemoryContext;
+	/* a subquery can share the main query's planner_tmp_cxt */
+	if (parent_root)
+	{
+		root->planner_tmp_cxt = parent_root->planner_tmp_cxt;
+		Assert(parent_root->planner_tmp_cxt_depth == 0);
+	}
+	else
+		root->planner_tmp_cxt = AllocSetContextCreate(root->planner_cxt,
+	  "Planner temp context",
+	  ALLOCSET_DEFAULT_SIZES);
+	root->planner_tmp_cxt_depth = 0;
 	root->init_plans = NIL;
 	root->cte_plan_ids = NIL;
 	root->multiexpr_params = NIL;
@@ -6514,6 +6525,10 @@ plan_cluster_use_sort(Oid tableOid, Oid indexOid)
 	root->glob = glob;
 	root->query_level = 1;
 	root->planner_cxt = CurrentMemoryContext;
+	root->planner_tmp_cxt = AllocSetContextCreate(root->planner_cxt,
+  "Planner temp context",
+  ALLOCSET_DEFAULT_SIZES);
+	root->planner_tmp_cxt_depth = 0;
 	root->wt_param_id = -1;
 	root->join_domains = list_make1(makeNode(JoinDomain));
 
@@ -6552,7 +6567,10 @@ plan_cluster_use_sort(Oid tableOid, Oid indexOid)
 	 * trust the index contents but use seqscan-and-sort.
 	 */
 	if (lc == NULL)/* not in the list? */
+	{
+		MemoryContextDelete(root->planner_tmp_cxt);
 		return true;			/* use sort */
+	}
 
 	/*
 	 * Rather than doing all the pushups that would be needed to use
@@ -658

Possible to trigger autovacuum?

2024-02-19 Thread Chris Cleveland

Is it possible to launch an autovacuum from within an extension?

I'm developing an index access method. After the index gets built it needs
some cleanup and optimization. I'd prefer to do this in the
amvacuumcleanup() method so it can happen periodically and asynchronously.

I could fire up a background worker to do the job, but it would be a lot
simpler to call please_launch_autovacuum_right_now();


-- 
Chris Cleveland
312-339-2677 mobile

Re: Patch: Add parse_type Function

"David E. Wheeler"  writes:
> The only way I can think of to avoid that is to:

> 1. Add a to_regtypmod() for those who just want the typemod.

Seems like there's a good case for doing that.

> 2. Add a format_type_string() function that returns a string, the equivalent 
> of this function but in C:

> CREATE FUNCTION format_type_string(text) RETURNS TEXT AS $$
>SELECT format_type(to_regtype($1), to_regtypmod($1))
> $$;

I'm less thrilled about that, mainly because I can't think of
a good name for it ("format_type_string" is certainly not that).
Is the use-case for this functionality really strong enough that
we need to provide it as a single function rather than something
assembled out of spare parts?

regards, tom lane

Re: Why is pq_begintypsend so slow?

2024-02-19 Thread Andres Freund

Hi,

On 2024-02-19 10:02:52 +0900, Sutou Kouhei wrote:
> In <20240218200906.zvihkrs46yl2j...@awork3.anarazel.de>
>   "Re: Why is pq_begintypsend so slow?" on Sun, 18 Feb 2024 12:09:06 -0800,
>   Andres Freund  wrote:
> 
> >> [1] 
> >> https://www.postgresql.org/message-id/flat/20240215.153421.96888103784986803.kou%40clear-code.com#34df359e6d255795d16814ce138cc995
> 
> >> Do we need one FunctionCallInfoBaseData for each attribute?
> >> How about sharing one FunctionCallInfoBaseData by all
> >> attributes like [1]?
> > 
> > That makes no sense to me. You're throwing away most of the possible gains 
> > by
> > having to update the FunctionCallInfo fields on every call. You're saving
> > neglegible amounts of memory at a substantial runtime cost.
> 
> The number of updated fields of your approach and [1] are
> same:
> 
> Your approach: 6 (I think that "fcinfo->isnull = false" is
> needed though.)
> 
> + fcinfo->args[0].value = CStringGetDatum(string);
> + fcinfo->args[0].isnull = false;
> + fcinfo->args[1].value = 
> ObjectIdGetDatum(attr->typioparam);
> + fcinfo->args[1].isnull = false;
> + fcinfo->args[2].value = 
> Int32GetDatum(attr->typmod);
> + fcinfo->args[2].isnull = false;
> 
> [1]: 6 (including "fcinfo->isnull = false")
> 
> + fcinfo->flinfo = flinfo;
> + fcinfo->context = escontext;
> + fcinfo->isnull = false;
> + fcinfo->args[0].value = CStringGetDatum(str);
> + fcinfo->args[1].value = ObjectIdGetDatum(typioparam);
> + fcinfo->args[2].value = Int32GetDatum(typmod);

If you want to do so you can elide the isnull assignments in my approach just
as well as yours. Assigning not just the value but also flinfo and context is
overhead.  But you can't elide assigning flinfo and context, which is why
reusing one FunctionCallInfo isn't going to win

I don't think you necessarily need to assign fcinfo->isnull on every call,
these functions aren't allowed to return NULL IIRC. And if they do we'd error
out, so it could only happen once.


> I don't have a strong opinion how to implement this
> optimization as I said above. It seems that you like your
> approach. So I withdraw [1]. Could you complete this
> optimization? Can we proceed making COPY format extensible
> without this optimization? It seems that it'll take a little
> time to complete this optimization because your patch is
> still WIP. And it seems that you can work on it after making
> COPY format extensible.

I don't think optimizing this aspect needs to block making copy extensible.

I don't know how much time/energy I'll have to focus on this in the near
term. I really just reposted this because the earlier patches were relevant
for the discussion in another thread.  If you want to pick the COPY part up,
feel free.

Greetings,

Andres Freund

Re: PGC_SIGHUP shared_buffers?

2024-02-19 Thread Andres Freund

Hi,

On 2024-02-19 13:54:01 -0500, Joe Conway wrote:
> On 2/19/24 13:13, Andres Freund wrote:
> > On 2024-02-19 09:19:16 -0500, Joe Conway wrote:
> > > Couldn't we scale the rounding, e.g. allow small allocations as we do now,
> > > but above some number always round? E.g. maybe >= 2GB round to the nearest
> > > 256MB, >= 4GB round to the nearest 512MB, >= 8GB round to the nearest 1GB,
> > > etc?
> > 
> > That'd make the translation considerably more expensive. Which is important,
> > given how common an operation this is.
> 
> 
> Perhaps it is not practical, doesn't help, or maybe I misunderstand, but my
> intent was that the rounding be done/enforced when setting the GUC value
> which surely cannot be that often.

It'd be used for something like

  WhereIsTheChunkOfBuffers[buffer/CHUNK_SIZE]+(buffer%CHUNK_SIZE)*BLCKSZ;

If CHUNK_SIZE isn't a compile time constant this gets a good bit more
expensive. A lot more, if implemented naively (i.e as actual modulo/division
operations, instead of translating to shifts and masks).

Greetings,

Andres Freund

Re: Proposal: Adjacent B-Tree index

2024-02-19 Thread Matthias van de Meent

On Mon, 19 Feb 2024 at 18:48, Dilshod Urazov  wrote:
>
> - Motivation
>
> A regular B-tree index provides efficient mapping of key values to tuples 
> within a table. However, if you have two tables connected in some way, a 
> regular B-tree index may not be efficient enough. In this case, you would 
> need to create an index for each table. The purpose will become clearer if we 
> consider a simple example which is the main use-case as I see it.

I'm not sure why are two indexes not sufficient here? PostgreSQL can
already do merge joins, which would have the same effect of hitting
the same location in the index at the same time between all tables,
without the additional overhead of having to scan two table's worth of
indexes in VACUUM.

> During the vacuum of A an index tuple pointing to a dead tuple in A should be 
> cleaned as well as all index tuples for the same key.

This is definitely not correct. If I have this schema

table test (id int primary key, b text unique)
table test_ref(test_id int references test(id))

and if an index would contain entries for both test and test_ref, it
can't just remove all test_ref entries because an index entry with the
same id was removed: The entry could've been removed because (e.g.)
test's b column was updated thus inserting a new index entry for the
new HOT-chain's TID.

> would suffice for this new semantics.

With the provided explanation I don't think this is a great idea.

Kind regards,

Matthias van de Meent.

Re: PGC_SIGHUP shared_buffers?

2024-02-19 Thread Joe Conway


On 2/19/24 13:13, Andres Freund wrote:

On 2024-02-19 09:19:16 -0500, Joe Conway wrote:

Couldn't we scale the rounding, e.g. allow small allocations as we do now,
but above some number always round? E.g. maybe >= 2GB round to the nearest
256MB, >= 4GB round to the nearest 512MB, >= 8GB round to the nearest 1GB,
etc?


That'd make the translation considerably more expensive. Which is important,
given how common an operation this is.



Perhaps it is not practical, doesn't help, or maybe I misunderstand, but 
my intent was that the rounding be done/enforced when setting the GUC 
value which surely cannot be that often.


--
Joe Conway
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com

Re: PGC_SIGHUP shared_buffers?

2024-02-19 Thread Andres Freund

Hi,

On 2024-02-19 09:19:16 -0500, Joe Conway wrote:
> On 2/18/24 15:35, Andres Freund wrote:
> > On 2024-02-18 17:06:09 +0530, Robert Haas wrote:
> > > How many people set shared_buffers to something that's not a whole
> > > number of GB these days?
> > 
> > I'd say the vast majority of postgres instances in production run with less
> > than 1GB of s_b. Just because numbers wise the majority of instances are
> > running on small VMs and/or many PG instances are running on one larger
> > machine.  There are a lot of instances where the total available memory is
> > less than 2GB.
> > 
> > > I mean I bet it happens, but in practice if you rounded to the nearest GB,
> > > or even the nearest 2GB, I bet almost nobody would really care. I think 
> > > it's
> > > fine to be opinionated here and hold the line at a relatively large 
> > > granule,
> > > even though in theory people could want something else.
> > 
> > I don't believe that at all unfortunately.
> 
> Couldn't we scale the rounding, e.g. allow small allocations as we do now,
> but above some number always round? E.g. maybe >= 2GB round to the nearest
> 256MB, >= 4GB round to the nearest 512MB, >= 8GB round to the nearest 1GB,
> etc?

That'd make the translation considerably more expensive. Which is important,
given how common an operation this is.

Greetings,

Andres Freund

Proposal: Adjacent B-Tree index

2024-02-19 Thread Dilshod Urazov

- Motivation

A regular B-tree index provides efficient mapping of key values to tuples
within a table. However, if you have two tables connected in some way, a
regular B-tree index may not be efficient enough. In this case, you would
need to create an index for each table. The purpose will become clearer if
we consider a simple example which is the main use-case as I see it.

- Example

We need to store a graph. So we create a table for nodes

CREATE TABLE Nodes (
  id SERIAL PRIMARY KEY,
  label VARCHAR(255)
);

and a table for edges

CREATE TABLE Edges (
  label VARCHAR(255),
  source INTEGER REFERENCES Nodes(id),
  target INTEGER REFERENCES Nodes(id)
);

In order to efficiently traverse a graph we would have an index for
Nodes.id which created automatically in this case and an index for
Edges.source.

- Tweaked B-Tree

We could optimize cases like the former by modifying PostgreSQL btree index
to allow it to index 2 tables simultaneously.

Semantically it would be UNIQUE index for attribute x of table A and an
index for attribute y in table B. In the non-deduplicated case  index tuple
pointing to a tuple in A should be marked by a flag. In the deduplicated
case first TID in an array can be interpreted as a link to A.
During the vacuum of A an index tuple pointing to a dead tuple in A should
be cleaned as well as all index tuples for the same key. We can reach this
by clearing all index tuples after the dead one until we come to index
tuple marked by a flag with different key. Or we can enforce deduplication
in such index.

In the example with a graph such index would provide PRIMARY KEY for Nodes
and the index for Edges.Source. The query

SELECT * FROM Nodes WHERE id = X;

will use this index and take into account only a TID in Nodes (so this
would be marked index tuple or first TID in a posting list).  The query

SELECT * FROM Edges WHERE source = X;

conversely will ignore links to Nodes.

-- Syntax

I believe that
CREATE TABLE Nodes (
  id SERIAL PRIMARY KEY ADJACENT,
  label VARCHAR(255)
);
CREATE TABLE Edges (
  label VARCHAR(255),
  source INTEGER REFERENCES ADJACENT Nodes(id),
  target INTEGER REFERENCES Nodes(id)
);

would suffice for this new semantics.
--
Dilshod Urazov

Re: Optimize planner memory consumption for huge arrays

On 2/19/24 16:45, Tom Lane wrote:
> Tomas Vondra  writes:
>> Considering there are now multiple patches improving memory usage during
>> planning with partitions, perhaps it's time to take a step back and
>> think about how we manage (or rather not manage) memory during query
>> planning, and see if we could improve that instead of an infinite
>> sequence of ad hoc patches?
> 
> +1, I've been getting an itchy feeling about that too.  I don't have
> any concrete proposals ATM, but I quite like your idea here:
> 
>> For example, I don't think we expect selectivity functions to allocate
>> long-lived objects, right? So maybe we could run them in a dedicated
>> memory context, and reset it aggressively (after each call).
> 
> That could eliminate a whole lot of potential leaks.  I'm not sure 
> though how much it moves the needle in terms of overall planner
> memory consumption.

I'm not sure about that either, maybe not much - for example it would
not help with the two other memory usage patches (which are related to
SpecialJoinInfo and RestrictInfo, outside selectivity functions).

It was an ad hoc thought, inspired by the issue at hand. Maybe it would
be possible to find similar "boundaries" in other parts of the planner.

I keep thinking about how compilers/optimizers typically have separate
optimizations passes, maybe that's something we might leverage ...

> I've always supposed that the big problem was data structures
> associated with rejected Paths, but I might be wrong. Is there some
> simple way we could get a handle on where the most memory goes while
> planning?
> 

I suspect this might have changed thanks to partitioning - it's not a
coincidence most of the recent memory usage improvements address cases
with many partitions.

As for how to analyze the memory usage - maybe there's a simpler way,
but what I did recently was adding simple instrumentation into memory
contexts, recording pointer/size/backtrace for each request, and then a
script that aggregated that into a "currently allocated" report with
information about "source" of the allocation.

regards

-- 
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Re: Avoid switching between system-user and system-username in the doc

2024-02-19 Thread Jelte Fennema-Nio

On Mon, 19 Feb 2024 at 09:52, Bertrand Drouvot
 wrote:
> Please find attached a tiny patch to clean that up.

LGTM

Re: Patch: Add parse_type Function

2024-02-19 Thread David E. Wheeler

On Feb 18, 2024, at 15:55, Erik Wienhold  wrote:

>> The overhead of parse_type_and_format can be related to higher planning
>> time. PL/pgSQL can assign composite without usage FROM clause.
> 
> Thanks, didn't know that this makes a difference.  In that case both
> variants are on par.

Presumably this is a side-effect of the use of a RECORD here, which requires a 
FROM clause in pure SQL, yes?

The only way I can think of to avoid that is to:

1. Add a to_regtypmod() for those who just want the typemod.

2. Add a format_type_string() function that returns a string, the equivalent of 
this function but in C:

CREATE FUNCTION format_type_string(text) RETURNS TEXT AS $$
   SELECT format_type(to_regtype($1), to_regtypmod($1))
$$;

We could also keep the record-returning function for users who want both the 
regtype and the regytypemod at once, but with the other two I consider it less 
essential.

Thoughts?

Best,

David

Re: JIT compilation per plan node

Hi Melih,

On 1/2/24 20:50, Melih Mutlu wrote:
> Hi hackers,
> 
> After discussing this with David offlist, I decided to reinitiate this
> discussion that has already been raised and discussed several times in the
> past. [1] [2]
> 
> Currently, if JIT is enabled, the decision for JIT compilation is purely
> tied to the total cost of the query. The number of expressions to be JIT
> compiled is not taken into consideration, however the time spent JITing
> also depends on that number. This may cause the cost of JITing to become
> too much that it hurts rather than improves anything.
> An example case would be that you have many partitions and run a query that
> touches one, or only a few, of those partitions. If there is no partition
> pruning done in planning time, all 1000 partitions are JIT compiled while
> most will not even be executed.
> 
> Proposed patch (based on the patch from [1]) simply changes consideration
> of JIT from plan level to per-plan-node level. Instead of depending on the
> total cost, we decide whether to perform JIT on a node or not by
> considering only that node's cost. This allows us to only JIT compile plan
> nodes with high costs.
> 
> Here is a small test case to see the issue and the benefit of the patch:
> 
> CREATE TABLE listp(a int, b int) PARTITION BY LIST(a);
> SELECT 'CREATE TABLE listp'|| x || ' PARTITION OF listp FOR VALUES IN
> ('||x||');' FROM generate_Series(1,1000) x; \gexec
> INSERT INTO listp SELECT 1,x FROM generate_series(1,1000) x;
> 
> EXPLAIN (VERBOSE, ANALYZE) SELECT COUNT(*) FROM listp WHERE b < 0;
> 
> master jit=off:
>  Planning Time: 25.113 ms
>  Execution Time: 315.896 ms
> 
> master jit=on:
>   Planning Time: 24.664 ms
>  JIT:
>Functions: 9008
>Options: Inlining false, Optimization false, Expressions true, Deforming
> true
>Timing: Generation 290.705 ms (Deform 108.274 ms), Inlining 0.000 ms,
> Optimization 165.991 ms, Emission 3232.775 ms, Total 3689.472 ms
>  Execution Time: 1612.817 ms
> 
> patch jit=on:
>  Planning Time: 24.055 ms
>  JIT:
>Functions: 17
>Options: Inlining false, Optimization false, Expressions true, Deforming
> true
>Timing: Generation 1.463 ms (Deform 0.232 ms), Inlining 0.000 ms,
> Optimization 0.766 ms, Emission 11.609 ms, Total 13.837 ms
>  Execution Time: 299.721 ms
> 

Thanks for the updated / refreshed patch.

I think one of the main challenges this patch faces is that there's a
couple old threads with previous attempts, and the current thread simply
builds on top of them, without explaining stuff fully. But people either
don't realize that, or don't have time to read old threads just in case,
so can't follow some of the decisions :-(

I think it'd be good to maybe try to explain some of the problems and
solutions more thoroughly, or at least point to the relevant places in
the old threads ...

> 
> A bit more on what this patch does:
> - It introduces a new context to keep track of the number of estimated
> calls and if JIT is decided for each node that the context applies.

AFAIK this is an attempt to deal with passing the necessary information
while constructing the plan, which David originally tried [1] doing by
passing est_calls during create_plan ...

I doubt CreatePlanContext is a great way to achieve this. For one, it
breaks the long-standing custom that PlannerInfo is the first parameter,
usually followed by RelOptInfo, etc. CreatePlanContext is added to some
functions (but not all), which makes it ... unpredictable.

FWIW it's not clear to me if/how this solves the problem with early
create_plan() calls for subplans. Or is it still the same?

Wouldn't it be simpler to just build the plan as we do now, and then
have an expression_tree_walker that walks the complete plan top-down,
inspects the nodes, enables JIT where appropriate and so on? That can
have arbitrary context, no problem with that.

Considering we decide JIT pretty late anyway (long after costing and
other stuff that might affect the plan selection), the result should be
exactly the same, without the extensive createplan.c disruption ...

(usual caveat: I haven't tried, maybe there's something that means this
can't work)

> - The number of estimated calls are especially useful where a node is
> expected to be rescanned, such as Gather. Gather Merge, Memoize and Nested
> Loop. Knowing the estimated number of calls for a node allows us to rely on
> total cost multiplied by the estimated calls instead of only total cost for
> the node.

Not sure I follow. Why would be these nodes (Gather, Gather Merge, ...)
more likely to be rescanned compared to other nodes?

> - For each node, the planner considers if the node should be JITed. If the
> cost of the node * the number of estimated calls is greater than
> jit_above_cost,  it's decided to be JIT compiled. Note that this changes
> the meaning of jit_above_cost, it's now a threshold for a single plan node
> and not the whole query. Additionally, this change in JIT consideration is
>

Re: Optimize planner memory consumption for huge arrays

Tomas Vondra  writes:
> Considering there are now multiple patches improving memory usage during
> planning with partitions, perhaps it's time to take a step back and
> think about how we manage (or rather not manage) memory during query
> planning, and see if we could improve that instead of an infinite
> sequence of ad hoc patches?

+1, I've been getting an itchy feeling about that too.  I don't have
any concrete proposals ATM, but I quite like your idea here:

> For example, I don't think we expect selectivity functions to allocate
> long-lived objects, right? So maybe we could run them in a dedicated
> memory context, and reset it aggressively (after each call).

That could eliminate a whole lot of potential leaks.  I'm not sure
though how much it moves the needle in terms of overall planner memory
consumption.  I've always supposed that the big problem was data
structures associated with rejected Paths, but I might be wrong.
Is there some simple way we could get a handle on where the most
memory goes while planning?

regards, tom lane

Re: numeric_big in make check?

Dean Rasheed  writes:
> On 19 Feb 2024, at 12:48, Tom Lane  wrote:
>> Or we could just flush it.  It's never detected a bug, and I think
>> you'd find that it adds zero code coverage (or if not, we could
>> fix that in a far more surgical and less expensive manner).

> Off the top of my head, I can't say to what extent that's true, but it
> wouldn't surprise me if at least some of the tests added in the last 4
> commits to touch that file aren't covered by tests elsewhere. Indeed
> that certainly looks like the case for 18a02ad2a5. I'm sure those
> tests could be pared down though.

I thought I'd try to acquire some actual facts here, so I compared
the code coverage shown by "make check" as of HEAD, versus "make
check" after adding numeric_big to parallel_schedule.  I saw the
following lines of numeric.c as being covered in the second run
and not the first:

numeric():
1285 || !NUMERIC_IS_SHORT(num)))
1293 new->choice.n_long.n_sign_dscale = NUMERIC_SIGN(new) |
1294 ((uint16) dscale & NUMERIC_DSCALE_MASK);
div_var_fast():
9185 idivisor = idivisor * NBASE + var2->digits[1];
9186 idivisor_weight--;
sqrt_var():
10073 res_ndigits = res_weight + 1 - (-rscale - 1) / DEC_DIGITS;

Pretty poor return on investment for the runtime consumed.  I don't
object to adding something to numeric.sql that's targeted at adding
coverage for these (or anyplace else that's not covered), but let's
not just throw cycles at the problem.

Oddly, there were a few lines in numeric_poly_combine and
int8_avg_combine that were hit in the first run and not the second.
Apparently our tests of parallel aggregation aren't as reproducible as
one could wish.

regards, tom lane

Re: Add trim_trailing_whitespace to editorconfig file

2024-02-19 Thread Jelte Fennema-Nio

On Fri, 16 Feb 2024 at 11:45, Peter Eisentraut  wrote:
> I have committed that one.

Thanks :)

> v3-0002-Require-final-newline-in-.po-files.patch
>
> The .po files are imported from elsewhere, so I'm not sure this is going
> to have the desired effect.  Perhaps it's worth cleaning up, but it
> would require more steps.

Okay, yeah that would need to be changed at the source then. Removed
this change from the newly attached patchset, as well as updating
editorconfig to have "insert_final_newline = unset" for .po files.

> v3-0003-Bring-editorconfig-in-line-with-gitattributes.patch
>
> I question whether we need to add rules to .editorconfig about files
> that are generated or imported from elsewhere, since those are not meant
> to be edited.

I agree that it's not strictly necessary to have .editorconfig match
.gitattributes for files that are not meant to be edited by hand. But
I don't really see a huge downside either, apart from having a few
extra lines it .editorconfig. And adding these lines does have a few
benefits:
1. It makes it easy to ensure that .editorconfig and .gitattributes stay in sync
2. If someone opens a file that they are not supposed to edit by hand,
and then saves it. Then no changes are made. As opposed to suddenly
making some whitespace changes

Attached is a new patchset with the first commit split in three
separate commits, which configure:
1. Files meant to be edited by hand)
2. Output test files (maybe edited by hand)
3. Imported/autogenerated files

The first one is definitely the most useful to me personally.
From 419d2d0e45d40a4cddb942fbc63c3c54f4dd479d Mon Sep 17 00:00:00 2001
From: Jelte Fennema-Nio 
Date: Mon, 19 Feb 2024 16:03:31 +0100
Subject: [PATCH v4 2/5] Include test output files in .editorconfig

Most of the time these will be copy pasted from a test run, but if
there's a trivial change someone might want to make the change by hand.
By adding them to .editorconfig, we make sure that doing so won't mess
up the formatting.
---
 .editorconfig | 9 +
 1 file changed, 9 insertions(+)

diff --git a/.editorconfig b/.editorconfig
index 0e7c2048f9..92a16bd5de 100644
--- a/.editorconfig
+++ b/.editorconfig
@@ -27,3 +27,12 @@ trim_trailing_whitespace = false
 
 [src/backend/catalog/sql_features.txt]
 trim_trailing_whitespace = false
+
+# Test output files that contain extra whitespace
+[*.out]
+trim_trailing_whitespace = false
+insert_final_newline = unset
+
+[src/interfaces/ecpg/test/expected/*]
+trim_trailing_whitespace = false
+insert_final_newline = unset
-- 
2.34.1

From b05173b619c45d9ad859da50bfa78d1fc626fe3d Mon Sep 17 00:00:00 2001
From: Jelte Fennema-Nio 
Date: Wed, 14 Feb 2024 17:19:42 +0100
Subject: [PATCH v4 1/5] Handle blank-at-eof/blank-at-eol in .editorconfig too

For many files in the repo our .gitattributes file is configured to
complain about trailing whitespace at the end of a line, as well as not
including a final newline at the end of the file. This updates our
.editorconfig file, to make editors and IDEs fix these issues
automatically on save in the same way for files that are intended to be
edited by hand.
---
 .editorconfig | 15 +++
 1 file changed, 15 insertions(+)

diff --git a/.editorconfig b/.editorconfig
index d69a3d1dc4..0e7c2048f9 100644
--- a/.editorconfig
+++ b/.editorconfig
@@ -1,5 +1,9 @@
 root = true
 
+[*]
+trim_trailing_whitespace = true
+insert_final_newline = true
+
 [*.{c,h,l,y,pl,pm}]
 indent_style = tab
 indent_size = tab
@@ -12,3 +16,14 @@ indent_size = 1
 [*.xsl]
 indent_style = space
 indent_size = 2
+
+# Certain data files that contain special whitespace, and other special cases
+[*.data]
+trim_trailing_whitespace = false
+insert_final_newline = unset
+
+[contrib/pgcrypto/sql/pgp-armor.sql]
+trim_trailing_whitespace = false
+
+[src/backend/catalog/sql_features.txt]
+trim_trailing_whitespace = false

base-commit: e1b7fde418f2c0ba4ab0d9fbfa801ef62d96397b
-- 
2.34.1

From 97fe71d7769a3e4191adedbfe4092afec696ea7b Mon Sep 17 00:00:00 2001
From: Jelte Fennema-Nio 
Date: Thu, 15 Feb 2024 10:23:19 +0100
Subject: [PATCH v4 4/5] Add note about keeping .editorconfig and
 .gitattributes in sync

Now that they are in sync, hopefully this reminder makes sure we keep
them that way. Automation to detect them being out of sync seems
excessive.
---
 .editorconfig  | 1 +
 .gitattributes | 1 +
 2 files changed, 2 insertions(+)

diff --git a/.editorconfig b/.editorconfig
index c742e0f844..3f5c4e0ef8 100644
--- a/.editorconfig
+++ b/.editorconfig
@@ -1,3 +1,4 @@
+# IMPORTANT: When updating this file, also update .gitattributes to match.
 root = true
 
 [*]
diff --git a/.gitattributes b/.gitattributes
index e9ff4a56bd..7923fc3387 100644
--- a/.gitattributes
+++ b/.gitattributes
@@ -1,3 +1,4 @@
+# IMPORTANT: When updating this file, also update .editorconfig to match.
 *		whitespace=space-before-tab,trailing-space
 *.[chly]	whitespace=space-before-tab,trailing-space,indent-with-non-tab,tabwidth=4
 *.pl		whitespace=space-before-tab,tra

Use streaming read API in ANALYZE

2024-02-19 Thread Nazir Bilal Yavuz

Hi,

I worked on using the currently proposed streaming read API [1] in ANALYZE.
The patch is attached. 0001 is the not yet merged streaming read API code
changes that can be applied to the master, 0002 is the actual code.

The blocks to analyze are obtained by using the streaming read API now.

- Since streaming read API is already doing prefetch, I removed the #ifdef
USE_PREFETCH code from acquire_sample_rows().

- Changed 'while (BlockSampler_HasMore(&bs))' to 'while (nblocks)' because
the prefetch mechanism in the streaming read API will advance 'bs' before
returning buffers.

- Removed BlockNumber and BufferAccessStrategy from the declaration of
scan_analyze_next_block(), passing pgsr (PgStreamingRead) instead of them.

I counted syscalls of analyzing ~5GB table. It can be seen that the patched
version did ~1300 less read calls.

Patched:

% time seconds  usecs/call callserrors syscall
-- --- --- - - 
 39.670.012128   0 29809   pwrite64
 36.960.011299   0 28594   pread64
 23.240.007104   0 27611   fadvise64

Master (21a71648d3):

% time seconds  usecs/call callserrors syscall
-- --- --- - - 
 38.940.016457   0 29816   pwrite64
 36.790.015549   0 29850   pread64
 23.910.010106   0 29848   fadvise64


Any kind of feedback would be appreciated.

[1]:
https://www.postgresql.org/message-id/CA%2BhUKGJkOiOCa%2Bmag4BF%2BzHo7qo%3Do9CFheB8%3Dg6uT5TUm2gkvA%40mail.gmail.com

-- 
Regards,
Nazir Bilal Yavuz
Microsoft
From 509e55997c084f831fcbcb46cabe79d4f97aa93e Mon Sep 17 00:00:00 2001
From: Thomas Munro 
Date: Sat, 22 Jul 2023 17:31:54 +1200
Subject: [PATCH v1 1/2] Streaming read API changes that are not committed to
 master yet

Discussion: https://www.postgresql.org/message-id/CA%2BhUKGJkOiOCa%2Bmag4BF%2BzHo7qo%3Do9CFheB8%3Dg6uT5TUm2gkvA%40mail.gmail.com
---
 src/include/storage/bufmgr.h |  22 +
 src/include/storage/streaming_read.h |  52 ++
 src/backend/storage/Makefile |   2 +-
 src/backend/storage/aio/Makefile |  14 +
 src/backend/storage/aio/meson.build  |   5 +
 src/backend/storage/aio/streaming_read.c | 472 ++
 src/backend/storage/buffer/bufmgr.c  | 592 +++
 src/backend/storage/buffer/localbuf.c|  14 +-
 src/backend/storage/meson.build  |   1 +
 src/tools/pgindent/typedefs.list |   2 +
 10 files changed, 953 insertions(+), 223 deletions(-)
 create mode 100644 src/include/storage/streaming_read.h
 create mode 100644 src/backend/storage/aio/Makefile
 create mode 100644 src/backend/storage/aio/meson.build
 create mode 100644 src/backend/storage/aio/streaming_read.c

diff --git a/src/include/storage/bufmgr.h b/src/include/storage/bufmgr.h
index d51d46d3353..a38f1acb37a 100644
--- a/src/include/storage/bufmgr.h
+++ b/src/include/storage/bufmgr.h
@@ -14,6 +14,7 @@
 #ifndef BUFMGR_H
 #define BUFMGR_H
 
+#include "port/pg_iovec.h"
 #include "storage/block.h"
 #include "storage/buf.h"
 #include "storage/bufpage.h"
@@ -158,6 +159,11 @@ extern PGDLLIMPORT int32 *LocalRefCount;
 #define BUFFER_LOCK_SHARE		1
 #define BUFFER_LOCK_EXCLUSIVE	2
 
+/*
+ * Maximum number of buffers for multi-buffer I/O functions.  This is set to
+ * allow 128kB transfers, unless BLCKSZ and IOV_MAX imply a a smaller maximum.
+ */
+#define MAX_BUFFERS_PER_TRANSFER Min(PG_IOV_MAX, (128 * 1024) / BLCKSZ)
 
 /*
  * prototypes for functions in bufmgr.c
@@ -177,6 +183,18 @@ extern Buffer ReadBufferWithoutRelcache(RelFileLocator rlocator,
 		ForkNumber forkNum, BlockNumber blockNum,
 		ReadBufferMode mode, BufferAccessStrategy strategy,
 		bool permanent);
+extern Buffer PrepareReadBuffer(BufferManagerRelation bmr,
+ForkNumber forkNum,
+BlockNumber blockNum,
+BufferAccessStrategy strategy,
+bool *foundPtr);
+extern void CompleteReadBuffers(BufferManagerRelation bmr,
+Buffer *buffers,
+ForkNumber forknum,
+BlockNumber blocknum,
+int nblocks,
+bool zero_on_error,
+BufferAccessStrategy strategy);
 extern void ReleaseBuffer(Buffer buffer);
 extern void UnlockReleaseBuffer(Buffer buffer);
 extern bool BufferIsExclusiveLocked(Buffer buffer);
@@ -247,9 +265,13 @@ extern void LockBufferForCleanup(Buffer buffer);
 extern bool ConditionalLockBufferForCleanup(Buffer buffer);
 extern bool IsBufferCleanupOK(Buffer buffer);
 extern bool HoldingBufferPinThatDelaysRecovery(void);
+extern void ZeroBuffer(Buffer buffer, ReadBufferMode mode);
 
 extern bool BgBufferSync(struct WritebackContext *wb_context);
 
+extern void LimitAdditionalPins(uint32 *additional_pins);
+extern void LimitAdditionalLocalPins(uint32 *additional_pins);
+
 /* in buf_init.c */
 extern void InitBufferPool(void);
 extern Size

Re: partitioning and identity column

2024-02-19 Thread Alexander Lakhin


Hello Ashutosh,

19.02.2024 15:17, Ashutosh Bapat wrote:



Functions ATExecAddIdentity() and ATExecDropIdentity() are recursive too,
so I think they can be exploited as well.

not just Identity related functions, but many other functions in
tablecmds.c have that problem as I mentioned earlier.



Could you please name functions, which you suspect, for me to recheck them?
Perhaps we should consider fixing all of such functions, in light of
b0f7dd915 and d57b7cc33...

Best regards,
Alexander

Re: Switching XLog source from archive to streaming when primary available

2024-02-19 Thread Japin Li



On Mon, 19 Feb 2024 at 18:36, Bharath Rupireddy 
 wrote:
> On Wed, Jan 31, 2024 at 6:30 PM Bharath Rupireddy
>  wrote:
>>
>> Needed a rebase due to commit 776621a (conflict in
>> src/test/recovery/meson.build for new TAP test file added). Please
>> find the attached v17 patch.
>
> Strengthened tests a bit by using recovery_min_apply_delay to mimic
> standby spending some time fetching from archive. PSA v18 patch.

Here are some minor comments:

[1]
+primary). However, the standby exhausts all the WAL present in pg_wal

s|pg_wal|pg_wal|g

[2]
+# Ensure checkpoint doesn't come in our way
+$primary->append_conf('postgresql.conf', qq(
+min_wal_size = 2MB
+max_wal_size = 1GB
+checkpoint_timeout = 1h
+   autovacuum = off
+));

Keeping the same indentation might be better.

Re: Injection points: some tools to wait and wake

2024-02-19 Thread Bertrand Drouvot

Hi,

On Mon, Feb 19, 2024 at 04:51:45PM +0900, Michael Paquier wrote:
> On Mon, Feb 19, 2024 at 03:01:40PM +0900, Michael Paquier wrote:
> > 0002 is a polished version of the TAP test that makes use of this
> > facility, providing coverage for the bug fixed by 7863ee4def65
> > (reverting this commit causes the test to fail), where a restart point 
> > runs across a promotion request.  The trick is to stop the
> > checkpointer in the middle of a restart point and issue a promotion
> > in-between.
> 
> The CF bot has been screaming at this one on Windows because the
> process started with IPC::Run::start was not properly finished, so
> attached is an updated version to bring that back to green.

Thanks for the patch, that's a very cool feature!

Random comments:

1 ===

+CREATE FUNCTION injection_points_wake()

what about injection_points_wakeup() instead?

2 ===
+/* Shared state information for injection points. */
+typedef struct InjectionPointSharedState
+{
+   /* protects accesses to wait_counts */
+   slock_t lock;
+
+   /* Counter advancing when injection_points_wake() is called */
+   int wait_counts;

If slock_t protects "only" one counter, then what about using pg_atomic_uint64
or pg_atomic_uint32 instead? And btw do we need wait_counts at all? (see 
comment 
number 4)

3 ===

+ * SQL function for waking a condition variable

waking up instead?

4 ===

+   for (;;)
+   {
+   int new_wait_counts;
+
+   SpinLockAcquire(&inj_state->lock);
+   new_wait_counts = inj_state->wait_counts;
+   SpinLockRelease(&inj_state->lock);
+
+   if (old_wait_counts != new_wait_counts)
+   break;
+   ConditionVariableSleep(&inj_state->wait_point, 
injection_wait_event);
+   }

I'm wondering if this loop and wait_counts are needed, why not doing something
like (and completely get rid of wait_counts)?

"
ConditionVariablePrepareToSleep(&inj_state->wait_point);
ConditionVariableSleep(&inj_state->wait_point, injection_wait_event);
ConditionVariableCancelSleep();
"

It's true that the comment above ConditionVariableSleep() mentions that:

"
 * This should be called in a predicate loop that tests for a specific exit
 * condition and otherwise sleeps
"

but is it needed in our particular case here?

Regards,

-- 
Bertrand Drouvot
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com

Re: PGC_SIGHUP shared_buffers?

2024-02-19 Thread Joe Conway


On 2/18/24 15:35, Andres Freund wrote:

On 2024-02-18 17:06:09 +0530, Robert Haas wrote:

How many people set shared_buffers to something that's not a whole
number of GB these days?


I'd say the vast majority of postgres instances in production run with less
than 1GB of s_b. Just because numbers wise the majority of instances are
running on small VMs and/or many PG instances are running on one larger
machine.  There are a lot of instances where the total available memory is
less than 2GB.


I mean I bet it happens, but in practice if you rounded to the nearest GB,
or even the nearest 2GB, I bet almost nobody would really care. I think it's
fine to be opinionated here and hold the line at a relatively large granule,
even though in theory people could want something else.


I don't believe that at all unfortunately.


Couldn't we scale the rounding, e.g. allow small allocations as we do 
now, but above some number always round? E.g. maybe >= 2GB round to the 
nearest 256MB, >= 4GB round to the nearest 512MB, >= 8GB round to the 
nearest 1GB, etc?


--
Joe Conway
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com

Re: Speeding up COPY TO for uuids and arrays

2024-02-19 Thread Laurenz Albe

On Sat, 2024-02-17 at 12:24 -0800, Andres Freund wrote:
> On 2024-02-17 17:48:23 +0100, Laurenz Albe wrote:
> > - Patch 0001 speeds up pq_begintypsend with a custom macro.
> >   That brought the binary copy down to 3.7 seconds, which is a
> >   speed gain of 15%.
> 
> Nice win, but I think we can do a bit better than this. Certainly it shouldn't
> be a macro, given the multiple evaluation risks.  I don't think we actually
> need to zero those bytes, in fact that seems more likely to hide bugs than
> prevent them.
>
> I have an old patch series improving performance in this area.

The multiple evaluation danger could be worked around with an automatic
variable, or the macro could just carry a warning (like 
appendStringInfoCharMacro).

But considering the thread in [1], I guess I'll retract that patch; I'm
sure the more invasive improvement you have in mind will do better.

> > - Patch 0001 speeds up uuid_out by avoiding the overhead of
> >   a Stringinfo.  This brings text mode COPY to 19.4 seconds,
> >   which is speed gain of 21%.
> 
> Nice!
> 
> I wonder if we should move the core part for converting to hex to numutils.c,
> we already have code the for the inverse. There does seem to be further
> optimization potential in the conversion, and that seems better done somewhere
> central rather than one type's output function. OTOH, it might not be worth
> it, given the need to add the dashes.

I thought about it, but then decided not to do that.
Calling a function that converts the bytes to hex and then adding the
hyphens will slow down processing, and I think the code savings would be
minimal at best.

> > - Patch 0003 speeds up array_out a bit by avoiding some zero
> >   byte writes.  The measured speed gain is under 2%.
> 
> Makes sense.

Thanks for your interest and approval.  The attached patches are the
renamed, but unmodified patches 2 and 3 from before.

Yours,
Laurenz Albe


 [1]: 
https://postgr.es/m/CAMkU%3D1whbRDUwa4eayD9%2B59K-coxO9senDkPRbTn3cg0pUz4AQ%40mail.gmail.com
From de87d249e3f55c38062e563821af9fa32d15e341 Mon Sep 17 00:00:00 2001
From: Laurenz Albe 
Date: Sat, 17 Feb 2024 17:23:39 +0100
Subject: [PATCH v1 2/3] Speed up uuid_out

Since the size of the string representation of an uuid is
fixed, there is no benefit in using a StringInfo.
Avoiding the overhead makes the function substantially faster.
---
 src/backend/utils/adt/uuid.c | 14 --
 1 file changed, 8 insertions(+), 6 deletions(-)

diff --git a/src/backend/utils/adt/uuid.c b/src/backend/utils/adt/uuid.c
index 73dfd711c7..b48bbf01e1 100644
--- a/src/backend/utils/adt/uuid.c
+++ b/src/backend/utils/adt/uuid.c
@@ -53,10 +53,11 @@ uuid_out(PG_FUNCTION_ARGS)
 {
 	pg_uuid_t  *uuid = PG_GETARG_UUID_P(0);
 	static const char hex_chars[] = "0123456789abcdef";
-	StringInfoData buf;
+	char	   *buf, *p;
 	int			i;
 
-	initStringInfo(&buf);
+	buf = palloc(2 * UUID_LEN + 5);
+	p = buf;
 	for (i = 0; i < UUID_LEN; i++)
 	{
 		int			hi;
@@ -68,16 +69,17 @@ uuid_out(PG_FUNCTION_ARGS)
 		 * ("-"). Therefore, add the hyphens at the appropriate places here.
 		 */
 		if (i == 4 || i == 6 || i == 8 || i == 10)
-			appendStringInfoChar(&buf, '-');
+			*p++ = '-';
 
 		hi = uuid->data[i] >> 4;
 		lo = uuid->data[i] & 0x0F;
 
-		appendStringInfoChar(&buf, hex_chars[hi]);
-		appendStringInfoChar(&buf, hex_chars[lo]);
+		*p++ = hex_chars[hi];
+		*p++ = hex_chars[lo];
 	}
+	*p = '\0';
 
-	PG_RETURN_CSTRING(buf.data);
+	PG_RETURN_CSTRING(buf);
 }
 
 /*
-- 
2.43.2

From e8346ea88785a763d2bd3f99800ae928b7469f64 Mon Sep 17 00:00:00 2001
From: Laurenz Albe 
Date: Sat, 17 Feb 2024 17:24:19 +0100
Subject: [PATCH v1 3/3] Small speedup for array_out

Avoid writing zero bytes where it is not necessary.
This offers only a small, but measurable speed gain
for larger arrays.
---
 src/backend/utils/adt/arrayfuncs.c | 9 +
 1 file changed, 5 insertions(+), 4 deletions(-)

diff --git a/src/backend/utils/adt/arrayfuncs.c b/src/backend/utils/adt/arrayfuncs.c
index f3fee54e37..306c5062f7 100644
--- a/src/backend/utils/adt/arrayfuncs.c
+++ b/src/backend/utils/adt/arrayfuncs.c
@@ -1200,7 +1200,7 @@ array_out(PG_FUNCTION_ARGS)
 	p = retval;
 
 #define APPENDSTR(str)	(strcpy(p, (str)), p += strlen(p))
-#define APPENDCHAR(ch)	(*p++ = (ch), *p = '\0')
+#define APPENDCHAR(ch)	(*p++ = (ch))
 
 	if (needdims)
 		APPENDSTR(dims_str);
@@ -1222,10 +1222,9 @@ array_out(PG_FUNCTION_ARGS)
 char		ch = *tmp;
 
 if (ch == '"' || ch == '\\')
-	*p++ = '\\';
-*p++ = ch;
+	APPENDCHAR('\\');
+APPENDCHAR(ch);
 			}
-			*p = '\0';
 			APPENDCHAR('"');
 		}
 		else
@@ -1248,6 +1247,8 @@ array_out(PG_FUNCTION_ARGS)
 		j = i;
 	} while (j != -1);
 
+	*p = '\0';
+
 #undef APPENDSTR
 #undef APPENDCHAR
 
-- 
2.43.2

Re: numeric_big in make check?

2024-02-19 Thread Dean Rasheed

> > On 19 Feb 2024, at 12:48, Tom Lane  wrote:
> >
> > Or we could just flush it.  It's never detected a bug, and I think
> > you'd find that it adds zero code coverage (or if not, we could
> > fix that in a far more surgical and less expensive manner).
>

Off the top of my head, I can't say to what extent that's true, but it
wouldn't surprise me if at least some of the tests added in the last 4
commits to touch that file aren't covered by tests elsewhere. Indeed
that certainly looks like the case for 18a02ad2a5. I'm sure those
tests could be pared down though.

Regards,
Dean

Re: Optimize planner memory consumption for huge arrays




On 9/8/23 07:11, Lepikhov Andrei wrote:
> 
> 
> On Wed, Sep 6, 2023, at 8:09 PM, Ashutosh Bapat wrote:
>> Hi Lepikhov,
>>
>> Thanks for using my patch and I am glad that you found it useful.
>>
>> On Mon, Sep 4, 2023 at 10:56 AM Lepikhov Andrei
>>  wrote:
>>>
>>> Hi, hackers,
>>>
>>> Looking at the planner behaviour with the memory consumption patch [1], I 
>>> figured out that arrays increase memory consumption by the optimizer 
>>> significantly. See init.sql in attachment.
>>> The point here is that the planner does small memory allocations for each 
>>> element during estimation. As a result, it looks like the planner consumes 
>>> about 250 bytes for each integer element.
>>
>> I guess the numbers you mentioned in init.sql are total memory used by
>> the planner (as reported by the patch in the thread) when planning
>> that query and not memory consumed by Const nodes themselves. Am I
>> right? I think the measurements need to be explained better and also
>> the realistic scenario you are trying to oprimize.
> 
> Yes, it is the total memory consumed by the planner - I used the numbers 
> generated by your patch [1]. I had been increasing the number of elements in 
> the array to exclude the memory consumed by the planner for other purposes. 
> As you can see, the array with 1 element consumes 12kB of memory, 1E4 
> elements - 2.6 MB. All of that memory increment is related to the only 
> enlargement of this array. (2600-12)/10 = 260 bytes. So, I make a conclusion: 
> each 4-byte element produces a consumption of 260 bytes of memory.
> This scenario I obtained from the user complaint - they had strict 
> restrictions on memory usage and were stuck in this unusual memory usage case.
> 
>> I guess, the reason you think that partitioning will increase the
>> memory consumed is because each partition will have the clause
>> translated for it. Selectivity estimation for each partition will
>> create those many Const nodes and hence consume memory. Am I right?
> 
> Yes.
> 
>> Can you please measure the memory consumed with and without your
>> patch.
> 
> Done. See test case and results in 'init_parts.sql' in attachment. Short 
> summary below. I varied a number of elements from 1 to 1 and partitions 
> from 1 to 100. As you can see, partitioning adds a lot of memory consumption 
> by itself. But we see an effect from patch also.
> 
> master:
> elems 1   1E1 1E2 1E3 1E4 
> parts
> 1 28kB50kB0.3MB   2.5MB   25MB
> 1045kB143kB   0.6MB   4.8MB   47MB
> 100   208kB   125kB   3.3MB   27MB274MB
> 
> patched:
> elems 1   1E1 1E2 1E3 1E4
> parts
> 1 28kB48kB0.25MB  2.2MB   22.8MB
> 1044kB100kB   313kB   2.4MB   23.7MB
> 100   208kB   101kB   0.9MB   3.7MB   32.4MB
> 
> Just for comparison, without partitioning:
> elems 1   1E1 1E2 1E3 1E4 
> master:   12kB14kB37kB266kB   2.5MB
> patched:  12kB11.5kB  13kB24kB141kB
> 

These improvements look pretty nice, considering how simple the patch
seems to be. I can't even imagine how much memory we'd need with even
more partitions (say, 1000) if 100 partitions means 274MB.

BTW when releasing memory in scalararraysel, wouldn't it be good to also
free the elem_values/elem_nulls? I haven't tried and maybe it's not that
significant amount.


Considering there are now multiple patches improving memory usage during
planning with partitions, perhaps it's time to take a step back and
think about how we manage (or rather not manage) memory during query
planning, and see if we could improve that instead of an infinite
sequence of ad hoc patches?

Our traditional attitude is to not manage memory, and rely on the memory
context to not be very long-lived. And that used to be fine, but
partitioning clearly changed the equation, increasing the amount of
allocated memory etc.

I don't think we want to stop relying on memory contexts for planning in
general - memory contexts are obviously very convenient etc. But maybe
we could identify "stages" in the planning and release the memory more
aggressively in those?

For example, I don't think we expect selectivity functions to allocate
long-lived objects, right? So maybe we could run them in a dedicated
memory context, and reset it aggressively (after each call).

Ofc, I'm not suggesting this patch should be responsible for doing this.


>>> It is maybe not a problem most of the time. However, in the case of 
>>> partitions, memory consumption multiplies by each partition. Such a corner 
>>> case looks weird, but the fix is simple. So, why not?
>>
>> With vectorized operations becoming a norm these days, it's possible
>> to have thousands of element in array of an ANY or IN clause. Also
>> will be common to have thousands of partitions. But I think what we
>> need to d

Re: Reducing memory consumed by RestrictInfo list translations in partitionwise join planning

On Mon, Feb 19, 2024 at 4:35 AM Tomas Vondra
 wrote:
>
> Hi,
>
> After taking a look at the patch optimizing SpecialJoinInfo allocations,
> I decided to take a quick look at this one too. I don't have any
> specific comments on the code yet, but it seems quite a bit more complex
> than the other patch ... it's introducing a HTAB into the optimizer,
> surely that has costs too.

Thanks for looking into this too.

>
> I started by doing the same test as with the other patch, comparing
> master to the two patches (applied independently and both). And I got
> about this (in MB):
>
>   tablesmaster sjinforinfo  both
>   ---
>237 36   3433
>3   138129  122   113
>4   421376  363   318
>5  1495   1254 1172   931
>
> Unlike the SpecialJoinInfo patch, I haven't seen any reduction in
> planning time for this one.

Yeah. That agreed with my observation as well.

>
> The reduction in allocated memory is nice. I wonder what's allocating
> the remaining memory, and we'd have to do to reduce it further.

Please see reply to SpecialJoinInfo thread. Other that the current
patches, we require memory efficient Relids implementation. I have
shared some ideas in the slides I shared in the other thread, but
haven't spent time experimenting with any ideas there.

>
> However, this is a somewhat extreme example - it's joining 5 tables,
> each with 1000 partitions, using a partition-wise join. It reduces the
> amount of memory, but the planning time is still quite high (and
> essentially the same as without the patch). So it's not like it'd make
> them significantly more practical ... do we have any other ideas/plans
> how to improve that?

Yuya has been working on reducing planning time [1]. Has some
significant improvements in that area based on my experiments. But
those patches are complex and still WIP.

>
> AFAIK we don't expect this to improve "regular" cases with modest number
> of tables / partitions, etc. But could it make them slower in some case?
>

AFAIR, my experiments did not show any degradation in regular cases
with modest number of tables/partitions. The variation in planning
time was with the usual planning time variations.

[1] 
https://www.postgresql.org/message-id/flat/CAExHW5uVZ3E5RT9cXHaxQ_DEK7tasaMN%3DD6rPHcao5gcXanY5w%40mail.gmail.com#112b3e104e0f9e39eb007abe075aae20

-- 
Best Wishes,
Ashutosh Bapat

Re: Experiments with Postgres and SSL

2024-02-19 Thread Matthias van de Meent

I've been asked to take a look at this thread and review some patches,
and the subject looks interesting enough, so here I am.

On Thu, 19 Jan 2023 at 04:16, Greg Stark  wrote:
> I had a conversation a while back with Heikki where he expressed that
> it was annoying that we negotiate SSL/TLS the way we do since it
> introduces an extra round trip. Aside from the performance
> optimization I think accepting standard TLS connections would open the
> door to a number of other opportunities that would be worth it on
> their own.

I agree that this would be very nice.

> Other things it would open the door to in order from least
> controversial to most
>
> * Hiding Postgres behind a standard SSL proxy terminating SSL without
> implementing the Postgres protocol.

I think there is also the option "hiding Postgres behind a standard
SNI-based SSL router that does not terminate SSL", as that's arguably
a more secure way to deploy any SSL service than SSL-terminating
proxies.

> * "Service Mesh" type tools that hide multiple services behind a
> single host/port ("Service Mesh" is just a new buzzword for "proxy").

People proxying PostgreSQL seems fine, and enabling better proxying
seems reasonable.

> * Browser-based protocol implementations using websockets for things
> like pgadmin or other tools to connect directly to postgres using
> Postgres wire protocol but using native SSL implementations.
>
> * Postgres could even implement an HTTP based version of its protocol
> and enable things like queries or browser based tools using straight
> up HTTP requests so they don't need to use websockets.
>
> * Postgres could implement other protocols to serve up data like
> status queries or monitoring metrics, using HTTP based standard
> protocols instead of using our own protocol.

I don't think we should be trying to serve anything HTTP-like, even
with a ten-foot pole, on a port that we serve the PostgreSQL wire
protocol on.

If someone wants to multiplex the PostgreSQL wire protocol on the same
port that serves HTTPS traffic, they're welcome to do so with their
own proxy, but I'd rather we keep the PostgreSQL server's socket
handling fundamentaly incapable of servicng protocols primarily used
in web browsers on the same socket that handles normal psql data
connections.

PostgreSQL may have its own host-based authentication with HBA, but
I'd rather not have to depend on it to filter incoming connections
between valid psql connections and people trying to grab the latest
monitoring statistics at some http endpoint - I'd rather use my trusty
firewall that can already limit access to specific ports very
efficiently without causing undue load on the database server.

Matthias van de Meent
Neon (https://neon.tech)

Re: serial not accepted as datatype in ALTER TABLE ... ALTER COLUMN

On Sun, Feb 18, 2024 at 1:59 PM Andy Fan  wrote:
>
>
> Hi Ashutosh,
>
> > data_type is described on that page as "Data type of the new column,
> > or new data type for an existing column." but CREATE TABLE
> > documentation [2] redirects data_type to [3], which mentions serial.
> > The impression created by the documentation is the second statement
> > above is a valid statement as should not throw an error; instead
> > change the data type of the column (and create required sequence).
>
> I didn't find out a reason to not support it, if have any reason, I
> think it is better have some explaination in the document.
>
> > In code ATPrepAlterColumnType() calls typenameTypeIdAndMod(), whereas
> > transformColumnDefinition() (called for ALTER TABLE ... ADD COLUMN and
> > CREATE TABLE) handles "serial" data type separately. Looks like we are
> > missing a call to transformColumnDefinition() in
> > transformAlterTableStmt() under case AT_AlterColumnType.
>
> I tried your idea with the attatchment, it is still in a drafted state
> but it can be used as a prove-of-concept and for better following
> communicating.  Just one point needs to metion is serial implies
> "default value" + "not null" constaint. So when we modify a column into
> serial, we need to modify the 'NULL value' and only to the default value
> at the RewriteTable stage.
>

I am surprised that this requires changes in ReWrite. I thought adding
NOT NULL constraint and default value commands would be done by
transformColumnDefinition(). But I haven't looked at the patch close
enough.

-- 
Best Wishes,
Ashutosh Bapat

Re: Memory consumed by child SpecialJoinInfo in partitionwise join planning

On Sun, Feb 18, 2024 at 10:55 PM Tomas Vondra
 wrote:
>
> Hi,
>
> I took a quick look at this patch today. I certainly agree with the
> intent to reduce the amount of memory during planning, assuming it's not
> overly disruptive. And I think this patch is fairly localized and looks
> sensible.

Thanks for looking at the patch-set.

>
> That being said I'm a big fan of using a local variable on stack and
> filling it. I'd probably go with the usual palloc/pfree, because that
> makes it much easier to use - the callers would not be responsible for
> allocating the SpecialJoinInfo struct. Sure, it's a little bit of
> overhead, but with the AllocSet caching I doubt it's measurable.

You are suggesting that instead of declaring a local variable of type
SpecialJoinInfo, build_child_join_sjinfo() should palloc() and return
SpecialJoinInfo which will be freed by free_child_sjinfo()
(free_child_sjinfo_members in the patch). I am fine with that.

>
> I did put this through check-world on amd64/arm64, with valgrind,
> without any issue. I also tried the scripts shared by Ashutosh in his
> initial message (with some minor fixes, adding MEMORY to explain etc).
>
> The results with the 20240130 patches are like this:
>
>tablesmasterpatched
>   -
> 2  40.8   39.9
> 3 151.7  142.6
> 4 464.0  418.5
> 51663.9 1419.5
>
> That's certainly a nice improvement, and it even reduces the amount of
> time for planning (the 5-table join goes from 18s to 17s on my laptop).
> That's nice, although 17 seconds for planning is not ... great.
>
> That being said, the amount of remaining memory needed by planning is
> still pretty high - we save ~240MB for a join of 5 tables, but we still
> need ~1.4GB. Yes, this is a bit extreme example, and it probably is not
> very common to join 5 tables with 1000 partitions each ...
>

Yes.

> Do we know what are the other places consuming the 1.4GB of memory?

Please find the breakdown at [1]. Investigating further, I found that
memory consumed by Bitmapsets is not explicitly visible through that
breakdown. But Bitmapsets consume a lot of memory in this case. I
shared google slides with raw numbers in [2]. The Gslides can be found
at 
https://docs.google.com/presentation/d/1S9BiAADhX-Fv9tDbx5R5Izq4blAofhZMhHcO1c-wzfI/edit#slide=id.p.
According to that measurement, Bitmapset objects created while
planning a 5-way join between partitioned tables with 1000 partitions
each consumed 769.795 MiB as compared to 19.728 KiB consumed by
Bitmapsets when planning a 5-way non-partitioned join. See slide 3 for
detailed numbers.

> Considering my recent thread about scalability, where malloc() turned
> out to be one of the main culprits, I wonder if maybe there's a lot to
> gain by reducing the memory usage ... Our attitude to memory usage is
> that it doesn't really matter if we keep it allocated for a bit, because
> we'll free it shortly. And that may be true for "modest" memory usage,
> but with 1.4GB that doesn't seem great, and the malloc overhead can be
> pretty bad.
>

That probably explains why we see some modest speedups because of my
memory saving patches. Thanks for sharing it.

[1] 
https://www.postgresql.org/message-id/caexhw5stmouobe55pmt83r8uxvfcph+pvo5dnpdrvcsbgxe...@mail.gmail.com
[2] 
https://www.postgresql.org/message-id/CAExHW5t0cPNjJRPRtt%3D%2Bc5SiTeBPHvx%3DSd2n8EO%2B7XdVuE8_YQ%40mail.gmail.com

-- 
Best Wishes,
Ashutosh Bapat

Re: Synchronizing slots from primary to standby

2024-02-19 Thread shveta malik

On Mon, Feb 19, 2024 at 5:32 PM Amit Kapila  wrote:
>
> Few comments on 0001

Thanks for the feedback.

> 
> 1. I think it is better to error out when the valid GUC or option is
> not set in ensure_valid_slotsync_params() and
> ensure_valid_remote_info() instead of waiting. And we shouldn't start
> the worker in the first place if not all required GUCs are set. This
> will additionally simplify the code a bit.

Sure, removed 'ensure' functions. Moved the validation checks to the
postmaster before starting the slot sync worker.

> 2.
> +typedef struct SlotSyncWorkerCtxStruct
>  {
> - /* prevents concurrent slot syncs to avoid slot overwrites */
> + pid_t pid;
> + bool stopSignaled;
>   bool syncing;
> + time_t last_start_time;
>   slock_t mutex;
> -} SlotSyncCtxStruct;
> +} SlotSyncWorkerCtxStruct;
>
> I think we don't need to change the name of this struct as this can be
> used both by the worker and the backend. We can probably add the
> comment to indicate that all the fields except 'syncing' are used by
> slotsync worker.

Modified.

> 3. Similar to above, function names like SlotSyncShmemInit() shouldn't
> be changed to SlotSyncWorkerShmemInit().

Modified.

> 4.
> +ReplSlotSyncWorkerMain(int argc, char *argv[])
> {
> ...
> + on_shmem_exit(slotsync_worker_onexit, (Datum) 0);
> ...
> + before_shmem_exit(slotsync_failure_callback, PointerGetDatum(wrconn));
> ...
> }
>
> Do we need two separate callbacks? Can't we have just one (say
> slotsync_failure_callback) that cleans additional things in case of
> slotsync worker? And, if we need both those callbacks then please add
> some comments for both and why one is before_shmem_exit() and the
> other is on_shmem_exit().

I think we can merge these now. Earlier 'on_shmem_exit' was needed to
avoid race-condition between startup and slot sync worker process to
drop 'i' slots on promotion.  Now we do not have any such scenario.
But I need some time to analyze it well. Will do it in the next
version.

> In addition to the above, I have made a few changes in the comments
> and code (cosmetic code changes). Please include those in the next
> version if you find them okay. You need to rename .txt file to remove
> .txt and apply atop v90-0001*.

Sure, included these.

Please find the patch001 attached. I will rebase the rest of the
patches and post them tomorrow.

thanks
Shveta


v91-0001-Add-a-new-slotsync-worker.patch
Description: Binary data

Re: POC, WIP: OR-clause support for indexes

2024-02-19 Thread Ranier Vilela

Em seg., 19 de fev. de 2024 às 05:35, Andrei Lepikhov <
a.lepik...@postgrespro.ru> escreveu:

> On 16/2/2024 19:54, jian he wrote:
> > After setting these parameters, overall enable_or_transformation ON is
> > performance better.
> > sorry for the noise.
> Don't worry, at least we know a weak point of partial paths estimation.
> > so now I didn't find any corner case where enable_or_transformation is
> > ON peforms worse than when it's OFF.
> >
> > +typedef struct OrClauseGroupEntry
> > +{
> > + OrClauseGroupKey key;
> > +
> > + Node   *node;
> > + List   *consts;
> > + Oid scalar_type;
> > + List   *exprs;
> > +} OrClauseGroupEntry;
> >
> > I found that the field `scalar_type` was never used.
> Thanks, fixed.
>
Not that it will make a big difference, but it would be good to avoid, I
think.

v17-0002
1) move the vars *arrayconst and *dest, to after if, to avoid makeNode
(palloc).
+ Const   *arrayconst;
+ ScalarArrayOpExpr  *dest;
+
+ pd = (PredicatesData *) lfirst(lc);
+ if (pd->elems == NIL)
+ /* The index doesn't participate in this operation */
+ continue;

+ arrayconst = lsecond_node(Const, saop->args);
+ dest = makeNode(ScalarArrayOpExpr);

best regards,
Ranier Vilela

Re: Memory consumed by paths during partitionwise join planning

On Fri, Feb 16, 2024 at 8:42 AM Andrei Lepikhov
 wrote:
> Live example: right now, I am working on the code like MSSQL has - a
> combination of NestLoop and HashJoin paths and switching between them in
> real-time. It requires both paths in the path list at the moment when
> extensions are coming. Even if one of them isn't referenced from the
> upper pathlist, it may still be helpful for the extension.

There is no guarantee that every path presented to add_path will be
preserved. Suboptimal paths are freed as and when add_path discovers
that they are suboptimal. So I don't think an extension can rely on
existence of a path. But having a refcount makes it easy to preserve
the required paths by referencing them.

>
> >> About partitioning. As I discovered planning issues connected to
> >> partitions, the painful problem is a rule, according to which we are
> >> trying to use all nomenclature of possible paths for each partition.
> >> With indexes, it quickly increases optimization work. IMO, this can help
> >> a 'symmetrical' approach, which could restrict the scope of possible
> >> pathways for upcoming partitions if we filter some paths in a set of
> >> previously planned partitions.
> >
> > filter or free?
> Filter.
> I meant that Postres tries to apply IndexScan, BitmapScan,
> IndexOnlyScan, and other strategies, passing throughout the partition
> indexes. The optimizer spends a lot of time doing that. So, why not
> introduce a symmetrical strategy and give away from the search some
> indexes of types of scan based on the pathifying experience of previous
> partitions of the same table: if you have dozens of partitions, Is it
> beneficial for the system to find a bit more optimal IndexScan on one
> partition having SeqScans on 999 other?
>
IIUC, you are suggesting that instead of planning each
partition/partitionwise join, we only create paths with the strategies
which were found to be optimal with previous partitions. That's a good
heuristic but it won't work if partition properties - statistics,
indexes etc. differ between groups of partitions.

-- 
Best Wishes,
Ashutosh Bapat

Re: partitioning and identity column

On Thu, Feb 15, 2024 at 11:30 PM Alexander Lakhin  wrote:
>
> Hello Ashutosh,
>
> 24.01.2024 09:34, Ashutosh Bapat wrote:
> >
> >>> There's another thing I found. The file isn't using
> >>> check_stack_depth() in the function which traverse inheritance
> >>> hierarchies. This isn't just a problem of the identity related
> >>> function but most of the functions in that file. Do you think it's
> >>> worth fixing it?
> >> I suppose the number of inheritance levels is usually not a problem for
> >> stack depth?
> >>
> > Practically it should not. I would rethink the application design if
> > it requires so many inheritance or partition levels. But functions in
> > optimizer like try_partitionwise_join() and set_append_rel_size() call
> >
> > /* Guard against stack overflow due to overly deep inheritance tree. */
> > check_stack_depth();
> >
> > I am fine if we want to skip this.
>
> I've managed to reach stack overflow inside ATExecSetIdentity() with
> the following script:
> (echo "CREATE TABLE tp0 (a int PRIMARY KEY,
>  b int GENERATED ALWAYS AS IDENTITY) PARTITION BY RANGE (a);";
> for ((i=1;i<=8;i++)); do
>echo "CREATE TABLE tp$i PARTITION OF tp$(( $i - 1 ))
>   FOR VALUES FROM ($i) TO (100) PARTITION BY RANGE (a);";
> done;
> echo "ALTER TABLE tp0 ALTER COLUMN b SET GENERATED BY DEFAULT;") | psql 
> >psql.log
>
> (with max_locks_per_transaction = 400 in the config)
>
> It runs about 15 minutes for me and ends with:
> Program terminated with signal SIGSEGV, Segmentation fault.
>
> #0  0x55a8ced20de9 in LWLockAcquire (lock=0x7faec200b900, 
> mode=mode@entry=LW_EXCLUSIVE) at lwlock.c:1169
> 1169{
> (gdb) bt
> #0  0x55a8ced20de9 in LWLockAcquire (lock=0x7faec200b900, 
> mode=mode@entry=LW_EXCLUSIVE) at lwlock.c:1169
> #1  0x55a8cea0342d in WALInsertLockAcquire () at xlog.c:1389
> #2  XLogInsertRecord (rdata=0x55a8cf1ccee8 , 
> fpw_lsn=fpw_lsn@entry=1261347512, flags=0 '\000',
> num_fpi=num_fpi@entry=0, topxid_included=false) at xlog.c:817
> #3  0x55a8cea1396e in XLogInsert (rmid=rmid@entry=11 '\v', 
> info=) at xloginsert.c:524
> #4  0x55a8ce9c1541 in _bt_insertonpg (rel=0x7faeb8478c98, 
> heaprel=0x7faecf63d378,
> itup_key=itup_key@entry=0x55a8d5064678, buf=3210, cbuf=cbuf@entry=0, 
> stack=stack@entry=0x55a8d1063d08,
> itup=0x55a8d5064658, itemsz=16,
>  newitemoff=, postingoff=0, split_only_page= out>) at nbtinsert.c:1389
> #5  0x55a8ce9bf9a7 in _bt_doinsert (rel=, 
> rel@entry=0x7faeb8478c98, itup=,
> itup@entry=0x55a8d5064658, checkUnique=, 
> checkUnique@entry=UNIQUE_CHECK_YES, indexUnchanged=,
>  heapRel=, heapRel@entry=0x7faecf63d378) at nbtinsert.c:260
> #6  0x55a8ce9c92ad in btinsert (rel=0x7faeb8478c98, values= out>, isnull=,
> ht_ctid=0x55a8d50643cc, heapRel=0x7faecf63d378, checkUnique=UNIQUE_CHECK_YES, 
> indexUnchanged=,
>  indexInfo=) at nbtree.c:205
> #7  0x55a8cea41391 in CatalogIndexInsert 
> (indstate=indstate@entry=0x55a8d0fc03e8, heapTuple=,
> heapTuple@entry=0x55a8d50643c8, updateIndexes=) at 
> indexing.c:170
> #8  0x55a8cea4172c in CatalogTupleUpdate 
> (heapRel=heapRel@entry=0x7faecf63d378, otid=0x55a8d50643cc,
> tup=tup@entry=0x55a8d50643c8) at indexing.c:324
> #9  0x55a8ceb18173 in ATExecSetIdentity (rel=0x7faeab1288a8, 
> colName=colName@entry=0x55a8d0fbc2b8 "b",
> def=def@entry=0x55a8d1063918, lockmode=lockmode@entry=8, recurse=true, 
> recursing=) at tablecmds.c:8307
> #10 0x55a8ceb18251 in ATExecSetIdentity (rel=0x7faeab127f28, 
> colName=colName@entry=0x55a8d0fbc2b8 "b",
> def=def@entry=0x55a8d1063918, lockmode=lockmode@entry=8, recurse=true, 
> recursing=) at tablecmds.c:8337
> #11 0x55a8ceb18251 in ATExecSetIdentity (rel=0x7faeab1275a8, 
> colName=colName@entry=0x55a8d0fbc2b8 "b",
> def=def@entry=0x55a8d1063918, lockmode=lockmode@entry=8, recurse=true, 
> recursing=) at tablecmds.c:8337
> #12 0x55a8ceb18251 in ATExecSetIdentity (rel=0x7faeab126c28, 
> colName=colName@entry=0x55a8d0fbc2b8 "b",
> def=def@entry=0x55a8d1063918, lockmode=lockmode@entry=8, recurse=true, 
> recursing=) at tablecmds.c:8337
> ...
>
> Functions ATExecAddIdentity() and ATExecDropIdentity() are recursive too,
> so I think they can be exploited as well.

not just Identity related functions, but many other functions in
tablecmds.c have that problem as I mentioned earlier.

-- 
Best Wishes,
Ashutosh Bapat

Re: Synchronizing slots from primary to standby

On Mon, Feb 19, 2024 at 9:46 AM shveta malik  wrote:
>
> Okay I see. Thanks for pointing it out. Here are the patches
> addressing your comments. Changes are in patch001, rest are rebased.
>

Few comments on 0001

1. I think it is better to error out when the valid GUC or option is
not set in ensure_valid_slotsync_params() and
ensure_valid_remote_info() instead of waiting. And we shouldn't start
the worker in the first place if not all required GUCs are set. This
will additionally simplify the code a bit.

2.
+typedef struct SlotSyncWorkerCtxStruct
 {
- /* prevents concurrent slot syncs to avoid slot overwrites */
+ pid_t pid;
+ bool stopSignaled;
  bool syncing;
+ time_t last_start_time;
  slock_t mutex;
-} SlotSyncCtxStruct;
+} SlotSyncWorkerCtxStruct;

I think we don't need to change the name of this struct as this can be
used both by the worker and the backend. We can probably add the
comment to indicate that all the fields except 'syncing' are used by
slotsync worker.

3. Similar to above, function names like SlotSyncShmemInit() shouldn't
be changed to SlotSyncWorkerShmemInit().

4.
+ReplSlotSyncWorkerMain(int argc, char *argv[])
{
...
+ on_shmem_exit(slotsync_worker_onexit, (Datum) 0);
...
+ before_shmem_exit(slotsync_failure_callback, PointerGetDatum(wrconn));
...
}

Do we need two separate callbacks? Can't we have just one (say
slotsync_failure_callback) that cleans additional things in case of
slotsync worker? And, if we need both those callbacks then please add
some comments for both and why one is before_shmem_exit() and the
other is on_shmem_exit().

In addition to the above, I have made a few changes in the comments
and code (cosmetic code changes). Please include those in the next
version if you find them okay. You need to rename .txt file to remove
.txt and apply atop v90-0001*.

-- 
With Regards,
Amit Kapila.
diff --git a/src/backend/postmaster/postmaster.c 
b/src/backend/postmaster/postmaster.c
index 3080d4aa53..790799c164 100644
--- a/src/backend/postmaster/postmaster.c
+++ b/src/backend/postmaster/postmaster.c
@@ -463,9 +463,9 @@ static void InitPostmasterDeathWatchHandle(void);
 /*
  * Slot sync worker allowed to start up?
  *
- * If we are on a hot standby, slot sync parameter is enabled, and it is
- * the first time of worker's launch, or enough time has passed since the
- * worker was launched last, then it is allowed to be started.
+ * We allow to start the slot sync worker when we are on a hot standby,
+ * sync_replication_slots is enabled, and it is the first time of worker's
+ * launch, or enough time has passed since the worker was launched last.
  */
 #define SlotSyncWorkerAllowed()\
(sync_replication_slots && pmState == PM_HOT_STANDBY && \
diff --git a/src/backend/replication/logical/slotsync.c 
b/src/backend/replication/logical/slotsync.c
index 40ab87bce1..0e0f3cdf76 100644
--- a/src/backend/replication/logical/slotsync.c
+++ b/src/backend/replication/logical/slotsync.c
@@ -1052,9 +1052,9 @@ ensure_valid_slotsync_params(void)
 /*
  * Re-read the config file.
  *
- * If any of the slot sync GUCs have changed, exit the worker and
- * let it get restarted by the postmaster. The worker to be exited for
- * restart purpose only if the caller passed restart as true.
+ * Exit if any of the slot sync GUCs have changed. The postmaster will
+ * restart it. The worker to be exited for restart purpose only if the
+ * caller passed restart as true.
  */
 static void
 slotsync_reread_config(bool restart)
@@ -1295,9 +1295,9 @@ ReplSlotSyncWorkerMain(int argc, char *argv[])
dbname = walrcv_get_dbname_from_conninfo(PrimaryConnInfo);
 
/*
-* Connect to the database specified by user in primary_conninfo. We 
need
-* a database connection for walrcv_exec to work. Please see comments 
atop
-* libpqrcv_exec.
+* Connect to the database specified by the user in primary_conninfo. We
+* need a database connection for walrcv_exec to work which we use to 
fetch
+* slot information from the remote node. See comments atop 
libpqrcv_exec.
 */
InitPostgres(dbname, InvalidOid, NULL, InvalidOid, 0, NULL);
 
@@ -1310,12 +1310,11 @@ ReplSlotSyncWorkerMain(int argc, char *argv[])
appendStringInfo(&app_name, "%s", "slotsyncworker");
 
/*
-* Establish the connection to the primary server for slots
+* Establish the connection to the primary server for slot
 * synchronization.
 */
wrconn = walrcv_connect(PrimaryConnInfo, false, false, false,
-   app_name.data,
-   &err);
+   app_name.data, &err);
pfree(app_name.data);
 
if (!wrconn)
@@ -1332,7 +1331,7 @@ ReplSlotSyncWorkerMain(int argc, char *argv[])
 */
ensure_valid_remote_info(wrconn);
 
-   /* Main

Re: numeric_big in make check?

Daniel Gustafsson  writes:
> numeric_big has been left out of parallel_schedule, requiring EXTRA_TESTS to
> run it, since going in back in 1999 (AFAICT it was even the reason EXTRA_TESTS
> was invented).  The original commit states that it's huge, and it probably 
> was.
> Today it runs faster than many tests we have in parallel_schedule, even on 
> slow
> hardware like my ~5 year old laptop.  Repeated runs in CI at various parallel
> groups place it at 50% the runtime of triggers.sql and 25% of brin.sql.

> To make sure it's executed and not silently breaks, is it time to add this to
> the regular make check?

Or we could just flush it.  It's never detected a bug, and I think
you'd find that it adds zero code coverage (or if not, we could
fix that in a far more surgical and less expensive manner).

regards, tom lane

Re: Add an option to skip loading missing publication to avoid logical replication failure

2024-02-19 Thread vignesh C

On Mon, 19 Feb 2024 at 12:48, vignesh C  wrote:
>
> Hi,
>
> Currently ALTER SUBSCRIPTION ... SET PUBLICATION will break the
> logical replication in certain cases. This can happen as the apply
> worker will get restarted after SET PUBLICATION, the apply worker will
> use the existing slot and replication origin corresponding to the
> subscription. Now, it is possible that before restart the origin has
> not been updated and the WAL start location points to a location prior
> to where PUBLICATION pub exists which can lead to such an error. Once
> this error occurs, apply worker will never be able to proceed and will
> always return the same error.
>
> There was discussion on this and Amit had posted a patch to handle
> this at [2]. Amit's patch does continue using a historic snapshot but
> ignores publications that are not found for the purpose of computing
> RelSyncEntry attributes. We won't mark such an entry as valid till all
> the publications are loaded without anything missing. This means we
> won't publish operations on tables corresponding to that publication
> till we found such a publication and that seems okay.
> I have added an option skip_not_exist_publication to enable this
> operation only when skip_not_exist_publication is specified as true.
> There is no change in default behavior when skip_not_exist_publication
> is specified as false.

I have updated the patch to now include changes for pg_dump, added few
tests, describe changes and added documentation changes. The attached
v2 version patch has the changes for the same.

Regards,
Vignesh
From c0f97fc671db390239ca5cb4c224cb3d80a0e22e Mon Sep 17 00:00:00 2001
From: Vignesh C 
Date: Mon, 19 Feb 2024 10:20:02 +0530
Subject: [PATCH v2 1/2] Skip loading the publication if the publication does
 not exist.

Skip loading the publication if the publication does not exist.
---
 src/backend/replication/pgoutput/pgoutput.c | 28 +++--
 1 file changed, 21 insertions(+), 7 deletions(-)

diff --git a/src/backend/replication/pgoutput/pgoutput.c b/src/backend/replication/pgoutput/pgoutput.c
index 998f92d671..f7b6d0384d 100644
--- a/src/backend/replication/pgoutput/pgoutput.c
+++ b/src/backend/replication/pgoutput/pgoutput.c
@@ -82,7 +82,7 @@ static void pgoutput_stream_prepare_txn(LogicalDecodingContext *ctx,
 
 static bool publications_valid;
 
-static List *LoadPublications(List *pubnames);
+static List *LoadPublications(List *pubnames, bool *skipped);
 static void publication_invalidation_cb(Datum arg, int cacheid,
 		uint32 hashvalue);
 static void send_relation_and_attrs(Relation relation, TransactionId xid,
@@ -1703,9 +1703,13 @@ pgoutput_shutdown(LogicalDecodingContext *ctx)
 
 /*
  * Load publications from the list of publication names.
+ *
+ * Here, we just skip the publications that don't exist yet. 'skipped'
+ * will be true if we find any publication from the given list that doesn't
+ * exist.
  */
 static List *
-LoadPublications(List *pubnames)
+LoadPublications(List *pubnames, bool *skipped)
 {
 	List	   *result = NIL;
 	ListCell   *lc;
@@ -1713,9 +1717,12 @@ LoadPublications(List *pubnames)
 	foreach(lc, pubnames)
 	{
 		char	   *pubname = (char *) lfirst(lc);
-		Publication *pub = GetPublicationByName(pubname, false);
+		Publication *pub = GetPublicationByName(pubname, true);
 
-		result = lappend(result, pub);
+		if (pub)
+			result = lappend(result, pub);
+		else
+			*skipped = true;
 	}
 
 	return result;
@@ -1994,7 +2001,7 @@ get_rel_sync_entry(PGOutputData *data, Relation relation)
 	}
 
 	/* Validate the entry */
-	if (!entry->replicate_valid)
+	if (!entry->replicate_valid || !publications_valid)
 	{
 		Oid			schemaId = get_rel_namespace(relid);
 		List	   *pubids = GetRelationPublications(relid);
@@ -2011,6 +2018,7 @@ get_rel_sync_entry(PGOutputData *data, Relation relation)
 		bool		am_partition = get_rel_relispartition(relid);
 		char		relkind = get_rel_relkind(relid);
 		List	   *rel_publications = NIL;
+		bool		skipped_pub = false;
 
 		/* Reload publications if needed before use. */
 		if (!publications_valid)
@@ -2021,9 +2029,15 @@ get_rel_sync_entry(PGOutputData *data, Relation relation)
 list_free_deep(data->publications);
 data->publications = NIL;
 			}
-			data->publications = LoadPublications(data->publication_names);
+			data->publications = LoadPublications(data->publication_names, &skipped_pub);
 			MemoryContextSwitchTo(oldctx);
-			publications_valid = true;
+
+			/*
+			 * We don't consider the publications to be valid till we have
+			 * information of all the publications.
+			 */
+			if (!skipped_pub)
+publications_valid = true;
 		}
 
 		/*
-- 
2.34.1

From e9ff5304b2a3a0fccfb9084ff076a0465f28edbe Mon Sep 17 00:00:00 2001
From: Vignesh C 
Date: Mon, 19 Feb 2024 10:23:05 +0530
Subject: [PATCH v2 2/2] Added an option skip_not_exist_publication which will
 skip loading the publication, if the publication does not exist.

Added an option skip_not_exist_publication which will skip

Re: table inheritance versus column compression and storage settings

On Fri, Feb 16, 2024 at 11:54 PM Tom Lane  wrote:
>
> I wrote:
> > I find it surprising that the committed patch does not touch
> > pg_dump.  Is it really true that pg_dump dumps situations with
> > differing compression/storage settings accurately already?
>
> It's worse than I thought.  Run "make installcheck" with
> today's HEAD, then:
>
> $ pg_dump -Fc regression >r.dump
> $ createdb r2
> $ pg_restore -d r2 r.dump
> pg_restore: error: could not execute query: ERROR:  column "a" inherits 
> conflicting storage methods
> HINT:  To resolve the conflict, specify a storage method explicitly.
> Command was: CREATE TABLE public.stchild4 (
> a text
> )
> INHERITS (public.stparent1, public.stparent2);
> ALTER TABLE ONLY public.stchild4 ALTER COLUMN a SET STORAGE MAIN;
>
>
> pg_restore: error: could not execute query: ERROR:  relation 
> "public.stchild4" does not exist
> Command was: ALTER TABLE public.stchild4 OWNER TO postgres;
>
> pg_restore: error: could not execute query: ERROR:  relation 
> "public.stchild4" does not exist
> Command was: COPY public.stchild4 (a) FROM stdin;
> pg_restore: warning: errors ignored on restore: 3

Thanks for the test. Let's call this Problem1. I expected
src/bin/pg_upgrade/t/002_pg_upgrade.pl to fail in this case since it
will execute similar steps as you did. And it actually does, except
that it uses binary-upgrade mode. In that mode, INHERITed tables are
dumped in a different manner
-- For binary upgrade, set up inheritance this way.
ALTER TABLE ONLY "public"."stchild4" INHERIT "public"."stparent1";
ALTER TABLE ONLY "public"."stchild4" INHERIT "public"."stparent2";
... snip ...
ALTER TABLE ONLY "public"."stchild4" ALTER COLUMN "a" SET STORAGE MAIN;

that does not lead to the conflict and pg_upgrade does not fail.

>
>
> What I'd intended to compare was the results of the query added to the
> regression tests:
>
> regression=# SELECT attrelid::regclass, attname, attstorage FROM pg_attribute
> WHERE (attrelid::regclass::name like 'stparent%'
> OR attrelid::regclass::name like 'stchild%')
> and attname = 'a'
> ORDER BY 1, 2;
>  attrelid  | attname | attstorage
> ---+-+
>  stparent1 | a   | p
>  stparent2 | a   | x
>  stchild1  | a   | p
>  stchild3  | a   | m
>  stchild4  | a   | m
>  stchild5  | a   | x
>  stchild6  | a   | m
> (7 rows)
>
> r2=# SELECT attrelid::regclass, attname, attstorage FROM pg_attribute
> WHERE (attrelid::regclass::name like 'stparent%'
> OR attrelid::regclass::name like 'stchild%')
> and attname = 'a'
> ORDER BY 1, 2;
>  attrelid  | attname | attstorage
> ---+-+
>  stparent1 | a   | p
>  stchild1  | a   | p
>  stchild3  | a   | m
>  stparent2 | a   | x
>  stchild5  | a   | p
>  stchild6  | a   | m
> (6 rows)
>
> So not only does stchild4 fail to restore altogether, but stchild5
> ends with the wrong attstorage.

With binary-upgrade dump and restore stchild5 gets the correct storage value.

Looks like we need a test which pg_dump s regression database and
restores it without going through pg_upgrade.

I think the fix is easy one. Dump the STORAGE and COMPRESSION clauses
with CREATE TABLE for local attributes. Those for inherited attributes
will be dumped separately.

But that will not fix an existing problem described below. Let's call
it Problem2. With HEAD  at commit
57f59396bb51953bb7b957780c7f1b7f67602125 (almost a month back)
$ createdb regression
$ psql -d regression
#create table par1 (a text storage plain);
#create table par2 (a text storage plain);
#create table chld (a text) inherits (par1, par2);
NOTICE:  merging multiple inherited definitions of column "a"
NOTICE:  merging column "a" with inherited definition
-- parent storages conflict after child creation
#alter table par1 alter column a set storage extended;
#SELECT attrelid::regclass, attname, attstorage FROM pg_attribute
 WHERE (attrelid::regclass::name like 'par%'
 OR attrelid::regclass::name like 'chld%')
 and attname = 'a'
 ORDER BY 1, 2;
 attrelid | attname | attstorage
--+-+
 par1 | a   | x
 par2 | a   | p
 chld | a   | x
(3 rows)

$ createdb r2
$ pg_dump -Fc regression > /tmp/r.dump
$ pg_restore -d r2 /tmp/r.dump
pg_restore: error: could not execute query: ERROR:  inherited column
"a" has a storage parameter conflict
DETAIL:  EXTENDED versus PLAIN
Command was: CREATE TABLE public.chld (
a text
)
INHERITS (public.par1, public.par2);

pg_restore: error: could not execute query: ERROR:  relation
"public.chld" does not exist
Command was: ALTER TABLE public.chld OWNER TO ashutosh;

pg_restore: error: could not execute query: ERROR:  relation
"public.chld" does not exist
Command was: COPY public.chld (a) FROM stdin;
pg_restore: warning: errors ignored on restore: 3

Fixing this requires that we dump ALTER TABLE ... ALTER COLUMN SET
STORAGE and COMPRESSION commands after all the tables (at least
children) have been created.

Re: [PoC] Improve dead tuple storage for lazy vacuum

2024-02-19 Thread John Naylor

On Mon, Feb 19, 2024 at 9:02 AM Masahiko Sawada  wrote:
>
> I think that vacuum and tidbitmap (and future users) would end up
> having the same max block size calculation. And it seems slightly odd
> layering to me that max-block-size-specified context is created on
> vacuum (or tidbitmap) layer, a varlen-value radix tree is created by
> tidstore layer, and the passed context is used for leaves (if
> varlen-value is used) on radix tree layer.

That sounds slightly more complicated than I was thinking of, but we
could actually be talking about the same thing: I'm drawing a
distinction between "used = must be detected / #ifdef'd" and "used =
actually happens to call allocation". I meant that the passed context
would _always_ be used for leaves, regardless of varlen or not. So
with fixed-length values short enough to live in child pointer slots,
that context would still be used for iteration etc.

> Another idea is to create a
> max-block-size-specified context on the tidstore layer. That is,
> vacuum and tidbitmap pass a work_mem and a flag indicating whether the
> tidstore can use the bump context, and tidstore creates a (aset of
> bump) memory context with the calculated max block size and passes it
> to the radix tree.

That might be a better abstraction since both uses have some memory limit.

> As for using the bump memory context, I feel that we need to store
> iterator struct in aset context at least as it can be individually
> freed and re-created. Or it might not be necessary to allocate the
> iterator struct in the same context as radix tree.

Okay, that's one thing I was concerned about. Since we don't actually
have a bump context yet, it seems simple to assume aset for non-nodes,
and if we do get it, we can adjust slightly. Anyway, this seems like a
good thing to try to clean up, but it's also not a show-stopper.

On that note: I will be going on honeymoon shortly, and then to PGConf
India, so I will have sporadic connectivity for the next 10 days and
won't be doing any hacking during that time.

Andres, did you want to take a look at the radix tree patch 0003?
Aside from the above possible cleanup, most of it should be stable.

Re: Switching XLog source from archive to streaming when primary available

On Wed, Jan 31, 2024 at 6:30 PM Bharath Rupireddy
 wrote:
>
> Needed a rebase due to commit 776621a (conflict in
> src/test/recovery/meson.build for new TAP test file added). Please
> find the attached v17 patch.

Strengthened tests a bit by using recovery_min_apply_delay to mimic
standby spending some time fetching from archive. PSA v18 patch.

-- 
Bharath Rupireddy
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com


v18-0001-Allow-standby-to-switch-WAL-source-from-archive-.patch
Description: Binary data

Re: speed up a logical replica setup

2024-02-19 Thread Shlok Kyal

Hi,

I have reviewed the v21 patch. And found an issue.

Initially I started the standby server with a new postgresql.conf file
(not the default postgresql.conf that is present in the instance).
pg_ctl -D ../standby start -o "-c config_file=/new_path/postgresql.conf"

And I have made 'max_replication_slots = 1' in new postgresql.conf and
made  'max_replication_slots = 0' in the default postgresql.conf file.
Now when we run pg_createsubscriber on standby we get error:
pg_createsubscriber: error: could not set replication progress for the
subscription "pg_createsubscriber_5_242843": ERROR:  cannot query or
manipulate replication origin when max_replication_slots = 0
NOTICE:  dropped replication slot "pg_createsubscriber_5_242843" on publisher
pg_createsubscriber: error: could not drop publication
"pg_createsubscriber_5" on database "postgres": ERROR:  publication
"pg_createsubscriber_5" does not exist
pg_createsubscriber: error: could not drop replication slot
"pg_createsubscriber_5_242843" on database "postgres": ERROR:
replication slot "pg_createsubscriber_5_242843" does not exist

I observed that when we run the pg_createsubscriber command, it will
stop the standby instance (the non-default postgres configuration) and
restart the standby instance which will now be started with default
postgresql.conf, where the 'max_replication_slot = 0' and
pg_createsubscriber will now fail with the error given above.
I have added the script file with which we can reproduce this issue.
Also similar issues can happen with other configurations such as port, etc.

The possible solution would be
1) allow to run pg_createsubscriber if standby is initially stopped .
I observed that pg_logical_createsubscriber also uses this approach.
2) read GUCs via SHOW command and restore them when server restarts

I would prefer the 1st solution.

Thanks and Regards,
Shlok Kyal
sudo pkill -9 postgres

rm -rf ../primary ../standby ../new_path
rm -rf primary.log standby.log

./initdb -D ../primary

cat << EOF >> ../primary/postgresql.conf
wal_level = 'logical'
EOF

./pg_ctl -D ../primary -l primary.log start
./psql -d postgres -c "CREATE table t1 (c1 int);"
./psql -d postgres -c "Insert into t1 values(1);"
./psql -d postgres -c "Insert into t1 values(2);"
./psql -d postgres -c "INSERT into t1 values(3);"
./psql -d postgres -c "INSERT into t1 values(4);"

./pg_basebackup -h localhost -X stream -v -W -R -D ../standby/

# postgresql.conf file in new location
mkdir  ../new_path
cat << EOF >> ../new_path/postgresql.conf
max_replication_slots = 1
port = 9000
EOF

# postgresql.conf file in default location
cat << EOF >> ../standby/postgresql.conf
max_replication_slots = 0
port = 9000
EOF

./pg_ctl -D ../standby -l standby.log  start -o "-c config_file=../new_path/postgresql.conf"
./pg_createsubscriber -D ../standby -S 'host=localhost port=9000 dbname=postgres' -P 'host=localhost port=5432 dbname=postgres' -d postgres -r

Re: Transaction timeout

2024-02-19 Thread Japin Li



On Mon, 19 Feb 2024 at 17:14, Andrey M. Borodin  wrote:
>> On 18 Feb 2024, at 22:16, Andrey M. Borodin  wrote:
>>
>> But it seems a little strange that session 3 did  not fail at all
> It was only coincidence. Any test that verifies FATALing out in 100ms can 
> fail, see new failure here [0].
>
> In a nearby thread Michael is proposing injections points that can wait and 
> be awaken. So I propose following course of action:
> 1. Remove all tests that involve pg_stat_activity test of FATALed session 
> (any permutation with checker_sleep step)
> 2. Add idle_in_transaction_session_timeout, statement_timeout and 
> transaction_timeout tests when injection points features get committed.
>

+1

> Alexander, what do you think?
>

Re: Fix race condition in InvalidatePossiblyObsoleteSlot()

2024-02-19 Thread Bertrand Drouvot

Hi,

On Mon, Feb 19, 2024 at 01:45:16PM +0530, Bharath Rupireddy wrote:
> On Mon, Feb 19, 2024 at 11:44 AM Michael Paquier  wrote:
> >
> > > Yeah, comments added in v3.
> >
> > The contents look rather OK, I may do some word-smithing for both.
> 
> Here are some comments on v3:

Thanks for looing at it!

> 1.
> +XLogRecPtrinitial_effective_xmin = InvalidXLogRecPtr;
> +XLogRecPtrinitial_catalog_effective_xmin = InvalidXLogRecPtr;
> +XLogRecPtrinitial_restart_lsn = InvalidXLogRecPtr;
> 
> Prefix 'initial_' makes the variable names a bit longer, I think we
> can just use effective_xmin, catalog_effective_xmin and restart_lsn,
> the code updating then only when if (!terminated) tells one that they
> aren't updated every time.

I'm not sure about that. I prefer to see meaningfull names instead of having
to read the code where they are used.

> 2.
> +/*
> + * We'll release the slot's mutex soon, so it's possible that
> + * those values change since the process holding the slot has 
> been
> + * terminated (if any), so record them here to ensure we would
> + * report the slot as obsolete correctly.
> + */
> 
> This needs a bit more info as to why and how effective_xmin,
> catalog_effective_xmin and restart_lsn can move ahead after signaling
> a backend and before the signalled backend reports back.

I'm not sure of the added value of such extra details in this comment and if
the comment would be easy to maintain. I've the feeling that knowing it's 
possible
is enough here. Happy to hear what others think about it too.

> 3.
> +/*
> + * Assert that the conflict cause recorded previously before we
> + * terminate the process did not change now for any reason.
> + */
> +Assert(!(conflict_prev != RS_INVAL_NONE && terminated &&
> + conflict_prev != conflict));
> 
> It took a while for me to understand the above condition, can we
> simplify it like below using De Morgan's laws for better readability?
> 
> Assert((conflict_prev == RS_INVAL_NONE) || !terminated ||
> (conflict_prev == conflict));

I don't have a strong opinon on this, looks like a matter of taste.

Regards,

-- 
Bertrand Drouvot
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com

Re: speed up a logical replica setup

2024-02-19 Thread Peter Eisentraut


Some review of the v21 patch:

- commit message

Mention pg_createsubscriber in the commit message title.  That's the 
most important thing that someone doing git log searches in the future 
will be looking for.



- doc/src/sgml/ref/allfiles.sgml

Move the new entry to alphabetical order.


- doc/src/sgml/ref/pg_createsubscriber.sgml

+  
+   The pg_createsubscriber should be run at 
the target
+   server. The source server (known as publisher server) should accept 
logical
+   replication connections from the target server (known as subscriber 
server).

+   The target server should accept local logical replication connection.
+  

"should" -> "must" ?

+ 
+  Options

Sort options alphabetically.

It would be good to indicate somewhere which options are mandatory.

+ 
+  Examples

I suggest including a pg_basebackup call into this example, so it's
easier for readers to get the context of how this is supposed to be
used.  You can add that pg_basebackup in this example is just an example 
and that other base backups can also be used.



- doc/src/sgml/reference.sgml

Move the new entry to alphabetical order.


- src/bin/pg_basebackup/Makefile

Move the new sections to alphabetical order.


- src/bin/pg_basebackup/meson.build

Move the new sections to alphabetical order.


- src/bin/pg_basebackup/pg_createsubscriber.c

+typedef struct CreateSubscriberOptions
+typedef struct LogicalRepInfo

I think these kinds of local-use struct don't need to be typedef'ed.
(Then you also don't need to update typdefs.list.)

+static void
+usage(void)

Sort the options alphabetically.

+static char *
+get_exec_path(const char *argv0, const char *progname)

Can this not use find_my_exec() and find_other_exec()?

+int
+main(int argc, char **argv)

Sort the options alphabetically (long_options struct, getopt_long()
argument, switch cases).


- .../t/040_pg_createsubscriber.pl
- .../t/041_pg_createsubscriber_standby.pl

These two files could be combined into one.

+# Force it to initialize a new cluster instead of copying a
+# previously initdb'd cluster.

Explain why?

+$node_s->append_conf(
+   'postgresql.conf', qq[
+log_min_messages = debug2

Is this setting necessary for the test?

numeric_big in make check?

2024-02-19 Thread Daniel Gustafsson

numeric_big has been left out of parallel_schedule, requiring EXTRA_TESTS to
run it, since going in back in 1999 (AFAICT it was even the reason EXTRA_TESTS
was invented).  The original commit states that it's huge, and it probably was.
Today it runs faster than many tests we have in parallel_schedule, even on slow
hardware like my ~5 year old laptop.  Repeated runs in CI at various parallel
groups place it at 50% the runtime of triggers.sql and 25% of brin.sql.

To make sure it's executed and not silently breaks, is it time to add this to
the regular make check?

--
Daniel Gustafsson

Re: Add last_commit_lsn to pg_stat_database