Re: Introduce XID age and inactive timeout based replication slot invalidation

2024-03-26 Thread Bertrand Drouvot
Hi,

On Wed, Mar 27, 2024 at 10:08:33AM +0530, Bharath Rupireddy wrote:
> On Tue, Mar 26, 2024 at 11:22 PM Bertrand Drouvot
>  wrote:
> >
> > -   if (!(RecoveryInProgress() && slot->data.synced))
> > +   if (!(InRecovery && slot->data.synced))
> > slot->inactive_since = GetCurrentTimestamp();
> > else
> > slot->inactive_since = 0;
> >
> > Not related to this change but more the way RestoreSlotFromDisk() behaves 
> > here:
> >
> > For a sync slot on standby it will be set to zero and then later will be
> > synchronized with the one coming from the primary. I think that's fine to 
> > have
> > it to zero for this window of time.
> 
> Right.
> 
> > Now, if the standby is down and one sets sync_replication_slots to off,
> > then inactive_since will be set to zero on the standby at startup and not
> > synchronized (unless one triggers a manual sync). I also think that's fine 
> > but
> > it might be worth to document this behavior (that after a standby startup
> > inactive_since is zero until the next sync...).
> 
> Isn't this behaviour applicable for other slot parameters that the
> slot syncs from the remote slot on the primary?

No they are persisted on disk. If not, we'd not know where to resume the 
decoding
from on the standby in case primary is down and/or sync is off.

> I've added the following note in the comments when we update
> inactive_since in RestoreSlotFromDisk.
> 
>  * Note that for synced slots after the standby starts up (i.e. after
>  * the slots are loaded from the disk), the inactive_since will remain
>  * zero until the next slot sync cycle.
>  */
> if (!(InRecovery && slot->data.synced))
> slot->inactive_since = GetCurrentTimestamp();
> else
> slot->inactive_since = 0;

I think we should add some words in the doc too and also about what the meaning
of inactive_since on the standby is (as suggested by Shveta in [1]).

[1]: 
https://www.postgresql.org/message-id/CAJpy0uDkTW%2Bt1k3oPkaipFBzZePfFNB5DmiA%3D%3DpxRGcAdpF%3DPg%40mail.gmail.com

> > 7 ===
> >
> > +# Capture and validate inactive_since of a given slot.
> > +sub capture_and_validate_slot_inactive_since
> > +{
> > +   my ($node, $slot_name, $slot_creation_time) = @_;
> > +   my $name = $node->name;
> >
> > We know have capture_and_validate_slot_inactive_since at 2 places:
> > 040_standby_failover_slots_sync.pl and 019_replslot_limit.pl.
> >
> > Worth to create a sub in Cluster.pm?
> 
> I'd second that thought for now. We might have to debate first if it's
> useful for all the nodes even without replication, and if yes, the
> naming stuff and all that. Historically, we've had such duplicated
> functions until recently, for instance advance_wal and log_contains.
> We
> moved them over to a common perl library Cluster.pm very recently. I'm
> sure we can come back later to move it to Cluster.pm.

I thought that would be the right time not to introduce duplicated code.

Regards,

-- 
Bertrand Drouvot
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com




Re: Introduce XID age and inactive timeout based replication slot invalidation

2024-03-26 Thread Bharath Rupireddy
On Wed, Mar 27, 2024 at 10:08 AM Bharath Rupireddy
 wrote:
>
> Please find the attached v25-0001 (made this 0001 patch now as
> inactive_since patch is committed) patch with the above changes.

Fixed an issue in synchronize_slots where DatumGetLSN is being used in
place of DatumGetTimestampTz. Found this via CF bot member [1], not on
my dev system.

Please find the attached v6 patch.


[1]
[05:14:39.281] #7  DatumGetLSN (X=) at
../src/include/utils/pg_lsn.h:24
[05:14:39.281] No locals.
[05:14:39.281] #8  synchronize_slots (wrconn=wrconn@entry=0x583cd170)
at ../src/backend/replication/logical/slotsync.c:757
[05:14:39.281] isnull = false
[05:14:39.281] remote_slot = 0x583ce1a8
[05:14:39.281] d = 
[05:14:39.281] col = 10
[05:14:39.281] slotRow = {25, 25, 3220, 3220, 28, 16, 16, 25, 25, 1184}
[05:14:39.281] res = 0x583cd1b8
[05:14:39.281] tupslot = 0x583ce11c
[05:14:39.281] remote_slot_list = 0x0
[05:14:39.281] some_slot_updated = false
[05:14:39.281] started_tx = false
[05:14:39.281] query = 0x57692bc4 "SELECT slot_name, plugin,
confirmed_flush_lsn, restart_lsn, catalog_xmin, two_phase, failover,
database, invalidation_reason, inactive_since FROM
pg_catalog.pg_replication_slots WHERE failover and NOT"...
[05:14:39.281] __func__ = "synchronize_slots"
[05:14:39.281] #9  0x56ff9d1e in SyncReplicationSlots
(wrconn=0x583cd170) at
../src/backend/replication/logical/slotsync.c:1504

--
Bharath Rupireddy
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com
From 8c5f7d0064e0e01a2d458872ecc1d8e682ddc033 Mon Sep 17 00:00:00 2001
From: Bharath Rupireddy 
Date: Wed, 27 Mar 2024 05:29:18 +
Subject: [PATCH v26] Maintain inactive_since for synced slots correctly.

The slot's inactive_since isn't currently maintained for
synced slots on the standby. The commit a11f330b55 prevents
updating inactive_since with RecoveryInProgress() check in
RestoreSlotFromDisk(). But, the issue is that
RecoveryInProgress() always returns true in
RestoreSlotFromDisk() as 'xlogctl->SharedRecoveryState' is
always 'RECOVERY_STATE_CRASH' at that time. The impact of this
on a promoted standby inactive_since is always NULL for all
synced slots even after server restart.

Above issue led us to a question as to why we can't just update
inactive_since for synced slots on the standby with the value
received from remote slot on the primary. This is consistent with
any other slot parameter i.e. all of them are synced from the
primary.

This commit does two things:
1) Updates inactive_since for sync slots with the value
received from the primary's slot.

2) Ensures the value is set to current timestamp during the
shutdown of slot sync machinery to help correctly interpret the
time if the standby gets promoted without a restart.

Author: Bharath Rupireddy
Reviewed-by: Bertrand Drouvot, Amit Kapila, Shveta Malik
Discussion: https://www.postgresql.org/message-id/CALj2ACWLctoiH-pSjWnEpR54q4DED6rw_BRJm5pCx86_Y01MoQ%40mail.gmail.com
---
 doc/src/sgml/system-views.sgml|  4 ++
 src/backend/replication/logical/slotsync.c| 61 -
 src/backend/replication/slot.c| 41 
 .../t/040_standby_failover_slots_sync.pl  | 66 +++
 4 files changed, 157 insertions(+), 15 deletions(-)

diff --git a/doc/src/sgml/system-views.sgml b/doc/src/sgml/system-views.sgml
index 3c8dca8ca3..7713f168e7 100644
--- a/doc/src/sgml/system-views.sgml
+++ b/doc/src/sgml/system-views.sgml
@@ -2530,6 +2530,10 @@ SELECT * FROM pg_locks pl LEFT JOIN pg_prepared_xacts ppx
   
 The time since the slot has become inactive.
 NULL if the slot is currently being used.
+Note that the slots that are being synced from a primary server
+(whose synced field is true), will get the
+inactive_since value from the corresponding
+remote slot on the primary.
   
  
 
diff --git a/src/backend/replication/logical/slotsync.c b/src/backend/replication/logical/slotsync.c
index 30480960c5..9c95a4b062 100644
--- a/src/backend/replication/logical/slotsync.c
+++ b/src/backend/replication/logical/slotsync.c
@@ -137,9 +137,12 @@ typedef struct RemoteSlot
 
 	/* RS_INVAL_NONE if valid, or the reason of invalidation */
 	ReplicationSlotInvalidationCause invalidated;
+
+	TimestampTz inactive_since; /* in seconds */
 } RemoteSlot;
 
 static void slotsync_failure_callback(int code, Datum arg);
+static void update_synced_slots_inactive_since(void);
 
 /*
  * If necessary, update the local synced slot's metadata based on the data
@@ -167,6 +170,7 @@ update_local_synced_slot(RemoteSlot *remote_slot, Oid remote_dbid)
 		remote_slot->two_phase == slot->data.two_phase &&
 		remote_slot->failover == slot->data.failover &&
 		remote_slot->confirmed_lsn == slot->data.confirmed_flush &&
+		remote_slot->inactive_since == slot->inactive_since &&
 		

Re: Why is parula failing?

2024-03-26 Thread Tom Lane
David Rowley  writes:
> Unfortunately, REL_16_STABLE does not have the additional debugging,
> so don't get to know what reltuples was set to.

Let's wait a bit to see if it fails in HEAD ... but if not, would
it be reasonable to back-patch the additional debugging output?

regards, tom lane




Re: Introduce XID age and inactive timeout based replication slot invalidation

2024-03-26 Thread shveta malik
On Tue, Mar 26, 2024 at 6:05 PM Bertrand Drouvot
 wrote:
>
>
> > We can think on that later if we really need another
> > field which give us sync time.
>
> I think that calling GetCurrentTimestamp() so frequently could be too costly, 
> so
> I'm not sure we should.

Agreed.

> > In my second approach, I have tried to
> > avoid updating inactive_since for synced slots during sync process. We
> > update that field during creation of synced slot so that
> > inactive_since reflects correct info even for synced slots (rather
> > than copying from primary).
>
> Yeah, and I think we could create a dedicated field with this information
> if we feel the need.

Okay.

thanks
Shveta




Re: Exposing the lock manager's WaitForLockers() to SQL

2024-03-26 Thread Will Mortensen
Rebased, fixed a couple typos, and reordered the isolation tests to
put the most elaborate pair last.


v11-0001-Refactor-GetLockConflicts-into-more-general-GetL.patch
Description: Binary data


v11-0002-Allow-specifying-single-lockmode-in-WaitForLocke.patch
Description: Binary data


v11-0003-Add-pg_wait_for_lockers-function.patch
Description: Binary data


Re: Introduce XID age and inactive timeout based replication slot invalidation

2024-03-26 Thread Bharath Rupireddy
On Wed, Mar 27, 2024 at 10:24 AM shveta malik  wrote:
>
> On Wed, Mar 27, 2024 at 10:22 AM Amit Kapila  wrote:
> >
> > On Wed, Mar 27, 2024 at 10:08 AM Bharath Rupireddy
> >  wrote:
> > >
> > > On Tue, Mar 26, 2024 at 11:22 PM Bertrand Drouvot
> > >  wrote:
> > >
> > > > 3)
> > > > update_synced_slots_inactive_time():
> > > >
> > > > This assert is removed, is it intentional?
> > > > Assert(s->active_pid == 0);
> > >
> > > Yes, the slot can get acquired in the corner case when someone runs
> > > pg_sync_replication_slots concurrently at this time. I'm referring to
> > > the issue reported upthread. We don't prevent one running
> > > pg_sync_replication_slots in promotion/ShutDownSlotSync phase right?
> > > Maybe we should prevent that otherwise some of the slots are synced
> > > and the standby gets promoted while others are yet-to-be-synced.
> > >
> >
> > We should do something about it but that shouldn't be done in this
> > patch. We can handle it separately and then add such an assert.
>
> Agreed. Once this patch is concluded, I can fix the slot sync shutdown
> issue and will also add this 'assert' back.

Agreed. Thanks.

-- 
Bharath Rupireddy
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com




Re: Introduce XID age and inactive timeout based replication slot invalidation

2024-03-26 Thread shveta malik
On Wed, Mar 27, 2024 at 10:22 AM Amit Kapila  wrote:
>
> On Wed, Mar 27, 2024 at 10:08 AM Bharath Rupireddy
>  wrote:
> >
> > On Tue, Mar 26, 2024 at 11:22 PM Bertrand Drouvot
> >  wrote:
> >
> > > 3)
> > > update_synced_slots_inactive_time():
> > >
> > > This assert is removed, is it intentional?
> > > Assert(s->active_pid == 0);
> >
> > Yes, the slot can get acquired in the corner case when someone runs
> > pg_sync_replication_slots concurrently at this time. I'm referring to
> > the issue reported upthread. We don't prevent one running
> > pg_sync_replication_slots in promotion/ShutDownSlotSync phase right?
> > Maybe we should prevent that otherwise some of the slots are synced
> > and the standby gets promoted while others are yet-to-be-synced.
> >
>
> We should do something about it but that shouldn't be done in this
> patch. We can handle it separately and then add such an assert.

Agreed. Once this patch is concluded, I can fix the slot sync shutdown
issue and will also add this 'assert' back.

thanks
Shveta




Re: Introduce XID age and inactive timeout based replication slot invalidation

2024-03-26 Thread Amit Kapila
On Wed, Mar 27, 2024 at 10:08 AM Bharath Rupireddy
 wrote:
>
> On Tue, Mar 26, 2024 at 11:22 PM Bertrand Drouvot
>  wrote:
>
> > 3)
> > update_synced_slots_inactive_time():
> >
> > This assert is removed, is it intentional?
> > Assert(s->active_pid == 0);
>
> Yes, the slot can get acquired in the corner case when someone runs
> pg_sync_replication_slots concurrently at this time. I'm referring to
> the issue reported upthread. We don't prevent one running
> pg_sync_replication_slots in promotion/ShutDownSlotSync phase right?
> Maybe we should prevent that otherwise some of the slots are synced
> and the standby gets promoted while others are yet-to-be-synced.
>

We should do something about it but that shouldn't be done in this
patch. We can handle it separately and then add such an assert.

-- 
With Regards,
Amit Kapila.




Re: pgsql: Track last_inactive_time in pg_replication_slots.

2024-03-26 Thread Amit Kapila
On Tue, Mar 26, 2024 at 9:10 PM Alvaro Herrera  wrote:
>
> On 2024-Mar-26, Nathan Bossart wrote:
>
> > FWIW I'd really prefer to have something like max_slot_xid_age for this.  A
> > time-based parameter would likely help with most cases, but transaction ID
> > usage will vary widely from server to server, so it'd be nice to have
> > something to protect against wraparound more directly.
>
> Yeah, I tend to agree that an XID-based limit makes more sense than a
> time-based one.
>

So, do we want the time-based parameter or just max_slot_xid_age
considering both will be GUC's? Yes, the xid_age based parameter
sounds to be directly helpful for transaction ID wraparound or dead
row cleanup, OTOH having a lot of inactive slots (which is possible in
use cases where a tool creates a lot of slots and forgets to remove
them, or the tool exits without cleaning up slots (say due to server
shutdown)) also prohibit removing dead space which is not nice either?

The one example that comes to mind is the pg_createsubscriber
(committed for PG17) which creates one slot per database to convert
standby to subscriber, now say it exits due to power shutdown then
there could be a lot of dangling slots on the primary server. Also,
say there is some bug in such a tool that doesn't allow proper cleanup
of slots, the same problem can happen; yeah, this would be a problem
of the tool but I think there is no harm in giving a way to avoid
problems at the server end due to such slots.

-- 
With Regards,
Amit Kapila.




Re: Introduce XID age and inactive timeout based replication slot invalidation

2024-03-26 Thread Bharath Rupireddy
On Tue, Mar 26, 2024 at 11:22 PM Bertrand Drouvot
 wrote:
>
> > I'm attaching v24 patches. It implements the above idea proposed
> > upthread for synced slots.
>
>  v24-0002
>
> 1 ===
>
> This commit does two things:
> 1) Updates inactive_since for sync slots with the value
> received from the primary's slot.
>
> Tested it and it does that.

Thanks. I've added a test case for this.

> 2 ===
>
> 2) Ensures the value is set to current timestamp during the
> shutdown of slot sync machinery to help correctly interpret the
> time if the standby gets promoted without a restart.
>
> Tested it and it does that.

Thanks. I've added a test case for this.

> 3 ===
>
> +/*
> + * Reset the synced slots info such as inactive_since after shutting
> + * down the slot sync machinery.
> + */
> +static void
> +update_synced_slots_inactive_time(void)
>
> Looks like the comment "reset" is not matching the name of the function and
> what it does.

Changed. I've also changed the function name to
update_synced_slots_inactive_since to be precise on what it exactly
does.

> 4 ===
>
> +   /*
> +* We get the current time beforehand and only once 
> to avoid
> +* system calls overhead while holding the lock.
> +*/
> +   if (now == 0)
> +   now = GetCurrentTimestamp();
>
> Also +1 of having GetCurrentTimestamp() just called one time within the loop.

Right.

> 5 ===
>
> -   if (!(RecoveryInProgress() && slot->data.synced))
> +   if (!(InRecovery && slot->data.synced))
> slot->inactive_since = GetCurrentTimestamp();
> else
> slot->inactive_since = 0;
>
> Not related to this change but more the way RestoreSlotFromDisk() behaves 
> here:
>
> For a sync slot on standby it will be set to zero and then later will be
> synchronized with the one coming from the primary. I think that's fine to have
> it to zero for this window of time.

Right.

> Now, if the standby is down and one sets sync_replication_slots to off,
> then inactive_since will be set to zero on the standby at startup and not
> synchronized (unless one triggers a manual sync). I also think that's fine but
> it might be worth to document this behavior (that after a standby startup
> inactive_since is zero until the next sync...).

Isn't this behaviour applicable for other slot parameters that the
slot syncs from the remote slot on the primary?

I've added the following note in the comments when we update
inactive_since in RestoreSlotFromDisk.

 * Note that for synced slots after the standby starts up (i.e. after
 * the slots are loaded from the disk), the inactive_since will remain
 * zero until the next slot sync cycle.
 */
if (!(InRecovery && slot->data.synced))
slot->inactive_since = GetCurrentTimestamp();
else
slot->inactive_since = 0;

> 6 ===
>
> +   print "HI  $slot_name $name $inactive_since $slot_creation_time\n";
>
> garbage?

Removed.

> 7 ===
>
> +# Capture and validate inactive_since of a given slot.
> +sub capture_and_validate_slot_inactive_since
> +{
> +   my ($node, $slot_name, $slot_creation_time) = @_;
> +   my $name = $node->name;
>
> We know have capture_and_validate_slot_inactive_since at 2 places:
> 040_standby_failover_slots_sync.pl and 019_replslot_limit.pl.
>
> Worth to create a sub in Cluster.pm?

I'd second that thought for now. We might have to debate first if it's
useful for all the nodes even without replication, and if yes, the
naming stuff and all that. Historically, we've had such duplicated
functions until recently, for instance advance_wal and log_contains.
We
moved them over to a common perl library Cluster.pm very recently. I'm
sure we can come back later to move it to Cluster.pm.

On Wed, Mar 27, 2024 at 9:02 AM shveta malik  wrote:
>
> 1)
> slot.c:
> + * data from the remote slot. We use InRecovery flag instead of
> + * RecoveryInProgress() as it always returns true even for normal
> + * server startup.
>
> a) Not clear what 'it' refers to. Better to use 'the latter'
> b) Is it better to mention the primary here:
>  'as the latter always returns true even on the primary server during 
> startup'.

Modified.

> 2)
> update_local_synced_slot():
>
> - strcmp(remote_slot->plugin, NameStr(slot->data.plugin)) == 0)
> + strcmp(remote_slot->plugin, NameStr(slot->data.plugin)) == 0 &&
> + remote_slot->inactive_since == slot->inactive_since)
>
> When this code was written initially, the intent was to do strcmp at
> the end (only if absolutely needed). It will be good if we maintain
> the same and add new checks before strcmp.

Done.

> 3)
> update_synced_slots_inactive_time():
>
> This assert is removed, is it intentional?
> Assert(s->active_pid == 0);

Yes, the slot can get acquired in the corner case when someone 

Re: remaining sql/json patches

2024-03-26 Thread Amit Langote
On Wed, Mar 27, 2024 at 12:42 PM jian he  wrote:
> hi.
> I don't fully understand all the code in json_table patch.
> maybe we can split it into several patches,

I'm working on exactly that atm.

> like:
> * no nested json_table_column.
> * nested json_table_column, with PLAN DEFAULT
> * nested json_table_column, with PLAN ( json_table_plan )

Yes, I think it will end up something like this.  I'll try to post the
breakdown tomorrow.

-- 
Thanks, Amit Langote




Re: remaining sql/json patches

2024-03-26 Thread jian he
On Tue, Mar 26, 2024 at 6:16 PM jian he  wrote:
>
> On Fri, Mar 22, 2024 at 12:08 AM Amit Langote  wrote:
> >
> > On Wed, Mar 20, 2024 at 9:53 PM Amit Langote  
> > wrote:
> > > I'll push 0001 tomorrow.
> >
> > Pushed that one.  Here's the remaining JSON_TABLE() patch.
> >

hi.
I don't fully understand all the code in json_table patch.
maybe we can split it into several patches, like:
* no nested json_table_column.
* nested json_table_column, with PLAN DEFAULT
* nested json_table_column, with PLAN ( json_table_plan )

i can understand the "no nested json_table_column" part,
which seems to be how oracle[1] implemented it.
I think we can make the "no nested json_table_column" part into  v17.
i am not sure about other complex parts.
lack of comment, makes it kind of hard to fully understand.

[1] 
https://docs.oracle.com/en/database/oracle/oracle-database/19/sqlrf/img_text/json_table.html



+/* Reset context item of a scan, execute JSON path and reset a scan */
+static void
+JsonTableResetContextItem(JsonTableScanState *scan, Datum item)
+{
+ MemoryContext oldcxt;
+ JsonPathExecResult res;
+ Jsonb   *js = (Jsonb *) DatumGetJsonbP(item);
+
+ JsonValueListClear(>found);
+
+ MemoryContextResetOnly(scan->mcxt);
+
+ oldcxt = MemoryContextSwitchTo(scan->mcxt);
+
+ res = executeJsonPath(scan->path, scan->args,
+  GetJsonPathVar, CountJsonPathVars,
+  js, scan->errorOnError, >found,
+  false /* FIXME */ );
+
+ MemoryContextSwitchTo(oldcxt);
+
+ if (jperIsError(res))
+ {
+ Assert(!scan->errorOnError);
+ JsonValueListClear(>found); /* EMPTY ON ERROR case */
+ }
+
+ JsonTableRescan(scan);
+}

"FIXME".
set the last argument in executeJsonPath to true also works as expected.
also there is no test related to the "FIXME"
i am not 100% sure about the "FIXME".

see demo (after set the executeJsonPath's "useTz" argument to true).

create table ss(js jsonb);
INSERT into ss select '{"a": "2018-02-21 12:34:56 +10"}';
INSERT into ss select '{"b": "2018-02-21 12:34:56 "}';
PREPARE q2 as SELECT jt.*  FROM ss, JSON_TABLE(js, '$.a.datetime()'
COLUMNS ("int7" timestamptz PATH '$')) jt;
PREPARE qb as SELECT jt.*  FROM ss, JSON_TABLE(js, '$.b.datetime()'
COLUMNS ("tstz" timestamptz PATH '$')) jt;
PREPARE q3 as SELECT jt.*  FROM ss, JSON_TABLE(js, '$.a.datetime()'
COLUMNS ("ts" timestamp PATH '$')) jt;

begin;
set time zone +10;
EXECUTE q2;
set time zone -10;
EXECUTE q2;
rollback;

begin;
set time zone +10;
SELECT JSON_VALUE(js, '$.a' returning timestamptz) from ss;
set time zone -10;
SELECT JSON_VALUE(js, '$.a' returning timestamptz) from ss;
rollback;
-
begin;
set time zone +10;
EXECUTE qb;
set time zone -10;
EXECUTE qb;
rollback;

begin;
set time zone +10;
SELECT JSON_VALUE(js, '$.b' returning timestamptz) from ss;
set time zone -10;
SELECT JSON_VALUE(js, '$.b' returning timestamptz) from ss;
rollback;
-
begin;
set time zone +10;
EXECUTE q3;
set time zone -10;
EXECUTE q3;
rollback;

begin;
set time zone +10;
SELECT JSON_VALUE(js, '$.b' returning timestamp) from ss;
set time zone -10;
SELECT JSON_VALUE(js, '$.b' returning timestamp) from ss;
rollback;




Re: Can't find not null constraint, but \d+ shows that

2024-03-26 Thread Tender Wang
Alvaro Herrera  于2024年3月26日周二 23:25写道:

> On 2024-Mar-26, Tender Wang wrote:
>
> > postgres=# CREATE TABLE t1(c0 int, c1 int);
> > postgres=# ALTER TABLE  t1 ADD CONSTRAINT Q PRIMARY KEY(c0, c1);
> > postgres=# ALTER TABLE  t1 DROP c1;
> >
> > postgres=# ALTER TABLE  t1  ALTER c0 DROP NOT NULL;
> > ERROR:  could not find not-null constraint on column "c0", relation "t1"
>
> Ooh, hah, what happens here is that we drop the PK constraint
> indirectly, so we only go via doDeletion rather than the tablecmds.c
> code, so we don't check the attnotnull flags that the PK was protecting.
>

Yeah,   Indeed, as you said.

> The attached patch is my workaround solution.  Look forward your apply.
>
> Yeah, this is not a very good approach -- I think you're just guessing
> that the column is marked NOT NULL because a PK was dropped in the
> past -- but really what this catalog state is, is corrupted contents
> because the PK drop was mishandled.  At least in theory there are other
> ways to drop a constraint other than dropping one of its columns (for
> example, maybe this could happen if you drop a collation that the PK
> depends on).  The right approach is to ensure that the PK drop always
> does the dance that ATExecDropConstraint does.  A good fix probably just
> moves some code from dropconstraint_internal to RemoveConstraintById.
>

Agreed. It is look better.  But it will not work if simply move some codes
from dropconstraint_internal
to RemoveConstraintById. I have tried this fix before 0001 patch, but
failed.

For example:
create table skip_wal_skip_rewrite_index (c varchar(10) primary key);
alter table skip_wal_skip_rewrite_index alter c type varchar(20);
ERROR:  primary key column "c" is not marked NOT NULL

index_check_primary_key() in index.c has below comments;

"We check for a pre-existing primary key, and that all columns of the index
are simple column references (not expressions), and that all those columns
are marked NOT NULL.  If not, fail."

So in aboved example, RemoveConstraintById() can't reset attnotnull. We can
pass some information  to
RemoveConstraintById() like a bool var to indicate that attnotnull should
be reset or not.



--
Tender Wang
OpenPie:  https://en.openpie.com/


Re: Introduce XID age and inactive timeout based replication slot invalidation

2024-03-26 Thread shveta malik
On Tue, Mar 26, 2024 at 9:59 PM Bharath Rupireddy
 wrote:
>
> On Tue, Mar 26, 2024 at 4:35 PM Bharath Rupireddy
>  wrote:
> >
> > If we just sync inactive_since value for synced slots while in
> > recovery from the primary, so be it. Why do we need to update it to
> > the current time when the slot is being created? We don't expose slot
> > creation time, no? Aren't we fine if we just sync the value from
> > primary and document that fact? After the promotion, we can reset it
> > to the current time so that it gets its own time.
>
> I'm attaching v24 patches. It implements the above idea proposed
> upthread for synced slots. I've now separated
> s/last_inactive_time/inactive_since and synced slots behaviour. Please
> have a look.

Thanks for the patches. Few trivial comments for v24-002:

1)
slot.c:
+ * data from the remote slot. We use InRecovery flag instead of
+ * RecoveryInProgress() as it always returns true even for normal
+ * server startup.

a) Not clear what 'it' refers to. Better to use 'the latter'
b) Is it better to mention the primary here:
 'as the latter always returns true even on the primary server during startup'.


2)
update_local_synced_slot():

- strcmp(remote_slot->plugin, NameStr(slot->data.plugin)) == 0)
+ strcmp(remote_slot->plugin, NameStr(slot->data.plugin)) == 0 &&
+ remote_slot->inactive_since == slot->inactive_since)

When this code was written initially, the intent was to do strcmp at
the end (only if absolutely needed). It will be good if we maintain
the same and add new checks before strcmp.

3)
update_synced_slots_inactive_time():

This assert is removed, is it intentional?
Assert(s->active_pid == 0);


4)
040_standby_failover_slots_sync.pl:

+# Capture the inactive_since of the slot from the standby the logical failover
+# slots are synced/created on the standby.

The comment is unclear, something seems missing.

thanks
Shveta




Re: Properly pathify the union planner

2024-03-26 Thread Richard Guo
On Wed, Mar 27, 2024 at 6:23 AM David Rowley  wrote:

> Because this field is set, it plans the CTE thinking it's a UNION
> child and breaks when it can't find a SortGroupClause for the CTE's
> target list item.


Right.  The problem here is that we mistakenly think that the CTE query
is a subquery for the set operation and thus store the SetOperationStmt
in its qp_extra.  Currently the code for the check is:

   /*
* Check if we're a subquery for a set operation.  If we are, store
* the SetOperationStmt in qp_extra.
*/
   if (root->parent_root != NULL &&
   root->parent_root->parse->setOperations != NULL &&
   IsA(root->parent_root->parse->setOperations, SetOperationStmt))
   qp_extra.setop =
   (SetOperationStmt *) root->parent_root->parse->setOperations;
   else
   qp_extra.setop = NULL;

This check cannot tell if the subquery is for a set operation or a CTE,
because its parent might have setOperations set in both cases.  Hmm, is
there any way to differentiate between the two?

Thanks
Richard


Re: recovery modules

2024-03-26 Thread Nathan Bossart
another rebase for cfbot

-- 
Nathan Bossart
Amazon Web Services: https://aws.amazon.com
>From 0ccdbb0a010830380e6ff4b7a052198d50f0680f Mon Sep 17 00:00:00 2001
From: Nathan Bossart 
Date: Wed, 15 Feb 2023 14:28:53 -0800
Subject: [PATCH v20 1/5] introduce routine for checking mutually exclusive
 string GUCs

---
 src/backend/postmaster/pgarch.c |  8 +++-
 src/backend/utils/misc/guc.c| 22 ++
 src/include/utils/guc.h |  3 +++
 3 files changed, 28 insertions(+), 5 deletions(-)

diff --git a/src/backend/postmaster/pgarch.c b/src/backend/postmaster/pgarch.c
index c266904b57..55e9a1796b 100644
--- a/src/backend/postmaster/pgarch.c
+++ b/src/backend/postmaster/pgarch.c
@@ -821,11 +821,9 @@ LoadArchiveLibrary(void)
 {
 	ArchiveModuleInit archive_init;
 
-	if (XLogArchiveLibrary[0] != '\0' && XLogArchiveCommand[0] != '\0')
-		ereport(ERROR,
-(errcode(ERRCODE_INVALID_PARAMETER_VALUE),
- errmsg("both archive_command and archive_library set"),
- errdetail("Only one of archive_command, archive_library may be set.")));
+	(void) CheckMutuallyExclusiveStringGUCs(XLogArchiveLibrary, "archive_library",
+			XLogArchiveCommand, "archive_command",
+			ERROR);
 
 	/*
 	 * If shell archiving is enabled, use our special initialization function.
diff --git a/src/backend/utils/misc/guc.c b/src/backend/utils/misc/guc.c
index 391866145e..ac507008ce 100644
--- a/src/backend/utils/misc/guc.c
+++ b/src/backend/utils/misc/guc.c
@@ -2659,6 +2659,28 @@ ReportGUCOption(struct config_generic *record)
 	pfree(val);
 }
 
+/*
+ * If both parameters are set, emits a log message at 'elevel' and returns
+ * false.  Otherwise, returns true.
+ */
+bool
+CheckMutuallyExclusiveStringGUCs(const char *p1val, const char *p1name,
+ const char *p2val, const char *p2name,
+ int elevel)
+{
+	if (p1val[0] != '\0' && p2val[0] != '\0')
+	{
+		ereport(elevel,
+(errcode(ERRCODE_INVALID_PARAMETER_VALUE),
+ errmsg("cannot set both %s and %s", p1name, p2name),
+ errdetail("Only one of %s or %s may be set.",
+		   p1name, p2name)));
+		return false;
+	}
+
+	return true;
+}
+
 /*
  * Convert a value from one of the human-friendly units ("kB", "min" etc.)
  * to the given base unit.  'value' and 'unit' are the input value and unit
diff --git a/src/include/utils/guc.h b/src/include/utils/guc.h
index 3712aba09b..3088ada610 100644
--- a/src/include/utils/guc.h
+++ b/src/include/utils/guc.h
@@ -376,6 +376,9 @@ extern void RestrictSearchPath(void);
 extern void AtEOXact_GUC(bool isCommit, int nestLevel);
 extern void BeginReportingGUCOptions(void);
 extern void ReportChangedGUCOptions(void);
+extern bool CheckMutuallyExclusiveStringGUCs(const char *p1val, const char *p1name,
+			 const char *p2val, const char *p2name,
+			 int elevel);
 extern void ParseLongOption(const char *string, char **name, char **value);
 extern const char *get_config_unit_name(int flags);
 extern bool parse_int(const char *value, int *result, int flags,
-- 
2.25.1

>From 9d36df03e0eece1a08b7e778e6d4c82b2228582a Mon Sep 17 00:00:00 2001
From: Nathan Bossart 
Date: Wed, 15 Feb 2023 10:36:00 -0800
Subject: [PATCH v20 2/5] refactor code for restoring via shell

---
 src/backend/Makefile  |   2 +-
 src/backend/access/transam/timeline.c |  12 +-
 src/backend/access/transam/xlog.c |  50 -
 src/backend/access/transam/xlogarchive.c  | 167 ---
 src/backend/access/transam/xlogrecovery.c |   3 +-
 src/backend/meson.build   |   1 +
 src/backend/postmaster/startup.c  |  16 +-
 src/backend/restore/Makefile  |  18 ++
 src/backend/restore/meson.build   |   5 +
 src/backend/restore/shell_restore.c   | 245 ++
 src/include/Makefile  |   2 +-
 src/include/access/xlogarchive.h  |   9 +-
 src/include/meson.build   |   1 +
 src/include/postmaster/startup.h  |   1 +
 src/include/restore/shell_restore.h   |  26 +++
 src/tools/pgindent/typedefs.list  |   1 +
 16 files changed, 407 insertions(+), 152 deletions(-)
 create mode 100644 src/backend/restore/Makefile
 create mode 100644 src/backend/restore/meson.build
 create mode 100644 src/backend/restore/shell_restore.c
 create mode 100644 src/include/restore/shell_restore.h

diff --git a/src/backend/Makefile b/src/backend/Makefile
index 6700aec039..590b5002c0 100644
--- a/src/backend/Makefile
+++ b/src/backend/Makefile
@@ -19,7 +19,7 @@ include $(top_builddir)/src/Makefile.global
 SUBDIRS = access archive backup bootstrap catalog parser commands executor \
 	foreign lib libpq \
 	main nodes optimizer partitioning port postmaster \
-	regex replication rewrite \
+	regex replication restore rewrite \
 	statistics storage tcop tsearch utils $(top_builddir)/src/timezone \
 	jit
 
diff --git a/src/backend/access/transam/timeline.c b/src/backend/access/transam/timeline.c
index 

Re: [PATCH] LockAcquireExtended improvement

2024-03-26 Thread Will Mortensen
On Thu, Mar 14, 2024 at 1:15 PM Robert Haas  wrote:
> Seeing no further discussion, I have committed my version of this
> patch, with your test case.

This comment on ProcSleep() seems to have the values of dontWait
backward (double negatives are tricky):

* Result: PROC_WAIT_STATUS_OK if we acquired the lock,
PROC_WAIT_STATUS_ERROR
* if not (if dontWait = true, this is a deadlock; if dontWait = false, we
* would have had to wait).

Also there's a minor typo in a comment in LockAcquireExtended():

* Check the proclock entry status. If dontWait = true, this is an
* expected case; otherwise, it will open happen if something in the
* ipc communication doesn't work correctly.

"open" should be "only".




Re: Add new error_action COPY ON_ERROR "log"

2024-03-26 Thread Masahiko Sawada
On Tue, Mar 26, 2024 at 6:36 PM Bharath Rupireddy
 wrote:
>
> On Tue, Mar 26, 2024 at 1:46 PM Masahiko Sawada  wrote:
> >
> > Thank you for updating the patch! Looks good to me.
> >
> > Please find the attached patch. I've made some changes for the
> > documentation and the commit message. I'll push it, barring any
> > objections.
>
> Thanks. v12 patch LGTM.
>

While testing the new option, I realized that the tab-completion
complements DEFAULT value, but it doesn't work without single-quotes:

postgres(1:2179134)=# copy test from '/tmp/dump.data' with
(log_verbosity default );
ERROR:  syntax error at or near "default"
LINE 1: ...py test from '/tmp/dump.data' with (log_verbosity default );
 ^
postgres(1:2179134)=# copy test from '/tmp/dump.data' with
(log_verbosity 'default' );
COPY 0

Whereas VERBOSE works even without single-quotes:

postgres(1:2179134)=# copy test from '/tmp/dump.data' with
(log_verbosity verbose );
COPY 0

postgres(1:2179134)=# copy test from '/tmp/dump.data' with
(log_verbosity 'verbose' );
COPY 0

Which could confuse users. This is because DEFAULT is a reserved
keyword and the COPY option doesn't accept them as an option value.

I think that there are two options to handle it:

1. change COPY grammar to accept DEFAULT as an option value.
2. change tab-completion to complement 'DEFAULT' instead of DEFAULT,
and update the doc too.

As for the documentation, we can add single-quotes as follows:

 ENCODING 'encoding_name'
+LOG_VERBOSITY [ 'mode' ]

I thought the option (2) is better but there seems no precedent of
complementing a single-quoted string other than file names. So the
option (1) could be clearer.

What do you think?

Regards,

-- 
Masahiko Sawada
Amazon Web Services: https://aws.amazon.com




Re: Why is parula failing?

2024-03-26 Thread David Rowley
On Tue, 26 Mar 2024 at 21:03, Tharakan, Robins  wrote:
>
> > David Rowley  writes:
> > It would be good to have log_autovacuum_min_duration = 0 on this machine 
> > for a while.
>
> - Have set log_autovacuum_min_duration=0 on parula and a test run came out 
> okay.
> - Also added REL_16_STABLE to the branches being tested (in case it matters 
> here).

Thanks for doing that.

I see PG_16_STABLE has failed twice already with the same issue.  I
don't see any autovacuum / autoanalyze in the log, so I guess that
rules out auto vacuum activity causing this.

Unfortunately, REL_16_STABLE does not have the additional debugging,
so don't get to know what reltuples was set to.

David




Re: add AVX2 support to simd.h

2024-03-26 Thread Tom Lane
Nathan Bossart  writes:
> On Tue, Mar 26, 2024 at 06:55:54PM -0500, Nathan Bossart wrote:
>> On Tue, Mar 26, 2024 at 07:28:24PM -0400, Tom Lane wrote:
>>> A significant fraction of the buildfarm is issuing warnings about
>>> this.

> Done.  I'll keep an eye on the farm.

Thanks.

> I just did the minimal fix for now, i.e., I moved the new label into the
> SIMD section of the function.  I think it would be better stylistically to
> move the one-by-one logic to an inline helper function, but I didn't do
> that just in case it might negatively impact performance.  I'll look into
> this and will follow up with another patch if it looks good.

Sounds like a plan.

regards, tom lane




Re: add AVX2 support to simd.h

2024-03-26 Thread Nathan Bossart
On Tue, Mar 26, 2024 at 06:55:54PM -0500, Nathan Bossart wrote:
> On Tue, Mar 26, 2024 at 07:28:24PM -0400, Tom Lane wrote:
>> A significant fraction of the buildfarm is issuing warnings about
>> this.
> 
> Thanks for the heads-up.  Will fix.

Done.  I'll keep an eye on the farm.

I just did the minimal fix for now, i.e., I moved the new label into the
SIMD section of the function.  I think it would be better stylistically to
move the one-by-one logic to an inline helper function, but I didn't do
that just in case it might negatively impact performance.  I'll look into
this and will follow up with another patch if it looks good.

-- 
Nathan Bossart
Amazon Web Services: https://aws.amazon.com




Re: Possibility to disable `ALTER SYSTEM`

2024-03-26 Thread Andrew Dunstan



> On Mar 27, 2024, at 3:53 AM, Tom Lane  wrote:
> 
> Bruce Momjian  writes:
>> I am thinking "enable_alter_system_command" is probably good because we
>> already use "enable" so why not reuse that idea, and I think "command"
>> is needed because we need to clarify we are talking about the command,
>> and not generic altering of the system.  We could use
>> "enable_sql_alter_system" if people want something shorter.
> 
> Robert already mentioned why not use "enable_": up to now that prefix
> has only been applied to planner plan-type-enabling GUCs.  I'd be okay
> with "allow_alter_system_command", although I find it unnecessarily
> verbose.

Agree. I don’t think “_command” adds much clarity.

Cheers

Andrew




Re: [PoC] Improve dead tuple storage for lazy vacuum

2024-03-26 Thread John Naylor
On Mon, Mar 25, 2024 at 8:07 PM Masahiko Sawada  wrote:
>
> On Mon, Mar 25, 2024 at 3:25 PM John Naylor  wrote:
> >
> > On Fri, Mar 22, 2024 at 12:20 PM Masahiko Sawada  
> > wrote:

> > - * remaining LP_DEAD line pointers on the page in the dead_items
> > - * array. These dead items include those pruned by lazy_scan_prune()
> > - * as well we line pointers previously marked LP_DEAD.
> > + * remaining LP_DEAD line pointers on the page in the dead_items.
> > + * These dead items include those pruned by lazy_scan_prune() as well
> > + * we line pointers previously marked LP_DEAD.
> >
> > Here maybe "into dead_items".

- * remaining LP_DEAD line pointers on the page in the dead_items.
+ * remaining LP_DEAD line pointers on the page into the dead_items.

Let me explain. It used to be "in the dead_items array." It is not an
array anymore, so it was changed to "in the dead_items". dead_items is
a variable name, and names don't take "the". "into dead_items" seems
most natural to me, but there are other possible phrasings.

> > > > Did you try it with 1MB m_w_m?
> > >
> > > I've incorporated the above comments and test results look good to me.
> >
> > Could you be more specific about what the test was?
> > Does it work with 1MB m_w_m?
>
> If m_w_m is 1MB, both the initial and maximum segment sizes are 256kB.
>
> FYI other test cases I tested were:
>
> * m_w_m = 2199023254528 (maximum value)
> initial: 1MB
> max: 128GB
>
> * m_w_m = 64MB (default)
> initial: 1MB
> max: 8MB

If the test was a vacuum, how big a table was needed to hit 128GB?

> > The existing comment slipped past my radar, but max_bytes is not a
> > limit, it's a hint. Come to think of it, it never was a limit in the
> > normal sense, but in earlier patches it was the criteria for reporting
> > "I'm full" when asked.
>
> Updated the comment.

+ * max_bytes is not a limit; it's used to choose the memory block sizes of
+ * a memory context for TID storage in order for the total memory consumption
+ * not to be overshot a lot. The caller can use the max_bytes as the criteria
+ * for reporting whether it's full or not.

This is good information. I suggest this edit:

"max_bytes" is not an internally-enforced limit; it is used only as a
hint to cap the memory block size of the memory context for TID
storage. This reduces space wastage due to over-allocation. If the
caller wants to monitor memory usage, it must compare its limit with
the value reported by TidStoreMemoryUsage().

Other comments:

v79-0002 looks good to me.

v79-0003:

"With this commit, when creating a shared TidStore, a dedicated DSA
area is created for TID storage instead of using the provided DSA
area."

This is very subtle, but "the provided..." implies there still is one.
-> "a provided..."

+ * Similar to TidStoreCreateLocal() but create a shared TidStore on a
+ * DSA area. The TID storage will live in the DSA area, and a memory
+ * context rt_context will have only meta data of the radix tree.

-> "the memory context"

I think you can go ahead and commit 0002 and 0003/4.

v79-0005:

- bypass = (vacrel->lpdead_item_pages < threshold &&
-   vacrel->lpdead_items < MAXDEADITEMS(32L * 1024L * 1024L));
+ bypass = (vacrel->lpdead_item_pages < threshold) &&
+ TidStoreMemoryUsage(vacrel->dead_items) < (32L * 1024L * 1024L);

The parentheses look strange, and the first line shouldn't change
without a good reason.

- /* Set dead_items space */
- dead_items = (VacDeadItems *) shm_toc_lookup(toc,
- PARALLEL_VACUUM_KEY_DEAD_ITEMS,
- false);
+ /* Set dead items */
+ dead_items = TidStoreAttach(shared->dead_items_dsa_handle,
+ shared->dead_items_handle);

I feel ambivalent about this comment change. The original is not very
descriptive to begin with. If we need to change at all, maybe "find
dead_items in shared memory"?

v79-0005: As I said earlier, Dilip Kumar reviewed an earlier version.

v79-0006:

vac_work_mem should also go back to being an int.




Re: [HACKERS] make async slave to wait for lsn to be replayed

2024-03-26 Thread Alexander Korotkov
On Sun, Mar 24, 2024 at 4:39 AM Bharath Rupireddy
 wrote:
> I share the same concern as yours and had proposed something upthread
> [1]. The idea is something like how each query takes a snapshot at the
> beginning of txn/query (depending on isolation level), the same way
> the standby can wait for the primary's current LSN as of the moment
> (at the time of taking snapshot). And, primary keeps sending its
> current LSN as part of regular WAL to standbys so that the standbys
> doesn't have to make connections to the primary to know its current
> LSN every time. Perhps, this may not even fully guarantee (considered
> to be achieving) the read-after-write consistency on standbys unless
> there's a way for the application to tell the wait LSN.

Oh, no.  Please, check [1].  The idea is to wait for a particular
transaction to become visible.  The one who made a change on primary
brings the lsn value from there to replica.  For instance, an
application made a change on primary and then willing to run some
report on replica.  And the report should be guaranteed to contain the
change just made.  So, the application query the LSN from primary
after making a write transaction, then calls pg_wait_lsn() on
replicate before running the report.

This is quite simple server functionality, which could be used at
application-level, ORM-level or pooler-level.  And it unlocks the way
forward for in-protocol implementation as proposed by Peter
Eisentraut.

Links.
1. 
https://www.postgresql.org/message-id/CAPpHfdtny81end69PzEdRsROKnsybsj%3DOs8DUM-6HeKGKnCuQQ%40mail.gmail.com

--
Regards,
Alexander Korotkov




Re: [HACKERS] make async slave to wait for lsn to be replayed

2024-03-26 Thread Alexander Korotkov
On Fri, Mar 22, 2024 at 12:58 AM Peter Eisentraut  wrote:
> On 17.03.24 15:09, Alexander Korotkov wrote:
> > My current attempt was to commit minimal implementation as less
> > invasive as possible.  A new clause for BEGIN doesn't require
> > additional keywords and doesn't introduce additional statements.  But
> > yes, this is still a new qual.  And, yes, Amit you're right that even
> > if I had committed that, there was still a high risk of further
> > debates and revert.
>
> I had written in [0] about my questions related to using this with
> connection poolers.  I don't think this was addressed at all.  I haven't
> seen any discussion about how to make this kind of facility usable in a
> full system.  You have to manually query and send LSNs; that seems
> pretty cumbersome.  Sure, this is part of something that could be
> useful, but how would an actual user with actual application code get to
> use this?

The current usage pattern of this functionality is the following.

1. Do the write transaction on primary
2. Query pg_current_wal_insert_lsn() on primary
3. Call pg_wait_lsn() with the value obtained on the previous step on replica
4. Do the read transaction of replica

This usage pattern could be implemented either on the application
level, or on the pooler level.  For application level, it would
require a somewhat advanced level of database-aware programming, but
this is still a valid usage.  Regarding poolers, if some poolers
manage to automatically distinguish reading and writing queries,
dealing with LSNs wouldn't be too complex for them.

Having this functionality on protocol level would be ideal, but let's
do this step-by-step.  The built-in procedure isn't very invasive, but
that could give us some adoption and open the way forward.

--
Regards,
Alexander Korotkov




Re: pg_stat_statements and "IN" conditions

2024-03-26 Thread Yasuo Honda
Thanks for the useful info.

Ruby on Rails uses bigint as a default data type for the primary key
and prepared statements have been enabled by default for PostgreSQL.
I'm looking forward to these current patches being merged as a first
step and future versions of pg_stat_statements will support
normalizing bigint and prepared statements.

On Wed, Mar 27, 2024 at 6:00 AM Dmitry Dolgov <9erthali...@gmail.com> wrote:

> It's a similar case: the column is defined as bigint, thus PostgreSQL
> has to wrap every constant expression in a function expression that
> converts its type to bigint. The current patch version doesn't try to
> reduce a FuncExpr into Const (event if the wrapped value is a Const),
> thus this array is not getting merged. If you replace bigint with an
> int, no type conversion would be required and merging logic will kick
> in.
>
> Again, the original version of the patch was able to handle this case,
> but it was stripped away to make the patch smaller in hope of moving
> forward. Anyway, thanks for reminding about how annoying the current
> handling of constant arrays can look like in practice!




Re: add AVX2 support to simd.h

2024-03-26 Thread Nathan Bossart
On Tue, Mar 26, 2024 at 07:28:24PM -0400, Tom Lane wrote:
> Nathan Bossart  writes:
>> I've committed v9, and I've marked the commitfest entry as "Committed,"
>> although we may want to revisit AVX2, etc. in the future.
> 
> A significant fraction of the buildfarm is issuing warnings about
> this.

Thanks for the heads-up.  Will fix.

-- 
Nathan Bossart
Amazon Web Services: https://aws.amazon.com




Re: [HACKERS] make async slave to wait for lsn to be replayed

2024-03-26 Thread Alexander Korotkov
On Fri, Mar 22, 2024 at 12:50 AM Peter Eisentraut  wrote:
> On 19.03.24 18:38, Kartyshov Ivan wrote:
> >   CALL pg_wait_lsn('0/3002AE8', 1);
> >   BEGIN;
> >   SELECT * FROM tbl; // read fresh insertions
> >   COMMIT;
>
> I'm not endorsing this or any other approach, but I think the timeout
> parameter should be of type interval, not an integer with a unit that is
> hidden in the documentation.

I'm not sure a timeout needs to deal with complexity of our interval
datatype.  At the same time, the integer number of milliseconds looks
a bit weird.  Could the float8 number of seconds be an option?

--
Regards,
Alexander Korotkov




Re: add AVX2 support to simd.h

2024-03-26 Thread Tom Lane
Nathan Bossart  writes:
> I've committed v9, and I've marked the commitfest entry as "Committed,"
> although we may want to revisit AVX2, etc. in the future.

A significant fraction of the buildfarm is issuing warnings about
this.

 adder | 2024-03-26 21:04:33 | 
../pgsql/src/include/port/pg_lfind.h:199:1: warning: label 'one_by_one' defined 
but not used [-Wunused-label]
 buri  | 2024-03-26 21:16:09 | ../../src/include/port/pg_lfind.h:199:1: 
warning: label 'one_by_one' defined but not used [-Wunused-label]
 cavefish  | 2024-03-26 22:53:23 | ../../src/include/port/pg_lfind.h:199:1: 
warning: label 'one_by_one' defined but not used [-Wunused-label]
 cisticola | 2024-03-26 22:20:07 | 
../../../../src/include/port/pg_lfind.h:199:1: warning: label 'one_by_one' 
defined but not used [-Wunused-label]
 lancehead | 2024-03-26 21:48:17 | ../../src/include/port/pg_lfind.h:199:1: 
warning: unused label 'one_by_one' [-Wunused-label]
 nicator   | 2024-03-26 21:08:14 | ../../src/include/port/pg_lfind.h:199:1: 
warning: label 'one_by_one' defined but not used [-Wunused-label]
 nuthatch  | 2024-03-26 22:00:04 | ../../src/include/port/pg_lfind.h:199:1: 
warning: label 'one_by_one' defined but not used [-Wunused-label]
 rinkhals  | 2024-03-26 19:51:32 | ../../src/include/port/pg_lfind.h:199:1: 
warning: unused label 'one_by_one' [-Wunused-label]
 siskin| 2024-03-26 19:59:29 | ../../src/include/port/pg_lfind.h:199:1: 
warning: label 'one_by_one' defined but not used [-Wunused-label]

regards, tom lane




Re: ALTER TABLE SET ACCESS METHOD on partitioned tables

2024-03-26 Thread Justin Pryzby
On Thu, Mar 21, 2024 at 01:07:01PM +0100, Alvaro Herrera wrote:
> Given that Michaël is temporarily gone, I propose to push the attached
> tomorrow.

Thanks.

On Tue, Mar 26, 2024 at 12:05:47PM +0100, Alvaro Herrera wrote:
> On 2024-Mar-26, Alexander Lakhin wrote:
> 
> > Hello Alvaro,
> > 
> > 21.03.2024 15:07, Alvaro Herrera wrote:
> > > Given that Michaël is temporarily gone, I propose to push the attached
> > > tomorrow.
> > 
> > Please look at a new anomaly introduced with 374c7a229.
> > Starting from that commit, the following erroneous query:
> > CREATE FOREIGN TABLE fp PARTITION OF pg_am DEFAULT SERVER x;
> > 
> > triggers an assertion failure:
> > TRAP: failed Assert("relation->rd_rel->relam == InvalidOid"), File: 
> > "relcache.c", Line: 1219, PID: 3706301
> 
> Hmm, yeah, we're setting relam for relations that shouldn't have it.
> I propose the attached.

Looks right.  That's how I originally wrote it, except for the
"stmt->accessMethod != NULL" case.

I prefered my way - the grammar should refuse to set stmt->accessMethod
for inappropriate relkinds.  And you could assert that.

I also prefered to set "accessMethodId = InvalidOid" once, rather than
twice.

diff --git a/src/backend/commands/tablecmds.c b/src/backend/commands/tablecmds.c
index 8a02c5b05b6..050be89728f 100644
--- a/src/backend/commands/tablecmds.c
+++ b/src/backend/commands/tablecmds.c
@@ -962,18 +962,21 @@ DefineRelation(CreateStmt *stmt, char relkind, Oid 
ownerId,
 * case of a partitioned table) the parent's, if it has one.
 */
if (stmt->accessMethod != NULL)
-   accessMethodId = get_table_am_oid(stmt->accessMethod, false);
-   else if (stmt->partbound)
{
-   Assert(list_length(inheritOids) == 1);
-   accessMethodId = get_rel_relam(linitial_oid(inheritOids));
+   Assert(RELKIND_HAS_TABLE_AM(relkind) || relkind == 
RELKIND_PARTITIONED_TABLE);
+   accessMethodId = get_table_am_oid(stmt->accessMethod, false);
}
-   else
-   accessMethodId = InvalidOid;
+   else if (RELKIND_HAS_TABLE_AM(relkind) || relkind == 
RELKIND_PARTITIONED_TABLE)
+   {
+   if (stmt->partbound)
+   {
+   Assert(list_length(inheritOids) == 1);
+   accessMethodId = 
get_rel_relam(linitial_oid(inheritOids));
+   }
 
-   /* still nothing? use the default */
-   if (RELKIND_HAS_TABLE_AM(relkind) && !OidIsValid(accessMethodId))
-   accessMethodId = get_table_am_oid(default_table_access_method, 
false);
+   if (RELKIND_HAS_TABLE_AM(relkind) && 
!OidIsValid(accessMethodId))
+   accessMethodId = 
get_table_am_oid(default_table_access_method, false);
+   }
 
/*
 * Create the relation.  Inherited defaults and constraints are passed 
in




Re: Properly pathify the union planner

2024-03-26 Thread David Rowley
On Wed, 27 Mar 2024 at 06:00, Alexander Lakhin  wrote:
> SELECT count(*) FROM (
>WITH q1(x) AS (SELECT 1)
>SELECT FROM q1 UNION SELECT FROM q1
> ) qu;
>
> TRAP: failed Assert("lg != NULL"), File: "planner.c", Line: 7941, PID: 1133017

Thanks for finding that.

There's something weird going on with the UNION child subquery's
setOperations field. As far as I understand, and from reading the
existing comments, this should only be set for the top-level union.

Because this field is set, it plans the CTE thinking it's a UNION
child and breaks when it can't find a SortGroupClause for the CTE's
target list item.

I'll keep digging. As far as I see the setOperations field is only set
in transformSetOperationStmt(). I'm guessing we must be doing a
copyObject() somewhere and accidentally picking up the parent's
setOperations.

David




Re: WIP Incremental JSON Parser

2024-03-26 Thread Jacob Champion
On Mon, Mar 25, 2024 at 3:14 PM Jacob Champion
 wrote:
> - add Assert calls in impossible error cases [2]

To expand on this one, I think these parts of the code (marked with
`<- here`) are impossible to reach:

> switch (top)
> {
> case JSON_TOKEN_STRING:
> if (next_prediction(pstack) == JSON_TOKEN_COLON)
> ctx = JSON_PARSE_STRING;
> else
> ctx = JSON_PARSE_VALUE;<- here
> break;
> case JSON_TOKEN_NUMBER:<- here
> case JSON_TOKEN_TRUE:  <- here
> case JSON_TOKEN_FALSE: <- here
> case JSON_TOKEN_NULL:  <- here
> case JSON_TOKEN_ARRAY_START:   <- here
> case JSON_TOKEN_OBJECT_START:  <- here
> ctx = JSON_PARSE_VALUE;
> break;
> case JSON_TOKEN_ARRAY_END: <- here
> ctx = JSON_PARSE_ARRAY_NEXT;
> break;
> case JSON_TOKEN_OBJECT_END:<- here
> ctx = JSON_PARSE_OBJECT_NEXT;
> break;
> case JSON_TOKEN_COMMA: <- here
> if (next_prediction(pstack) == JSON_TOKEN_STRING)
> ctx = JSON_PARSE_OBJECT_NEXT;
> else
> ctx = JSON_PARSE_ARRAY_NEXT;
> break;

Since none of these cases are non-terminals, the only way to get to
this part of the code is if (top != tok). But inspecting the
productions and transitions that can put these tokens on the stack, it
looks like the only way for them to be at the top of the stack to
begin with is if (tok == top). (Otherwise, we would have chosen a
different production, or else errored out on a non-terminal.)

This case is possible...

> case JSON_TOKEN_STRING:
> if (next_prediction(pstack) == JSON_TOKEN_COLON)
> ctx = JSON_PARSE_STRING;

...if we're in the middle of JSON_PROD_[MORE_]KEY_PAIRS. But the
corresponding else case is not, because if we're in the middle of a
_KEY_PAIRS production, the next_prediction() _must_ be
JSON_TOKEN_COLON.

Do you agree, or am I missing a way to get there?

Thanks,
--Jacob




Re: speed up a logical replica setup

2024-03-26 Thread Tomas Vondra
On 3/26/24 21:17, Euler Taveira wrote:
> On Tue, Mar 26, 2024, at 4:12 PM, Tomas Vondra wrote:
>> Perhaps I'm missing something, but why is NUM_CONN_ATTEMPTS even needed?
>> Why isn't recovery_timeout enough to decide if wait_for_end_recovery()
>> waited long enough?
> 
> It was an attempt to decoupled a connection failure (that keeps streaming the
> WAL) from recovery timeout. The NUM_CONN_ATTEMPTS guarantees that if the 
> primary
> is gone during the standby recovery process, there is a way to bail out. The
> recovery-timeout is 0 (infinite) by default so you have an infinite wait 
> without
> this check. The idea behind this implementation is to avoid exiting in this
> critical code path. If it times out here you might have to rebuild the standby
> and start again.

- This seems like something that should definitely be documented in the
comment before wait_for_end_recovery(). At the moment it only talks
about timeout, and nothing about NUM_CONN_ATTEMPTS.

- The NUM_CONN_ATTEMPTS name seems rather misleading, considering it
does not really count connection attempts, but number of times we have
not seen 1 in pg_catalog.pg_stat_wal_receiver.

- Not sure I follow the logic - it tries to avoid exiting by setting
infinite timeout, but it still exists based on NUM_CONN_ATTEMPTS. Isn't
that somewhat contradictory?

- Isn't the NUM_CONN_ATTEMPTS actually making it more fragile, i.e. more
likely to exit? For example, what if there's a short networking hiccup,
so that the standby can't connect to the primary.

- It seems a bit strange that even with the recovery timeout set, having
the limit of 10 "connection attempts" effectively establishes a separate
hard-coded limit of 10 seconds. Seems a bit surprising if I set recovery
limit to 1 minute, and it just dies after 10 seconds.

> Amit suggested [1] that we use a value as recovery-timeout but
> how high is a good value? I've already saw some long recovery process using
> pglogical equivalent that timeout out after hundreds of minutes. Maybe I'm too
> worried about a small percentage of cases and we should use 1h as default, for
> example. It would reduce the complexity since the recovery process lacks some
> progress indicators (LSN is not sufficient in this case and there isn't a
> function to provide the current state -- stop applying WAL, reach target, new
> timeline, etc).
> 
> If we remove the pg_stat_wal_receiver check, we should avoid infinite recovery
> by default otherwise we will have some reports saying the tool is hanging when
> in reality the primary has gone and WAL should be streamed.
> 

I don't think there's a default timeout value that would work for
everyone. Either it's going to be too short for some cases, or it'll
take too long for some other cases.

I think there are two obvious default values for the timeout - infinity,
and 60 seconds, which is the default we use for other CLI tools (like
pg_ctl and so on). Considering the negative impact of exiting, I'd say
it's better to default to infinity. It's always possible to Ctrl-C or
terminate the process in some other way, if needed.

As for people complaining about infinite recovery - perhaps it'd be
sufficient to mention this in the messages printed by the tool, to make
it clearer. Or maybe even print something in the loop, because right now
it's entirely silent so it's easy to believe it's stuck. Perhaps not on
every loop, but at least in verbose mode it should print something.

>> IMHO the test should simply pass PG_TEST_DEFAULT_TIMEOUT when calling
>> pg_createsubscriber, and that should do the trick.
> 
> That's a good idea. Tests are not exercising the recovery-timeout option.
> 
>> Increasing PG_TEST_DEFAULT_TIMEOUT is what buildfarm animals doing
>> things like ubsan/valgrind already use to deal with exactly this kind of
>> timeout problem.
>>
>> Or is there a deeper problem with deciding if the system is in recovery?
> 
> As I said with some recovery progress indicators it would be easier to make 
> some
> decisions like wait a few seconds because the WAL has already been applied and
> it is creating a new timeline. The recovery timeout decision is a shot in the
> dark because we might be aborting pg_createsubscriber when the target server 
> is
> about to set RECOVERY_STATE_DONE.
> 

Isn't it enough to check data in pg_stat_replication on the primary?


regards

-- 
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company




Re: pg_stat_statements and "IN" conditions

2024-03-26 Thread Dmitry Dolgov
> On Tue, Mar 26, 2024 at 04:21:46PM +0900, Yasuo Honda wrote:
> Yes. The script uses prepared statements because Ruby on Rails enables
> prepared statements by default for PostgreSQL databases.
>
> Then I tested this branch
> https://github.com/yahonda/postgres/tree/pg_stat_statements without
> using prepared statements as follows and all of them do not normalize
> in clause values.
>
> - Disabled prepared statements by setting `prepared_statements: false`
> https://gist.github.com/yahonda/2c2d6ac7a955886a305750eecfd07c5e
>
> - Use ruby-pg
> https://gist.github.com/yahonda/2f0efb11ae888d8f6b27a07e0b833fdf
>
> - Use psql
> https://gist.github.com/yahonda/c830379b33d66a743aef159aa03d7e49
>
> I do not know why even if I use psql, the query column at
> pg_stat_sql_statement shows it is like a prepared statement "IN ($1,
> $2)".

It's a similar case: the column is defined as bigint, thus PostgreSQL
has to wrap every constant expression in a function expression that
converts its type to bigint. The current patch version doesn't try to
reduce a FuncExpr into Const (event if the wrapped value is a Const),
thus this array is not getting merged. If you replace bigint with an
int, no type conversion would be required and merging logic will kick
in.

Again, the original version of the patch was able to handle this case,
but it was stripped away to make the patch smaller in hope of moving
forward. Anyway, thanks for reminding about how annoying the current
handling of constant arrays can look like in practice!




Re: speed up a logical replica setup

2024-03-26 Thread Euler Taveira
On Tue, Mar 26, 2024, at 4:12 PM, Tomas Vondra wrote:
> Perhaps I'm missing something, but why is NUM_CONN_ATTEMPTS even needed?
> Why isn't recovery_timeout enough to decide if wait_for_end_recovery()
> waited long enough?

It was an attempt to decoupled a connection failure (that keeps streaming the
WAL) from recovery timeout. The NUM_CONN_ATTEMPTS guarantees that if the primary
is gone during the standby recovery process, there is a way to bail out. The
recovery-timeout is 0 (infinite) by default so you have an infinite wait without
this check. The idea behind this implementation is to avoid exiting in this
critical code path. If it times out here you might have to rebuild the standby
and start again. Amit suggested [1] that we use a value as recovery-timeout but
how high is a good value? I've already saw some long recovery process using
pglogical equivalent that timeout out after hundreds of minutes. Maybe I'm too
worried about a small percentage of cases and we should use 1h as default, for
example. It would reduce the complexity since the recovery process lacks some
progress indicators (LSN is not sufficient in this case and there isn't a
function to provide the current state -- stop applying WAL, reach target, new
timeline, etc).

If we remove the pg_stat_wal_receiver check, we should avoid infinite recovery
by default otherwise we will have some reports saying the tool is hanging when
in reality the primary has gone and WAL should be streamed.

> IMHO the test should simply pass PG_TEST_DEFAULT_TIMEOUT when calling
> pg_createsubscriber, and that should do the trick.

That's a good idea. Tests are not exercising the recovery-timeout option.

> Increasing PG_TEST_DEFAULT_TIMEOUT is what buildfarm animals doing
> things like ubsan/valgrind already use to deal with exactly this kind of
> timeout problem.
> 
> Or is there a deeper problem with deciding if the system is in recovery?

As I said with some recovery progress indicators it would be easier to make some
decisions like wait a few seconds because the WAL has already been applied and
it is creating a new timeline. The recovery timeout decision is a shot in the
dark because we might be aborting pg_createsubscriber when the target server is
about to set RECOVERY_STATE_DONE.


[1] 
https://www.postgresql.org/message-id/CAA4eK1JRgnhv_ySzuFjN7UaX9qxz5Hqcwew7Fk%3D%2BAbJbu0Kd9w%40mail.gmail.com


--
Euler Taveira
EDB   https://www.enterprisedb.com/


Re: Remove some redundant set_cheapest() calls

2024-03-26 Thread Tom Lane
Richard Guo  writes:
> I happened to notice that the set_cheapest() calls in functions
> set_namedtuplestore_pathlist() and set_result_pathlist() are redundant,
> as we've centralized the set_cheapest() calls in set_rel_pathlist().

> Attached is a trivial patch to remove these calls.

Agreed, and pushed.

> BTW, I suspect that the set_cheapest() call in set_dummy_rel_pathlist()
> is also redundant.  The comment there says "This is redundant when we're
> called from set_rel_size(), but not when called from elsewhere".  I
> doubt it.  The other places where it is called are set_append_rel_size()
> and set_subquery_pathlist(), both being called from set_rel_size().  So
> set_cheapest() would ultimately be called in set_rel_pathlist().

I'm less convinced about changing this.  I'd rather keep it consistent
with mark_dummy_rel.

regards, tom lane




Re: New Table Access Methods for Multi and Single Inserts

2024-03-26 Thread Bharath Rupireddy
On Tue, Mar 26, 2024 at 9:07 PM Jeff Davis  wrote:
>
> On Tue, 2024-03-26 at 01:28 +0530, Bharath Rupireddy wrote:
> > I'm thinking
> > of dropping the COPY FROM patch using the new multi insert API for
> > the
> > following reasons: ...
>
> I agree with all of this. We do want COPY ... FROM support, but there
> are some details to work out and we don't want to make a big code
> change at this point in the cycle.

Right.

> > Please see the attached v14 patches.
>
> * No need for a 'kind' field in TableModifyState. The state should be
> aware of the kinds of changes that it has received and that may need to
> be flushed later -- for now, only inserts, but possibly updates/deletes
> in the future.

Removed 'kind' field with lazy initialization of required AM specific
modify (insert in this case) state. Since we don't have 'kind', I
chose the callback approach to cleanup the modify (insert in this
case) specific state at the end.

> * If the AM doesn't support the bulk methods, fall back to retail
> inserts instead of throwing an error.

For instance, CREATE MATERIALIZED VIEW foo_mv AS SELECT * FROM foo
USING bar_tam; doesn't work if bar_tam doesn't have the
table_tuple_insert implemented.

Similarly, with this new AM, the onus lies on the table AM
implementers to provide an implementation for these new AMs even if
they just do single inserts. But, I do agree that we must catch this
ahead during parse analysis itself, so I've added assertions in
GetTableAmRoutine().

> * It seems like this API will eventually replace table_multi_insert and
> table_finish_bulk_insert completely. Do those APIs have any advantage
> remaining over the new one proposed here?

table_multi_insert needs to be there for sure as COPY ... FROM uses
it. Not sure if we need to remove the optional callback
table_finish_bulk_insert though. Heap AM doesn't implement one, but
some other AM might. Having said that, with this new AM, whatever the
logic that used to be there in table_finish_bulk_insert previously,
table AM implementers will have to move them to table_modify_end.

FWIW, I can try writing a test table AM that uses this new AM but just
does single inserts, IOW, equivalent to table_tuple_insert().
Thoughts?

> * Right now I don't any important use of the flush method. It seems
> that could be accomplished in the finish method, and flush could just
> be an internal detail when the memory is exhausted. If we find a use
> for it later, we can always add it, but right now it seems unnecessary.

Firstly, we are not storing CommandId and options in TableModifyState,
because we expect CommandId to be changing (per Andres comment).
Secondly, we don't want to pass just the CommandId and options to
table_modify_end(). Thirdly, one just has to call the
table_modify_buffer_flush before the table_modify_end. Do you have any
other thoughts here?

> * We need to be careful about cases where the command can be successful
> but the writes are not flushed. I don't tihnk that's a problem with the
> current patch, but we will need to do something here when we expand to
> INSERT INTO ... SELECT.

You mean, writes are not flushed to the disk? Can you please elaborate
why it's different for INSERT INTO ... SELECT and not others? Can't
the new flush AM be helpful here to implement any flush related
things?

Please find the attached v15 patches with the above review comments addressed.

--
Bharath Rupireddy
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com


v15-0001-Introduce-new-table-modify-access-methods.patch
Description: Binary data


v15-0002-Optimize-CREATE-TABLE-AS-with-multi-inserts.patch
Description: Binary data


v15-0003-Optimize-REFRESH-MATERIALIZED-VIEW-with-multi-in.patch
Description: Binary data


Re: Slow GRANT ROLE on PostgreSQL 16 with thousands of ROLEs

2024-03-26 Thread Nathan Bossart
On Tue, Mar 26, 2024 at 03:08:00PM -0400, Tom Lane wrote:
> My one remaining suggestion is that this comment isn't very precise
> about what's happening:
> 
>  * If there is a previously-created Bloom filter, use it to determine
>  * whether the role is missing from the list.  Otherwise, do an ordinary
>  * linear search through the existing role list.
> 
> Maybe more like
> 
>  * If there is a previously-created Bloom filter, use it to try to
>  * determine whether the role is missing from the list.  If it
>  * says yes, that's a hard fact and we can go ahead and add the
>  * role.  If it says no, that's only probabilistic and we'd better
>  * search the list.  Without a filter, we must always do an ordinary
>  * linear search through the existing list.
> 
> LGTM other than that nit.

Committed with that change.  Thanks for the guidance on this one.

-- 
Nathan Bossart
Amazon Web Services: https://aws.amazon.com




Re: archive modules loose ends

2024-03-26 Thread Nathan Bossart
On Mon, Jan 15, 2024 at 08:50:25AM -0600, Nathan Bossart wrote:
> Thanks for reviewing.  I've marked this as ready-for-committer, and I'm
> hoping to commit it in the near future.

This one probably ought to go into v17, but I wanted to do one last call
for feedback prior to committing.

-- 
Nathan Bossart
Amazon Web Services: https://aws.amazon.com




Re: speed up a logical replica setup

2024-03-26 Thread Tomas Vondra
On 3/26/24 03:53, Euler Taveira wrote:
> On Mon, Mar 25, 2024, at 1:06 PM, Hayato Kuroda (Fujitsu) wrote:
>> ## Analysis for failure 1
>>
>> The failure caused by a time lag between walreceiver finishes and 
>> pg_is_in_recovery()
>> returns true.
>>
>> According to the output [1], it seems that the tool failed at 
>> wait_for_end_recovery()
>> with the message "standby server disconnected from the primary". Also, lines
>> "redo done at..." and "terminating walreceiver process due to administrator 
>> command"
>> meant that walreceiver was requested to shut down by XLogShutdownWalRcv().
>>
>> According to the source, we confirm that walreceiver is shut down in
>> StartupXLOG()->FinishWalRecovery()->XLogShutdownWalRcv(). Also, 
>> SharedRecoveryState
>> is changed to RECOVERY_STATE_DONE (this meant the pg_is_in_recovery() return 
>> true)
>> at the latter part of StartupXLOG().
>>
>> So, if there is a delay between FinishWalRecovery() and change the state, 
>> the check
>> in wait_for_end_recovery() would be failed during the time. Since we allow 
>> to miss
>> the walreceiver 10 times and it is checked once per second, the failure 
>> occurs if
>> the time lag is longer than 10 seconds.
>>
>> I do not have a good way to fix it. One approach is make NUM_CONN_ATTEMPTS 
>> larger,
>> but it's not a fundamental solution.
> 
> I was expecting that slow hosts might have issues in wait_for_end_recovery().
> As you said it took a lot of steps between FinishWalRecovery() (where
> walreceiver is shutdown -- XLogShutdownWalRcv) and SharedRecoveryState is set 
> to
> RECOVERY_STATE_DONE. If this window takes longer than NUM_CONN_ATTEMPTS *
> WAIT_INTERVAL (10 seconds), it aborts the execution. That's a bad decision
> because it already finished the promotion and it is just doing the final
> preparation for the host to become a primary.
> 
> /*   
>  * If it is still in recovery, make sure the target server is
>  * connected to the primary so it can receive the required WAL to
>  * finish the recovery process. If it is disconnected try
>  * NUM_CONN_ATTEMPTS in a row and bail out if not succeed.
>  */
> res = PQexec(conn,
>  "SELECT 1 FROM pg_catalog.pg_stat_wal_receiver");
> if (PQntuples(res) == 0)
> {
> if (++count > NUM_CONN_ATTEMPTS)
> {
> stop_standby_server(subscriber_dir);
> pg_log_error("standby server disconnected from the primary");
> break;
> }
> }
> else
> count = 0;  /* reset counter if it connects again */
> 
> This code was add to defend against the death/crash of the target server. 
> There
> are at least 3 options:
> 
> (1) increase NUM_CONN_ATTEMPTS * WAIT_INTERVAL seconds. We discussed this 
> constant
> and I decided to use 10 seconds because even in some slow hosts, this time
> wasn't reached during my tests. It seems I forgot to test the combination of 
> slow
> host, asserts enabled, and ubsan. I didn't notice that pg_promote() uses 60
> seconds as default wait. Maybe that's a reasonable value. I checked the
> 004_timeline_switch test and the last run took: 39.2s (serinus), 33.1s
> (culicidae), 18.31s (calliphoridae) and 27.52s (olingo).
> > (2) check if the primary is not running when walreceiver is not
available on the
> target server. Increase the connection attempts iif the primary is not 
> running.
> Hence, the described case doesn't cause an increment on the count variable.
> 
> (3) set recovery_timeout default to != 0 and remove pg_stat_wal_receiver check
> protection against the death/crash target server. I explained in a previous
> message that timeout may occur in cases that WAL replay to reach consistent
> state takes more than recovery-timeout seconds.
> 
> Option (1) is the easiest fix, however, we can have the same issue again if a
> slow host decides to be even slower, hence, we have to adjust this value 
> again.
> Option (2) interprets the walreceiver absence as a recovery end and if the
> primary server is running it can indicate that the target server is in the
> imminence of the recovery end. Option (3) is not as resilient as the other
> options.
> 
> The first patch implements a combination of (1) and (2).
> 
>> ## Analysis for failure 2
>>
>> According to [2], the physical replication slot which is specified as 
>> primary_slot_name
>> was not used by the walsender process. At that time walsender has not 
>> existed.
>>
>> ```
>> ...
>> pg_createsubscriber: publisher: current wal senders: 0
>> pg_createsubscriber: command is: SELECT 1 FROM 
>> pg_catalog.pg_replication_slots WHERE active AND slot_name = 'physical_slot'
>> pg_createsubscriber: error: could not obtain replication slot information: 
>> got 0 rows, expected 1 row
>> ...
>> ```
>>
>> Currently standby must be stopped before the command and current code does 
>> not
>> block the flow to ensure the 

Re: add AVX2 support to simd.h

2024-03-26 Thread Nathan Bossart
I've committed v9, and I've marked the commitfest entry as "Committed,"
although we may want to revisit AVX2, etc. in the future.

-- 
Nathan Bossart
Amazon Web Services: https://aws.amazon.com




Re: Slow GRANT ROLE on PostgreSQL 16 with thousands of ROLEs

2024-03-26 Thread Tom Lane
Nathan Bossart  writes:
> On Tue, Mar 26, 2024 at 02:16:03PM -0400, Tom Lane wrote:
>> ... I'm not sold on your "ROLES_LIST_BLOOM_THRESHOLD * 10"
>> value.  Maybe it doesn't matter though.

> Yeah, I wasn't sure how much to worry about this.  I figured that we might
> as well set it to a reasonable estimate based on the description of the
> parameter.  This description claims that the filter should work well if
> this is off by a factor of 5 or more, and 50x the threshold sounded like it
> ought to be good enough for anyone, so that's how I landed on 10x.  But as
> you point out, this value will be disregarded altogether, and it will
> continue to be ignored unless the filter implementation changes, which
> seems unlikely.  If you have a different value in mind that you would
> rather use, I'm fine with changing it.

No, I have no better idea.  As you say, we should try to provide some
semi-reasonable number in case bloom_create ever starts paying
attention, and this one seems fine.

>> I do not like, even a little bit, your use of a static variable to
>> hold the bloom filter pointer.

> Ah, yes, that's no good.  I fixed this in the new version.

My one remaining suggestion is that this comment isn't very precise
about what's happening:

 * If there is a previously-created Bloom filter, use it to determine
 * whether the role is missing from the list.  Otherwise, do an ordinary
 * linear search through the existing role list.

Maybe more like

 * If there is a previously-created Bloom filter, use it to try to
 * determine whether the role is missing from the list.  If it
 * says yes, that's a hard fact and we can go ahead and add the
 * role.  If it says no, that's only probabilistic and we'd better
 * search the list.  Without a filter, we must always do an ordinary
 * linear search through the existing list.

LGTM other than that nit.

regards, tom lane




Re: Slow GRANT ROLE on PostgreSQL 16 with thousands of ROLEs

2024-03-26 Thread Nathan Bossart
On Tue, Mar 26, 2024 at 02:16:03PM -0400, Tom Lane wrote:
> I did a little experimentation using the attached quick-hack C
> function, and came to the conclusion that setting up the bloom filter
> costs more or less as much as inserting 1000 or so OIDs the dumb way.
> So we definitely want a threshold that's not much less than that.

Thanks for doing this.

> So I'm now content with choosing a threshold of 1000 or 1024 or so.

Cool.

> As for the bloom filter size, I see that bloom_create does
> 
>   bitset_bytes = Min(bloom_work_mem * UINT64CONST(1024), total_elems * 2);
>   bitset_bytes = Max(1024 * 1024, bitset_bytes);
> 
> which means that any total_elems input less than 512K is disregarded
> altogether.  So I'm not sold on your "ROLES_LIST_BLOOM_THRESHOLD * 10"
> value.  Maybe it doesn't matter though.

Yeah, I wasn't sure how much to worry about this.  I figured that we might
as well set it to a reasonable estimate based on the description of the
parameter.  This description claims that the filter should work well if
this is off by a factor of 5 or more, and 50x the threshold sounded like it
ought to be good enough for anyone, so that's how I landed on 10x.  But as
you point out, this value will be disregarded altogether, and it will
continue to be ignored unless the filter implementation changes, which
seems unlikely.  If you have a different value in mind that you would
rather use, I'm fine with changing it.

> I do not like, even a little bit, your use of a static variable to
> hold the bloom filter pointer.  That code will misbehave horribly
> if we throw an error partway through the role-accumulation loop;
> the next call will try to carry on using the old filter, which would
> be wrong even if it still existed which it likely won't.  It's not
> that much worse notationally to keep it as a local variable, as I
> did in the attached.

Ah, yes, that's no good.  I fixed this in the new version.

-- 
Nathan Bossart
Amazon Web Services: https://aws.amazon.com
>From e5618bb6f1d8b21d527bfda5b49f75d12bf31e9e Mon Sep 17 00:00:00 2001
From: Nathan Bossart 
Date: Mon, 25 Mar 2024 23:17:08 -0500
Subject: [PATCH v4 1/1] Optimize roles_is_member_of() with a Bloom filter.

When the list of roles gathered by roles_is_member_of() grows very
large, a Bloom filter is created to help avoid some linear searches
through the list.  The threshold for creating the Bloom filter is
set arbitrarily high and may require future adjustment.

Suggested-by: Tom Lane
Reviewed-by: Tom Lane
Discussion: https://postgr.es/m/CAGvXd3OSMbJQwOSc-Tq-Ro1CAz%3DvggErdSG7pv2s6vmmTOLJSg%40mail.gmail.com
---
 src/backend/utils/adt/acl.c | 65 +++--
 1 file changed, 62 insertions(+), 3 deletions(-)

diff --git a/src/backend/utils/adt/acl.c b/src/backend/utils/adt/acl.c
index cf5d08576a..39197613c2 100644
--- a/src/backend/utils/adt/acl.c
+++ b/src/backend/utils/adt/acl.c
@@ -36,6 +36,7 @@
 #include "common/hashfn.h"
 #include "foreign/foreign.h"
 #include "funcapi.h"
+#include "lib/bloomfilter.h"
 #include "lib/qunique.h"
 #include "miscadmin.h"
 #include "utils/acl.h"
@@ -78,6 +79,13 @@ static Oid	cached_role[] = {InvalidOid, InvalidOid, InvalidOid};
 static List *cached_roles[] = {NIL, NIL, NIL};
 static uint32 cached_db_hash;
 
+/*
+ * If the list of roles gathered by roles_is_member_of() grows larger than the
+ * below threshold, a Bloom filter is created to speed up list membership
+ * checks.  This threshold is set arbitrarily high to avoid the overhead of
+ * creating the Bloom filter until it seems likely to provide a net benefit.
+ */
+#define ROLES_LIST_BLOOM_THRESHOLD 1024
 
 static const char *getid(const char *s, char *n, Node *escontext);
 static void putid(char *p, const char *s);
@@ -4918,6 +4926,50 @@ RoleMembershipCacheCallback(Datum arg, int cacheid, uint32 hashvalue)
 	cached_role[ROLERECURSE_SETROLE] = InvalidOid;
 }
 
+/*
+ * A helper function for roles_is_member_of() that provides an optimized
+ * implementation of list_append_unique_oid() via a Bloom filter.  The caller
+ * (i.e., roles_is_member_of()) is responsible for freeing bf once it is done
+ * using this function.
+ */
+static inline List *
+roles_list_append(List *roles_list, bloom_filter **bf, Oid role)
+{
+	unsigned char *roleptr = (unsigned char *) 
+
+	/*
+	 * If there is a previously-created Bloom filter, use it to determine
+	 * whether the role is missing from the list.  Otherwise, do an ordinary
+	 * linear search through the existing role list.
+	 */
+	if ((*bf && bloom_lacks_element(*bf, roleptr, sizeof(Oid))) ||
+		!list_member_oid(roles_list, role))
+	{
+		/*
+		 * If the list is large, we take on the overhead of creating and
+		 * populating a Bloom filter to speed up future calls to this
+		 * function.
+		 */
+		if (*bf == NULL &&
+			list_length(roles_list) > ROLES_LIST_BLOOM_THRESHOLD)
+		{
+			*bf = bloom_create(ROLES_LIST_BLOOM_THRESHOLD * 10, work_mem, 0);
+			foreach_oid(roleid, roles_list)
+

Re: Slow GRANT ROLE on PostgreSQL 16 with thousands of ROLEs

2024-03-26 Thread Tom Lane
Nathan Bossart  writes:
> I spent some time trying to get some ballpark figures but have thus far
> been unsuccessful.  Even if I was able to get good numbers, I'm not sure
> how much they'd help us, as we'll still need to decide how much overhead we
> are willing to take in comparison to the linear search.  I don't think
> ~1000 is an unreasonable starting point, as it seems generally more likely
> that you will have many more roles to process at that point than if the
> threshold was, say, 100.  And if the threshold is too high (e.g., 10,000),
> this optimization will only kick in for the most extreme cases, so we'd
> likely be leaving a lot on the table.  But, I will be the first to admit
> that my reasoning here is pretty unscientific, and I'm open to suggestions
> for how to make it less so.

I did a little experimentation using the attached quick-hack C
function, and came to the conclusion that setting up the bloom filter
costs more or less as much as inserting 1000 or so OIDs the dumb way.
So we definitely want a threshold that's not much less than that.
For example, with ROLES_LIST_BLOOM_THRESHOLD = 100 I saw:

regression=# select drive_bloom(100, 10, 10);
 drive_bloom 
-
 
(1 row)

Time: 319.931 ms
regression=# select drive_bloom(101, 10, 10);
 drive_bloom 
-
 
(1 row)

Time: 319.385 ms
regression=# select drive_bloom(102, 10, 10);
 drive_bloom 
-
 
(1 row)

Time: 9904.786 ms (00:09.905)

That's a pretty big jump in context.  With the threshold set to 1024,

regression=# select drive_bloom(1024, 10, 10);
 drive_bloom 
-
 
(1 row)

Time: 14597.510 ms (00:14.598)
regression=# select drive_bloom(1025, 10, 10);
 drive_bloom 
-
 
(1 row)

Time: 14589.197 ms (00:14.589)
regression=# select drive_bloom(1026, 10, 10);
 drive_bloom 
-
 
(1 row)

Time: 25947.000 ms (00:25.947)
regression=# select drive_bloom(1027, 10, 10);
 drive_bloom 
-
 
(1 row)

Time: 25399.718 ms (00:25.400)
regression=# select drive_bloom(2048, 10, 10);
 drive_bloom 
-
 
(1 row)

Time: 33809.536 ms (00:33.810)

So I'm now content with choosing a threshold of 1000 or 1024 or so.

As for the bloom filter size, I see that bloom_create does

bitset_bytes = Min(bloom_work_mem * UINT64CONST(1024), total_elems * 2);
bitset_bytes = Max(1024 * 1024, bitset_bytes);

which means that any total_elems input less than 512K is disregarded
altogether.  So I'm not sold on your "ROLES_LIST_BLOOM_THRESHOLD * 10"
value.  Maybe it doesn't matter though.

I do not like, even a little bit, your use of a static variable to
hold the bloom filter pointer.  That code will misbehave horribly
if we throw an error partway through the role-accumulation loop;
the next call will try to carry on using the old filter, which would
be wrong even if it still existed which it likely won't.  It's not
that much worse notationally to keep it as a local variable, as I
did in the attached.

regards, tom lane

/*

create function drive_bloom(num_oids int, dup_freq int, count int) returns void
strict volatile language c as '/path/to/drive_bloom.so';

\timing on

select drive_bloom(100, 0, 10);

 */

#include "postgres.h"

#include "fmgr.h"
#include "lib/bloomfilter.h"
#include "miscadmin.h"
#include "tcop/tcopprot.h"
#include "utils/builtins.h"
#include "utils/memutils.h"

PG_MODULE_MAGIC;

#define ROLES_LIST_BLOOM_THRESHOLD 1024
#define ROLES_LIST_BLOOM_SIZE (1024 * 1024)


static inline List *
roles_list_append(List *roles_list, Oid role, bloom_filter **roles_list_bf)
{
	unsigned char *roleptr = (unsigned char *) 

	/*
	 * If there is a previously-created Bloom filter, use it to determine
	 * whether the role is missing from the list.  Otherwise, do an ordinary
	 * linear search through the existing role list.
	 */
	if ((*roles_list_bf &&
		 bloom_lacks_element(*roles_list_bf, roleptr, sizeof(Oid))) ||
		!list_member_oid(roles_list, role))
	{
		/*
		 * If the list is large, we take on the overhead of creating and
		 * populating a Bloom filter to speed up future calls to this
		 * function.
		 */
		if (!*roles_list_bf &&
			list_length(roles_list) > ROLES_LIST_BLOOM_THRESHOLD)
		{
			*roles_list_bf = bloom_create(ROLES_LIST_BLOOM_SIZE,
		 work_mem, 0);
			foreach_oid(roleid, roles_list)
bloom_add_element(*roles_list_bf,
  (unsigned char *) ,
  sizeof(Oid));
		}

		/*
		 * Finally, add the role to the list and the Bloom filter, if it
		 * exists.
		 */
		roles_list = lappend_oid(roles_list, role);
		if (*roles_list_bf)
			bloom_add_element(*roles_list_bf, roleptr, sizeof(Oid));
	}

	return roles_list;
}

/*
 * drive_bloom(num_oids int, dup_freq int, count int) returns void
 *
 * num_oids: number of OIDs to de-duplicate
 * dup_freq: if > 0, every dup_freq'th OID is duplicated
 * count: overall repetition count; choose large enough to get reliable timing
 */

Re: pg_upgrade --copy-file-range

2024-03-26 Thread Tomas Vondra
On 3/25/24 15:31, Robert Haas wrote:
> On Sat, Mar 23, 2024 at 9:37 AM Tomas Vondra
>  wrote:
>> OK, that makes sense. Here's a patch that should work like this - in
>> copy_file we check if we need to calculate checksums, and either use the
>> requested copy method, or fall back to the block-by-block copy.
> 
> +Use efficient file cloning (also known as reflinks on
> +some systems) instead of copying files to the new cluster.  This can
> 
> new cluster -> output directory
> 

Ooops, forgot to fix this. Will do in next version.

> I think your version kind of messes up the debug logging. In my
> version, every call to copy_file() would emit either "would copy
> \"%s\" to \"%s\" using strategy %s" and "copying \"%s\" to \"%s\"
> using strategy %s". In your version, the dry_run mode emits a string
> similar to the former, but creates separate translatable strings for
> each copy method instead of using the same one with a different value
> of %s. In non-dry-run mode, I think your version loses the debug
> logging altogether.
> 

Yeah. Sorry for not being careful enough about that, I was focusing on
the actual copy logic and forgot about this.

The patch I shared a couple minutes ago should fix this, effectively
restoring the original debug behavior. I liked the approach with calling
strategy_implementation a bit more, I wonder if it'd be better to go
back to that for the "accelerated" copy methods, somehow.

regards

-- 
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company




Re: pg_combinebackup --copy-file-range

2024-03-26 Thread Tomas Vondra


On 3/26/24 15:09, Jakub Wartak wrote:
> On Sat, Mar 23, 2024 at 6:57 PM Tomas Vondra
>  wrote:
> 
>> On 3/23/24 14:47, Tomas Vondra wrote:
>>> On 3/23/24 13:38, Robert Haas wrote:
 On Fri, Mar 22, 2024 at 8:26 PM Thomas Munro  
 wrote:
> [..]
>>> Yeah, that's in write_reconstructed_file() and the patch does not touch
>>> that at all. I agree it would be nice to use copy_file_range() in this
>>> part too, and it doesn't seem it'd be that hard to do, I think.
>>>
>>> It seems we'd just need a "fork" that either calls pread/pwrite or
>>> copy_file_range, depending on checksums and what was requested.
>>>
>>
>> Here's a patch to use copy_file_range in write_reconstructed_file too,
>> when requested/possible. One thing that I'm not sure about is whether to
>> do pg_fatal() if --copy-file-range but the platform does not support it.
> [..]
> 
> Hi Tomas, so I gave a go to the below patches today:
> - v20240323-2-0001-pg_combinebackup-allow-using-clone-copy_.patch
> - v20240323-2-0002-write_reconstructed_file.patch
> 
> My assessment:
> 
> v20240323-2-0001-pg_combinebackup-allow-using-clone-copy_.patch -
> looks like more or less good to go

There's some issues with the --dry-run, pointed out by Robert. Should be
fixed in the attached version.

> v20240323-2-0002-write_reconstructed_file.patch - needs work and
> without that clone/copy_file_range() good effects are unlikely
> 
> Given Debian 12, ~100GB DB, (pgbench -i -s 7000 , and some additional
> tables with GiST and GIN indexes , just to see more WAL record types)
> and with backups sizes in MB like that:
> 
> 106831  full
> 2823incr.1 # captured after some time with pgbench -R 100
> 165 incr.2 # captured after some time with pgbench -R 100
> 
> Test cmd: rm -rf outtest; sync; sync ; sync; echo 3 | sudo tee
> /proc/sys/vm/drop_caches ; time /usr/pgsql17/bin/pg_combinebackup -o
> outtest full incr.1 incr.2
> 
> Test results of various copies on small I/O constrained XFS device:
> normal copy: 31m47.407s
> --clone copy: error: file cloning not supported on this platform (it's
> due #ifdef of having COPY_FILE_RANGE available)
> --copy-file-range: aborted, as it was taking too long , I was
> expecting it to accelerate, but it did not... obviously this is the
> transparent failover in case of calculating checksums...
> --manifest-checksums=NONE --copy-file-range: BUG, it keep on appending
> to just one file e.g. outtest/base/5/16427.29 with 200GB+ ?? and ended
> up with ENOSPC [more on this later]

That's really strange.

> --manifest-checksums=NONE --copy-file-range without v20240323-2-0002: 
> 27m23.887s
> --manifest-checksums=NONE --copy-file-range with v20240323-2-0002 and
> loop-fix: 5m1.986s but it creates corruption as it stands
> 

Thanks. I plan to do more similar tests, once my machines get done with
some other stuff.

> Issues:
> 
> 1. https://cirrus-ci.com/task/5937513653600256?logs=mingw_cross_warning#L327
> compains about win32/mingw:
> 
> [15:47:27.184] In file included from copy_file.c:22:
> [15:47:27.184] copy_file.c: In function ‘copy_file’:
> [15:47:27.184] ../../../src/include/common/logging.h:134:6: error:
> this statement may fall through [-Werror=implicit-fallthrough=]
> [15:47:27.184]   134 |   if (unlikely(__pg_log_level <= PG_LOG_DEBUG)) \
> [15:47:27.184]   |  ^
> [15:47:27.184] copy_file.c:96:5: note: in expansion of macro ‘pg_log_debug’
> [15:47:27.184]96 | pg_log_debug("would copy \"%s\" to \"%s\"
> (copy_file_range)",
> [15:47:27.184]   | ^~~~
> [15:47:27.184] copy_file.c:99:4: note: here
> [15:47:27.184]99 |case COPY_MODE_COPYFILE:
> [15:47:27.184]   |^~~~
> [15:47:27.184] cc1: all warnings being treated as errors
> 

Yup, missing break.

> 2. I do not know what's the consensus between --clone and
> --copy-file-range , but if we have #ifdef FICLONE clone_works() #elif
> HAVE_COPY_FILE_RANGE copy_file_range_only_works() then we should also
> apply the same logic to the --help so that --clone is not visible
> there (for consistency?). Also the "error: file cloning not supported
> on this platform " is technically incorrect, Linux does support
> ioctl(FICLONE) and copy_file_range(), but we are just not choosing one
> over another (so technically it is "available"). Nitpicking I know.
> 

That's a good question, I'm not sure. But whatever we do, we should do
the same thing in pg_upgrade. Maybe there's some sort of precedent?

> 3. [v20240323-2-0002-write_reconstructed_file.patch]: The mentioned
> ENOSPACE spiral-of-death-bug symptoms are like that:
> 
> strace:
> copy_file_range(8, [697671680], 9, NULL, 8192, 0) = 8192
> copy_file_range(8, [697679872], 9, NULL, 8192, 0) = 8192
> copy_file_range(8, [697688064], 9, NULL, 8192, 0) = 8192
> copy_file_range(8, [697696256], 9, NULL, 8192, 0) = 8192
> copy_file_range(8, [697704448], 9, NULL, 8192, 0) = 8192
> copy_file_range(8, [697712640], 9, NULL, 8192, 0) = 8192
> copy_file_range(8, [697720832], 9, NULL, 8192, 0) = 8192
> 

Re: Introduce XID age and inactive timeout based replication slot invalidation

2024-03-26 Thread Bertrand Drouvot
Hi,

On Tue, Mar 26, 2024 at 09:59:23PM +0530, Bharath Rupireddy wrote:
> On Tue, Mar 26, 2024 at 4:35 PM Bharath Rupireddy
>  wrote:
> >
> > If we just sync inactive_since value for synced slots while in
> > recovery from the primary, so be it. Why do we need to update it to
> > the current time when the slot is being created? We don't expose slot
> > creation time, no? Aren't we fine if we just sync the value from
> > primary and document that fact? After the promotion, we can reset it
> > to the current time so that it gets its own time.
> 
> I'm attaching v24 patches. It implements the above idea proposed
> upthread for synced slots. I've now separated
> s/last_inactive_time/inactive_since and synced slots behaviour. Please
> have a look.

Thanks!

 v24-0001

It's now pure mechanical changes and it looks good to me.

 v24-0002

1 ===

This commit does two things:
1) Updates inactive_since for sync slots with the value
received from the primary's slot.

Tested it and it does that.

2 ===

2) Ensures the value is set to current timestamp during the
shutdown of slot sync machinery to help correctly interpret the
time if the standby gets promoted without a restart.

Tested it and it does that.

3 ===

+/*
+ * Reset the synced slots info such as inactive_since after shutting
+ * down the slot sync machinery.
+ */
+static void
+update_synced_slots_inactive_time(void)

Looks like the comment "reset" is not matching the name of the function and
what it does.

4 ===

+   /*
+* We get the current time beforehand and only once to 
avoid
+* system calls overhead while holding the lock.
+*/
+   if (now == 0)
+   now = GetCurrentTimestamp();

Also +1 of having GetCurrentTimestamp() just called one time within the loop.

5 ===

-   if (!(RecoveryInProgress() && slot->data.synced))
+   if (!(InRecovery && slot->data.synced))
slot->inactive_since = GetCurrentTimestamp();
else
slot->inactive_since = 0;

Not related to this change but more the way RestoreSlotFromDisk() behaves here:

For a sync slot on standby it will be set to zero and then later will be
synchronized with the one coming from the primary. I think that's fine to have
it to zero for this window of time.

Now, if the standby is down and one sets sync_replication_slots to off,
then inactive_since will be set to zero on the standby at startup and not 
synchronized (unless one triggers a manual sync). I also think that's fine but
it might be worth to document this behavior (that after a standby startup
inactive_since is zero until the next sync...). 

6 ===

+   print "HI  $slot_name $name $inactive_since $slot_creation_time\n";

garbage?

7 ===

+# Capture and validate inactive_since of a given slot.
+sub capture_and_validate_slot_inactive_since
+{
+   my ($node, $slot_name, $slot_creation_time) = @_;
+   my $name = $node->name;

We know have capture_and_validate_slot_inactive_since at 2 places:
040_standby_failover_slots_sync.pl and 019_replslot_limit.pl.

Worth to create a sub in Cluster.pm?

Regards,

-- 
Bertrand Drouvot
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com




Re: UUID v7

2024-03-26 Thread Andrey M. Borodin
Sorry for this long reply. I was looking on refactoring around 
pg_strong_random() and could not decide what to do. Finally, I decided to post 
at least something.

> On 22 Mar 2024, at 19:15, Peter Eisentraut  wrote:
> 
> I have been studying the uuidv() function.
> 
> I find this code extremely hard to follow.
> 
> We don't need to copy all that documentation from the RFC 4122bis document.  
> People can read that themselves.  What I would like to see is easy to find 
> information what from there we are implementing.  Like,
> 
> - UUID version 7
> - fixed-length dedicated counter
> - counter is 18 bits
> - 4 bits are initialized as zero

I've removed table taken from RFC.

> That's more or less all I would need to know what is going on.
> 
> That said, I don't understand why you say it's an 18 bit counter, when you 
> overwrite 6 bits with variant and version.  Then it's just a 12 bit counter?  
> Which is the size of the rand_a field, so that kind of makes sense.  But 12 
> bits is the recommended minimum, and (in this patch) we don't use 
> sub-millisecond timestamp precision, so we should probably use more than the 
> minimum?


No, we use 4 bits in data[6], 8 bits in data[7], and 6 bits data[8]. It's 18 
total. Essentially, we use both partial bytes and one whole byte between.
There was a bug - we used 1 extra byte of random numbers that was not 
necessary, I think that's what lead you to think that we use 12-bit counter.

> Also, you are initializing 4 bits (I think?) to zero to guard against counter 
> rollovers (so it's really just an 8 bit counter?).  But nothing checks 
> against such rollovers, so I don't understand the use of that.

No, there's only one guard rollover bit.
Here: uuid->data[6] = (uuid->data[6] & 0xf7);
Bits that are called "guard bits" do not guard anything, they just ensure 
counter capacity when it is initialized.
Rollover is carried into time tick here: 
++sequence_counter;
if (sequence_counter > 0x3)
{
/* We only have 18-bit counter */
sequence_counter = 0;
previous_timestamp++;
}

I think we might use 10 bits of microseconds and have 8 bits of a counter. 
Effect of a counter won't change much. But I'm not sure if this is allowed per 
RFC.
If time source is coarse-grained it still acts like a random initializer. And 
when it is precise - time is "natural" source of entropy.


> The code code be organized better.  In the not-increment_counter case, you 
> could use two separate pg_strong_random calls: One to initialize rand_b, 
> starting at >data[8], and one to initialize the counter. Then the 
> former could be shared between the two branches, and the code to assign the 
> sequence_counter to the uuid fields could also be shared.

Call to pg_strong_random() is very expensive in builds without ssl (and even 
with ssl too). If we could ammortize random numbers in small buffers - that 
would save a lot of time (see v8-0002-Buffer-random-numbers.patch upthread). 
Or, perhaps, we can ignore cost of two pg_string_random() calls.

> 
> I would also prefer if the normal case (not-increment_counter) were the first 
> if branch.

Done.

> Some other notes on your patch:
> 
> - Your rebase duplicated the documentation of uuid_extract_timestamp and 
> uuid_extract_version.
> 
> - PostgreSQL code uses uint64 etc. instead of uint64_t etc.
> 
> - It seems the added includes
> 
> #include "access/xlog.h"
> #include "utils/builtins.h"
> #include "utils/datetime.h"
> 
> are not needed.
> 
> - The static variables sequence_counter and previous_timestamp could be kept 
> inside the uuidv7() function.

Fixed.

Thanks!


Best regards, Andrey Borodin.


v24-0001-Implement-UUID-v7.patch
Description: Binary data


Re: Possibility to disable `ALTER SYSTEM`

2024-03-26 Thread Tom Lane
Bruce Momjian  writes:
> I am thinking "enable_alter_system_command" is probably good because we
> already use "enable" so why not reuse that idea, and I think "command"
> is needed because we need to clarify we are talking about the command,
> and not generic altering of the system.  We could use
> "enable_sql_alter_system" if people want something shorter.

Robert already mentioned why not use "enable_": up to now that prefix
has only been applied to planner plan-type-enabling GUCs.  I'd be okay
with "allow_alter_system_command", although I find it unnecessarily
verbose.

> Will people think this allows non-root users to use ALTER SYSTEM if
> enabled?

They'll soon find out differently, so I'm not concerned about that.

regards, tom lane




Re: Propagate pathkeys from CTEs up to the outer query

2024-03-26 Thread Tom Lane
Richard Guo  writes:
> I agree with your points.  Previously I was thinking that CTEs were the
> only scenario where we needed to remember the best path and only
> required the best path's pathkeys.  However, considering potential
> future use cases as you mentioned, I concur that having a per-subplan
> list of paths would be more future-proof.  Please see attached v4 patch.

Hm, well, you didn't actually fill in the paths for the other
subqueries.  I agree that it's not worth doing so in
SS_make_initplan_from_plan, but a comment explaining that decision
seems in order.  Also, there's nothing stopping us from saving the
path for subplans made in build_subplan, except adding a parameter
to pass them down.  So I did that, and made a couple other cosmetic
changes, and pushed it.

One thing I noticed while testing is that while your regression-test
query successfully propagates the pathkeys:

regression=# explain with x as materialized (select unique1 from tenk1 b order 
by unique1)
select count(*) from tenk1 a
  where unique1 in (select * from x);
 QUERY PLAN 


 Aggregate  (cost=915.55..915.56 rows=1 width=8)
   CTE x
 ->  Index Only Scan using tenk1_unique1 on tenk1 b  (cost=0.29..270.29 
rows=1 width=4)
   ->  Merge Semi Join  (cost=0.31..620.26 rows=1 width=0)
 Merge Cond: (a.unique1 = x.unique1)
 ->  Index Only Scan using tenk1_unique1 on tenk1 a  (cost=0.29..270.29 
rows=1 width=4)
 ->  CTE Scan on x  (cost=0.00..200.00 rows=1 width=4)
(7 rows)

this variant doesn't:

regression=# explain with x as materialized (select unique1 from tenk1 b)
select count(*) from tenk1 a
  where unique1 in (select * from x);
 QUERY PLAN 


 Aggregate  (cost=1028.07..1028.08 rows=1 width=8)
   CTE x
 ->  Index Only Scan using tenk1_unique1 on tenk1 b  (cost=0.29..270.29 
rows=1 width=4)
   ->  Hash Semi Join  (cost=325.28..732.78 rows=1 width=0)
 Hash Cond: (a.unique1 = x.unique1)
 ->  Index Only Scan using tenk1_unique1 on tenk1 a  (cost=0.29..270.29 
rows=1 width=4)
 ->  Hash  (cost=200.00..200.00 rows=1 width=4)
   ->  CTE Scan on x  (cost=0.00..200.00 rows=1 width=4)
(8 rows)

That's not the fault of anything we did here; the IndexOnlyScan path
in the subquery is in fact not marked with any pathkeys, even though
clearly its result is sorted.  I believe that's an intentional
decision from way way back, that pathkeys only correspond to orderings
that are of interest in the current query level.  "select unique1 from
tenk1 b order by unique1" has an interest in ordering by unique1,
but "select unique1 from tenk1 b" does not, so it's choosing that
path strictly according to cost.  Not generating pathkeys in such a
query saves a few cycles and ensures that we won't improperly prefer
a path on the basis of pathkeys if it hasn't got a cost advantage.
So I'm quite hesitant to muck with that old decision, especially in
the waning days of a development cycle, but the results do feel a
little strange here.

regards, tom lane




Re: Properly pathify the union planner

2024-03-26 Thread Alexander Lakhin

Hello David,

25.03.2024 04:43, David Rowley wrote:

I didn't see that as a reason not to push this patch as this occurs
both with and without this change, so I've now pushed this patch.


Please look at a new assertion failure, that is triggered by the following
query:
SELECT count(*) FROM (
  WITH q1(x) AS (SELECT 1)
  SELECT FROM q1 UNION SELECT FROM q1
) qu;

TRAP: failed Assert("lg != NULL"), File: "planner.c", Line: 7941, PID: 1133017

Best regards,
Alexander




Re: Slow GRANT ROLE on PostgreSQL 16 with thousands of ROLEs

2024-03-26 Thread Nathan Bossart
Here is a new version of the patch that I feel is in decent shape.

On Mon, Mar 25, 2024 at 10:16:47AM -0500, Nathan Bossart wrote:
> On Mon, Mar 25, 2024 at 11:08:39AM -0400, Tom Lane wrote:
>> * The magic constants (crossover list length and bloom filter size)
>> need some testing to see if there are better values.  They should
>> probably be made into named #defines, too.  I suspect, with little
>> proof, that the bloom filter size isn't particularly critical --- but
>> I know we pulled the crossover of 1000 out of thin air, and I have
>> no certainty that it's even within an order of magnitude of being a
>> good choice.
> 
> I'll try to construct a couple of tests to see if we can determine a proper
> order of magnitude.

I spent some time trying to get some ballpark figures but have thus far
been unsuccessful.  Even if I was able to get good numbers, I'm not sure
how much they'd help us, as we'll still need to decide how much overhead we
are willing to take in comparison to the linear search.  I don't think
~1000 is an unreasonable starting point, as it seems generally more likely
that you will have many more roles to process at that point than if the
threshold was, say, 100.  And if the threshold is too high (e.g., 10,000),
this optimization will only kick in for the most extreme cases, so we'd
likely be leaving a lot on the table.  But, I will be the first to admit
that my reasoning here is pretty unscientific, and I'm open to suggestions
for how to make it less so.

-- 
Nathan Bossart
Amazon Web Services: https://aws.amazon.com
>From 95fe6e811990895acb9f79544f502408ac472203 Mon Sep 17 00:00:00 2001
From: Nathan Bossart 
Date: Mon, 25 Mar 2024 23:17:08 -0500
Subject: [PATCH v3 1/1] Optimize roles_is_member_of() with a Bloom filter.

When the list of roles gathered by roles_is_member_of() grows very
large, a Bloom filter is created to help avoid some linear searches
through the list.  The threshold for creating the Bloom filter is
set arbitrarily high and may require future adjustment.

Suggested-by: Tom Lane
Reviewed-by: Tom Lane
Discussion: https://postgr.es/m/CAGvXd3OSMbJQwOSc-Tq-Ro1CAz%3DvggErdSG7pv2s6vmmTOLJSg%40mail.gmail.com
---
 src/backend/utils/adt/acl.c | 71 +++--
 1 file changed, 68 insertions(+), 3 deletions(-)

diff --git a/src/backend/utils/adt/acl.c b/src/backend/utils/adt/acl.c
index cf5d08576a..7c4b642ebd 100644
--- a/src/backend/utils/adt/acl.c
+++ b/src/backend/utils/adt/acl.c
@@ -36,6 +36,7 @@
 #include "common/hashfn.h"
 #include "foreign/foreign.h"
 #include "funcapi.h"
+#include "lib/bloomfilter.h"
 #include "lib/qunique.h"
 #include "miscadmin.h"
 #include "utils/acl.h"
@@ -78,6 +79,17 @@ static Oid	cached_role[] = {InvalidOid, InvalidOid, InvalidOid};
 static List *cached_roles[] = {NIL, NIL, NIL};
 static uint32 cached_db_hash;
 
+/*
+ * If the list of roles gathered by roles_is_member_of() grows larger than the
+ * below threshold, a Bloom filter is created to speed up list membership
+ * checks.  This threshold is set arbitrarily high to avoid the overhead of
+ * creating the Bloom filter until it seems likely to provide a net benefit.
+ *
+ * XXX: The current threshold of 1024 is little more than a wild guess and may
+ * need to be adjusted in the future.
+ */
+#define ROLES_LIST_BLOOM_THRESHOLD 1024
+static bloom_filter *roles_list_bf = NULL;
 
 static const char *getid(const char *s, char *n, Node *escontext);
 static void putid(char *p, const char *s);
@@ -4918,6 +4930,54 @@ RoleMembershipCacheCallback(Datum arg, int cacheid, uint32 hashvalue)
 	cached_role[ROLERECURSE_SETROLE] = InvalidOid;
 }
 
+/*
+ * A helper function for roles_is_member_of() that provides an optimized
+ * implementation of list_append_unique_oid() via a Bloom filter.  The caller
+ * (i.e., roles_is_member_of()) is responsible for freeing roles_list_bf once
+ * it is done using this function.
+ */
+static inline List *
+roles_list_append(List *roles_list, Oid role)
+{
+	unsigned char *roleptr = (unsigned char *) 
+
+	/*
+	 * If there is a previously-created Bloom filter, use it to determine
+	 * whether the role is missing from the list.  Otherwise, do an ordinary
+	 * linear search through the existing role list.
+	 */
+	if ((roles_list_bf &&
+		 bloom_lacks_element(roles_list_bf, roleptr, sizeof(Oid))) ||
+		!list_member_oid(roles_list, role))
+	{
+		/*
+		 * If the list is large, we take on the overhead of creating and
+		 * populating a Bloom filter to speed up future calls to this
+		 * function.
+		 */
+		if (!roles_list_bf &&
+			list_length(roles_list) > ROLES_LIST_BLOOM_THRESHOLD)
+		{
+			roles_list_bf = bloom_create(ROLES_LIST_BLOOM_THRESHOLD * 10,
+		 work_mem, 0);
+			foreach_oid(roleid, roles_list)
+bloom_add_element(roles_list_bf,
+  (unsigned char *) ,
+  sizeof(Oid));
+		}
+
+		/*
+		 * Finally, add the role to the list and the Bloom filter, if it
+		 * exists.
+		 */
+		roles_list = lappend_oid(roles_list, role);
+		if 

Re: Possibility to disable `ALTER SYSTEM`

2024-03-26 Thread Bruce Momjian
On Tue, Mar 26, 2024 at 10:23:51AM -0400, Tom Lane wrote:
> Robert Haas  writes:
> > On Mon, Mar 25, 2024 at 5:04 PM Bruce Momjian  wrote:
> >> To me, externally_managed_configuration is promising a lot more than it
> >> delivers because there is still a lot of ocnfiguration it doesn't
> >> control.  I am also confused why the purpose of the feature, external
> >> management of configuation, is part of the variable name.  We usually
> >> name parameters for what they control.
> 
> > I actually agree with this. I wasn't going to quibble with it because
> > other people seemed to like it. But I think something like
> > allow_alter_system would be better, as it would describe the exact
> > thing that the parameter does, rather than how we think the parameter
> > ought to be used.
> 
> +1.  The overpromise-and-underdeliver aspect of the currently proposed
> name is a lot of the reason I've been unhappy and kept pushing for it
> to lock things down more.  "allow_alter_system" is a lot more
> straightforward about exactly what it does, and if that is all we want
> it to do, then a name like that is good.

I am thinking "enable_alter_system_command" is probably good because we
already use "enable" so why not reuse that idea, and I think "command"
is needed because we need to clarify we are talking about the command,
and not generic altering of the system.  We could use
"enable_sql_alter_system" if people want something shorter.

Will people think this allows non-root users to use ALTER SYSTEM if
enabled?

-- 
  Bruce Momjian  https://momjian.us
  EDB  https://enterprisedb.com

  Only you can decide what is important to you.




Re: pgsql: Allow using syncfs() in frontend utilities.

2024-03-26 Thread Nathan Bossart
On Tue, Mar 26, 2024 at 10:11:31AM -0500, Nathan Bossart wrote:
> On Tue, Mar 26, 2024 at 11:18:57AM +0100, Peter Eisentraut wrote:
>> On 22.03.24 17:52, Robert Haas wrote:
>>> I'd like to complain about this commit's addition of a new appendix.
>> 
>> I already complained about that at 
>> 
>> and some follow-up was announced but didn't happen.  It was on my list to
>> look into cleaning up during beta.
> 
> Sorry about this, I lost track of it.

Here's a first attempt at a patch based on Robert's suggestion from
upthread.

-- 
Nathan Bossart
Amazon Web Services: https://aws.amazon.com
>From c51703257821079016456b152e4a9f41169dad96 Mon Sep 17 00:00:00 2001
From: Nathan Bossart 
Date: Tue, 26 Mar 2024 11:26:56 -0500
Subject: [PATCH v1 1/1] Adjust documentation for syncfs().

Commit 8c16ad3b43 created a new appendix for syncfs(), which is
excessive for the small amount of content it contains.  This commit
moves the description of the caveats to be aware of when using
syncfs() back to the documentation for recovery_init_sync_method.
The documentation for the other utilities with syncfs() support now
directs readers to recovery_init_sync_method for information about
these caveats.

Reported-by: Peter Eisentraut
Suggested-by: Robert Haas
Discussion: https://postgr.es/m/42804669-7063-1320-ed37-3226d5f1067d%40eisentraut.org
Discussion: https://postgr.es/m/CA%2BTgmobUiqKr%2BZMCLc5Qap-sXBnjfGUU%2BZBmzYEjUuWyjsGr1g%40mail.gmail.com
---
 doc/src/sgml/config.sgml   | 12 ++---
 doc/src/sgml/filelist.sgml |  1 -
 doc/src/sgml/postgres.sgml |  1 -
 doc/src/sgml/ref/initdb.sgml   |  4 +--
 doc/src/sgml/ref/pg_basebackup.sgml|  4 +--
 doc/src/sgml/ref/pg_checksums.sgml |  4 +--
 doc/src/sgml/ref/pg_combinebackup.sgml |  4 +--
 doc/src/sgml/ref/pg_dump.sgml  |  5 ++--
 doc/src/sgml/ref/pg_rewind.sgml|  4 +--
 doc/src/sgml/ref/pgupgrade.sgml|  4 +--
 doc/src/sgml/syncfs.sgml   | 36 --
 11 files changed, 24 insertions(+), 55 deletions(-)
 delete mode 100644 doc/src/sgml/syncfs.sgml

diff --git a/doc/src/sgml/config.sgml b/doc/src/sgml/config.sgml
index 65a6e6c408..5468637e2e 100644
--- a/doc/src/sgml/config.sgml
+++ b/doc/src/sgml/config.sgml
@@ -10870,9 +10870,15 @@ dynamic_library_path = 'C:\tools\postgresql;H:\my_project\lib;$libdir'
 On Linux, syncfs may be used instead, to ask the
 operating system to synchronize the file systems that contain the
 data directory, the WAL files and each tablespace (but not any other
-file systems that may be reachable through symbolic links).  See
- for more information about using
-syncfs().
+file systems that may be reachable through symbolic links).  This may
+be a lot faster than the fsync setting, because it
+doesn't need to open each file one by one.  On the other hand, it may
+be slower if a file system is shared by other applications that
+modify a lot of files, since those files will also be written to disk.
+Furthermore, on versions of Linux before 5.8, I/O errors encountered
+while writing data to disk may not be reported to
+PostgreSQL, and relevant error messages may
+appear only in kernel logs.


 This parameter can only be set in the
diff --git a/doc/src/sgml/filelist.sgml b/doc/src/sgml/filelist.sgml
index e0dca81cb2..c92a16c7ac 100644
--- a/doc/src/sgml/filelist.sgml
+++ b/doc/src/sgml/filelist.sgml
@@ -182,7 +182,6 @@
 
 
 
-
 
 
 
diff --git a/doc/src/sgml/postgres.sgml b/doc/src/sgml/postgres.sgml
index 2c107199d3..381af69be2 100644
--- a/doc/src/sgml/postgres.sgml
+++ b/doc/src/sgml/postgres.sgml
@@ -289,7 +289,6 @@ break is not needed in a wider output rendering.
   
   
   
-  
   
 
  
diff --git a/doc/src/sgml/ref/initdb.sgml b/doc/src/sgml/ref/initdb.sgml
index 377c3cb20a..dc9011b40e 100644
--- a/doc/src/sgml/ref/initdb.sgml
+++ b/doc/src/sgml/ref/initdb.sgml
@@ -394,8 +394,8 @@ PostgreSQL documentation
 On Linux, syncfs may be used instead to ask the
 operating system to synchronize the whole file systems that contain the
 data directory, the WAL files, and each tablespace.  See
- for more information about using
-syncfs().
+ for information about
+the caveats to be aware of when using syncfs.


 This option has no effect when --no-sync is used.
diff --git a/doc/src/sgml/ref/pg_basebackup.sgml b/doc/src/sgml/ref/pg_basebackup.sgml
index 208530f393..82d0c8e008 100644
--- a/doc/src/sgml/ref/pg_basebackup.sgml
+++ b/doc/src/sgml/ref/pg_basebackup.sgml
@@ -642,8 +642,8 @@ PostgreSQL documentation
 backup directory.  When the plain format is used,
 pg_basebackup will also synchronize the file systems
 that contain the 

Re: Introduce XID age and inactive timeout based replication slot invalidation

2024-03-26 Thread Bharath Rupireddy
On Tue, Mar 26, 2024 at 4:35 PM Bharath Rupireddy
 wrote:
>
> If we just sync inactive_since value for synced slots while in
> recovery from the primary, so be it. Why do we need to update it to
> the current time when the slot is being created? We don't expose slot
> creation time, no? Aren't we fine if we just sync the value from
> primary and document that fact? After the promotion, we can reset it
> to the current time so that it gets its own time.

I'm attaching v24 patches. It implements the above idea proposed
upthread for synced slots. I've now separated
s/last_inactive_time/inactive_since and synced slots behaviour. Please
have a look.

-- 
Bharath Rupireddy
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com


v24-0001-Use-less-confusing-name-for-slot-s-last_inactive.patch
Description: Binary data


v24-0002-Maintain-inactive_since-for-synced-slots-correct.patch
Description: Binary data


v24-0003-Allow-setting-inactive_timeout-for-replication-s.patch
Description: Binary data


v24-0004-Add-inactive_timeout-based-replication-slot-inva.patch
Description: Binary data


Re: pgsql: Track last_inactive_time in pg_replication_slots.

2024-03-26 Thread Bertrand Drouvot
Hi,

On Tue, Mar 26, 2024 at 04:39:55PM +0100, Alvaro Herrera wrote:
> On 2024-Mar-26, Nathan Bossart wrote:
> > I don't object to a
> > time-based setting as well, but that won't always work as well for this
> > particular use-case, especially if we are relying on users to set a
> > slot-level parameter.
> 
> I think slot-level parameters are mostly useless, because it takes just
> one slot where you forget to set it for disaster to strike.

I think that's a fair point. So maybe we should focus on having a GUC first and
later on re-think about having (or not) a slot based one (in addition to the 
GUC).

Regards,

-- 
Bertrand Drouvot
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com




Re: pgsql: Track last_inactive_time in pg_replication_slots.

2024-03-26 Thread Alvaro Herrera
On 2024-Mar-26, Nathan Bossart wrote:

> FWIW I'd really prefer to have something like max_slot_xid_age for this.  A
> time-based parameter would likely help with most cases, but transaction ID
> usage will vary widely from server to server, so it'd be nice to have
> something to protect against wraparound more directly.

Yeah, I tend to agree that an XID-based limit makes more sense than a
time-based one.

> I don't object to a
> time-based setting as well, but that won't always work as well for this
> particular use-case, especially if we are relying on users to set a
> slot-level parameter.

I think slot-level parameters are mostly useless, because it takes just
one slot where you forget to set it for disaster to strike.

-- 
Álvaro HerreraBreisgau, Deutschland  —  https://www.EnterpriseDB.com/




Re: New Table Access Methods for Multi and Single Inserts

2024-03-26 Thread Jeff Davis
On Tue, 2024-03-26 at 01:28 +0530, Bharath Rupireddy wrote:
> I'm thinking
> of dropping the COPY FROM patch using the new multi insert API for
> the
> following reasons: ...

I agree with all of this. We do want COPY ... FROM support, but there
are some details to work out and we don't want to make a big code
change at this point in the cycle.

> The new APIs are more extensible, memory management is taken care of
> by AM, and with TableModifyState as the structure name and more
> meaningful API names. The callback for triggers/indexes etc. aren't
> taken care of as I'm now only focusing on CTAS, CMV, RMV
> optimizations.
> 
> Please see the attached v14 patches.

* No need for a 'kind' field in TableModifyState. The state should be
aware of the kinds of changes that it has received and that may need to
be flushed later -- for now, only inserts, but possibly updates/deletes
in the future.

* If the AM doesn't support the bulk methods, fall back to retail
inserts instead of throwing an error.

* It seems like this API will eventually replace table_multi_insert and
table_finish_bulk_insert completely. Do those APIs have any advantage
remaining over the new one proposed here?

* Right now I don't any important use of the flush method. It seems
that could be accomplished in the finish method, and flush could just
be an internal detail when the memory is exhausted. If we find a use
for it later, we can always add it, but right now it seems unnecessary.

* We need to be careful about cases where the command can be successful
but the writes are not flushed. I don't tihnk that's a problem with the
current patch, but we will need to do something here when we expand to
INSERT INTO ... SELECT.

Andres, is this patch overall closer to what you had in mind in the
email here:

https://www.postgresql.org/message-id/20230603223824.o7iyochli2dww...@alap3.anarazel.de

?

Regards,
Jeff Davis





Re: Can't find not null constraint, but \d+ shows that

2024-03-26 Thread Alvaro Herrera
On 2024-Mar-26, Tender Wang wrote:

> postgres=# CREATE TABLE t1(c0 int, c1 int);
> postgres=# ALTER TABLE  t1 ADD CONSTRAINT Q PRIMARY KEY(c0, c1);
> postgres=# ALTER TABLE  t1 DROP c1;
> 
> postgres=# ALTER TABLE  t1  ALTER c0 DROP NOT NULL;
> ERROR:  could not find not-null constraint on column "c0", relation "t1"

Ooh, hah, what happens here is that we drop the PK constraint
indirectly, so we only go via doDeletion rather than the tablecmds.c
code, so we don't check the attnotnull flags that the PK was protecting.

> The attached patch is my workaround solution.  Look forward your apply.

Yeah, this is not a very good approach -- I think you're just guessing
that the column is marked NOT NULL because a PK was dropped in the
past -- but really what this catalog state is, is corrupted contents
because the PK drop was mishandled.  At least in theory there are other
ways to drop a constraint other than dropping one of its columns (for
example, maybe this could happen if you drop a collation that the PK
depends on).  The right approach is to ensure that the PK drop always
does the dance that ATExecDropConstraint does.  A good fix probably just
moves some code from dropconstraint_internal to RemoveConstraintById.

-- 
Álvaro Herrera PostgreSQL Developer  —  https://www.EnterpriseDB.com/
 really, I see PHP as like a strange amalgamation of C, Perl, Shell
 inflex: you know that "amalgam" means "mixture with mercury",
   more or less, right?
 i.e., "deadly poison"




Re: pgsql: Allow using syncfs() in frontend utilities.

2024-03-26 Thread Nathan Bossart
On Tue, Mar 26, 2024 at 11:18:57AM +0100, Peter Eisentraut wrote:
> On 22.03.24 17:52, Robert Haas wrote:
>> I'd like to complain about this commit's addition of a new appendix.
> 
> I already complained about that at 
> 
> and some follow-up was announced but didn't happen.  It was on my list to
> look into cleaning up during beta.

Sorry about this, I lost track of it.

-- 
Nathan Bossart
Amazon Web Services: https://aws.amazon.com




Re: pgsql: Track last_inactive_time in pg_replication_slots.

2024-03-26 Thread Nathan Bossart
On Tue, Mar 26, 2024 at 03:44:29PM +0530, Amit Kapila wrote:
> On Tue, Mar 26, 2024 at 2:11 PM Alvaro Herrera  
> wrote:
>> Well, I think a GUC is good to have regardless of the slot parameter,
>> because the GUC can be used as an instance-wide protection against going
>> out of disk space because of broken replication.  However, now that I
>> think about it, I'm not really sure about invalidating a slot based on
>> time rather on disk space, for which we already have a parameter; what's
>> your rationale for that?  The passage of time is not a very good
>> measure, really, because the amount of WAL being protected has wildly
>> varying production rate across time.
> 
> The inactive slot not only blocks WAL from being removed but prevents
> the vacuum from proceeding. Also, there is a risk of transaction Id
> wraparound. See email [1] for more context.

FWIW I'd really prefer to have something like max_slot_xid_age for this.  A
time-based parameter would likely help with most cases, but transaction ID
usage will vary widely from server to server, so it'd be nice to have
something to protect against wraparound more directly.  I don't object to a
time-based setting as well, but that won't always work as well for this
particular use-case, especially if we are relying on users to set a
slot-level parameter.

-- 
Nathan Bossart
Amazon Web Services: https://aws.amazon.com




Re: pgsql: Allow using syncfs() in frontend utilities.

2024-03-26 Thread Nathan Bossart
On Fri, Mar 22, 2024 at 12:52:15PM -0400, Robert Haas wrote:
> I'd like to complain about this commit's addition of a new appendix. I
> do understand the temptation to document caveats like this centrally
> instead of in multiple places, but as I've been complaining about over
> in the "documentation structure" thread, our top-level documentation
> index is too big, and I feel strongly that we need to de-clutter it
> rather than cluttering it further. This added a new chapter which is
> just 5 sentences long. I understand that this was done because the
> same issue applies to a bunch of different utilities and we didn't
> want to duplicate this text in all of those places, but I feel like
> this approach just doesn't scale. If we did this in every place where
> we have this much text that we want to avoid duplicating, we'd soon
> have hundreds of appendixes.

Sorry I missed this.  I explored a couple of options last year but the
discussion trailed off [0].

> What I would suggest we do instead is pick one of the places where
> this comes up and document it there, perhaps the
> recovery_init_sync_method GUC. And then make the documentation for the
> other say something like, you know those issues we documented for
> recovery_init_sync_method? Well they also apply to this.

WFM.  I'll put together a patch.

[0] https://postgr.es/m/20231009204823.GA659480%40nathanxps13

-- 
Nathan Bossart
Amazon Web Services: https://aws.amazon.com




SQL function which allows to distinguish a server being in point in time recovery mode and an ordinary replica

2024-03-26 Thread m . litsarev

Hi,

At present time, an existing pg_is_in_recovery() method is not enough
to distinguish a server being in point in time recovery (PITR) mode and 
an ordinary replica

because it returns true in both cases.

That is why pg_is_standby_requested() function introduced in attached 
patch might help.
It reports whether a standby.signal file was found in the data directory 
at startup process.

Instructions for reproducing the possible use case are also attached.

Hope it will be usefull.

Respectfully,

Mikhail Litsarev
Postgres Professional: https://postgrespro.comSteps to reproduce use case:

1.  Log in as a postgres user and setup env variables (PGDATA, PGLOG, PGPORT 
etc.), also 
mkdir $PGDATA/../archive_dir

2.  Create data directory
initdb -k

3.  Modify postgresql.conf
wal_level = replica
archive_mode = on
archive_command = 'test ! -f /home/postgres/data/pgpro/archive_dir/%f && cp 
%p /home/postgres/data/pgpro/archive_dir/%f'

4.  Start server
pg_ctl start -l $PGLOG/logfile

5.  Create example database and run psql
createdb db_pitr_01
psql -d db_pitr_1

6.  Create table (just to modify database)
CREATE TABLE t8x1_float4 AS SELECT random()::float4 as a1 FROM 
generate_series(1, 8);

7.  Switch to a new WAL file
SELECT pg_switch_wal();

8.  Make a backup
pg_basebackup -Ft -X none -D - | gzip > 
/home/postgres/data/pgpro/db_file_backup.tar.gz

9.  Update table 
INSERT INTO t8x1_float4 values  (1.);
INSERT INTO t8x1_float4 values  (2.);
INSERT INTO t8x1_float4 values  (3.);

10. Switch to a new WAL file
SELECT pg_switch_wal();

11. Stop database and remove data directory
pg_ctl stop
rm -r /home/postgres/data/pgpro/data-debug/*

12. Restore data directory
tar xvfz db_file_backup.tar.gz -C /home/postgres/data/pgpro/data-debug/

13. Update the following parameters in postgrespro.conf
recovery_target_time = '2024-03-26 10:26:30 GMT'
restore_command = 'cp /home/postgres/data/pgpro/archive_dir/%f %p'

14.
   touch recovery.signal

15. 
pg_ctl start  -l $PGLOG/logfile

16. Start database and see it is in Point In Time Recovery mode
psql -d db_pitr_01

db_pitr_01=# SELECT pg_is_in_recovery();
 pg_is_in_recovery 
---
 t
(1 row)

db_pitr_01=# SELECT pg_is_standby_requested();
 pg_is_standby_requested 
-
 f
(1 row)

Now we can differ recovering server in Point In Time Recovery mode from standby 
mode.
pg_is_standby_requested() returns true for a replica.


From 563431ffb53d0b598f33d3378b7ba40338020ca6 Mon Sep 17 00:00:00 2001
From: Mikhail Litsarev 
Date: Tue, 20 Feb 2024 20:05:37 +0300
Subject: [PATCH 1/2] Introduce pg_is_standby_requested().

Introduce pg_is_standby_requested() function to distinguish
a replica from a regular instance in point in time recovery mode.
---
 src/backend/access/transam/xlogfuncs.c| 14 ++
 src/backend/access/transam/xlogrecovery.c | 33 +++
 src/include/access/xlogrecovery.h |  1 +
 src/include/catalog/pg_proc.dat   |  4 +++
 4 files changed, 52 insertions(+)

diff --git a/src/backend/access/transam/xlogfuncs.c b/src/backend/access/transam/xlogfuncs.c
index 45452d937c7..3170c4ef343 100644
--- a/src/backend/access/transam/xlogfuncs.c
+++ b/src/backend/access/transam/xlogfuncs.c
@@ -753,3 +753,17 @@ pg_promote(PG_FUNCTION_ARGS)
 		   wait_seconds)));
 	PG_RETURN_BOOL(false);
 }
+
+/*
+ * Returns bool with a current value of StandbyModeIsRequested flag
+ * to distinguish a replica from a regular instance in a
+ * Point In Time Recovery (PITR) mode.
+ *
+ * Returns	true	if standby.signal file is found at startup process
+ * 			false	otherwise
+ */
+Datum
+pg_is_standby_requested(PG_FUNCTION_ARGS)
+{
+	PG_RETURN_BOOL(StandbyModeIsRequested());
+}
diff --git a/src/backend/access/transam/xlogrecovery.c b/src/backend/access/transam/xlogrecovery.c
index 6318878356a..cbca4c5055d 100644
--- a/src/backend/access/transam/xlogrecovery.c
+++ b/src/backend/access/transam/xlogrecovery.c
@@ -321,6 +321,12 @@ typedef struct XLogRecoveryCtlData
 	 */
 	bool		SharedPromoteIsTriggered;
 
+	/*
+	 * SharedStandbyModeRequested indicates if we're in a standby mode at
+	 * start, while recovery mode is on. Protected by info_lck.
+	 */
+	bool		SharedStandbyModeRequested;
+
 	/*
 	 * recoveryWakeupLatch is used to wake up the startup process to continue
 	 * WAL replay, if it is waiting for WAL to arrive or promotion to be
@@ -1070,7 +1076,16 @@ readRecoverySignalFile(void)
 		ArchiveRecoveryRequested = true;
 	}
 	else
+	{
+		/*
+		 * There is no need to use Spinlock here because only the startup
+		 * process modifies the SharedStandbyModeRequested variable here and
+		 * no other processes are reading it at that time.
+		 */
+		XLogRecoveryCtl->SharedStandbyModeRequested = StandbyModeRequested;
 		return;
+	}
+	XLogRecoveryCtl->SharedStandbyModeRequested = StandbyModeRequested;
 
 	/*
 	 * We 

Re: make dist using git archive

2024-03-26 Thread Tristan Partin

On Tue Mar 26, 2024 at 2:56 AM CDT, Andres Freund wrote:

Hi,

On 2024-03-26 08:36:58 +0100, Peter Eisentraut wrote:
> On 26.03.24 01:23, Andres Freund wrote:
> > On 2024-03-25 06:44:33 +0100, Peter Eisentraut wrote:
> > > Done and committed.
> > 
> > This triggered a new warning for me:
> > 
> > ../../../../../home/andres/src/postgresql/meson.build:3422: WARNING: Project targets '>=0.54' but uses feature introduced in '0.55.0': Passing executable/found program object to script parameter of add_dist_script.
> 
> Hmm, I don't see that.  Is there another version dependency that controls

> when you see version dependency warnings? ;-)

Sometimes an incompatibility is later noticed and a warning is introduced at
that point.

> We could trivially remove this particular line, or perhaps put a
> 
> if meson.version().version_compare('>=0.55')
> 
> around it.  (But would that still warn?)


It shouldn't, no. As long as the code is actually executed within the check,
it avoids the warning. However if you just set a variable inside the version
gated block and then later use the variable outside that, it will
warn. Probably hard to avoid...


The following change also makes the warning go away, but the version 
comparison seems better to me due to how we choose not to use machine 
files for overriding programs[0]. :(


- meson.add_dist_script(perl, ...)
+ meson.add_dist_script('perl', ...)

Aside, but I think since we dropped AIX, we can bump the required Meson 
version. My last analysis of the situation told me that the AIX 
buildfarm animals were the only machines which didn't have a Python 
version capable of running a newer version. I would need to look at the 
situation again though.


[0]: If someone wants to make a plea here: 
https://github.com/mesonbuild/meson/pull/12623


--
Tristan Partin
Neon (https://neon.tech)




Re: Possibility to disable `ALTER SYSTEM`

2024-03-26 Thread Tom Lane
Robert Haas  writes:
> On Mon, Mar 25, 2024 at 5:04 PM Bruce Momjian  wrote:
>> To me, externally_managed_configuration is promising a lot more than it
>> delivers because there is still a lot of ocnfiguration it doesn't
>> control.  I am also confused why the purpose of the feature, external
>> management of configuation, is part of the variable name.  We usually
>> name parameters for what they control.

> I actually agree with this. I wasn't going to quibble with it because
> other people seemed to like it. But I think something like
> allow_alter_system would be better, as it would describe the exact
> thing that the parameter does, rather than how we think the parameter
> ought to be used.

+1.  The overpromise-and-underdeliver aspect of the currently proposed
name is a lot of the reason I've been unhappy and kept pushing for it
to lock things down more.  "allow_alter_system" is a lot more
straightforward about exactly what it does, and if that is all we want
it to do, then a name like that is good.

regards, tom lane




Re: pg_upgrade --copy-file-range

2024-03-26 Thread Jakub Wartak
On Sat, Mar 23, 2024 at 6:57 PM Tomas Vondra
 wrote:

> On 3/23/24 14:47, Tomas Vondra wrote:
> > On 3/23/24 13:38, Robert Haas wrote:
> >> On Fri, Mar 22, 2024 at 8:26 PM Thomas Munro  
> >> wrote:
[..]
> > Yeah, that's in write_reconstructed_file() and the patch does not touch
> > that at all. I agree it would be nice to use copy_file_range() in this
> > part too, and it doesn't seem it'd be that hard to do, I think.
> >
> > It seems we'd just need a "fork" that either calls pread/pwrite or
> > copy_file_range, depending on checksums and what was requested.
> >
>
> Here's a patch to use copy_file_range in write_reconstructed_file too,
> when requested/possible. One thing that I'm not sure about is whether to
> do pg_fatal() if --copy-file-range but the platform does not support it.
[..]

Hi Tomas, so I gave a go to the below patches today:
- v20240323-2-0001-pg_combinebackup-allow-using-clone-copy_.patch
- v20240323-2-0002-write_reconstructed_file.patch

My assessment:

v20240323-2-0001-pg_combinebackup-allow-using-clone-copy_.patch -
looks like more or less good to go
v20240323-2-0002-write_reconstructed_file.patch - needs work and
without that clone/copy_file_range() good effects are unlikely

Given Debian 12, ~100GB DB, (pgbench -i -s 7000 , and some additional
tables with GiST and GIN indexes , just to see more WAL record types)
and with backups sizes in MB like that:

106831  full
2823incr.1 # captured after some time with pgbench -R 100
165 incr.2 # captured after some time with pgbench -R 100

Test cmd: rm -rf outtest; sync; sync ; sync; echo 3 | sudo tee
/proc/sys/vm/drop_caches ; time /usr/pgsql17/bin/pg_combinebackup -o
outtest full incr.1 incr.2

Test results of various copies on small I/O constrained XFS device:
normal copy: 31m47.407s
--clone copy: error: file cloning not supported on this platform (it's
due #ifdef of having COPY_FILE_RANGE available)
--copy-file-range: aborted, as it was taking too long , I was
expecting it to accelerate, but it did not... obviously this is the
transparent failover in case of calculating checksums...
--manifest-checksums=NONE --copy-file-range: BUG, it keep on appending
to just one file e.g. outtest/base/5/16427.29 with 200GB+ ?? and ended
up with ENOSPC [more on this later]
--manifest-checksums=NONE --copy-file-range without v20240323-2-0002: 27m23.887s
--manifest-checksums=NONE --copy-file-range with v20240323-2-0002 and
loop-fix: 5m1.986s but it creates corruption as it stands

Issues:

1. https://cirrus-ci.com/task/5937513653600256?logs=mingw_cross_warning#L327
compains about win32/mingw:

[15:47:27.184] In file included from copy_file.c:22:
[15:47:27.184] copy_file.c: In function ‘copy_file’:
[15:47:27.184] ../../../src/include/common/logging.h:134:6: error:
this statement may fall through [-Werror=implicit-fallthrough=]
[15:47:27.184]   134 |   if (unlikely(__pg_log_level <= PG_LOG_DEBUG)) \
[15:47:27.184]   |  ^
[15:47:27.184] copy_file.c:96:5: note: in expansion of macro ‘pg_log_debug’
[15:47:27.184]96 | pg_log_debug("would copy \"%s\" to \"%s\"
(copy_file_range)",
[15:47:27.184]   | ^~~~
[15:47:27.184] copy_file.c:99:4: note: here
[15:47:27.184]99 |case COPY_MODE_COPYFILE:
[15:47:27.184]   |^~~~
[15:47:27.184] cc1: all warnings being treated as errors

2. I do not know what's the consensus between --clone and
--copy-file-range , but if we have #ifdef FICLONE clone_works() #elif
HAVE_COPY_FILE_RANGE copy_file_range_only_works() then we should also
apply the same logic to the --help so that --clone is not visible
there (for consistency?). Also the "error: file cloning not supported
on this platform " is technically incorrect, Linux does support
ioctl(FICLONE) and copy_file_range(), but we are just not choosing one
over another (so technically it is "available"). Nitpicking I know.

3. [v20240323-2-0002-write_reconstructed_file.patch]: The mentioned
ENOSPACE spiral-of-death-bug symptoms are like that:

strace:
copy_file_range(8, [697671680], 9, NULL, 8192, 0) = 8192
copy_file_range(8, [697679872], 9, NULL, 8192, 0) = 8192
copy_file_range(8, [697688064], 9, NULL, 8192, 0) = 8192
copy_file_range(8, [697696256], 9, NULL, 8192, 0) = 8192
copy_file_range(8, [697704448], 9, NULL, 8192, 0) = 8192
copy_file_range(8, [697712640], 9, NULL, 8192, 0) = 8192
copy_file_range(8, [697720832], 9, NULL, 8192, 0) = 8192
copy_file_range(8, [697729024], 9, NULL, 8192, 0) = 8192
copy_file_range(8, [697737216], 9, NULL, 8192, 0) = 8192
copy_file_range(8, [697745408], 9, NULL, 8192, 0) = 8192
copy_file_range(8, [697753600], 9, NULL, 8192, 0) = 8192
copy_file_range(8, [697761792], 9, NULL, 8192, 0) = 8192
copy_file_range(8, [697769984], 9, NULL, 8192, 0) = 8192

Notice that dest_off_t (poutoff) is NULL.

(gdb) where
#0  0x7f2cd56f6733 in copy_file_range (infd=8,
pinoff=pinoff@entry=0x7f2cd53f54e8, outfd=outfd@entry=9,
poutoff=poutoff@entry=0x0,
length=length@entry=8192, flags=flags@entry=0) at

Re: [HACKERS] make async slave to wait for lsn to be replayed

2024-03-26 Thread Kartyshov Ivan

Thank you for your interest to the patch.
I understand you questions, but I fully support Alexander Korotkov idea
to commit the minimal required functionality. And then keep working on
other improvements.

On 2024-03-24 05:39, Bharath Rupireddy wrote:
On Fri, Mar 22, 2024 at 4:28 AM Peter Eisentraut  
wrote:


I had written in [0] about my questions related to using this with
connection poolers.  I don't think this was addressed at all.  I 
haven't
seen any discussion about how to make this kind of facility usable in 
a

full system.  You have to manually query and send LSNs; that seems
pretty cumbersome.  Sure, this is part of something that could be
useful, but how would an actual user with actual application code get 
to

use this?

[0]:
https://www.postgresql.org/message-id/8b5b172f-0ae7-d644-8358-e2851dded43b%40enterprisedb.com




But I wonder how a client is going to get the LSN.  How would all of
this be used by a client?  I can think of a scenarios where you have
an application that issues a bunch of SQL commands and you have some
kind of pooler in the middle that redirects those commands to
different hosts, and what you really want is to have it transparently
behave as if it's just a single host.  Do we want to inject a bunch
of "SELECT pg_get_lsn()", "SELECT pg_wait_lsn()" calls into that?


As I understand your question, application make dml on the primary
server, get LSN of changes and send bunch SQL read-only commands to 
pooler. Transparent behave we can get using #synchronous_commit, but

it is very slow.


I'm tempted to think this could be a protocol-layer facility.  Every
query automatically returns the current LSN, and every query can also
send along an LSN to wait for, and the client library would just keep
track of the LSN for (what it thinks of as) the connection.  So you
get some automatic serialization without having to modify your client 
code.


Thank you, it is a good question for future versions.
You say about a protocol-layer facility, what you meen. May be we can
use signals, like hot_standby_feedback.


I share the same concern as yours and had proposed something upthread
[1]. The idea is something like how each query takes a snapshot at the
beginning of txn/query (depending on isolation level), the same way
the standby can wait for the primary's current LSN as of the moment
(at the time of taking snapshot). And, primary keeps sending its
current LSN as part of regular WAL to standbys so that the standbys
doesn't have to make connections to the primary to know its current
LSN every time. Perhps, this may not even fully guarantee (considered
to be achieving) the read-after-write consistency on standbys unless
there's a way for the application to tell the wait LSN.

Thoughts?

[1] 
https://www.postgresql.org/message-id/CALj2ACUfS7LH1PaWmSZ5KwH4BpQxO9izeMw4qC3a1DAwi6nfbQ%40mail.gmail.com




+1 to have support for implicit txns. A strawman solution I can think
of is to let primary send its current insert LSN to the standby every
time it sends a bunch of WAL, and the standby waits for that LSN to be
replayed on it at the start of every implicit txn automatically.


And how standby will get lsn to wait for? All solutions I can think of
are very invasive and poorly scalable.

For example, every dml can send back LSN if dml is success. And 
application could use it to wait actual changes.



The new BEGIN syntax requires application code changes. This led me to
think how one can achieve read-after-write consistency today in a
primary - standby set up. All the logic of this patch, that is, waiting
for the standby to pass a given primary LSN needs to be done in the
application code (or in proxy or in load balancer?). I believe there
might be someone doing this already, it's good to hear from them.


You may use #synchronous_commit mode but it slow. So my implementation
don`t make primary to wait all standby to sent its feedbacks.

--
Ivan Kartyshov
Postgres Professional: www.postgrespro.comdiff --git a/doc/src/sgml/func.sgml b/doc/src/sgml/func.sgml
index 8ecc02f2b9..8042201bab 100644
--- a/doc/src/sgml/func.sgml
+++ b/doc/src/sgml/func.sgml
@@ -28086,10 +28086,55 @@ postgres=# SELECT '0/0'::pg_lsn + pd.segment_number * ps.setting::int + :offset
 extension.

   
+
+  
+   
+
+ pg_wait_lsn
+
+pg_wait_lsn (trg_lsn pg_lsn, delay int8 DEFAULT 0)
+   
+   
+Returns ERROR if target LSN was not replayed on standby.
+Parameter delay sets seconds, the time to wait to the LSN.
+   
+  
  
 

 
+   
+pg_wait_lsn waits till wait_lsn
+to be replayed on standby to achieve read-your-writes-consistency, while
+using async replica for reads and primary for writes. In that case lsn of
+last modification is stored inside application.
+Note: pg_wait_lsn(trg_lsn pg_lsn, delay int8 DEFAULT 0)
+   
+
+   
+You can use pg_wait_lsn to wait the pg_lsn
+value. For example:
+   

Re: Parallel Aggregates for string_agg and array_agg

2024-03-26 Thread Alexander Lakhin

Hello David,

23.01.2023 07:37, David Rowley wrote:

I've now pushed this.



I've discovered that the test query:
-- Ensure parallel aggregation is actually being used.
explain (costs off) select * from v_pagg_test order by y;

added by 16fd03e95 fails sometimes. For instance:
https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=urutu=2024-03-19%2021%3A04%3A05

--- /home/bf/bf-build/urutu/HEAD/pgsql/src/test/regress/expected/aggregates.out 
2024-02-24 06:42:47.039073180 +
+++ 
/home/bf/bf-build/urutu/HEAD/pgsql.build/src/test/regress/results/aggregates.out
 2024-03-19 22:24:18.155876135 +
@@ -1993,14 +1993,16 @@
  Sort Key: pagg_test.y, (((unnest(regexp_split_to_array((string_agg((pagg_test.x)::text, ','::text)), 
','::text::integer)

  ->  Result
    ->  ProjectSet
- ->  Finalize HashAggregate
+ ->  Finalize GroupAggregate
...

https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=phycodurus=2024-02-28%2007%3A38%3A27

https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=avocet=2024-02-08%2008%3A47%3A45

I suspect that these failures caused by autovacuum.

I could reproduce this plan change when running multiple tests in
parallel, and also with the attached test patch applied (several sleeps are
needed for autovacuum/relation_needs_vacanalyze() to get a non-zero
mod_since_analyze value from pgstat):
TEMP_CONFIG=/tmp/temp.config TESTS="test_setup create_index create_aggregate 
aggregates" make -s check-tests

where /tmp/temp.config is:
autovacuum_naptime = 1
log_autovacuum_min_duration = 0

log_min_messages = INFO
log_min_error_statement = log
log_statement = 'all'

With EXPLAIN (VERBOSE), I see a slight change of the Seq Scan cost/rows
estimate:
... ->  Parallel Seq Scan on public.pagg_test  (cost=0.00..48.99 rows=2599 
width=8)
vs
.. ->  Parallel Seq Scan on public.pagg_test  (cost=0.00..48.00 rows=2500 
width=8)
(after automatic analyze of table "regression.public.pagg_test")

Best regards,
Alexanderdiff --git a/src/test/regress/expected/aggregates.out b/src/test/regress/expected/aggregates.out
index d54a255e58..3de98c916d 100644
--- a/src/test/regress/expected/aggregates.out
+++ b/src/test/regress/expected/aggregates.out
@@ -1931,6 +1931,12 @@ select string_agg(v, decode('ee', 'hex')) from bytea_test_table;
 drop table bytea_test_table;
 -- Test parallel string_agg and array_agg
 create table pagg_test (x int, y int);
+select pg_sleep(3);
+ pg_sleep 
+--
+ 
+(1 row)
+
 insert into pagg_test
 select (case x % 4 when 1 then null else x end), x % 10
 from generate_series(1,5000) x;
@@ -1967,6 +1973,12 @@ from (
 	) a1
 ) a2
 group by y;
+select pg_sleep(3);
+ pg_sleep 
+--
+ 
+(1 row)
+
 -- Ensure results are correct.
 select * from v_pagg_test order by y;
  y | tmin | tmax | tndistinct | bmin | bmax | bndistinct | amin | amax | andistinct | aamin | aamax | aandistinct 
@@ -1983,6 +1995,12 @@ select * from v_pagg_test order by y;
  9 |   19 | 4999 |250 | 1019 | 999  |250 |   19 | 4999 |250 |19 |  4999 | 250
 (10 rows)
 
+select pg_sleep(3);
+ pg_sleep 
+--
+ 
+(1 row)
+
 -- Ensure parallel aggregation is actually being used.
 explain (costs off) select * from v_pagg_test order by y;
   QUERY PLAN  
diff --git a/src/test/regress/sql/aggregates.sql b/src/test/regress/sql/aggregates.sql
index 441e01d150..25dd90ec65 100644
--- a/src/test/regress/sql/aggregates.sql
+++ b/src/test/regress/sql/aggregates.sql
@@ -750,6 +750,7 @@ drop table bytea_test_table;
 
 -- Test parallel string_agg and array_agg
 create table pagg_test (x int, y int);
+select pg_sleep(3);
 insert into pagg_test
 select (case x % 4 when 1 then null else x end), x % 10
 from generate_series(1,5000) x;
@@ -789,9 +790,11 @@ from (
 ) a2
 group by y;
 
+select pg_sleep(3);
 -- Ensure results are correct.
 select * from v_pagg_test order by y;
 
+select pg_sleep(3);
 -- Ensure parallel aggregation is actually being used.
 explain (costs off) select * from v_pagg_test order by y;
 


Re: Possibility to disable `ALTER SYSTEM`

2024-03-26 Thread Robert Haas
On Tue, Mar 26, 2024 at 8:55 AM Abhijit Menon-Sen  wrote:
> Yes, "externally_managed_configuration" raises far more questions than
> it answers. "enable_alter_system" is clearer in terms of what to expect
> when you set it. "enable_alter_system_command" is rather long, but even
> better in that it is specific enough to not promise anything about not
> allowing superusers to change the configuration some other way.

It was previously suggested that we shouldn't start the GUC name with
"enable," since those are all planner GUCs currently. It's sort of a
silly precedent, but we have it, so that's why I proposed "allow"
instead.

-- 
Robert Haas
EDB: http://www.enterprisedb.com




Re: Possibility to disable `ALTER SYSTEM`

2024-03-26 Thread Abhijit Menon-Sen
At 2024-03-26 08:11:33 -0400, robertmh...@gmail.com wrote:
>
> On Mon, Mar 25, 2024 at 5:04 PM Bruce Momjian  wrote:
> > > > Isn't "configuration" too generic a term for disabling ALTER SYSTEM?
> > >
> > > maybe "externally_managed_auto_config"
> >
> > How many people associate "auto" with ALTER SYSTEM?  I assume not many.
> >
> > To me, externally_managed_configuration is promising a lot more than it
> > delivers because there is still a lot of ocnfiguration it doesn't
> > control.  I am also confused why the purpose of the feature, external
> > management of configuation, is part of the variable name.  We usually
> > name parameters for what they control.
> 
> I actually agree with this. I wasn't going to quibble with it because
> other people seemed to like it. But I think something like
> allow_alter_system would be better, as it would describe the exact
> thing that the parameter does, rather than how we think the parameter
> ought to be used.

Yes, "externally_managed_configuration" raises far more questions than
it answers. "enable_alter_system" is clearer in terms of what to expect
when you set it. "enable_alter_system_command" is rather long, but even
better in that it is specific enough to not promise anything about not
allowing superusers to change the configuration some other way.

-- Abhijit (as someone who could find a use for this feature)




Re: Streaming I/O, vectored I/O (WIP)

2024-03-26 Thread Heikki Linnakangas

On 24/03/2024 15:02, Thomas Munro wrote:

On Wed, Mar 20, 2024 at 4:04 AM Heikki Linnakangas  wrote:

Maybe 'pg_streaming_read_next_buffer' or just 'pg_streaming_read_next',
for a shorter name.


Hmm.  The idea of 'buffer' appearing in a couple of names is that
there are conceptually other kinds of I/O that we might want to
stream, like raw files or buffers other than the buffer pool, maybe
even sockets, so this would be part of a family of similar interfaces.
I think it needs to be clear that this variant gives you buffers.  I'm
OK with removing "get" but I guess it would be better to keep the
words in the same order across the three functions?  What about these?

streaming_read_buffer_begin();
streaming_read_buffer_next();
streaming_read_buffer_end();

Tried like that in this version.  Other ideas would be to make
"stream" the main noun, buffered_read_stream_begin() or something.
Ideas welcome.


Works for me, although "streaming_read_buffer" is a pretty long prefix. 
The flags like "STREAMING_READ_MAINTENANCE" probably ought to be 
"STREAMING_READ_BUFFER_MAINTENANCE" as well.


Maybe "buffer_stream_next()"?


Here are some other changes:

* I'm fairly happy with the ABC adaptive distance algorithm so far, I
think, but I spent more time tidying up the way it is implemented.  I
didn't like the way each 'range' had buffer[MAX_BUFFERS_PER_TRANSFER],
so I created a new dense array stream->buffers that behaved as a
second circular queue.

* The above also made it trivial for MAX_BUFFERS_PER_TRANSFER to
become the GUC that it always wanted to be: buffer_io_size defaulting
to 128kB.  Seems like a reasonable thing to have?  Could also
influence things like bulk write?  (The main problem I have with the
GUC currently is choosing a category, async resources is wrong)

* By analogy, it started to look a bit funny that each range had room
for a ReadBuffersOperation, and we had enough ranges for
max_pinned_buffers * 1 block range.  So I booted that out to another
dense array, of size max_ios.

* At the same time, Bilal and Andres had been complaining privately
about 'range' management overheads showing up in perf and creating a
regression against master on fully cached scans that do nothing else
(eg pg_prewarm, where we lookup, pin, unpin every page and do no I/O
and no CPU work with the page, a somewhat extreme case but a
reasonable way to isolate the management costs); having made the above
change, it suddenly seemed obvious that I should make the buffers
array the 'main' circular queue, pointing off to another place for
information required for dealing with misses.  In this version, there
are no more range objects.  This feels better and occupies and touches
less memory.  See pictures below.


+1 for all that. Much better!


* Various indexes and sizes that couldn't quite fit in uint8_t but
couldn't possibly exceed a few thousand because they are bounded by
numbers deriving from range-limited GUCs are now int16_t (while I was
looking for low hanging opportunities to reduce memory usage...)


Is int16 enough though? It seems so, because:

max_pinned_buffers = Max(max_ios * 4, buffer_io_size);

and max_ios is constrained by the GUC's maximum MAX_IO_CONCURRENCY, and 
buffer_io_size is constrained by MAX_BUFFER_IO_SIZE == PG_IOV_MAX == 32.


If someone changes those constants though, int16 might overflow and fail 
in weird ways. I'd suggest being more careful here and explicitly clamp 
max_pinned_buffers at PG_INT16_MAX or have a static assertion or 
something. (I think it needs to be somewhat less than PG_INT16_MAX, 
because of the extra "overflow buffers" stuff and some other places 
where you do arithmetic.)



/*
 * We gave a contiguous range of buffer space to StartReadBuffers(), but
 * we want it to wrap around at max_pinned_buffers.  Move values that
 * overflowed into the extra space.  At the same time, put -1 in the I/O
 * slots for the rest of the buffers to indicate no I/O.  They are 
covered
 * by the head buffer's I/O, if there is one.  We avoid a % operator.
 */
overflow = (stream->next_buffer_index + nblocks) - 
stream->max_pinned_buffers;
if (overflow > 0)
{
memmove(>buffers[0],
>buffers[stream->max_pinned_buffers],
sizeof(stream->buffers[0]) * overflow);
for (int i = 0; i < overflow; ++i)
stream->buffer_io_indexes[i] = -1;
for (int i = 1; i < nblocks - overflow; ++i)
stream->buffer_io_indexes[stream->next_buffer_index + 
i] = -1;
}
else
{
for (int i = 1; i < nblocks; ++i)
stream->buffer_io_indexes[stream->next_buffer_index + 
i] = -1;
}


Instead of clearing buffer_io_indexes here, it might be cheaper/simpler 
to initialize the array to -1 in streaming_read_buffer_begin(), and 
reset 

Re: Introduce XID age and inactive timeout based replication slot invalidation

2024-03-26 Thread Bertrand Drouvot
Hi,

On Tue, Mar 26, 2024 at 04:17:53PM +0530, shveta malik wrote:
> On Tue, Mar 26, 2024 at 3:50 PM Bertrand Drouvot
>  wrote:
> >
> > Hi,
> >
> > > I think there may have been some misunderstanding here.
> >
> > Indeed ;-)
> >
> > > But now if I
> > > rethink this, I am fine with 'inactive_since' getting synced from
> > > primary to standby. But if we do that, we need to add docs stating
> > > "inactive_since" represents primary's inactivity and not standby's
> > > slots inactivity for synced slots.
> >
> > Yeah sure.
> >
> > > The reason for this clarification
> > > is that the synced slot might be generated much later, yet
> > > 'inactive_since' is synced from the primary, potentially indicating a
> > > time considerably earlier than when the synced slot was actually
> > > created.
> >
> > Right.
> >
> > > Another approach could be that "inactive_since" for synced slot
> > > actually gives its own inactivity data rather than giving primary's
> > > slot data. We update inactive_since on standby only at 3 occasions:
> > > 1) at the time of creation of the synced slot.
> > > 2) during standby restart.
> > > 3) during promotion of standby.
> > >
> > > I have attached a sample patch for this idea as.txt file.
> >
> > Thanks!
> >
> > > I am fine with any of these approaches.  One gives data synced from
> > > primary for synced slots, while another gives actual inactivity data
> > > of synced slots.
> >
> > What about another approach?: inactive_since gives data synced from primary 
> > for
> > synced slots and another dedicated field (could be added later...) could
> > represent what you suggest as the other option.
> 
> Yes, okay with me. I think there is some confusion here as well. In my
> second approach above, I have not suggested anything related to
> sync-worker.

Yeah, no confusion, understood that way.

> We can think on that later if we really need another
> field which give us sync time.

I think that calling GetCurrentTimestamp() so frequently could be too costly, so
I'm not sure we should.

> In my second approach, I have tried to
> avoid updating inactive_since for synced slots during sync process. We
> update that field during creation of synced slot so that
> inactive_since reflects correct info even for synced slots (rather
> than copying from primary). 

Yeah, and I think we could create a dedicated field with this information
if we feel the need.

Regards,

-- 
Bertrand Drouvot
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com




Re: Introduce XID age and inactive timeout based replication slot invalidation

2024-03-26 Thread Bertrand Drouvot
Hi,

On Tue, Mar 26, 2024 at 04:49:18PM +0530, shveta malik wrote:
> On Tue, Mar 26, 2024 at 4:35 PM Bharath Rupireddy
>  wrote:
> >
> > On Tue, Mar 26, 2024 at 4:18 PM shveta malik  wrote:
> > >
> > > > What about another approach?: inactive_since gives data synced from 
> > > > primary for
> > > > synced slots and another dedicated field (could be added later...) could
> > > > represent what you suggest as the other option.
> > >
> > > Yes, okay with me. I think there is some confusion here as well. In my
> > > second approach above, I have not suggested anything related to
> > > sync-worker. We can think on that later if we really need another
> > > field which give us sync time.  In my second approach, I have tried to
> > > avoid updating inactive_since for synced slots during sync process. We
> > > update that field during creation of synced slot so that
> > > inactive_since reflects correct info even for synced slots (rather
> > > than copying from primary). Please have a look at my patch and let me
> > > know your thoughts. I am fine with copying it from primary as well and
> > > documenting this behaviour.
> >
> > I took a look at your patch.
> >
> > --- a/src/backend/replication/logical/slotsync.c
> > +++ b/src/backend/replication/logical/slotsync.c
> > @@ -628,6 +628,7 @@ synchronize_one_slot(RemoteSlot *remote_slot, Oid
> > remote_dbid)
> >  SpinLockAcquire(>mutex);
> >  slot->effective_catalog_xmin = xmin_horizon;
> >  slot->data.catalog_xmin = xmin_horizon;
> > +slot->inactive_since = GetCurrentTimestamp();
> >  SpinLockRelease(>mutex);
> >
> > If we just sync inactive_since value for synced slots while in
> > recovery from the primary, so be it. Why do we need to update it to
> > the current time when the slot is being created?
> 
> If we update inactive_since  at synced slot's creation or during
> restart (skipping setting it during sync), then this time reflects
> actual 'inactive_since' for that particular synced slot.  Isn't that a
> clear info for the user and in alignment of what the name
> 'inactive_since' actually suggests?
> 
> > We don't expose slot
> > creation time, no?
> 
> No, we don't. But for synced slot, that is the time since that slot is
> inactive  (unless promoted), so we are exposing inactive_since and not
> creation time.
> 
> >Aren't we fine if we just sync the value from
> > primary and document that fact? After the promotion, we can reset it
> > to the current time so that it gets its own time. Do you see any
> > issues with it?
> 
> Yes, we can do that. But curious to know, do we see any additional
> benefit of reflecting primary's inactive_since at standby which I
> might be missing?

In case the primary goes down, then one could use the value on the standby
to get the value coming from the primary. I think that could be useful info to
have.

Regards,

-- 
Bertrand Drouvot
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com




Re: REVOKE FROM warning on grantor

2024-03-26 Thread Robert Haas
On Tue, Mar 26, 2024 at 5:22 AM Étienne BERSAC
 wrote:
> > ldap2pg really ought to issue REVOKE x FROM y GRANTED BY z.
>
> Thanks for this. I missed this notation and it is exactly what I need.
>
> You could consider this subject as closed. Thanks for your time and
> explanation.

No problem. Glad it helped!

-- 
Robert Haas
EDB: http://www.enterprisedb.com




Re: Possibility to disable `ALTER SYSTEM`

2024-03-26 Thread Daniel Gustafsson
> On 26 Mar 2024, at 13:11, Robert Haas  wrote:
> On Mon, Mar 25, 2024 at 5:04 PM Bruce Momjian  wrote:

>> To me, externally_managed_configuration is promising a lot more than it
>> delivers because there is still a lot of ocnfiguration it doesn't
>> control.  I am also confused why the purpose of the feature, external
>> management of configuation, is part of the variable name.  We usually
>> name parameters for what they control.
> 
> I actually agree with this. I wasn't going to quibble with it because
> other people seemed to like it. But I think something like
> allow_alter_system would be better, as it would describe the exact
> thing that the parameter does, rather than how we think the parameter
> ought to be used.

+Many.  Either allow_alter_system or enable_alter_system_command is IMO
preferrable, not least because someone might use this without using any
external configuration tool, making the name even more misleading.

--
Daniel Gustafsson





Re: Possibility to disable `ALTER SYSTEM`

2024-03-26 Thread Robert Haas
On Mon, Mar 25, 2024 at 5:04 PM Bruce Momjian  wrote:
> > > Isn't "configuration" too generic a term for disabling ALTER SYSTEM?
> >
> > maybe "externally_managed_auto_config"
>
> How many people associate "auto" with ALTER SYSTEM?  I assume not many.
>
> To me, externally_managed_configuration is promising a lot more than it
> delivers because there is still a lot of ocnfiguration it doesn't
> control.  I am also confused why the purpose of the feature, external
> management of configuation, is part of the variable name.  We usually
> name parameters for what they control.

I actually agree with this. I wasn't going to quibble with it because
other people seemed to like it. But I think something like
allow_alter_system would be better, as it would describe the exact
thing that the parameter does, rather than how we think the parameter
ought to be used.

-- 
Robert Haas
EDB: http://www.enterprisedb.com




Can't find not null constraint, but \d+ shows that

2024-03-26 Thread Tender Wang
Hi Alvaro,

I met an issue related to Catalog not-null commit on HEAD.

postgres=# CREATE TABLE t1(c0 int, c1 int);
CREATE TABLE
postgres=# ALTER TABLE  t1 ADD CONSTRAINT Q PRIMARY KEY(c0, c1);
ALTER TABLE
postgres=# \d+ t1
   Table "public.t1"
 Column |  Type   | Collation | Nullable | Default | Storage | Compression
| Stats target | Description
+-+---+--+-+-+-+--+-
 c0 | integer |   | not null | | plain   |
|  |
 c1 | integer |   | not null | | plain   |
|  |
Indexes:
"q" PRIMARY KEY, btree (c0, c1)
Access method: heap

postgres=# ALTER TABLE  t1 DROP c1;
ALTER TABLE
postgres=# \d+ t1
   Table "public.t1"
 Column |  Type   | Collation | Nullable | Default | Storage | Compression
| Stats target | Description
+-+---+--+-+-+-+--+-
 c0 | integer |   | not null | | plain   |
|  |
Access method: heap

postgres=# ALTER TABLE  t1  ALTER c0 DROP NOT NULL;
ERROR:  could not find not-null constraint on column "c0", relation "t1"
postgres=# insert into t1 values (NULL);
ERROR:  null value in column "c0" of relation "t1" violates not-null
constraint
DETAIL:  Failing row contains (null).

I couldn't reproduce aboved issue on older version(12.12 ...16.1).
to be more precisely, since  b0e96f3119 commit.

Without the b0e96f3119, when we drop not null constraint, we  just update
the pg_attribute attnotnull to false
in ATExecDropNotNull().  But now we first check pg_constraint if has the
tuple. if attnotnull is ture, but pg_constraint
doesn't has that tuple. Aboved error will report.

It will be confuesed for users. Because \d shows the column c0 has not
null, and we cann't insert NULL value. But it
reports errore when users drop the NOT NULL constraint.

The attached patch is my workaround solution.  Look forward your apply.

-- 
Tender Wang
OpenPie:  https://en.openpie.com/


0001-Fix-attnotnull-not-correct-reset-issue.patch
Description: Binary data


Re: Refactoring of pg_resetwal/t/001_basic.pl

2024-03-26 Thread Svetlana Derevyanko

Peter Eisentraut писал(а) 2024-03-25 17:10:


But MXOFF_SIZE doesn't exist anywhere else.  The actual formula uses
sizeof(MultiXactOffset), which isn't obvious from your patch.  So this 
just moves the magic constants around by one level.


I think if we're going to add more symbols, then it has to be done 
consistently in the source code, the documentation, and the tests, not 
just one of them.




Hello!
Thank you for your reply.

Attached is the updated version of patch for pg_resetwal test. I added 
definitions for MXOFF_SIZE and MXID_SIZE constants in multixact.c (and 
replaced use of sizeof(MultiXactId) and sizeof(MultiXactOffset) 
accordingly). Also changed multipliers for pg_xact/members/offset on 
CLOG_XACTS_PER_PAGE/MULTIXACT_MEMBERS_PER_PAGE/MULTIXACT_OFFSETS_PER_PAGE 
both in src/bin/pg_resetwal/t/001_basic.pl and docs, since it seems to 
me that this makes things more clear.


What do you think?


Best regards,
Svetlana Derevyanko.
From a1bbaddbc4e97e95ed15b51e48386be5fb4b Mon Sep 17 00:00:00 2001
From: Svetlana Derevyanko 
Date: Tue, 26 Mar 2024 13:09:43 +0300
Subject: [PATCH v2] Refactor pg_resetwal/t/001_basic.pl

Use constants instead of magic numbers.
Added MXID_SIZE and MXOFF_SIZE constants.
Changed desciptions for pg_resetwal options in docs.

---
 doc/src/sgml/ref/pg_resetwal.sgml  |  8 
 src/backend/access/transam/multixact.c | 15 +--
 src/bin/pg_resetwal/t/001_basic.pl | 18 +++---
 3 files changed, 28 insertions(+), 13 deletions(-)

diff --git a/doc/src/sgml/ref/pg_resetwal.sgml b/doc/src/sgml/ref/pg_resetwal.sgml
index cf9c7e70f2..d8de5e2e29 100644
--- a/doc/src/sgml/ref/pg_resetwal.sgml
+++ b/doc/src/sgml/ref/pg_resetwal.sgml
@@ -274,7 +274,7 @@ PostgreSQL documentation
   names are in hexadecimal, so the easiest way to do this is to specify
   the option value in hexadecimal and append four zeroes.
  
- 
+ 
 

 
@@ -309,7 +309,7 @@ PostgreSQL documentation
   The file names are in hexadecimal.  There is no simple recipe such as
   the ones for other options of appending zeroes.
  
- 
+ 
 

 
@@ -358,7 +358,7 @@ PostgreSQL documentation
   in pg_xact, -u 0x70 will work (five
   trailing zeroes provide the proper multiplier).
  
- 
+ 
 

 
@@ -380,7 +380,7 @@ PostgreSQL documentation
   in pg_xact, -x 0x120 will work (five
   trailing zeroes provide the proper multiplier).
  
- 
+ 
 

   
diff --git a/src/backend/access/transam/multixact.c b/src/backend/access/transam/multixact.c
index 83b578dced..6bda000120 100644
--- a/src/backend/access/transam/multixact.c
+++ b/src/backend/access/transam/multixact.c
@@ -104,8 +104,11 @@
  * MultiXactOffsetPagePrecedes).
  */
 
+#define MXOFF_SIZE sizeof(MultiXactOffset)
+#define MXID_SIZE sizeof(MultiXactId)
+
 /* We need four bytes per offset */
-#define MULTIXACT_OFFSETS_PER_PAGE (BLCKSZ / sizeof(MultiXactOffset))
+#define MULTIXACT_OFFSETS_PER_PAGE (BLCKSZ / MXOFF_SIZE)
 
 #define MultiXactIdToOffsetPage(xid) \
 	((xid) / (MultiXactOffset) MULTIXACT_OFFSETS_PER_PAGE)
@@ -1765,7 +1768,7 @@ AtPrepare_MultiXact(void)
 
 	if (MultiXactIdIsValid(myOldestMember))
 		RegisterTwoPhaseRecord(TWOPHASE_RM_MULTIXACT_ID, 0,
-			   , sizeof(MultiXactId));
+			   , MXID_SIZE);
 }
 
 /*
@@ -1832,7 +1835,7 @@ multixact_twophase_recover(TransactionId xid, uint16 info,
 	 * Get the oldest member XID from the state file record, and set it in the
 	 * OldestMemberMXactId slot reserved for this prepared transaction.
 	 */
-	Assert(len == sizeof(MultiXactId));
+	Assert(len == MXID_SIZE);
 	oldestMember = *((MultiXactId *) recdata);
 
 	OldestMemberMXactId[dummyProcNumber] = oldestMember;
@@ -1848,7 +1851,7 @@ multixact_twophase_postcommit(TransactionId xid, uint16 info,
 {
 	ProcNumber	dummyProcNumber = TwoPhaseGetDummyProcNumber(xid, true);
 
-	Assert(len == sizeof(MultiXactId));
+	Assert(len == MXID_SIZE);
 
 	OldestMemberMXactId[dummyProcNumber] = InvalidMultiXactId;
 }
@@ -1877,7 +1880,7 @@ MultiXactShmemSize(void)
 	/* We need 2*MaxOldestSlot perBackendXactIds[] entries */
 #define SHARED_MULTIXACT_STATE_SIZE \
 	add_size(offsetof(MultiXactStateData, perBackendXactIds), \
-			 mul_size(sizeof(MultiXactId) * 2, MaxOldestSlot))
+			 mul_size(MXID_SIZE * 2, MaxOldestSlot))
 
 	size = SHARED_MULTIXACT_STATE_SIZE;
 	size = add_size(size, SimpleLruShmemSize(multixact_offset_buffers, 0));
@@ -2146,7 +2149,7 @@ TrimMultiXact(void)
 		offptr = (MultiXactOffset *) MultiXactOffsetCtl->shared->page_buffer[slotno];
 		offptr += entryno;
 
-		MemSet(offptr, 0, BLCKSZ - (entryno * sizeof(MultiXactOffset)));
+		MemSet(offptr, 0, BLCKSZ - (entryno * MXOFF_SIZE));
 
 		MultiXactOffsetCtl->shared->page_dirty[slotno] = true;
 		LWLockRelease(lock);
diff --git a/src/bin/pg_resetwal/t/001_basic.pl b/src/bin/pg_resetwal/t/001_basic.pl
index 9829e48106..19fb6dc6a2 100644
--- a/src/bin/pg_resetwal/t/001_basic.pl

Re: Use streaming read API in ANALYZE

2024-03-26 Thread Nazir Bilal Yavuz
Hi,

On Wed, 28 Feb 2024 at 14:42, Nazir Bilal Yavuz  wrote:
>
>
> The new version of the streaming read API [1] is posted. I updated the
> streaming read API changes patch (0001), using the streaming read API
> in ANALYZE patch (0002) remains the same. This should make it easier
> to review as it can be applied on top of master
>
>

The new version of the streaming read API is posted [1]. I rebased the
patch on top of master and v9 of the streaming read API.

There is a minimal change in the 'using the streaming read API in ANALYZE
patch (0002)', I changed STREAMING_READ_FULL to STREAMING_READ_MAINTENANCE
to copy exactly the same behavior as before. Also, some benchmarking
results:

I created a 22 GB table and set the size of shared buffers to 30GB, the
rest is default.

╔═══╦═╦╗
║   ║  Avg Timings in ms  ║║
╠═══╬══╦══╬╣
║   ║  master  ║  patched ║ percentage ║
╠═══╬══╬══╬╣
║ Both OS cache and ║  ║  ║║
║  shared buffers are clear ║ 513.9247 ║ 463.1019 ║%9.9║
╠═══╬══╬══╬╣
║   OS cache is loaded but  ║  ║  ║║
║  shared buffers are clear ║ 423.1097 ║ 354.3277 ║%16.3   ║
╠═══╬══╬══╬╣
║ Shared buffers are loaded ║  ║  ║║
║   ║  89.2846 ║  84.6952 ║%5.1║
╚═══╩══╩══╩╝

Any kind of feedback would be appreciated.

[1]:
https://www.postgresql.org/message-id/CA%2BhUKGL-ONQnnnp-SONCFfLJzqcpAheuzZ%2B-yTrD9WBM-GmAcg%40mail.gmail.com

-- 
Regards,
Nazir Bilal Yavuz
Microsoft
From 7278c354d80e4d1f21c6fa0d810a723789f2d722 Mon Sep 17 00:00:00 2001
From: Thomas Munro 
Date: Mon, 26 Feb 2024 23:48:31 +1300
Subject: [PATCH v3 1/2] Streaming read API changes that are not committed to
 master yet

Discussion: https://www.postgresql.org/message-id/CA%2BhUKGJkOiOCa%2Bmag4BF%2BzHo7qo%3Do9CFheB8%3Dg6uT5TUm2gkvA%40mail.gmail.com
---
 src/include/storage/bufmgr.h  |  45 +-
 src/include/storage/streaming_read.h  |  50 ++
 src/backend/storage/Makefile  |   2 +-
 src/backend/storage/aio/Makefile  |  14 +
 src/backend/storage/aio/meson.build   |   5 +
 src/backend/storage/aio/streaming_read.c  | 678 +
 src/backend/storage/buffer/bufmgr.c   | 706 --
 src/backend/storage/buffer/localbuf.c |  14 +-
 src/backend/storage/meson.build   |   1 +
 src/backend/utils/misc/guc_tables.c   |  14 +
 src/backend/utils/misc/postgresql.conf.sample |   1 +
 doc/src/sgml/config.sgml  |  14 +
 src/tools/pgindent/typedefs.list  |   2 +
 13 files changed, 1309 insertions(+), 237 deletions(-)
 create mode 100644 src/include/storage/streaming_read.h
 create mode 100644 src/backend/storage/aio/Makefile
 create mode 100644 src/backend/storage/aio/meson.build
 create mode 100644 src/backend/storage/aio/streaming_read.c

diff --git a/src/include/storage/bufmgr.h b/src/include/storage/bufmgr.h
index d51d46d3353..1cc198bde21 100644
--- a/src/include/storage/bufmgr.h
+++ b/src/include/storage/bufmgr.h
@@ -14,6 +14,7 @@
 #ifndef BUFMGR_H
 #define BUFMGR_H
 
+#include "port/pg_iovec.h"
 #include "storage/block.h"
 #include "storage/buf.h"
 #include "storage/bufpage.h"
@@ -133,6 +134,10 @@ extern PGDLLIMPORT bool track_io_timing;
 extern PGDLLIMPORT int effective_io_concurrency;
 extern PGDLLIMPORT int maintenance_io_concurrency;
 
+#define MAX_BUFFER_IO_SIZE PG_IOV_MAX
+#define DEFAULT_BUFFER_IO_SIZE Min(MAX_BUFFER_IO_SIZE, (128 * 1024) / BLCKSZ)
+extern PGDLLIMPORT int buffer_io_size;
+
 extern PGDLLIMPORT int checkpoint_flush_after;
 extern PGDLLIMPORT int backend_flush_after;
 extern PGDLLIMPORT int bgwriter_flush_after;
@@ -158,7 +163,6 @@ extern PGDLLIMPORT int32 *LocalRefCount;
 #define BUFFER_LOCK_SHARE		1
 #define BUFFER_LOCK_EXCLUSIVE	2
 
-
 /*
  * prototypes for functions in bufmgr.c
  */
@@ -177,6 +181,42 @@ extern Buffer ReadBufferWithoutRelcache(RelFileLocator rlocator,
 		ForkNumber forkNum, BlockNumber blockNum,
 		ReadBufferMode mode, BufferAccessStrategy strategy,
 		bool permanent);
+
+#define READ_BUFFERS_ZERO_ON_ERROR 0x01
+#define READ_BUFFERS_ISSUE_ADVICE 0x02
+
+/*
+ * Private state used by StartReadBuffers() and WaitReadBuffers().  Declared
+ * in public header only to allow inclusion in other structs, but contents
+ * should not be accessed.
+ */
+struct ReadBuffersOperation
+{
+	/* Parameters passed in to StartReadBuffers(). */
+	BufferManagerRelation bmr;
+	Buffer	   *buffers;
+	ForkNumber	forknum;
+	BlockNumber blocknum;
+	int16		nblocks;
+	

Re: Introduce XID age and inactive timeout based replication slot invalidation

2024-03-26 Thread shveta malik
On Tue, Mar 26, 2024 at 4:35 PM Bharath Rupireddy
 wrote:
>
> On Tue, Mar 26, 2024 at 4:18 PM shveta malik  wrote:
> >
> > > What about another approach?: inactive_since gives data synced from 
> > > primary for
> > > synced slots and another dedicated field (could be added later...) could
> > > represent what you suggest as the other option.
> >
> > Yes, okay with me. I think there is some confusion here as well. In my
> > second approach above, I have not suggested anything related to
> > sync-worker. We can think on that later if we really need another
> > field which give us sync time.  In my second approach, I have tried to
> > avoid updating inactive_since for synced slots during sync process. We
> > update that field during creation of synced slot so that
> > inactive_since reflects correct info even for synced slots (rather
> > than copying from primary). Please have a look at my patch and let me
> > know your thoughts. I am fine with copying it from primary as well and
> > documenting this behaviour.
>
> I took a look at your patch.
>
> --- a/src/backend/replication/logical/slotsync.c
> +++ b/src/backend/replication/logical/slotsync.c
> @@ -628,6 +628,7 @@ synchronize_one_slot(RemoteSlot *remote_slot, Oid
> remote_dbid)
>  SpinLockAcquire(>mutex);
>  slot->effective_catalog_xmin = xmin_horizon;
>  slot->data.catalog_xmin = xmin_horizon;
> +slot->inactive_since = GetCurrentTimestamp();
>  SpinLockRelease(>mutex);
>
> If we just sync inactive_since value for synced slots while in
> recovery from the primary, so be it. Why do we need to update it to
> the current time when the slot is being created?

If we update inactive_since  at synced slot's creation or during
restart (skipping setting it during sync), then this time reflects
actual 'inactive_since' for that particular synced slot.  Isn't that a
clear info for the user and in alignment of what the name
'inactive_since' actually suggests?

> We don't expose slot
> creation time, no?

No, we don't. But for synced slot, that is the time since that slot is
inactive  (unless promoted), so we are exposing inactive_since and not
creation time.

>Aren't we fine if we just sync the value from
> primary and document that fact? After the promotion, we can reset it
> to the current time so that it gets its own time. Do you see any
> issues with it?

Yes, we can do that. But curious to know, do we see any additional
benefit of reflecting primary's inactive_since at standby which I
might be missing?

thanks
Shveta




Re: ALTER TABLE SET ACCESS METHOD on partitioned tables

2024-03-26 Thread Alvaro Herrera
On 2024-Mar-26, Alexander Lakhin wrote:

> Hello Alvaro,
> 
> 21.03.2024 15:07, Alvaro Herrera wrote:
> > Given that Michaël is temporarily gone, I propose to push the attached
> > tomorrow.
> 
> Please look at a new anomaly introduced with 374c7a229.
> Starting from that commit, the following erroneous query:
> CREATE FOREIGN TABLE fp PARTITION OF pg_am DEFAULT SERVER x;
> 
> triggers an assertion failure:
> TRAP: failed Assert("relation->rd_rel->relam == InvalidOid"), File: 
> "relcache.c", Line: 1219, PID: 3706301

Hmm, yeah, we're setting relam for relations that shouldn't have it.
I propose the attached.

-- 
Álvaro Herrera   48°01'N 7°57'E  —  https://www.EnterpriseDB.com/
>From 8b68591cea43fdc50fe53f11b077fcd42d501946 Mon Sep 17 00:00:00 2001
From: Alvaro Herrera 
Date: Tue, 26 Mar 2024 11:58:53 +0100
Subject: [PATCH] relam fixes

---
 src/backend/commands/tablecmds.c | 22 ++
 1 file changed, 14 insertions(+), 8 deletions(-)

diff --git a/src/backend/commands/tablecmds.c b/src/backend/commands/tablecmds.c
index 635d54645d..315d355238 100644
--- a/src/backend/commands/tablecmds.c
+++ b/src/backend/commands/tablecmds.c
@@ -714,7 +714,7 @@ DefineRelation(CreateStmt *stmt, char relkind, Oid ownerId,
 	Oid			ofTypeId;
 	ObjectAddress address;
 	LOCKMODE	parentLockmode;
-	Oid			accessMethodId = InvalidOid;
+	Oid			accessMethodId;
 
 	/*
 	 * Truncate relname to appropriate length (probably a waste of time, as
@@ -958,15 +958,21 @@ DefineRelation(CreateStmt *stmt, char relkind, Oid ownerId,
 	}
 
 	/*
-	 * Select access method to use: an explicitly indicated one, or (in the
-	 * case of a partitioned table) the parent's, if it has one.
+	 * For relations with table AM and partitioned tables, select access method
+	 * to use: an explicitly indicated one, or (in the case of a partitioned
+	 * table) the parent's, if it has one.
 	 */
-	if (stmt->accessMethod != NULL)
-		accessMethodId = get_table_am_oid(stmt->accessMethod, false);
-	else if (stmt->partbound)
+	if (RELKIND_HAS_TABLE_AM(relkind) || relkind == RELKIND_PARTITIONED_TABLE)
 	{
-		Assert(list_length(inheritOids) == 1);
-		accessMethodId = get_rel_relam(linitial_oid(inheritOids));
+		if (stmt->accessMethod != NULL)
+			accessMethodId = get_table_am_oid(stmt->accessMethod, false);
+		else if (stmt->partbound)
+		{
+			Assert(list_length(inheritOids) == 1);
+			accessMethodId = get_rel_relam(linitial_oid(inheritOids));
+		}
+		else
+			accessMethodId = InvalidOid;
 	}
 	else
 		accessMethodId = InvalidOid;
-- 
2.39.2



Re: Introduce XID age and inactive timeout based replication slot invalidation

2024-03-26 Thread Bharath Rupireddy
On Tue, Mar 26, 2024 at 4:18 PM shveta malik  wrote:
>
> > What about another approach?: inactive_since gives data synced from primary 
> > for
> > synced slots and another dedicated field (could be added later...) could
> > represent what you suggest as the other option.
>
> Yes, okay with me. I think there is some confusion here as well. In my
> second approach above, I have not suggested anything related to
> sync-worker. We can think on that later if we really need another
> field which give us sync time.  In my second approach, I have tried to
> avoid updating inactive_since for synced slots during sync process. We
> update that field during creation of synced slot so that
> inactive_since reflects correct info even for synced slots (rather
> than copying from primary). Please have a look at my patch and let me
> know your thoughts. I am fine with copying it from primary as well and
> documenting this behaviour.

I took a look at your patch.

--- a/src/backend/replication/logical/slotsync.c
+++ b/src/backend/replication/logical/slotsync.c
@@ -628,6 +628,7 @@ synchronize_one_slot(RemoteSlot *remote_slot, Oid
remote_dbid)
 SpinLockAcquire(>mutex);
 slot->effective_catalog_xmin = xmin_horizon;
 slot->data.catalog_xmin = xmin_horizon;
+slot->inactive_since = GetCurrentTimestamp();
 SpinLockRelease(>mutex);

If we just sync inactive_since value for synced slots while in
recovery from the primary, so be it. Why do we need to update it to
the current time when the slot is being created? We don't expose slot
creation time, no? Aren't we fine if we just sync the value from
primary and document that fact? After the promotion, we can reset it
to the current time so that it gets its own time. Do you see any
issues with it?

-- 
Bharath Rupireddy
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com




Remove some redundant set_cheapest() calls

2024-03-26 Thread Richard Guo
I happened to notice that the set_cheapest() calls in functions
set_namedtuplestore_pathlist() and set_result_pathlist() are redundant,
as we've centralized the set_cheapest() calls in set_rel_pathlist().

Attached is a trivial patch to remove these calls.

BTW, I suspect that the set_cheapest() call in set_dummy_rel_pathlist()
is also redundant.  The comment there says "This is redundant when we're
called from set_rel_size(), but not when called from elsewhere".  I
doubt it.  The other places where it is called are set_append_rel_size()
and set_subquery_pathlist(), both being called from set_rel_size().  So
set_cheapest() would ultimately be called in set_rel_pathlist().

Thoughts?

Thanks
Richard


v1-0001-Remove-some-redundant-set_cheapest-calls.patch
Description: Binary data


Re: Introduce XID age and inactive timeout based replication slot invalidation

2024-03-26 Thread shveta malik
On Tue, Mar 26, 2024 at 3:50 PM Bertrand Drouvot
 wrote:
>
> Hi,
>
> > I think there may have been some misunderstanding here.
>
> Indeed ;-)
>
> > But now if I
> > rethink this, I am fine with 'inactive_since' getting synced from
> > primary to standby. But if we do that, we need to add docs stating
> > "inactive_since" represents primary's inactivity and not standby's
> > slots inactivity for synced slots.
>
> Yeah sure.
>
> > The reason for this clarification
> > is that the synced slot might be generated much later, yet
> > 'inactive_since' is synced from the primary, potentially indicating a
> > time considerably earlier than when the synced slot was actually
> > created.
>
> Right.
>
> > Another approach could be that "inactive_since" for synced slot
> > actually gives its own inactivity data rather than giving primary's
> > slot data. We update inactive_since on standby only at 3 occasions:
> > 1) at the time of creation of the synced slot.
> > 2) during standby restart.
> > 3) during promotion of standby.
> >
> > I have attached a sample patch for this idea as.txt file.
>
> Thanks!
>
> > I am fine with any of these approaches.  One gives data synced from
> > primary for synced slots, while another gives actual inactivity data
> > of synced slots.
>
> What about another approach?: inactive_since gives data synced from primary 
> for
> synced slots and another dedicated field (could be added later...) could
> represent what you suggest as the other option.

Yes, okay with me. I think there is some confusion here as well. In my
second approach above, I have not suggested anything related to
sync-worker. We can think on that later if we really need another
field which give us sync time.  In my second approach, I have tried to
avoid updating inactive_since for synced slots during sync process. We
update that field during creation of synced slot so that
inactive_since reflects correct info even for synced slots (rather
than copying from primary). Please have a look at my patch and let me
know your thoughts. I am fine with copying it from primary as well and
documenting this behaviour.

> Another cons of updating inactive_since at the current time during each slot
> sync cycle is that calling GetCurrentTimestamp() very frequently
> (during each sync cycle of very active slots) could be too costly.

Right.

thanks
Shveta




Re: Introduce XID age and inactive timeout based replication slot invalidation

2024-03-26 Thread Ajin Cherian
On Tue, Mar 26, 2024 at 7:57 PM Bharath Rupireddy <
bharath.rupireddyforpostg...@gmail.com> wrote:

> Please see the attached v23 patches. I've addressed all the review
> comments received so far from Amit and Shveta.
>
>
In patch 0003:
+ SpinLockAcquire(>mutex);
+ }
+
+ Assert(LWLockHeldByMeInMode(ReplicationSlotControlLock, LW_SHARED));
+
+ if (slot->inactive_since > 0 &&
+ slot->data.inactive_timeout > 0)
+ {
+ TimestampTz now;
+
+ /* inactive_since is only tracked for inactive slots */
+ Assert(slot->active_pid == 0);
+
+ now = GetCurrentTimestamp();
+ if (TimestampDifferenceExceeds(slot->inactive_since, now,
+   slot->data.inactive_timeout * 1000))
+ inavidation_cause = RS_INVAL_INACTIVE_TIMEOUT;
+ }
+
+ if (need_locks)
+ {
+ SpinLockRelease(>mutex);

Here, GetCurrentTimestamp() is still called with SpinLock held. Maybe do
this prior to acquiring the spinlock.

regards,
Ajin Cherian
Fujitsu Australia


Re: Introduce XID age and inactive timeout based replication slot invalidation

2024-03-26 Thread Amit Kapila
On Tue, Mar 26, 2024 at 3:12 PM Bertrand Drouvot
 wrote:
>
> On Tue, Mar 26, 2024 at 02:27:17PM +0530, Bharath Rupireddy wrote:
> > Please use the v22 patch set.
>
> Thanks!
>
> 1 ===
>
> +reset_synced_slots_info(void)
>
> I'm not sure "reset" is the right word, what about 
> slot_sync_shutdown_update()?
>

*shutdown_update() sounds generic. How about
update_synced_slots_inactive_time()? I think it is a bit longer but
conveys the meaning.

-- 
With Regards,
Amit Kapila.




Re: Introduce XID age and inactive timeout based replication slot invalidation

2024-03-26 Thread Bertrand Drouvot
Hi,

On Tue, Mar 26, 2024 at 03:17:36PM +0530, shveta malik wrote:
> On Tue, Mar 26, 2024 at 1:54 PM Bertrand Drouvot
>  wrote:
> >
> > Hi,
> >
> > On Tue, Mar 26, 2024 at 01:37:21PM +0530, Amit Kapila wrote:
> > > On Tue, Mar 26, 2024 at 1:15 PM Bertrand Drouvot
> > >  wrote:
> > > >
> > > > 2 ===
> > > >
> > > > It looks like inactive_since is set to the current timestamp on the 
> > > > standby
> > > > each time the sync worker does a cycle:
> > > >
> > > > primary:
> > > >
> > > > postgres=# select slot_name,inactive_since from pg_replication_slots 
> > > > where failover = 't';
> > > >   slot_name  |inactive_since
> > > > -+---
> > > >  lsub27_slot | 2024-03-26 07:39:19.745517+00
> > > >  lsub28_slot | 2024-03-26 07:40:24.953826+00
> > > >
> > > > standby:
> > > >
> > > > postgres=# select slot_name,inactive_since from pg_replication_slots 
> > > > where failover = 't';
> > > >   slot_name  |inactive_since
> > > > -+---
> > > >  lsub27_slot | 2024-03-26 07:43:56.387324+00
> > > >  lsub28_slot | 2024-03-26 07:43:56.387338+00
> > > >
> > > > I don't think that should be the case.
> > > >
> > >
> > > But why? This is exactly what we discussed in another thread where we
> > > agreed to update inactive_since even for sync slots.
> >
> > Hum, I thought we agreed to "sync" it and to "update it to current time"
> > only at promotion time.
> 
> I think there may have been some misunderstanding here.

Indeed ;-)

> But now if I
> rethink this, I am fine with 'inactive_since' getting synced from
> primary to standby. But if we do that, we need to add docs stating
> "inactive_since" represents primary's inactivity and not standby's
> slots inactivity for synced slots.

Yeah sure.

> The reason for this clarification
> is that the synced slot might be generated much later, yet
> 'inactive_since' is synced from the primary, potentially indicating a
> time considerably earlier than when the synced slot was actually
> created.

Right.

> Another approach could be that "inactive_since" for synced slot
> actually gives its own inactivity data rather than giving primary's
> slot data. We update inactive_since on standby only at 3 occasions:
> 1) at the time of creation of the synced slot.
> 2) during standby restart.
> 3) during promotion of standby.
> 
> I have attached a sample patch for this idea as.txt file.

Thanks!

> I am fine with any of these approaches.  One gives data synced from
> primary for synced slots, while another gives actual inactivity data
> of synced slots.

What about another approach?: inactive_since gives data synced from primary for
synced slots and another dedicated field (could be added later...) could
represent what you suggest as the other option.

Another cons of updating inactive_since at the current time during each slot
sync cycle is that calling GetCurrentTimestamp() very frequently
(during each sync cycle of very active slots) could be too costly.

Regards,

-- 
Bertrand Drouvot
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com




Re: pgsql: Allow using syncfs() in frontend utilities.

2024-03-26 Thread Peter Eisentraut

On 22.03.24 17:52, Robert Haas wrote:

On Wed, Sep 6, 2023 at 7:28 PM Nathan Bossart  wrote:

Allow using syncfs() in frontend utilities.

This commit allows specifying a --sync-method in several frontend
utilities that must synchronize many files to disk (initdb,
pg_basebackup, pg_checksums, pg_dump, pg_rewind, and pg_upgrade).
On Linux, users can specify "syncfs" to synchronize the relevant
file systems instead of calling fsync() for every single file.  In
many cases, using syncfs() is much faster.

As with recovery_init_sync_method, this new option comes with some
caveats.  The descriptions of these caveats have been moved to a
new appendix section in the documentation.


Hi,

I'd like to complain about this commit's addition of a new appendix.


I already complained about that at 
 
and some follow-up was announced but didn't happen.  It was on my list 
to look into cleaning up during beta.






Re: remaining sql/json patches

2024-03-26 Thread jian he
On Fri, Mar 22, 2024 at 12:08 AM Amit Langote  wrote:
>
> On Wed, Mar 20, 2024 at 9:53 PM Amit Langote  wrote:
> > I'll push 0001 tomorrow.
>
> Pushed that one.  Here's the remaining JSON_TABLE() patch.
>

hi. minor issues i found json_table patch.

+ if (!IsA($5, A_Const) ||
+ castNode(A_Const, $5)->val.node.type != T_String)
+ ereport(ERROR,
+ errcode(ERRCODE_FEATURE_NOT_SUPPORTED),
+ errmsg("only string constants are supported in JSON_TABLE"
+   " path specification"),
+ parser_errposition(@5));
as mentioned in upthread, this error message should be one line.


+const TableFuncRoutine JsonbTableRoutine =
+{
+ JsonTableInitOpaque,
+ JsonTableSetDocument,
+ NULL,
+ NULL,
+ NULL,
+ JsonTableFetchRow,
+ JsonTableGetValue,
+ JsonTableDestroyOpaque
+};
should be:

const TableFuncRoutine JsonbTableRoutine =
{
.InitOpaque = JsonTableInitOpaque,
.SetDocument = JsonTableSetDocument,
.SetNamespace = NULL,
.SetRowFilter = NULL,
.SetColumnFilter = NULL,
.FetchRow = JsonTableFetchRow,
.GetValue = JsonTableGetValue,
.DestroyOpaque = JsonTableDestroyOpaque
};

+/*
+ * JsonTablePathSpec
+ * untransformed specification of JSON path expression with an optional
+ * name
+ */
+typedef struct JsonTablePathSpec
+{
+ NodeTag type;
+
+ Node   *string;
+ char   *name;
+ int name_location;
+ int location; /* location of 'string' */
+} JsonTablePathSpec;
the comment still does not explain the distinction between "location"
and "name_location"?


JsonTablePathSpec needs to be added to typedefs.list.
JsonPathSpec should be removed from typedefs.list.


+/*
+ * JsonTablePlanType -
+ * flags for JSON_TABLE plan node types representation
+ */
+typedef enum JsonTablePlanType
+{
+ JSTP_DEFAULT,
+ JSTP_SIMPLE,
+ JSTP_JOINED,
+} JsonTablePlanType;
+
+/*
+ * JsonTablePlanJoinType -
+ * JSON_TABLE join types for JSTP_JOINED plans
+ */
+typedef enum JsonTablePlanJoinType
+{
+ JSTP_JOIN_INNER,
+ JSTP_JOIN_OUTER,
+ JSTP_JOIN_CROSS,
+ JSTP_JOIN_UNION,
+} JsonTablePlanJoinType;
I can guess the enum value meaning of JsonTablePlanJoinType,
but I can't guess the meaning of "JSTP_SIMPLE" or "JSTP_JOINED".
adding some comments in JsonTablePlanType would make it more clear.

I think I can understand JsonTableScanNextRow.
but i don't understand JsonTablePlanNextRow.
maybe we can add some comments on JsonTableJoinState.


+-- unspecified plan (outer, union)
+select
+ jt.*
+from
+ jsonb_table_test jtt,
+ json_table (
+ jtt.js,'strict $[*]' as p
+ columns (
+ n for ordinality,
+ a int path 'lax $.a' default -1 on empty,
+ nested path 'strict $.b[*]' as pb columns ( b int path '$' ),
+ nested path 'strict $.c[*]' as pc columns ( c int path '$' )
+ )
+ ) jt;
+ n | a  | b | c
+---++---+
+ 1 |  1 |   |
+ 2 |  2 | 1 |
+ 2 |  2 | 2 |
+ 2 |  2 | 3 |
+ 2 |  2 |   | 10
+ 2 |  2 |   |
+ 2 |  2 |   | 20
+ 3 |  3 | 1 |
+ 3 |  3 | 2 |
+ 4 | -1 | 1 |
+ 4 | -1 | 2 |
+(11 rows)
+
+-- default plan (outer, union)
+select
+ jt.*
+from
+ jsonb_table_test jtt,
+ json_table (
+ jtt.js,'strict $[*]' as p
+ columns (
+ n for ordinality,
+ a int path 'lax $.a' default -1 on empty,
+ nested path 'strict $.b[*]' as pb columns ( b int path '$' ),
+ nested path 'strict $.c[*]' as pc columns ( c int path '$' )
+ )
+ plan default (outer, union)
+ ) jt;
+ n | a  | b | c
+---++---+
+ 1 |  1 |   |
+ 2 |  2 | 1 | 10
+ 2 |  2 | 1 |
+ 2 |  2 | 1 | 20
+ 2 |  2 | 2 | 10
+ 2 |  2 | 2 |
+ 2 |  2 | 2 | 20
+ 2 |  2 | 3 | 10
+ 2 |  2 | 3 |
+ 2 |  2 | 3 | 20
+ 3 |  3 |   |
+ 4 | -1 |   |
+(12 rows)
these two query results should be the same, if i understand it correctly.




Re: Improve readability by using designated initializers when possible

2024-03-26 Thread Peter Eisentraut

On 25.03.24 06:00, jian he wrote:

looking through v4 again.
v4 looks good to me.


Thanks, I have committed this.





Re: pgsql: Track last_inactive_time in pg_replication_slots.

2024-03-26 Thread Amit Kapila
On Tue, Mar 26, 2024 at 2:11 PM Alvaro Herrera  wrote:
>
> On 2024-Mar-26, Amit Kapila wrote:
>
> > On Tue, Mar 26, 2024 at 1:09 PM Alvaro Herrera  
> > wrote:
> > > On 2024-Mar-26, Amit Kapila wrote:
> > > > I would also like to solicit your opinion on the other slot-level
> > > > parameter we are planning to introduce.  This new slot-level parameter
> > > > will be named as inactive_timeout.
> > >
> > > Maybe inactivity_timeout?
> > >
> > > > This will indicate that once the slot is inactive for the
> > > > inactive_timeout period, we will invalidate the slot. We are also
> > > > discussing to have this parameter (inactive_timeout) as GUC [1]. We
> > > > can have this new parameter both at the slot level and as well as a
> > > > GUC, or just one of those.
> > >
> > > replication_slot_inactivity_timeout?
> >
> > So, it seems you are okay to have this parameter both at slot level
> > and as a GUC.
>
> Well, I think a GUC is good to have regardless of the slot parameter,
> because the GUC can be used as an instance-wide protection against going
> out of disk space because of broken replication.  However, now that I
> think about it, I'm not really sure about invalidating a slot based on
> time rather on disk space, for which we already have a parameter; what's
> your rationale for that?  The passage of time is not a very good
> measure, really, because the amount of WAL being protected has wildly
> varying production rate across time.
>

The inactive slot not only blocks WAL from being removed but prevents
the vacuum from proceeding. Also, there is a risk of transaction Id
wraparound. See email [1] for more context.

> I can only see a timeout being useful as a parameter if its default
> value is not the special disable value; say, the default timeout is 3
> days (to be more precise -- the period from Friday to Monday, that is,
> between DBA leaving the office one week until discovering a problem when
> he returns early next week).  This way we have a built-in mechanism that
> invalidates slots regardless of how big the WAL partition is.
>

We can have a default value for this parameter but it has the
potential to break the replication, so not sure what could be a good
default value.

>
> I'm less sure about the slot parameter; in what situation do you need to
> extend the life of one individual slot further than the life of all the
> other slots?

I was thinking of an idle slot scenario where a slot from one
particular subscriber (or output plugin) is inactive due to some
maintenance activity. But it should be okay to have a GUC for this for
now.

[1] - 
https://www.postgresql.org/message-id/20240325195443.GA2923888%40nathanxps13

-- 
With Regards,
Amit Kapila.




Re: Introduce XID age and inactive timeout based replication slot invalidation

2024-03-26 Thread shveta malik
On Tue, Mar 26, 2024 at 1:54 PM Bertrand Drouvot
 wrote:
>
> Hi,
>
> On Tue, Mar 26, 2024 at 01:37:21PM +0530, Amit Kapila wrote:
> > On Tue, Mar 26, 2024 at 1:15 PM Bertrand Drouvot
> >  wrote:
> > >
> > > 2 ===
> > >
> > > It looks like inactive_since is set to the current timestamp on the 
> > > standby
> > > each time the sync worker does a cycle:
> > >
> > > primary:
> > >
> > > postgres=# select slot_name,inactive_since from pg_replication_slots 
> > > where failover = 't';
> > >   slot_name  |inactive_since
> > > -+---
> > >  lsub27_slot | 2024-03-26 07:39:19.745517+00
> > >  lsub28_slot | 2024-03-26 07:40:24.953826+00
> > >
> > > standby:
> > >
> > > postgres=# select slot_name,inactive_since from pg_replication_slots 
> > > where failover = 't';
> > >   slot_name  |inactive_since
> > > -+---
> > >  lsub27_slot | 2024-03-26 07:43:56.387324+00
> > >  lsub28_slot | 2024-03-26 07:43:56.387338+00
> > >
> > > I don't think that should be the case.
> > >
> >
> > But why? This is exactly what we discussed in another thread where we
> > agreed to update inactive_since even for sync slots.
>
> Hum, I thought we agreed to "sync" it and to "update it to current time"
> only at promotion time.

I think there may have been some misunderstanding here. But now if I
rethink this, I am fine with 'inactive_since' getting synced from
primary to standby. But if we do that, we need to add docs stating
"inactive_since" represents primary's inactivity and not standby's
slots inactivity for synced slots. The reason for this clarification
is that the synced slot might be generated much later, yet
'inactive_since' is synced from the primary, potentially indicating a
time considerably earlier than when the synced slot was actually
created.

Another approach could be that "inactive_since" for synced slot
actually gives its own inactivity data rather than giving primary's
slot data. We update inactive_since on standby only at 3 occasions:
1) at the time of creation of the synced slot.
2) during standby restart.
3) during promotion of standby.

I have attached a sample patch for this idea as.txt file.

I am fine with any of these approaches.  One gives data synced from
primary for synced slots, while another gives actual inactivity data
of synced slots.

thanks
Shveta
From 7dcd0e95299263187eb1f03812f8321b2612ee5c Mon Sep 17 00:00:00 2001
From: Shveta Malik 
Date: Tue, 26 Mar 2024 14:42:25 +0530
Subject: [PATCH v1] inactive_since for synced slots.

inactive_since is updated for synced slots:
1) at the time of creation of slot.
2) during server restart.
3) during promotion.
---
 src/backend/replication/logical/slotsync.c |  1 +
 src/backend/replication/slot.c | 15 ---
 2 files changed, 13 insertions(+), 3 deletions(-)

diff --git a/src/backend/replication/logical/slotsync.c 
b/src/backend/replication/logical/slotsync.c
index bbf9a2c485..6114895dca 100644
--- a/src/backend/replication/logical/slotsync.c
+++ b/src/backend/replication/logical/slotsync.c
@@ -628,6 +628,7 @@ synchronize_one_slot(RemoteSlot *remote_slot, Oid 
remote_dbid)
SpinLockAcquire(>mutex);
slot->effective_catalog_xmin = xmin_horizon;
slot->data.catalog_xmin = xmin_horizon;
+   slot->inactive_since = GetCurrentTimestamp();
SpinLockRelease(>mutex);
ReplicationSlotsComputeRequiredXmin(true);
LWLockRelease(ProcArrayLock);
diff --git a/src/backend/replication/slot.c b/src/backend/replication/slot.c
index d0a2f440ef..f2a57a14ec 100644
--- a/src/backend/replication/slot.c
+++ b/src/backend/replication/slot.c
@@ -628,7 +628,10 @@ retry:
 * now.
 */
SpinLockAcquire(>mutex);
-   s->inactive_since = 0;
+
+   if (!(RecoveryInProgress() && s->data.synced))
+   s->inactive_since = 0;
+
SpinLockRelease(>mutex);
 
if (am_walsender)
@@ -704,14 +707,20 @@ ReplicationSlotRelease(void)
 */
SpinLockAcquire(>mutex);
slot->active_pid = 0;
-   slot->inactive_since = now;
+
+   if (!(RecoveryInProgress() && slot->data.synced))
+   slot->inactive_since = now;
+
SpinLockRelease(>mutex);
ConditionVariableBroadcast(>active_cv);
}
else
{
SpinLockAcquire(>mutex);
-   slot->inactive_since = now;
+
+   if (!(RecoveryInProgress() && slot->data.synced))
+   slot->inactive_since = now;
+
SpinLockRelease(>mutex);
}
 
-- 
2.34.1



Re: Introduce XID age and inactive timeout based replication slot invalidation

2024-03-26 Thread Bertrand Drouvot
Hi,

On Tue, Mar 26, 2024 at 02:27:17PM +0530, Bharath Rupireddy wrote:
> Please use the v22 patch set.

Thanks!

1 ===

+reset_synced_slots_info(void)

I'm not sure "reset" is the right word, what about slot_sync_shutdown_update()?

2 ===

+   for (int i = 0; i < max_replication_slots; i++)
+   {
+   ReplicationSlot *s = >replication_slots[i];
+
+   /* Check if it is a synchronized slot */
+   if (s->in_use && s->data.synced)
+   {
+   TimestampTz now;
+
+   Assert(SlotIsLogical(s));
+   Assert(s->active_pid == 0);
+
+   /*
+* Set the time since the slot has become inactive 
after shutting
+* down slot sync machinery. This helps correctly 
interpret the
+* time if the standby gets promoted without a restart. 
We get the
+* current time beforehand to avoid a system call while 
holding
+* the lock.
+*/
+   now = GetCurrentTimestamp();

What about moving "now = GetCurrentTimestamp()" outside of the for loop? (it
would be less costly and probably good enough).

Regards,

-- 
Bertrand Drouvot
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com




Re: Add new error_action COPY ON_ERROR "log"

2024-03-26 Thread Bharath Rupireddy
On Tue, Mar 26, 2024 at 1:46 PM Masahiko Sawada  wrote:
>
> Thank you for updating the patch! Looks good to me.
>
> Please find the attached patch. I've made some changes for the
> documentation and the commit message. I'll push it, barring any
> objections.

Thanks. v12 patch LGTM.

--
Bharath Rupireddy
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com




Re: REVOKE FROM warning on grantor

2024-03-26 Thread Étienne BERSAC
Hi,

> ldap2pg really ought to issue REVOKE x FROM y GRANTED BY z. 

Thanks for this. I missed this notation and it is exactly what I need.

You could consider this subject as closed. Thanks for your time and
explanation.

Regards,
Étienne




Re: Introduce XID age and inactive timeout based replication slot invalidation

2024-03-26 Thread shveta malik
On Tue, Mar 26, 2024 at 2:27 PM Bharath Rupireddy
 wrote:
>
> >
> > 1)
> > Commti msg:
> >
> > ensures the value is set to current timestamp during the
> > shutdown to help correctly interpret the time if the standby gets
> > promoted without a restart.
> >
> > shutdown --> shutdown of slot sync worker   (as it was not clear if it
> > is instance shutdown or something else)
>
> Changed it to "shutdown of slot sync machinery" to be consistent with
> the comments.

Thanks for addressing the comments. Just to give more clarity here (so
that you take a informed decision), I am not sure if we actually shut
down slot-sync machinery. We only shot down slot sync worker.
Slot-sync machinery can still be used using
'pg_sync_replication_slots' SQL function. I can easily reproduce the
scenario where SQL function and  reset_synced_slots_info() are going
in parallel where the latter hits 'Assert(s->active_pid == 0)' due to
the fact that  parallel SQL sync function is active on that slot.

thanks
Shveta




Re: pgsql: Track last_inactive_time in pg_replication_slots.

2024-03-26 Thread Bertrand Drouvot
Hi,

On Tue, Mar 26, 2024 at 01:45:23PM +0530, Amit Kapila wrote:
> On Tue, Mar 26, 2024 at 1:09 PM Alvaro Herrera  
> wrote:
> >
> > On 2024-Mar-26, Amit Kapila wrote:
> >
> > > We have a consensus on inactive_since, so I'll make that change.
> >
> > Sounds reasonable.  So this is a timestamptz if the slot is inactive,
> > NULL if active, right?
> >
> 
> Yes.
> 
> >  What value is it going to have for sync slots?
> >
> 
> The behavior will be the same for non-sync slots. In each sync cycle,
> we acquire/release the sync slots. So at the time of release,
> inactive_since will be updated. See email [1].

I don't think we should set inactive_since to the current time at each sync 
cycle,
see [1] as to why. What do you think?

[1]: 
https://www.postgresql.org/message-id/ZgKGIDC5lttWTdJH%40ip-10-97-1-34.eu-west-3.compute.internal

Regards,

-- 
Bertrand Drouvot
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com




Re: Catalog domain not-null constraints

2024-03-26 Thread Dean Rasheed
On Tue, 26 Mar 2024 at 07:30, Alvaro Herrera  wrote:
>
> On 2024-Mar-25, Dean Rasheed wrote:
>
> > Also (not this patch's fault), psql doesn't seem to offer a way to
> > display domain constraint names -- something you need to know to drop
> > or alter them. Perhaps \dD+ could be made to do that?
>
> Ooh, I remember we had offered a patch for \d++ to display these
> constraint names for tables, but didn't get around to gather consensus
> for it.  We did gather consensus on *not* wanting \d+ to display them,
> but we need *something*.  I suppose we should do something symmetrical
> for tables and domains.  How about \dD++ and \dt++?
>

Personally, I quite like the fact that \d+ displays NOT NULL
constraints, because it puts them on an equal footing with CHECK
constraints. However, I can appreciate that it will significantly
increase the length of the output in some cases.

With \dD it's not so nice because of the way it puts all the details
on one line. The obvious output might look something like this:

\dD
   List of domains
 Schema | Name |  Type   | Collation | Nullable | Default |  Check
+--+-+---+--+-+---
 public | d1   | integer |   | NOT NULL | | CHECK (VALUE > 0)

\dD+
List of domains
 Schema | Name |  Type   | Collation | Nullable
| Default |Check  | Access privileges
| Description
+--+-+---+-+-+---+---+-
 public | d1   | integer |   | CONSTRAINT d1_not_null NOT NULL
| | CONSTRAINT d1_check CHECK (VALUE > 0) |
|

So you'd need quite a wide window to easily view it (or use \x). I
suppose the width could be reduced by dropping the word "CONSTRAINT"
in the \dD+ case, but it's probably still going to be wider than the
average window.

Regards,
Dean




Re: Introduce XID age and inactive timeout based replication slot invalidation

2024-03-26 Thread Bharath Rupireddy
On Tue, Mar 26, 2024 at 11:26 AM Amit Kapila  wrote:
>
> Review comments on v18_0002 and v18_0005
> ===
>
> 1.
> We have decided to update inactive_since for temporary slots. So,
> unless there is some reason, we should allow inactive_timeout to also
> be set for temporary slots.

WFM. A temporary slot that's inactive for a long time before even the
server isn't shutdown can utilize this inactive_timeout based
invalidation mechanism. And, I'd also vote for we being consistent for
temporary and synced slots.

>  L.last_inactive_time,
> +L.inactive_timeout,
>
> Shall we keep inactive_timeout before
> last_inactive_time/inactive_since? I don't have any strong reason to
> propose that way apart from that the former is provided by the user.

Done.

> + if (InvalidateReplicationSlotForInactiveTimeout(slot, false, true, true))
> + invalidated = true;
>
> I don't think we should try to invalidate the slots in
> pg_get_replication_slots. This function's purpose is to get the
> current information on slots and has no intention to perform any work
> for slots. Any error due to invalidation won't be what the user would
> be expecting here.

Agree. Removed.

> 4.
> +static bool
> +InvalidateSlotForInactiveTimeout(ReplicationSlot *slot,
> + bool need_control_lock,
> + bool need_mutex)
> {
> ...
> ...
> + if (need_control_lock)
> + LWLockAcquire(ReplicationSlotControlLock, LW_SHARED);
> +
> + Assert(LWLockHeldByMeInMode(ReplicationSlotControlLock, LW_SHARED));
> +
> + /*
> + * Check if the slot needs to be invalidated due to inactive_timeout. We
> + * do this with the spinlock held to avoid race conditions -- for example
> + * the restart_lsn could move forward, or the slot could be dropped.
> + */
> + if (need_mutex)
> + SpinLockAcquire(>mutex);
> ...
>
> I find this combination of parameters a bit strange. Because, say if
> need_mutex is false and need_control_lock is true then that means this
> function will acquire LWlock after acquiring spinlock which is
> unacceptable. Now, this may not happen in practice as the callers
> won't pass such a combination but still, this functionality should be
> improved.

Right. Either we need two locks or not. So, changed it to use just one
bool need_locks, upon set both control lock and spin lock are acquired
and released.

On Mon, Mar 25, 2024 at 10:33 AM shveta malik  wrote:
>
> patch 002:
>
> 2)
> slotsync.c:
>
>   ReplicationSlotCreate(remote_slot->name, true, RS_TEMPORARY,
> remote_slot->two_phase,
> remote_slot->failover,
> -   true);
> +   true, 0);
>
> + slot->data.inactive_timeout = remote_slot->inactive_timeout;
>
> Is there a reason we are not passing 'remote_slot->inactive_timeout'
> to ReplicationSlotCreate() directly?

The slot there gets created temporarily for which we were not
supporting inactive_timeout being set. But, in the latest v22 patch we
are supporting, so passing the remote_slot->inactive_timeout directly.

> 3)
> slotfuncs.c
> pg_create_logical_replication_slot():
> + int inactive_timeout = PG_GETARG_INT32(5);
>
> Can we mention here that timeout is in seconds either in comment or
> rename variable to inactive_timeout_secs?
>
> Please do this for create_physical_replication_slot(),
> create_logical_replication_slot(),
> pg_create_physical_replication_slot() as well.

Added /* in seconds */ next the variable declaration.

> -
> 4)
> + int inactive_timeout; /* The amount of time in seconds the slot
> + * is allowed to be inactive. */
>  } LogicalSlotInfo;
>
>  Do we need to mention "before getting invalided" like other places
> (in last patch)?

Done.

>  5)
> Same at these two places. "before getting invalided" to be added in
> the last patch otherwise the info is incompleted.
>
> +
> + /* The amount of time in seconds the slot is allowed to be inactive */
> + int inactive_timeout;
>  } ReplicationSlotPersistentData;
>
>
> + * inactive_timeout: The amount of time in seconds the slot is allowed to be
> + * inactive.
>   */
>  void
>  ReplicationSlotCreate(const char *name, bool db_specific,
>  Same here. "before getting invalidated" ?

Done.

On Tue, Mar 26, 2024 at 12:04 PM shveta malik  wrote:
>
> > Please find the attached v21 patch implementing the above idea. It
> > also has changes for renaming last_inactive_time to inactive_since.
>
> Thanks for the patch. I have tested this patch alone, and it does what
> it says. One additional thing which I noticed is that now it sets
> inactive_since for temp slots as well, but that idea looks fine to me.

Right. Let's be consistent by treating all slots the same.

> I could not test 'invalidation on promotion bug' with this change, as
> that needed rebasing of the rest of the patches.

Please use the v22 patch set.

> Few trivial things:
>
> 1)
> Commti msg:
>
> ensures the value is set to current timestamp during the
> shutdown to help correctly interpret the time if the standby gets
> promoted without a restart.
>
> shutdown --> 

Re: pgsql: Track last_inactive_time in pg_replication_slots.

2024-03-26 Thread Alvaro Herrera
On 2024-Mar-26, Amit Kapila wrote:

> On Tue, Mar 26, 2024 at 1:09 PM Alvaro Herrera  
> wrote:
> > On 2024-Mar-26, Amit Kapila wrote:
> > > I would also like to solicit your opinion on the other slot-level
> > > parameter we are planning to introduce.  This new slot-level parameter
> > > will be named as inactive_timeout.
> >
> > Maybe inactivity_timeout?
> >
> > > This will indicate that once the slot is inactive for the
> > > inactive_timeout period, we will invalidate the slot. We are also
> > > discussing to have this parameter (inactive_timeout) as GUC [1]. We
> > > can have this new parameter both at the slot level and as well as a
> > > GUC, or just one of those.
> >
> > replication_slot_inactivity_timeout?
> 
> So, it seems you are okay to have this parameter both at slot level
> and as a GUC.

Well, I think a GUC is good to have regardless of the slot parameter,
because the GUC can be used as an instance-wide protection against going
out of disk space because of broken replication.  However, now that I
think about it, I'm not really sure about invalidating a slot based on
time rather on disk space, for which we already have a parameter; what's
your rationale for that?  The passage of time is not a very good
measure, really, because the amount of WAL being protected has wildly
varying production rate across time.

I can only see a timeout being useful as a parameter if its default
value is not the special disable value; say, the default timeout is 3
days (to be more precise -- the period from Friday to Monday, that is,
between DBA leaving the office one week until discovering a problem when
he returns early next week).  This way we have a built-in mechanism that
invalidates slots regardless of how big the WAL partition is.


I'm less sure about the slot parameter; in what situation do you need to
extend the life of one individual slot further than the life of all the
other slots?  (Of course, it makes no sense to set the per-slot param to
a shorter period than the GUC: invalidating one slot ahead of the others
is completely pointless.)

-- 
Álvaro Herrera   48°01'N 7°57'E  —  https://www.EnterpriseDB.com/
Bob [Floyd] used to say that he was planning to get a Ph.D. by the "green
stamp method," namely by saving envelopes addressed to him as 'Dr. Floyd'.
After collecting 500 such letters, he mused, a university somewhere in
Arizona would probably grant him a degree.  (Don Knuth)




  1   2   >