date:20240325

Re: Functions to return random numbers in a given range

2024-03-25 Thread Dean Rasheed

On Tue, 27 Feb 2024 at 17:33, Dean Rasheed  wrote:
>
> On Sat, 24 Feb 2024 at 17:10, Tomas Vondra
> >
> > I did a quick review and a little bit of testing on the patch today. I
> > think it's a good/useful idea, and I think the code is ready to go (the
> > code is certainly much cleaner than anything I'd written ...).
>

Based on the reviews so far, I think this is ready for commit, so
unless anyone objects, I will do so in a day or so.

As a quick summary, this adds a new file:

src/backend/utils/adt/pseudorandomfuncs.c

which contains SQL-callable functions that access a single shared
pseudorandom number generator, whose state is private to that file.
Currently the functions are:

  random() returns double precision [moved from float.c]
  random(min integer, max integer) returns integer  [new]
  random(min bigint, max bigint) returns bigint [new]
  random(min numeric, max numeric) returns numeric  [new]
  random_normal() returns double precision  [moved from float.c]
  setseed(seed double precision) returns void   [moved from float.c]

It's possible that functions to return other random distributions or
other datatypes might get added in the future, but I have no plans to
do so at the moment.

Regards,
Dean

Re: Introduce XID age and inactive timeout based replication slot invalidation

2024-03-25 Thread shveta malik

On Tue, Mar 26, 2024 at 11:08 AM Bharath Rupireddy
 wrote:
>
> On Tue, Mar 26, 2024 at 9:30 AM shveta malik  wrote:
> >
> > On Mon, Mar 25, 2024 at 12:43 PM shveta malik  
> > wrote:
> > >
> > > I have one concern, for synced slots on standby, how do we disallow
> > > invalidation due to inactive-timeout immediately after promotion?
> > >
> > > For synced slots, last_inactive_time and inactive_timeout are both
> > > set. Let's say I bring down primary for promotion of standby and then
> > > promote standby, there are chances that it may end up invalidating
> > > synced slots (considering standby is not brought down during promotion
> > > and thus inactive_timeout may already be past 'last_inactive_time').
> > >
> >
> > On standby, if we decide to maintain valid last_inactive_time for
> > synced slots, then invalidation is correctly restricted in
> > InvalidateSlotForInactiveTimeout() for synced slots using the check:
> >
> > if (RecoveryInProgress() && slot->data.synced)
> > return false;
> >
> > But immediately after promotion, we can not rely on the above check
> > and thus possibility of synced slots invalidation is there. To
> > maintain consistent behavior regarding the setting of
> > last_inactive_time for synced slots, similar to user slots, one
> > potential solution to prevent this invalidation issue is to update the
> > last_inactive_time of all synced slots within the ShutDownSlotSync()
> > function during FinishWalRecovery(). This approach ensures that
> > promotion doesn't immediately invalidate slots, and henceforth, we
> > possess a correct last_inactive_time as a basis for invalidation going
> > forward. This will be equivalent to updating last_inactive_time during
> > restart (but without actual restart during promotion).
> > The plus point of maintaining last_inactive_time for synced slots
> > could be, this can provide data to the user on when last time the sync
> > was attempted on that particular slot by background slot sync worker
> > or SQl function. Thoughts?
>
> Please find the attached v21 patch implementing the above idea. It
> also has changes for renaming last_inactive_time to inactive_since.
>

Thanks for the patch. I have tested this patch alone, and it does what
it says. One additional thing which I noticed is that now it sets
inactive_since for temp slots as well, but that idea looks fine to me.

I could not test 'invalidation on promotion bug' with this change, as
that needed rebasing of the rest of the patches.

Few trivial things:

1)
Commti msg:

ensures the value is set to current timestamp during the
shutdown to help correctly interpret the time if the standby gets
promoted without a restart.

shutdown --> shutdown of slot sync worker   (as it was not clear if it
is instance shutdown or something else)

2)
'The time since the slot has became inactive'.

has became-->has become
or just became

Please check it in all the files. There are multiple places.

thanks
Shveta

Re: Propagate pathkeys from CTEs up to the outer query

2024-03-25 Thread Richard Guo

On Tue, Mar 26, 2024 at 1:39 AM Tom Lane  wrote:

> I got around to looking at this finally.  I was a bit surprised by
> your choice of data structure.  You made a per-CTE-item cte_paths
> list paralleling cte_plan_ids, but what I had had in mind was a
> per-subplan list of paths paralleling glob->subplans and subroots.
> This would mean that the code for ordinary SubqueryScans would
> also need to fill in that list, but surely that's a trivial cost
> compared to everything else we do to prepare a subplan.  I don't
> think that we have any immediate need to remember that info for
> an ordinary SubqueryScan, but it seems plausible that we will
> in future.  Also, I'm not sure that a Path is fully interpretable
> without the associated PlannerInfo (subroot), so keeping it
> beside the list of subroots seems more future-proof than dissociating
> it from that.  This approach would also be more amenable to postponing
> creation of the subplans, as we speculated about earlier.  (I have
> no near-term desire to actually do that, but maybe someday it will
> happen.)


I agree with your points.  Previously I was thinking that CTEs were the
only scenario where we needed to remember the best path and only
required the best path's pathkeys.  However, considering potential
future use cases as you mentioned, I concur that having a per-subplan
list of paths would be more future-proof.  Please see attached v4 patch.

Thanks
Richard


v4-0001-Propagate-pathkeys-from-CTEs-up-to-the-outer-query.patch
Description: Binary data

Re: Introduce XID age and inactive timeout based replication slot invalidation

2024-03-25 Thread shveta malik

On Tue, Mar 26, 2024 at 11:36 AM Bertrand Drouvot
 wrote:
> >
> > The issue that I can see with your proposal is: what if one synced the slots
> > manually (with pg_sync_replication_slots()) but does not use the sync 
> > worker?
> > Then I think ShutDownSlotSync() is not going to help in that case.
>
> It looks like ShutDownSlotSync() is always called (even if 
> sync_replication_slots = off),
> so that sounds ok to me (I should have checked the code, I was under the 
> impression
> ShutDownSlotSync() was not called if sync_replication_slots = off).

Right, it is called irrespective of sync_replication_slots.

thanks
Shveta

Re: Introduce XID age and inactive timeout based replication slot invalidation

2024-03-25 Thread Bertrand Drouvot

Hi,

On Tue, Mar 26, 2024 at 05:55:11AM +, Bertrand Drouvot wrote:
> Hi,
> 
> On Tue, Mar 26, 2024 at 09:30:32AM +0530, shveta malik wrote:
> > On Mon, Mar 25, 2024 at 12:43 PM shveta malik  
> > wrote:
> > >
> > > I have one concern, for synced slots on standby, how do we disallow
> > > invalidation due to inactive-timeout immediately after promotion?
> > >
> > > For synced slots, last_inactive_time and inactive_timeout are both
> > > set. Let's say I bring down primary for promotion of standby and then
> > > promote standby, there are chances that it may end up invalidating
> > > synced slots (considering standby is not brought down during promotion
> > > and thus inactive_timeout may already be past 'last_inactive_time').
> > >
> > 
> > On standby, if we decide to maintain valid last_inactive_time for
> > synced slots, then invalidation is correctly restricted in
> > InvalidateSlotForInactiveTimeout() for synced slots using the check:
> > 
> > if (RecoveryInProgress() && slot->data.synced)
> > return false;
> 
> Right.
> 
> > But immediately after promotion, we can not rely on the above check
> > and thus possibility of synced slots invalidation is there. To
> > maintain consistent behavior regarding the setting of
> > last_inactive_time for synced slots, similar to user slots, one
> > potential solution to prevent this invalidation issue is to update the
> > last_inactive_time of all synced slots within the ShutDownSlotSync()
> > function during FinishWalRecovery(). This approach ensures that
> > promotion doesn't immediately invalidate slots, and henceforth, we
> > possess a correct last_inactive_time as a basis for invalidation going
> > forward. This will be equivalent to updating last_inactive_time during
> > restart (but without actual restart during promotion).
> > The plus point of maintaining last_inactive_time for synced slots
> > could be, this can provide data to the user on when last time the sync
> > was attempted on that particular slot by background slot sync worker
> > or SQl function. Thoughts?
> 
> Yeah, another plus point is that if the primary is down then one could look
> at the synced "active_since" on the standby to get an idea of it (depends of 
> the
> last sync though).
> 
> The issue that I can see with your proposal is: what if one synced the slots
> manually (with pg_sync_replication_slots()) but does not use the sync worker?
> Then I think ShutDownSlotSync() is not going to help in that case.

It looks like ShutDownSlotSync() is always called (even if 
sync_replication_slots = off),
so that sounds ok to me (I should have checked the code, I was under the 
impression
ShutDownSlotSync() was not called if sync_replication_slots = off).

Regards,

-- 
Bertrand Drouvot
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com

Re: Add new error_action COPY ON_ERROR "log"

2024-03-25 Thread Bharath Rupireddy

On Tue, Mar 26, 2024 at 9:56 AM Masahiko Sawada  wrote:
>
> > > errmsg("data type incompatibility at line %llu for column %s: \"%s\"",
>
> > > I guess it would be better to make the log message clearer to convey
> > > what we did for the malformed row. For example, how about something
> > > like "skipping row due to data type incompatibility at line %llu for
> > > column %s: \"s\""?
> >
> > The summary message which gets printed at the end says that "NOTICE:
> > 6 rows were skipped due to data type incompatibility". Isn't this
> > enough? If someone is using ON_ERROR 'ignore', it's quite natural that
> > such rows get skipped softly and the summary message can help them,
> > no?
>
> I think that in the main log message we should mention what happened
> (or is happening) or what we did (or are doing). If the message "data
> type incompatibility ..." was in the DETAIL message with the main
> message saying something like "skipping row at line %llu for column
> %s: ...", it would make sense to me. But the current message seems not
> to be clear to me and consistent with other NOTICE messages. Also, the
> last summary line would not be written if the user cancelled, and
> someone other than person who used ON_ERROR 'ignore' might check the
> server logs later.

Agree. I changed the NOTICE message to what you've suggested. Thanks.

--
Bharath Rupireddy
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com
From da8c02dace865ea9b02f19968056f25069d8aa91 Mon Sep 17 00:00:00 2001
From: Bharath Rupireddy 
Date: Tue, 26 Mar 2024 05:51:55 +
Subject: [PATCH v11] Add detailed info when COPY skips soft errors

This commit emits individual info like line number and column name
when COPY skips soft errors. Because, the summary containing the
total rows skipped isn't enough for the users to know what exactly
are the malformed rows in the input data.

This commit also adds a new option LOG_VERBOSITY to control the
verbosity of logged messages when COPY command skips soft errors.
This option if required can also be extended to control other COPY
related log messages. A value of 'verbose' can be used to emit
more informative messages by the command, while the value of
'default (which is the default) can be used to not log any
additional messages. More values such as 'terse', 'row_details'
etc. can be added based on the need  to the LOG_VERBOSITY option.
To see the individual info added by this commit when COPY skips
soft errors, one needs to set LOG_VERBOSITY to 'verbose'.

Author: Bharath Rupireddy
Reviewed-by: Michael Paquier, Masahiko Sawada
Reviewed-by: Atsushi Torikoshi
Discussion: https://www.postgresql.org/message-id/CALj2ACUk700cYhx1ATRQyRw-fBM%2BaRo6auRAitKGff7XNmYfqQ%40mail.gmail.com
---
 doc/src/sgml/ref/copy.sgml   | 26 --
 src/backend/commands/copy.c  | 38 ++
 src/backend/commands/copyfrom.c  | 10 +++
 src/backend/commands/copyfromparse.c | 35 
 src/bin/psql/tab-complete.c  |  6 +++-
 src/include/commands/copy.h  | 11 
 src/test/regress/expected/copy2.out  | 41 +++-
 src/test/regress/sql/copy2.sql   | 24 +++-
 src/tools/pgindent/typedefs.list |  1 +
 9 files changed, 181 insertions(+), 11 deletions(-)

diff --git a/doc/src/sgml/ref/copy.sgml b/doc/src/sgml/ref/copy.sgml
index 6c83e30ed0..ecbbf5f94a 100644
--- a/doc/src/sgml/ref/copy.sgml
+++ b/doc/src/sgml/ref/copy.sgml
@@ -45,6 +45,7 @@ COPY { table_name [ ( column_name [, ...] ) | * }
 ON_ERROR 'error_action'
 ENCODING 'encoding_name'
+LOG_VERBOSITY [ mode ]
 
  
 
@@ -400,8 +401,12 @@ COPY { table_name [ ( FORMAT is text or csv.
  
  
-  A NOTICE message containing the ignored row count is emitted at the end
-  of the COPY FROM if at least one row was discarded.
+  A NOTICE message containing the ignored row count is
+  emitted at the end of the COPY FROM if at least one
+  row was discarded. When LOG_VERBOSITY option is set to
+  verbose, a NOTICE message
+  containing the line of the input file and the column name whose input
+  conversion has failed is emitted for each discarded row.
  
 

@@ -418,6 +423,23 @@ COPY { table_name [ ( 

 
+   
+LOG_VERBOSITY
+
+ 
+  Sets the verbosity of some of the messages logged by a
+  COPY command.
+  A mode value of
+  verbose can be used to emit more informative messages.
+  default will not log any additional messages.
+ 
+ 
+  This is currently used in COPY FROM command when
+  ON_ERROR is set to ignore.
+  
+
+   
+

 WHERE
 
diff --git a/src/backend/commands/copy.c b/src/backend/commands/copy.c
index 28cf8b040a..67d5c3f7d0 100644
--- a/src/backend/commands/copy.c
+++ b/src/backend/commands/copy.c
@@ -422,6 +422,36 @@ defGetCopyOnErrorChoice(DefElem *def, ParseState *pstate, bool is_f

Re: Introduce XID age and inactive timeout based replication slot invalidation

2024-03-25 Thread Amit Kapila

On Sun, Mar 24, 2024 at 3:05 PM Bharath Rupireddy
 wrote:
>
> I've attached the v18 patch set here. I've also addressed earlier
> review comments from Amit, Ajin Cherian. Note that I've added new
> invalidation mechanism tests in a separate TAP test file just because
> I don't want to clutter or bloat any of the existing files and spread
> tests for physical slots and logical slots into separate existing TAP
> files.
>

Review comments on v18_0002 and v18_0005
===
1.
 ReplicationSlotCreate(const char *name, bool db_specific,
ReplicationSlotPersistency persistency,
-   bool two_phase, bool failover, bool synced)
+   bool two_phase, bool failover, bool synced,
+   int inactive_timeout)
 {
  ReplicationSlot *slot = NULL;
  int i;
@@ -345,6 +348,18 @@ ReplicationSlotCreate(const char *name, bool db_specific,
  errmsg("cannot enable failover for a temporary replication slot"));
  }

+ if (inactive_timeout > 0)
+ {
+ /*
+ * Do not allow users to set inactive_timeout for temporary slots,
+ * because temporary slots will not be saved to the disk.
+ */
+ if (persistency == RS_TEMPORARY)
+ ereport(ERROR,
+ errcode(ERRCODE_FEATURE_NOT_SUPPORTED),
+ errmsg("cannot set inactive_timeout for a temporary replication slot"));
+ }

We have decided to update inactive_since for temporary slots. So,
unless there is some reason, we should allow inactive_timeout to also
be set for temporary slots.

2.
--- a/src/backend/catalog/system_views.sql
+++ b/src/backend/catalog/system_views.sql
@@ -1024,6 +1024,7 @@ CREATE VIEW pg_replication_slots AS
 L.safe_wal_size,
 L.two_phase,
 L.last_inactive_time,
+L.inactive_timeout,

Shall we keep inactive_timeout before
last_inactive_time/inactive_since? I don't have any strong reason to
propose that way apart from that the former is provided by the user.

3.
@@ -287,6 +288,13 @@ pg_get_replication_slots(PG_FUNCTION_ARGS)
  slot_contents = *slot;
  SpinLockRelease(&slot->mutex);

+ /*
+ * Here's an opportunity to invalidate inactive replication slots
+ * based on timeout, so let's do it.
+ */
+ if (InvalidateReplicationSlotForInactiveTimeout(slot, false, true, true))
+ invalidated = true;

I don't think we should try to invalidate the slots in
pg_get_replication_slots. This function's purpose is to get the
current information on slots and has no intention to perform any work
for slots. Any error due to invalidation won't be what the user would
be expecting here.

4.
+static bool
+InvalidateSlotForInactiveTimeout(ReplicationSlot *slot,
+ bool need_control_lock,
+ bool need_mutex)
{
...
...
+ if (need_control_lock)
+ LWLockAcquire(ReplicationSlotControlLock, LW_SHARED);
+
+ Assert(LWLockHeldByMeInMode(ReplicationSlotControlLock, LW_SHARED));
+
+ /*
+ * Check if the slot needs to be invalidated due to inactive_timeout. We
+ * do this with the spinlock held to avoid race conditions -- for example
+ * the restart_lsn could move forward, or the slot could be dropped.
+ */
+ if (need_mutex)
+ SpinLockAcquire(&slot->mutex);
...

I find this combination of parameters a bit strange. Because, say if
need_mutex is false and need_control_lock is true then that means this
function will acquire LWlock after acquiring spinlock which is
unacceptable. Now, this may not happen in practice as the callers
won't pass such a combination but still, this functionality should be
improved.

-- 
With Regards,
Amit Kapila.

Re: Introduce XID age and inactive timeout based replication slot invalidation

2024-03-25 Thread Bertrand Drouvot

Hi,

On Tue, Mar 26, 2024 at 09:30:32AM +0530, shveta malik wrote:
> On Mon, Mar 25, 2024 at 12:43 PM shveta malik  wrote:
> >
> > I have one concern, for synced slots on standby, how do we disallow
> > invalidation due to inactive-timeout immediately after promotion?
> >
> > For synced slots, last_inactive_time and inactive_timeout are both
> > set. Let's say I bring down primary for promotion of standby and then
> > promote standby, there are chances that it may end up invalidating
> > synced slots (considering standby is not brought down during promotion
> > and thus inactive_timeout may already be past 'last_inactive_time').
> >
> 
> On standby, if we decide to maintain valid last_inactive_time for
> synced slots, then invalidation is correctly restricted in
> InvalidateSlotForInactiveTimeout() for synced slots using the check:
> 
> if (RecoveryInProgress() && slot->data.synced)
> return false;

Right.

> But immediately after promotion, we can not rely on the above check
> and thus possibility of synced slots invalidation is there. To
> maintain consistent behavior regarding the setting of
> last_inactive_time for synced slots, similar to user slots, one
> potential solution to prevent this invalidation issue is to update the
> last_inactive_time of all synced slots within the ShutDownSlotSync()
> function during FinishWalRecovery(). This approach ensures that
> promotion doesn't immediately invalidate slots, and henceforth, we
> possess a correct last_inactive_time as a basis for invalidation going
> forward. This will be equivalent to updating last_inactive_time during
> restart (but without actual restart during promotion).
> The plus point of maintaining last_inactive_time for synced slots
> could be, this can provide data to the user on when last time the sync
> was attempted on that particular slot by background slot sync worker
> or SQl function. Thoughts?

Yeah, another plus point is that if the primary is down then one could look
at the synced "active_since" on the standby to get an idea of it (depends of the
last sync though).

The issue that I can see with your proposal is: what if one synced the slots
manually (with pg_sync_replication_slots()) but does not use the sync worker?
Then I think ShutDownSlotSync() is not going to help in that case.

Regards,

-- 
Bertrand Drouvot
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com

Re: pgsql: Track last_inactive_time in pg_replication_slots.

2024-03-25 Thread Bharath Rupireddy

On Tue, Mar 26, 2024 at 9:38 AM shveta malik  wrote:
>
> On Mon, Mar 25, 2024 at 9:54 PM Bertrand Drouvot
>  wrote:
> >
> > Hi,
> >
> > On Mon, Mar 25, 2024 at 07:32:11PM +0530, Amit Kapila wrote:
> > > On Mon, Mar 25, 2024 at 6:57 PM Robert Haas  wrote:
> > > > And I'm suspicious that having an exception for slots being synced is
> > > > a bad idea. That makes too much of a judgement about how the user will
> > > > use this field. It's usually better to just expose the data, and if
> > > > the user needs helps to make sense of that data, then give them that
> > > > help separately.
> > >
> > > The reason we didn't set this for sync slots is that they won't be
> > > usable (one can't use them to decode WAL) unless standby is promoted
> > > [2]. But I see your point as well. So, I have copied the others
> > > involved in this discussion to see what they think.
> >
> > Yeah I also see Robert's point. If we also sync the "last inactive time" 
> > field then
> > we would need to take care of the corner case mentioned by Shveta in [1] 
> > during
> > promotion.
>
> I have suggested one potential solution for that in [1]. Please have a look.
>
> [1]: 
> https://www.postgresql.org/message-id/CAJpy0uB-yE%2BRiw7JQ4hW0%2BigJxvPc%2Brq%2B9c7WyTa1Jz7%2B2gAiA%40mail.gmail.com

I posted the v21 patch implementing the above idea in the other thread
- 
https://www.postgresql.org/message-id/CALj2ACXRFx9g7A9RFJZF7eBe%3Dzxk7%3DapMRFuCgJJKYB7O%3Dvgwg%40mail.gmail.com.
For ease, I'm also attaching the patch in here.

-- 
Bharath Rupireddy
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com


v21-0001-Fix-review-comments-for-slot-s-last_inactive_tim.patch
Description: Binary data

Re: Introduce XID age and inactive timeout based replication slot invalidation

2024-03-25 Thread Bharath Rupireddy

On Tue, Mar 26, 2024 at 9:30 AM shveta malik  wrote:
>
> On Mon, Mar 25, 2024 at 12:43 PM shveta malik  wrote:
> >
> > I have one concern, for synced slots on standby, how do we disallow
> > invalidation due to inactive-timeout immediately after promotion?
> >
> > For synced slots, last_inactive_time and inactive_timeout are both
> > set. Let's say I bring down primary for promotion of standby and then
> > promote standby, there are chances that it may end up invalidating
> > synced slots (considering standby is not brought down during promotion
> > and thus inactive_timeout may already be past 'last_inactive_time').
> >
>
> On standby, if we decide to maintain valid last_inactive_time for
> synced slots, then invalidation is correctly restricted in
> InvalidateSlotForInactiveTimeout() for synced slots using the check:
>
> if (RecoveryInProgress() && slot->data.synced)
> return false;
>
> But immediately after promotion, we can not rely on the above check
> and thus possibility of synced slots invalidation is there. To
> maintain consistent behavior regarding the setting of
> last_inactive_time for synced slots, similar to user slots, one
> potential solution to prevent this invalidation issue is to update the
> last_inactive_time of all synced slots within the ShutDownSlotSync()
> function during FinishWalRecovery(). This approach ensures that
> promotion doesn't immediately invalidate slots, and henceforth, we
> possess a correct last_inactive_time as a basis for invalidation going
> forward. This will be equivalent to updating last_inactive_time during
> restart (but without actual restart during promotion).
> The plus point of maintaining last_inactive_time for synced slots
> could be, this can provide data to the user on when last time the sync
> was attempted on that particular slot by background slot sync worker
> or SQl function. Thoughts?

Please find the attached v21 patch implementing the above idea. It
also has changes for renaming last_inactive_time to inactive_since.

-- 
Bharath Rupireddy
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com


v21-0001-Fix-review-comments-for-slot-s-last_inactive_tim.patch
Description: Binary data

Re: altering a column's collation leaves an invalid foreign key

2024-03-25 Thread jian he

On Mon, Mar 25, 2024 at 2:47 PM Paul Jungwirth
 wrote:
>
> On 3/23/24 10:04, Paul Jungwirth wrote:
> > Perhaps if the previous collation was nondeterministic we should force a 
> > re-check.
>
> Here is a patch implementing this. It was a bit more fuss than I expected, so 
> maybe someone has a
> better way.
>

+ /* test follows the one in ri_FetchConstraintInfo() */
+ if (ARR_NDIM(arr) != 1 ||
+ ARR_HASNULL(arr) ||
+ ARR_ELEMTYPE(arr) != INT2OID)
+ elog(ERROR, "conkey is not a 1-D smallint array");
+ attarr = (AttrNumber *) ARR_DATA_PTR(arr);
+
+ /* stash a List of the collation Oids in our Constraint node */
+ for (i = 0; i < numkeys; i++)
+ con->old_collations = lappend_oid(con->old_collations,
+  list_nth_oid(changedCollationOids, attarr[i] - 1));

I don't understand the "ri_FetchConstraintInfo" comment.


+static void
+RememberCollationForRebuilding(AttrNumber attnum, AlteredTableInfo *tab)
+{
+ Oid typid;
+ int32 typmod;
+ Oid collid;
+ ListCell   *lc;
+
+ /* Fill in the list with InvalidOid if this is our first visit */
+ if (tab->changedCollationOids == NIL)
+ {
+ int len = RelationGetNumberOfAttributes(tab->rel);
+ int i;
+
+ for (i = 0; i < len; i++)
+ tab->changedCollationOids = lappend_oid(tab->changedCollationOids,
+ InvalidOid);
+ }
+
+ get_atttypetypmodcoll(RelationGetRelid(tab->rel), attnum,
+  &typid, &typmod, &collid);
+
+ lc = list_nth_cell(tab->changedCollationOids, attnum - 1);
+ lfirst_oid(lc) = collid;
+}

do we need to check if `collid` is a valid collation?
like:

if (!OidIsValid(collid))
{
lc = list_nth_cell(tab->changedCollationOids, attnum - 1);
lfirst_oid(lc) = collid;
}

Re: Recent 027_streaming_regress.pl hangs

2024-03-25 Thread Tom Lane

Andres Freund  writes:
> On 2024-03-26 00:00:38 -0400, Tom Lane wrote:
>> Are you sure it's not just that the total time to run the core
>> regression tests has grown to a bit more than what the test timeout
>> allows for?

> You're right, that could be it - in a way at least, the issue is replay not
> catching up within 180s, so it'd have to be the data volume growing, I think.
> But it doesn't look like the regression volume meaningfully grew around that
> time?

No, but my impression is that the failure rate has been getting slowly
worse for awhile now.

> I guess I'll try to write a buildfarm database query to extract how long that
> phase of the test took from all runs on my menagerie, not just the failing
> one, and see if there's a visible trend.

+1

regards, tom lane

Re: Introduce XID age and inactive timeout based replication slot invalidation

2024-03-25 Thread Amit Kapila

On Tue, Mar 26, 2024 at 1:24 AM Nathan Bossart  wrote:
>
>
> On Sun, Mar 24, 2024 at 03:05:44PM +0530, Bharath Rupireddy wrote:
> > This commit particularly lets one specify the inactive_timeout for
> > a slot via SQL functions pg_create_physical_replication_slot and
> > pg_create_logical_replication_slot.
>
> Off-list, Bharath brought to my attention that the current proposal was to
> set the timeout at the slot level.  While I think that is an entirely
> reasonable thing to support, the main use-case I have in mind for this
> feature is for an administrator that wants to prevent inactive slots from
> causing problems (e.g., transaction ID wraparound) on a server or a number
> of servers.  For that use-case, I think a GUC would be much more
> convenient.  Perhaps there could be a default inactive slot timeout GUC
> that would be used in the absence of a slot-level setting.  Thoughts?
>

Yeah, that is a valid point. One of the reasons for keeping it at slot
level was to allow different subscribers/output plugins to have a
different setting for invalid_timeout for their respective slots based
on their usage. Now, having it as a GUC also has some valid use cases
as pointed out by you but I am not sure having both at slot level and
at GUC level is required. I was a bit inclined to have it at slot
level for now and then based on some field usage report we can later
add GUC as well.

-- 
With Regards,
Amit Kapila.

Re: Improve eviction algorithm in ReorderBuffer

2024-03-25 Thread Masahiko Sawada

On Thu, Mar 14, 2024 at 12:02 PM Masahiko Sawada  wrote:
>
>
> I've attached new version patches.

Since the previous patch conflicts with the current HEAD, I've
attached the rebased patches.

Regards,

-- 
Masahiko Sawada
Amazon Web Services: https://aws.amazon.com


v10-0001-Make-binaryheap-enlargeable.patch
Description: Binary data


v10-0003-Improve-eviction-algorithm-in-Reorderbuffer-usin.patch
Description: Binary data


v10-0002-Add-functions-to-binaryheap-for-efficient-key-re.patch
Description: Binary data

Re: Recent 027_streaming_regress.pl hangs

2024-03-25 Thread Andres Freund

Hi,

On 2024-03-26 00:00:38 -0400, Tom Lane wrote:
> Andres Freund  writes:
> > I think there must be some actual regression involved. The frequency of
> > failures on HEAD vs failures on 16 - both of which run the tests 
> > concurrently
> > via meson - is just vastly different.
>
> Are you sure it's not just that the total time to run the core
> regression tests has grown to a bit more than what the test timeout
> allows for?

You're right, that could be it - in a way at least, the issue is replay not
catching up within 180s, so it'd have to be the data volume growing, I think.

But it doesn't look like the regression volume meaningfully grew around that
time?

I guess I'll try to write a buildfarm database query to extract how long that
phase of the test took from all runs on my menagerie, not just the failing
one, and see if there's a visible trend.

Greetings,

Andres Freund

RE: speed up a logical replica setup

2024-03-25 Thread Hayato Kuroda (Fujitsu)

Dear Amit, Euler,

> 
> This only drops the publications created by this tool, not the
> pre-existing ones that we discussed in the link provided.

Another concern around here is the case which primary subscribes changes from 
others.
After the conversion, new subscriber also tries to connect to another publisher 
as
well - this may lead conflicts. This causes because both launcher/workers start
after recovery finishes. So, based on the Ashutosh's point, should we remove
such replication objects?

Best Regards,
Hayato Kuroda
FUJITSU LIMITED
https://www.fujitsu.com/

Re: Add new error_action COPY ON_ERROR "log"

2024-03-25 Thread Masahiko Sawada

On Tue, Mar 26, 2024 at 12:23 PM Bharath Rupireddy
 wrote:
>
> On Tue, Mar 26, 2024 at 7:16 AM Masahiko Sawada  wrote:
> >
> > > Please see the attached v9 patch set.
> >
> > Thank you for updating the patch! The patch mostly looks good to me.
> > Here are some minor comments:
>
> Thanks for looking into this.
>
> > ---
> >  /* non-export function prototypes */
> > -static char *limit_printout_length(const char *str);
> > -
> > static void ClosePipeFromProgram(CopyFromState cstate);
> >
> > Now that we have only one function we should replace "prototypes" with
> > "prototype".
>
> Well no. We might add a few more (never know). A quick look around the
> GUCs under /* GUCs */ tells me that plural form there is being used
> even just one GUC is defined (xlogprefetcher.c for instance).

Understood.

>
> > ---
> > +ereport(NOTICE,
> > +
> > errmsg("data type incompatibility at line %llu for column %s: \"%s\"",
> > +
> >  (unsigned long long) cstate->cur_lineno,
> > +
> >  cstate->cur_attname,
> > +
> >  attval));
> >
> > I guess it would be better to make the log message clearer to convey
> > what we did for the malformed row. For example, how about something
> > like "skipping row due to data type incompatibility at line %llu for
> > column %s: \"s\""?
>
> The summary message which gets printed at the end says that "NOTICE:
> 6 rows were skipped due to data type incompatibility". Isn't this
> enough? If someone is using ON_ERROR 'ignore', it's quite natural that
> such rows get skipped softly and the summary message can help them,
> no?

I think that in the main log message we should mention what happened
(or is happening) or what we did (or are doing). If the message "data
type incompatibility ..." was in the DETAIL message with the main
message saying something like "skipping row at line %llu for column
%s: ...", it would make sense to me. But the current message seems not
to be clear to me and consistent with other NOTICE messages. Also, the
last summary line would not be written if the user cancelled, and
someone other than person who used ON_ERROR 'ignore' might check the
server logs later.

Regards,

-- 
Masahiko Sawada
Amazon Web Services: https://aws.amazon.com

Re: pgsql: Track last_inactive_time in pg_replication_slots.

2024-03-25 Thread shveta malik

On Tue, Mar 26, 2024 at 1:50 AM Bharath Rupireddy
 wrote:
>
> On Tue, Mar 26, 2024 at 1:30 AM Nathan Bossart  
> wrote:
> >
> > On Mon, Mar 25, 2024 at 04:49:12PM +, Bertrand Drouvot wrote:
> > > On Mon, Mar 25, 2024 at 12:25:37PM -0400, Robert Haas wrote:
> > >> In the same vein, I think deactivated_at or inactive_since might be
> > >> good names to consider. I think they get at the same thing as
> > >> released_time, but they avoid introducing a completely new word
> > >> (release, as opposed to active/inactive).
> > >
> > > Yeah, I'd vote for inactive_since then.
> >
> > Having only skimmed some of the related discussions, I'm inclined to agree
> > that inactive_since provides the clearest description for the column.
>
> I think we all have some agreement on inactive_since. So, I'm
> attaching the patch for that change.

pg_proc.dat needs to be changed to refer to 'inactive_since' instead
of 'last_inactive_time' in the attached patch.

thanks
Shveta

Re: pgsql: Track last_inactive_time in pg_replication_slots.

2024-03-25 Thread shveta malik

On Mon, Mar 25, 2024 at 9:54 PM Bertrand Drouvot
 wrote:
>
> Hi,
>
> On Mon, Mar 25, 2024 at 07:32:11PM +0530, Amit Kapila wrote:
> > On Mon, Mar 25, 2024 at 6:57 PM Robert Haas  wrote:
> > > And I'm suspicious that having an exception for slots being synced is
> > > a bad idea. That makes too much of a judgement about how the user will
> > > use this field. It's usually better to just expose the data, and if
> > > the user needs helps to make sense of that data, then give them that
> > > help separately.
> >
> > The reason we didn't set this for sync slots is that they won't be
> > usable (one can't use them to decode WAL) unless standby is promoted
> > [2]. But I see your point as well. So, I have copied the others
> > involved in this discussion to see what they think.
>
> Yeah I also see Robert's point. If we also sync the "last inactive time" 
> field then
> we would need to take care of the corner case mentioned by Shveta in [1] 
> during
> promotion.

I have suggested one potential solution for that in [1]. Please have a look.

[1]: 
https://www.postgresql.org/message-id/CAJpy0uB-yE%2BRiw7JQ4hW0%2BigJxvPc%2Brq%2B9c7WyTa1Jz7%2B2gAiA%40mail.gmail.com

thanks
Shveta

Re: speed up a logical replica setup

2024-03-25 Thread Amit Kapila

On Tue, Mar 26, 2024 at 8:27 AM Euler Taveira  wrote:
>
> On Mon, Mar 25, 2024, at 11:33 PM, Amit Kapila wrote:
>
> On Mon, Mar 25, 2024 at 5:25 PM Peter Eisentraut  wrote:
> >
> > I have committed your version v33.  I did another pass over the
> > identifier and literal quoting.  I added quoting for replication slot
> > names, for example, even though they can only contain a restricted set
> > of characters, but it felt better to be defensive there.
> >
> > I'm happy to entertain follow-up patches on some of the details like
> > option naming that were still being discussed.  I just wanted to get the
> > main functionality in in good time.  We can fine-tune the rest over the
> > next few weeks.
> >
>
> I was looking at prior discussions on this topic to see if there are
> any other open design points apart from this and noticed that the
> points raised/discussed in the email [1] are also not addressed. IIRC,
> the key point we discussed was that after promotion, the existing
> replication objects should be removed (either optionally or always),
> otherwise, it can lead to a new subscriber not being able to restart
> or getting some unwarranted data.
>
>
> See setup_subscriber.
>
> /*
>  * Since the publication was created before the consistent LSN, it is
>  * available on the subscriber when the physical replica is promoted.
>  * Remove publications from the subscriber because it has no use.
>  */
> drop_publication(conn, &dbinfo[I]);
>

This only drops the publications created by this tool, not the
pre-existing ones that we discussed in the link provided.

-- 
With Regards,
Amit Kapila.

Re: Introduce XID age and inactive timeout based replication slot invalidation

2024-03-25 Thread shveta malik

On Mon, Mar 25, 2024 at 12:43 PM shveta malik  wrote:
>
> I have one concern, for synced slots on standby, how do we disallow
> invalidation due to inactive-timeout immediately after promotion?
>
> For synced slots, last_inactive_time and inactive_timeout are both
> set. Let's say I bring down primary for promotion of standby and then
> promote standby, there are chances that it may end up invalidating
> synced slots (considering standby is not brought down during promotion
> and thus inactive_timeout may already be past 'last_inactive_time').
>

On standby, if we decide to maintain valid last_inactive_time for
synced slots, then invalidation is correctly restricted in
InvalidateSlotForInactiveTimeout() for synced slots using the check:

if (RecoveryInProgress() && slot->data.synced)
return false;

But immediately after promotion, we can not rely on the above check
and thus possibility of synced slots invalidation is there. To
maintain consistent behavior regarding the setting of
last_inactive_time for synced slots, similar to user slots, one
potential solution to prevent this invalidation issue is to update the
last_inactive_time of all synced slots within the ShutDownSlotSync()
function during FinishWalRecovery(). This approach ensures that
promotion doesn't immediately invalidate slots, and henceforth, we
possess a correct last_inactive_time as a basis for invalidation going
forward. This will be equivalent to updating last_inactive_time during
restart (but without actual restart during promotion).
The plus point of maintaining last_inactive_time for synced slots
could be, this can provide data to the user on when last time the sync
was attempted on that particular slot by background slot sync worker
or SQl function. Thoughts?

thanks
Shveta

Re: Recent 027_streaming_regress.pl hangs

2024-03-25 Thread Tom Lane

Andres Freund  writes:
> I think there must be some actual regression involved. The frequency of
> failures on HEAD vs failures on 16 - both of which run the tests concurrently
> via meson - is just vastly different.

Are you sure it's not just that the total time to run the core
regression tests has grown to a bit more than what the test timeout
allows for?

regards, tom lane

Re: Recent 027_streaming_regress.pl hangs

2024-03-25 Thread Andres Freund

Hi,

On 2024-03-20 17:41:45 -0700, Andres Freund wrote:
> On 2024-03-14 16:56:39 -0400, Tom Lane wrote:
> > Also, this is probably not
> > helping anything:
> >
> >'extra_config' => {
> >   ...
> >   'fsync = on'
>
> At some point we had practically no test coverage of fsync, so I made my
> animals use fsync. I think we still have little coverage.  I probably could
> reduce the number of animals using it though.

I think there must be some actual regression involved. The frequency of
failures on HEAD vs failures on 16 - both of which run the tests concurrently
via meson - is just vastly different.  I'd expect the absolute number of
failures in 027_stream_regress.pl to differ between branches due to fewer runs
on 16, but there's no explanation for the difference in percentage of
failures. My menagerie had only a single recoveryCheck failure on !HEAD in the
last 30 days, but in the vicinity of 100 on HEAD
https://buildfarm.postgresql.org/cgi-bin/show_failures.pl?max_days=30&stage=recoveryCheck&filter=Submit

If anything the load when testing back branch changes is higher, because
commonly back-branch builds are happening on all branches, so I don't think
that can be the explanation either.

>From what I can tell the pattern changed on 2024-02-16 19:39:02 - there was a
rash of recoveryCheck failures in the days before that too, but not
027_stream_regress.pl in that way.

It certainly seems suspicious that one commit before the first observed failure
is
2024-02-16 11:09:11 -0800 [73f0a132660] Pass correct count to WALRead().

Of course the failure rate is low enough that it could have been a day or two
before that, too.

Greetings,

Andres Freund

Re: Teach predtest about IS [NOT] proofs

2024-03-25 Thread Tom Lane

I wrote:
> I went ahead and committed 0001 after one more round of review
> 
> statements; my bad).  I also added the changes in test_predtest.c from
> 0002.  I attach a rebased version of 0002, as well as 0003 which isn't
> changed, mainly to keep the cfbot happy.

[ squint.. ]  Apparently I managed to hit ^K right before sending this
email.  The missing line was meant to be more or less

> which found a couple of missing "break"

Not too important, but perhaps future readers of the archives will
be confused.

regards, tom lane

Re: Sync scan & regression tests

2024-03-25 Thread Andres Freund

Hi,

On 2024-03-24 11:28:12 -0400, Tom Lane wrote:
> Heikki Linnakangas  writes:
> > On 19/09/2023 01:57, Andres Freund wrote:
> >> On 2023-09-18 13:49:24 +0300, Heikki Linnakangas wrote:
> >>> d) Copy fewer rows to the table in the test. If we copy only 6 rows, for
> >>> example, the table will have only two pages, regardless of shared_buffers.
> >>>
> >>> I'm leaning towards d). The whole test is a little fragile, it will also
> >>> fail with a non-default block size, for example. But c) seems like a 
> >>> simple
> >>> fix and wouldn't look too out of place in the test.
>
> >> Hm, what do you mean with the last sentence? Oh, is the test you're
> >> referencing the relation-extension logic?
>
> > Sorry, I said "c) seems like a simple fix ...", but I meant "d) seems
> > like a simple fix ..."
> > I meant the attached.
>
> This thread stalled out months ago, but chipmunk is still failing in
> HEAD and v16.  Can we please have a fix?  I'm good with Heikki's
> adjustment to the pg_visibility test case.

I pushed Heikki's adjustment. Thanks for the "fix" and the reminder.

Greetings,

Andres Freund

Re: Add new error_action COPY ON_ERROR "log"

2024-03-25 Thread Bharath Rupireddy

On Tue, Mar 26, 2024 at 7:16 AM Masahiko Sawada  wrote:
>
> > Please see the attached v9 patch set.
>
> Thank you for updating the patch! The patch mostly looks good to me.
> Here are some minor comments:

Thanks for looking into this.

> ---
>  /* non-export function prototypes */
> -static char *limit_printout_length(const char *str);
> -
> static void ClosePipeFromProgram(CopyFromState cstate);
>
> Now that we have only one function we should replace "prototypes" with
> "prototype".

Well no. We might add a few more (never know). A quick look around the
GUCs under /* GUCs */ tells me that plural form there is being used
even just one GUC is defined (xlogprefetcher.c for instance).

> ---
> +ereport(NOTICE,
> +
> errmsg("data type incompatibility at line %llu for column %s: \"%s\"",
> +
>  (unsigned long long) cstate->cur_lineno,
> +
>  cstate->cur_attname,
> +
>  attval));
>
> I guess it would be better to make the log message clearer to convey
> what we did for the malformed row. For example, how about something
> like "skipping row due to data type incompatibility at line %llu for
> column %s: \"s\""?

The summary message which gets printed at the end says that "NOTICE:
6 rows were skipped due to data type incompatibility". Isn't this
enough? If someone is using ON_ERROR 'ignore', it's quite natural that
such rows get skipped softly and the summary message can help them,
no?

> ---
>  extern void CopyFromErrorCallback(void *arg);
> +extern char *limit_printout_length(const char *str);
>
> I don't disagree with exposing the limit_printout_length() function
> but I think it's better to rename it for consistency with other
> exposed COPY command functions. Only this function is snake-case. How
> about CopyLimitPrintoutLength() or similar?

WFM. Although its implementation is not related to COPY code, COPY is
the sole user of it right now, so I'm fine with it. Done that.

> FWIW I'm going to merge two patches before the push.

Done that.

Please see the attached v10 patch.

-- 
Bharath Rupireddy
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com

v10-0001-Add-detailed-info-when-COPY-skips-soft-errors.patch
Description: Binary data

Re: speed up a logical replica setup

2024-03-25 Thread Euler Taveira

On Mon, Mar 25, 2024, at 11:33 PM, Amit Kapila wrote:
> On Mon, Mar 25, 2024 at 5:25 PM Peter Eisentraut  wrote:
> >
> > I have committed your version v33.  I did another pass over the
> > identifier and literal quoting.  I added quoting for replication slot
> > names, for example, even though they can only contain a restricted set
> > of characters, but it felt better to be defensive there.
> >
> > I'm happy to entertain follow-up patches on some of the details like
> > option naming that were still being discussed.  I just wanted to get the
> > main functionality in in good time.  We can fine-tune the rest over the
> > next few weeks.
> >
> 
> I was looking at prior discussions on this topic to see if there are
> any other open design points apart from this and noticed that the
> points raised/discussed in the email [1] are also not addressed. IIRC,
> the key point we discussed was that after promotion, the existing
> replication objects should be removed (either optionally or always),
> otherwise, it can lead to a new subscriber not being able to restart
> or getting some unwarranted data.

See setup_subscriber.

/*
 * Since the publication was created before the consistent LSN, it is
 * available on the subscriber when the physical replica is promoted.
 * Remove publications from the subscriber because it has no use.
 */
drop_publication(conn, &dbinfo[i]);


--
Euler Taveira
EDB   https://www.enterprisedb.com/

Re: speed up a logical replica setup

2024-03-25 Thread Euler Taveira

On Mon, Mar 25, 2024, at 1:06 PM, Hayato Kuroda (Fujitsu) wrote:
> ## Analysis for failure 1
> 
> The failure caused by a time lag between walreceiver finishes and 
> pg_is_in_recovery()
> returns true.
> 
> According to the output [1], it seems that the tool failed at 
> wait_for_end_recovery()
> with the message "standby server disconnected from the primary". Also, lines
> "redo done at..." and "terminating walreceiver process due to administrator 
> command"
> meant that walreceiver was requested to shut down by XLogShutdownWalRcv().
> 
> According to the source, we confirm that walreceiver is shut down in
> StartupXLOG()->FinishWalRecovery()->XLogShutdownWalRcv(). Also, 
> SharedRecoveryState
> is changed to RECOVERY_STATE_DONE (this meant the pg_is_in_recovery() return 
> true)
> at the latter part of StartupXLOG().
> 
> So, if there is a delay between FinishWalRecovery() and change the state, the 
> check
> in wait_for_end_recovery() would be failed during the time. Since we allow to 
> miss
> the walreceiver 10 times and it is checked once per second, the failure 
> occurs if
> the time lag is longer than 10 seconds.
> 
> I do not have a good way to fix it. One approach is make NUM_CONN_ATTEMPTS 
> larger,
> but it's not a fundamental solution.

I was expecting that slow hosts might have issues in wait_for_end_recovery().
As you said it took a lot of steps between FinishWalRecovery() (where
walreceiver is shutdown -- XLogShutdownWalRcv) and SharedRecoveryState is set to
RECOVERY_STATE_DONE. If this window takes longer than NUM_CONN_ATTEMPTS *
WAIT_INTERVAL (10 seconds), it aborts the execution. That's a bad decision
because it already finished the promotion and it is just doing the final
preparation for the host to become a primary.

/*   
 * If it is still in recovery, make sure the target server is
 * connected to the primary so it can receive the required WAL to
 * finish the recovery process. If it is disconnected try
 * NUM_CONN_ATTEMPTS in a row and bail out if not succeed.
 */
res = PQexec(conn,
 "SELECT 1 FROM pg_catalog.pg_stat_wal_receiver");
if (PQntuples(res) == 0)
{
if (++count > NUM_CONN_ATTEMPTS)
{
stop_standby_server(subscriber_dir);
pg_log_error("standby server disconnected from the primary");
break;
}
}
else
count = 0;  /* reset counter if it connects again */

This code was add to defend against the death/crash of the target server. There
are at least 3 options:

(1) increase NUM_CONN_ATTEMPTS * WAIT_INTERVAL seconds. We discussed this 
constant
and I decided to use 10 seconds because even in some slow hosts, this time
wasn't reached during my tests. It seems I forgot to test the combination of 
slow
host, asserts enabled, and ubsan. I didn't notice that pg_promote() uses 60
seconds as default wait. Maybe that's a reasonable value. I checked the
004_timeline_switch test and the last run took: 39.2s (serinus), 33.1s
(culicidae), 18.31s (calliphoridae) and 27.52s (olingo).

(2) check if the primary is not running when walreceiver is not available on the
target server. Increase the connection attempts iif the primary is not running.
Hence, the described case doesn't cause an increment on the count variable.

(3) set recovery_timeout default to != 0 and remove pg_stat_wal_receiver check
protection against the death/crash target server. I explained in a previous
message that timeout may occur in cases that WAL replay to reach consistent
state takes more than recovery-timeout seconds.

Option (1) is the easiest fix, however, we can have the same issue again if a
slow host decides to be even slower, hence, we have to adjust this value again.
Option (2) interprets the walreceiver absence as a recovery end and if the
primary server is running it can indicate that the target server is in the
imminence of the recovery end. Option (3) is not as resilient as the other
options.

The first patch implements a combination of (1) and (2).

> ## Analysis for failure 2
> 
> According to [2], the physical replication slot which is specified as 
> primary_slot_name
> was not used by the walsender process. At that time walsender has not existed.
> 
> ```
> ...
> pg_createsubscriber: publisher: current wal senders: 0
> pg_createsubscriber: command is: SELECT 1 FROM 
> pg_catalog.pg_replication_slots WHERE active AND slot_name = 'physical_slot'
> pg_createsubscriber: error: could not obtain replication slot information: 
> got 0 rows, expected 1 row
> ...
> ```
> 
> Currently standby must be stopped before the command and current code does not
> block the flow to ensure the replication is started. So there is a possibility
> that the checking is run before walsender is launched.
> 
> One possible approach is to wait until the replication starts. Alternative 
> one is
> to ease the condition.

Re: RFC: Logging plan of the running query

2024-03-25 Thread Andres Freund

Hi,

On 2024-03-13 15:33:02 -0400, Robert Haas wrote:
> But also ... having to wrap the entire plan tree like this seems
> pretty awful. I don't really like the idea of a large-scan plan
> modification like this in the middle of the query.

It's not great. But I also don't really see an alternative with this approach.

I guess we could invent a new CFI version that gets the current PlanState and
use that in all of src/backend/executor/node* and pass the PlanState to that -
but then we could just as well just directly process the interrupt there.

> I also wonder whether it interacts properly with JIT.

I don't think there's a problem unless somebody invests a lot of time in
JITing much more of the query. Which will require a lot more work, basically
redesigning the executor...

> Andres, did you have some clever idea for this feature that would
> avoid the need to do this?

No. I think it's acceptable though.

However it might be worth inventing an executor tree walker in a preliminary
step. We have already quite a few switches over all plan nodes, which we could
largely replace with a helper.

Greetings,

Andres Freund

Re: speed up a logical replica setup

2024-03-25 Thread Amit Kapila

On Mon, Mar 25, 2024 at 5:25 PM Peter Eisentraut  wrote:
>
> I have committed your version v33.  I did another pass over the
> identifier and literal quoting.  I added quoting for replication slot
> names, for example, even though they can only contain a restricted set
> of characters, but it felt better to be defensive there.
>
> I'm happy to entertain follow-up patches on some of the details like
> option naming that were still being discussed.  I just wanted to get the
> main functionality in in good time.  We can fine-tune the rest over the
> next few weeks.
>

I was looking at prior discussions on this topic to see if there are
any other open design points apart from this and noticed that the
points raised/discussed in the email [1] are also not addressed. IIRC,
the key point we discussed was that after promotion, the existing
replication objects should be removed (either optionally or always),
otherwise, it can lead to a new subscriber not being able to restart
or getting some unwarranted data.

[1] - 
https://www.postgresql.org/message-id/CAExHW5t4ew7ZrgcDdTv7YmuG7LVQT1ZaEny_EvtngHtEBNyjcQ%40mail.gmail.com

-- 
With Regards,
Amit Kapila.

Re: Add new error_action COPY ON_ERROR "log"

2024-03-25 Thread Masahiko Sawada

On Mon, Mar 25, 2024 at 8:21 PM Bharath Rupireddy
 wrote:
>
> On Mon, Mar 25, 2024 at 10:42 AM Masahiko Sawada  
> wrote:
> >
> > The current approach, eliminating the duplicated information in
> > CONTEXT, seems good to me.
>
> Thanks for looking into it.
>
> > One question about the latest (v8) patch:
> >
> > +   else
> > +   ereport(NOTICE,
> > +   errmsg("data type incompatibility at
> > line %llu for column %s: null input",
> > +  (unsigned long long) 
> > cstate->cur_lineno,
> > +  cstate->cur_attname));
> > +
> >
> > How can we reach this path? It seems we don't cover this path by the tests.
>
> Tests don't cover that part, but it can be hit with something like
> [1]. I've added a test for this.
>
> Note the use of domain to provide an indirect way of providing null
> constraint check. Otherwise, COPY FROM fails early in
> CopyFrom->ExecConstraints if the NOT NULL constraint is directly
> provided next to the column in the table [2].
>
> Please see the attached v9 patch set.
>

Thank you for updating the patch! The patch mostly looks good to me.
Here are some minor comments:

---
 /* non-export function prototypes */
-static char *limit_printout_length(const char *str);
-
static void ClosePipeFromProgram(CopyFromState cstate);

Now that we have only one function we should replace "prototypes" with
"prototype".

---
+ereport(NOTICE,
+
errmsg("data type incompatibility at line %llu for column %s: \"%s\"",
+
 (unsigned long long) cstate->cur_lineno,
+
 cstate->cur_attname,
+
 attval));

I guess it would be better to make the log message clearer to convey
what we did for the malformed row. For example, how about something
like "skipping row due to data type incompatibility at line %llu for
column %s: \"s\""?

---
 extern void CopyFromErrorCallback(void *arg);
+extern char *limit_printout_length(const char *str);

I don't disagree with exposing the limit_printout_length() function
but I think it's better to rename it for consistency with other
exposed COPY command functions. Only this function is snake-case. How
about CopyLimitPrintoutLength() or similar?

FWIW I'm going to merge two patches before the push.

Regards,

-- 
Masahiko Sawada
Amazon Web Services: https://aws.amazon.com

Re: speed up a logical replica setup

2024-03-25 Thread vignesh C

On Mon, 25 Mar 2024 at 21:36, Hayato Kuroda (Fujitsu)
 wrote:
>
> Dear Bharath, Peter,
>
> > Looks like BF animals aren't happy, please check -
> > > https://buildfarm.postgresql.org/cgi-bin/show_failures.pl.
> >
> > Looks like sanitizer failures.  There were a few messages about that
> > recently, but those were all just about freeing memory after use, which
> > we don't necessarily require for client programs.  So maybe something else.
>
> It seems that there are several time of failures, [1] and [2].
>
> ## Analysis for failure 1
>
> The failure caused by a time lag between walreceiver finishes and 
> pg_is_in_recovery()
> returns true.
>
> According to the output [1], it seems that the tool failed at 
> wait_for_end_recovery()
> with the message "standby server disconnected from the primary". Also, lines
> "redo done at..." and "terminating walreceiver process due to administrator 
> command"
> meant that walreceiver was requested to shut down by XLogShutdownWalRcv().
>
> According to the source, we confirm that walreceiver is shut down in
> StartupXLOG()->FinishWalRecovery()->XLogShutdownWalRcv(). Also, 
> SharedRecoveryState
> is changed to RECOVERY_STATE_DONE (this meant the pg_is_in_recovery() return 
> true)
> at the latter part of StartupXLOG().
>
> So, if there is a delay between FinishWalRecovery() and change the state, the 
> check
> in wait_for_end_recovery() would be failed during the time. Since we allow to 
> miss
> the walreceiver 10 times and it is checked once per second, the failure 
> occurs if
> the time lag is longer than 10 seconds.
>
> I do not have a good way to fix it. One approach is make NUM_CONN_ATTEMPTS 
> larger,
> but it's not a fundamental solution.

I agree with your analysis, another way to fix could be to remove the
following check as increasing the count might still have the race
condition issue:
/*
* If it is still in recovery, make sure the target server is
* connected to the primary so it can receive the required WAL to
* finish the recovery process. If it is disconnected try
* NUM_CONN_ATTEMPTS in a row and bail out if not succeed.
*/
res = PQexec(conn,
"SELECT 1 FROM pg_catalog.pg_stat_wal_receiver");

I'm not sure whether we should worry about the condition where
recovery is not done and pg_stat_wal_receiver is exited as we have the
following sanity check in check_subscriber before we wait for recovery
to be finished:
/* The target server must be a standby */
if (!server_is_in_recovery(conn))
{
pg_log_error("target server must be a standby");
disconnect_database(conn, true);
}

Regards,
Vignesh

Re: pgsql: Track last_inactive_time in pg_replication_slots.

2024-03-25 Thread Amit Kapila

On Mon, Mar 25, 2024 at 9:55 PM Robert Haas  wrote:
>
> On Mon, Mar 25, 2024 at 12:12 PM Bertrand Drouvot
>  wrote:
>
> > Would "released_time" sounds better? (at the end this is exactly what it 
> > does
> > represent unless for the case where it is restored from disk for which the 
> > meaning
> > would still makes sense to me though). It seems to me that released_time 
> > does not
> > lead to any expectation then removing any confusion.
>
> Yeah, that's not bad. I mean, I don't agree that released_time doesn't
> lead to any expectation, but what it leads me to expect is that you're
> going to tell me the time at which the slot was released. So if it's
> currently active, then I see NULL, because it's not released; but if
> it's inactive, then I see the time at which it became so.
>
> In the same vein, I think deactivated_at or inactive_since might be
> good names to consider. I think they get at the same thing as
> released_time, but they avoid introducing a completely new word
> (release, as opposed to active/inactive).
>

We have a consensus on inactive_since, so I'll make that change. I
would also like to solicit your opinion on the other slot-level
parameter we are planning to introduce.  This new slot-level parameter
will be named as inactive_timeout. This will indicate that once the
slot is inactive for the inactive_timeout period, we will invalidate
the slot. We are also discussing to have this parameter
(inactive_timeout) as GUC [1]. We can have this new parameter both at
the slot level and as well as a GUC, or just one of those.

[1] - 
https://www.postgresql.org/message-id/20240325195443.GA2923888%40nathanxps13

-- 
With Regards,
Amit Kapila.

Re: Combine Prune and Freeze records emitted by vacuum

2024-03-25 Thread Melanie Plageman

Thanks for committing the new WAL format!

On Mon, Mar 25, 2024 at 3:33 PM Heikki Linnakangas  wrote:
>
> On 24/03/2024 18:32, Melanie Plageman wrote:
> > On Thu, Mar 21, 2024 at 9:28 AM Heikki Linnakangas  wrote:
> >>
> >> In heap_page_prune_and_freeze(), we now do some extra work on each live
> >> tuple, to set the all_visible_except_removable correctly. And also to
> >> update live_tuples, recently_dead_tuples and hastup. When we're not
> >> freezing, that's a waste of cycles, the caller doesn't care. I hope it's
> >> enough that it doesn't matter, but is it?
> >
> > Last year on an early version of the patch set I did some pgbench
> > tpcb-like benchmarks -- since there is a lot of on-access pruning in
> > that workload -- and I don't remember it being a showstopper. The code
> > has changed a fair bit since then. However, I think it might be safer
> > to pass a flag "by_vacuum" to heap_page_prune_and_freeze() and skip
> > the rest of the loop after heap_prune_satisifies_vacuum() when
> > on-access pruning invokes it. I had avoided that because it felt ugly
> > and error-prone, however it addresses a few other of your points as
> > well.
>
> Ok. I'm not a fan of the name 'by_vacuum' though. It'd be nice if the
> argument described what it does, rather than who it's for. For example,
> 'need_all_visible'. If set to true, the function determines
> 'all_visible', otherwise it does not.

I like that way of putting it -- describing what it does instead of
who it is for. However, we now have PruneReason as an argument to
heap_page_prune(), which would be usable for this purpose (for
skipping the rest of the first loop). It is not descriptive of how we
would use it in this scenario, though.

> I started to look closer at the loops in heap_prune_chain() and how they
> update all the various flags and counters. There's a lot going on there.
> We have:
>
> - live_tuples counter
> - recently_dead_tuples counter
> - all_visible[_except_removable]
> - all_frozen
> - visibility_cutoff_xid
> - hastup
> - prstate.frozen array
> - nnewlpdead
> - deadoffsets array
>
> And that doesn't even include all the local variables and the final
> dead/redirected arrays.

Yes, there are a lot of things happening. In an early version, I had
hoped for the first loop to be just getting the visibility information
and then to do most of the other stuff as we went in
heap_prune_chain() as you mention below. I couldn't quite get a
version of that working that looked nice. I agree that the whole thing
feels a bit brittle and error-prone. It's hard to be objective after
fiddling with something over the course of a year. I'm trying to take
a step back now and rethink it.

> Some of those are set in the first loop that initializes 'htsv' for each
> tuple on the page. Others are updated in heap_prune_chain(). Some are
> updated in both. It's hard to follow which are set where.

Yep.

> I think recently_dead_tuples is updated incorrectly, for tuples that are
> part of a completely dead HOT chain. For example, imagine a hot chain
> with two tuples: RECENTLY_DEAD -> DEAD. heap_prune_chain() would follow
> the chain, see the DEAD tuple at the end of the chain, and mark both
> tuples for pruning. However, we already updated 'recently_dead_tuples'
> in the first loop, which is wrong if we remove the tuple.
>
> Maybe that's the only bug like this, but I'm a little scared. Is there
> something we could do to make this simpler? Maybe move all the new work
> that we added to the first loop, into heap_prune_chain() ? Maybe
> introduce a few more helper heap_prune_record_*() functions, to update
> the flags and counters also for live and insert/delete-in-progress
> tuples and for dead line pointers? Something like
> heap_prune_record_live() and heap_prune_record_lp_dead().

I had discarded previous attempts to get everything done in
heap_prune_chain() because it was hard to make sure I was doing the
right thing given that it visits the line pointers out of order so
making sure you've considered all of them once and only once was hard.
I hadn't thought of the approach you suggested with record_live() --
that might help. I will work on this tomorrow. I had hoped to get
something out today, but I am still in the middle of rebasing the back
20 patches from your v5 over current master and then adding in the
suggestions that I made in the various diffs on the thread.

> > Note that I still don't think we have a resolution on what to
> > correctly update new_relfrozenxid and new_relminmxid to at the end
> > when presult->nfrozen == 0 and presult->all_frozen is true.
> >
> >  if (presult->nfrozen > 0)
> >  {
> >  presult->new_relfrozenxid = pagefrz->FreezePageRelfrozenXid;
> >  presult->new_relminmxid = pagefrz->FreezePageRelminMxid;
> >  }
> >  else
> >  {
> >  presult->new_relfrozenxid = pagefrz->NoFreezePageRelfrozenXid;
> >  presult->new_relminmxid = pagefrz->NoFreezePage

Re: make dist using git archive

2024-03-25 Thread Andres Freund

Hi,

On 2024-03-25 06:44:33 +0100, Peter Eisentraut wrote:
> Done and committed.

This triggered a new warning for me:

../../../../../home/andres/src/postgresql/meson.build:3422: WARNING: Project 
targets '>=0.54' but uses feature introduced in '0.55.0': Passing 
executable/found program object to script parameter of add_dist_script.

Greetings,

Andres

Re: SQL:2011 application time

2024-03-25 Thread jian he

On Sun, Mar 24, 2024 at 1:42 AM Paul Jungwirth
 wrote:
>
> v33 attached with minor changes.
>
> Okay, added those tests too. Thanks!
>
> Rebased to 697f8d266c.
>


hi.
minor issues I found in v33-0003.
there are 29 of {check_amproc_signature?.*false}
only one {check_amproc_signature(procform->amproc, opcintype, true}
is this refactoring really worth it?

We also need to refactor gistadjustmembers?


+  
+   intersect
+   computes intersection with FOR PORTION OF
+bounds
+   13
+  
+  
+   without_portion
+   computes remaining duration(s) outside
+   FOR PORTION OF bounds
+   14
+  
needs to add "(optional)".


+
+Datum
+my_range_intersect(PG_FUNCTION_ARGS)
+{
+RangeType  *r1 = PG_GETARG_RANGE_P(0);
+RangeType  *r2 = PG_GETARG_RANGE_P(1);
+TypeCacheEntry *typcache;
+
+/* Different types should be prevented by ANYRANGE matching rules */
+if (RangeTypeGetOid(r1) != RangeTypeGetOid(r2))
   elog(ERROR, "range
types do not match");
+
+typcache = range_get_typcache(fcinfo, RangeTypeGetOid(r1));
+
+PG_RETURN_RANGE_P(range_intersect_internal(typcache, r1, r2));
+}
+
the elog, ERROR indentation is wrong?


+/*
+ * range_without_portion_internal - Sets outputs and outputn to the ranges
+ * remaining and their count (respectively) after subtracting r2 from r1.
+ * The array should never contain empty ranges.
+ * The outputs will be ordered. We expect that outputs is an array of
+ * RangeType pointers, already allocated with two slots.
+ */
+void
+range_without_portion_internal(TypeCacheEntry *typcache, RangeType *r1,
+   RangeType *r2, RangeType **outputs, int *outputn)
the comments need to be refactored?
there is nothing related to "slot"?
not sure the "array" description is right.
(my understanding is compute rangetype r1 and r2, and save the result to
RangeType **outputs.


select proisstrict, proname from pg_proc where proname =
'range_without_portion';
range_without_portion is strict.
but
select range_without_portion(NULL::int4range, int4range(11, 20,'[]'));
return zero rows.
Is this the expected behavior?


0003 seems simple enough.
but it's more related to "for portion of".
not sure we can push 0003 into v17.

Re: WIP Incremental JSON Parser

2024-03-25 Thread Jacob Champion

On Mon, Mar 25, 2024 at 4:24 PM Andrew Dunstan  wrote:
> OK, so we invent a new error code and have the parser  return that if the 
> stack depth gets too big?

Yeah, that seems reasonable. I'd potentially be able to build on that
via OAuth for next cycle, too, since that client needs to limit its
memory usage.

--Jacob

Re: WIP Incremental JSON Parser

2024-03-25 Thread Andrew Dunstan

On Mon, Mar 25, 2024 at 7:12 PM Jacob Champion <
jacob.champ...@enterprisedb.com> wrote:

> On Mon, Mar 25, 2024 at 4:02 PM Andrew Dunstan 
> wrote:
> > Well, what's the alternative? The current parser doesn't check stack
> depth in frontend code. Presumably it too will eventually just run out of
> memory, possibly rather sooner as the stack frames could  be more expensive
> than the incremental parser stack extensions.
>
> Stack size should be pretty limited, at least on the platforms I'm
> familiar with. So yeah, the recursive descent will segfault pretty
> quickly, but it won't repalloc() an unbounded amount of heap space.
> The alternative would just be to go back to a hardcoded limit in the
> short term, I think.
>
>
>
OK, so we invent a new error code and have the parser  return that if the
stack depth gets too big?

cheers

andrew

Re: WIP Incremental JSON Parser

2024-03-25 Thread Jacob Champion

On Mon, Mar 25, 2024 at 4:12 PM Jacob Champion
 wrote:
> Stack size should be pretty limited, at least on the platforms I'm
> familiar with. So yeah, the recursive descent will segfault pretty
> quickly, but it won't repalloc() an unbounded amount of heap space.
> The alternative would just be to go back to a hardcoded limit in the
> short term, I think.

And I should mention that there are other ways to consume a bunch of
memory, but I think they're bounded by the size of the JSON file.
Looks like the repalloc()s amplify the JSON size by a factor of ~20
(JS_MAX_PROD_LEN + sizeof(char*) + sizeof(bool)). That may or may not
be enough to be concerned about in the end, since I think it's still
linear, but I wanted to make sure it was known.

--Jacob

Re: session username in default psql prompt?

2024-03-25 Thread Andrew Dunstan

On Mon, Mar 25, 2024 at 9:14 AM Jelte Fennema-Nio 
wrote:

> On Mon, 25 Mar 2024 at 14:06, Robert Haas  wrote:
> > On Mon, Mar 25, 2024 at 4:30 AM Jelte Fennema-Nio 
> wrote:
> > > That problem seems easy to address by adding a newline into the
> > > default prompt.
> >
> > Ugh. Please, no!
>
> I guess it's partially a matter of taste, but personally I'm never
> going back to a single line prompt. It's so nice for zoomed-in demos
> that your SQL queries don't get broken up.
>

Very  much a matter of taste. I knew when I saw your suggestion there would
be some kickback. If horizontal space is at a premium vertical space is
doubly so, I suspect.

cheers

andrew

Re: WIP Incremental JSON Parser

2024-03-25 Thread Jacob Champion

On Mon, Mar 25, 2024 at 4:02 PM Andrew Dunstan  wrote:
> Well, what's the alternative? The current parser doesn't check stack depth in 
> frontend code. Presumably it too will eventually just run out of memory, 
> possibly rather sooner as the stack frames could  be more expensive than the 
> incremental parser stack extensions.

Stack size should be pretty limited, at least on the platforms I'm
familiar with. So yeah, the recursive descent will segfault pretty
quickly, but it won't repalloc() an unbounded amount of heap space.
The alternative would just be to go back to a hardcoded limit in the
short term, I think.

--Jacob

Re: WIP Incremental JSON Parser

2024-03-25 Thread Andrew Dunstan

On Mon, Mar 25, 2024 at 6:15 PM Jacob Champion <
jacob.champ...@enterprisedb.com> wrote:

> On Wed, Mar 20, 2024 at 11:56 PM Andrew Dunstan 
> wrote:
> > Thanks, included that and attended to the other issues we discussed. I
> think this is pretty close now.
>
> Okay, looking over the thread, there are the following open items:
> - extend the incremental test in order to exercise the semantic callbacks
> [1]
>


Yeah, I'm on a super long plane trip later this week, so I might get it
done then :-)


> - add Assert calls in impossible error cases [2]
>

ok, will do


> - error out if the non-incremental lex doesn't consume the entire token [2]
>

ok, will do


> - double-check that out of memory is an appropriate failure mode for
> the frontend [3]
>


Well, what's the alternative? The current parser doesn't check stack depth
in frontend code. Presumably it too will eventually just run out of memory,
possibly rather sooner as the stack frames could  be more expensive than
the incremental parser stack extensions.



>
> Just as a general style nit:
>
> > +   if (lex->incremental)
> > +   {
> > +   lex->input = lex->token_terminator = lex->line_start = json;
> > +   lex->input_length = len;
> > +   lex->inc_state->is_last_chunk = is_last;
> > +   }
> > +   else
> > +   return JSON_INVALID_LEXER_TYPE;
>
> I think flipping this around would probably make it more readable;
> something like:
>
> if (!lex->incremental)
> return JSON_INVALID_LEXER_TYPE;
>
> lex->input = ...
>
>
>
Noted. will do, Thanks.

cheers

andrew



> [1]
> https://www.postgresql.org/message-id/CAOYmi%2BnHV55Uhz%2Bo-HKq0GNiWn2L5gMcuyRQEz_fqpGY%3DpFxKA%40mail.gmail.com
> [2]
> https://www.postgresql.org/message-id/CAD5tBcLi2ffZkktV2qrsKSBykE-N8CiYgrfbv0vZ-F7%3DxLFeqw%40mail.gmail.com
> [3]
> https://www.postgresql.org/message-id/CAOYmi%2BnY%3DrF6dJCzaOuA3d-3FbwXCcecOs_S1NutexFA3dRXAw%40mail.gmail.com
>

Re: WIP Incremental JSON Parser

2024-03-25 Thread Jacob Champion

On Wed, Mar 20, 2024 at 11:56 PM Andrew Dunstan  wrote:
> Thanks, included that and attended to the other issues we discussed. I think 
> this is pretty close now.

Okay, looking over the thread, there are the following open items:
- extend the incremental test in order to exercise the semantic callbacks [1]
- add Assert calls in impossible error cases [2]
- error out if the non-incremental lex doesn't consume the entire token [2]
- double-check that out of memory is an appropriate failure mode for
the frontend [3]

Just as a general style nit:

> +   if (lex->incremental)
> +   {
> +   lex->input = lex->token_terminator = lex->line_start = json;
> +   lex->input_length = len;
> +   lex->inc_state->is_last_chunk = is_last;
> +   }
> +   else
> +   return JSON_INVALID_LEXER_TYPE;

I think flipping this around would probably make it more readable;
something like:

if (!lex->incremental)
return JSON_INVALID_LEXER_TYPE;

lex->input = ...

Thanks,
--Jacob

[1] 
https://www.postgresql.org/message-id/CAOYmi%2BnHV55Uhz%2Bo-HKq0GNiWn2L5gMcuyRQEz_fqpGY%3DpFxKA%40mail.gmail.com
[2] 
https://www.postgresql.org/message-id/CAD5tBcLi2ffZkktV2qrsKSBykE-N8CiYgrfbv0vZ-F7%3DxLFeqw%40mail.gmail.com
[3] 
https://www.postgresql.org/message-id/CAOYmi%2BnY%3DrF6dJCzaOuA3d-3FbwXCcecOs_S1NutexFA3dRXAw%40mail.gmail.com

Re: Teach predtest about IS [NOT] proofs

2024-03-25 Thread Tom Lane

James Coleman  writes:
> [ v6 patchset ]

I went ahead and committed 0001 after one more round of review

statements; my bad).  I also added the changes in test_predtest.c from
0002.  I attach a rebased version of 0002, as well as 0003 which isn't
changed, mainly to keep the cfbot happy.

I'm still not happy with what you did in predicate_refuted_by_recurse:
it feels wrong and rather expensively so.  There has to be a better
way.  Maybe strong vs. weak isn't quite the right formulation for
refutation tests?

regards, tom lane

diff --git a/src/backend/optimizer/util/predtest.c b/src/backend/optimizer/util/predtest.c
index 6e3b376f3d..5bb5bb4f0e 100644
--- a/src/backend/optimizer/util/predtest.c
+++ b/src/backend/optimizer/util/predtest.c
@@ -99,6 +99,8 @@ static bool predicate_implied_by_simple_clause(Expr *predicate, Node *clause,
 			   bool weak);
 static bool predicate_refuted_by_simple_clause(Expr *predicate, Node *clause,
 			   bool weak);
+static bool predicate_implied_not_null_by_clause(Expr *predicate, Node *clause,
+ bool weak);
 static Node *extract_not_arg(Node *clause);
 static Node *extract_strong_not_arg(Node *clause);
 static bool clause_is_strict_for(Node *clause, Node *subexpr, bool allow_false);
@@ -198,6 +200,11 @@ predicate_implied_by(List *predicate_list, List *clause_list,
  * (i.e., B must yield false or NULL).  We use this to detect mutually
  * contradictory WHERE clauses.
  *
+ * A notable difference between implication and refutation proofs is that
+ * strong/weak refutations don't vary the input of A (both must be true) but
+ * vary the allowed outcomes of B (false vs. non-truth), while for implications
+ * we vary both A (truth vs. non-falsity) and B (truth vs. non-falsity).
+ *
  * Weak refutation can be proven in some cases where strong refutation doesn't
  * hold, so it's useful to use it when possible.  We don't currently have
  * support for disproving one CHECK constraint based on another one, nor for
@@ -740,6 +747,16 @@ predicate_refuted_by_recurse(Node *clause, Node *predicate,
 			 !weak))
 return true;
 
+			/*
+			 * Because weak refutation expands the allowed outcomes for B
+			 * from "false" to "false or null", we can additionally prove
+			 * weak refutation in the case that strong refutation is proven.
+			 */
+			if (weak && not_arg &&
+predicate_implied_by_recurse(predicate, not_arg,
+			 true))
+return true;
+
 			switch (pclass)
 			{
 case CLASS_AND:
@@ -1137,21 +1154,27 @@ predicate_implied_by_simple_clause(Expr *predicate, Node *clause,
 
 	Assert(list_length(op->args) == 2);
 	rightop = lsecond(op->args);
-	/* We might never see null Consts here, but better check */
-	if (rightop && IsA(rightop, Const) &&
-		!((Const *) rightop)->constisnull)
+	if (rightop && IsA(rightop, Const))
 	{
+		Const	*constexpr = (Const *) rightop;
 		Node	   *leftop = linitial(op->args);
 
-		if (DatumGetBool(((Const *) rightop)->constvalue))
+		/*
+		 * We might never see a null Const here, but better
+		 * check anyway.
+		 */
+		if (constexpr->constisnull)
+			return false;
+
+		if (DatumGetBool(constexpr->constvalue))
 		{
-			/* X = true implies X */
+			/* x = true implies x */
 			if (equal(predicate, leftop))
 return true;
 		}
 		else
 		{
-			/* X = false implies NOT X */
+			/* x = false implies NOT x */
 			if (is_notclause(predicate) &&
 equal(get_notclausearg(predicate), leftop))
 return true;
@@ -1160,6 +1183,97 @@ predicate_implied_by_simple_clause(Expr *predicate, Node *clause,
 }
 			}
 			break;
+		case T_NullTest:
+			{
+NullTest *clausentest = (NullTest *) clause;
+
+/*
+ * row IS NOT NULL does not act in the simple way we have in
+ * mind
+ */
+if (clausentest->argisrow)
+	return false;
+
+switch (clausentest->nulltesttype)
+{
+	case IS_NULL:
+		/*
+		 * A clause in the form "foo IS NULL" implies a
+		 * predicate "NOT foo" that is strict for "foo", but
+		 * only weakly since "foo" being null will result in
+		 * the clause evaluating to true while the predicate
+		 * will evaluate to null.
+		 */
+		if (weak && is_notclause(predicate) &&
+			clause_is_strict_for((Node *) get_notclausearg(predicate), (Node *) clausentest->arg, true))
+			return true;
+
+		break;
+	case IS_NOT_NULL:
+		break;
+}
+			}
+			break;
+		case T_BooleanTest:
+			{
+BooleanTest	*clausebtest = (BooleanTest *) clause;
+
+switch (clausebtest->booltesttype)
+{
+	case IS_TRUE:
+		/* x IS TRUE implies x */
+		if (equal(predicate, clausebtest->arg))
+			return true;
+		break;
+	case IS_FALSE:
+		/* x IS FALSE implies NOT x */
+		if (is_notclause(predicate) &&
+			equal(get_notclausearg(predicate), clausebtest->arg))
+			ret

Re: Add bump memory context type and use it for tuplesorts

2024-03-25 Thread Tom Lane

David Rowley  writes:
> On Tue, 26 Mar 2024 at 03:53, Tom Lane  wrote:
>> Could we move the knowledge of exactly which context type it is out
>> of the per-chunk header and keep it in the block header?

> I wasn't 100% clear on your opinion about using 010 vs expanding the
> bit-space. Based on the following it sounded like you were not
> outright rejecting the idea of consuming the 010 pattern.

What I said earlier was that 010 was the least bad choice if we
fail to do any expansibility work; but I'm not happy with failing
to do that.

Basically, I'm not happy with consuming the last reasonably-available
pattern for a memory context type that has little claim to being the
Last Context Type We Will Ever Want.  Rather than making a further
dent in our ability to detect corrupted chunks, we should do something
towards restoring the expansibility that existed in the original
design.  Then we can add bump contexts and whatever else we want.

regards, tom lane

Re: add AVX2 support to simd.h

2024-03-25 Thread Nathan Bossart

Here is what I have staged for commit.  One notable difference in this
version of the patch is that I've changed

+   if (nelem <= nelem_per_iteration)
+   goto one_by_one;

to

+   if (nelem < nelem_per_iteration)
+   goto one_by_one;

I realized that there's no reason to jump to the one-by-one linear search
code when nelem == nelem_per_iteration, as the worst thing that will happen
is that we'll process all the elements twice if the value isn't present in
the array.  My benchmark that I've been using also shows a significant
speedup for this case with this change (on the order of 75%), which I
imagine might be due to a combination of branch prediction, caching, fewer
instructions, etc.

-- 
Nathan Bossart
Amazon Web Services: https://aws.amazon.com
>From 1dd970248efd3c5ae1736c0dd1d61fbabbb6c101 Mon Sep 17 00:00:00 2001
From: Nathan Bossart 
Date: Mon, 25 Mar 2024 16:21:45 -0500
Subject: [PATCH v9 1/1] Micro-optimize pg_lfind32().

This commit improves the performance of pg_lfind32() in many cases
by modifying it to process the remaining "tail" of elements with
SIMD instructions instead of processing them one-by-one.  Since the
SIMD code processes a large block of elements, this means that we
will process a subset of elements more than once, but that won't
affect the correctness of the result, and testing has shown that
this helps more cases than it regresses.  With this change, the
standard one-by-one linear search code is only used for small
arrays and for platforms without SIMD support.

Furthermore, this commit restructures pg_lfind32() to minimize
branching, which should also improve performance.

Suggested-by: John Naylor
Reviewed-by: John Naylor
Discussion: https://postgr.es/m/20231129171526.GA857928%40nathanxps13
---
 src/include/port/pg_lfind.h | 114 
 1 file changed, 76 insertions(+), 38 deletions(-)

diff --git a/src/include/port/pg_lfind.h b/src/include/port/pg_lfind.h
index b8dfa66eef..dbc3e9fc6a 100644
--- a/src/include/port/pg_lfind.h
+++ b/src/include/port/pg_lfind.h
@@ -80,6 +80,51 @@ pg_lfind8_le(uint8 key, uint8 *base, uint32 nelem)
 	return false;
 }
 
+#ifndef USE_NO_SIMD
+/*
+ * pg_lfind32_simd_helper
+ *
+ * Searches one 4-register-block of integers.  The caller is responsible for
+ * ensuring that there are at least 4-registers-worth of integers remaining.
+ */
+static inline bool
+pg_lfind32_simd_helper(const Vector32 keys, uint32 *base)
+{
+	const uint32 nelem_per_vector = sizeof(Vector32) / sizeof(uint32);
+	Vector32	vals1,
+vals2,
+vals3,
+vals4,
+result1,
+result2,
+result3,
+result4,
+tmp1,
+tmp2,
+result;
+
+	/* load the next block into 4 registers */
+	vector32_load(&vals1, base);
+	vector32_load(&vals2, &base[nelem_per_vector]);
+	vector32_load(&vals3, &base[nelem_per_vector * 2]);
+	vector32_load(&vals4, &base[nelem_per_vector * 3]);
+
+	/* compare each value to the key */
+	result1 = vector32_eq(keys, vals1);
+	result2 = vector32_eq(keys, vals2);
+	result3 = vector32_eq(keys, vals3);
+	result4 = vector32_eq(keys, vals4);
+
+	/* combine the results into a single variable */
+	tmp1 = vector32_or(result1, result2);
+	tmp2 = vector32_or(result3, result4);
+	result = vector32_or(tmp1, tmp2);
+
+	/* return whether there was a match */
+	return vector32_is_highbit_set(result);
+}
+#endif			/* ! USE_NO_SIMD */
+
 /*
  * pg_lfind32
  *
@@ -95,8 +140,7 @@ pg_lfind32(uint32 key, uint32 *base, uint32 nelem)
 
 	/*
 	 * For better instruction-level parallelism, each loop iteration operates
-	 * on a block of four registers.  Testing for SSE2 has showed this is ~40%
-	 * faster than using a block of two registers.
+	 * on a block of four registers.
 	 */
 	const Vector32 keys = vector32_broadcast(key);	/* load copies of key */
 	const uint32 nelem_per_vector = sizeof(Vector32) / sizeof(uint32);
@@ -109,9 +153,9 @@ pg_lfind32(uint32 key, uint32 *base, uint32 nelem)
 	bool		assert_result = false;
 
 	/* pre-compute the result for assert checking */
-	for (i = 0; i < nelem; i++)
+	for (int j = 0; j < nelem; j++)
 	{
-		if (key == base[i])
+		if (key == base[j])
 		{
 			assert_result = true;
 			break;
@@ -119,47 +163,41 @@ pg_lfind32(uint32 key, uint32 *base, uint32 nelem)
 	}
 #endif
 
-	for (i = 0; i < tail_idx; i += nelem_per_iteration)
+	/*
+	 * If there aren't enough elements for the SIMD code, jump to the standard
+	 * one-by-one linear search code.
+	 */
+	if (nelem < nelem_per_iteration)
+		goto one_by_one;
+
+	/*
+	 * Process as many elements as possible with a block of 4 registers.
+	 */
+	do
 	{
-		Vector32	vals1,
-	vals2,
-	vals3,
-	vals4,
-	result1,
-	result2,
-	result3,
-	result4,
-	tmp1,
-	tmp2,
-	result;
-
-		/* load the next block into 4 registers */
-		vector32_load(&vals1, &base[i]);
-		vector32_load(&vals2, &base[i + nelem_per_vector]);
-		vector32_load(&vals3, &base[i + nelem_per_vector * 2]);
-		vector32_load(&vals4

Re: Parallel Bitmap Heap Scan reports per-worker stats in EXPLAIN ANALYZE

2024-03-25 Thread Melanie Plageman

On Mon, Mar 25, 2024 at 2:29 AM Donghang Lin  wrote:
>
>
> > On Sat, Feb 17, 2024 at 2:31 PM Tomas Vondra 
> >  wrote:
> > 2) Leader vs. worker counters
> >
> > It seems to me this does nothing to add the per-worker values from "Heap
> > Blocks" into the leader, which means we get stuff like this:
> >
> > Heap Blocks: exact=102 lossy=10995
> > Worker 0:  actual time=50.559..209.773 rows=215253 loops=1
> >Heap Blocks: exact=207 lossy=19354
> > Worker 1:  actual time=50.543..211.387 rows=162934 loops=1
> >Heap Blocks: exact=161 lossy=14636
> >
> > I think this is wrong / confusing, and inconsistent with what we do for
> > other nodes. It's also inconsistent with how we deal e.g. with BUFFERS,
> > where we *do* add the values to the leader:
> >
> > Heap Blocks: exact=125 lossy=10789
> > Buffers: shared hit=11 read=45420
> > Worker 0:  actual time=51.419..221.904 rows=150437 loops=1
> >   Heap Blocks: exact=136 lossy=13541
> >   Buffers: shared hit=4 read=13541
> > Worker 1:  actual time=56.610..222.469 rows=229738 loops=1
> >   Heap Blocks: exact=209 lossy=20655
> >   Buffers: shared hit=4 read=20655
> >
> > Here it's not entirely obvious, because leader participates in the
> > execution, but once we disable leader participation, it's clearer:
> >
> > Buffers: shared hit=7 read=45421
> > Worker 0:  actual time=28.540..247.683 rows=309112 loops=1
> >   Heap Blocks: exact=282 lossy=27806
> >   Buffers: shared hit=4 read=28241
> > Worker 1:  actual time=24.290..251.993 rows=190815 loops=1
> >   Heap Blocks: exact=188 lossy=17179
> >   Buffers: shared hit=3 read=17180
> >
> > Not only is "Buffers" clearly a sum of per-worker stats, but the "Heap
> > Blocks" simply disappeared because the leader does nothing and we don't
> > print zeros.
>
> Heap Blocks is specific to Bitmap Heap Scan. It seems that node specific stats
> do not aggregate workers' stats into leaders for some existing nodes. For 
> example,
> Memorize node for Hits, Misses, etc
>
>->  Nested Loop (actual rows=17 loops=3)
>  ->  Parallel Seq Scan on t (actual rows=3 loops=3)
>  ->  Memoize (actual rows=5 loops=10)
>Cache Key: t.j
>Cache Mode: logical
>Hits: 32991  Misses: 5  Evictions: 0  Overflows: 0  Memory 
> Usage: 2kB
>Worker 0:  Hits: 33551  Misses: 5  Evictions: 0  Overflows: 0  
> Memory Usage: 2kB
>Worker 1:  Hits: 33443  Misses: 5  Evictions: 0  Overflows: 0  
> Memory Usage: 2kB
>->  Index Scan using uj on u (actual rows=5 loops=15)
>  Index Cond: (j = t.j)
>
> Sort, HashAggregate also do the same stuff.
>
> > 3) I'm not sure dealing with various EXPLAIN flags may not be entirely
> > correct. Consider this:
> >
> > EXPLAIN (ANALYZE):
> >
> >->  Parallel Bitmap Heap Scan on t  (...)
> >  Recheck Cond: (a < 5000)
> >  Rows Removed by Index Recheck: 246882
> >  Worker 0:  Heap Blocks: exact=168 lossy=15648
> >  Worker 1:  Heap Blocks: exact=302 lossy=29337
> >
> > EXPLAIN (ANALYZE, VERBOSE):
> >
> >->  Parallel Bitmap Heap Scan on public.t  (...)
> >  Recheck Cond: (t.a < 5000)
> >  Rows Removed by Index Recheck: 246882
> >  Worker 0:  actual time=35.067..300.882 rows=282108 loops=1
> >Heap Blocks: exact=257 lossy=25358
> >  Worker 1:  actual time=32.827..302.224 rows=217819 loops=1
> >Heap Blocks: exact=213 lossy=19627
> >
> > EXPLAIN (ANALYZE, BUFFERS):
> >
> >->  Parallel Bitmap Heap Scan on t  (...)
> >  Recheck Cond: (a < 5000)
> >  Rows Removed by Index Recheck: 246882
> >  Buffers: shared hit=7 read=45421
> >  Worker 0:  Heap Blocks: exact=236 lossy=21870
> >  Worker 1:  Heap Blocks: exact=234 lossy=23115
> >
> > EXPLAIN (ANALYZE, VERBOSE, BUFFERS):
> >
> >->  Parallel Bitmap Heap Scan on public.t  (...)
> >  Recheck Cond: (t.a < 5000)
> >  Rows Removed by Index Recheck: 246882
> >  Buffers: shared hit=7 read=45421
> >  Worker 0:  actual time=28.265..260.381 rows=261264 loops=1
> >Heap Blocks: exact=260 lossy=23477
> >Buffers: shared hit=3 read=23478
> >  Worker 1:  actual time=28.224..261.627 rows=238663 loops=1
> >Heap Blocks: exact=210 lossy=21508
> >Buffers: shared hit=4 read=21943
> >
> > Why should the per-worker buffer info be shown when combined with the
> > VERBOSE flag, and not just with BUFFERS, when the patch shows the
> > per-worker info always?
> >
>
> It seems that the general explain print framework requires verbose mode to 
> show per worker stats.
> For example, how Buffers hits, JIT are printed. While in some specific nodes 
> which involves parallelism,
> they always show worker blocks. This is why we see that some worker blocks 
> don't have buffers
> stat

Re: Possibility to disable `ALTER SYSTEM`

2024-03-25 Thread Bruce Momjian

On Mon, Mar 25, 2024 at 09:40:55PM +0100, Jelte Fennema-Nio wrote:
> On Mon, 25 Mar 2024 at 20:16, Bruce Momjian  wrote:
> > I am wondering if the fact that you would be able to do:
> >
> > ALTER SYSTEM SET externally_managed_configuration = false
> >
> > and then be unable to use ALTER SYSTEM to revert the change is
> > significant.
> 
> This is not possible, due to the externally_managed_configuration GUC
> having the GUC_DISALLOW_IN_AUTO_FILE flag.

Ah, good, thanks.

> > Isn't "configuration" too generic a term for disabling ALTER SYSTEM?
> 
> maybe "externally_managed_auto_config"

How many people associate "auto" with ALTER SYSTEM?  I assume not many. 

To me, externally_managed_configuration is promising a lot more than it
delivers because there is still a lot of ocnfiguration it doesn't
control.  I am also confused why the purpose of the feature, external
management of configuation, is part of the variable name.  We usually
name parameters for what they control.

It seems this is really controlling the ability to alter system
variables at the SQL level, maybe sql_alter_system_vars.

-- 
  Bruce Momjian  https://momjian.us
  EDB  https://enterprisedb.com

  Only you can decide what is important to you.

Re: Add bump memory context type and use it for tuplesorts

2024-03-25 Thread David Rowley

On Tue, 26 Mar 2024 at 03:53, Tom Lane  wrote:
> I agree with this completely.  However, the current design for chunk
> headers is mighty restrictive about how many kinds of contexts we can
> have.  We need to open that back up.

Andres mentioned how we could do this in [1].  One possible issue with
that is that slab.c has no external chunks so would restrict slab to
512MB chunks.  I doubt that's ever going to realistically be an issue.
That's just not a good use case for slab, so I'd be ok with that.

> Could we move the knowledge of exactly which context type it is out
> of the per-chunk header and keep it in the block header?  This'd
> require that every context type have a standardized way of finding
> the block header from a chunk.  We could repurpose the existing
> MemoryContextMethodID bits to allow having a small number of different
> ways, perhaps.

I wasn't 100% clear on your opinion about using 010 vs expanding the
bit-space. Based on the following it sounded like you were not
outright rejecting the idea of consuming the 010 pattern.

On Sat, 17 Feb 2024 at 12:14, Tom Lane  wrote:
> If we do kick this can down the road, then I concur with eating 010
> next, as it seems the least likely to occur in glibc-malloced
> chunks.

David

[1] https://postgr.es/m/20240217200845.ywlwenjrlbyoc...@awork3.anarazel.de

Re: Possibility to disable `ALTER SYSTEM`

2024-03-25 Thread Jelte Fennema-Nio

On Mon, 25 Mar 2024 at 20:16, Bruce Momjian  wrote:
> I am wondering if the fact that you would be able to do:
>
> ALTER SYSTEM SET externally_managed_configuration = false
>
> and then be unable to use ALTER SYSTEM to revert the change is
> significant.

This is not possible, due to the externally_managed_configuration GUC
having the GUC_DISALLOW_IN_AUTO_FILE flag.

> Isn't "configuration" too generic a term for disabling ALTER SYSTEM?

maybe "externally_managed_auto_config"

Re: Why is parula failing?

2024-03-25 Thread David Rowley

On Thu, 21 Mar 2024 at 14:19, Tom Lane  wrote:
>
> David Rowley  writes:
> > We could also do something like the attached just in case we're
> > barking up the wrong tree.
>
> Yeah, checking indisvalid isn't a bad idea.  I'd put another
> one further down, just before the DROP of table ab, so we
> can see the state both before and after the unstable tests.

So it's taken quite a while to finally fail again.

Effectively, we're getting:

relname | relpages | reltuples | indisvalid | autovacuum_count
| autoanalyze_count
+--+---++--+---
- ab_a2_b2   |0 |-1 ||
0 | 0
+ ab_a2_b2   |0 |48 ||
0 | 0

I see AddNewRelationTuple() does set reltuples to -1, so I can't quite
figure out why 48 is in there.  Even if auto-analyze had somehow
mistakenly run and the autoanalyze_count stats just were not
up-to-date yet, the table has zero blocks, and I don't see how
acquire_sample_rows() would set *totalrows to anything other than 0.0
in this case.  For the vacuum case, I see that reltuples is set from:

/* now we can compute the new value for pg_class.reltuples */
vacrel->new_live_tuples = vac_estimate_reltuples(vacrel->rel, rel_pages,
vacrel->scanned_pages,
vacrel->live_tuples);

Again, hard to see how that could come to anything other than zero
given that rel_pages and scanned_pages should be 0.

Looking at the binary representation of a float of -1 vs 48, they're
not nearly the same. 0xBF80 vs 0x4240, so it's not looking
like a flipped bit.

It would be good to have log_autovacuum_min_duration = 0 on this
machine for a while.

David

Re: Large block sizes support in Linux

2024-03-25 Thread Thomas Munro

On Tue, Mar 26, 2024 at 3:34 AM Pankaj Raghav  wrote:
> One question: Does ZFS do something like FUA request to force the device
> to clear the cache before it can update the node to point to the new page?
>
> If it doesn't do it, there is no guarantee from device to update the data
> atomically unless it has bigger atomic guarantees?

It flushes the whole disk write cache (unless you turn that off).
AFAIK it can't use use FUA instead yet (it knows some things about it,
there are mentions under the Linux-specific parts of the tree but that
may be more to do with understanding and implementing it when
exporting a virtual block device, or something like that (?), but I
don't believe it knows how to use it for its own underlying log or
ordering).  FUA would clearly be better, no waiting for random extra
data to be flushed.

Re: pgsql: Track last_inactive_time in pg_replication_slots.

2024-03-25 Thread Bharath Rupireddy

On Tue, Mar 26, 2024 at 1:30 AM Nathan Bossart  wrote:
>
> On Mon, Mar 25, 2024 at 04:49:12PM +, Bertrand Drouvot wrote:
> > On Mon, Mar 25, 2024 at 12:25:37PM -0400, Robert Haas wrote:
> >> In the same vein, I think deactivated_at or inactive_since might be
> >> good names to consider. I think they get at the same thing as
> >> released_time, but they avoid introducing a completely new word
> >> (release, as opposed to active/inactive).
> >
> > Yeah, I'd vote for inactive_since then.
>
> Having only skimmed some of the related discussions, I'm inclined to agree
> that inactive_since provides the clearest description for the column.

I think we all have some agreement on inactive_since. So, I'm
attaching the patch for that change.

-- 
Bharath Rupireddy
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com


v1-0001-Use-a-less-confusing-name-for-slot-s-last_inactiv.patch
Description: Binary data

Re: Large block sizes support in Linux

2024-03-25 Thread Bruce Momjian

On Mon, Mar 25, 2024 at 02:53:56PM +0100, Pankaj Raghav wrote:
> This is an excellent question that needs a bit of community discussion to
> expose a device agnostic value that userspace can trust.
> 
> There might be a talk this year at LSFMM about untorn writes[1] in buffered IO
> path. I will make sure to bring this question up.
> 
> At the moment, Linux exposes the physical blocksize by taking also atomic 
> guarantees
> into the picture, especially for NVMe it uses the NAWUPF and AWUPF while 
> setting
> physical blocksize (/sys/block//queue/physical_block_size).
> 
> A system admin could use value exposed by phy_bs as a hint to disable 
> full_page_write=off.
> Of course this requires also the device to give atomic guarantees.
> 
> The most optimal would be DB page size == FS block size == Device atomic size.

One other thing I remember is that some people modified the ZFS file
system parameters enough that they made Postgres non-durable and
corrupted their database.  This is a very hard thing to get right
because the user has very little feedback when they break things.

-- 
  Bruce Momjian  https://momjian.us
  EDB  https://enterprisedb.com

  Only you can decide what is important to you.

Re: Popcount optimization using AVX512

2024-03-25 Thread Nathan Bossart

On Mon, Mar 25, 2024 at 06:42:36PM +, Amonson, Paul D wrote:
> Ok, CI turned green after my re-post of the patches.  Can this please get
> merged?

Thanks for the new patches.  I intend to take another look soon.

-- 
Nathan Bossart
Amazon Web Services: https://aws.amazon.com

Re: pgsql: Track last_inactive_time in pg_replication_slots.

2024-03-25 Thread Nathan Bossart

On Mon, Mar 25, 2024 at 04:49:12PM +, Bertrand Drouvot wrote:
> On Mon, Mar 25, 2024 at 12:25:37PM -0400, Robert Haas wrote:
>> In the same vein, I think deactivated_at or inactive_since might be
>> good names to consider. I think they get at the same thing as
>> released_time, but they avoid introducing a completely new word
>> (release, as opposed to active/inactive).
> 
> Yeah, I'd vote for inactive_since then.

Having only skimmed some of the related discussions, I'm inclined to agree
that inactive_since provides the clearest description for the column.

-- 
Nathan Bossart
Amazon Web Services: https://aws.amazon.com

Re: New Table Access Methods for Multi and Single Inserts

2024-03-25 Thread Bharath Rupireddy

On Sat, Mar 23, 2024 at 5:47 AM Jeff Davis  wrote:
>
> Comments:

Thanks for looking into it.

> * Do I understand correctly that CMV, RMV, and CTAS experience a
> performance benefit, but COPY FROM does not? And is that because COPY
> already used table_multi_insert, whereas CMV and RMV did not?

Yes, that's right. COPY FROM is already optimized with multi inserts.

I now have a feeling that I need to simplify the patches. I'm thinking
of dropping the COPY FROM patch using the new multi insert API for the
following reasons:
1. We can now remove some of the new APIs (table_multi_insert_slots
and table_multi_insert_next_free_slot) that were just invented for
COPY FROM.
2. COPY FROM is already optimized with multi inserts, so no real gain
is expected with the new multi insert API.
3. As we are inching towards feature freeze, simplifying the patches
by having only the necessary things increases the probability of
getting this in.
4. The real benefit of this whole new multi insert API is seen if used
for the commands CMV, RMV, CTAS. These commands got faster by 62.54%,
68.87%, 74.31% or 2.67, 3.21, 3.89 times respectively.
5. This leaves with really simple APIs. No need for callback stuff for
dealing with indexes, triggers etc. as CMV, RMV, CTAS cannot have any
of them.

The new APIs are more extensible, memory management is taken care of
by AM, and with TableModifyState as the structure name and more
meaningful API names. The callback for triggers/indexes etc. aren't
taken care of as I'm now only focusing on CTAS, CMV, RMV
optimizations.

Please see the attached v14 patches.

--
Bharath Rupireddy
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com
From 2de89705c6b2d03020988db0cc8857a0bf19b38e Mon Sep 17 00:00:00 2001
From: Bharath Rupireddy 
Date: Mon, 25 Mar 2024 07:09:25 +
Subject: [PATCH v14 1/3] Introduce table modify access methods

---
 src/backend/access/heap/heapam.c | 163 +++
 src/backend/access/heap/heapam_handler.c |   6 +
 src/include/access/heapam.h  |  48 +++
 src/include/access/tableam.h | 103 ++
 src/tools/pgindent/typedefs.list |   4 +
 5 files changed, 324 insertions(+)

diff --git a/src/backend/access/heap/heapam.c b/src/backend/access/heap/heapam.c
index 34bc60f625..d1ef2464ef 100644
--- a/src/backend/access/heap/heapam.c
+++ b/src/backend/access/heap/heapam.c
@@ -64,6 +64,7 @@
 #include "storage/standby.h"
 #include "utils/datum.h"
 #include "utils/inval.h"
+#include "utils/memutils.h"
 #include "utils/relcache.h"
 #include "utils/snapmgr.h"
 #include "utils/spccache.h"
@@ -2442,6 +2443,168 @@ heap_multi_insert(Relation relation, TupleTableSlot **slots, int ntuples,
 	pgstat_count_heap_insert(relation, ntuples);
 }
 
+/*
+ * Initialize heap modify state.
+ */
+TableModifyState *
+heap_modify_begin(TableModifyKind kind, Relation rel, int flags)
+{
+	TableModifyState *state;
+	MemoryContext context;
+	MemoryContext oldcontext;
+
+	context = AllocSetContextCreate(CurrentMemoryContext,
+	"heap_modify memory context",
+	ALLOCSET_DEFAULT_SIZES);
+
+	oldcontext = MemoryContextSwitchTo(context);
+
+	state = palloc0(sizeof(TableModifyState));
+	state->kind = kind;
+	state->rel = rel;
+	state->flags = flags;
+	state->mctx = context;
+
+	if (kind == TM_KIND_INSERT)
+	{
+		HeapInsertState *istate;
+
+		istate = (HeapInsertState *) palloc0(sizeof(HeapInsertState));
+		istate->bistate = NULL;
+		istate->mistate = NULL;
+		state->data = istate;
+
+		if ((flags & TM_FLAG_MULTI_INSERTS) != 0)
+		{
+			HeapMultiInsertState *mistate;
+
+			mistate = (HeapMultiInsertState *) palloc0(sizeof(HeapMultiInsertState));
+			mistate->slots = (TupleTableSlot **) palloc0(sizeof(TupleTableSlot *) * HEAP_MAX_BUFFERED_SLOTS);
+			istate->mistate = mistate;
+		}
+
+		if ((flags & TM_FLAG_BAS_BULKWRITE) != 0)
+			istate->bistate = GetBulkInsertState();
+	}
+
+	MemoryContextSwitchTo(oldcontext);
+
+	return state;
+}
+
+/*
+ * Store passed-in tuple into in-memory buffered slots. When full, insert
+ * multiple tuples from the buffers into heap.
+ */
+void
+heap_modify_buffer_insert(TableModifyState *state, CommandId cid,
+		  int options, TupleTableSlot *slot)
+{
+	TupleTableSlot *dstslot;
+	HeapInsertState *istate;
+	HeapMultiInsertState *mistate;
+	MemoryContext oldcontext;
+
+	Assert(state->kind == TM_KIND_INSERT);
+	istate = (HeapInsertState *) state->data;
+	Assert(istate->mistate != NULL);
+	mistate = istate->mistate;
+	Assert(istate->bistate != NULL);
+
+	oldcontext = MemoryContextSwitchTo(state->mctx);
+
+	dstslot = mistate->slots[mistate->cur_slots];
+	if (dstslot == NULL)
+	{
+		/*
+		 * We use virtual tuple slots buffered slots for leveraging the
+		 * optimization it provides to minimize physical data copying. The
+		 * virtual slot gets materialized when we copy (via below
+		 * ExecCopySlot) the tuples from the source slot which can be of any
+		 * type. This way, it

Re: Introduce XID age and inactive timeout based replication slot invalidation

2024-03-25 Thread Nathan Bossart

I apologize that I haven't been able to keep up with this thread for a
while, but I'm happy to see the continued interest in $SUBJECT.

On Sun, Mar 24, 2024 at 03:05:44PM +0530, Bharath Rupireddy wrote:
> This commit particularly lets one specify the inactive_timeout for
> a slot via SQL functions pg_create_physical_replication_slot and
> pg_create_logical_replication_slot.

Off-list, Bharath brought to my attention that the current proposal was to
set the timeout at the slot level.  While I think that is an entirely
reasonable thing to support, the main use-case I have in mind for this
feature is for an administrator that wants to prevent inactive slots from
causing problems (e.g., transaction ID wraparound) on a server or a number
of servers.  For that use-case, I think a GUC would be much more
convenient.  Perhaps there could be a default inactive slot timeout GUC
that would be used in the absence of a slot-level setting.  Thoughts?

-- 
Nathan Bossart
Amazon Web Services: https://aws.amazon.com

Re: Combine Prune and Freeze records emitted by vacuum

2024-03-25 Thread Heikki Linnakangas


On 24/03/2024 18:32, Melanie Plageman wrote:

On Thu, Mar 21, 2024 at 9:28 AM Heikki Linnakangas  wrote:


In heap_page_prune_and_freeze(), we now do some extra work on each live
tuple, to set the all_visible_except_removable correctly. And also to
update live_tuples, recently_dead_tuples and hastup. When we're not
freezing, that's a waste of cycles, the caller doesn't care. I hope it's
enough that it doesn't matter, but is it?


Last year on an early version of the patch set I did some pgbench
tpcb-like benchmarks -- since there is a lot of on-access pruning in
that workload -- and I don't remember it being a showstopper. The code
has changed a fair bit since then. However, I think it might be safer
to pass a flag "by_vacuum" to heap_page_prune_and_freeze() and skip
the rest of the loop after heap_prune_satisifies_vacuum() when
on-access pruning invokes it. I had avoided that because it felt ugly
and error-prone, however it addresses a few other of your points as
well.


Ok. I'm not a fan of the name 'by_vacuum' though. It'd be nice if the 
argument described what it does, rather than who it's for. For example, 
'need_all_visible'. If set to true, the function determines 
'all_visible', otherwise it does not.


I started to look closer at the loops in heap_prune_chain() and how they 
update all the various flags and counters. There's a lot going on there. 
We have:


- live_tuples counter
- recently_dead_tuples counter
- all_visible[_except_removable]
- all_frozen
- visibility_cutoff_xid
- hastup
- prstate.frozen array
- nnewlpdead
- deadoffsets array

And that doesn't even include all the local variables and the final 
dead/redirected arrays.


Some of those are set in the first loop that initializes 'htsv' for each 
tuple on the page. Others are updated in heap_prune_chain(). Some are 
updated in both. It's hard to follow which are set where.


I think recently_dead_tuples is updated incorrectly, for tuples that are 
part of a completely dead HOT chain. For example, imagine a hot chain 
with two tuples: RECENTLY_DEAD -> DEAD. heap_prune_chain() would follow 
the chain, see the DEAD tuple at the end of the chain, and mark both 
tuples for pruning. However, we already updated 'recently_dead_tuples' 
in the first loop, which is wrong if we remove the tuple.


Maybe that's the only bug like this, but I'm a little scared. Is there 
something we could do to make this simpler? Maybe move all the new work 
that we added to the first loop, into heap_prune_chain() ? Maybe 
introduce a few more helper heap_prune_record_*() functions, to update 
the flags and counters also for live and insert/delete-in-progress 
tuples and for dead line pointers? Something like 
heap_prune_record_live() and heap_prune_record_lp_dead().



The 'frz_conflict_horizon' stuff is still fuzzy to me. (Not necessarily
these patches's fault). This at least is wrong, because Max(a, b)
doesn't handle XID wraparound correctly:


   if (do_freeze)
   conflict_xid = 
Max(prstate.snapshotConflictHorizon,
  
presult->frz_conflict_horizon);
   else
   conflict_xid = prstate.snapshotConflictHorizon;


Then there's this in lazy_scan_prune():


   /* Using same cutoff when setting VM is now unnecessary */
   if (presult.all_frozen)
   presult.frz_conflict_horizon = InvalidTransactionId;

This does the right thing in the end, but if all the tuples are frozen
shouldn't frz_conflict_horizon already be InvalidTransactionId? The
comment says it's "newest xmin on the page", and if everything was
frozen, all xmins are FrozenTransactionId. In other words, that should
be moved to heap_page_prune_and_freeze() so that it doesn't lie to its
caller. Also, frz_conflict_horizon is only set correctly if
'all_frozen==true', would be good to mention that in the comments too.


Yes, this is a good point. I've spent some time swapping all of this
back into my head. I think we should change the names of all these
conflict horizon variables and introduce some local variables again.
In the attached patch, I've updated the name of the variable in
PruneFreezeResult to vm_conflict_horizon, as it is only used for
emitting a VM update record. Now, I don't set it until the end of
heap_page_prune_and_freeze(). It is only updated from
InvalidTransactionId if the page is not all frozen. As you say, if the
page is all frozen, there can be no conflict.


Makes sense.


I've also changed PruneState->snapshotConflictHorizon to
PruneState->latest_xid_removed.

And I introduced the local variables visibility_cutoff_xid and
frz_conflict_horizon. I think it is important we distinguish between
the latest xid pruned, the latest xmin of tuples frozen, and the
latest xid of all live tuples on the page.

Though we end up using visibility_cutoff_xid as the freeze conflict
horizon if the page is a

Re: Possibility to disable `ALTER SYSTEM`

2024-03-25 Thread Bruce Momjian

On Mon, Mar 25, 2024 at 01:29:46PM -0400, Robert Haas wrote:
> What is less clear is whether there is a consensus in favor of this
> particular method of disabling ALTER SYSTEM, namely, via a GUC. The
> two alternate approaches that seem to enjoy some level of support are
> (a) an extension or (b) changing the permissions on the files.

I am wondering if the fact that you would be able to do:

ALTER SYSTEM SET externally_managed_configuration = false

and then be unable to use ALTER SYSTEM to revert the change is
significant.  I can't think of many such cases.

Isn't "configuration" too generic a term for disabling ALTER SYSTEM?

-- 
  Bruce Momjian  https://momjian.us
  EDB  https://enterprisedb.com

  Only you can decide what is important to you.

Re: pgsql: Clean up role created in new subscription test.

2024-03-25 Thread Andres Freund

Hi,

On 2024-01-19 15:40:21 +0100, Peter Eisentraut wrote:
> On 19.01.24 15:26, Daniel Gustafsson wrote:
> > > On 18 Jan 2024, at 01:57, vignesh C  wrote:
> > 
> > > There are a lot of failures in CFBot at [1] with:
> > 
> > > More details of the same are available at [2].
> > > Do we need to clean up the objects leftover for the reported issues in 
> > > the test?
> > 
> > Not really, these should not need cleaning up, and it's quite odd that it 
> > only
> > happens on FreeBSD.  I need to investigate further so I'll mark this 
> > waiting on
> > author in the meantime
> 
> Most likely because only the FreeBSD job uses
> ENFORCE_REGRESSION_TEST_NAME_RESTRICTIONS.

I don't think it's that, but that the freebsd task tests the installcheck
equivalent in meson.  I haven't checked what your patch is doing, but perhaps
the issue is that it's seeing global objects concurrently created by another
test?

Greetings,

Andres Freund

Re: Possibility to disable `ALTER SYSTEM`

2024-03-25 Thread Robert Haas

On Mon, Mar 25, 2024 at 2:26 PM Tom Lane  wrote:
> I wonder whether this feature should include teaching the server
> to ignore postgresql.auto.conf altogether, which would make it
> relatively easy to get to a bulletproof configuration.

This has been debated a few times on the thread already, but a number
of problems with that idea have been raised, and as far as I can see,
everyone who suggested went on to recant and agree that we shouldn't
do that. If you feel a strong need to relitigate that, please check
the prior discussion first.

-- 
Robert Haas
EDB: http://www.enterprisedb.com

Re: psql not responding to SIGINT upon db reconnection

2024-03-25 Thread Robert Haas

On Fri, Mar 22, 2024 at 4:58 PM Tristan Partin  wrote:
> I had a question about parameter naming. Right now I have a mix of
> camel-case and snake-case in the function signature since that is what
> I inherited. Should I change that to be consistent? If so, which case
> would you like?

Uh... PostgreSQL is kind of the wild west in that regard. The thing to
do is look for nearby precedents, but that doesn't help much here
because in the very same file, libpq-fe.h, we have:

extern int  PQsetResultAttrs(PGresult *res, int numAttributes,
PGresAttDesc *attDescs);
extern int  PQsetvalue(PGresult *res, int tup_num, int field_num,
char *value, int len);

Since the existing naming is consistent with one of those two styles,
I'd probably just leave it be.

+   The function returns a value greater than 0
if the specified condition
+   is met, 0 if a timeout occurred, or
-1 if an error
+   or interrupt occurred. In the event forRead and

We either need to tell people how to find out which error it was, or
if that's not possible and we can't reasonably make it possible, we
need to tell them why they shouldn't care. Because there's nothing
more delightful than someone who shows up and says "hey, I tried to do
XYZ, and I got an error," as if that were sufficient information for
me to do something useful.

+   end_time is the time in the future in
seconds starting from the UNIX
+   epoch in which you would like the function to return if the
condition is not met.

This sentence seems a bit contorted to me, like maybe Yoda wrote it. I
was about to try to rephrase it and maybe split it in two when I
wondered why we need to document how time_t works at all. Can't we
just say something like "If end_time is not -1, it specifies the time
at which this function should stop waiting for the condition to be
met" -- and maybe move it to the end of the first paragraph, so it's
before where we list the meanings of the return values?

-- 
Robert Haas
EDB: http://www.enterprisedb.com

RE: Popcount optimization using AVX512

2024-03-25 Thread Amonson, Paul D

> -Original Message-
> From: Amonson, Paul D 
> Sent: Monday, March 25, 2024 8:20 AM
> To: Tom Lane 
> Cc: David Rowley ; Nathan Bossart
> ; Andres Freund ; Alvaro
> Herrera ; Shankaran, Akash
> ; Noah Misch ; Matthias
> van de Meent ; pgsql-
> hack...@lists.postgresql.org
> Subject: RE: Popcount optimization using AVX512
>

Ok, CI turned green after my re-post of the patches.  Can this please get 
merged?

Thanks,
Paul

Re: Large block sizes support in Linux

2024-03-25 Thread Pankaj Raghav

On 23/03/2024 03:41, Bruce Momjian wrote:
> On Fri, Mar 22, 2024 at 10:31:11PM +0100, Tomas Vondra wrote:
>> Right, but things change over time - current storage devices support
>> much larger sectors (LBA format), usually 4K. And if you do I/O with
>> this size, it's usually atomic.
>>
>> AFAIK if you built Postgres with 4K pages, on a device with 4K LBA
>> format, that would not need full-page writes - we always do I/O in 4k
>> pages, and block layer does I/O (during writeback from page cache) with
>> minimum guaranteed size = logical block size. 4K are great for OLTP
>> systems in general, it'd be even better if we didn't need to worry about
>> torn pages (but the tricky part is to be confident it's safe to disable
>> them on a particular system).
> 
> Yes, even if the file system is 8k, and the storage is 8k, we only know
> that torn pages are impossible if the file system never overwrites
> existing 8k pages, but writes new ones and then makes it active.  I
> think ZFS does that to handle snapshots.
> 

I think we can also avoid torn writes:
- if filesystem's data path always writes in multiples of 8k (with alignment)
- device supports 8k atomic writes.

Then we might be able to push the responsibility to the device without having 
the overhead
of a CoW FS or FPW=on. Of course, the performance here depends on the vendor 
specific
implementation of atomics.

We are trying to enable the former by adding LBS support to XFS in Linux.

--
Pankaj

Re: Large block sizes support in Linux

2024-03-25 Thread Pankaj Raghav

Hi Thomas,

On 23/03/2024 05:53, Thomas Munro wrote:
> On Fri, Mar 22, 2024 at 10:56 PM Pankaj Raghav (Samsung)
>  wrote:
>> My team and I have been working on adding Large block size(LBS)
>> support to XFS in Linux[1]. Once this feature lands upstream, we will be
>> able to create XFS with FS block size > page size of the system on Linux.
>> We also gave a talk about it in Linux Plumbers conference recently[2]
>> for more context. The initial support is only for XFS but more FSs will
>> follow later.
> 
> Very cool!
> 
> (I used XFS on IRIX in the 90s, and it had large blocks then, a
> feature lost in the port to Linux AFAIK.)
> 

Yes, I heard this also from the Maintainer of XFS that they had to drop
this functionality when they did the port. :)

>> On an x86_64 system, fs block size was limited to 4k, but traditionally
>> Postgres uses 8k as its default internal page size. With LBS support,
>> fs block size can be set to 8K, thereby matching the Postgres page size.
>>
>> If the file system block size == DB page size, then Postgres can have
>> guarantees that a single DB page will be written as a single unit during
>> kernel write back and not split.
>>
>> My knowledge of Postgres internals is limited, so I'm wondering if there
>> are any optimizations or potential optimizations that Postgres could
>> leverage once we have LBS support on Linux?
> 
> FWIW here are a couple of things I wrote about our storage atomicity
> problem, for non-PostgreSQL hackers who may not understand our project
> jargon:
> 
> https://wiki.postgresql.org/wiki/Full_page_writes
> https://freebsdfoundation.org/wp-content/uploads/2023/02/munro_ZFS.pdf
> 
This is very useful, thanks a lot.

> The short version is that we (and MySQL, via a different scheme with
> different tradeoffs) could avoid writing all our stuff out twice if we
> could count on atomic writes of a suitable size on power failure, so
> the benefits are very large.  As far as I know, there are two things
> we need from the kernel and storage to do that on "overwrite"
> filesystems like XFS:
> 
> 1.  The disk must promise that its atomicity-on-power-failure is a
> multiple of our block size -- something like NVMe AWUPF, right?  My
> devices seem to say 0 :-(  Or I guess the filesystem has to
> compensate, but then it's not exactly an overwrite filesystem
> anymore...
> 

0 means 1 logical block, which might be 4k in your case. Typically device
vendors have to put extra hardware to guarantee bigger atomic block sizes.

> 2.  The kernel must promise that there is no code path in either
> buffered I/O or direct I/O that will arbitrarily chop up our 8KB (or
> other configured block size) writes on some smaller boundary, most
> likely sector I guess, on their way to the device, as you were saying.
> Not just in happy cases, but even under memory pressure, if
> interrupted, etc etc.
> 
> Sounds like you're working on problem #2 which is great news.
> 

Yes, you are spot on. :)

> I've been wondering for a while how a Unixoid kernel should report
> these properties to userspace where it knows them, especially on
> non-overwrite filesystems like ZFS where this sort of thing works

So it looks like ZFS (or any other CoW filesystem that supports larger
block sizes) is doing what postgres will do anyway with FPW=on, making
it safe to turn off FPW.

One question: Does ZFS do something like FUA request to force the device
to clear the cache before it can update the node to point to the new page?

If it doesn't do it, there is no guarantee from device to update the data
atomically unless it has bigger atomic guarantees?

> already, without stuff like AWUPF working the way one might hope.
> Here was one throw-away idea on the back of a napkin about that, for
> what little it's worth:
> > https://wiki.postgresql.org/wiki/FreeBSD/AtomicIO

As I replied in the previous mail to Tomas, we might be having a talk
about Untorn writes[1] in LSFMM this year. I hope to bring up some of the
discussions from here. Thanks!

[1] https://lore.kernel.org/linux-fsdevel/20240228061257.ga106...@mit.edu/

Re: Large block sizes support in Linux

2024-03-25 Thread Pankaj Raghav

Hi Tomas and Bruce,

>>> My knowledge of Postgres internals is limited, so I'm wondering if there
>>> are any optimizations or potential optimizations that Postgres could
>>> leverage once we have LBS support on Linux?
>>
>> We have discussed this in the past, and in fact in the early years we
>> thought we didn't need fsync since the BSD file system was 8k at the
>> time.
>>
>> What we later realized is that we have no guarantee that the file system
>> will write to the device in the specified block size, and even it it
>> does, the I/O layers between the OS and the device might not, since many
>> devices use 512 byte blocks or other sizes.
>>
> 
> Right, but things change over time - current storage devices support
> much larger sectors (LBA format), usually 4K. And if you do I/O with
> this size, it's usually atomic.
> 
> AFAIK if you built Postgres with 4K pages, on a device with 4K LBA
> format, that would not need full-page writes - we always do I/O in 4k
> pages, and block layer does I/O (during writeback from page cache) with
> minimum guaranteed size = logical block size. 4K are great for OLTP
> systems in general, it'd be even better if we didn't need to worry about
> torn pages (but the tricky part is to be confident it's safe to disable
> them on a particular system).
> 
> I did watch the talk linked by Pankaj, and IIUC the promise of the LBS
> patches is that this benefit would extend would apply even to larger
> page sizes (= fs page size). Which right now you can't even mount, but
> the patches allow that. So for example it would be possible to create an
> XFS filesystem with 8kB pages, and then we'd read/write 8kB pages as
> usual, and we'd know that the page cache always writes out either the
> whole page or none of it. Which right now is not guaranteed to happen,
> it's possible to e.g. write the page as two 4K requests, even if all
> other things are set properly (drive has 4K logical/physical sectors).
> 
> At least that's my understanding ...
>> Pankaj, could you clarify what the guarantees provided by LBS are going
> to be? the talk uses wording like "should be" and "hint" in a couple
> places, and there's also stuff I'm not 100% familiar with.
> 
> If we create a filesystem with 8K blocks, and we only ever do writes
> (and reads) in 8K chunks (our default page size), what guarantees that
> gives us? What if the underlying device has LBA format with only 4K (or
> perhaps even just 512B), how would that affect the guarantees?
> 

Yes, the whole FS block is managed as one unit (also on a physically contiguous
page), so we send the whole fs block while performing writeback. This is not 
guaranteed
when FS block size = 4k and the DB page size is 8k as it might be sent as two
different requests as you have indicated.

The LBA format will not affect the guarantee of sending the whole FS block 
without
splitting as long as the FS block size is less than the maximum IO transfer 
size*.

But another issue now is even though the host has done its job, the device might
have a smaller atomic guarantee, thereby making it not powerfail safe.

> The other thing is - is there a reliable way to say when the guarantees
> actually apply? I mean, how would the administrator *know* it's safe to
> set full_page_writes=off, or even better how could we verify this when
> the database starts (and complain if it's not safe to disable FPW)?
> 

This is an excellent question that needs a bit of community discussion to
expose a device agnostic value that userspace can trust.

There might be a talk this year at LSFMM about untorn writes[1] in buffered IO
path. I will make sure to bring this question up.

At the moment, Linux exposes the physical blocksize by taking also atomic 
guarantees
into the picture, especially for NVMe it uses the NAWUPF and AWUPF while setting
physical blocksize (/sys/block//queue/physical_block_size).

A system admin could use value exposed by phy_bs as a hint to disable 
full_page_write=off.
Of course this requires also the device to give atomic guarantees.

The most optimal would be DB page size == FS block size == Device atomic size.

> It's easy to e.g. take a backup on one filesystem and restore it on
> another one, and forget those may have different block sizes etc. I'm
> not sure it's possible in a 100% reliable way (tablespaces?).
> 
> 
> regards
> 

[1] https://lore.kernel.org/linux-fsdevel/20240228061257.ga106...@mit.edu/

* A small caveat, I am most familiar with NVMe, so my answers might be based on
my experience in NVMe.

Re: Possibility to disable `ALTER SYSTEM`

2024-03-25 Thread Magnus Hagander

On Mon, Mar 25, 2024 at 7:27 PM Tom Lane  wrote:

> Robert Haas  writes:
> > OK, great. The latest patch doesn't specifically talk about backing it
> > up with filesystem-level controls, but it does clearly say that this
> > feature is not going to stop a determined superuser from bypassing the
> > feature, which I think is the appropriate level of detail. We don't
> > actually know whether a user has filesystem-level controls available
> > on their system that are equal to the task; certainly chmod isn't good
> > enough, unless you can prevent the superuser from just running chmod
> > again, which you probably can't. An FS-level immutable flag or some
> > other kind of OS-level wizardry might well get the job done, but I
> > don't think our documentation needs to speculate about that.
>
> True.  For postgresql.conf, you can put it outside the data directory
> and make it be owned by some other user, and the job is done.  It's
> harder for postgresql.auto.conf because that always lives in the data
> directory which is necessarily postgres-writable, so even if you
> did those two things to it the superuser could just rename or
> remove it and then write postgresql.auto.conf of his choosing.
>

Just to add to that -- if you use chattr +i on it, the superuser in
postgres won't be able to rename it -- only the actual root user.

Just chowning it won't help of course, then the rename part works.

-- 
 Magnus Hagander
 Me: https://www.hagander.net/ 
 Work: https://www.redpill-linpro.com/

Re: Catalog domain not-null constraints

2024-03-25 Thread Dean Rasheed

On Fri, 22 Mar 2024 at 08:28, jian he  wrote:
>
> On Thu, Mar 21, 2024 at 7:23 PM Peter Eisentraut  wrote:
> >
> > Hmm.  CREATE DOMAIN uses column constraint syntax, but ALTER DOMAIN uses
> > table constraint syntax.  Attached is a patch to try to sort this out.
>
> also you should also change src/backend/utils/adt/ruleutils.c?
>
> src6=# \dD
>   List of domains
>  Schema |Name |  Type   | Collation | Nullable | Default |
>  Check
> +-+-+---+--+-+--
>  public | domain_test | integer |   | not null | |
> CHECK (VALUE > 0) NOT NULL VALUE
> (1 row)
>
> probably change to CHECK (VALUE IS NOT NULL)

I'd say it should just output "NOT NULL", since that's the input
syntax that created the constraint. But then again, why display NOT
NULL constraints in that column at all, when there's a separate
"Nullable" column?

Also (not this patch's fault), psql doesn't seem to offer a way to
display domain constraint names -- something you need to know to drop
or alter them. Perhaps \dD+ could be made to do that?

+   The syntax NOT NULL in this command is a
+   PostgreSQL extension.  (A standard-conforming
+   way to write the same would be CHECK (VALUE IS NOT
+   NULL).  However, per ,
+   such constraints a best avoided in practice anyway.)  The
+   NULL constraint is a
+   PostgreSQL extension (see also ).

I didn't verify this, but I thought that according to the SQL
standard, only non-NULL values should be passed to CHECK constraints,
so there is no standard-conforming way to write a NOT NULL domain
constraint.

FWIW, I think NOT NULL domain constraints are a useful feature to
have, and I suspect that there are more people out there who use them
and like them, than who care what the SQL standard says. If so, I'm in
favour of allowing them to be named and managed in the same way as NOT
NULL table constraints.

+   processCASbits($5, @5, "CHECK",
+  NULL, NULL, &n->skip_validation,
+  &n->is_no_inherit, yyscanner);
+   n->initially_valid = !n->skip_validation;

+   /* no NOT VALID support yet */
+   processCASbits($3, @3, "NOT NULL",
+  NULL, NULL, NULL,
+  &n->is_no_inherit, yyscanner);
+   n->initially_valid = true;

NO INHERIT is allowed for domain constraints? What does that even mean?

There's something very wonky about this:

CREATE DOMAIN d1 AS int CHECK (value > 0) NO INHERIT; -- Rejected
ERROR:  check constraints for domains cannot be marked NO INHERIT

CREATE DOMAIN d1 AS int;
ALTER DOMAIN d1 ADD CHECK (value > 0) NO INHERIT; -- Allowed

CREATE DOMAIN d2 AS int NOT NULL NO INHERIT; -- Now allowed (used to
syntax error)

CREATE DOMAIN d3 AS int;
ALTER DOMAIN d3 ADD NOT NULL NO INHERIT; -- Allowed

Presumably all of those should be rejected in the grammar.

Regards,
Dean

Re: Possibility to disable `ALTER SYSTEM`

2024-03-25 Thread Tom Lane

Robert Haas  writes:
> OK, great. The latest patch doesn't specifically talk about backing it
> up with filesystem-level controls, but it does clearly say that this
> feature is not going to stop a determined superuser from bypassing the
> feature, which I think is the appropriate level of detail. We don't
> actually know whether a user has filesystem-level controls available
> on their system that are equal to the task; certainly chmod isn't good
> enough, unless you can prevent the superuser from just running chmod
> again, which you probably can't. An FS-level immutable flag or some
> other kind of OS-level wizardry might well get the job done, but I
> don't think our documentation needs to speculate about that.

True.  For postgresql.conf, you can put it outside the data directory
and make it be owned by some other user, and the job is done.  It's
harder for postgresql.auto.conf because that always lives in the data
directory which is necessarily postgres-writable, so even if you
did those two things to it the superuser could just rename or
remove it and then write postgresql.auto.conf of his choosing.

I wonder whether this feature should include teaching the server
to ignore postgresql.auto.conf altogether, which would make it
relatively easy to get to a bulletproof configuration.

regards, tom lane

Re: [PATCH] Improve amcheck to also check UNIQUE constraint in btree index.

2024-03-25 Thread Noah Misch

On Mon, Mar 25, 2024 at 12:03:10PM -0400, Peter Geoghegan wrote:
> On Sun, Mar 24, 2024 at 10:03 PM Noah Misch  wrote:

> Separately, I now see that the committed patch just reuses the code
> that has long been used to check that things are in the correct order
> across page boundaries: this is the bt_right_page_check_scankey check,
> which existed in the very earliest versions of amcheck. So while I
> agree that we could just keep the original scan key (from the last
> item on every leaf page), and then make the check at the start of the
> next page instead (as opposed to making it at the end of the previous
> leaf page, which is how it works now), it's not obvious that that
> would be a good trade-off, all things considered.
> 
> It might still be a little better that way around, overall, but you're
> not just talking about changing the recently committed checkunique
> patch (I think). You're also talking about restructuring the long
> established bt_right_page_check_scankey check (otherwise, what's the
> point?). I'm not categorically opposed to that, but it's not as if

I wasn't thinking about changing the pre-v17 bt_right_page_check_scankey()
code.  I got interested in this area when I saw the interaction of the new
"first key on the next page" logic with bt_right_page_check_scankey().  The
patch made bt_right_page_check_scankey() pass back rightfirstoffset.  The new
code then does palloc_btree_page() and PageGetItem() with that offset, which
bt_right_page_check_scankey() had already done.  That smelled like a misplaced
distribution of responsibility.  For a time, I suspected the new code should
move down into bt_right_page_check_scankey().  Then I transitioned to thinking
checkunique didn't need new code for the page boundary.

> it'll allow you to throw out a bunch of code -- AFAICT that proposal
> doesn't have that clear advantage going for it. The race condition
> that is described at great length in bt_right_page_check_scankey isn't
> ever going to be a problem for the recently committed checkunique
> patch (as you more or less pointed out yourself), but obviously it is
> still a concern for the cross-page order check.
> 
> In summary, the old bt_right_page_check_scankey check is strictly
> concerned with the consistency of a physical data structure (the index
> itself), whereas the new checkunique check makes sure that the logical
> content of the database is consistent (the index, the heap, and all
> associated transaction status metadata have to be consistent). That
> means that the concerns that are described at length in
> bt_right_page_check_scankey (nor anything like those concerns) don't
> apply to the new checkunique check. We agree on all that, I think. But
> it's less clear that that presents us with an opportunity to simplify
> this patch.

See above for why I anticipated a simplification opportunity with respect to
new-in-v17 code.  Still, it may not pan out.

> > Adding checkunique raised runtime from 58s to 276s, because it checks

Side note: my last email incorrectly described that as "raises runtime by
476%".  It should have said "by 376%" or "by a factor of 4.76".

> > visibility for every heap tuple.  It could do the heap fetch and visibility
> > check lazily, when the index yields two heap TIDs for one scan key.  That
> > should give zero visibility checks for this particular test case, and it
> > doesn't add visibility checks to bloated-table cases.

> It seems like the implication of everything that you said about
> refactoring/moving the check was that doing so would enable this
> optimization (at least an implementation along the lines of your
> pseudo code). If that was what you intended, then it's not obvious to
> me why it is relevant. What, if anything, does it have to do with
> making the new checkunique visibility checks happen lazily?

Their connection is just being the two big-picture topics I found in
post-commit review.  Decisions about the cross-page check are indeed separable
from decisions about lazy vs. eager visibility checks.

Thanks,
nm

Re: Possibility to disable `ALTER SYSTEM`

2024-03-25 Thread Robert Haas

On Mon, Mar 25, 2024 at 1:47 PM Tom Lane  wrote:
> FWIW, I never objected to the idea of being able to disable ALTER
> SYSTEM.  I felt that it ought to be part of a larger feature that
> would provide a more bulletproof guarantee that a superuser can't
> alter the system configuration; but I'm clearly in the minority
> on that.  I'm content with just having it disable ALTER SYSTEM
> and no more, as long as the documentation is sufficiently clear
> that an uncooperative superuser can easily bypass this if you don't
> back it up with filesystem-level controls.

OK, great. The latest patch doesn't specifically talk about backing it
up with filesystem-level controls, but it does clearly say that this
feature is not going to stop a determined superuser from bypassing the
feature, which I think is the appropriate level of detail. We don't
actually know whether a user has filesystem-level controls available
on their system that are equal to the task; certainly chmod isn't good
enough, unless you can prevent the superuser from just running chmod
again, which you probably can't. An FS-level immutable flag or some
other kind of OS-level wizardry might well get the job done, but I
don't think our documentation needs to speculate about that.

-- 
Robert Haas
EDB: http://www.enterprisedb.com

Re: Built-in CTYPE provider

2024-03-25 Thread Jeff Davis

On Mon, 2024-03-25 at 08:29 +0100, Peter Eisentraut wrote:
> Right.  I thought when you said there is an ICU configuration for it,
> that it might be like collation options that you specify in the
> locale 
> string.  But it appears it is only an internal API setting.  So that,
> in 
> my mind, reinforces the opinion that we should leave initcap() as is
> and 
> make a new function that exposes the new functionality.  (This does
> not 
> have to be part of this patch set.)

OK, I'll propose a "title" or "titlecase" function for 18, along with
"casefold" (which I was already planning to propose).

What do you think about UPPER/LOWER and full case mapping? Should there
be extra arguments for full vs simple case mapping, or should it come
from the collation?

It makes sense that the "dotted vs dotless i" behavior comes from the
collation because that depends on locale. But full-vs-simple case
mapping is not really a locale question. For instance:

   select lower('0Σ' collate "en-US-x-icu") AS lower_sigma,
  lower('ΑΣ' collate "en-US-x-icu") AS lower_final_sigma,
  upper('ß' collate "en-US-x-icu") AS upper_eszett;
lower_sigma | lower_final_sigma | upper_eszett 
   -+---+--
0σ  | ας| SS

produces the same results for any ICU collation.

There's also another reason to consider it an argument rather than a
collation property, which is that it might be dependent on some other
field in a row. I could imagine someone wanting to do:

   SELECT
 UPPER(some_field,
   full => true,
   dotless_i => CASE other_field WHEN ...)
   FROM ...

That makes sense for a function in the target list, because different
customers might be from different locales and therefore want different
treatment of the dotted-vs-dotless-i.

Thoughts? Should we use the collation by default but then allow
parameters to override? Or should we just consider this a new set of
functions?

(All of this is v18 material, of course.)

Regards,
Jeff Davis

Re: Possibility to disable `ALTER SYSTEM`

2024-03-25 Thread Tom Lane

Robert Haas  writes:
> Since those are just minor points, that brings us to the question of
> whether there is consensus to proceed with this. I believe that there
> is a clear consensus that there should be some way to disable ALTER
> SYSTEM. Sure, some people, particularly Tom, disagree, but I don't
> think there is any way of counting up the votes that leads to the
> conclusion that we shouldn't have this feature at all.

FWIW, I never objected to the idea of being able to disable ALTER
SYSTEM.  I felt that it ought to be part of a larger feature that
would provide a more bulletproof guarantee that a superuser can't
alter the system configuration; but I'm clearly in the minority
on that.  I'm content with just having it disable ALTER SYSTEM
and no more, as long as the documentation is sufficiently clear
that an uncooperative superuser can easily bypass this if you don't
back it up with filesystem-level controls.

regards, tom lane

Re: Propagate pathkeys from CTEs up to the outer query

2024-03-25 Thread Tom Lane

Richard Guo  writes:
> This patch was initially posted in that same thread and has received
> some comments from Tom in [2].  Due to the presence of multiple patches
> in that thread, it has led to confusion.  So fork a new thread here
> specifically dedicated to discussing the patch about exposing pathkeys
> from CTEs to the upper planner.

I got around to looking at this finally.  I was a bit surprised by
your choice of data structure.  You made a per-CTE-item cte_paths
list paralleling cte_plan_ids, but what I had had in mind was a
per-subplan list of paths paralleling glob->subplans and subroots.
This would mean that the code for ordinary SubqueryScans would
also need to fill in that list, but surely that's a trivial cost
compared to everything else we do to prepare a subplan.  I don't
think that we have any immediate need to remember that info for
an ordinary SubqueryScan, but it seems plausible that we will
in future.  Also, I'm not sure that a Path is fully interpretable
without the associated PlannerInfo (subroot), so keeping it
beside the list of subroots seems more future-proof than dissociating
it from that.  This approach would also be more amenable to postponing
creation of the subplans, as we speculated about earlier.  (I have
no near-term desire to actually do that, but maybe someday it will
happen.)

regards, tom lane

Re: Possibility to disable `ALTER SYSTEM`

2024-03-25 Thread Robert Haas

On Tue, Mar 19, 2024 at 9:13 AM Jelte Fennema-Nio  wrote:
> On Mon, 18 Mar 2024 at 18:27, Robert Haas  wrote:
> > I think for now we
> > should just file this under "Other platforms and clients," which only
> > has one existing setting. If the number of settings of this type
> > grows, we can split it out.
>
> Done. I also included a patch to rename COMPAT_OPTIONS_CLIENTS to
> COMPAT_OPTIONS_OTHER, since that enum variant naming doesn't match the
> new intent of the section.

I reviewed these patches. I think 0001 probably isn't strictly
necessary, but I don't think it's problematic either. And I'm quite
happy with 0002 also. In particular, I think the documentation - which
must be by far the most important of the patch - does an excellent job
explaining the limitations of this feature. My only quibbles are:

- 0002 deletes a blank line from postgresql.conf.sample, and I think
it shouldn't; and
- I think the last sentence of the documentation is odd and could be
dropped; who would expect changing a GUC to reset the contents of a
config file, anyway?

Since those are just minor points, that brings us to the question of
whether there is consensus to proceed with this. I believe that there
is a clear consensus that there should be some way to disable ALTER
SYSTEM. Sure, some people, particularly Tom, disagree, but I don't
think there is any way of counting up the votes that leads to the
conclusion that we shouldn't have this feature at all. If someone
feels otherwise, show us how you counted the votes. What is less clear
is whether there is a consensus in favor of this particular method of
disabling ALTER SYSTEM, namely, via a GUC. The two alternate
approaches that seem to enjoy some level of support are (a) an
extension or (b) changing the permissions on the files.

I haven't tried to count up how many people are specifically in favor
of each approach. I personally think that it doesn't matter very much,
because I interpret the comments in favor of one or another
implementation as saying "I want us to have this feature and of the
possible approaches I prefer $WHATEVER" rather than "the only
architecturally acceptable approach to this feature is $WHATEVER and
if we can't have that then i'd rather have nothing at all." Of course,
like everything else, that conclusion is open to debate, and certainly
to correction by the people who have voted in favor of one of the
alternate approaches, if I've misinterpreted their views.

But, as a practical matter, this is the patch we have, because this is
the patch that Gabriele and Jelte took time to write and polish.
Nobody else has taken the opportunity to produce a competing one. And,
if we nevertheless insist that it has to be done some other way, I
think the inevitable result will be that nothing gets into this
release at all, because we're less than 2 weeks from feature freeze,
and there's not time for a complete do-over of something that was
originally proposed all the way back in September. And my reading of
the thread, at least, is that more people will be happy if something
gets committed here, even if it's not exactly what they would have
preferred, than if we get nothing at all.

I'm going to wait a few days for any final comments. If it becomes
clear that there is in fact no consensus to commit this version of the
patch set (or something very similar) then I'll mark this as Returned
with Feedback. Otherwise, I plan to commit these patches (perhaps
after adjusting in accordance with my comments above).

-- 
Robert Haas
EDB: http://www.enterprisedb.com

Re: pgsql: Track last_inactive_time in pg_replication_slots.

2024-03-25 Thread Bertrand Drouvot

Hi,

On Mon, Mar 25, 2024 at 12:25:37PM -0400, Robert Haas wrote:
> On Mon, Mar 25, 2024 at 12:12 PM Bertrand Drouvot
>  wrote:
> > Would "released_time" sounds better? (at the end this is exactly what it 
> > does
> > represent unless for the case where it is restored from disk for which the 
> > meaning
> > would still makes sense to me though). It seems to me that released_time 
> > does not
> > lead to any expectation then removing any confusion.
> 
> Yeah, that's not bad. I mean, I don't agree that released_time doesn't
> lead to any expectation,
> but what it leads me to expect is that you're
> going to tell me the time at which the slot was released. So if it's
> currently active, then I see NULL, because it's not released; but if
> it's inactive, then I see the time at which it became so.
> 
> In the same vein, I think deactivated_at or inactive_since might be
> good names to consider. I think they get at the same thing as
> released_time, but they avoid introducing a completely new word
> (release, as opposed to active/inactive).
> 

Yeah, I'd vote for inactive_since then.

Regards,

-- 
Bertrand Drouvot
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com

Re: pg_stat_statements and "IN" conditions

2024-03-25 Thread Dmitry Dolgov

> On Sun, Mar 24, 2024 at 11:36:38PM +0900, Yasuo Honda wrote:
> Thanks for the information. I can apply these 4 patches from
> 0eb23285a2 . I tested this branch from Ruby on Rails and it gets some
> unexpected behavior from my point of view.
> Setting pg_stat_statements.query_id_const_merge_threshold = 5 does not
> normalize sql queries whose number of in clauses exceeds 5.
>
> Here are test steps.
> https://gist.github.com/yahonda/825ffccc4dcb58aa60e12ce33d25cd45#expected-behavior
>
> It would be appreciated if I can get my understanding correct.

>From what I understand out of the description this ruby script uses
prepared statements, passing values as parameters, right? Unfortunately
the current version of the patch doesn't handle that, it works with
constants only [1]. The original incarnation of this feature was able to
handle that, but the implementation was considered to be not suitable --
thus, to make some progress, it was left outside.

The plan is, if everything goes fine at some point, to do a follow-up
patch to handle Params and the rest.

[1]: 
https://www.postgresql.org/message-id/20230211104707.grsicemegr7d3mgh%40erthalion.local

Re: pgsql: Track last_inactive_time in pg_replication_slots.

2024-03-25 Thread Robert Haas

On Mon, Mar 25, 2024 at 12:12 PM Bertrand Drouvot
 wrote:
> Now that I read your arguments I think that last__time could 
> be
> both missleading because at the end they rely on users "expectation".

Well, the user is always going to expect *something* -- that's just
how language works.

> Would "released_time" sounds better? (at the end this is exactly what it does
> represent unless for the case where it is restored from disk for which the 
> meaning
> would still makes sense to me though). It seems to me that released_time does 
> not
> lead to any expectation then removing any confusion.

Yeah, that's not bad. I mean, I don't agree that released_time doesn't
lead to any expectation, but what it leads me to expect is that you're
going to tell me the time at which the slot was released. So if it's
currently active, then I see NULL, because it's not released; but if
it's inactive, then I see the time at which it became so.

In the same vein, I think deactivated_at or inactive_since might be
good names to consider. I think they get at the same thing as
released_time, but they avoid introducing a completely new word
(release, as opposed to active/inactive).

-- 
Robert Haas
EDB: http://www.enterprisedb.com

Re: pgsql: Track last_inactive_time in pg_replication_slots.

2024-03-25 Thread Bertrand Drouvot

Hi,

On Mon, Mar 25, 2024 at 07:32:11PM +0530, Amit Kapila wrote:
> On Mon, Mar 25, 2024 at 6:57 PM Robert Haas  wrote:
> > And I'm suspicious that having an exception for slots being synced is
> > a bad idea. That makes too much of a judgement about how the user will
> > use this field. It's usually better to just expose the data, and if
> > the user needs helps to make sense of that data, then give them that
> > help separately.
> 
> The reason we didn't set this for sync slots is that they won't be
> usable (one can't use them to decode WAL) unless standby is promoted
> [2]. But I see your point as well. So, I have copied the others
> involved in this discussion to see what they think.

Yeah I also see Robert's point. If we also sync the "last inactive time" field 
then
we would need to take care of the corner case mentioned by Shveta in [1] during
promotion.

[1]: 
https://www.postgresql.org/message-id/CAJpy0uCLu%2BmqAwAMum%3DpXE9YYsy0BE7hOSw_Wno5vjwpFY%3D63g%40mail.gmail.com

Regards,

-- 
Bertrand Drouvot
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com

Re: Avoiding inadvertent debugging mode for pgbench

2024-03-25 Thread Nathan Bossart

Committed.

-- 
Nathan Bossart
Amazon Web Services: https://aws.amazon.com

Re: pgsql: Track last_inactive_time in pg_replication_slots.

2024-03-25 Thread Bertrand Drouvot

Hi,

On Mon, Mar 25, 2024 at 11:49:00AM -0400, Robert Haas wrote:
> On Mon, Mar 25, 2024 at 11:16 AM Bertrand Drouvot
>  wrote:
> > > IIUC, Bertrand's point was that users can interpret last_active_time
> > > as a value that gets updated each time they decode a change which is
> > > not what we are doing. So, this can confuse users. Your expectation of
> > > answer (NULL) when the slot is active is correct and that is what will
> > > happen.
> >
> > Yeah, and so would be the confusion: why is last_active_time NULL while one 
> > is
> > using the slot?
> 
> I agree that users could get confused here, but the solution to that
> shouldn't be to give the field a name that is the opposite of what it
> actually does. I expect a field called last_inactive_time to tell me
> the last time that the slot was inactive. Here, it tells me the last
> time that a currently-inactive slot previously *WAS* active. How can
> you justify calling that the last *INACTIVE* time?
> 
> AFAICS, the user who has the confusion that you mention here is simply
> wrong. If they are looking at a field called "last active time" and
> the slot is active, then the correct answer is "right now" or
> "undefined" and that is what they will see. Sure, they might not
> understand that. But flipping the name of the field on its head cannot
> be the right way to help them.
> 
> With the current naming, I expect to have the exact opposite confusion
> as your hypothetical confused user. I'm going to be looking at a slot
> that's currently inactive, and it's going to tell me that the
> last_inactive_time was at some time in the past. And I'm going to say
> "what the heck is going on here, the slot is inactive *right now*!"
> 
> Half of me wonders whether we should avoid this whole problem by
> renaming it to something like last_state_change or
> last_state_change_time, or maybe just state_change like we do in
> pg_stat_activity, and making it mean the last time the slot flipped
> between active and inactive in either direction. I'm not sure if this
> is better, but unless I'm misunderstanding something, the current
> situation is terrible.
> 

Now that I read your arguments I think that last__time could be
both missleading because at the end they rely on users "expectation".

Would "released_time" sounds better? (at the end this is exactly what it does 
represent unless for the case where it is restored from disk for which the 
meaning
would still makes sense to me though). It seems to me that released_time does 
not
lead to any expectation then removing any confusion.

Regards,

-- 
Bertrand Drouvot
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com

RE: speed up a logical replica setup

2024-03-25 Thread Hayato Kuroda (Fujitsu)

Dear Bharath, Peter,

> Looks like BF animals aren't happy, please check -
> > https://buildfarm.postgresql.org/cgi-bin/show_failures.pl.
> 
> Looks like sanitizer failures.  There were a few messages about that
> recently, but those were all just about freeing memory after use, which
> we don't necessarily require for client programs.  So maybe something else.

It seems that there are several time of failures, [1] and [2].

## Analysis for failure 1

The failure caused by a time lag between walreceiver finishes and 
pg_is_in_recovery()
returns true.

According to the output [1], it seems that the tool failed at 
wait_for_end_recovery()
with the message "standby server disconnected from the primary". Also, lines
"redo done at..." and "terminating walreceiver process due to administrator 
command"
meant that walreceiver was requested to shut down by XLogShutdownWalRcv().

According to the source, we confirm that walreceiver is shut down in
StartupXLOG()->FinishWalRecovery()->XLogShutdownWalRcv(). Also, 
SharedRecoveryState
is changed to RECOVERY_STATE_DONE (this meant the pg_is_in_recovery() return 
true)
at the latter part of StartupXLOG().

So, if there is a delay between FinishWalRecovery() and change the state, the 
check
in wait_for_end_recovery() would be failed during the time. Since we allow to 
miss
the walreceiver 10 times and it is checked once per second, the failure occurs 
if
the time lag is longer than 10 seconds.

I do not have a good way to fix it. One approach is make NUM_CONN_ATTEMPTS 
larger,
but it's not a fundamental solution.

## Analysis for failure 2

According to [2], the physical replication slot which is specified as 
primary_slot_name
was not used by the walsender process. At that time walsender has not existed.

```
...
pg_createsubscriber: publisher: current wal senders: 0
pg_createsubscriber: command is: SELECT 1 FROM pg_catalog.pg_replication_slots 
WHERE active AND slot_name = 'physical_slot'
pg_createsubscriber: error: could not obtain replication slot information: got 
0 rows, expected 1 row
...
```

Currently standby must be stopped before the command and current code does not
block the flow to ensure the replication is started. So there is a possibility
that the checking is run before walsender is launched.

One possible approach is to wait until the replication starts. Alternative one 
is
to ease the condition.

How do you think?

[1]: 
https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=serinus&dt=2024-03-25%2013%3A03%3A07
[2]: 
https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=culicidae&dt=2024-03-25%2013%3A53%3A58

Best Regards,
Hayato Kuroda
FUJITSU LIMITED
https://www.fujitsu.com/

Re: [PATCH] Improve amcheck to also check UNIQUE constraint in btree index.

2024-03-25 Thread Peter Geoghegan

On Sun, Mar 24, 2024 at 10:03 PM Noah Misch  wrote:
> > You're going to have to "couple" buffer locks in the style of
> > _bt_check_unique() (as well as keeping a buffer lock on "the first
> > leaf page a duplicate might be on" throughout) if you need the test to
> > work reliably.
>
> The amcheck feature has no lock coupling at its "first key on the next page"
> check.  I think that's fine, because amcheck takes one snapshot at the
> beginning and looks for pairs of visible-to-that-snapshot heap tuples with the
> same scan key.  _bt_check_unique(), unlike amcheck, must catch concurrent
> inserts.  If amcheck "checkunique" wanted to detect duplicates that would
> appear when all transactions commit, it would need lock coupling.  (I'm not
> suggesting it do that.)  Do you see a problem with the lack of lock coupling
> at "first key on the next page"?

Practically speaking, no, I see no problems.

> I agree, but perhaps the "first key on the next page" code is more complex
> than general-case code would be.  If the lack of lock coupling is fine, then I
> think memory context lifecycle is the only obstacle making index page
> boundaries special.  Are there factors beyond that?

I believe that my concern back in 2021 was that the general complexity
of cross-page checking was unlikely to be worth it. Note that
nbtsplitloc.c is *maximally* aggressive about avoiding split points
that fall within some group of duplicates, so with a unique index it
should be very rare.

Admittedly, I was probably thinking about the complexity of adding a
bunch of code just to be able to check uniqueness across page
boundaries. I did mention lock coupling by name, but that was more of
a catch-all term for the problems in this area.

> We already have
> state->lowkey kept across pages via MemoryContextAlloc().  Similar lines of
> code could preserve the scan key for checkunique, making the "first key on the
> next page" code unnecessary.

I suspect that I was overly focussed on the index structure itself
back when I made these remarks. I might not have considered that just
using an MVCC snapshot for the TIDs makes the whole process safe,
though that now seems quite obvious.

Separately, I now see that the committed patch just reuses the code
that has long been used to check that things are in the correct order
across page boundaries: this is the bt_right_page_check_scankey check,
which existed in the very earliest versions of amcheck. So while I
agree that we could just keep the original scan key (from the last
item on every leaf page), and then make the check at the start of the
next page instead (as opposed to making it at the end of the previous
leaf page, which is how it works now), it's not obvious that that
would be a good trade-off, all things considered.

It might still be a little better that way around, overall, but you're
not just talking about changing the recently committed checkunique
patch (I think). You're also talking about restructuring the long
established bt_right_page_check_scankey check (otherwise, what's the
point?). I'm not categorically opposed to that, but it's not as if
it'll allow you to throw out a bunch of code -- AFAICT that proposal
doesn't have that clear advantage going for it. The race condition
that is described at great length in bt_right_page_check_scankey isn't
ever going to be a problem for the recently committed checkunique
patch (as you more or less pointed out yourself), but obviously it is
still a concern for the cross-page order check.

In summary, the old bt_right_page_check_scankey check is strictly
concerned with the consistency of a physical data structure (the index
itself), whereas the new checkunique check makes sure that the logical
content of the database is consistent (the index, the heap, and all
associated transaction status metadata have to be consistent). That
means that the concerns that are described at length in
bt_right_page_check_scankey (nor anything like those concerns) don't
apply to the new checkunique check. We agree on all that, I think. But
it's less clear that that presents us with an opportunity to simplify
this patch.

> Adding checkunique raised runtime from 58s to 276s, because it checks
> visibility for every heap tuple.  It could do the heap fetch and visibility
> check lazily, when the index yields two heap TIDs for one scan key.  That
> should give zero visibility checks for this particular test case, and it
> doesn't add visibility checks to bloated-table cases.

The added runtime that you report seems quite excessive to me. I'm
really surprised that the code doesn't manage to avoid visibility
checks in the absence of duplicates that might both have TIDs
considered visible. Lazy visibility checking seems almost essential,
and not just a nice-to-have optimization.

It seems like the implication of everything that you said about
refactoring/moving the check was that doing so would enable this
optimization (at least an implementation along the lines of your

Re: documentation structure

2024-03-25 Thread Robert Haas

On Mon, Mar 25, 2024 at 11:40 AM Peter Eisentraut  wrote:
> I think a possible problem we need to consider with these proposals to
> combine chapters is that they could make the chapters themselves too
> deep and harder to navigate.  For example, if we combined the
> installation from source and binaries chapters, the structure of the new
> chapter would presumably be

I agree with this in theory, but in practice I think the patches that
I posted don't have this issue to a degree that is problematic, and I
posted some specific proposals on adjustments that we could make to
ameliorate the problem if other people feel differently.

> I think maybe more could also be done at the top-level structure, too.
> Right now, we have  ->  -> .  We could add  on
> top of that.
>
> We could also play with CSS or JavaScript to make the top-level table of
> contents more navigable, with collapsing subsections or whatever.
>
> We could also render additional tables of contents or indexes, so there
> is more than one way to navigate into the content from the top.
>
> We could also build better search.

These are all reasonable ideas. I think some better CSS and JavaScript
could definitely help, and I also wondered whether the entrypoint to
the documentation has to be the index page, or whether it could maybe
be a page we've crafted specifically for that purpose, that might
include some text as well as a bunch of links.

But that having been said, I don't believe that any of those ideas (or
anything else we do) will obviate the need for some curation of the
toplevel index. If you're going to add another level, as you propose
in the first point, you still need to make decisions about which
things properly go at which levels. If you're going to allow for
collapsing subsections, you still want the overall tree in which
subsections are be expanded and collapsed to make logical sense. If
you have multiple ways to navigate to the content, one of them will
probably be still the index, and it should be good. And good search is
good, but it shouldn't be the only convenient way to find the content.

-- 
Robert Haas
EDB: http://www.enterprisedb.com

Re: [PATCH] plpython function causes server panic

2024-03-25 Thread Robert Haas

On Mon, Mar 25, 2024 at 11:36 AM Tom Lane  wrote:
> By that logic, we should rip out every Assert in the system, as well
> as all of the (extensive) resource leak checking that already happens
> during CommitTransaction.  We've always felt that those leak checks
> were worth the cost to help us find bugs --- which they have done and
> still do from time to time.  I don't see why this case is different,
> especially when the added cost compared to HEAD is not much more than
> one C function call.

Well, I explained why *I* thought it was different, but obviously you
don't agree.

-- 
Robert Haas
EDB: http://www.enterprisedb.com

Re: pgsql: Track last_inactive_time in pg_replication_slots.

2024-03-25 Thread Robert Haas

On Mon, Mar 25, 2024 at 11:16 AM Bertrand Drouvot
 wrote:
> > IIUC, Bertrand's point was that users can interpret last_active_time
> > as a value that gets updated each time they decode a change which is
> > not what we are doing. So, this can confuse users. Your expectation of
> > answer (NULL) when the slot is active is correct and that is what will
> > happen.
>
> Yeah, and so would be the confusion: why is last_active_time NULL while one is
> using the slot?

I agree that users could get confused here, but the solution to that
shouldn't be to give the field a name that is the opposite of what it
actually does. I expect a field called last_inactive_time to tell me
the last time that the slot was inactive. Here, it tells me the last
time that a currently-inactive slot previously *WAS* active. How can
you justify calling that the last *INACTIVE* time?

AFAICS, the user who has the confusion that you mention here is simply
wrong. If they are looking at a field called "last active time" and
the slot is active, then the correct answer is "right now" or
"undefined" and that is what they will see. Sure, they might not
understand that. But flipping the name of the field on its head cannot
be the right way to help them.

With the current naming, I expect to have the exact opposite confusion
as your hypothetical confused user. I'm going to be looking at a slot
that's currently inactive, and it's going to tell me that the
last_inactive_time was at some time in the past. And I'm going to say
"what the heck is going on here, the slot is inactive *right now*!"

Half of me wonders whether we should avoid this whole problem by
renaming it to something like last_state_change or
last_state_change_time, or maybe just state_change like we do in
pg_stat_activity, and making it mean the last time the slot flipped
between active and inactive in either direction. I'm not sure if this
is better, but unless I'm misunderstanding something, the current
situation is terrible.

-- 
Robert Haas
EDB: http://www.enterprisedb.com

Re: Popcount optimization using AVX512

2024-03-25 Thread Joe Conway


On 3/25/24 11:12, Tom Lane wrote:

"Amonson, Paul D"  writes:

I am re-posting the patches as CI for Mac failed (CI error not code/test 
error). The patches are the same as last time.


Just for a note --- the cfbot will re-test existing patches every
so often without needing a bump.  The current cycle period seems to
be about two days.



Just an FYI -- there seems to be an issue with all three of the macos 
cfbot runners (mine included). I spent time over the weekend working 
with Thomas Munro (added to CC list) trying different fixes to no avail. 
Help from macos CI wizards would be gratefully accepted...


--
Joe Conway
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com

Re: documentation structure

2024-03-25 Thread Peter Eisentraut


On 22.03.24 15:10, Robert Haas wrote:

Sorry. I didn't mean to dispute the point that the section was added a
few years ago, nor the point that most people just want to read about
the binaries. I am confident that both of those things are true. What
I do want to dispute is that having a four-sentence chapter in the
documentation index that tells people something they can find much
more easily without using the documentation at all is a good plan.


I think a possible problem we need to consider with these proposals to 
combine chapters is that they could make the chapters themselves too 
deep and harder to navigate.  For example, if we combined the 
installation from source and binaries chapters, the structure of the new 
chapter would presumably be


 Installation
Installation from Binaries
Installation from Source
 Requirements
 Getting the Source
 Building and Installation with Autoconf and Make
 Building and Installation with Meson
etc.

This would mean that the entire "Installation from Source" part would be 
rendered on a single HTML page.


The rendering can be adjusted to some degree, but then we also need to 
make sure any new chunking makes sense in other chapters.  (And it might 
also change a bunch of externally known HTML links.)


I think maybe more could also be done at the top-level structure, too. 
Right now, we have  ->  -> .  We could add  on 
top of that.


We could also play with CSS or JavaScript to make the top-level table of 
contents more navigable, with collapsing subsections or whatever.


We could also render additional tables of contents or indexes, so there 
is more than one way to navigate into the content from the top.


We could also build better search.

Re: pgsql: Track last_inactive_time in pg_replication_slots.

2024-03-25 Thread Bertrand Drouvot

Hi,

On Mon, Mar 25, 2024 at 08:59:55PM +0530, Amit Kapila wrote:
> On Mon, Mar 25, 2024 at 8:46 PM Bertrand Drouvot
>  wrote:
> >
> > On Mon, Mar 25, 2024 at 08:38:16PM +0530, Amit Kapila wrote:
> > > On Mon, Mar 25, 2024 at 7:51 PM Robert Haas  wrote:
> > > >
> > > > On Mon, Mar 25, 2024 at 10:02 AM Amit Kapila  
> > > > wrote:
> > > > > We considered the other two names as last_inactive_at and
> > > > > last_active_time. For the first (last_inactive_at), there was an
> > > > > argument that most other fields that display time ends with _time. For
> > > > > the second (last_active_time), there was an argument that it could be
> > > > > misleading as one could think that it should be updated each time WAL
> > > > > record decoding is happening [1]. The other possibility is to name it
> > > > > last_used_time but I think it won't be much different from
> > > > > last_active_time.
> > > >
> > > > I don't understand the bit about updating it each time WAL record
> > > > decoding is happening. If it's the last active time, and the slot is
> > > > currently active, then the answer is either "right now" or "currently
> > > > undefined." I'd expect to see NULL in the system view in such a case.
> > > > And if that's so, then there's nothing to update each time a record is
> > > > decoded, because it's just still going to show NULL.
> > > >
> > >
> > > IIUC, Bertrand's point was that users can interpret last_active_time
> > > as a value that gets updated each time they decode a change which is
> > > not what we are doing. So, this can confuse users. Your expectation of
> > > answer (NULL) when the slot is active is correct and that is what will
> > > happen.
> >
> > Yeah, and so would be the confusion: why is last_active_time NULL while one 
> > is
> > using the slot?
> >
> 
> It is because we set it to zero when we acquire the slot and that
> value will remain the same till the slot is active. I am not sure if I
> understood your question so what I am saying might not make sense.

There is no "real" question, I was just highlighting the confusion in case we
name the field "last_active_time".

Regards,

-- 
Bertrand Drouvot
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com

Re: [PATCH] plpython function causes server panic

2024-03-25 Thread Tom Lane

Robert Haas  writes:
> On Sat, Mar 23, 2024 at 12:31 PM Tom Lane  wrote:
>> However, the calling logic seems a bit shy of a load, in that it
>> trusts IsInParallelMode() completely to decide whether to check for
>> leaked parallel contexts.  So we'd miss the case where somebody did
>> ExitParallelMode without having cleaned up workers.

> But if the user puts a call to ExitParallelMode() inside such a
> function, it's hard to imagine what goal they have other than to
> deliberately circumvent the safeguards. And they're always going to be
> able to do that somehow, if they're coding in C. So I'm not convinced
> that the sanity checks you've added are really going to do anything
> other than burn a handful of CPU cycles. If there's some plausible
> case in which they protect us against a user who has legitimately made
> an error, fine; but if we're just wandering down the slippery slope of
> believing we can defend against malicious C code, we absolutely should
> not do that, not even a little bit. The first CPU instruction we burn
> in the service of a hopeless cause is already one too many.

By that logic, we should rip out every Assert in the system, as well
as all of the (extensive) resource leak checking that already happens
during CommitTransaction.  We've always felt that those leak checks
were worth the cost to help us find bugs --- which they have done and
still do from time to time.  I don't see why this case is different,
especially when the added cost compared to HEAD is not much more than
one C function call.

Or in other words: the point is not about stopping malicious C code,
it's about recognizing that we make mistakes.

regards, tom lane

Re: pgsql: Track last_inactive_time in pg_replication_slots.

2024-03-25 Thread Amit Kapila

On Mon, Mar 25, 2024 at 8:46 PM Bertrand Drouvot
 wrote:
>
> On Mon, Mar 25, 2024 at 08:38:16PM +0530, Amit Kapila wrote:
> > On Mon, Mar 25, 2024 at 7:51 PM Robert Haas  wrote:
> > >
> > > On Mon, Mar 25, 2024 at 10:02 AM Amit Kapila  
> > > wrote:
> > > > We considered the other two names as last_inactive_at and
> > > > last_active_time. For the first (last_inactive_at), there was an
> > > > argument that most other fields that display time ends with _time. For
> > > > the second (last_active_time), there was an argument that it could be
> > > > misleading as one could think that it should be updated each time WAL
> > > > record decoding is happening [1]. The other possibility is to name it
> > > > last_used_time but I think it won't be much different from
> > > > last_active_time.
> > >
> > > I don't understand the bit about updating it each time WAL record
> > > decoding is happening. If it's the last active time, and the slot is
> > > currently active, then the answer is either "right now" or "currently
> > > undefined." I'd expect to see NULL in the system view in such a case.
> > > And if that's so, then there's nothing to update each time a record is
> > > decoded, because it's just still going to show NULL.
> > >
> >
> > IIUC, Bertrand's point was that users can interpret last_active_time
> > as a value that gets updated each time they decode a change which is
> > not what we are doing. So, this can confuse users. Your expectation of
> > answer (NULL) when the slot is active is correct and that is what will
> > happen.
>
> Yeah, and so would be the confusion: why is last_active_time NULL while one is
> using the slot?
>

It is because we set it to zero when we acquire the slot and that
value will remain the same till the slot is active. I am not sure if I
understood your question so what I am saying might not make sense.

-- 
With Regards,
Amit Kapila.

Re: documentation structure

2024-03-25 Thread Peter Eisentraut


On 22.03.24 14:59, Robert Haas wrote:

And I don't believe that if someone were writing a physical book about
PostgreSQL from scratch, they'd ever end up with a top-level chapter
that looks anything like our GiST chapter. All of the index AM
chapters are quite obviously clones of each other, and they're all
quite short. Surely you'd make them sections within a chapter, not
entire chapters.

I do agree that PL/pgsql is more arguable. I can imagine somebody
writing a book about PostgreSQL and choosing to make that topic into a
whole chapter.


Yeah, I think there is probably a range of of things from pretty obvious 
to mostly controversial.

Re: add AVX2 support to simd.h

2024-03-25 Thread Nathan Bossart

On Mon, Mar 25, 2024 at 10:03:27AM +0700, John Naylor wrote:
> Seems pretty good. It'd be good to see the results of 2- vs.
> 4-register before committing, because that might lead to some
> restructuring, but maybe it won't, and v8 is already an improvement
> over HEAD.

I tested this the other day [0] (only for x86).  The results seemed to
indicate that the 4-register approach was still quite a bit better.

> /* Process the remaining elements one at a time. */
> 
> This now does all of them if that path is taken, so "remaining" can be 
> removed.

Right, will do.

[0] https://postgr.es/m/20240321183823.GA1800896%40nathanxps13

-- 
Nathan Bossart
Amazon Web Services: https://aws.amazon.com

RE: Popcount optimization using AVX512

2024-03-25 Thread Amonson, Paul D

> -Original Message-
> From: Tom Lane 
> Sent: Monday, March 25, 2024 8:12 AM
> To: Amonson, Paul D 
> Cc: David Rowley ; Nathan Bossart
> Subject: Re: Popcount optimization using AVX512
>...
> Just for a note --- the cfbot will re-test existing patches every so often 
> without
> needing a bump.  The current cycle period seems to be about two days.
> 
>   regards, tom lane

Good to know! Maybe this is why I thought it originally passed CI and suddenly 
this morning there is a failure. I noticed at least 2 other patch runs also 
failed in the same way.

Thanks,
Paul

Re: pgsql: Track last_inactive_time in pg_replication_slots.

2024-03-25 Thread Bertrand Drouvot

Hi,

On Mon, Mar 25, 2024 at 08:38:16PM +0530, Amit Kapila wrote:
> On Mon, Mar 25, 2024 at 7:51 PM Robert Haas  wrote:
> >
> > On Mon, Mar 25, 2024 at 10:02 AM Amit Kapila  
> > wrote:
> > > We considered the other two names as last_inactive_at and
> > > last_active_time. For the first (last_inactive_at), there was an
> > > argument that most other fields that display time ends with _time. For
> > > the second (last_active_time), there was an argument that it could be
> > > misleading as one could think that it should be updated each time WAL
> > > record decoding is happening [1]. The other possibility is to name it
> > > last_used_time but I think it won't be much different from
> > > last_active_time.
> >
> > I don't understand the bit about updating it each time WAL record
> > decoding is happening. If it's the last active time, and the slot is
> > currently active, then the answer is either "right now" or "currently
> > undefined." I'd expect to see NULL in the system view in such a case.
> > And if that's so, then there's nothing to update each time a record is
> > decoded, because it's just still going to show NULL.
> >
> 
> IIUC, Bertrand's point was that users can interpret last_active_time
> as a value that gets updated each time they decode a change which is
> not what we are doing. So, this can confuse users. Your expectation of
> answer (NULL) when the slot is active is correct and that is what will
> happen.

Yeah, and so would be the confusion: why is last_active_time NULL while one is
using the slot?

> > Why does this field get set to the current time when the slot is
> > restored from disk?
> >
> 
> It is because we don't want to include the time the server is down in
> the last_inactive_time. Say, if we are shutting down the server at
> time X and the server remains down for another two hours, we don't
> want to include those two hours as the slot inactive time. The related
> theory is that this field will be used to invalidate inactive slots
> based on a threshold (say inactive_timeout). Say, before the shutdown,
> we release the slot and set the current_time for last_inactive_time
> for each slot and persist that information as well. Now, if the server
> is down for a long time, we may invalidate the slots as soon as the
> server comes up. So, instead, we just set this field at the time we
> read slots for disk and then reset it to 0/NULL as soon as the slot
> became active.

Right, and we also want to invalidate the slot if not used duration > timeout,
so that setting the field to zero when the slot is restored from disk is also 
not
an option.

Regards,

-- 
Bertrand Drouvot
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com

Re: Slow GRANT ROLE on PostgreSQL 16 with thousands of ROLEs

2024-03-25 Thread Nathan Bossart

On Mon, Mar 25, 2024 at 11:08:39AM -0400, Tom Lane wrote:
> * The magic constants (crossover list length and bloom filter size)
> need some testing to see if there are better values.  They should
> probably be made into named #defines, too.  I suspect, with little
> proof, that the bloom filter size isn't particularly critical --- but
> I know we pulled the crossover of 1000 out of thin air, and I have
> no certainty that it's even within an order of magnitude of being a
> good choice.

I'll try to construct a couple of tests to see if we can determine a proper
order of magnitude.

> * Code needs more than zero comments.

Yup.

> * Is it worth trying to make a subroutine, or at least a macro,
> so as not to have 2 copies of the code?

I think so.  I'll try that in the next version.

-- 
Nathan Bossart
Amazon Web Services: https://aws.amazon.com

Re: [PATCH] plpython function causes server panic

2024-03-25 Thread Robert Haas

On Sat, Mar 23, 2024 at 12:31 PM Tom Lane  wrote:
> However, the calling logic seems a bit shy of a load, in that it
> trusts IsInParallelMode() completely to decide whether to check for
> leaked parallel contexts.  So we'd miss the case where somebody did
> ExitParallelMode without having cleaned up workers.  It's not like
> AtEOXact_Parallel and AtEOSubXact_Parallel cost a lot when they have
> nothing to do, so I think we should call them unconditionally, and
> separately from that issue a warning if parallelModeLevel isn't zero
> (and we're committing).

I wasn't worried about this case when I wrote this code. The general
flow that I anticipated was that somebody would run a query, and
ExecMain.c would enter parallel mode, and then maybe eventually reach
some SQL-callable C function that hadn't gotten the memo about
parallel query but had been mistakenly labelled as PARALLEL RESTRICTED
or PARALLEL SAFE when it wasn't really, and so the goal was for core
functions that such a function might reasonably attempt to call to
notice that something bad was happening.

But if the user puts a call to ExitParallelMode() inside such a
function, it's hard to imagine what goal they have other than to
deliberately circumvent the safeguards. And they're always going to be
able to do that somehow, if they're coding in C. So I'm not convinced
that the sanity checks you've added are really going to do anything
other than burn a handful of CPU cycles. If there's some plausible
case in which they protect us against a user who has legitimately made
an error, fine; but if we're just wandering down the slippery slope of
believing we can defend against malicious C code, we absolutely should
not do that, not even a little bit. The first CPU instruction we burn
in the service of a hopeless cause is already one too many.

-- 
Robert Haas
EDB: http://www.enterprisedb.com

Re: Popcount optimization using AVX512

2024-03-25 Thread Tom Lane

"Amonson, Paul D"  writes:
> I am re-posting the patches as CI for Mac failed (CI error not code/test 
> error). The patches are the same as last time.

Just for a note --- the cfbot will re-test existing patches every
so often without needing a bump.  The current cycle period seems to
be about two days.

regards, tom lane

Re: Slow GRANT ROLE on PostgreSQL 16 with thousands of ROLEs

2024-03-25 Thread Tom Lane

Nathan Bossart  writes:
> Are there any changes you'd like to see for the Bloom patch [0]?  I'd like
> to see about getting that committed for v17.  One thing that crossed my
> mind is creating a combined list/filter that transparently created a filter
> when necessary (for reuse elsewhere), but I'm not sure that's v17 material.

Yeah, that thought occurred to me too, but I think we ought to have a
few more use-cases in view before trying to write an API.

As for the patch, I agree it could go into v17, but I think there is
still a little bit of work to do:

* The magic constants (crossover list length and bloom filter size)
need some testing to see if there are better values.  They should
probably be made into named #defines, too.  I suspect, with little
proof, that the bloom filter size isn't particularly critical --- but
I know we pulled the crossover of 1000 out of thin air, and I have
no certainty that it's even within an order of magnitude of being a
good choice.

* Code needs more than zero comments.

* Is it worth trying to make a subroutine, or at least a macro,
so as not to have 2 copies of the code?

regards, tom lane

1 2 >

1 - 100 of 157 matches

Mail list logo