date:20240214

On Wed, Feb 14, 2024 at 9:07 AM Hayato Kuroda (Fujitsu)
 wrote:
>
> > pg_upgrade/t/004_subscription.pl says
> >
> > |my $mode = $ENV{PG_TEST_PG_UPGRADE_MODE} || '--copy';
> >
> > ..but I think maybe it should not.
> >
> > When you try to use --link, it fails:
> > https://cirrus-ci.com/task/4669494061170688
> >
> > |Adding ".old" suffix to old global/pg_control ok
> > |
> > |If you want to start the old cluster, you will need to remove
> > |the ".old" suffix from
> > /tmp/cirrus-ci-build/build/testrun/pg_upgrade/004_subscription/data/t_004_su
> > bscription_old_sub_data/pgdata/global/pg_control.old.
> > |Because "link" mode was used, the old cluster cannot be safely
> > |started once the new cluster has been started.
> > |...
> > |
> > |postgres: could not find the database system
> > |Expected to find it in the directory
> > "/tmp/cirrus-ci-build/build/testrun/pg_upgrade/004_subscription/data/t_004_s
> > ubscription_old_sub_data/pgdata",
> > |but could not open file
> > "/tmp/cirrus-ci-build/build/testrun/pg_upgrade/004_subscription/data/t_004_s
> > ubscription_old_sub_data/pgdata/global/pg_control": No such file or 
> > directory
> > |# No postmaster PID for node "old_sub"
> > |[19:36:01.396](0.250s) Bail out!  pg_ctl start failed
> >
>
> Good catch! The primal reason of the failure is to reuse the old cluster, 
> even after
> the successful upgrade. The documentation said [1]:
>
> >
> If you use link mode, the upgrade will be much faster (no file copying) and 
> use less
> disk space, but you will not be able to access your old cluster once you 
> start the new
> cluster after the upgrade.
> >
>
> > You could rename pg_control.old to avoid that immediate error, but that 
> > doesn't
> > address the essential issue that "the old cluster cannot be safely started 
> > once
> > the new cluster has been started."
>
> Yeah, I agreed that it should be avoided to access to the old cluster after 
> the upgrade.
> IIUC, pg_upgrade would be run third times in 004_subscription.
>
> 1. successful upgrade
> 2. failure due to the insufficient max_replication_slot
> 3. failure because the pg_subscription_rel has 'd' state
>
> And old instance is reused in all of runs. Therefore, the most reasonable fix 
> is to
> change the ordering of tests, i.e., "successful upgrade" should be done at 
> last.
>

This sounds like a reasonable way to address the reported problem.
Justin, do let me know if you think otherwise?

Comment:
===
*
-# Setup an enabled subscription to verify that the running status and failover
-# option are retained after the upgrade.
+# Setup a subscription to verify that the failover option are retained after
+# the upgrade.
 $publisher->safe_psql('postgres', "CREATE PUBLICATION regress_pub1");
 $old_sub->safe_psql('postgres',
- "CREATE SUBSCRIPTION regress_sub1 CONNECTION '$connstr' PUBLICATION
regress_pub1 WITH (failover = true)"
+ "CREATE SUBSCRIPTION regress_sub1 CONNECTION '$connstr' PUBLICATION
regress_pub1 WITH (failover = true, enabled = false)"
 );

I think it is better not to create a subscription in the early stage
which we wanted to use for the success case. Let's have separate
subscriptions for failure and success cases. I think that will avoid
the newly added ALTER statements in the patch.

-- 
With Regards,
Amit Kapila.

Re: Make COPY format extendable: Extract COPY TO format implementations

2024-02-14 Thread Sutou Kouhei

Hi,

In <20240213.173340.1518143507526518973@clear-code.com>
  "Re: Make COPY format extendable: Extract COPY TO format implementations" on 
Tue, 13 Feb 2024 17:33:40 +0900 (JST),
  Sutou Kouhei  wrote:

> I'll reply other comments later...

I've read other comments and my answers for them are same as
Michael's one.


I'll prepare the v15 patch with static inline functions and
fixed arguments after the fcinfo cache patches are merged. I
think that the v15 patch will be conflicted with fcinfo
cache patches.


Thanks,
-- 
kou

Re: planner chooses incremental but not the best one

2024-02-14 Thread Andrei Lepikhov


On 18/12/2023 19:53, Tomas Vondra wrote:

On 12/18/23 11:40, Richard Guo wrote:
The challenge is where to get usable information about correlation
between columns. I only have a couple very rought ideas of what might
try. For example, if we have multi-column ndistinct statistics, we might
look at ndistinct(b,c) and ndistinct(b,c,d) and deduce something from

 ndistinct(b,c,d) / ndistinct(b,c)

If we know how many distinct values we have for the predicate column, we
could then estimate the number of groups. I mean, we know that for the
restriction "WHERE b = 3" we only have 1 distinct value, so we could
estimate the number of groups as

 1 * ndistinct(b,c)

Did you mean here ndistinct(c,d) and the formula:
ndistinct(b,c,d) / ndistinct(c,d) ?

Do you implicitly bear in mind here the necessity of tracking clauses 
that were applied to the data up to the moment of grouping?


--
regards,
Andrei Lepikhov
Postgres Professional

Re: Make COPY format extendable: Extract COPY TO format implementations

2024-02-14 Thread Sutou Kouhei

Hi,

In 
  "Re: Make COPY format extendable: Extract COPY TO format implementations" on 
Wed, 14 Feb 2024 15:52:36 +0900,
  Michael Paquier  wrote:

>> How about InputFunctionCallSafeWithInfo(),
>> InputFunctionCallSafeInfo() or
>> InputFunctionCallInfoCallSafe()?
> 
> WithInfo() would not be a new thing.  There are a couple of APIs named
> like this when manipulating catalogs, so that sounds kind of a good
> choice from here.

Thanks for the info. Let's use InputFunctionCallSafeWithInfo().
See that attached patch:
v2-0001-Reuse-fcinfo-used-in-COPY-FROM.patch

I also attach a patch for COPY TO:
v1-0001-Reuse-fcinfo-used-in-COPY-TO.patch

I measured the COPY TO patch on my environment with:
COPY (SELECT 
1::int2,2::int2,3::int2,4::int2,5::int2,6::int2,7::int2,8::int2,9::int2,10::int2,11::int2,12::int2,13::int2,14::int2,15::int2,16::int2,17::int2,18::int2,19::int2,20::int2,
 generate_series(1, 100::int4)) TO '/dev/null' \watch c=5

master:
740.066ms
734.884ms
738.579ms
734.170ms
727.953ms

patched:
730.714ms
741.483ms
714.149ms
715.436ms
713.578ms

It seems that it improves performance a bit but my
environment isn't suitable for benchmark. So they may not
be valid numbers.


Thanks,
-- 
kou
>From b677732f46f735a5601b889f79671e91be41 Mon Sep 17 00:00:00 2001
From: Sutou Kouhei 
Date: Thu, 15 Feb 2024 15:01:08 +0900
Subject: [PATCH v2] Reuse fcinfo used in COPY FROM

Each NextCopyFrom() calls input functions or binary-input
functions. We can reuse fcinfo for them instead of creating a local
fcinfo for each call. This will improve performance.
---
 src/backend/commands/copyfrom.c  |   4 +
 src/backend/commands/copyfromparse.c |  20 ++--
 src/backend/utils/fmgr/fmgr.c| 126 +++
 src/include/commands/copyfrom_internal.h |   1 +
 src/include/fmgr.h   |  12 +++
 5 files changed, 154 insertions(+), 9 deletions(-)

diff --git a/src/backend/commands/copyfrom.c b/src/backend/commands/copyfrom.c
index 1fe70b9133..ed375c012e 100644
--- a/src/backend/commands/copyfrom.c
+++ b/src/backend/commands/copyfrom.c
@@ -1691,6 +1691,10 @@ BeginCopyFrom(ParseState *pstate,
 	/* We keep those variables in cstate. */
 	cstate->in_functions = in_functions;
 	cstate->typioparams = typioparams;
+	if (cstate->opts.binary)
+		cstate->fcinfo = PrepareReceiveFunctionCallInfo();
+	else
+		cstate->fcinfo = PrepareInputFunctionCallInfo();
 	cstate->defmap = defmap;
 	cstate->defexprs = defexprs;
 	cstate->volatile_defexprs = volatile_defexprs;
diff --git a/src/backend/commands/copyfromparse.c b/src/backend/commands/copyfromparse.c
index 7cacd0b752..7907e16ea8 100644
--- a/src/backend/commands/copyfromparse.c
+++ b/src/backend/commands/copyfromparse.c
@@ -861,6 +861,7 @@ NextCopyFrom(CopyFromState cstate, ExprContext *econtext,
 num_defaults = cstate->num_defaults;
 	FmgrInfo   *in_functions = cstate->in_functions;
 	Oid		   *typioparams = cstate->typioparams;
+	FunctionCallInfoBaseData *fcinfo = cstate->fcinfo;
 	int			i;
 	int		   *defmap = cstate->defmap;
 	ExprState **defexprs = cstate->defexprs;
@@ -961,12 +962,13 @@ NextCopyFrom(CopyFromState cstate, ExprContext *econtext,
 			 * If ON_ERROR is specified with IGNORE, skip rows with soft
 			 * errors
 			 */
-			else if (!InputFunctionCallSafe(_functions[m],
-			string,
-			typioparams[m],
-			att->atttypmod,
-			(Node *) cstate->escontext,
-			[m]))
+			else if (!InputFunctionCallSafeWithInfo(fcinfo,
+	_functions[m],
+	string,
+	typioparams[m],
+	att->atttypmod,
+	(Node *) cstate->escontext,
+	[m]))
 			{
 cstate->num_errors++;
 return true;
@@ -1966,7 +1968,7 @@ CopyReadBinaryAttribute(CopyFromState cstate, FmgrInfo *flinfo,
 	if (fld_size == -1)
 	{
 		*isnull = true;
-		return ReceiveFunctionCall(flinfo, NULL, typioparam, typmod);
+		return ReceiveFunctionCallWithInfo(cstate->fcinfo, flinfo, NULL, typioparam, typmod);
 	}
 	if (fld_size < 0)
 		ereport(ERROR,
@@ -1987,8 +1989,8 @@ CopyReadBinaryAttribute(CopyFromState cstate, FmgrInfo *flinfo,
 	cstate->attribute_buf.data[fld_size] = '\0';
 
 	/* Call the column type's binary input converter */
-	result = ReceiveFunctionCall(flinfo, >attribute_buf,
- typioparam, typmod);
+	result = ReceiveFunctionCallWithInfo(cstate->fcinfo, flinfo, >attribute_buf,
+		 typioparam, typmod);
 
 	/* Trouble if it didn't eat the whole buffer */
 	if (cstate->attribute_buf.cursor != cstate->attribute_buf.len)
diff --git a/src/backend/utils/fmgr/fmgr.c b/src/backend/utils/fmgr/fmgr.c
index e48a86be54..14c3ed2bdb 100644
--- a/src/backend/utils/fmgr/fmgr.c
+++ b/src/backend/utils/fmgr/fmgr.c
@@ -1672,6 +1672,73 @@ DirectInputFunctionCallSafe(PGFunction func, char *str,
 	return true;
 }
 
+/*
+ * Prepare callinfo for InputFunctionCallSafeWithInfo to reuse one callinfo
+ * instead of initializing it for each call. This is for performance.
+ */

Re: About a recently-added message

2024-02-14 Thread Kyotaro Horiguchi

At Thu, 15 Feb 2024 09:22:23 +0530, shveta malik  wrote 
in 
> On Thu, Feb 15, 2024 at 8:26 AM Amit Kapila  wrote:
> >
> > On Wed, Feb 14, 2024 at 7:51 PM Euler Taveira  wrote:
> > >
> > > On Wed, Feb 14, 2024, at 8:45 AM, Amit Kapila wrote:
> > >
> > > Now, I am less clear about whether to quote "logical" or not in the
> > > above message. Do you have any suggestions?
> > >
> > >
> > > The possible confusion comes from the fact that the sentence contains 
> > > "must be"
> > > in the middle of a comparison expression. For pg_createsubscriber, we are 
> > > using
> > >
> > > publisher requires wal_level >= logical
> > >
> > > I suggest to use something like
> > >
> > > slot synchronization requires wal_level >= logical
> > >
> >
> > This sounds like a better idea but shouldn't we directly use this in
> > 'errmsg' instead of a separate 'errhint'? Also, if change this, then
> > we should make a similar change for other messages in the same
> > function.
> 
> +1 on changing the msg(s) suggested way. Please find the patch for the
> same. It also removes double quotes around the variable names

Thanks for the discussion.

With a translator hat on, I would be happy if I could determine
whether a word requires translation with minimal background
information. In this case, a translator needs to know which values
wal_level can take. It's relatively easy in this case, but I'm not
sure if this is always the case. Therefore, I would be slightly
happier if "logical" were double-quoted.

regards.

-- 
Kyotaro Horiguchi
NTT Open Source Software Center

Re: Skip collecting decoded changes of already-aborted transactions

2024-02-14 Thread Masahiko Sawada

On Fri, Feb 2, 2024 at 12:48 AM vignesh C  wrote:
>
> On Tue, 3 Oct 2023 at 15:54, vignesh C  wrote:
> >
> > On Mon, 3 Jul 2023 at 07:16, Masahiko Sawada  wrote:
> > >
> > > On Fri, Jun 23, 2023 at 12:39 PM Dilip Kumar  
> > > wrote:
> > > >
> > > > On Fri, Jun 9, 2023 at 10:47 AM Masahiko Sawada  
> > > > wrote:
> > > > >
> > > > > Hi,
> > > > >
> > > > > In logical decoding, we don't need to collect decoded changes of
> > > > > aborted transactions. While streaming changes, we can detect
> > > > > concurrent abort of the (sub)transaction but there is no mechanism to
> > > > > skip decoding changes of transactions that are known to already be
> > > > > aborted. With the attached WIP patch, we check CLOG when decoding the
> > > > > transaction for the first time. If it's already known to be aborted,
> > > > > we skip collecting decoded changes of such transactions. That way,
> > > > > when the logical replication is behind or restarts, we don't need to
> > > > > decode large transactions that already aborted, which helps improve
> > > > > the decoding performance.
> > > > >
> > > > +1 for the idea of checking the transaction status only when we need
> > > > to flush it to the disk or send it downstream (if streaming in
> > > > progress is enabled).   Although this check is costly since we are
> > > > planning only for large transactions then it is worth it if we can
> > > > occasionally avoid disk or network I/O for the aborted transactions.
> > > >
> > >
> > > Thanks.
> > >
> > > I've attached the updated patch. With this patch, we check the
> > > transaction status for only large-transactions when eviction. For
> > > regression test purposes, I disable this transaction status check when
> > > logical_replication_mode is set to 'immediate'.
> >
> > May be there is some changes that are missing in the patch, which is
> > giving the following errors:
> > reorderbuffer.c: In function ‘ReorderBufferCheckTXNAbort’:
> > reorderbuffer.c:3584:22: error: ‘logical_replication_mode’ undeclared
> > (first use in this function)
> >  3584 | if (unlikely(logical_replication_mode ==
> > LOGICAL_REP_MODE_IMMEDIATE))
> >   |  ^~~~
>
> With no update to the thread and the compilation still failing I'm
> marking this as returned with feedback.  Please feel free to resubmit
> to the next CF when there is a new version of the patch.
>

I resumed working on this item. I've attached the new version patch.

I rebased the patch to the current HEAD and updated comments and
commit messages. The patch is straightforward and I'm somewhat
satisfied with it, but I'm thinking of adding some tests for it.

Regards,

--
Masahiko Sawada
Amazon Web Services: https://aws.amazon.com
From c5c78f5d53d375f7a79b2561c551f7bb3ff57717 Mon Sep 17 00:00:00 2001
From: Masahiko Sawada 
Date: Mon, 3 Jul 2023 10:28:00 +0900
Subject: [PATCH v3] Skip logical decoding of already-aborted transactions.

If we detect a concurrent abort of a streaming transaction, we discard
all changes and skip decoding further changes of the transaction. This
commit introduces a new check if a (streaming or non-streaming)
transaction is already aborted by CLOG lookup, enabling us to skip
decoding further changes of the transaction. This helps a lot in
logical decoding performance in a case where the transaction is large
and already rolled back since we can save disk or network I/O.

We do this new check for only large-transactions when eviction since
checking CLOG is costly and could cause a slowdown with lots of small
transactions, where most transactions commit.

Reviewed-by:
Discussion: https://postgr.es/m/CAD21AoDht9Pz_DFv_R2LqBTBbO4eGrpa9Vojmt5z5sEx3XwD7A@mail.gmail.com
---
 .../replication/logical/reorderbuffer.c   | 98 ---
 src/include/replication/reorderbuffer.h   | 13 ++-
 2 files changed, 94 insertions(+), 17 deletions(-)

diff --git a/src/backend/replication/logical/reorderbuffer.c b/src/backend/replication/logical/reorderbuffer.c
index bbf0966182..f3284708bf 100644
--- a/src/backend/replication/logical/reorderbuffer.c
+++ b/src/backend/replication/logical/reorderbuffer.c
@@ -100,6 +100,7 @@
 #include "replication/snapbuild.h"	/* just for SnapBuildSnapDecRefcount */
 #include "storage/bufmgr.h"
 #include "storage/fd.h"
+#include "storage/procarray.h"
 #include "storage/sinval.h"
 #include "utils/builtins.h"
 #include "utils/combocid.h"
@@ -256,7 +257,7 @@ static void ReorderBufferRestoreChange(ReorderBuffer *rb, ReorderBufferTXN *txn,
 	   char *data);
 static void ReorderBufferRestoreCleanup(ReorderBuffer *rb, ReorderBufferTXN *txn);
 static void ReorderBufferTruncateTXN(ReorderBuffer *rb, ReorderBufferTXN *txn,
-	 bool txn_prepared);
+	 bool txn_prepared, bool streaming);
 static void ReorderBufferCleanupSerializedTXNs(const char *slotname);
 static void ReorderBufferSerializedPath(char *path, ReplicationSlot *slot,
 		TransactionId xid, XLogSegNo segno);
@@

Re: speed up a logical replica setup

2024-02-14 Thread Shubham Khanna

On Thu, Feb 15, 2024 at 10:37 AM Hayato Kuroda (Fujitsu)
 wrote:
>
> Dear Euler,
>
> Here are my minor comments for 17.
>
> 01.
> ```
> /* Options */
> static const char *progname;
>
> static char *primary_slot_name = NULL;
> static bool dry_run = false;
>
> static bool success = false;
>
> static LogicalRepInfo *dbinfo;
> static int  num_dbs = 0;
> ```
>
> The comment seems out-of-date. There is only one option.
>
> 02. check_subscriber and check_publisher
>
> Missing pg_catalog prefix in some lines.
>
> 03. get_base_conninfo
>
> I think dbname would not be set. IIUC, dbname should be a pointer of the 
> pointer.
>
>
> 04.
>
> I check the coverage and found two functions have been never called:
>  - drop_subscription
>  - drop_replication_slot
>
> Also, some cases were not tested. Below bullet showed notable ones for me.
> (Some of them would not be needed based on discussions)
>
> * -r is specified
> * -t is specified
> * -P option contains dbname
> * -d is not specified
> * GUC settings are wrong
> * primary_slot_name is specified on the standby
> * standby server is not working
>
> In feature level, we may able to check the server log is surely removed in 
> case
> of success.
>
> So, which tests should be added? drop_subscription() is called only when the
> cleanup phase, so it may be difficult to test. According to others, it seems 
> that
> -r and -t are not tested. GUC-settings have many test cases so not sure they
> should be. Based on this, others can be tested.
>
> PSA my top-up patch set.
>
> V18-0001: same as your patch, v17-0001.
> V18-0002: modify the alignment of codes.
> V18-0003: change an argument of get_base_conninfo. Per comment 3.
> === experimental patches ===
> V18-0004: Add testcases per comment 4.
> V18-0005: Remove -P option. I'm not sure it should be needed, but I made just 
> in case.

I created a cascade Physical Replication system like
node1->node2->node3 and ran pg_createsubscriber for node2. After
running the script, I started the node2 again and found
pg_createsubscriber command was successful after which the physical
replication between node2 and node3 has been broken. I feel
pg_createsubscriber should check this scenario and throw an error in
this case to avoid breaking the cascaded replication setup. I have
attached the script which was used to verify this.

Thanks and Regards,
Shubham Khanna.
#!/bin/bash

#
# This script tests a case of cascading replication.
#
# Creating system,:
#   node1-(physical replication)->node2-(physical replication)->node3
#

# Stop instances
pg_ctl stop -D node1
pg_ctl stop -D node2
pg_ctl stop -D node3

# Remove old files
rm -rf node1
rm -rf node2
rm -rf node3
rm log_node2 log_node1 log_node3

# Initialize primary
initdb -D node1

echo "wal_level = logical" >> node1/postgresql.conf
echo "max_replication_slots=3" >> node1/postgresql.conf

pg_ctl -D node1 -l log_node1 start

psql -d postgres -c "CREATE DATABASE db1";
psql -d db1 -c "CHECKPOINT";

sleep 3

# Initialize standby
pg_basebackup -v -R -D node2
echo "Port=9000" >> node2/postgresql.conf

pg_ctl -D node2 -l log_node2 start

# Initialize another standby
pg_basebackup -p 9000 -v -R -D node3
echo "Port=9001" >> node3/postgresql.conf
pg_ctl -D node3 -l log_node3 start

# Insert a tuple to primary
psql -d db1 -c "CREATE TABLE c1(a int)";
psql -d db1 -c "Insert into c1 Values(2)";
sleep 3

# And verify it can be seen on another standby
psql -d db1 -p 9001 -c "Select * from c1";

pg_createsubscriber -D node2/ -S "host=localhost port=9000 dbname=postgres" -d postgres -d db1 -r -v
pg_ctl -D node2 -l log_node2 start

Re: Returning non-terminated string in ECPG Informix-compatible function

On Thu, Feb 15, 2024 at 12:15:40PM +0700, Oleg Tselebrovskiy wrote:
> Greetings again.
> I was looking through more static analyzer output and found another problem.
> In ecpg/pgtypeslib/dt_common.c there are 4 calls of pgtypes_alloc.
> This function uses calloc and returns NULL if OOM, but we don't check its
> return value and immediately pass it to strcpy, which could lead to
> segfault.
>
> I suggest adding a check for a return value since all other calls of
> pgtypes_alloc are checked for NULL.

Right.

> @@ -654,7 +654,7 @@ intoasc(interval * i, char *str)
>  if (!tmp)
>  return -errno;
>  
> -memcpy(str, tmp, strlen(tmp));
> +strcpy(str, tmp);

For this stuff, Ashutosh's point was to integrate a test case in the
patch.  With the pgc you have posted, most of the work is done, but
this should be added to src/interfaces/ecpg/test/sql/ to add some
automated coverage.  See the area for references showing how this is
achieved.

> @@ -2837,6 +2843,8 @@ PGTYPEStimestamp_defmt_scan(char **str, char *fmt, 
> timestamp * d,
>  case 'T':
>  pfmt++;
>  tmp = pgtypes_alloc(strlen("%H:%M:%S") + strlen(pstr) + 1);
> +if(!tmp)
> +return 1;

These are obviously wrong as pgtypes_alloc() could return NULL.
--
Michael


signature.asc
Description: PGP signature

xmlBufferCreate return value not checked in pgxmlNodeSetToText

2024-02-14 Thread Oleg Tselebrovskiy


Greetings, everyone!

While analyzing output of Svace static analyzer [1] I've found a bug.

In function pgxmlNodeSetToText there is a call of xmlBufferCreate that 
doesn't
have its return value checked. In all four other calls of 
xmlBufferCreate there

is a try...catch that checks the return value inside.

I suggest to add the same checks here that are used in other four calls 
of

xmlBufferCreate.

The proposed patch is attached.

[1] - https://svace.pages.ispras.ru/svace-website/en/

Oleg Tselebrovskiy, Postgres Prodiff --git a/contrib/xml2/xpath.c b/contrib/xml2/xpath.c
index 94641930f7b..2d72ade9c20 100644
--- a/contrib/xml2/xpath.c
+++ b/contrib/xml2/xpath.c
@@ -122,62 +122,85 @@ pgxmlNodeSetToText(xmlNodeSetPtr nodeset,
    xmlChar *septagname,
    xmlChar *plainsep)
 {
-	xmlBufferPtr buf;
+	xmlBufferPtr buf = NULL;
 	xmlChar*result;
 	int			i;
+	PgXmlErrorContext *xmlerrcxt;
 
-	buf = xmlBufferCreate();
+	xmlerrcxt = pg_xml_init(PG_XML_STRICTNESS_LEGACY);
 
-	if ((toptagname != NULL) && (xmlStrlen(toptagname) > 0))
-	{
-		xmlBufferWriteChar(buf, "<");
-		xmlBufferWriteCHAR(buf, toptagname);
-		xmlBufferWriteChar(buf, ">");
-	}
-	if (nodeset != NULL)
+	PG_TRY();
 	{
-		for (i = 0; i < nodeset->nodeNr; i++)
-		{
-			if (plainsep != NULL)
-			{
-xmlBufferWriteCHAR(buf,
-   xmlXPathCastNodeToString(nodeset->nodeTab[i]));
+		buf = xmlBufferCreate();
 
-/* If this isn't the last entry, write the plain sep. */
-if (i < (nodeset->nodeNr) - 1)
-	xmlBufferWriteChar(buf, (char *) plainsep);
-			}
-			else
+		if (buf == NULL)
+			xml_ereport(xmlerrcxt, ERROR, ERRCODE_OUT_OF_MEMORY,
+		"could not allocate xmlBuffer");
+
+		if ((toptagname != NULL) && (xmlStrlen(toptagname) > 0))
+		{
+			xmlBufferWriteChar(buf, "<");
+			xmlBufferWriteCHAR(buf, toptagname);
+			xmlBufferWriteChar(buf, ">");
+		}
+		if (nodeset != NULL)
+		{
+			for (i = 0; i < nodeset->nodeNr; i++)
 			{
-if ((septagname != NULL) && (xmlStrlen(septagname) > 0))
+if (plainsep != NULL)
 {
-	xmlBufferWriteChar(buf, "<");
-	xmlBufferWriteCHAR(buf, septagname);
-	xmlBufferWriteChar(buf, ">");
-}
-xmlNodeDump(buf,
-			nodeset->nodeTab[i]->doc,
-			nodeset->nodeTab[i],
-			1, 0);
+	xmlBufferWriteCHAR(buf,
+	xmlXPathCastNodeToString(nodeset->nodeTab[i]));
 
-if ((septagname != NULL) && (xmlStrlen(septagname) > 0))
+	/* If this isn't the last entry, write the plain sep. */
+	if (i < (nodeset->nodeNr) - 1)
+		xmlBufferWriteChar(buf, (char *) plainsep);
+}
+else
 {
-	xmlBufferWriteChar(buf, "");
+	if ((septagname != NULL) && (xmlStrlen(septagname) > 0))
+	{
+		xmlBufferWriteChar(buf, "<");
+		xmlBufferWriteCHAR(buf, septagname);
+		xmlBufferWriteChar(buf, ">");
+	}
+	xmlNodeDump(buf,
+nodeset->nodeTab[i]->doc,
+nodeset->nodeTab[i],
+1, 0);
+
+	if ((septagname != NULL) && (xmlStrlen(septagname) > 0))
+	{
+		xmlBufferWriteChar(buf, "");
+	}
 }
 			}
 		}
-	}
 
-	if ((toptagname != NULL) && (xmlStrlen(toptagname) > 0))
+		if ((toptagname != NULL) && (xmlStrlen(toptagname) > 0))
+		{
+			xmlBufferWriteChar(buf, "");
+		}
+		result = xmlStrdup(buf->content);
+	}
+	PG_CATCH();
 	{
-		xmlBufferWriteChar(buf, "");
+		if (buf)
+			xmlBufferFree(buf);
+
+		pg_xml_done(xmlerrcxt, true);
+
+		PG_RE_THROW();
 	}
-	result = xmlStrdup(buf->content);
+	PG_END_TRY();
+
 	xmlBufferFree(buf);
+	pg_xml_done(xmlerrcxt, false);
+
 	return result;
 }

Re: Why is subscription/t/031_column_list.pl failing so much?

2024-02-14 Thread vignesh C

On Thu, 15 Feb 2024 at 08:42, Amit Kapila  wrote:
>
> On Wed, Feb 14, 2024 at 12:58 PM vignesh C  wrote:
> >
> > On Wed, 7 Feb 2024 at 16:27, vignesh C  wrote:
> > >
> > > I was able to reproduce the issue consistently with the changes shared
> > > by Tom Lane at [1].
> > > I have made changes to change ALTER SUBSCRIPTION to DROP+CREATE
> > > SUBSCRIPTION and verified that the test has passed consistently for
> > > >50 runs that I ran. Also the test execution time increased for this
> > > case is very negligibly:
> > > Without patch: 7.991 seconds
> > > With test change patch:   8.121 seconds
> > >
> > > The test changes for the same are attached.
> >
> > Alternative, this could also be fixed like the changes proposed by
> > Amit at [1]. In this case we ignore publications that are not found
> > for the purpose of computing RelSyncEntry attributes. We won't mark
> > such an entry as valid till all the publications are loaded without
> > anything missing. This means we won't publish operations on tables
> > corresponding to that publication till we found such a publication and
> > that seems okay.
> >
> > Tomas had raised a performance issue forcing us to reload it for every
> > replicated change/row in case the publications are invalid at [2].
> >
>
> Did you measure the performance impact? I think it should impact the
> performance only when there is a dropped/non-existent publication as
> per the subscription definition.

The performance was almost similar in this case:
Without patch: 7.991 seconds
With skip_not_exist_publication option patch: 7.996 seconds

> Also, in the same email[2], Tomas
> raised another concern that in such cases it is better to break the
> replication.

I will handle this in the next version

> >
>  How
> > about keeping the default option as it is and providing a new option
> > skip_not_exist_publication while creating/altering a subscription. In
> > this case if skip_not_exist_publication  is specified we will ignore
> > the case if publication is not present and publications will be kept
> > in invalid and get validated later.
> >
>
> I don't know if this deserves a separate option or not but I think it
> is better to discuss this in a separate thread.

I will start a separate thread for this.

> To resolve the BF
> failure, I still feel, we should just recreate the subscription. This
> is a pre-existing problem and we can track it via a separate patch
> with a test case targetting such failures.

+1 for going with recreation of the subscription, the changes for this
are available at [1].
[1] - 
https://www.postgresql.org/message-id/CALDaNm1hLZW4H4Z61swK6WPW6pcNcjzXqw%3D6NqG7e-RMtkFaZA%40mail.gmail.com

Regards,
Vignesh

Re: Returning non-terminated string in ECPG Informix-compatible function

2024-02-14 Thread Oleg Tselebrovskiy


Greetings again.
I was looking through more static analyzer output and found another 
problem.

In ecpg/pgtypeslib/dt_common.c there are 4 calls of pgtypes_alloc.
This function uses calloc and returns NULL if OOM, but we don't check 
its
return value and immediately pass it to strcpy, which could lead to 
segfault.


I suggest adding a check for a return value since all other calls of
pgtypes_alloc are checked for NULL.

A proposed patch (with previous and current changes) is attached

Oleg Tselebrovskiy, Postgres Prodiff --git a/src/interfaces/ecpg/compatlib/informix.c b/src/interfaces/ecpg/compatlib/informix.c
index dccf39582da..80d40aa3e09 100644
--- a/src/interfaces/ecpg/compatlib/informix.c
+++ b/src/interfaces/ecpg/compatlib/informix.c
@@ -654,7 +654,7 @@ intoasc(interval * i, char *str)
 	if (!tmp)
 		return -errno;
 
-	memcpy(str, tmp, strlen(tmp));
+	strcpy(str, tmp);
 	free(tmp);
 	return 0;
 }
diff --git a/src/interfaces/ecpg/pgtypeslib/dt_common.c b/src/interfaces/ecpg/pgtypeslib/dt_common.c
index 99bdc94d6d7..d4ca0cbff6e 100644
--- a/src/interfaces/ecpg/pgtypeslib/dt_common.c
+++ b/src/interfaces/ecpg/pgtypeslib/dt_common.c
@@ -2659,6 +2659,8 @@ PGTYPEStimestamp_defmt_scan(char **str, char *fmt, timestamp * d,
  */
 pfmt++;
 tmp = pgtypes_alloc(strlen("%m/%d/%y") + strlen(pstr) + 1);
+if(!tmp)
+	return 1;
 strcpy(tmp, "%m/%d/%y");
 strcat(tmp, pfmt);
 err = PGTYPEStimestamp_defmt_scan(, tmp, d, year, month, day, hour, minute, second, tz);
@@ -2784,6 +2786,8 @@ PGTYPEStimestamp_defmt_scan(char **str, char *fmt, timestamp * d,
 			case 'r':
 pfmt++;
 tmp = pgtypes_alloc(strlen("%I:%M:%S %p") + strlen(pstr) + 1);
+if(!tmp)
+	return 1;
 strcpy(tmp, "%I:%M:%S %p");
 strcat(tmp, pfmt);
 err = PGTYPEStimestamp_defmt_scan(, tmp, d, year, month, day, hour, minute, second, tz);
@@ -2792,6 +2796,8 @@ PGTYPEStimestamp_defmt_scan(char **str, char *fmt, timestamp * d,
 			case 'R':
 pfmt++;
 tmp = pgtypes_alloc(strlen("%H:%M") + strlen(pstr) + 1);
+if(!tmp)
+	return 1;
 strcpy(tmp, "%H:%M");
 strcat(tmp, pfmt);
 err = PGTYPEStimestamp_defmt_scan(, tmp, d, year, month, day, hour, minute, second, tz);
@@ -2837,6 +2843,8 @@ PGTYPEStimestamp_defmt_scan(char **str, char *fmt, timestamp * d,
 			case 'T':
 pfmt++;
 tmp = pgtypes_alloc(strlen("%H:%M:%S") + strlen(pstr) + 1);
+if(!tmp)
+	return 1;
 strcpy(tmp, "%H:%M:%S");
 strcat(tmp, pfmt);
 err = PGTYPEStimestamp_defmt_scan(, tmp, d, year, month, day, hour, minute, second, tz);

Re: Refactoring backend fork+exec code

On Thu, Feb 15, 2024 at 3:07 AM Andres Freund  wrote:
> > I think the last remaining question here is about the 0- vs 1-based indexing
> > of BackendIds. Is it a good idea to switch to 0-based indexing? And if we do
> > it, should we reserve PGPROC 0. I'm on the fence on this one.
>
> I lean towards it being a good idea. Having two internal indexing schemes was
> bad enough so far, but at least one would fairly quickly notice if one used
> the wrong one. If they're just offset by 1, it might end up taking longer,
> because that'll often also be a valid id.

Yeah, I think making everything 0-based is probably the best way
forward long term. It might require more cleanup work to get there,
but it's just a lot simpler in the end, IMHO.

-- 
Robert Haas
EDB: http://www.enterprisedb.com

Re: Fix race condition in InvalidatePossiblyObsoleteSlot()

On Thu, Jan 18, 2024 at 02:20:28PM +, Bertrand Drouvot wrote:
> On Thu, Jan 18, 2024 at 04:59:39PM +0530, Bharath Rupireddy wrote:
>> I'm not sure if it
>> can happen at all, but I think we can rely on previous conflict reason
>> instead of previous effective_xmin/effective_catalog_xmin/restart_lsn.
> 
> I'm not sure what you mean here. I think we should still keep the "initial" 
> LSN
> so that the next check done with it still makes sense. The previous conflict
> reason as you're proposing also makes sense to me but for another reason: 
> PANIC
> in case the issue still happen (for cases we did not think about, means not
> covered by what the added previous LSNs are covering).

Using a PANIC seems like an overreaction to me for this path.  Sure, 
we should not record twice a conflict reason, but wouldn't an
assertion be more adapted enough to check that a termination is not
logged twice for this code path run in the checkpointer?

+if (!terminated)
+{
+initial_restart_lsn = s->data.restart_lsn;
+initial_effective_xmin = s->effective_xmin;
+initial_catalog_effective_xmin = s->effective_catalog_xmin;
+}

It seems important to document why we are saving this data here; while
we hold the LWLock for repslots, before we perform any termination,
and we  want to protect conflict reports with incorrect data if the
slot data got changed while the lwlock is temporarily released while
there's a termination.  No need to mention the pesky standby snapshot
records, I guess, as there could be different reasons why these xmins
advance.

>> 3. Is there a way to reproduce this racy issue, may be by adding some
>> sleeps or something? If yes, can you share it here, just for the
>> records and verify the whatever fix provided actually works?
> 
> Alexander was able to reproduce it on a slow machine and the issue was not 
> there
> anymore with v1 in place. I think it's tricky to reproduce as it would need 
> the
> slot to advance between the 2 checks.

I'd really want a test for that in the long term.  And an injection
point stuck between the point where ReplicationSlotControlLock is
released then re-acquired when there is an active PID should be
enough, isn't it?  For example just after the kill()?  It seems to me
that we should stuck the checkpointer, perform a forced upgrade of the
xmins, release the checkpointer and see the effects of the change in
the second loop.  Now, modules/injection_points/ does not allow that,
yet, but I've posted a patch among these lines that can stop a process
and release it using a condition variable (should be 0006 on [1]).  I
was planning to start a new thread with a patch posted for the next CF
to add this kind of facility with a regression test for an old bug,
the patch needs a few hours of love, first.  I should be able to post
that next week.

[1]: 
https://www.postgresql.org/message-id/caexhw5uwp9rmct9ba91bufknqeurosawtmens4ah2puyzv2...@mail.gmail.com
--
Michael

signature.asc
Description: PGP signature

Re: index prefetching

On Thu, Feb 15, 2024 at 10:33 AM Andres Freund  wrote:
> The issue here is that we need to read index leaf pages (synchronously for
> now!) to get the tids to do readahead of table data. What you describe is done
> for the table data (IMO not a good idea medium term [1]), but the problem at
> hand is that once we've done readahead for all the tids on one index page, we
> can't do more readahead without looking at the next index leaf page.

Oh, right.

> However, if we want to issue table readahead for tids on the neighboring index
> leaf page, we'll - as the patch stands - not hold a pin on the "current" index
> leaf page. Which makes index prefetching as currently implemented incompatible
> with kill_prior_tuple, as that requires the index leaf page pin being held.

But I think it probably also breaks MVCC, as Peter was saying.

-- 
Robert Haas
EDB: http://www.enterprisedb.com

Re: Properly pathify the union planner

2024-02-14 Thread David Rowley

On Wed, 7 Feb 2024 at 12:05, Andy Fan  wrote:
> +static int
> +pathkeys_useful_for_setop(PlannerInfo *root, List *pathkeys)
> +{
> +   int n_common_pathkeys;
> +
> +   if (root->setop_pathkeys == NIL)
> +   return 0;   /* no special setop 
> ordering requested */
> +
> +   if (pathkeys == NIL)
> +   return 0;   /* unordered path */
> +
> +   (void) pathkeys_count_contained_in(root->setop_pathkeys, pathkeys,
> +  
> _common_pathkeys);
> +
> +   return n_common_pathkeys;
> +}
>
> The two if-clauses looks unnecessary, it should be handled by
> pathkeys_count_contained_in already. The same issue exists in
> pathkeys_useful_for_ordering as well. Attached patch fix it in master.

I agree.  I'd rather not have those redundant checks in
pathkeys_useful_for_setop(), and I do want those functions to be as
similar as possible.  So I think adjusting it in master is a good
idea.

I've pushed your patch.

David

Re: index prefetching

Hi,

On 2024-02-15 09:59:27 +0530, Robert Haas wrote:
> I would have thought that the way this prefetching would work is that
> we would bring pages into shared_buffers sooner than we currently do,
> but not actually pin them until we're ready to use them, so that it's
> possible they might be evicted again before we get around to them, if
> we prefetch too far and the system is too busy.

The issue here is that we need to read index leaf pages (synchronously for
now!) to get the tids to do readahead of table data. What you describe is done
for the table data (IMO not a good idea medium term [1]), but the problem at
hand is that once we've done readahead for all the tids on one index page, we
can't do more readahead without looking at the next index leaf page.

Obviously that would lead to a sawtooth like IO pattern, where you'd regularly
have to wait for IO for the first tuples referenced by an index leaf page.

However, if we want to issue table readahead for tids on the neighboring index
leaf page, we'll - as the patch stands - not hold a pin on the "current" index
leaf page. Which makes index prefetching as currently implemented incompatible
with kill_prior_tuple, as that requires the index leaf page pin being held.

> Alternately, it also seems OK to read those later pages and pin them right
> away, as long as (1) we don't also give up pins that we would have held in
> the absence of prefetching and (2) we have some mechanism for limiting the
> number of extra pins that we're holding to a reasonable number given the
> size of shared_buffers.

FWIW, there's already some logic for (2) in LimitAdditionalPins(). Currently
used to limit how many buffers a backend may pin for bulk relation extension.

Greetings,

Andres Freund

[1] The main reasons that I think that just doing readahead without keeping a
pin is a bad idea, at least medium term, are:

a) To do AIO you need to hold a pin on the page while the IO is in progress,
as the target buffer contents will be modified at some moment you don't
control, so that buffer should better not be replaced while IO is in
progress. So at the very least you need to hold a pin until the IO is over.

b) If you do not keep a pin until you actually use the page, you need to
either do another buffer lookup (expensive!) or you need to remember the
buffer id and revalidate that it's still pointing to the same block (cheaper,
but still not cheap).  That's not just bad because it's slow in an absolute
sense, more importantly it increases the potential performance downside of
doing readahead for fully cached workloads, because you don't gain anything,
but pay the price of two lookups/revalidation.

Note that these reasons really just apply to cases where we read ahead because
we are quite certain we'll need exactly those blocks (leaving errors or
queries ending early aside), not for "heuristic" prefetching. If we e.g. were
to issue prefetch requests for neighboring index pages while descending during
an ordered index scan, without checking that we'll need those, it'd make sense
to just do a "throway" prefetch request.

Re: Properly pathify the union planner

2024-02-14 Thread David Rowley

On Tue, 6 Feb 2024 at 22:05, Richard Guo  wrote:
> I'm thinking that maybe it'd be better to move the work of sorting the
> subquery's paths to the outer query level, specifically within the
> build_setop_child_paths() function, just before we stick SubqueryScanPath
> on top of the subquery's paths.  I think this is better because:
>
> 1. This minimizes the impact on subquery planning and reduces the
> footprint within the grouping_planner() function as much as possible.
>
> 2. This can help avoid the aforementioned add_path() issue because the
> two involved paths will be structured as:

Yes, this is a good idea. I agree with both of your points.

I've taken your suggested changes with minor fixups and expanded on it
to do the partial paths too.  I've also removed almost all of the
changes to planner.c.

I fixed a bug where I was overwriting the union child's
TargetEntry.ressortgroupref without consideration that it might be set
for some other purpose in the subquery.  I wrote
generate_setop_child_grouplist() to handle this which is almost like
generate_setop_grouplist() except it calls assignSortGroupRef() to
figure out the next free tleSortGroupRef, (or reuse the existing one
if the TargetEntry already has one set).

Earlier, I pushed a small comment change to pathnode.c in order to
shrink this patch down a little. It was also a chance that could be
made in isolation of this work.

v2 attached.

David
diff --git a/contrib/postgres_fdw/expected/postgres_fdw.out 
b/contrib/postgres_fdw/expected/postgres_fdw.out
index b5a38aeb21..62b7103d0c 100644
--- a/contrib/postgres_fdw/expected/postgres_fdw.out
+++ b/contrib/postgres_fdw/expected/postgres_fdw.out
@@ -11454,6 +11454,10 @@ DROP INDEX base_tbl1_idx;
 DROP INDEX base_tbl2_idx;
 DROP INDEX async_p3_idx;
 -- UNION queries
+SET enable_sort TO off;
+SET enable_incremental_sort TO off;
+-- Adjust fdw_startup_cost so that we get an unordered path in the Append.
+ALTER SERVER loopback2 OPTIONS (ADD fdw_startup_cost '0.00');
 EXPLAIN (VERBOSE, COSTS OFF)
 INSERT INTO result_tbl
 (SELECT a, b, 'AAA' || c FROM async_p1 ORDER BY a LIMIT 10)
@@ -11535,6 +11539,9 @@ SELECT * FROM result_tbl ORDER BY a;
 (12 rows)
 
 DELETE FROM result_tbl;
+RESET enable_incremental_sort;
+RESET enable_sort;
+ALTER SERVER loopback2 OPTIONS (DROP fdw_startup_cost);
 -- Disable async execution if we use gating Result nodes for pseudoconstant
 -- quals
 EXPLAIN (VERBOSE, COSTS OFF)
diff --git a/contrib/postgres_fdw/sql/postgres_fdw.sql 
b/contrib/postgres_fdw/sql/postgres_fdw.sql
index f410c3db4e..0ed9c5d61f 100644
--- a/contrib/postgres_fdw/sql/postgres_fdw.sql
+++ b/contrib/postgres_fdw/sql/postgres_fdw.sql
@@ -3857,6 +3857,11 @@ DROP INDEX base_tbl2_idx;
 DROP INDEX async_p3_idx;
 
 -- UNION queries
+SET enable_sort TO off;
+SET enable_incremental_sort TO off;
+-- Adjust fdw_startup_cost so that we get an unordered path in the Append.
+ALTER SERVER loopback2 OPTIONS (ADD fdw_startup_cost '0.00');
+
 EXPLAIN (VERBOSE, COSTS OFF)
 INSERT INTO result_tbl
 (SELECT a, b, 'AAA' || c FROM async_p1 ORDER BY a LIMIT 10)
@@ -3883,6 +3888,10 @@ UNION ALL
 SELECT * FROM result_tbl ORDER BY a;
 DELETE FROM result_tbl;
 
+RESET enable_incremental_sort;
+RESET enable_sort;
+ALTER SERVER loopback2 OPTIONS (DROP fdw_startup_cost);
+
 -- Disable async execution if we use gating Result nodes for pseudoconstant
 -- quals
 EXPLAIN (VERBOSE, COSTS OFF)
diff --git a/src/backend/optimizer/path/equivclass.c 
b/src/backend/optimizer/path/equivclass.c
index 4bd60a09c6..c036911309 100644
--- a/src/backend/optimizer/path/equivclass.c
+++ b/src/backend/optimizer/path/equivclass.c
@@ -2867,6 +2867,58 @@ add_child_join_rel_equivalences(PlannerInfo *root,
MemoryContextSwitchTo(oldcontext);
 }
 
+/*
+ * add_setop_child_rel_equivalences
+ * Add equivalence members for each non-resjunk target in 
'child_tlist'
+ * to the EquivalenceClass in the corresponding setop_pathkey's 
pk_class.
+ *
+ * 'root' is the PlannerInfo belonging to the top-level set operation.
+ * 'child_relids' are the Relids of the child relation we're adding
+ * EquivalenceMembers for.
+ * 'child_tlist' is the target list for the setop child relation.  The target
+ * list expressions are what we add as EquivalenceMembers.
+ * 'setop_pathkeys' is a list of PathKeys which must contain an entry for each
+ * non-resjunk target in 'child_tlist'.
+ */
+void
+add_setop_child_rel_equivalences(PlannerInfo *root, Relids child_relids,
+List 
*child_tlist, List *setop_pathkeys)
+{
+   ListCell   *lc;
+   ListCell   *lc2 = list_head(setop_pathkeys);
+
+   foreach(lc, child_tlist)
+   {
+   TargetEntry *tle = lfirst_node(TargetEntry, lc);
+   EquivalenceMember *parent_em;
+   PathKey*pk;
+
+   if (tle->resjunk)
+   continue;
+
+   if (lc2 == NULL)
+

Re: index prefetching

On Wed, Feb 14, 2024 at 7:43 PM Tomas Vondra
 wrote:
> I don't think it's just a bookkeeping problem. In a way, nbtree already
> does keep an array of tuples to kill (see btgettuple), but it's always
> for the current index page. So it's not that we immediately go and kill
> the prior tuple - nbtree already stashes it in an array, and kills all
> those tuples when moving to the next index page.
>
> The way I understand the problem is that with prefetching we're bound to
> determine the kill_prior_tuple flag with a delay, in which case we might
> have already moved to the next index page ...

Well... I'm not clear on all of the details of how this works, but
this sounds broken to me, for the reasons that Peter G. mentions in
his comments about desynchronization. If we currently have a rule that
you hold a pin on the index page while processing the heap tuples it
references, you can't just throw that out the window and expect things
to keep working. Saying that kill_prior_tuple doesn't work when you
throw that rule out the window is probably understating the extent of
the problem very considerably.

I would have thought that the way this prefetching would work is that
we would bring pages into shared_buffers sooner than we currently do,
but not actually pin them until we're ready to use them, so that it's
possible they might be evicted again before we get around to them, if
we prefetch too far and the system is too busy. Alternately, it also
seems OK to read those later pages and pin them right away, as long as
(1) we don't also give up pins that we would have held in the absence
of prefetching and (2) we have some mechanism for limiting the number
of extra pins that we're holding to a reasonable number given the size
of shared_buffers.

However, it doesn't seem OK at all to give up pins that the current
code holds sooner than the current code would do.

-- 
Robert Haas
EDB: http://www.enterprisedb.com

Re: 039_end_of_wal: error in "xl_tot_len zero" test

On Fri, Jan 19, 2024 at 11:35:30AM +1300, Thomas Munro wrote:
> On Fri, Jan 19, 2024 at 1:47 AM Anton Voloshin
>  wrote:
>> I believe there is a small problem in the 039_end_of_wal.pl's
>> "xl_tot_len zero" test. It supposes that after immediate shutdown the
>> server, upon startup recovery, should always produce a message matching
>> "invalid record length at .*: wanted 24, got 0". However, if the
>> circumstances are just right and we happened to hit exactly on the edge
>> of WAL page, then the message on startup recovery would be "invalid
>> magic number  in log segment .*, offset .*". The test does not take
>> that into account.
> 
> Thanks for figuring this out!  Right, I see.  I will look more closely
> when I'm back from summer vacation in a few days, but first reaction:

Thomas, are you planning to look at that?
--
Michael


signature.asc
Description: PGP signature

Re: logical decoding and replication of sequences, take 2

On Wed, Feb 14, 2024 at 10:21 PM Tomas Vondra
 wrote:
> The way I think about non-transactional sequence changes is as if they
> were tiny transactions that happen "fully" (including commit) at the LSN
> where the LSN change is logged.

100% this.

> It does not mean we can arbitrarily reorder the changes, it only means
> the changes are applied as if they were independent transactions (but in
> the same order as they were executed originally). Both with respect to
> the other non-transactional changes, and to "commits" of other stuff.

Right, this is very important and I agree completely.

I'm feeling more confident about this now that I heard you say that
stuff -- this is really the key issue I've been worried about since I
first looked at this, and I wasn't sure that you were in agreement,
but it sounds like you are. I think we should (a) fix the locking bug
I found (but that can be independent of this patch) and (b) make sure
that this patch documents the points from the quoted material above so
that everyone who reads the code (and maybe tries to enhance it) is
clear on what the assumptions are.

(I haven't checked whether it documents that stuff or not. I'm just
saying it should, because I think it's a subtlety that someone might
miss.)

-- 
Robert Haas
EDB: http://www.enterprisedb.com

Re: Memory consumed by paths during partitionwise join planning

2024-02-14 Thread Andrei Lepikhov

On 6/2/2024 19:51, Ashutosh Bapat wrote:

On Fri, Dec 15, 2023 at 5:22 AM Ashutosh Bapat
The patches are raw. make check has some crashes that I need to fix. I
am waiting to hear whether this is useful and whether the design is on
the right track.

Let me write words of opinion on that feature.
I generally like the idea of a refcount field. We had problems
generating alternative paths in extensions and designing the Asymmetric
Join feature [1] when we proposed an alternative path one level below
and called the add_path() routine. We had lost the path in the path list
referenced from the upper RelOptInfo. This approach allows us to make
more handy solutions instead of hacking with a copy/restore pathlist.
But I'm not sure about freeing unreferenced paths. I would have to see
alternatives in the pathlist.
About partitioning. As I discovered planning issues connected to
partitions, the painful problem is a rule, according to which we are
trying to use all nomenclature of possible paths for each partition.
With indexes, it quickly increases optimization work. IMO, this can help
a 'symmetrical' approach, which could restrict the scope of possible
pathways for upcoming partitions if we filter some paths in a set of
previously planned partitions.
Also, I am glad to see a positive opinion about the path_walker()
routine. Somewhere else, for example, in [2], it seems we need it too.

[1]
https://www.postgresql.org/message-id/flat/CAOP8fzaVL_2SCJayLL9kj5pCA46PJOXXjuei6-3aFUV45j4LJQ%40mail.gmail.com
[2]
https://www.postgresql.org/message-id/flat/CAMbWs496%2BN%3DUAjOc%3DrcD3P7B6oJe4rZw08e_TZRUsWbPxZW3Tw%40mail.gmail.com

--
regards,
Andrei Lepikhov
Postgres Professional

Re: About a recently-added message

2024-02-14 Thread shveta malik

On Thu, Feb 15, 2024 at 8:26 AM Amit Kapila  wrote:
>
> On Wed, Feb 14, 2024 at 7:51 PM Euler Taveira  wrote:
> >
> > On Wed, Feb 14, 2024, at 8:45 AM, Amit Kapila wrote:
> >
> > Now, I am less clear about whether to quote "logical" or not in the
> > above message. Do you have any suggestions?
> >
> >
> > The possible confusion comes from the fact that the sentence contains "must 
> > be"
> > in the middle of a comparison expression. For pg_createsubscriber, we are 
> > using
> >
> > publisher requires wal_level >= logical
> >
> > I suggest to use something like
> >
> > slot synchronization requires wal_level >= logical
> >
>
> This sounds like a better idea but shouldn't we directly use this in
> 'errmsg' instead of a separate 'errhint'? Also, if change this, then
> we should make a similar change for other messages in the same
> function.

+1 on changing the msg(s) suggested way. Please find the patch for the
same. It also removes double quotes around the variable names

thanks
Shveta


v1-0001-Fix-quotation-of-variable-names.patch
Description: Binary data

RE: Synchronizing slots from primary to standby

2024-02-14 Thread Zhijie Hou (Fujitsu)

On Thursday, February 15, 2024 10:49 AM Amit Kapila  
wrote:
> 
> On Wed, Feb 14, 2024 at 7:26 PM Bertrand Drouvot
>  wrote:
> >
> > On Wed, Feb 14, 2024 at 10:40:11AM +, Zhijie Hou (Fujitsu) wrote:
> > > On Wednesday, February 14, 2024 6:05 PM Amit Kapila
>  wrote:
> > > >
> > > > To ensure that restart_lsn has been moved to a recent position, we
> > > > need to log XLOG_RUNNING_XACTS and make sure the same is processed
> > > > as well by walsender. The attached patch does the required change.
> > > >
> > > > Hou-San can reproduce this problem by adding additional
> > > > checkpoints in the test and after applying the attached it fixes
> > > > the problem. Now, this patch is mostly based on the theory we
> > > > formed based on LOGs on BF and a reproducer by Hou-San, so still,
> > > > there is some chance that this doesn't fix the BF failures in which 
> > > > case I'll
> again look into those.
> > >
> > > I have verified that the patch can fix the issue on my machine(after
> > > adding few more checkpoints before slot invalidation test.) I also
> > > added one more check in the test to confirm the synced slot is not temp 
> > > slot.
> Here is the v2 patch.
> >
> > Thanks!
> >
> > +# To ensure that restart_lsn has moved to a recent WAL position, we
> > +need # to log XLOG_RUNNING_XACTS and make sure the same is processed
> > +as well $primary->psql('postgres', "CHECKPOINT");
> >
> > Instead of "CHECKPOINT" wouldn't a less heavy "SELECT
> pg_log_standby_snapshot();"
> > be enough?
> >
> 
> Yeah, that would be enough. However, the test still fails randomly due to the
> same reason. See [1]. So, as mentioned yesterday, now, I feel it is better to
> recreate the subscription/slot so that it can get the latest restart_lsn 
> rather than
> relying on pg_log_standby_snapshot() to move it.
> 
> > Not a big deal but maybe we could do the change while modifying
> > 040_standby_failover_slots_sync.pl in the next patch "Add a new slotsync
> worker".
> >
> 
> Right, we can do that or probably this test would have made more sense with a
> worker patch where we could wait for the slot to be synced.
> Anyway, let's try to recreate the slot/subscription idea. BTW, do you think 
> that
> adding a LOG when we are not able to sync will help in debugging such
> problems? I think eventually we can change it to DEBUG1 but for now, it can 
> help
> with stabilizing BF and or some other reported issues.

Here is the patch that attempts the re-create sub idea. I also think that a 
LOG/DEBUG
would be useful for such analysis, so the 0002 is to add such a log.

Best Regards,
Hou zj


0002-Add-a-log-if-remote-slot-didn-t-catch-up-to-locally-.patch
Description:  0002-Add-a-log-if-remote-slot-didn-t-catch-up-to-locally-.patch


0001-fix-BF-error-take-2.patch
Description: 0001-fix-BF-error-take-2.patch

Re: [PoC] Improve dead tuple storage for lazy vacuum

2024-02-14 Thread Masahiko Sawada

On Sat, Feb 10, 2024 at 9:29 PM John Naylor  wrote:
>
> On Tue, Feb 6, 2024 at 9:58 AM Masahiko Sawada  wrote:
> >
> > On Fri, Feb 2, 2024 at 8:47 PM John Naylor  wrote:
>
> > > My todo:
> > > - benchmark tid store / vacuum again, since we haven't since varlen
> > > types and removing unnecessary locks.
>
> I ran a vacuum benchmark similar to the one in [1] (unlogged tables
> for reproducibility), but smaller tables (100 million records),
> deleting only the last 20% of the table, and including a parallel
> vacuum test. Scripts attached.
>
> monotonically ordered int column index:
>
> master:
> system usage: CPU: user: 4.27 s, system: 0.41 s, elapsed: 4.70 s
> system usage: CPU: user: 4.23 s, system: 0.44 s, elapsed: 4.69 s
> system usage: CPU: user: 4.26 s, system: 0.39 s, elapsed: 4.66 s
>
> v-59:
> system usage: CPU: user: 3.10 s, system: 0.44 s, elapsed: 3.56 s
> system usage: CPU: user: 3.07 s, system: 0.35 s, elapsed: 3.43 s
> system usage: CPU: user: 3.07 s, system: 0.36 s, elapsed: 3.44 s
>
> uuid column index:
>
> master:
> system usage: CPU: user: 18.22 s, system: 1.70 s, elapsed: 20.01 s
> system usage: CPU: user: 17.70 s, system: 1.70 s, elapsed: 19.48 s
> system usage: CPU: user: 18.48 s, system: 1.59 s, elapsed: 20.43 s
>
> v-59:
> system usage: CPU: user: 5.18 s, system: 1.18 s, elapsed: 6.45 s
> system usage: CPU: user: 6.56 s, system: 1.39 s, elapsed: 7.99 s
> system usage: CPU: user: 6.51 s, system: 1.44 s, elapsed: 8.05 s
>
> int & uuid indexes in parallel:
>
> master:
> system usage: CPU: user: 4.53 s, system: 1.22 s, elapsed: 20.43 s
> system usage: CPU: user: 4.49 s, system: 1.29 s, elapsed: 20.98 s
> system usage: CPU: user: 4.46 s, system: 1.33 s, elapsed: 20.50 s
>
> v59:
> system usage: CPU: user: 2.09 s, system: 0.32 s, elapsed: 4.86 s
> system usage: CPU: user: 3.76 s, system: 0.51 s, elapsed: 8.92 s
> system usage: CPU: user: 3.83 s, system: 0.54 s, elapsed: 9.09 s
>
> Over all, I'm pleased with these results, although I'm confused why
> sometimes with the patch the first run reports running faster than the
> others. I'm curious what others get. Traversing a tree that lives in
> DSA has some overhead, as expected, but still comes out way ahead of
> master.

Thanks! That's a great improvement.

I've also run the same scripts in my environment just in case and got
similar results:

monotonically ordered int column index:

master:
system usage: CPU: user: 14.81 s, system: 0.90 s, elapsed: 15.74 s
system usage: CPU: user: 14.91 s, system: 0.80 s, elapsed: 15.73 s
system usage: CPU: user: 14.85 s, system: 0.70 s, elapsed: 15.57 s

v-59:
system usage: CPU: user: 9.47 s, system: 1.04 s, elapsed: 10.53 s
system usage: CPU: user: 9.67 s, system: 0.81 s, elapsed: 10.50 s
system usage: CPU: user: 9.59 s, system: 0.86 s, elapsed: 10.47 s

uuid column index:

master:
system usage: CPU: user: 28.37 s, system: 1.38 s, elapsed: 29.81 s
system usage: CPU: user: 28.05 s, system: 1.37 s, elapsed: 29.47 s
system usage: CPU: user: 28.46 s, system: 1.36 s, elapsed: 29.88 s

v-59:
system usage: CPU: user: 14.87 s, system: 1.13 s, elapsed: 16.02 s
system usage: CPU: user: 14.84 s, system: 1.31 s, elapsed: 16.18 s
system usage: CPU: user: 10.96 s, system: 1.24 s, elapsed: 12.22 s

int & uuid indexes in parallel:

master:
system usage: CPU: user: 15.81 s, system: 1.43 s, elapsed: 34.31 s
system usage: CPU: user: 15.84 s, system: 1.41 s, elapsed: 34.34 s
system usage: CPU: user: 15.92 s, system: 1.39 s, elapsed: 34.33 s

v-59:
system usage: CPU: user: 10.93 s, system: 0.92 s, elapsed: 17.59 s
system usage: CPU: user: 10.92 s, system: 1.20 s, elapsed: 17.58 s
system usage: CPU: user: 10.90 s, system: 1.01 s, elapsed: 17.45 s

>
> There are still some micro-benchmarks we could do on tidstore, and
> it'd be good to find out worse-case memory use (1 dead tuple each on
> spread-out pages), but this is decent demonstration.

I've tested a simple case where vacuum removes 33k dead tuples spread
about every 10 pages.

master:
198,000 bytes (=33000 * 6)
system usage: CPU: user: 29.49 s, system: 0.88 s, elapsed: 30.40 s

v-59:
2,834,432 bytes (reported by TidStoreMemoryUsage())
system usage: CPU: user: 15.96 s, system: 0.89 s, elapsed: 16.88 s

>
> > > I'm not sure what the test_node_types_* functions are testing that
> > > test_basic doesn't. They have a different, and confusing, way to stop
> > > at every size class and check the keys/values. It seems we can replace
> > > all that with two more calls (asc/desc) to test_basic, with the
> > > maximum level.
>
> v58-0008:
>
> + /* borrowed from RT_MAX_SHIFT */
> + const int max_shift = (pg_leftmost_one_pos64(UINT64_MAX) /
> BITS_PER_BYTE) * BITS_PER_BYTE;
>
> This is harder to read than "64 - 8", and doesn't really help
> maintainability either.
> Maybe "(sizeof(uint64) - 1) * BITS_PER_BYTE" is a good compromise.
>
> + /* leaf nodes */
> + test_basic(test_info, 0);
>
> + /* internal nodes */
> + test_basic(test_info, 8);
> +
> + /* max-level nodes */
> +

Re: make add_paths_to_append_rel aware of startup cost

2024-02-14 Thread Andy Fan



Thomas Munro  writes:

> On Thu, Oct 5, 2023 at 9:07 PM David Rowley  wrote:
>> Thanks. Pushed.
>
> FYI somehow this plan from a8a968a8212e flipped in this run:
>
> === dumping 
> /home/bf/bf-build/mylodon/HEAD/pgsql.build/testrun/recovery/027_stream_regress/data/regression.diffs
> ===
> diff -U3 
> /home/bf/bf-build/mylodon/HEAD/pgsql/src/test/regress/expected/union.out
> /home/bf/bf-build/mylodon/HEAD/pgsql.build/testrun/recovery/027_stream_regress/data/results/union.out
> --- /home/bf/bf-build/mylodon/HEAD/pgsql/src/test/regress/expected/union.out
> 2024-01-15 00:31:13.947555940 +
> +++ 
> /home/bf/bf-build/mylodon/HEAD/pgsql.build/testrun/recovery/027_stream_regress/data/results/union.out
> 2024-02-14 00:06:17.075584839 +
> @@ -1447,9 +1447,9 @@
> ->  Append
>   ->  Nested Loop
> Join Filter: (t1.tenthous = t2.tenthous)
> -   ->  Seq Scan on tenk1 t1
> +   ->  Seq Scan on tenk2 t2
> ->  Materialize
> - ->  Seq Scan on tenk2 t2
> + ->  Seq Scan on tenk1 t1
>   ->  Result
>  (8 rows)
>
> https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=mylodon=2024-02-14%2000%3A01%3A03

Thanks for this information! I will take a look at this.

-- 
Best Regards
Andy Fan

Re: Why is subscription/t/031_column_list.pl failing so much?

On Wed, Feb 14, 2024 at 12:58 PM vignesh C  wrote:
>
> On Wed, 7 Feb 2024 at 16:27, vignesh C  wrote:
> >
> > I was able to reproduce the issue consistently with the changes shared
> > by Tom Lane at [1].
> > I have made changes to change ALTER SUBSCRIPTION to DROP+CREATE
> > SUBSCRIPTION and verified that the test has passed consistently for
> > >50 runs that I ran. Also the test execution time increased for this
> > case is very negligibly:
> > Without patch: 7.991 seconds
> > With test change patch:   8.121 seconds
> >
> > The test changes for the same are attached.
>
> Alternative, this could also be fixed like the changes proposed by
> Amit at [1]. In this case we ignore publications that are not found
> for the purpose of computing RelSyncEntry attributes. We won't mark
> such an entry as valid till all the publications are loaded without
> anything missing. This means we won't publish operations on tables
> corresponding to that publication till we found such a publication and
> that seems okay.
>
> Tomas had raised a performance issue forcing us to reload it for every
> replicated change/row in case the publications are invalid at [2].
>

Did you measure the performance impact? I think it should impact the
performance only when there is a dropped/non-existent publication as
per the subscription definition. Also, in the same email[2], Tomas
raised another concern that in such cases it is better to break the
replication.

>
 How
> about keeping the default option as it is and providing a new option
> skip_not_exist_publication while creating/altering a subscription. In
> this case if skip_not_exist_publication  is specified we will ignore
> the case if publication is not present and publications will be kept
> in invalid and get validated later.
>

I don't know if this deserves a separate option or not but I think it
is better to discuss this in a separate thread. To resolve the BF
failure, I still feel, we should just recreate the subscription. This
is a pre-existing problem and we can track it via a separate patch
with a test case targetting such failures.

> The attached patch has the changes for the same. Thoughts?
>
> [1] - 
> https://www.postgresql.org/message-id/CAA4eK1%2BT-ETXeRM4DHWzGxBpKafLCp__5bPA_QZfFQp7-0wj4Q%40mail.gmail.com
> [2] - 
> https://www.postgresql.org/message-id/dc08add3-10a8-738b-983a-191c7406707b%40enterprisedb.com
>

-- 
With Regards,
Amit Kapila.

Re: Can we include capturing logs of pgdata/pg_upgrade_output.d/*/log in buildfarm

2024-02-14 Thread vignesh C

On Thu, 15 Feb 2024 at 07:24, Michael Paquier  wrote:
>
> On Wed, Feb 14, 2024 at 03:51:08PM +0530, vignesh C wrote:
> > First regex is the testname_clusterinstance_data, second regex is the
> > timestamp used for pg_upgrade, third regex is for the text files
> > generated by pg_upgrade and fourth regex is for the log files
> > generated by pg_upgrade.
> >
> > Can we include these log files also in the buildfarm?
> >
> > [1] - 
> > https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=drongo=2024-02-10%2007%3A03%3A10
> > [2] - 
> > https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=drongo=2023-12-07%2003%3A56%3A20
>
> Indeed, these lack some patterns.  Why not sending a pull request
> around [1] to get more patterns covered?

I have added a few more patterns to include the pg_upgrade generated
files. The attached patch has the changes for the same.
Adding Andrew also to get his thoughts on this.

Regards,
Vignesh
diff --git a/PGBuild/Modules/TestUpgrade.pm b/PGBuild/Modules/TestUpgrade.pm
index ad3e00e..83f62b7 100644
--- a/PGBuild/Modules/TestUpgrade.pm
+++ b/PGBuild/Modules/TestUpgrade.pm
@@ -139,6 +139,8 @@ sub check
  $self->{pgsql}/src/bin/pg_upgrade/log/*
  $self->{pgsql}/src/bin/pg_upgrade/tmp_check/*/*.diffs
  $self->{pgsql}/src/bin/pg_upgrade/tmp_check/data/pg_upgrade_output.d/log/*
+ $self->{pgsql}/src/bin/pg_upgrade/tmp_check/*/pgdata/pg_upgrade_output.d/*/*.txt
+ $self->{pgsql}/src/bin/pg_upgrade/tmp_check/*/pgdata/pg_upgrade_output.d/*/log/*.log
  $self->{pgsql}/src/test/regress/*.diffs"
 	);
 	$log->add_log($_) foreach (@logfiles);

Re: About a recently-added message

On Wed, Feb 14, 2024 at 7:51 PM Euler Taveira  wrote:
>
> On Wed, Feb 14, 2024, at 8:45 AM, Amit Kapila wrote:
>
> Now, I am less clear about whether to quote "logical" or not in the
> above message. Do you have any suggestions?
>
>
> The possible confusion comes from the fact that the sentence contains "must 
> be"
> in the middle of a comparison expression. For pg_createsubscriber, we are 
> using
>
> publisher requires wal_level >= logical
>
> I suggest to use something like
>
> slot synchronization requires wal_level >= logical
>

This sounds like a better idea but shouldn't we directly use this in
'errmsg' instead of a separate 'errhint'? Also, if change this, then
we should make a similar change for other messages in the same
function.

-- 
With Regards,
Amit Kapila.

Re: Synchronizing slots from primary to standby

On Wed, Feb 14, 2024 at 7:26 PM Bertrand Drouvot
 wrote:
>
> On Wed, Feb 14, 2024 at 10:40:11AM +, Zhijie Hou (Fujitsu) wrote:
> > On Wednesday, February 14, 2024 6:05 PM Amit Kapila 
> >  wrote:
> > >
> > > To ensure that restart_lsn has been moved to a recent position, we need 
> > > to log
> > > XLOG_RUNNING_XACTS and make sure the same is processed as well by
> > > walsender. The attached patch does the required change.
> > >
> > > Hou-San can reproduce this problem by adding additional checkpoints in the
> > > test and after applying the attached it fixes the problem. Now, this 
> > > patch is
> > > mostly based on the theory we formed based on LOGs on BF and a reproducer
> > > by Hou-San, so still, there is some chance that this doesn't fix the BF 
> > > failures in
> > > which case I'll again look into those.
> >
> > I have verified that the patch can fix the issue on my machine(after adding 
> > few
> > more checkpoints before slot invalidation test.) I also added one more 
> > check in
> > the test to confirm the synced slot is not temp slot. Here is the v2 patch.
>
> Thanks!
>
> +# To ensure that restart_lsn has moved to a recent WAL position, we need
> +# to log XLOG_RUNNING_XACTS and make sure the same is processed as well
> +$primary->psql('postgres', "CHECKPOINT");
>
> Instead of "CHECKPOINT" wouldn't a less heavy "SELECT 
> pg_log_standby_snapshot();"
> be enough?
>

Yeah, that would be enough. However, the test still fails randomly due
to the same reason. See [1]. So, as mentioned yesterday, now, I feel
it is better to recreate the subscription/slot so that it can get the
latest restart_lsn rather than relying on pg_log_standby_snapshot() to
move it.

> Not a big deal but maybe we could do the change while modifying
> 040_standby_failover_slots_sync.pl in the next patch "Add a new slotsync 
> worker".
>

Right, we can do that or probably this test would have made more sense
with a worker patch where we could wait for the slot to be synced.
Anyway, let's try to recreate the slot/subscription idea. BTW, do you
think that adding a LOG when we are not able to sync will help in
debugging such problems? I think eventually we can change it to DEBUG1
but for now, it can help with stabilizing BF and or some other
reported issues.

[1] - 
https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=skink=2024-02-15%2000%3A14%3A38

-- 
With Regards,
Amit Kapila.

Re: Can we include capturing logs of pgdata/pg_upgrade_output.d/*/log in buildfarm

On Wed, Feb 14, 2024 at 03:51:08PM +0530, vignesh C wrote:
> First regex is the testname_clusterinstance_data, second regex is the
> timestamp used for pg_upgrade, third regex is for the text files
> generated by pg_upgrade and fourth regex is for the log files
> generated by pg_upgrade.
> 
> Can we include these log files also in the buildfarm?
> 
> [1] - 
> https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=drongo=2024-02-10%2007%3A03%3A10
> [2] - 
> https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=drongo=2023-12-07%2003%3A56%3A20

Indeed, these lack some patterns.  Why not sending a pull request
around [1] to get more patterns covered?
[1]: 
https://github.com/PGBuildFarm/client-code/blob/main/PGBuild/Modules/TestUpgrade.pm
--
Michael


signature.asc
Description: PGP signature

Re: Add system identifier to backup manifest

On Wed, Feb 14, 2024 at 12:29:07PM +0530, Amul Sul wrote:
> Ok, I did that way in the attached version, I have passed the control file's
> full path as a second argument to verify_system_identifier() what we gets in
> verify_backup_file(), but that is not doing any useful things with it,
> since we
> were using get_controlfile() to open the control file, which takes the
> directory as an input and computes the full path on its own.

I've read through the patch, and that's pretty cool.

-static void
-parse_manifest_file(char *manifest_path, manifest_files_hash **ht_p,
-   manifest_wal_range **first_wal_range_p)
+static manifest_data *
+parse_manifest_file(char *manifest_path)

In 0001, should the comment describing this routine be updated as
well?

+   identifier with pg_control of the backup directory or fails verification 

This is missing a  markup here.

+  PostgreSQL 17, it is 2; in older versions,
+  it is 1. 

Perhaps a couple of s here.

+   if (strcmp(parse->manifest_version, "1") != 0 &&
+   strcmp(parse->manifest_version, "2") != 0)
+   json_manifest_parse_failure(parse->context,
+   
"unexpected manifest version");
+
+   /* Parse version. */
+   version = strtoi64(parse->manifest_version, , 10);
+   if (*ep)
+   json_manifest_parse_failure(parse->context,
+   
"manifest version not an integer");
+
+   /* Invoke the callback for version */
+   context->version_cb(context, version);

Shouldn't these two checks be reversed?  And is there actually a need
for the first check at all knowing that the version callback should be
in charge of performing the validation vased on the version received?

+my $node2;
+{
+   local $ENV{'INITDB_TEMPLATE'} = undef;

Not sure that it is a good idea to duplicate this pattern twice.
Shouldn't this be something we'd want to control with an option in the
init() method instead?

+static void
+verify_system_identifier(verifier_context *context, char *controlpath) 

Relying both on controlpath, being a full path to the control file
including the data directory, and context->backup_directory to read
the contents of the control file looks a bit weird.  Wouldn't it be
cleaner to just use one of them?
--
Michael


signature.asc
Description: PGP signature

Re: index prefetching

On Wed, Feb 14, 2024 at 7:28 PM Andres Freund  wrote:
> On 2024-02-13 14:54:14 -0500, Peter Geoghegan wrote:
> > This property of index scans is fundamental to how index scans work.
> > Pinning an index page as an interlock against concurrently TID
> > recycling by VACUUM is directly described by the index API docs [1],
> > even (the docs actually use terms like "buffer pin" rather than
> > something more abstract sounding). I don't think that anything
> > affecting that behavior should be considered an implementation detail
> > of the nbtree index AM as such (nor any particular index AM).
>
> Given that the interlock is only needed for non-mvcc scans, that non-mvcc
> scans are rare due to catalog accesses using snapshots these days and that
> most non-mvcc scans do single-tuple lookups, it might be viable to be more
> restrictive about prefetching iff non-mvcc snapshots are in use and to use
> method of cleanup that allows multiple pages to be cleaned up otherwise.

I agree, but don't think that it matters all that much.

If you have an MVCC snapshot, that doesn't mean that TID recycle
safety problems automatically go away. It only means that you have one
known and supported alternative approach to dealing with such
problems. It's not like you just get that for free, just by using an
MVCC snapshot, though -- it has downsides. Downsides such as the
current _bt_killitems() behavior with a concurrently-modified leaf
page (modified when we didn't hold a leaf page pin). It'll just give
up on setting any LP_DEAD bits due to noticing that the leaf page's
LSN changed. (Plus there are implementation restrictions that I won't
repeat again now.)

When I refer to the buffer pin interlock, I'm mostly referring to the
general need for something like that in the context of index scans.
Principally in order to make kill_prior_tuple continue to work in
something more or less like its current form.

> However, I don't think we would necessarily have to relax the IAM pinning
> rules, just to be able to do prefetching of more than one index leaf
> page.

To be clear, we already do relax the IAM pinning rules. Or at least
nbtree selectively opts out, as I've gone into already.

> Restricting prefetching to entries within a single leaf page obviously
> has the disadvantage of not being able to benefit from concurrent IO whenever
> crossing a leaf page boundary, but at the same time processing entries from
> just two leaf pages would often allow for a sufficiently aggressive
> prefetching.  Pinning a small number of leaf pages instead of a single leaf
> page shouldn't be a problem.

You're probably right. I just don't see any need to solve that problem in v1.

> One argument for loosening the tight coupling between kill_prior_tuples and
> index scan progress is that the lack of kill_prior_tuples for bitmap scans is
> quite problematic. I've seen numerous production issues with bitmap scans
> caused by subsequent scans processing a growing set of dead tuples, where
> plain index scans were substantially slower initially but didn't get much
> slower over time.

I've seen production issues like that too. No doubt it's a problem.

> We might be able to design a system where the bitmap
> contains a certain number of back-references to the index, allowing later
> cleanup if there weren't any page splits or such.

That does seem possible, but do you really want a design for index
prefetching that relies on that massive enhancement (a total redesign
of kill_prior_tuple) happening at some point in the not-too-distant
future? Seems risky, from a project management point of view.

This back-references idea seems rather complicated, especially if it
needs to work with very large bitmap index scans. Since you'll still
have the basic problem of TID recycle safety to deal with (even with
an MVCC snapshot), you don't just have to revisit the leaf pages. You
also have to revisit the corresponding heap pages (generally they'll
be a lot more numerous than leaf pages). You'll have traded one
problem for another (which is not to say that it's not a good
trade-off).

Right now the executor uses a amgettuple interface, and knows nothing
about index related costs (e.g., pages accessed in any index, index
qual costs). While the index AM has some limited understanding of heap
access costs. So the index AM kinda knows a small bit about both types
of costs (possibly not enough, but something). That informs the
language I'm using to describe all this.

To do something like your "back-references to the index" thing well, I
think that you need more dynamic behavior around when you visit the
heap to get heap tuples pointed to by TIDs from index pages (i.e.
dynamic behavior that determines how many leaf pages to go before
going to the heap to get pointed-to TIDs). That is basically what I
meant by "put the index AM in control" -- it doesn't *strictly*
require that the index AM actually do that. Just that a single piece
of code has to have access to the full

Re: index prefetching

Hi,

On 2024-02-13 14:54:14 -0500, Peter Geoghegan wrote:
> This property of index scans is fundamental to how index scans work.
> Pinning an index page as an interlock against concurrently TID
> recycling by VACUUM is directly described by the index API docs [1],
> even (the docs actually use terms like "buffer pin" rather than
> something more abstract sounding). I don't think that anything
> affecting that behavior should be considered an implementation detail
> of the nbtree index AM as such (nor any particular index AM).

Given that the interlock is only needed for non-mvcc scans, that non-mvcc
scans are rare due to catalog accesses using snapshots these days and that
most non-mvcc scans do single-tuple lookups, it might be viable to be more
restrictive about prefetching iff non-mvcc snapshots are in use and to use
method of cleanup that allows multiple pages to be cleaned up otherwise.

However, I don't think we would necessarily have to relax the IAM pinning
rules, just to be able to do prefetching of more than one index leaf
page. Restricting prefetching to entries within a single leaf page obviously
has the disadvantage of not being able to benefit from concurrent IO whenever
crossing a leaf page boundary, but at the same time processing entries from
just two leaf pages would often allow for a sufficiently aggressive
prefetching.  Pinning a small number of leaf pages instead of a single leaf
page shouldn't be a problem.

One argument for loosening the tight coupling between kill_prior_tuples and
index scan progress is that the lack of kill_prior_tuples for bitmap scans is
quite problematic. I've seen numerous production issues with bitmap scans
caused by subsequent scans processing a growing set of dead tuples, where
plain index scans were substantially slower initially but didn't get much
slower over time.  We might be able to design a system where the bitmap
contains a certain number of back-references to the index, allowing later
cleanup if there weren't any page splits or such.

> I think that it makes sense to put the index AM in control here --
> that almost follows from what I said about the index AM API. The index
> AM already needs to be in control, in about the same way, to deal with
> kill_prior_tuple (plus it helps with the  LIMIT issue I described).

Depending on what "control" means I'm doubtful:

Imo there are decisions influencing prefetching that an index AM shouldn't
need to know about directly, e.g. how the plan shape influences how many
tuples are actually going to be consumed. Of course that determination could
be made in planner/executor and handed to IAMs, for the IAM to then "control"
the prefetching.

Another aspect is that *long* term I think we want to be able to execute
different parts of the plan tree when one part is blocked for IO. Of course
that's not always possible. But particularly with partitioned queries it often
is.  Depending on the form of "control" that's harder if IAMs are in control,
because control flow needs to return to the executor to be able to switch to a
different node, so we can't wait for IO inside the AM.

There probably are ways IAMs could be in "control" that would be compatible
with such constraints however.

Greetings,

Andres Freund

Re: index prefetching

Hi,

On 2024-02-14 16:45:57 -0500, Melanie Plageman wrote:
> > > The LIMIT problem is not very clear to me either. Yes, if we get close
> > > to the end of the leaf page, we may need to visit the next leaf page.
> > > But that's kinda the whole point of prefetching - reading stuff ahead,
> > > and reading too far ahead is an inherent risk. Isn't that a problem we
> > > have even without LIMIT? The prefetch distance ramp up is meant to limit
> > > the impact.
> >
> > Right now, the index AM doesn't know anything about LIMIT at all. That
> > doesn't matter, since the index AM can only read/scan one full leaf
> > page before returning control back to the executor proper. The
> > executor proper can just shut down the whole index scan upon finding
> > that we've already returned N tuples for a LIMIT N.
> >
> > We don't do prefetching right now, but we also don't risk reading a
> > leaf page that'll just never be needed. Those two things are in
> > tension, but I don't think that that's quite the same thing as the
> > usual standard prefetching tension/problem. Here there is uncertainty
> > about whether what we're prefetching will *ever* be required -- not
> > uncertainty about when exactly it'll be required. (Perhaps this
> > distinction doesn't mean much to you. I'm just telling you how I think
> > about it, in case it helps move the discussion forward.)
>
> I don't think that the LIMIT problem is too different for index scans
> than heap scans. We will need some advice from planner to come down to
> prevent over-eager prefetching in all cases.

I'm not sure that that's really true. I think the more common and more
problematic case for partially executing a sub-tree of a query are nested
loops (worse because that happens many times within a query). Particularly for
anti-joins prefetching too aggressively could lead to a significant IO
amplification.

At the same time it's IMO more important to ramp up prefetching distance
fairly aggressively for index scans than it is for sequential scans. For
sequential scans it's quite likely that either the whole scan takes quite a
while (thus slowly ramping doesn't affect overall time that much) or that the
data is cached anyway because the tables are small and frequently used (in
which case we don't need to ramp). And even if smaller tables aren't cached,
because it's sequential IO, the IOs are cheaper as they're sequential.
Contrast that to index scans, where it's much more likely that you have cache
misses in queries that do an overall fairly small number of IOs and where that
IO is largely random.

I think we'll need some awareness at ExecInitNode() time about how the results
of the nodes are used. I see a few "classes":

1) All rows are needed, because the node is below an Agg, Hash, Materialize,
   Sort,  Can be determined purely by the plan shape.

2) All rows are needed, because the node is completely consumed by the
   top-level (i.e. no limit, anti-joins or such inbetween) and the top-level
   wants to run the whole query. Unfortunately I don't think we know this at
   plan time at the moment (it's just determined by what's passed to
   ExecutorRun()).

3) Some rows are needed, but it's hard to know the precise number. E.g. because
   of a LIMIT further up.

4) Only a single row is going to be needed, albeit possibly after filtering on
   the node level. E.g. the anti-join case.

There are different times at which we could determine how each node is
consumed:

a) Determine node consumption "class" purely within ExecInit*, via different
   eflags.

   Today that couldn't deal with 2), but I think it'd not too hard to modify
   callers that consume query results completely to tell that ExecutorStart(),
   not just ExecutorRun().

   A disadvantage would be that this prevents us from taking IO depth into
   account during costing. There very well might be plans that are cheaper
   than others because the plan shape allows more concurrent IO.

b) Determine node consumption class at plan time.

   This also couldn't deal with 2), but fixing that probably would be harder,
   because we'll often not know at plan time how the query will be
   executed. And in fact the same plan might be executed multiple ways, in
   case of prepared statements.

   The obvious advantage is of course that we can influence the choice of
   paths.

I suspect we'd eventually want a mix of both. Plan time to be able to
influence plan shape, ExecInit* to deal with not knowing how the query will be
consumed at plan time.  Which suggests that we could start with whichever is
easier and extend later.

Greetings,

Andres Freund

Re: common signal handler protection

On Wed, Feb 14, 2024 at 11:55:43AM -0800, Noah Misch wrote:
> I took a look over each of these.  +1 for all.  Thank you.

Committed.  Thanks!

-- 
Nathan Bossart
Amazon Web Services: https://aws.amazon.com

Re: index prefetching

On Wed, Feb 14, 2024 at 4:46 PM Melanie Plageman
 wrote:
> So, a pin on the index leaf page is sufficient to keep line pointers
> from being reused? If we stick to prefetching heap blocks referred to
> by index tuples in a single index leaf page, and we keep that page
> pinned, will we still have a problem?

That's certainly one way of dealing with it. Obviously, there are
questions about how you do that in a way that consistently avoids
creating new problems.

> I don't think that the LIMIT problem is too different for index scans
> than heap scans. We will need some advice from planner to come down to
> prevent over-eager prefetching in all cases.

I think that I'd rather use information at execution time instead, if
at all possible (perhaps in addition to a hint given by the planner).
But it seems a bit premature to discuss this problem now, except to
say that it might indeed be a problem.

> > It's certainly possible that you could figure out various workarounds
> > for each of these issues (plus the kill_prior_tuple issue) with a
> > prefetching design that "de-synchronizes" the index access and the
> > heap access. But it might well be better to extend the existing design
> > in a way that just avoids all these problems in the first place. Maybe
> > "de-synchronization" really can pay for itself (because the benefits
> > will outweigh these costs), but if you go that way then I'd really
> > prefer it that way.
>
> Forcing each index access to be synchronous and interleaved with each
> table access seems like an unprincipled design constraint. While it is
> true that we rely on that in our current implementation (when using
> non-MVCC snapshots), it doesn't seem like a principle inherent to
> accessing indexes and tables.

There is nothing sacred about the way plain index scans work right now
-- especially the part about buffer pins as an interlock.

If the pin thing really was sacred, then we could never have allowed
nbtree to selectively opt-out in cases where it's possible to provide
an equivalent correctness guarantee without holding onto buffer pins,
which, as I went into, is how it actually works in nbtree's
_bt_killitems() today (see commit 2ed5b87f96 for full details). And so
in principle I have no problem with the idea of revising the basic
definition of plain index scans -- especially if it's to make the
definition more abstract, without fundamentally changing it (e.g., to
make it no longer reference buffer pins, making life easier for
prefetching, while at the same time still implying the same underlying
guarantees sufficient to allow nbtree to mostly work the same way as
today).

All I'm really saying is:

1. The sort of tricks that we can do in nbtree's _bt_killitems() are
quite useful, and ought to be preserved in something like their
current form, even when prefetching is in use.

This seems to push things in the direction of centralizing control of
the process in index scan code. For example, it has to understand that
_bt_killitems() will be called at some regular cadence that is well
defined and sensible from an index point of view.

2. Are you sure that the leaf-page-at-a-time thing is such a huge
hindrance to effective prefetching?

I suppose that it might be much more important than I imagine it is
right now, but it'd be nice to have something a bit more concrete to
go on.

3. Even if it is somewhat important, do you really need to get that
part working in v1?

Tomas' original prototype worked with the leaf-page-at-a-time thing,
and that still seemed like a big improvement to me. While being less
invasive, in effect. If we can agree that something like that
represents a useful step in the right direction (not an evolutionary
dead end), then we can make good incremental progress within a single
release.

> I don't think the fact that it would also be valuable to do index
> prefetching is a reason not to do prefetching of heap pages. And,
> while it is true that were you to add index interior or leaf page
> prefetching, it would impact the heap prefetching, at the end of the
> day, the table AM needs some TID or TID-equivalents that whose blocks
> it can go fetch.

I wasn't really thinking of index page prefetching at all. Just the
cost of applying index quals to read leaf pages that might never
actually need to be read, due to the presence of a LIMIT. That is kind
of a new problem created by eagerly reading (without actually
prefetching) leaf pages.

> You could argue that my suggestion to have the index AM manage and
> populate a queue of TIDs for use by the table AM puts the index AM in
> control. I do think having so many members of the IndexScanDescriptor
> which imply a one-at-a-time (xs_heaptid, xs_itup, etc) synchronous
> interplay between fetching an index tuple and fetching a heap tuple is
> confusing and error prone.

But that's kinda how amgettuple is supposed to work -- cursors need it
to work that way. Having some kind of general notion of scan order is
also important

Re: When extended query protocol ends?

2024-02-14 Thread Tatsuo Ishii

>>> From [1] I think the JDBC driver sends something like below if
>>> autosave=always option is specified.
>>>
>>> "BEGIN READ ONLY" Parse/Bind/Eexecute (in the extended query protocol)
>>> "SAVEPOINT PGJDBC_AUTOSAVE" (in the simple query protocol)
>>>
>>> It seems the SAVEPOINT is sent without finishing the extended query
>>> protocol (i.e. without Sync message). Is it possible for the JDBC
>>> driver to issue a Sync message before sending SAVEPOINT in simple
>>> query protocol? Or you can send SAVEPOINT using the extended query
>>> protocol.
>>>
>>> [1]
>>> https://www.pgpool.net/pipermail/pgpool-general/2023-December/009051.html
>> 
>> 
>> Can you ask the OP what version of the driver they are using. From what I
>> can tell we send BEGIN using SimpleQuery.
> 
> Sure. I will get back once I get the JDBC version.

Here it is:
> JDBC driver version used:42.5.1 Regards, Karel.

Best reagards,
--
Tatsuo Ishii
SRA OSS LLC
English: http://www.sraoss.co.jp/index_en/
Japanese:http://www.sraoss.co.jp

Re: Add trim_trailing_whitespace to editorconfig file

2024-02-14 Thread Daniel Gustafsson

> On 14 Feb 2024, at 23:06, Melanie Plageman  wrote:
> 
> On Wed, Feb 14, 2024 at 11:35 AM Jelte Fennema-Nio  wrote:
>> 
>> This brings our .gitattributes and .editorconfig files more in line. I
>> had the problem that "git add" would complain often about trailing
>> whitespaces when I was changing sgml files specifically.
> 
> +1 from me. But when do we want it to be false? That is, why not
> declare it true for all file types?

Regression test .out files commonly have spaces at the end of the line.  (Not
to mention the ECPG .c files but they probably really shouldn't have.)

--
Daniel Gustafsson

Re: Add trim_trailing_whitespace to editorconfig file

On Wed, Feb 14, 2024 at 11:35 AM Jelte Fennema-Nio  wrote:
>
> This brings our .gitattributes and .editorconfig files more in line. I
> had the problem that "git add" would complain often about trailing
> whitespaces when I was changing sgml files specifically.

+1 from me. But when do we want it to be false? That is, why not
declare it true for all file types?

- Melanie

Re: index prefetching

On Wed, Feb 14, 2024 at 1:21 PM Peter Geoghegan  wrote:
>
> On Wed, Feb 14, 2024 at 8:34 AM Tomas Vondra
>  wrote:
> > > Another thing that argues against doing this is that we might not need
> > > to visit any more B-Tree leaf pages when there is a LIMIT n involved.
> > > We could end up scanning a whole extra leaf page (including all of its
> > > tuples) for want of the ability to "push down" a LIMIT to the index AM
> > > (that's not what happens right now, but it isn't really needed at all
> > > right now).
> > >
> >
> > I'm not quite sure I understand what is "this" that you argue against.
> > Are you saying we should not separate the two scans? If yes, is there a
> > better way to do this?
>
> What I'm concerned about is the difficulty and complexity of any
> design that requires revising "63.4. Index Locking Considerations",
> since that's pretty subtle stuff. In particular, if prefetching
> "de-synchronizes" (to use your term) the index leaf page level scan
> and the heap page scan, then we'll probably have to totally revise the
> basic API.

So, a pin on the index leaf page is sufficient to keep line pointers
from being reused? If we stick to prefetching heap blocks referred to
by index tuples in a single index leaf page, and we keep that page
pinned, will we still have a problem?

> > The LIMIT problem is not very clear to me either. Yes, if we get close
> > to the end of the leaf page, we may need to visit the next leaf page.
> > But that's kinda the whole point of prefetching - reading stuff ahead,
> > and reading too far ahead is an inherent risk. Isn't that a problem we
> > have even without LIMIT? The prefetch distance ramp up is meant to limit
> > the impact.
>
> Right now, the index AM doesn't know anything about LIMIT at all. That
> doesn't matter, since the index AM can only read/scan one full leaf
> page before returning control back to the executor proper. The
> executor proper can just shut down the whole index scan upon finding
> that we've already returned N tuples for a LIMIT N.
>
> We don't do prefetching right now, but we also don't risk reading a
> leaf page that'll just never be needed. Those two things are in
> tension, but I don't think that that's quite the same thing as the
> usual standard prefetching tension/problem. Here there is uncertainty
> about whether what we're prefetching will *ever* be required -- not
> uncertainty about when exactly it'll be required. (Perhaps this
> distinction doesn't mean much to you. I'm just telling you how I think
> about it, in case it helps move the discussion forward.)

I don't think that the LIMIT problem is too different for index scans
than heap scans. We will need some advice from planner to come down to
prevent over-eager prefetching in all cases.

> Another factor that complicates things here is mark/restore
> processing. The design for that has the idea of processing one page at
> a time baked-in. Kinda like with the kill_prior_tuple issue.

Yes, I mentioned this in my earlier email. I think we can resolve
mark/restore by resetting the prefetch and TID queues and restoring
the last used heap TID in the index scan descriptor.

> It's certainly possible that you could figure out various workarounds
> for each of these issues (plus the kill_prior_tuple issue) with a
> prefetching design that "de-synchronizes" the index access and the
> heap access. But it might well be better to extend the existing design
> in a way that just avoids all these problems in the first place. Maybe
> "de-synchronization" really can pay for itself (because the benefits
> will outweigh these costs), but if you go that way then I'd really
> prefer it that way.

Forcing each index access to be synchronous and interleaved with each
table access seems like an unprincipled design constraint. While it is
true that we rely on that in our current implementation (when using
non-MVCC snapshots), it doesn't seem like a principle inherent to
accessing indexes and tables.

> > > I think that it makes sense to put the index AM in control here --
> > > that almost follows from what I said about the index AM API. The index
> > > AM already needs to be in control, in about the same way, to deal with
> > > kill_prior_tuple (plus it helps with the  LIMIT issue I described).
> > >
> >
> > In control how? What would be the control flow - what part would be
> > managed by the index AM?
>
> ISTM that prefetching for an index scan is about the index scan
> itself, first and foremost. The heap accesses are usually the dominant
> cost, of course, but sometimes the index leaf page accesses really do
> make up a significant fraction of the overall cost of the index scan.
> Especially with an expensive index qual. So if you just assume that
> the TIDs returned by the index scan are the only thing that matters,
> you might have a model that's basically correct on average, but is
> occasionally very wrong. That's one reason for "putting the index AM
> in control".

I don't think the

Re: Refactoring backend fork+exec code

Hi,

On 2024-02-08 13:19:53 +0200, Heikki Linnakangas wrote:
> > > - /*
> > > -  * Assign the ProcSignalSlot for an auxiliary process.  Since it doesn't
> > > -  * have a BackendId, the slot is statically allocated based on the
> > > -  * auxiliary process type (MyAuxProcType).  Backends use slots indexed 
> > > in
> > > -  * the range from 1 to MaxBackends (inclusive), so we use MaxBackends +
> > > -  * AuxProcType + 1 as the index of the slot for an auxiliary process.
> > > -  *
> > > -  * This will need rethinking if we ever want more than one of a 
> > > particular
> > > -  * auxiliary process type.
> > > -  */
> > > - ProcSignalInit(MaxBackends + MyAuxProcType + 1);
> > > + ProcSignalInit();
> >
> > Now that we don't need the offset here, we could move ProcSignalInit() into
> > BsaeInit() I think?
>
> Hmm, doesn't feel right to me. BaseInit() is mostly concerned with setting
> up backend-private structures, and it's also called for a standalone
> backend.

It already initializes a lot of shared subsystems (pgstat, replication slots
and arguable things like the buffer pool, temporary file access and WAL). And
note that it already requires that MyProc is already set (but it's not yet
"added" to the procarray, i.e. doesn't do visibility stuff at that stage).

I don't think that BaseInit() being called by standalone backends really poses
a problem? So is InitPostgres(), which does call ProcSignalInit() in
standalone processes.

My mental model is that BaseInit() is for stuff that's shared between
processes that do attach to databases and those that don't. Right now the
initialization flow is something like this ascii diagram:

standalone: \   
 /-> StartupXLOG() \
 -> InitProcess()  -\ /-> ProcArrayAdd() -> 
SharedInvalBackendInit() -> ProcSignalInit()-   -> 
pgstat_beinit() -> attach to db -> pgstat_bestart()
normal backend: /\   /
  -> BaseInit() -
aux process:InitAuxiliaryProcess()  -/   \--
 -> ProcSignalInit()-> 
pgstat_beinit() -> pgstat_bestart()

The only reason ProcSignalInit() happens kinda late is that historically we
used BackendIds as the index, which were only assigned in
SharedInvalBackendInit() for normal processes. But that doesn't make sense
anymore after your changes.

Similarly, we do pgstat_beinit() quite late, but that's again only because it
uses MyBackendId, which today is only assigned during
SharedInvalBackendInit().  I don't think we can do pgstat_bestart() earlier
though, which is a shame, given the four calls to it inside InitPostgres().

> I feel the process initialization codepaths could use some cleanup in
> general. Not sure what exactly.

Very much agreed.

> > > +/*
> > > + * BackendIdGetProc -- get a backend's PGPROC given its backend ID
> > > + *
> > > + * The result may be out of date arbitrarily quickly, so the caller
> > > + * must be careful about how this information is used.  NULL is
> > > + * returned if the backend is not active.
> > > + */
> > > +PGPROC *
> > > +BackendIdGetProc(int backendID)
> > > +{
> > > + PGPROC *result;
> > > +
> > > + if (backendID < 1 || backendID > ProcGlobal->allProcCount)
> > > + return NULL;
> >
> > Hm, doesn't calling BackendIdGetProc() with these values a bug? That's not
> > about being out of date or such.
>
> Perhaps. I just followed the example of the old implementation, which also
> returns NULL on bogus inputs.

Fair enough. Makes it harder to not notice bugs, but that's not on this 
patchset to fix...

> I think the last remaining question here is about the 0- vs 1-based indexing
> of BackendIds. Is it a good idea to switch to 0-based indexing? And if we do
> it, should we reserve PGPROC 0. I'm on the fence on this one.

I lean towards it being a good idea. Having two internal indexing schemes was
bad enough so far, but at least one would fairly quickly notice if one used
the wrong one. If they're just offset by 1, it might end up taking longer,
because that'll often also be a valid id.  But I think you have the author's
prerogative on this one.

If we do so, I think it might be better to standardize on MyProcNumber instead
of MyBackendId. That'll force looking at code where indexing shifts by 1 - and
it also seems more descriptive, as inside postgres it's imo clearer what a
"proc number" is than what a "backend id" is. Particularly because the latter
is also used for things that aren't backends...

The only exception are SQL level users, for those I think it might make sense
to keep the current 1 based indexing, there's just a few functions where we'd
need to translate.

> @@ -791,6 +792,7 @@ ProcArrayEndTransactionInternal(PGPROC *proc, 
> TransactionId latestXid)
>  static

Re: index prefetching

On Wed, Feb 14, 2024 at 11:40 AM Melanie Plageman
 wrote:
>
> On Tue, Feb 13, 2024 at 2:01 PM Tomas Vondra
>  wrote:
> >
> > On 2/7/24 22:48, Melanie Plageman wrote:
> > > ...
> > > - switching scan directions
> > >
> > > If the index scan switches directions on a given invocation of
> > > IndexNext(), heap blocks may have already been prefetched and read for
> > > blocks containing tuples beyond the point at which we want to switch
> > > directions.
> > >
> > > We could fix this by having some kind of streaming read "reset"
> > > callback to drop all of the buffers which have been prefetched which
> > > are now no longer needed. We'd have to go backwards from the last TID
> > > which was yielded to the caller and figure out which buffers in the
> > > pgsr buffer ranges are associated with all of the TIDs which were
> > > prefetched after that TID. The TIDs are in the per_buffer_data
> > > associated with each buffer in pgsr. The issue would be searching
> > > through those efficiently.
> > >
> >
> > Yeah, that's roughly what I envisioned in one of my previous messages
> > about this issue - walking back the TIDs read from the index and added
> > to the prefetch queue.
> >
> > > The other issue is that the streaming read API does not currently
> > > support backwards scans. So, if we switch to a backwards scan from a
> > > forwards scan, we would need to fallback to the non streaming read
> > > method. We could do this by just setting the TID queue size to 1
> > > (which is what I have currently implemented). Or we could add
> > > backwards scan support to the streaming read API.
> > >
> >
> > What do you mean by "support for backwards scans" in the streaming read
> > API? I imagined it naively as
> >
> > 1) drop all requests in the streaming read API queue
> >
> > 2) walk back all "future" requests in the TID queue
> >
> > 3) start prefetching as if from scratch
> >
> > Maybe there's a way to optimize this and reuse some of the work more
> > efficiently, but my assumption is that the scan direction does not
> > change very often, and that we process many items in between.
>
> Yes, the steps you mention for resetting the queues make sense. What I
> meant by "backwards scan is not supported by the streaming read API"
> is that Thomas/Andres had mentioned that the streaming read API does
> not support backwards scans right now. Though, since the callback just
> returns a block number, I don't know how it would break.
>
> When switching between a forwards and backwards scan, does it go
> backwards from the current position or start at the end (or beginning)
> of the relation?

Okay, well I answered this question for myself, by, um, trying it :).
FETCH backward will go backwards from the current cursor position. So,
I don't see exactly why this would be an issue.

> If it is the former, then the blocks would most
> likely be in shared buffers -- which the streaming read API handles.
> It is not obvious to me from looking at the code what the gap is, so
> perhaps Thomas could weigh in.

I have the same problem with the sequential scan streaming read user,
so I am going to try and figure this backwards scan and switching scan
direction thing there (where we don't have other issues).

- Melanie

Re: Fix a typo in pg_rotate_logfile

2024-02-14 Thread Daniel Gustafsson

> On 14 Feb 2024, at 19:51, Nathan Bossart  wrote:
> 
> On Wed, Feb 14, 2024 at 10:04:49AM -0500, Tom Lane wrote:
>> Daniel Gustafsson  writes:
>>> On 14 Feb 2024, at 11:35, Dave Page  wrote:
 That said, pgAdmin III has been out of support for many years, and as
 far as I know, it (and similarly old versions of EDB's PEM which was
 based on it) were the only consumers of adminpack. I would not be sad
 to see it removed entirely
>> 
>>> Searching on Github and Debian Codesearch I cannot find any reference to 
>>> anyone
>>> using any function from adminpack.  With pgAdminIII being EOL it might be to
>>> remove it now rather than be on the hook to maintain it for another 5 years
>>> until v17 goes EOL.  It'll still be around for years in V16->.
>> 
>> Works for me.
>> 
>>> Attached is a diff to show what it would look like to remove adminpack 
>>> (catalog
>>> version bump omitted on purpose to avoid conflicts until commit).
>> 
>> I don't see any references you missed, so +1.
> 
> Seems reasonable to me, too.

Thanks!  I'll put this in the next CF to keep it open for comments a bit
longer, but will close it early in the CF.

--
Daniel Gustafsson

Re: common signal handler protection

2024-02-14 Thread Noah Misch

On Mon, Dec 18, 2023 at 11:32:47AM -0600, Nathan Bossart wrote:
> rebased for cfbot

I took a look over each of these.  +1 for all.  Thank you.

Re: BitmapHeapScan streaming read user and prelim refactoring

Hi,

On 2024-02-13 18:11:25 -0500, Melanie Plageman wrote:
> Attached is a patch set which refactors BitmapHeapScan such that it
> can use the streaming read API [1]. It also resolves the long-standing
> FIXME in the BitmapHeapScan code suggesting that the skip fetch
> optimization should be pushed into the table AMs. Additionally, it
> moves table scan initialization to after the index scan and bitmap
> initialization.

Thanks for working on this! While I have some quibbles with details, I think
this is quite a bit of progress in the right direction.


> patches 0001-0002 are assorted cleanup needed later in the set.
> patches 0003 moves the table scan initialization to after bitmap creation
> patch 0004 is, I think, a bug fix. see [2].

I'd not quite call it a bugfix, it's not like it leads to wrong
behaviour. Seems more like an optimization. But whatever :)



> The caveat is that these patches introduce breaking changes to two
> table AM functions for bitmapheapscan: table_scan_bitmap_next_block()
> and table_scan_bitmap_next_tuple().

That's to be expected, I don't think it's worth worrying about. Right now a
bunch of TAMs can't implement bitmap scans, this goes a fair bit towards
allowing that...





> From d6dd6eb21dcfbc41208f87d1d81ffe3960130889 Mon Sep 17 00:00:00 2001
> From: Melanie Plageman 
> Date: Mon, 12 Feb 2024 18:50:29 -0500
> Subject: [PATCH v1 03/11] BitmapHeapScan begin scan after bitmap setup
>
> There is no reason for table_beginscan_bm() to begin the actual scan of
> the underlying table in ExecInitBitmapHeapScan(). We can begin the
> underlying table scan after the index scan has been completed and the
> bitmap built.
>
> The one use of the scan descriptor during initialization was
> ExecBitmapHeapInitializeWorker(), which set the scan descriptor snapshot
> with one from an array in the parallel state. This overwrote the
> snapshot set in table_beginscan_bm().
>
> By saving that worker snapshot as a member in the BitmapHeapScanState
> during initialization, it can be restored in table_beginscan_bm() after
> returning from the table AM specific begin scan function.

I don't understand what the point of passing two different snapshots to
table_beginscan_bm() is. What does that even mean? Why can't we just use the
correct snapshot initially?


> From a3f62e4299663d418531ae61bb16ea39f0836fac Mon Sep 17 00:00:00 2001
> From: Melanie Plageman 
> Date: Mon, 12 Feb 2024 19:03:24 -0500
> Subject: [PATCH v1 04/11] BitmapPrefetch use prefetch block recheck for skip
>  fetch
>
> Previously BitmapPrefetch() used the recheck flag for the current block
> to determine whether or not it could skip prefetching the proposed
> prefetch block. It makes more sense for it to use the recheck flag from
> the TBMIterateResult for the prefetch block instead.

I'd mention the commit that introduced the current logic and link to the
the thread that you started about this.


> From d56be7741765d93002649ef912ef4b8256a5b9af Mon Sep 17 00:00:00 2001
> From: Melanie Plageman 
> Date: Mon, 12 Feb 2024 19:04:48 -0500
> Subject: [PATCH v1 05/11] Update BitmapAdjustPrefetchIterator parameter type
>  to BlockNumber
>
> BitmapAdjustPrefetchIterator() only used the blockno member of the
> passed in TBMIterateResult to ensure that the prefetch iterator and
> regular iterator stay in sync. Pass it the BlockNumber only. This will
> allow us to move away from using the TBMIterateResult outside of table
> AM specific code.

Hm - I'm not convinced this is a good direction - doesn't that arguably
*increase* TAM awareness? Perhaps it doesn't make much sense to use bitmap
heap scans in a TAM without blocks, but still.



> From 202b16d3a381210e8dbee69e68a8310be8ee11d2 Mon Sep 17 00:00:00 2001
> From: Melanie Plageman 
> Date: Mon, 12 Feb 2024 20:15:05 -0500
> Subject: [PATCH v1 06/11] Push BitmapHeapScan skip fetch optimization into
>  table AM
>
> This resolves the long-standing FIXME in BitmapHeapNext() which said that
> the optmization to skip fetching blocks of the underlying table when
> none of the column data was needed should be pushed into the table AM
> specific code.

Long-standing? Sure, it's old enough to walk, but we have FIXMEs that are old
enough to drink, at least in some countries. :)


> The table AM agnostic functions for prefetching still need to know if
> skipping fetching is permitted for this scan. However, this dependency
> will be removed when that prefetching code is removed in favor of the
> upcoming streaming read API.

> ---
>  src/backend/access/heap/heapam.c  |  10 +++
>  src/backend/access/heap/heapam_handler.c  |  29 +++
>  src/backend/executor/nodeBitmapHeapscan.c | 100 ++
>  src/include/access/heapam.h   |   2 +
>  src/include/access/tableam.h  |  17 ++--
>  src/include/nodes/execnodes.h |   6 --
>  6 files changed, 74 insertions(+), 90 deletions(-)
>
> diff --git a/src/backend/access/heap/heapam.c 
>

Re: postgres_fdw fails to see that array type belongs to extension

2024-02-14 Thread Tom Lane

I wrote:
> There's one big remaining to-do item, I think: experimentation with
> pg_upgrade proves that a binary upgrade fails to fix the extension
> membership of arrays/rowtypes.  That's because pg_dump needs to
> manually reconstruct the membership dependencies, and it thinks that
> it doesn't need to do anything for implicit arrays.  Normally the
> point of that is that we want to exactly reconstruct the extension's
> contents, but I think in this case we'd really like to add the
> additional pg_depend entries even if they weren't there before.
> Otherwise people wouldn't get the new behavior until they do a
> full dump/reload.

> I can see two ways we could do that:

> * add logic to pg_dump

> * modify ALTER EXTENSION ADD TYPE so that it automatically recurses
> from a base type to its array type (and I guess we'd need to add
> something for relation rowtypes and multiranges, too).

> I think I like the latter approach because it's like how we
> handle ownership: pg_dump doesn't emit any commands to explicitly
> change the ownership of dependent types, either.  (But see [1].)
> We could presumably steal some logic from ALTER TYPE OWNER.
> I've not tried to code that here, though.

Now that the multirange issue is fixed (3e8235ba4), here's a
new version that includes the needed recursion in ALTER EXTENSION.
I spent some more effort on a proper test case, too.

regards, tom lane

diff --git a/src/backend/catalog/pg_type.c b/src/backend/catalog/pg_type.c
index fe47be38d0..1a0460b491 100644
--- a/src/backend/catalog/pg_type.c
+++ b/src/backend/catalog/pg_type.c
@@ -540,7 +540,7 @@ TypeCreate(Oid newTypeOid,
  * etc.
  *
  * We make an extension-membership dependency if we're in an extension
- * script and makeExtensionDep is true (and isDependentType isn't true).
+ * script and makeExtensionDep is true.
  * makeExtensionDep should be true when creating a new type or replacing a
  * shell type, but not for ALTER TYPE on an existing type.  Passing false
  * causes the type's extension membership to be left alone.
@@ -600,7 +600,7 @@ GenerateTypeDependencies(HeapTuple typeTuple,
 	ObjectAddressSet(myself, TypeRelationId, typeObjectId);
 
 	/*
-	 * Make dependencies on namespace, owner, ACL, extension.
+	 * Make dependencies on namespace, owner, ACL.
 	 *
 	 * Skip these for a dependent type, since it will have such dependencies
 	 * indirectly through its depended-on type or relation.  An exception is
@@ -625,11 +625,18 @@ GenerateTypeDependencies(HeapTuple typeTuple,
 
 		recordDependencyOnNewAcl(TypeRelationId, typeObjectId, 0,
  typeForm->typowner, typacl);
-
-		if (makeExtensionDep)
-			recordDependencyOnCurrentExtension(, rebuild);
 	}
 
+	/*
+	 * Make extension dependency if requested.
+	 *
+	 * We used to skip this for dependent types, but it seems better to record
+	 * their extension membership explicitly; otherwise code such as
+	 * postgres_fdw's shippability test will be fooled.
+	 */
+	if (makeExtensionDep)
+		recordDependencyOnCurrentExtension(, rebuild);
+
 	/* Normal dependencies on the I/O and support functions */
 	if (OidIsValid(typeForm->typinput))
 	{
diff --git a/src/backend/commands/extension.c b/src/backend/commands/extension.c
index 226f85d0e3..6772d6a8d2 100644
--- a/src/backend/commands/extension.c
+++ b/src/backend/commands/extension.c
@@ -131,6 +131,9 @@ static void ApplyExtensionUpdates(Oid extensionOid,
   char *origSchemaName,
   bool cascade,
   bool is_create);
+static void ExecAlterExtensionContentsRecurse(AlterExtensionContentsStmt *stmt,
+			  ObjectAddress extension,
+			  ObjectAddress object);
 static char *read_whole_file(const char *filename, int *length);
 
 
@@ -3294,7 +3297,6 @@ ExecAlterExtensionContentsStmt(AlterExtensionContentsStmt *stmt,
 	ObjectAddress extension;
 	ObjectAddress object;
 	Relation	relation;
-	Oid			oldExtension;
 
 	switch (stmt->objtype)
 	{
@@ -3349,6 +3351,38 @@ ExecAlterExtensionContentsStmt(AlterExtensionContentsStmt *stmt,
 	check_object_ownership(GetUserId(), stmt->objtype, object,
 		   stmt->object, relation);
 
+	/* Do the update, recursing to any dependent objects */
+	ExecAlterExtensionContentsRecurse(stmt, extension, object);
+
+	/* Finish up */
+	InvokeObjectPostAlterHook(ExtensionRelationId, extension.objectId, 0);
+
+	/*
+	 * If get_object_address() opened the relation for us, we close it to keep
+	 * the reference count correct - but we retain any locks acquired by
+	 * get_object_address() until commit time, to guard against concurrent
+	 * activity.
+	 */
+	if (relation != NULL)
+		relation_close(relation, NoLock);
+
+	return extension;
+}
+
+/*
+ * ExecAlterExtensionContentsRecurse
+ *		Subroutine for ExecAlterExtensionContentsStmt
+ *
+ * Do the bare alteration of object's membership in extension,
+ * without permission checks.  Recurse to dependent objects, if any.
+ */
+static void
+ExecAlterExtensionContentsRecurse(AlterExtensionContentsStmt

Re: index prefetching

On Wed, Feb 14, 2024 at 11:40 AM Melanie Plageman
 wrote:
> I wasn't quite sure how we could use
> index_compute_xid_horizon_for_tuples() for inspiration -- per Peter's
> suggestion. But, I'd like to understand.

The point I was trying to make with that example was: a highly generic
mechanism can sometimes work across disparate index AMs (that all at
least support plain index scans) when it just so happens that these
AMs don't actually differ in a way that could possibly matter to that
mechanism. While it's true that (say) nbtree and hash are very
different at a high level, it's nevertheless also true that the way
things work at the level of individual index pages is much more
similar than different.

With index deletion, we know that we're differences between each
supported index AM either don't matter at all (which is what obviates
the need for index_compute_xid_horizon_for_tuples() to be directly
aware of which index AM the page it is passed comes from), or matter
only in small, incidental ways (e.g., nbtree stores posting lists in
its tuples, despite using IndexTuple structs).

With prefetching, it seems reasonable to suppose that an index-AM
specific approach would end up needing very little truly custom code.
This is pretty strongly suggested by the fact that the rules around
buffer pins (as an interlock against concurrent TID recycling by
VACUUM) are standardized by the index AM API itself. Those rules might
be slightly more natural with nbtree, but that's kinda beside the
point. While the basic organizing principle for where each index tuple
goes can vary enormously, it doesn't necessarily matter at all -- in
the end, you're really just reading each index page (that has TIDs to
read) exactly once per scan, in some fixed order, with interlaced
inline heap accesses (that go fetch heap tuples for each individual
TID read from each index page).

In general I don't accept that we need to do things outside the index
AM, because software architecture encapsulation something something. I
suspect that we'll need to share some limited information across
different layers of abstraction, because that's just fundamentally
what's required by the constraints we're operating under. Can't really
prove it, though.

-- 
Peter Geoghegan

Re: MAINTAIN privilege -- what do we need to un-revert it?

On Wed, Feb 14, 2024 at 10:20:28AM -0800, Jeff Davis wrote:
> If those arguments are still unconvincing, then the next idea is to fix
> the search_path for all maintenance commands[3]. I tried this during
> the 16 cycle, but due to timing issues it was also reverted. I can
> proceed with this approach again, but I'd like a clear endorsement, in
> case there were other reasons to doubt the approach.

This seemed like the approach folks were most in favor of at the developer
meeting a couple weeks ago [0].  At least, that was my interpretation of
the discussion.

BTW I have been testing reverting commit 151c22d (i.e., un-reverting
MAINTAIN) every month or two, and last I checked, it still applies pretty
cleanly.  The only changes I've needed to make are to the catversion and to
a hard-coded version in a test (16 -> 17).

[0] 
https://wiki.postgresql.org/wiki/FOSDEM/PGDay_2024_Developer_Meeting#The_Path_to_un-reverting_the_MAINTAIN_privilege

-- 
Nathan Bossart
Amazon Web Services: https://aws.amazon.com

Re: Fix a typo in pg_rotate_logfile

On Wed, Feb 14, 2024 at 10:04:49AM -0500, Tom Lane wrote:
> Daniel Gustafsson  writes:
>> On 14 Feb 2024, at 11:35, Dave Page  wrote:
>>> That said, pgAdmin III has been out of support for many years, and as
>>> far as I know, it (and similarly old versions of EDB's PEM which was
>>> based on it) were the only consumers of adminpack. I would not be sad
>>> to see it removed entirely
> 
>> Searching on Github and Debian Codesearch I cannot find any reference to 
>> anyone
>> using any function from adminpack.  With pgAdminIII being EOL it might be to
>> remove it now rather than be on the hook to maintain it for another 5 years
>> until v17 goes EOL.  It'll still be around for years in V16->.
> 
> Works for me.
> 
>> Attached is a diff to show what it would look like to remove adminpack 
>> (catalog
>> version bump omitted on purpose to avoid conflicts until commit).
> 
> I don't see any references you missed, so +1.

Seems reasonable to me, too.

-- 
Nathan Bossart
Amazon Web Services: https://aws.amazon.com

Re: Re[2]: [PATCH] allow pg_current_logfile() execution under pg_monitor role

On Wed, Feb 14, 2024 at 01:45:31PM -0500, Tom Lane wrote:
> "Pavlo Golub"  writes:
>> Oh, thanks! I forgot, indeed, to update docs and catalog version! My 
>> bad!
> 
> Docs, yes, but don't include catversion bumps in submitted patches.
> They'll just lead to merge problems when somebody else changes the
> current catversion.  We rely on the committer to remember to do this
> (which is an imperfect system, but...)

+1, I only included it in the patch I posted so that I didn't forget it...

-- 
Nathan Bossart
Amazon Web Services: https://aws.amazon.com

Re: Re[2]: [PATCH] allow pg_current_logfile() execution under pg_monitor role

2024-02-14 Thread Pavlo Golub

On Wed, Feb 14, 2024, 19:45 Tom Lane  wrote:

> "Pavlo Golub"  writes:
> > Oh, thanks! I forgot, indeed, to update docs and catalog version! My
> > bad!
>
> Docs, yes, but don't include catversion bumps in submitted patches.
> They'll just lead to merge problems when somebody else changes the
> current catversion.  We rely on the committer to remember to do this
> (which is an imperfect system, but...)
>

Thanks for the clarification.


> regards, tom lane
>

Re: Re[2]: [PATCH] allow pg_current_logfile() execution under pg_monitor role

2024-02-14 Thread Tom Lane

"Pavlo Golub"  writes:
> Oh, thanks! I forgot, indeed, to update docs and catalog version! My 
> bad!

Docs, yes, but don't include catversion bumps in submitted patches.
They'll just lead to merge problems when somebody else changes the
current catversion.  We rely on the committer to remember to do this
(which is an imperfect system, but...)

regards, tom lane

Re: [EXTERNAL] Re: Add non-blocking version of PQcancel

2024-02-14 Thread Jelte Fennema-Nio

On Wed, 14 Feb 2024 at 18:41, Alvaro Herrera  wrote:
> Hmm, I think the changes to libpq_pipeline in 0005 should be in 0004.

Yeah, you're correct. Fixed that now.


v32-0005-Start-using-new-libpq-cancel-APIs.patch
Description: Binary data


v32-0004-libpq-Add-encrypted-and-non-blocking-versions-of.patch
Description: Binary data

Re: index prefetching

On Wed, Feb 14, 2024 at 8:34 AM Tomas Vondra
 wrote:
> > Another thing that argues against doing this is that we might not need
> > to visit any more B-Tree leaf pages when there is a LIMIT n involved.
> > We could end up scanning a whole extra leaf page (including all of its
> > tuples) for want of the ability to "push down" a LIMIT to the index AM
> > (that's not what happens right now, but it isn't really needed at all
> > right now).
> >
>
> I'm not quite sure I understand what is "this" that you argue against.
> Are you saying we should not separate the two scans? If yes, is there a
> better way to do this?

What I'm concerned about is the difficulty and complexity of any
design that requires revising "63.4. Index Locking Considerations",
since that's pretty subtle stuff. In particular, if prefetching
"de-synchronizes" (to use your term) the index leaf page level scan
and the heap page scan, then we'll probably have to totally revise the
basic API.

Maybe that'll actually turn out to be the right thing to do -- it
could just be the only thing that can unleash the full potential of
prefetching. But I'm not aware of any evidence that points in that
direction. Are you? (I might have just missed it.)

> The LIMIT problem is not very clear to me either. Yes, if we get close
> to the end of the leaf page, we may need to visit the next leaf page.
> But that's kinda the whole point of prefetching - reading stuff ahead,
> and reading too far ahead is an inherent risk. Isn't that a problem we
> have even without LIMIT? The prefetch distance ramp up is meant to limit
> the impact.

Right now, the index AM doesn't know anything about LIMIT at all. That
doesn't matter, since the index AM can only read/scan one full leaf
page before returning control back to the executor proper. The
executor proper can just shut down the whole index scan upon finding
that we've already returned N tuples for a LIMIT N.

We don't do prefetching right now, but we also don't risk reading a
leaf page that'll just never be needed. Those two things are in
tension, but I don't think that that's quite the same thing as the
usual standard prefetching tension/problem. Here there is uncertainty
about whether what we're prefetching will *ever* be required -- not
uncertainty about when exactly it'll be required. (Perhaps this
distinction doesn't mean much to you. I'm just telling you how I think
about it, in case it helps move the discussion forward.)

> > This property of index scans is fundamental to how index scans work.
> > Pinning an index page as an interlock against concurrently TID
> > recycling by VACUUM is directly described by the index API docs [1],
> > even (the docs actually use terms like "buffer pin" rather than
> > something more abstract sounding). I don't think that anything
> > affecting that behavior should be considered an implementation detail
> > of the nbtree index AM as such (nor any particular index AM).
> >
>
> Good point.

The main reason why the index AM docs require this interlock is
because we need such an interlock to make non-MVCC snapshot scans
safe. If you remove the interlock (the buffer pin interlock that
protects against TID recycling by VACUUM), you can still avoid the
same race condition by using an MVCC snapshot. This is why using an
MVCC snapshot is a requirement for bitmap index scans. I believe that
it's also a requirement for index-only scans, but the index AM docs
don't spell that out.

Another factor that complicates things here is mark/restore
processing. The design for that has the idea of processing one page at
a time baked-in. Kinda like with the kill_prior_tuple issue.

It's certainly possible that you could figure out various workarounds
for each of these issues (plus the kill_prior_tuple issue) with a
prefetching design that "de-synchronizes" the index access and the
heap access. But it might well be better to extend the existing design
in a way that just avoids all these problems in the first place. Maybe
"de-synchronization" really can pay for itself (because the benefits
will outweigh these costs), but if you go that way then I'd really
prefer it that way.

> > I think that it makes sense to put the index AM in control here --
> > that almost follows from what I said about the index AM API. The index
> > AM already needs to be in control, in about the same way, to deal with
> > kill_prior_tuple (plus it helps with the  LIMIT issue I described).
> >
>
> In control how? What would be the control flow - what part would be
> managed by the index AM?

ISTM that prefetching for an index scan is about the index scan
itself, first and foremost. The heap accesses are usually the dominant
cost, of course, but sometimes the index leaf page accesses really do
make up a significant fraction of the overall cost of the index scan.
Especially with an expensive index qual. So if you just assume that
the TIDs returned by the index scan are the only thing that matters,
you might have a model that's

MAINTAIN privilege -- what do we need to un-revert it?

2024-02-14 Thread Jeff Davis

The MAINTAIN privilege was reverted during the 16 cycle because of the
potential for someone to play tricks with search_path.

For instance, if user foo does:

   CREATE FUNCTION mod7(INT) RETURNS INT IMMUTABLE
 LANGUAGE plpgsql AS $$ BEGIN RETURN mod($1, 7); END; $$;
   CREATE TABLE x(i INT);
   CREATE UNIQUE INDEX x_mod7_idx ON x (mod7(i));
   GRANT MAINTAIN ON x TO bar;

Then user bar can create their own function named "bar.mod(int, int)",
and "SET search_path = bar, pg_catalog", and then issue a "REINDEX x"
and cause problems.

There are several factors required for that to be a problem:

  1. foo hasn't used a "SET search_path" clause on their function
  2. bar must have the privileges to create a function somewhere
  3. bar must have privileges on table x

There's an argument that we should blame factor #1. Robert stated[1]
that users should use SET search_path clauses on their functions, even
SECURITY INVOKER functions. And I've added a search_path cache which
improves the performance enough to make that more reasonable to do
generally.

There's also an argument that #2 is to blame. Given the realities of
our system, best practice is that users shouldn't have the privileges
to create objects, even in their own schema, unless required. (Joe made
this suggestion in an offline discussion.)

There's also an arugment that #3 is not specific to the MAINTAIN
privilege. Clearly similar risks exist for other privileges, like
TRIGGER. And even the INSERT privilege, in the above example, would
allow bar to violate the unique constraint and corrupt the index[2].

If those arguments are still unconvincing, then the next idea is to fix
the search_path for all maintenance commands[3]. I tried this during
the 16 cycle, but due to timing issues it was also reverted. I can
proceed with this approach again, but I'd like a clear endorsement, in
case there were other reasons to doubt the approach.

Regards,
Jeff Davis

[1]
https://www.postgresql.org/message-id/ca+tgmoyep40ibw-a9npfdp8ahgoekpp3apdfztgburqmfcw...@mail.gmail.com

[2]
https://www.postgresql.org/message-id/fff566293c9165c69bb4c555da1ac02c63660664.ca...@j-davis.com

[3]
https://www.postgresql.org/message-id/e44327179e5c9015c8dda67351c04da552066017.ca...@j-davis.com

Re[2]: [PATCH] allow pg_current_logfile() execution under pg_monitor role

2024-02-14 Thread Pavlo Golub






On Mon, Feb 12, 2024 at 09:49:45AM -0600, Nathan Bossart wrote:

 Okay.  I'll plan on committing this in the next few days.


Here is what I have staged for commit.

Oh, thanks! I forgot, indeed, to update docs and catalog version! My 
bad!
In my defense, I was trying to find tests but I missed 
regress/sql/misc_functions.sql somehow.

Now I will know. Thanks again!

Best regards,
Pavlo Golub

Re: [PATCH] allow pg_current_logfile() execution under pg_monitor role

On Wed, Feb 14, 2024 at 08:59:06AM +0100, Daniel Gustafsson wrote:
> LGTM.

Committed.

-- 
Nathan Bossart
Amazon Web Services: https://aws.amazon.com

Re: [EXTERNAL] Re: Add non-blocking version of PQcancel

2024-02-14 Thread Alvaro Herrera

On 2024-Feb-14, Jelte Fennema-Nio wrote:

> Attached is a new version of the final patches, with much improved
> docs (imho) and the new function names: PQcancelStart and
> PQcancelBlocking.

Hmm, I think the changes to libpq_pipeline in 0005 should be in 0004.

-- 
Álvaro Herrera PostgreSQL Developer  —  https://www.EnterpriseDB.com/

Re: [EXTERNAL] Re: Add non-blocking version of PQcancel

2024-02-14 Thread Jelte Fennema-Nio

On Sun, 4 Feb 2024 at 16:39, Alvaro Herrera  wrote:
> Maybe this is okay?  I'll have a look at the whole final situation more
> carefully later; or if somebody else wants to share an opinion, please
> do so.

Attached is a new version of the final patches, with much improved
docs (imho) and the new function names: PQcancelStart and
PQcancelBlocking.


v31-0004-libpq-Add-encrypted-and-non-blocking-versions-of.patch
Description: Binary data


v31-0005-Start-using-new-libpq-cancel-APIs.patch
Description: Binary data

Re: Document efficient self-joins / UPDATE LIMIT techniques.

2024-02-14 Thread Joel Jacobson

On Tue, Feb 13, 2024, at 23:56, Corey Huinker wrote:
> This patch came out of a discussion at the last PgCon with the person 
> who made the "fringe feature" quote, who seemed quite supportive of 
> documenting the technique. The comment may have been in regards to 
> actually implementing a LIMIT clause on UPDATE and DELETE, which isn't 
> in the SQL standard and would be difficult to implement as the two 
> statements have no concept of ordering. Documenting the workaround 
> would alleviate some interest in implementing a nonstandard feature.

Thanks for sharing the background story.

> As for whether it's commonplace, when I was a consultant I had a number 
> of customers that I had who bemoaned how large updates caused big 
> replica lag, basically punishing access to records they did care about 
> in order to properly archive or backfill records they don't care about. 
> I used the technique a lot, putting the update/delete in a loop, and 
> often running multiple copies of the same script at times when I/O 
> contention was low, but if load levels rose it was trivial to just kill 
> a few of the scripts until things calmed down.

I've also used the technique quite a lot, but only using the PK,
didn't know about the ctid trick, so many thanks for documenting it.

/Joel

Re: What about Perl autodie?

2024-02-14 Thread Peter Eisentraut


On 08.02.24 16:53, Tom Lane wrote:

Daniel Gustafsson  writes:

On 8 Feb 2024, at 08:01, Peter Eisentraut  wrote:
I suppose we could start using it for completely new scripts.



+1, it would be nice to eventually be able to move to it everywhere so starting
now with new scripts may make the eventual transition smoother.


I'm still concerned about people carelessly using autodie-reliant
code in places where they shouldn't.  I offer two safer ways
forward:

1. Wait till v16 is the oldest supported branch, and then migrate
both HEAD and back branches to using autodie.

2. Don't wait, migrate them all now.  This would mean requiring
Perl 5.10.1 or later to run the TAP tests, even in back branches.

I think #2 might not be all that radical.  We have nothing older
than 5.14.0 in the buildfarm, so we don't really have much grounds
for claiming that 5.8.3 will work today.  And 5.10.1 came out in
2009, so how likely is it that anyone cares anymore?


A gentler way might be to start using some perlcritic policies like 
InputOutput::RequireCheckedOpen or the more general 
InputOutput::RequireCheckedSyscalls and add explicit error checking at 
the sites it points out.  And then if we start using autodie in the 
future, any inappropriate backpatching of calls lacking error checks 
would be caught.

Re: logical decoding and replication of sequences, take 2

2024-02-14 Thread Tomas Vondra

On 2/13/24 17:37, Robert Haas wrote:
> On Sun, Jan 28, 2024 at 1:07 AM Tomas Vondra
>  wrote:
>> Right, locks + apply in commit order gives us this guarantee (I can't
>> think of a case where it wouldn't be the case).
> 
> I couldn't find any cases of inadequate locking other than the one I 
> mentioned.
> 
>> Doesn't the whole logical replication critically depend on the commit
>> order? If you decide to arbitrarily reorder/delay the transactions, all
>> kinds of really bad things can happen. That's a generic problem, it
>> applies to all kinds of objects, not just sequences - a parallel apply
>> would need to detect this sort of dependencies (e.g. INSERT + DELETE of
>> the same key), and do something about it.
> 
> Yes, but here I'm not just talking about the commit order. I'm talking
> about the order of applying non-transactional operations relative to
> commits.
> 
> Consider:
> 
> T1: CREATE SEQUENCE s;
> T2: BEGIN;
> T2: SELECT nextval('s');
> T3: SELECT nextval('s');
> T2: ALTER SEQUENCE s INCREMENT 2;
> T2: SELECT nextval('s');
> T2: COMMIT;
> 

It's not clear to me if you're talking about nextval() that happens to
generate WAL, or nextval() covered by WAL generated by a previous call.

I'm going to assume it's the former, i.e. nextval() that generated WAL
describing the *next* sequence chunk, because without WAL there's
nothing to apply and therefore no issue with T3 ordering.

The way I think about non-transactional sequence changes is as if they
were tiny transactions that happen "fully" (including commit) at the LSN
where the LSN change is logged.

> The commit order is T1 < T3 < T2, but T3 makes no transactional
> changes, so the commit order is really just T1 < T2. But it's
> completely wrong to say that all we need to do is apply T1 before we
> apply T2. The correct order of application is:
> 
> 1. T1.
> 2. T2's first nextval
> 3. T3's nextval
> 4. T2's transactional changes (i.e. the ALTER SEQUENCE INCREMENT and
> the subsequent nextval)
> 

Is that quite true? If T3 generated WAL (for the nextval call), it will
be applied at that particular LSN. AFAIK that guarantees it happens
after the first T2 change (which is also non-transactional) and before
the transactional T2 change (because that creates a new relfilenode).

> In other words, the fact that some sequence changes are
> non-transactional creates ordering hazards that don't exist if there
> are no non-transactional changes. So in that way, sequences are
> different from table modifications, where applying the transactions in
> order of commit is all we need to do. Here we need to apply the
> transactions in order of commit and also apply the non-transactional
> changes at the right point in the sequence. Consider the following
> alternative apply sequence:
> 
> 1. T1.
> 2. T2's transactional changes (i.e. the ALTER SEQUENCE INCREMENT and
> the subsequent nextval)
> 3. T3's nextval
> 4. T2's first nextval
> 
> That's still in commit order. It's also wrong.
> 

Yes, this would be wrong. Thankfully the apply is not allowed to reorder
the changes like this, because that's not what "non-transactional" means
in this context.

It does not mean we can arbitrarily reorder the changes, it only means
the changes are applied as if they were independent transactions (but in
the same order as they were executed originally). Both with respect to
the other non-transactional changes, and to "commits" of other stuff.

(for serial apply, at least)

> Imagine that you commit this patch and someone later wants to do
> parallel logical apply. So every time they finish decoding a
> transaction, they stick it in a queue to be applied by the next
> available worker. But, non-transactional changes are very simple, so
> we just directly apply those in the main process. Well, kaboom! But
> now this can happen with the above example.
> 
> 1. Decode T1. Add to queue for apply.
> 2. Before the (idle) apply worker has a chance to pull T1 out of the
> queue, decode the first nextval and try to apply it.
> 
> Oops. We're trying to apply a modification to a sequence that hasn't
> been created yet. I'm not saying that this kind of hypothetical is a
> reason not to commit the patch. But it seems like we're not on the
> same page about what the ordering requirements are here. I'm just
> making the argument that those non-transactional operations actually
> act like mini-transactions. They need to happen at the right time
> relative to the real transactions. A non-transactional operation needs
> to be applied after any transactions that commit before it is logged,
> and before any transactions that commit after it's logged.
> 

How is this issue specific to sequences? AFAIK this is a general problem
with transactions that depend on each other. Consider for example this:

T1: INSERT INTO t (id) VALUES (1);
T2: DELETE FROM t WHERE id = 1;

If you parallelize this in a naive way, maybe T2 gets applied before T1.
In which case the DELETE won't find the row yet.

There's

Re: Constant Splitting/Refactoring

2024-02-14 Thread David Christensen

This should target PG 18, but that status is not available in the CF app yet, 
so just making a note here.

Re: BitmapHeapScan streaming read user and prelim refactoring

2024-02-14 Thread Mark Dilger




> On Feb 14, 2024, at 6:47 AM, Melanie Plageman  
> wrote:
> 
> Just curious, did your table AM implement
> table_scan_bitmap_next_block() and table_scan_bitmap_next_tuple(),
> and, if so, did you use the TBMIterateResult? Since it is not used in
> BitmapHeapNext() in my version, table AMs would have to change how
> they use TBMIterateResults anyway. But I assume they could add it to a
> table AM specific scan descriptor if they want access to a
> TBMIterateResult of their own making in both
> table_san_bitmap_next_block() and next_tuple()?

My table AM does implement those two functions and does use the 
TBMIterateResult *tbmres argument, yes.  I would deal with the issue in very 
much the same way that your patches modify heapam.  I don't really have any 
additional comments about that.

—
Mark Dilger
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Re: index prefetching

On Tue, Feb 13, 2024 at 2:01 PM Tomas Vondra
 wrote:
>
> On 2/7/24 22:48, Melanie Plageman wrote:
> > ...
Issues
> > ---
> > - kill prior tuple
> >
> > This optimization doesn't work with index prefetching with the current
> > design. Kill prior tuple relies on alternating between fetching a
> > single index tuple and visiting the heap. After visiting the heap we
> > can potentially kill the immediately preceding index tuple. Once we
> > fetch multiple index tuples, enqueue their TIDs, and later visit the
> > heap, the next index page we visit may not contain all of the index
> > tuples deemed killable by our visit to the heap.
> >
>
> I admit I haven't thought about kill_prior_tuple until you pointed out.
> Yeah, prefetching separates (de-synchronizes) the two scans (index and
> heap) in a way that prevents this optimization. Or at least makes it
> much more complex :-(
>
> > In our case, we could try and fix this by prefetching only heap blocks
> > referred to by index tuples on the same index page. Or we could try
> > and keep a pool of index pages pinned and go back and kill index
> > tuples on those pages.
> >
>
> I think restricting the prefetching to a single index page would not be
> a huge issue performance-wise - that's what the initial patch version
> (implemented at the index AM level) did, pretty much. The prefetch queue
> would get drained as we approach the end of the index page, but luckily
> index pages tend to have a lot of entries. But it'd put an upper bound
> on the prefetch distance (much lower than the e_i_c maximum 1000, but
> I'd say common values are 10-100 anyway).
>
> But how would we know we're on the same index page? That knowledge is
> not available outside the index AM - the executor or indexam.c does not
> know this, right? Presumably we could expose this, somehow, but it seems
> like a violation of the abstraction ...

The easiest way to do this would be to have the index AM amgettuple()
functions set a new member in the IndexScanDescData which is either
the index page identifier or a boolean that indicates we have moved on
to the next page. Then, when filling the queue, we would stop doing so
when the page switches. Now, this wouldn't really work for the first
index tuple on each new page, so, perhaps we would need the index AMs
to implement some kind of "peek" functionality.

Or, we could provide the index AM with a max queue size and allow it
to fill up the queue with the TIDs it wants (which it could keep to
the same index page). And, for the index-only scan case, could have
some kind of flag which indicates if the caller is putting
TIDs+HeapTuples or TIDS+IndexTuples on the queue, which might reduce
the amount of space we need. I'm not sure who manages the memory here.

I wasn't quite sure how we could use
index_compute_xid_horizon_for_tuples() for inspiration -- per Peter's
suggestion. But, I'd like to understand.

> > - switching scan directions
> >
> > If the index scan switches directions on a given invocation of
> > IndexNext(), heap blocks may have already been prefetched and read for
> > blocks containing tuples beyond the point at which we want to switch
> > directions.
> >
> > We could fix this by having some kind of streaming read "reset"
> > callback to drop all of the buffers which have been prefetched which
> > are now no longer needed. We'd have to go backwards from the last TID
> > which was yielded to the caller and figure out which buffers in the
> > pgsr buffer ranges are associated with all of the TIDs which were
> > prefetched after that TID. The TIDs are in the per_buffer_data
> > associated with each buffer in pgsr. The issue would be searching
> > through those efficiently.
> >
>
> Yeah, that's roughly what I envisioned in one of my previous messages
> about this issue - walking back the TIDs read from the index and added
> to the prefetch queue.
>
> > The other issue is that the streaming read API does not currently
> > support backwards scans. So, if we switch to a backwards scan from a
> > forwards scan, we would need to fallback to the non streaming read
> > method. We could do this by just setting the TID queue size to 1
> > (which is what I have currently implemented). Or we could add
> > backwards scan support to the streaming read API.
> >
>
> What do you mean by "support for backwards scans" in the streaming read
> API? I imagined it naively as
>
> 1) drop all requests in the streaming read API queue
>
> 2) walk back all "future" requests in the TID queue
>
> 3) start prefetching as if from scratch
>
> Maybe there's a way to optimize this and reuse some of the work more
> efficiently, but my assumption is that the scan direction does not
> change very often, and that we process many items in between.

Yes, the steps you mention for resetting the queues make sense. What I
meant by "backwards scan is not supported by the streaming read API"
is that Thomas/Andres had mentioned that the streaming read API does
not support backwards scans

Add trim_trailing_whitespace to editorconfig file

2024-02-14 Thread Jelte Fennema-Nio

This brings our .gitattributes and .editorconfig files more in line. I
had the problem that "git add" would complain often about trailing
whitespaces when I was changing sgml files specifically.


v1-0001-Configure-trailing-whitespace-trimming-in-editorc.patch
Description: Binary data

Re: 035_standby_logical_decoding unbounded hang

2024-02-14 Thread Bertrand Drouvot

Hi,

On Sat, Feb 10, 2024 at 05:02:27PM -0800, Noah Misch wrote:
> Coincidentally, one of my buildfarm animals hanged several weeks in a
> different test, 035_standby_logical_decoding.pl.  A LOG_SNAPSHOT_INTERVAL_MS
> reduction was part of making it reproducible:
> 
> On Fri, Feb 02, 2024 at 04:01:45PM -0800, Noah Misch wrote:
> > On Fri, Feb 02, 2024 at 02:30:03PM -0800, Noah Misch wrote:
> > > On Fri, Feb 02, 2024 at 05:07:14PM -0500, Tom Lane wrote:
> > > > If you look at the buildfarm's failures page and filter down to
> > > > just subscriptionCheck failures, what you find is that all of the
> > > > last 6 such failures are in 031_column_list.pl:
> 
> > > https://www.postgresql.org/message-id/flat/16d6d9cc-f97d-0b34-be65-425183ed3721%40gmail.com
> > > reported a replacement BgWriterDelay value reproducing it.
> > 
> > Correction: the recipe changes LOG_SNAPSHOT_INTERVAL_MS in addition to
> > BgWriterDelay.
> 
> I'm reusing this thread just in case there's overlap with the
> 031_column_list.pl cause and fix.  The 035_standby_logical_decoding.pl hang is
> a race condition arising from an event sequence like this:
> 
> - Test script sends CREATE SUBSCRIPTION to subscriber, which loses the CPU.
> - Test script calls pg_log_standby_snapshot() on primary.  Emits 
> XLOG_RUNNING_XACTS.
> - checkpoint_timeout makes a primary checkpoint finish.  Emits 
> XLOG_RUNNING_XACTS.
> - bgwriter executes LOG_SNAPSHOT_INTERVAL_MS logic.  Emits XLOG_RUNNING_XACTS.
> - CREATE SUBSCRIPTION wakes up and sends CREATE_REPLICATION_SLOT to standby.
> 
> Other test code already has a solution for this, so the attached patches add a
> timeout and copy the existing solution.  I'm also attaching the hack that
> makes it 100% reproducible.

Thanks!

I did a few tests and confirm that the proposed solution fixes the corner case.

standby-slot-test-1-timeout-v1.patch LGTM.

Regarding standby-slot-test-2-race-v1.patch:

> +# See corresponding create_logical_slot_on_standby() code.
> +$node_standby->poll_query_until(
> + 'postgres', qq[
> + SELECT restart_lsn IS NOT NULL
> + FROM pg_catalog.pg_replication_slots WHERE slot_name = 'tap_sub'
> + ]) or die "timed out waiting for logical slot to calculate its 
> restart_lsn";
> +

What about creating a sub, say wait_for_restart_lsn_calculation() in Cluster.pm
and then make use of it in create_logical_slot_on_standby() and above? 
(something
like wait_for_restart_lsn_calculation-v1.patch attached).

Regards,

-- 
Bertrand Drouvot
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com
commit 14d988c748f3e500c44d65e073c276e6d8af6156
Author: Bertrand Drouvot 
Date:   Wed Feb 14 15:21:35 2024 +

wait_for_restart_lsn_calculation

diff --git a/src/test/perl/PostgreSQL/Test/Cluster.pm b/src/test/perl/PostgreSQL/Test/Cluster.pm
index e2e70d0dbf..21cf179db1 100644
--- a/src/test/perl/PostgreSQL/Test/Cluster.pm
+++ b/src/test/perl/PostgreSQL/Test/Cluster.pm
@@ -3174,6 +3174,28 @@ $SIG{TERM} = $SIG{INT} = sub {
 
 =pod
 
+=item $node->wait_for_restart_lsn_calculation(self, slot_name)
+
+Create logical replication slot on given standby
+
+=cut
+
+sub wait_for_restart_lsn_calculation
+{
+	my ($self, $slot_name) = @_;
+	my ($stdout, $stderr);
+
+	$self->poll_query_until(
+		'postgres', qq[
+		SELECT restart_lsn IS NOT NULL
+		FROM pg_catalog.pg_replication_slots WHERE slot_name = '$slot_name'
+	])
+	  or die
+	  "timed out waiting for logical slot to calculate its restart_lsn";
+}
+
+=pod
+
 =item $node->create_logical_slot_on_standby(self, primary, slot_name, dbname)
 
 Create logical replication slot on given standby
@@ -3203,13 +3225,7 @@ sub create_logical_slot_on_standby
 	# xl_running_xacts WAL record from the restart_lsn onwards. First wait
 	# until the slot restart_lsn is determined.
 
-	$self->poll_query_until(
-		'postgres', qq[
-		SELECT restart_lsn IS NOT NULL
-		FROM pg_catalog.pg_replication_slots WHERE slot_name = '$slot_name'
-	])
-	  or die
-	  "timed out waiting for logical slot to calculate its restart_lsn";
+	$self->wait_for_restart_lsn_calculation($slot_name);
 
 	# Then arrange for the xl_running_xacts record for which pg_recvlogical is
 	# waiting.
diff --git a/src/test/recovery/t/035_standby_logical_decoding.pl b/src/test/recovery/t/035_standby_logical_decoding.pl
index f7b7fc7f9e..85330720c5 100644
--- a/src/test/recovery/t/035_standby_logical_decoding.pl
+++ b/src/test/recovery/t/035_standby_logical_decoding.pl
@@ -465,12 +465,8 @@ $psql_subscriber{subscriber_stdin} .= "\n";
 
 $psql_subscriber{run}->pump_nb();
 
-# See corresponding create_logical_slot_on_standby() code.
-$node_standby->poll_query_until(
-	'postgres', qq[
-		SELECT restart_lsn IS NOT NULL
-		FROM pg_catalog.pg_replication_slots WHERE slot_name = 'tap_sub'
-	]) or die "timed out waiting for logical slot to calculate its restart_lsn";
+# Wait for restart_lsn calculation
+$node_standby->wait_for_restart_lsn_calculation('tap_sub');

Re: Fix a typo in pg_rotate_logfile

2024-02-14 Thread Tom Lane

Daniel Gustafsson  writes:
> On 14 Feb 2024, at 11:35, Dave Page  wrote:
>> That said, pgAdmin III has been out of support for many years, and as far as 
>> I know, it (and similarly old versions of EDB's PEM which was based on it) 
>> were the only consumers of adminpack. I would not be sad to see it removed 
>> entirely

> Searching on Github and Debian Codesearch I cannot find any reference to 
> anyone
> using any function from adminpack.  With pgAdminIII being EOL it might be to
> remove it now rather than be on the hook to maintain it for another 5 years
> until v17 goes EOL.  It'll still be around for years in V16->.

Works for me.

> Attached is a diff to show what it would look like to remove adminpack 
> (catalog
> version bump omitted on purpose to avoid conflicts until commit).

I don't see any references you missed, so +1.

regards, tom lane

Re: BitmapHeapScan streaming read user and prelim refactoring

On Tue, Feb 13, 2024 at 11:34 PM Mark Dilger
 wrote:
>
> > On Feb 13, 2024, at 3:11 PM, Melanie Plageman  
> > wrote:
>
> Thanks for the patch...
>
> > Attached is a patch set which refactors BitmapHeapScan such that it
> > can use the streaming read API [1]. It also resolves the long-standing
> > FIXME in the BitmapHeapScan code suggesting that the skip fetch
> > optimization should be pushed into the table AMs. Additionally, it
> > moves table scan initialization to after the index scan and bitmap
> > initialization.
> >
> > patches 0001-0002 are assorted cleanup needed later in the set.
> > patches 0003 moves the table scan initialization to after bitmap creation
> > patch 0004 is, I think, a bug fix. see [2].
> > patches 0005-0006 push the skip fetch optimization into the table AMs
> > patches 0007-0009 change the control flow of BitmapHeapNext() to match
> > that required by the streaming read API
> > patch 0010 is the streaming read code not yet in master
> > patch 0011 is the actual bitmapheapscan streaming read user.
> >
> > patches 0001-0009 apply on top of master but 0010 and 0011 must be
> > applied on top of a commit before a 21d9c3ee4ef74e2 (until a rebased
> > version of the streaming read API is on the mailing list).
>
> I followed your lead and applied them to 
> 6a8ffe812d194ba6f4f26791b6388a4837d17d6c.  `make check` worked fine, though I 
> expect you know that already.

Thanks for taking a look!

> > The caveat is that these patches introduce breaking changes to two
> > table AM functions for bitmapheapscan: table_scan_bitmap_next_block()
> > and table_scan_bitmap_next_tuple().
>
> You might want an independent perspective on how much of a hassle those 
> breaking changes are, so I took a stab at that.  Having written a custom 
> proprietary TAM for postgresql 15 here at EDB, and having ported it and 
> released it for postgresql 16, I thought I'd try porting it to the the above 
> commit with your patches.  Even without your patches, I already see breaking 
> changes coming from commit f691f5b80a85c66d715b4340ffabb503eb19393e, which 
> creates a similar amount of breakage for me as does your patches.  Dealing 
> with the combined breakage might amount to a day of work, including testing, 
> half of which I think I've already finished.  In other words, it doesn't seem 
> like a big deal.
>
> Were postgresql 17 shaping up to be compatible with TAMs written for 16, your 
> patch would change that qualitatively, but since things are already 
> incompatible, I think you're in the clear.

Oh, good to know! I'm very happy to have the perspective of a table AM
author. Just curious, did your table AM implement
table_scan_bitmap_next_block() and table_scan_bitmap_next_tuple(),
and, if so, did you use the TBMIterateResult? Since it is not used in
BitmapHeapNext() in my version, table AMs would have to change how
they use TBMIterateResults anyway. But I assume they could add it to a
table AM specific scan descriptor if they want access to a
TBMIterateResult of their own making in both
table_san_bitmap_next_block() and next_tuple()?

- Melanie

Re: About a recently-added message

2024-02-14 Thread Euler Taveira

On Wed, Feb 14, 2024, at 8:45 AM, Amit Kapila wrote:
> Now, I am less clear about whether to quote "logical" or not in the
> above message. Do you have any suggestions?

The possible confusion comes from the fact that the sentence contains "must be"
in the middle of a comparison expression. For pg_createsubscriber, we are using

publisher requires wal_level >= logical

I suggest to use something like

slot synchronization requires wal_level >= logical

--
Euler Taveira
EDB   https://www.enterprisedb.com/

Re: index prefetching

2024-02-14 Thread Tomas Vondra

On 2/14/24 08:10, Robert Haas wrote:
> On Thu, Feb 8, 2024 at 3:18 AM Melanie Plageman
>  wrote:
>> - kill prior tuple
>>
>> This optimization doesn't work with index prefetching with the current
>> design. Kill prior tuple relies on alternating between fetching a
>> single index tuple and visiting the heap. After visiting the heap we
>> can potentially kill the immediately preceding index tuple. Once we
>> fetch multiple index tuples, enqueue their TIDs, and later visit the
>> heap, the next index page we visit may not contain all of the index
>> tuples deemed killable by our visit to the heap.
> 
> Is this maybe just a bookkeeping problem? A Boolean that says "you can
> kill the prior tuple" is well-suited if and only if the prior tuple is
> well-defined. But perhaps it could be replaced with something more
> sophisticated that tells you which tuples are eligible to be killed.
> 

I don't think it's just a bookkeeping problem. In a way, nbtree already
does keep an array of tuples to kill (see btgettuple), but it's always
for the current index page. So it's not that we immediately go and kill
the prior tuple - nbtree already stashes it in an array, and kills all
those tuples when moving to the next index page.

The way I understand the problem is that with prefetching we're bound to
determine the kill_prior_tuple flag with a delay, in which case we might
have already moved to the next index page ...

So to make this work, we'd need to:

1) keep index pages pinned for all "in flight" TIDs (read from the
index, not yet consumed by the index scan)

2) keep a separate array of "to be killed" index tuples for each page

3) have a more sophisticated way to decide when to kill tuples and unpin
the index page (instead of just doing it when moving to the next index page)

Maybe that's what you meant by "more sophisticated bookkeeping", ofc.

regards

-- 
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Re: Synchronizing slots from primary to standby

2024-02-14 Thread Bertrand Drouvot

Hi,

On Wed, Feb 14, 2024 at 10:40:11AM +, Zhijie Hou (Fujitsu) wrote:
> On Wednesday, February 14, 2024 6:05 PM Amit Kapila  
> wrote:
> > 
> > On Wed, Feb 14, 2024 at 2:14 PM Amit Kapila  wrote:
> > >
> > > On Wed, Feb 14, 2024 at 9:34 AM Zhijie Hou (Fujitsu)
> > >  wrote:
> > > >
> > > > Here is V87 patch that adds test for the suggested cases.
> > > >
> > >
> > > I have pushed this patch and it leads to a BF failure:
> > > https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=flaviventris
> > > t=2024-02-14%2004%3A43%3A37
> > >
> > > The test failures are:
> > > #   Failed test 'logical decoding is not allowed on synced slot'
> > > #   at
> > /home/bf/bf-build/flaviventris/HEAD/pgsql/src/test/recovery/t/040_standby_f
> > ailover_slots_sync.pl
> > > line 272.
> > > #   Failed test 'synced slot on standby cannot be altered'
> > > #   at
> > /home/bf/bf-build/flaviventris/HEAD/pgsql/src/test/recovery/t/040_standby_f
> > ailover_slots_sync.pl
> > > line 281.
> > > #   Failed test 'synced slot on standby cannot be dropped'
> > > #   at
> > /home/bf/bf-build/flaviventris/HEAD/pgsql/src/test/recovery/t/040_standby_f
> > ailover_slots_sync.pl
> > > line 287.
> > >
> > > The reason is that in LOGs, we see a different ERROR message than what
> > > is expected:
> > > 2024-02-14 04:52:32.916 UTC [1767765][client backend][3/4:0] ERROR:
> > > replication slot "lsub1_slot" is active for PID 1760871
> > >
> > > Now, we see the slot still active because a test before these tests (#
> > > Test that if the synchronized slot is invalidated while the remote
> > > slot is still valid, ) is not able to successfully persist the
> > > slot and the synced temporary slot remains active.
> > >
> > > The reason is clear by referring to below standby LOGS:
> > >
> > > LOG:  connection authorized: user=bf database=postgres
> > > application_name=040_standby_failover_slots_sync.pl
> > > LOG:  statement: SELECT pg_sync_replication_slots();
> > > LOG:  dropped replication slot "lsub1_slot" of dbid 5
> > > STATEMENT:  SELECT pg_sync_replication_slots(); ...
> > > SELECT conflict_reason IS NULL AND synced FROM pg_replication_slots
> > > WHERE slot_name = 'lsub1_slot';
> > >
> > > In the above LOGs, we should ideally see: "newly created slot
> > > "lsub1_slot" is sync-ready now" after the "LOG:  dropped replication
> > > slot "lsub1_slot" of dbid 5" but lack of that means the test didn't
> > > accomplish what it was supposed to. Ideally, the same test should have
> > > failed but the pass criteria for the test failed to check whether the
> > > slot is persisted or not.
> > >
> > > The probable reason for failure is that remote_slot's restart_lsn lags
> > > behind the oldest WAL segment on standby. Now, in the test, we do
> > > ensure that the publisher and subscriber are caught up by following
> > > steps:
> > > # Enable the subscription to let it catch up to the latest wal
> > > position $subscriber1->safe_psql('postgres',
> > > "ALTER SUBSCRIPTION regress_mysub1 ENABLE");
> > >
> > > $primary->wait_for_catchup('regress_mysub1');
> > >
> > > However, this doesn't guarantee that restart_lsn is moved to a
> > > position new enough that standby has a WAL corresponding to it.
> > >
> > 
> > To ensure that restart_lsn has been moved to a recent position, we need to 
> > log
> > XLOG_RUNNING_XACTS and make sure the same is processed as well by
> > walsender. The attached patch does the required change.
> > 
> > Hou-San can reproduce this problem by adding additional checkpoints in the
> > test and after applying the attached it fixes the problem. Now, this patch 
> > is
> > mostly based on the theory we formed based on LOGs on BF and a reproducer
> > by Hou-San, so still, there is some chance that this doesn't fix the BF 
> > failures in
> > which case I'll again look into those.
> 
> I have verified that the patch can fix the issue on my machine(after adding 
> few
> more checkpoints before slot invalidation test.) I also added one more check 
> in
> the test to confirm the synced slot is not temp slot. Here is the v2 patch.

Thanks!

+# To ensure that restart_lsn has moved to a recent WAL position, we need
+# to log XLOG_RUNNING_XACTS and make sure the same is processed as well
+$primary->psql('postgres', "CHECKPOINT");

Instead of "CHECKPOINT" wouldn't a less heavy "SELECT 
pg_log_standby_snapshot();"
be enough?

Not a big deal but maybe we could do the change while modifying
040_standby_failover_slots_sync.pl in the next patch "Add a new slotsync 
worker".

Regards,

-- 
Bertrand Drouvot
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com

Re: Fix a typo in pg_rotate_logfile

2024-02-14 Thread Daniel Gustafsson

> On 14 Feb 2024, at 11:35, Dave Page  wrote:

> That said, pgAdmin III has been out of support for many years, and as far as 
> I know, it (and similarly old versions of EDB's PEM which was based on it) 
> were the only consumers of adminpack. I would not be sad to see it removed 
> entirely

Searching on Github and Debian Codesearch I cannot find any reference to anyone
using any function from adminpack.  With pgAdminIII being EOL it might be to
remove it now rather than be on the hook to maintain it for another 5 years
until v17 goes EOL.  It'll still be around for years in V16->.

If anyone still uses pgAdminIII then I have a hard time believing they are
diligently updating to the latest major version of postgres..

Attached is a diff to show what it would look like to remove adminpack (catalog
version bump omitted on purpose to avoid conflicts until commit).

--
Daniel Gustafsson

v1-0001-Remove-the-adminpack-extension.patch
Description: Binary data

Re: index prefetching

2024-02-14 Thread Tomas Vondra

On 2/13/24 20:54, Peter Geoghegan wrote:
> On Tue, Feb 13, 2024 at 2:01 PM Tomas Vondra
>  wrote:
>> On 2/7/24 22:48, Melanie Plageman wrote:
>> I admit I haven't thought about kill_prior_tuple until you pointed out.
>> Yeah, prefetching separates (de-synchronizes) the two scans (index and
>> heap) in a way that prevents this optimization. Or at least makes it
>> much more complex :-(
> 
> Another thing that argues against doing this is that we might not need
> to visit any more B-Tree leaf pages when there is a LIMIT n involved.
> We could end up scanning a whole extra leaf page (including all of its
> tuples) for want of the ability to "push down" a LIMIT to the index AM
> (that's not what happens right now, but it isn't really needed at all
> right now).
> 

I'm not quite sure I understand what is "this" that you argue against.
Are you saying we should not separate the two scans? If yes, is there a
better way to do this?

The LIMIT problem is not very clear to me either. Yes, if we get close
to the end of the leaf page, we may need to visit the next leaf page.
But that's kinda the whole point of prefetching - reading stuff ahead,
and reading too far ahead is an inherent risk. Isn't that a problem we
have even without LIMIT? The prefetch distance ramp up is meant to limit
the impact.

> This property of index scans is fundamental to how index scans work.
> Pinning an index page as an interlock against concurrently TID
> recycling by VACUUM is directly described by the index API docs [1],
> even (the docs actually use terms like "buffer pin" rather than
> something more abstract sounding). I don't think that anything
> affecting that behavior should be considered an implementation detail
> of the nbtree index AM as such (nor any particular index AM).
> 

Good point.

> I think that it makes sense to put the index AM in control here --
> that almost follows from what I said about the index AM API. The index
> AM already needs to be in control, in about the same way, to deal with
> kill_prior_tuple (plus it helps with the  LIMIT issue I described).
> 

In control how? What would be the control flow - what part would be
managed by the index AM?

I initially did the prefetching entirely in each index AM, but it was
suggested doing this in the executor would be better. So I gradually
moved it to executor. But the idea to combine this with the streaming
read API seems as a move from executor back to the lower levels ... and
now you're suggesting to make the index AM responsible for this again.

I'm not saying any of those layering options is wrong, but it's not
clear to me which is the right one.

> There doesn't necessarily need to be much code duplication to make
> that work. Offhand I suspect it would be kind of similar to how
> deletion of LP_DEAD-marked index tuples by non-nbtree index AMs gets
> by with generic logic implemented by
> index_compute_xid_horizon_for_tuples -- that's all that we need to
> determine a snapshotConflictHorizon value for recovery conflict
> purposes. Note that index_compute_xid_horizon_for_tuples() reads
> *index* pages, despite not being aware of the caller's index AM and
> index tuple format.
> 
> (The only reason why nbtree needs a custom solution is because it has
> posting list tuples to worry about, unlike GiST and unlike Hash, which
> consistently use unadorned generic IndexTuple structs with heap TID
> represented in the standard/generic way only. While these concepts
> probably all originated in nbtree, they're still not nbtree
> implementation details.)
> 

I haven't looked at the details, but I agree the LP_DEAD deletion seems
like a sensible inspiration.

>>> Having disabled kill_prior_tuple is why the mvcc test fails. Perhaps
>>> there is an easier way to fix this, as I don't think the mvcc test
>>> failed on Tomas' version.
>>>
>>
>> I kinda doubt it worked correctly, considering I simply ignored the
>> optimization. It's far more likely it just worked by luck.
> 
> The test that did fail will have only revealed that the
> kill_prior_tuple wasn't operating as  expected -- which isn't the same
> thing as giving wrong answers.
> 

Possible. But AFAIK it did fail for Melanie, and I don't have a very
good explanation for the difference in behavior.

> Note that there are various ways that concurrent TID recycling might
> prevent _bt_killitems() from setting LP_DEAD bits. It's totally
> unsurprising that breaking kill_prior_tuple in some way could be
> missed. Andres wrote the MVCC test in question precisely because
> certain aspects of kill_prior_tuple were broken for months without
> anybody noticing.
> 
> [1] https://www.postgresql.org/docs/devel/index-locking.html

Yeah. There's clearly plenty of space for subtle issues.


regards

-- 
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Re: Stack overflow issue

2024-02-14 Thread Alexander Korotkov

Hi!

On Fri, Jan 12, 2024 at 11:00 PM Robert Haas  wrote:
> On Fri, Jan 12, 2024 at 10:12 AM Heikki Linnakangas  wrote:
> > Here's one goto-free attempt. It adds a local loop to where the
> > recursion was, so that if you have a chain of subtransactions that need
> > to be aborted in CommitTransactionCommand, they are aborted iteratively.
> > The TBLOCK_SUBCOMMIT case already had such a loop.
> >
> > I added a couple of comments in the patch marked with "REVIEWER NOTE",
> > to explain why I changed some things. They are to be removed before
> > committing.
> >
> > I'm not sure if this is better than a goto. In fact, even if we commit
> > this, I think I'd still prefer to replace the remaining recursive calls
> > with a goto. Recursion feels a weird to me, when we're unwinding the
> > states from the stack as we go.
>
> I'm not able to quickly verify whether this version is correct, but I
> do think the code looks nicer this way.
>
> I understand that's a question of opinion rather than fact, though.

I'd like to revive this thread.  The attached 0001 patch represents my
attempt to remove recursion in
CommitTransactionCommand()/AbortCurrentTransaction() by adding a
wrapper function.  This method doesn't use goto, doesn't require much
code changes and subjectively provides good readability.

Regarding ShowTransactionStateRec() and memory context function, as I
get from this thread they are called in states where abortion can lead
to a panic.  So, it's preferable to change them into loops too rather
than just adding check_stack_depth().  The 0002 and 0003 patches by
Heikki posted in [1] look good to me.  Can we accept them?

Also there are a number of recursive functions, which seem to be not
used in critical states where abortion can lead to a panic.  I've
extracted them from [2] into an attached 0002 patch.  I'd like to push
it if there is no objection.

Links.
1. 
https://www.postgresql.org/message-id/6b48c746-9704-46dc-b9be-01fe4137c824%40iki.fi
2. 
https://www.postgresql.org/message-id/4530546a-3216-eaa9-4c92-92d33290a211%40mail.ru

--
Regards,
Alexander Korotkov

0002-Add-missing-check_stack_depth-to-some-recursive-f-v3.patch
Description: Binary data

0001-Turn-tail-recursion-into-iteration-in-CommitTrans-v3.patch
Description: Binary data

Re: About a recently-added message

On Wed, Feb 14, 2024 at 12:57 PM Kyotaro Horiguchi
 wrote:
>
> A recent commit added the following message:
>
> > "wal_level" must be >= logical.
>
> The use of the term "logical" here is a bit confusing, as it's unclear
> whether it's meant to be a natural language word or a token. (I
> believe it to be a token.)
>
> On the contrary, we already have the following message:
>
> > wal_level must be set to "replica" or "logical" at server start.
>
> This has the conflicting policy about quotation of variable names and
> enum values.
>
> I suggest making the quoting policy consistent.
>

As per docs [1] (In messages containing configuration variable names,
do not include quotes when the names are visibly not natural English
words, such as when they have underscores, are all-uppercase, or have
mixed case. Otherwise, quotes must be added. Do include quotes in a
message where an arbitrary variable name is to be expanded.), I think
in this case GUC's name shouldn't have been quoted. I think the patch
did this because it's developed parallel to a thread where we were
discussing whether to quote GUC names or not [2]. I think it is better
not to do as per docs till we get any further clarification.

Now, I am less clear about whether to quote "logical" or not in the
above message. Do you have any suggestions?

[1] - https://www.postgresql.org/docs/devel/error-style-guide.html
[2] - 
https://www.postgresql.org/message-id/cahut+psf3newxbsfky88qn1on1_dmd6343muwdmiim2ds9a...@mail.gmail.com

-- 
With Regards,
Amit Kapila.

Re: Psql meta-command conninfo+

2024-02-14 Thread Jim Jones

On 12.02.24 15:16, Maiquel Grassi wrote:
>
> (v14)
>
>
v14 applies cleanly and the SSL info is now shown as previously
suggested. Here is a more comprehensive test:

$ /usr/local/postgres-dev/bin/psql -x "\
    host=server.uni-muenster.de
    hostaddr=172.19.42.1
    user=jim dbname=postgres
    sslrootcert=server-certificates/server.crt
    sslcert=jim-certificates/jim.crt
    sslkey=jim-certificates/jim.key"

psql (17devel)
SSL connection (protocol: TLSv1.3, cipher: TLS_AES_256_GCM_SHA384,
compression: off)
Type "help" for help.

postgres=# SET ROLE foo;
SET
postgres=> \conninfo+
Current Connection Information
-[ RECORD 1
]--+---
Database   | postgres
Authenticated User | jim
System User    | cert:emailAddress=j...@uni-muenster.de,CN=jim,OU=WWU
IT,O=Universitaet Muenster,L=Muenster,ST=Nordrhein-Westfalen,C=DE
Current User   | foo
Session User   | jim
Backend PID    | 278294
Server Address | 172.19.42.1
Server Port    | 5432
Client Address | 192.168.178.27
Client Port    | 54948
Socket Directory   |
Host   | server.uni-muenster.de
Encryption | SSL
Protocol   | TLSv1.3
Cipher | TLS_AES_256_GCM_SHA384
Compression    | off

postgres=> \conninfo
You are connected to database "postgres" as user "jim" on host
"server.uni-muenster.de" (address "127.0.0.1") at port "5432".
SSL connection (protocol: TLSv1.3, cipher: TLS_AES_256_GCM_SHA384,
compression: off)

A little odd is that "Server Address" of \conninfo+ differs from
"address" of \conninfo...

I think the documentation could add a little more info than just this:

> When no + is specified, it simply prints a textual description of a
few connection options. When + is given, more complete information is
displayed as a table.

Perhaps briefly mentioning the returned columns or simply listing them
would be IMHO a nice addition. For some users the semantics of
"Authenticated User", "System User", "Current User" and "Session User"
can be a little confusing. And yes, I realize the current documentation
of \conninfo is already a little vague :).

Thanks for working on this

-- 
Jim

Re: Add new error_action COPY ON_ERROR "log"

2024-02-14 Thread torikoshia

On 2024-02-12 15:46, Bharath Rupireddy wrote:

Thanks for your comments.

On Mon, Jan 29, 2024 at 8:41 AM torikoshia  
wrote:

On 2024-01-27 00:43, David G. Johnston wrote:

>> I'd like to have a new option "log", which skips soft errors and
>> logs
>> information that should have resulted in errors to PostgreSQL log.
>
> user-specified tables in the same database.  Setting up temporary
> tables or unlogged tables probably is going to be a more acceptable
> methodology than trying to get to the log files.

OTOH not everyone thinks saving table information is the best idea[2].

The added NOTICE message gives a summary of how many rows were
skipped, but not the line numbers. It's hard for the users to find the
malformed data, especially when they are bulk-inserting from data
files of multiple GBs in size (hard to open such files in any editor
too).

I understand the value of writing the info to server log or tables of
users' choice as being discussed in a couple of other threads.
However, I'd prefer we do the simplest thing first before we go debate
server log or table - let the users know what line numbers are
containing the errors that COPY ignored something like [1] with a
simple change like [2].

Agreed.
Unlike my patch, it hides the error information(i.e. 22P02: invalid 
input syntax for type integer: ), but I feel that it's usually 
sufficient to know the row number and column where the error occurred.

It not only helps users figure out which rows
and attributes were malformed, but also helps them redirect them to
server logs with setting log_min_messages = notice [3]. In the worst
case scenario, a problem with this one NOTICE per malformed row is
that it can overload the psql session if all the rows are malformed.
I'm not sure if this is a big problem, but IMO better than a single
summary NOTICE message and simpler than writing to tables of users'
choice.

Maybe could we do what you suggested for the behavior when 'log' is set 
to on_error?

Thoughts?

FWIW, I presented the new COPY ... ON_ERROR option feature at
Hyderabad PostgreSQL User Group meetup and it was well-received by the
audience. The audience felt it's going to be extremely helpful for
bulk-loading tasks. Thank you all for working on this feature.

Thanks for informing it!
I hope it will not be reverted as the audience:)

[1]
postgres=# CREATE TABLE check_ign_err (n int, m int[], k int);
CREATE TABLE
postgres=# COPY check_ign_err FROM STDIN WITH (on_error ignore);
Enter data to be copied followed by a newline.
End with a backslash and a period on a line by itself, or an EOF 
signal.

1{1} 1

a   {2} 2
3   {3} 33
4   {a, 4}  4

5   {5}>> >> >> >> >>   5
\.>>
NOTICE:  detected data type incompatibility at line number 2 for
column check_ign_err, COPY n
NOTICE:  detected data type incompatibility at line number 3 for
column check_ign_err, COPY k
NOTICE:  detected data type incompatibility at line number 4 for
column check_ign_err, COPY m
NOTICE:  detected data type incompatibility at line number 5 for
column check_ign_err, COPY n
NOTICE:  4 rows were skipped due to data type incompatibility
COPY 2

[2]
diff --git a/src/backend/commands/copyfromparse.c
b/src/backend/commands/copyfromparse.c
index 906756362e..93ab5d4ebb 100644
--- a/src/backend/commands/copyfromparse.c
+++ b/src/backend/commands/copyfromparse.c
@@ -961,7 +961,16 @@ NextCopyFrom(CopyFromState cstate, ExprContext 
*econtext,

 [m]))
{
cstate->num_errors++;
-   return true;
+
+   if (cstate->opts.on_error != 
COPY_ON_ERROR_STOP)

+   {
+   ereport(NOTICE,
+
errmsg("detected data type incompatibility at line number %llu for
column %s, COPY %s",
+
(unsigned long long) cstate->cur_lineno,
+
cstate->cur_relname,
+
cstate->cur_attname));
+   return true;
+   }
}

cstate->cur_attname = NULL;

[3]
2024-02-12 06:20:29.363 UTC [427892] NOTICE:  detected data type
incompatibility at line number 2 for column check_ign_err, COPY n
2024-02-12 06:20:29.363 UTC [427892] CONTEXT:  COPY check_ign_err,
line 2, column n: "a"
2024-02-12 06:20:29.363 UTC [427892] NOTICE:  detected data type
incompatibility at line number 3 for column check_ign_err, COPY k
2024-02-12 06:20:29.363 UTC [427892] CONTEXT:  COPY check_ign_err,
line 3, column k: "33"
2024-02-12 06:20:29.363 UTC [427892] NOTICE:  detected data type
incompatibility at line number 4 for column check_ign_err, COPY m
2024-02-12 06:20:29.363 UTC [427892] CONTEXT:  COPY check_ign_err,
line 4, column m: "{a, 4}"
2024-02-12 06:20:29.363 UTC [427892] NOTICE:  detected data type
incompatibility at line number 5 for column check_ign_err, COPY n
2024-02-12 06:20:29.363 UTC [427892] CONTEXT:

Re: About a recently-added message

On Wed, Feb 14, 2024 at 1:04 PM Kyotaro Horiguchi
 wrote:
>
> Just after this, I found another inconsistency regarding quotation.
>
> > 'dbname' must be specified in "%s".
>
> The use of single quotes doesn't seem to comply with our standard.
>

Agreed, I think we have two choices here one is to use dbname without
any quotes ("dbname must be specified in \"%s\".", ...)) or use double
quotes ("\"%s\" must be specified in \"%s\".", "dbname" ...)). I see
messages like: "host name must be specified", so if we want to follow
that earlier makes more sense. Any suggestions?

-- 
With Regards,
Amit Kapila.

Re: Improve eviction algorithm in ReorderBuffer

2024-02-14 Thread Ajin Cherian

On Sat, Feb 10, 2024 at 2:23 AM Masahiko Sawada 
wrote:

> On Fri, Feb 9, 2024 at 7:35 PM Ajin Cherian  wrote:
> >
> >
> >
> > On Tue, Feb 6, 2024 at 5:06 PM Masahiko Sawada 
> wrote:
> >>
> >>
> >> I've attached the new version patch set.
> >>
> >> Regards,
> >>
> >>
> >> --
> >> Masahiko Sawada
> >> Amazon Web Services: https://aws.amazon.com
> >
> >
> > Thanks for the patch. I reviewed that patch and did minimal testing and
> it seems to show the speed up as claimed. Some minor comments:
>
> Thank you for the comments!
>
> > patch 0001:
> >
> > +static void
> > +bh_enlarge_node_array(binaryheap *heap)
> > +{
> > + if (heap->bh_size < heap->bh_space)
> > + return;
> >
> > why not check "if (heap->bh_size >= heap->bh_space)" outside this
> function to avoid calling this function when not necessary? This check was
> there in code before the patch.
>
> Agreed.
>
> >
> > patch 0003:
> >
> > +/*
> > + * The threshold of the number of transactions in the max-heap
> (rb->txn_heap)
> > + * to switch the state.
> > + */
> > +#define REORDE_BUFFER_MEM_TRACK_THRESHOLD 1024
> >
> > Typo: I think you meant REORDER_ and not REORDE_
>
> Fixed.
>
> These comments are addressed in the v4 patch set I just shared[1].
>
>
> These changes look good to me. I've done some tests with a few varying
levels of subtransaction and I could see that the patch was at least 5%
better in all of them.

regards,
Ajin Cherian
Fujitsu Australia

RE: Synchronizing slots from primary to standby

2024-02-14 Thread Zhijie Hou (Fujitsu)

On Wednesday, February 14, 2024 6:05 PM Amit Kapila  
wrote:
> 
> On Wed, Feb 14, 2024 at 2:14 PM Amit Kapila  wrote:
> >
> > On Wed, Feb 14, 2024 at 9:34 AM Zhijie Hou (Fujitsu)
> >  wrote:
> > >
> > > Here is V87 patch that adds test for the suggested cases.
> > >
> >
> > I have pushed this patch and it leads to a BF failure:
> > https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=flaviventris
> > t=2024-02-14%2004%3A43%3A37
> >
> > The test failures are:
> > #   Failed test 'logical decoding is not allowed on synced slot'
> > #   at
> /home/bf/bf-build/flaviventris/HEAD/pgsql/src/test/recovery/t/040_standby_f
> ailover_slots_sync.pl
> > line 272.
> > #   Failed test 'synced slot on standby cannot be altered'
> > #   at
> /home/bf/bf-build/flaviventris/HEAD/pgsql/src/test/recovery/t/040_standby_f
> ailover_slots_sync.pl
> > line 281.
> > #   Failed test 'synced slot on standby cannot be dropped'
> > #   at
> /home/bf/bf-build/flaviventris/HEAD/pgsql/src/test/recovery/t/040_standby_f
> ailover_slots_sync.pl
> > line 287.
> >
> > The reason is that in LOGs, we see a different ERROR message than what
> > is expected:
> > 2024-02-14 04:52:32.916 UTC [1767765][client backend][3/4:0] ERROR:
> > replication slot "lsub1_slot" is active for PID 1760871
> >
> > Now, we see the slot still active because a test before these tests (#
> > Test that if the synchronized slot is invalidated while the remote
> > slot is still valid, ) is not able to successfully persist the
> > slot and the synced temporary slot remains active.
> >
> > The reason is clear by referring to below standby LOGS:
> >
> > LOG:  connection authorized: user=bf database=postgres
> > application_name=040_standby_failover_slots_sync.pl
> > LOG:  statement: SELECT pg_sync_replication_slots();
> > LOG:  dropped replication slot "lsub1_slot" of dbid 5
> > STATEMENT:  SELECT pg_sync_replication_slots(); ...
> > SELECT conflict_reason IS NULL AND synced FROM pg_replication_slots
> > WHERE slot_name = 'lsub1_slot';
> >
> > In the above LOGs, we should ideally see: "newly created slot
> > "lsub1_slot" is sync-ready now" after the "LOG:  dropped replication
> > slot "lsub1_slot" of dbid 5" but lack of that means the test didn't
> > accomplish what it was supposed to. Ideally, the same test should have
> > failed but the pass criteria for the test failed to check whether the
> > slot is persisted or not.
> >
> > The probable reason for failure is that remote_slot's restart_lsn lags
> > behind the oldest WAL segment on standby. Now, in the test, we do
> > ensure that the publisher and subscriber are caught up by following
> > steps:
> > # Enable the subscription to let it catch up to the latest wal
> > position $subscriber1->safe_psql('postgres',
> > "ALTER SUBSCRIPTION regress_mysub1 ENABLE");
> >
> > $primary->wait_for_catchup('regress_mysub1');
> >
> > However, this doesn't guarantee that restart_lsn is moved to a
> > position new enough that standby has a WAL corresponding to it.
> >
> 
> To ensure that restart_lsn has been moved to a recent position, we need to log
> XLOG_RUNNING_XACTS and make sure the same is processed as well by
> walsender. The attached patch does the required change.
> 
> Hou-San can reproduce this problem by adding additional checkpoints in the
> test and after applying the attached it fixes the problem. Now, this patch is
> mostly based on the theory we formed based on LOGs on BF and a reproducer
> by Hou-San, so still, there is some chance that this doesn't fix the BF 
> failures in
> which case I'll again look into those.

I have verified that the patch can fix the issue on my machine(after adding few
more checkpoints before slot invalidation test.) I also added one more check in
the test to confirm the synced slot is not temp slot. Here is the v2 patch.

Best Regards,
Hou zj


v2-0001-fix-040-standby-failover-slots-sync.patch
Description: v2-0001-fix-040-standby-failover-slots-sync.patch

Re: Fix a typo in pg_rotate_logfile

2024-02-14 Thread Dave Page

Hi

On Mon, 12 Feb 2024 at 21:31, Bharath Rupireddy <
bharath.rupireddyforpostg...@gmail.com> wrote:

> On Tue, Feb 13, 2024 at 2:29 AM Daniel Gustafsson  wrote:
> >
> > On that note though, we might want to consider just dropping it
> altogether in
> > v17 (while fixing the incorrect hint in backbranches)?  I can't imagine
> > adminpack 1.0 being in heavy use today, and skimming pgAdmin code it
> seems it's
> > only used in pgAdmin3 and not 4. Maybe it's time to simply drop old code?
>
> https://codesearch.debian.net/search?q=pg_logfile_rotate=1
> shows no users for it though. There's pgadmin3 using it
>
> https://github.com/search?q=repo%3Apgadmin-org%2Fpgadmin3%20pg_logfile_rotate=code
> ,
> however the repo is archived. Surprisingly, core has to maintain the
> old code needed for adminpack 1.0 - pg_rotate_logfile_old SQL function
> and pg_rotate_logfile function in signalfuncs.c. These things could
> have been moved to adminpack.c back then and pointed CREATE FUNCTION
> pg_catalog.pg_logfile_rotate() to use it from adminpack.c. If we
> decide to remove adminpack 1.0 version completely, the 1.0 functions
> pg_file_read, pg_file_length and pg_logfile_rotate will also go away
> making adminpack code simpler.
>
> Having said that, it's good to hear from others, preferably from
> pgadmin developers - added Dave Page (dp...@pgadmin.org) in here for
> inputs.
>

As it happens we're currently implementing a redesigned version of that
functionality from pgAdmin III in pgAdmin 4. However, we are not using
adminpack for it.

FWIW, the reason for the weird naming is that originally all the
functionality for reading/managing files was added entirely as the
adminpack extension. It was only later that some of the functionality was
moved into core, and renamed along the way (everyone likes blue for their
bikeshed right?). The old functions (albeit, rewritten to use the new core
functions) were kept in adminpack for backwards compatibility.

That said, pgAdmin III has been out of support for many years, and as far
as I know, it (and similarly old versions of EDB's PEM which was based on
it) were the only consumers of adminpack. I would not be sad to see it
removed entirely - except for the fact that I fondly remember being invited
to join -core immediately after a heated discussion with Tom about it!

-- 
Dave Page
pgAdmin: https://www.pgadmin.org
PostgreSQL: https://www.postgresql.org
EDB: https://www.enterprisedb.com

Can we include capturing logs of pgdata/pg_upgrade_output.d/*/log in buildfarm

2024-02-14 Thread vignesh C

Hi,

Recently there have been few upgrade tap test failures in buildfarm
like in [1] & [2]. Analysing these failures requires the log files
that are getting generated from src/bin/pg_upgrade at the following
locations:
tmp_check/*/pgdata/pg_upgrade_output.d/*/*.txt   - e.g.
tmp_check/t_004_subscription_new_sub1_data/pgdata/pg_upgrade_output.d/20240214T052229.045/subs_invalid.txt
tmp_check/*/pgdata/pg_upgrade_output.d/*/*/*.log - e.g.
tmp_check/t_004_subscription_new_sub1_data/pgdata/pg_upgrade_output.d/20240214T052229.045/log/pg_upgrade_server.log

First regex is the testname_clusterinstance_data, second regex is the
timestamp used for pg_upgrade, third regex is for the text files
generated by pg_upgrade and fourth regex is for the log files
generated by pg_upgrade.

Can we include these log files also in the buildfarm?

[1] - 
https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=drongo=2024-02-10%2007%3A03%3A10
[2] - 
https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=drongo=2023-12-07%2003%3A56%3A20

Regards,
Vignesh

Re: Synchronizing slots from primary to standby

On Wed, Feb 14, 2024 at 2:14 PM Amit Kapila  wrote:
>
> On Wed, Feb 14, 2024 at 9:34 AM Zhijie Hou (Fujitsu)
>  wrote:
> >
> > Here is V87 patch that adds test for the suggested cases.
> >
>
> I have pushed this patch and it leads to a BF failure:
> https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=flaviventris=2024-02-14%2004%3A43%3A37
>
> The test failures are:
> #   Failed test 'logical decoding is not allowed on synced slot'
> #   at 
> /home/bf/bf-build/flaviventris/HEAD/pgsql/src/test/recovery/t/040_standby_failover_slots_sync.pl
> line 272.
> #   Failed test 'synced slot on standby cannot be altered'
> #   at 
> /home/bf/bf-build/flaviventris/HEAD/pgsql/src/test/recovery/t/040_standby_failover_slots_sync.pl
> line 281.
> #   Failed test 'synced slot on standby cannot be dropped'
> #   at 
> /home/bf/bf-build/flaviventris/HEAD/pgsql/src/test/recovery/t/040_standby_failover_slots_sync.pl
> line 287.
>
> The reason is that in LOGs, we see a different ERROR message than what
> is expected:
> 2024-02-14 04:52:32.916 UTC [1767765][client backend][3/4:0] ERROR:
> replication slot "lsub1_slot" is active for PID 1760871
>
> Now, we see the slot still active because a test before these tests (#
> Test that if the synchronized slot is invalidated while the remote
> slot is still valid, ) is not able to successfully persist the
> slot and the synced temporary slot remains active.
>
> The reason is clear by referring to below standby LOGS:
>
> LOG:  connection authorized: user=bf database=postgres
> application_name=040_standby_failover_slots_sync.pl
> LOG:  statement: SELECT pg_sync_replication_slots();
> LOG:  dropped replication slot "lsub1_slot" of dbid 5
> STATEMENT:  SELECT pg_sync_replication_slots();
> ...
> SELECT conflict_reason IS NULL AND synced FROM pg_replication_slots
> WHERE slot_name = 'lsub1_slot';
>
> In the above LOGs, we should ideally see: "newly created slot
> "lsub1_slot" is sync-ready now" after the "LOG:  dropped replication
> slot "lsub1_slot" of dbid 5" but lack of that means the test didn't
> accomplish what it was supposed to. Ideally, the same test should have
> failed but the pass criteria for the test failed to check whether the
> slot is persisted or not.
>
> The probable reason for failure is that remote_slot's restart_lsn lags
> behind the oldest WAL segment on standby. Now, in the test, we do
> ensure that the publisher and subscriber are caught up by following
> steps:
> # Enable the subscription to let it catch up to the latest wal position
> $subscriber1->safe_psql('postgres',
> "ALTER SUBSCRIPTION regress_mysub1 ENABLE");
>
> $primary->wait_for_catchup('regress_mysub1');
>
> However, this doesn't guarantee that restart_lsn is moved to a
> position new enough that standby has a WAL corresponding to it.
>

To ensure that restart_lsn has been moved to a recent position, we
need to log XLOG_RUNNING_XACTS and make sure the same is processed as
well by walsender. The attached patch does the required change.

Hou-San can reproduce this problem by adding additional checkpoints in
the test and after applying the attached it fixes the problem. Now,
this patch is mostly based on the theory we formed based on LOGs on BF
and a reproducer by Hou-San, so still, there is some chance that this
doesn't fix the BF failures in which case I'll again look into those.

-- 
With Regards,
Amit Kapila.


fix_040_standby_failover_slots_sync.1.patch
Description: Binary data

Re: About a recently-added message

On Wed, Feb 14, 2024 at 1:04 PM Kyotaro Horiguchi
 wrote:
>
> At Wed, 14 Feb 2024 16:26:52 +0900 (JST), Kyotaro Horiguchi 
>  wrote in
> > > "wal_level" must be >= logical.
> ..
> > > wal_level must be set to "replica" or "logical" at server start.
> ..
> > I suggest making the quoting policy consistent.
>
> Just after this, I found another inconsistency regarding quotation.
>
> > 'dbname' must be specified in "%s".
>
> The use of single quotes doesn't seem to comply with our standard.
>

Thanks for the report. I'll look into it after analyzing the BF
failure caused by the same commit.

-- 
With Regards,
Amit Kapila.

Re: Synchronizing slots from primary to standby