date:20240216

On Fri, Feb 16, 2024 at 5:53 PM vignesh C  wrote:
>
>
> After the insert operation is replicated to the subscriber, the
> subscriber will set the lsn value sent by the publisher in the
> replication origin (in my case it was 0/1510978). publisher will then
> send keepalive messages with the current WAL position in the publisher
> (in my case it was 0/15109B0), but subscriber will simply send this
> position as the flush_lsn to the publisher as there are no ongoing
> transactions. Then since the publisher is started, it will identify
> that publication does not exist and stop the walsender/apply worker
> process. When the apply worker is restarted, we will get the
> remote_lsn(in my case it was 0/1510978) of the origin and set it to
> origin_startpos. We will start the apply worker with this
> origin_startpos (origin's remote_lsn). This position will be sent as
> feedback to the walsender process from the below stack:
> run_apply_worker->start_apply->LogicalRepApplyLoop->send_feedback.
> It will use the following send_feedback function call of
> LogicalRepApplyLoop function as in below code here as nothing is
> received from walsender:
> LogicalRepApplyLoop function
> ...
> len = walrcv_receive(LogRepWorkerWalRcvConn, , );
> if (len != 0)
> {
>/* Loop to process all available data (without blocking). */
>for (;;)
>{
>   CHECK_FOR_INTERRUPTS();
>   ...
>}
> }
>
> /* confirm all writes so far */
> send_feedback(last_received, false, false);
> ...
>
> In send_feedback, we will set flushpos to replication origin's
> remote_lsn and send it to the walsender process. Walsender process
> will receive this information and set confirmed_flush in:
> ProcessStandbyReplyMessage->LogicalConfirmReceivedLocation
>
> Then immediately we are trying to stop the publisher instance,
> shutdown checkpoint process will be triggered. In this case:
> confirmed_flush = 0/1510978 will be lesser than
> last_saved_confirmed_flush = 0/15109B0 which will result in Assertion
> failure.
>
> This issue is happening because we allow setting the confirmed_flush
> to a backward position.
>

I see your point.

> There are a couple of ways to fix this:
> a) One way it not to update the confirm_flush if the lsn sent is an
> older value like in Confirm_flush_dont_allow_backward.patch
>

@@ -1839,7 +1839,8 @@ LogicalConfirmReceivedLocation(XLogRecPtr lsn)

  SpinLockAcquire(>mutex);

- MyReplicationSlot->data.confirmed_flush = lsn;
+ if (lsn > MyReplicationSlot->data.confirmed_flush)
+ MyReplicationSlot->data.confirmed_flush = lsn;

  /* if we're past the location required for bumping xmin, do so */
  if (MyReplicationSlot->candidate_xmin_lsn != InvalidXLogRecPtr &&
@@ -1904,7 +1905,8 @@ LogicalConfirmReceivedLocation(XLogRecPtr lsn)
  else
  {
  SpinLockAcquire(>mutex);
- MyReplicationSlot->data.confirmed_flush = lsn;
+ if (lsn > MyReplicationSlot->data.confirmed_flush)
+ MyReplicationSlot->data.confirmed_flush = lsn;

BTW, from which code path does it update the prior value of
confirmed_flush? If it is through the else check, then can we see if
it may change the confirm_flush to the prior position via the first
code path? I am asking because in the first code path, we can even
flush the re-treated value of confirm_flush LSN.

-- 
With Regards,
Amit Kapila.

Re: Add new error_action COPY ON_ERROR "log"

On Fri, Feb 16, 2024 at 8:17 PM torikoshia  wrote:
>
> I may be wrong since I seldom do data loading tasks, but I greed with
> you.
>
> I also a little concerned about the case where there are many malformed
> data and it causes lots of messages, but the information is usually
> valuable and if users don't need it, they can suppress it by changing
> client_min_messages.
>
> Currently both summary of failures and individual information is logged
> in NOTICE level.
> If we should assume that there are cases where only summary information
> is required, it'd be useful to set lower log level, i.e. LOG to the
> individual information.

How about we emit the summary at INFO level and individual information
at NOTICE level? With this, the summary is given a different priority
than the individual info. With SET client_min_messages = WARNING; one
can still get the summary but not the individual info. Also, to get
all of these into server log, one can SET log_min_messages = INFO; or
SET log_min_messages = NOTICE;.

Thoughts?

--
Bharath Rupireddy
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com
From 3cdb3512ec1cffbaeae00d0e3cc41c57021fd6ee Mon Sep 17 00:00:00 2001
From: Bharath Rupireddy 
Date: Sat, 17 Feb 2024 05:43:19 +
Subject: [PATCH v2] Add detailed info when COPY skips soft errors

This commit emits individual info like line number and column name
when COPY skips soft errors. Because, the summary containing the
total rows skipped isn't enough for the users to know what exactly
are the malformed rows in the input data.

It emits individual info and summary at NOTICE and INFO level
respectively to let users switch of individual info by changing
client_min_messages to WARNING. Also, one can get all of these
information into server logs by changing log_min_messages.
---
 src/backend/commands/copyfrom.c  |  2 +-
 src/backend/commands/copyfromparse.c | 12 +++-
 src/test/regress/expected/copy2.out  |  6 +-
 3 files changed, 17 insertions(+), 3 deletions(-)

diff --git a/src/backend/commands/copyfrom.c b/src/backend/commands/copyfrom.c
index 1fe70b9133..e11c2d1cff 100644
--- a/src/backend/commands/copyfrom.c
+++ b/src/backend/commands/copyfrom.c
@@ -1314,7 +1314,7 @@ CopyFrom(CopyFromState cstate)
 
 	if (cstate->opts.on_error != COPY_ON_ERROR_STOP &&
 		cstate->num_errors > 0)
-		ereport(NOTICE,
+		ereport(INFO,
 errmsg_plural("%llu row was skipped due to data type incompatibility",
 			  "%llu rows were skipped due to data type incompatibility",
 			  (unsigned long long) cstate->num_errors,
diff --git a/src/backend/commands/copyfromparse.c b/src/backend/commands/copyfromparse.c
index 7cacd0b752..747e173d9c 100644
--- a/src/backend/commands/copyfromparse.c
+++ b/src/backend/commands/copyfromparse.c
@@ -969,7 +969,17 @@ NextCopyFrom(CopyFromState cstate, ExprContext *econtext,
 			[m]))
 			{
 cstate->num_errors++;
-return true;
+
+if (cstate->opts.on_error != COPY_ON_ERROR_STOP)
+{
+	ereport(NOTICE,
+			errmsg("detected data type incompatibility at line number %llu for column %s; COPY %s",
+   (unsigned long long) cstate->cur_lineno,
+   cstate->cur_attname,
+   cstate->cur_relname));
+
+	return true;
+}
 			}
 
 			cstate->cur_attname = NULL;
diff --git a/src/test/regress/expected/copy2.out b/src/test/regress/expected/copy2.out
index 25c401ce34..15a1da2eac 100644
--- a/src/test/regress/expected/copy2.out
+++ b/src/test/regress/expected/copy2.out
@@ -730,7 +730,11 @@ COPY check_ign_err FROM STDIN WITH (on_error stop);
 ERROR:  invalid input syntax for type integer: "a"
 CONTEXT:  COPY check_ign_err, line 2, column n: "a"
 COPY check_ign_err FROM STDIN WITH (on_error ignore);
-NOTICE:  4 rows were skipped due to data type incompatibility
+NOTICE:  detected data type incompatibility at line number 2 for column n; COPY check_ign_err
+NOTICE:  detected data type incompatibility at line number 3 for column k; COPY check_ign_err
+NOTICE:  detected data type incompatibility at line number 4 for column m; COPY check_ign_err
+NOTICE:  detected data type incompatibility at line number 5 for column n; COPY check_ign_err
+INFO:  4 rows were skipped due to data type incompatibility
 SELECT * FROM check_ign_err;
  n |  m  | k 
 ---+-+---
-- 
2.34.1

Re: pg_upgrade and logical replication

On Sat, Feb 17, 2024 at 10:05 AM vignesh C  wrote:
>
> On Fri, 16 Feb 2024 at 10:50, Hayato Kuroda (Fujitsu)
>  wrote:
>
> Thanks for the updated patch, Few suggestions:
> 1)  This can be moved to keep it similar to other tests:
> +# Setup a disabled subscription. The upcoming test will check the
> +# pg_createsubscriber won't work, so it is sufficient.
> +$publisher->safe_psql('postgres', "CREATE PUBLICATION regress_pub1");
> +$old_sub->safe_psql('postgres',
> +   "CREATE SUBSCRIPTION regress_sub1 CONNECTION '$connstr'
> PUBLICATION regress_pub1 WITH (enabled = false)"
> +);
> +
> +$old_sub->stop;
> +
> +# --
> +# Check that pg_upgrade fails when max_replication_slots configured in the 
> new
> +# cluster is less than the number of subscriptions in the old cluster.
> +# --
> +$new_sub->append_conf('postgresql.conf', "max_replication_slots = 0");
> +
> +# pg_upgrade will fail because the new cluster has insufficient
> +# max_replication_slots.
> +command_checks_all(
> +   [
> +   'pg_upgrade', '--no-sync', '-d', $old_sub->data_dir,
> +   '-D', $new_sub->data_dir, '-b', $oldbindir,
> +   '-B', $newbindir, '-s', $new_sub->host,
> +   '-p', $old_sub->port, '-P', $new_sub->port,
> +   $mode, '--check',
> +   ],
>
> like below and the extra comment can be removed:
> +# --
> +# Check that pg_upgrade fails when max_replication_slots configured in the 
> new
> +# cluster is less than the number of subscriptions in the old cluster.
> +# --
> +# Create a disabled subscription.
>

It is okay to adjust as you are suggesting but I find Kuroda-San's
comment better than just saying: "Create a disabled subscription." as
that explicitly tells why it is okay to create a disabled
subscription.

>
> 3) The old comments were slightly better:
> # Resume the initial sync and wait until all tables of subscription
> # 'regress_sub5' are synchronized
> $new_sub->append_conf('postgresql.conf',
> "max_logical_replication_workers = 10");
> $new_sub->restart;
> $new_sub->safe_psql('postgres', "ALTER SUBSCRIPTION regress_sub5 ENABLE");
> $new_sub->wait_for_subscription_sync($publisher, 'regress_sub5');
>
> Like:
> # Enable the subscription
> $new_sub->safe_psql('postgres', "ALTER SUBSCRIPTION regress_sub5 ENABLE");
>
> # Wait until all tables of subscription 'regress_sub5' are synchronized
> $new_sub->wait_for_subscription_sync($publisher, 'regress_sub5');
>

I would prefer Kuroda-San's version as his version of the comment
explains the intent of the test better whereas what you are saying is
just exactly what the next line of code is doing and is
self-explanatory.

-- 
With Regards,
Amit Kapila.

Re: Improve WALRead() to suck data directly from WAL buffers when possible

On Fri, Feb 16, 2024 at 11:01 PM Jeff Davis  wrote:
>
> > Here, I'm with v23 patch set:
>
> Thank you, I'll look at these.

Thanks. Here's the v24 patch set after rebasing.

--
Bharath Rupireddy
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com
From d0317ed91b1483a5556c87388e0186462711e022 Mon Sep 17 00:00:00 2001
From: Bharath Rupireddy 
Date: Sat, 17 Feb 2024 04:41:29 +
Subject: [PATCH v24 4/4] Demonstrate reading unflushed WAL directly from WAL
 buffers

---
 src/backend/access/transam/xlogreader.c   |   3 +-
 .../read_wal_from_buffers--1.0.sql|  23 ++
 .../read_wal_from_buffers.c   | 266 +-
 .../read_wal_from_buffers/t/001_basic.pl  |  35 +++
 4 files changed, 325 insertions(+), 2 deletions(-)

diff --git a/src/backend/access/transam/xlogreader.c b/src/backend/access/transam/xlogreader.c
index ae9904e7e4..4658a86997 100644
--- a/src/backend/access/transam/xlogreader.c
+++ b/src/backend/access/transam/xlogreader.c
@@ -1035,7 +1035,8 @@ ReadPageInternal(XLogReaderState *state, XLogRecPtr pageptr, int reqLen)
 	 * record is.  This is so that we can check the additional identification
 	 * info that is present in the first page's "long" header.
 	 */
-	if (targetSegNo != state->seg.ws_segno && targetPageOff != 0)
+	if (state->seg.ws_segno != 0 &&
+		targetSegNo != state->seg.ws_segno && targetPageOff != 0)
 	{
 		XLogRecPtr	targetSegmentPtr = pageptr - targetPageOff;
 
diff --git a/src/test/modules/read_wal_from_buffers/read_wal_from_buffers--1.0.sql b/src/test/modules/read_wal_from_buffers/read_wal_from_buffers--1.0.sql
index 82fa097d10..72d05522fc 100644
--- a/src/test/modules/read_wal_from_buffers/read_wal_from_buffers--1.0.sql
+++ b/src/test/modules/read_wal_from_buffers/read_wal_from_buffers--1.0.sql
@@ -12,3 +12,26 @@ CREATE FUNCTION read_wal_from_buffers(IN lsn pg_lsn, IN bytes_to_read int,
 bytes_read OUT int)
 AS 'MODULE_PATHNAME', 'read_wal_from_buffers'
 LANGUAGE C STRICT;
+
+--
+-- get_wal_records_info_from_buffers()
+--
+-- SQL function to get info of WAL records available in WAL buffers.
+--
+CREATE FUNCTION get_wal_records_info_from_buffers(IN start_lsn pg_lsn,
+IN end_lsn pg_lsn,
+OUT start_lsn pg_lsn,
+OUT end_lsn pg_lsn,
+OUT prev_lsn pg_lsn,
+OUT xid xid,
+OUT resource_manager text,
+OUT record_type text,
+OUT record_length int4,
+OUT main_data_length int4,
+OUT fpi_length int4,
+OUT description text,
+OUT block_ref text
+)
+RETURNS SETOF record
+AS 'MODULE_PATHNAME', 'get_wal_records_info_from_buffers'
+LANGUAGE C STRICT PARALLEL SAFE;
diff --git a/src/test/modules/read_wal_from_buffers/read_wal_from_buffers.c b/src/test/modules/read_wal_from_buffers/read_wal_from_buffers.c
index 9df5c07b4b..ed33a14127 100644
--- a/src/test/modules/read_wal_from_buffers/read_wal_from_buffers.c
+++ b/src/test/modules/read_wal_from_buffers/read_wal_from_buffers.c
@@ -14,11 +14,27 @@
 #include "postgres.h"
 
 #include "access/xlog.h"
-#include "fmgr.h"
+#include "access/xlog_internal.h"
+#include "access/xlogreader.h"
+#include "access/xlogrecovery.h"
+#include "funcapi.h"
+#include "miscadmin.h"
+#include "utils/builtins.h"
 #include "utils/pg_lsn.h"
 
 PG_MODULE_MAGIC;
 
+static int	read_from_wal_buffers(XLogReaderState *state, XLogRecPtr targetPagePtr,
+  int reqLen, XLogRecPtr targetRecPtr,
+  char *cur_page);
+
+static XLogRecord *ReadNextXLogRecord(XLogReaderState *xlogreader);
+static void GetWALRecordInfo(XLogReaderState *record, Datum *values,
+			 bool *nulls, uint32 ncols);
+static void GetWALRecordsInfo(FunctionCallInfo fcinfo,
+			  XLogRecPtr start_lsn,
+			  XLogRecPtr end_lsn);
+
 /*
  * SQL function to read WAL from WAL buffers. Returns number of bytes read.
  */
@@ -52,3 +68,251 @@ read_wal_from_buffers(PG_FUNCTION_ARGS)
 
 	PG_RETURN_INT32(read);
 }
+
+/*
+ * XLogReaderRoutine->page_read callback for reading WAL from WAL buffers.
+ */
+static int
+read_from_wal_buffers(XLogReaderState *state, XLogRecPtr targetPagePtr,
+	  int reqLen, XLogRecPtr targetRecPtr,
+	  char *cur_page)
+{
+	XLogRecPtr	read_upto,
+loc;
+	TimeLineID	tli = GetWALInsertionTimeLine();
+	Size		count;
+	Size		read = 0;
+
+	loc = targetPagePtr + reqLen;
+
+	/* Loop waiting for xlog to be available if necessary */
+	while (1)
+	{
+		read_upto = GetXLogInsertRecPtr();
+
+		if (loc <= read_upto)
+			break;
+
+		WaitXLogInsertionsToFinish(loc);
+
+		CHECK_FOR_INTERRUPTS();
+		pg_usleep(1000L);
+	}
+
+	if (targetPagePtr + XLOG_BLCKSZ <= read_upto)
+	{
+		/*
+		 * more than one block available; read only that block, have caller
+		 * come back if they need more.
+		 */
+		count = XLOG_BLCKSZ;
+	}
+	else if (targetPagePtr + reqLen > read_upto)
+	{
+		/* not enough data there */
+		return -1;
+	}
+	else
+	{
+		/* enough bytes available to satisfy the request */
+		count = read_upto - targetPagePtr;
+	}
+
+	/* read WAL from WAL

Re: Synchronizing slots from primary to standby

On Fri, Feb 16, 2024 at 4:10 PM shveta malik  wrote:
>
> On Thu, Feb 15, 2024 at 10:48 PM Bertrand Drouvot
>  wrote:
>
> > 5 ===
> >
> > +   if (SlotSyncWorker->syncing)
> > {
> > -   SpinLockRelease(>mutex);
> > +   SpinLockRelease(>mutex);
> > ereport(ERROR,
> > 
> > errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE),
> > errmsg("cannot synchronize replication 
> > slots concurrently"));
> > }
> >
> > worth to add a test in 040_standby_failover_slots_sync.pl for it?
>
> It will be very difficult to stabilize this test as we have to make
> sure that the concurrent users (SQL function(s) and/or worker(s)) are
> in that target function at the same time to hit it.
>

Yeah, I also think would be tricky to write a stable test, maybe one
can explore using a new injection point facility but I don't think it
is worth for this error check as this appears straightforward to be
broken in the future by other changes.

-- 
With Regards,
Amit Kapila.

Re: pg_upgrade and logical replication

2024-02-16 Thread vignesh C

On Fri, 16 Feb 2024 at 10:50, Hayato Kuroda (Fujitsu)
 wrote:
>
> Dear Vignesh,
>
> Thanks for reviewing! PSA new version.
>
> >
> > Thanks for the updated patch, few suggestions:
> > 1) Can we use a new publication for this subscription too so that the
> > publication and subscription naming will become consistent throughout
> > the test case:
> > +# Table will be in 'd' (data is being copied) state as table sync will fail
> > +# because of primary key constraint error.
> > +my $started_query =
> > +  "SELECT count(1) = 1 FROM pg_subscription_rel WHERE srsubstate = 'd'";
> > +$old_sub->poll_query_until('postgres', $started_query)
> > +  or die
> > +  "Timed out while waiting for the table state to become 'd' (datasync)";
> > +
> > +# Create another subscription and drop the subscription's replication 
> > origin
> > +$old_sub->safe_psql('postgres',
> > +   "CREATE SUBSCRIPTION regress_sub3 CONNECTION '$connstr'
> > PUBLICATION regress_pub2 WITH (enabled = false)"
> > +);
> >
> > So after the change it will become like subscription regress_sub3 for
> > publication regress_pub3, subscription regress_sub4 for publication
> > regress_pub4 and subscription regress_sub5 for publication
> > regress_pub5.
>
> A new publication was defined.
>
> > 2) The tab_upgraded1 table can be created along with create
> > publication and create subscription itself:
> > $publisher->safe_psql('postgres',
> > "CREATE PUBLICATION regress_pub3 FOR TABLE tab_upgraded1");
> > $old_sub->safe_psql('postgres',
> > "CREATE SUBSCRIPTION regress_sub4 CONNECTION '$connstr' PUBLICATION
> > regress_pub3 WITH (failover = true)"
> > );
>
> The definition of tab_upgraded1 was moved to the place you pointed.
>
> > 3) The tab_upgraded2 table can be created along with create
> > publication and create subscription itself to keep it consistent:
> >  $publisher->safe_psql('postgres',
> > -   "ALTER PUBLICATION regress_pub2 ADD TABLE tab_upgraded2");
> > +   "CREATE PUBLICATION regress_pub4 FOR TABLE tab_upgraded2");
> >  $old_sub->safe_psql('postgres',
> > -   "ALTER SUBSCRIPTION regress_sub2 REFRESH PUBLICATION");
> > +   "CREATE SUBSCRIPTION regress_sub5 CONNECTION '$connstr'
> > PUBLICATION regress_pub4"
> > +);
>
> Ditto.
>
> > With above fixes, the following can be removed:
> > # Initial setup
> > $publisher->safe_psql(
> > 'postgres', qq[
> > CREATE TABLE tab_upgraded1(id int);
> > CREATE TABLE tab_upgraded2(id int);
> > ]);
> > $old_sub->safe_psql(
> > 'postgres', qq[
> > CREATE TABLE tab_upgraded1(id int);
> > CREATE TABLE tab_upgraded2(id int);
> > ]);
>
> Yes, earlier definitions were removed instead.
> Also, some comments were adjusted based on these fixes.

Thanks for the updated patch, Few suggestions:
1)  This can be moved to keep it similar to other tests:
+# Setup a disabled subscription. The upcoming test will check the
+# pg_createsubscriber won't work, so it is sufficient.
+$publisher->safe_psql('postgres', "CREATE PUBLICATION regress_pub1");
+$old_sub->safe_psql('postgres',
+   "CREATE SUBSCRIPTION regress_sub1 CONNECTION '$connstr'
PUBLICATION regress_pub1 WITH (enabled = false)"
+);
+
+$old_sub->stop;
+
+# --
+# Check that pg_upgrade fails when max_replication_slots configured in the new
+# cluster is less than the number of subscriptions in the old cluster.
+# --
+$new_sub->append_conf('postgresql.conf', "max_replication_slots = 0");
+
+# pg_upgrade will fail because the new cluster has insufficient
+# max_replication_slots.
+command_checks_all(
+   [
+   'pg_upgrade', '--no-sync', '-d', $old_sub->data_dir,
+   '-D', $new_sub->data_dir, '-b', $oldbindir,
+   '-B', $newbindir, '-s', $new_sub->host,
+   '-p', $old_sub->port, '-P', $new_sub->port,
+   $mode, '--check',
+   ],

like below and the extra comment can be removed:
+# --
+# Check that pg_upgrade fails when max_replication_slots configured in the new
+# cluster is less than the number of subscriptions in the old cluster.
+# --
+# Create a disabled subscription.
+$publisher->safe_psql('postgres', "CREATE PUBLICATION regress_pub1");
+$old_sub->safe_psql('postgres',
+   "CREATE SUBSCRIPTION regress_sub1 CONNECTION '$connstr'
PUBLICATION regress_pub1 WITH (enabled = false)"
+);
+
+$old_sub->stop;
+
+$new_sub->append_conf('postgresql.conf', "max_replication_slots = 0");
+
+# pg_upgrade will fail because the new cluster has insufficient
+# max_replication_slots.
+command_checks_all(
+   [
+   'pg_upgrade', '--no-sync', '-d', $old_sub->data_dir,
+   '-D', $new_sub->data_dir, '-b', $oldbindir,
+   '-B', $newbindir, '-s', $new_sub->host,
+   '-p', $old_sub->port, '-P', $new_sub->port,
+   $mode, '--check',
+   ],

Re: Add bump memory context type and use it for tuplesorts

Tomas Vondra  writes:
> On 2/17/24 00:14, Tom Lane wrote:
>> The conclusion was that the specific invalid values didn't matter as
>> much on the other platforms as they do with glibc.  But right now you
>> have a fifty-fifty chance that a pointer to garbage will look valid.
>> Do we want to increase those odds?

> Not sure. The ability to detect bogus pointers seems valuable, but is
> the difference between 4/8 and 3/8 really qualitatively different? If it
> is, maybe we should try to increase it by simply adding a bit.

I think it'd be worth taking a fresh look at the bit allocation in the
header word to see if we can squeeze another bit without too much
pain.  There's basically no remaining headroom in the current design,
and it starts to seem like we want some.  (I'm also wondering whether
the palloc_aligned stuff should have been done some other way than
by consuming a context type ID.)

regards, tom lane

Re: Add pg_basetype() function to obtain a DOMAIN base type

2024-02-16 Thread jian he

On Sat, Feb 17, 2024 at 2:16 AM Tomas Vondra
 wrote:
>
> Hi,
>
> On 1/2/24 01:00, jian he wrote:
> > On Mon, Dec 4, 2023 at 5:11 PM John Naylor  wrote:
> >>
> >> On Thu, Sep 28, 2023 at 12:22 AM Alexander Korotkov
> >>  wrote:
> >>> The one thing triggering my perfectionism is that the patch does two
> >>> syscache lookups instead of one.
> >>
> >> For an admin function used interactively, I'm not sure why that
> >> matters? Or do you see another use case?
> >
> > I did a minor refactor based on v1-0001.
> > I think pg_basetype should stay at "9.26.4. System Catalog Information
> > Functions".
> > So I placed it before pg_char_to_encoding.
> > Now functions listed on "Table 9.73. System Catalog Information
> > Functions" will look like alphabetical ordering.
> > I slightly changed the src/include/catalog/pg_proc.dat.
> > now it looks like very similar to pg_typeof
> >
> > src6=# \df pg_typeof
> >List of functions
> >Schema   |   Name| Result data type | Argument data types | Type
> > +---+--+-+--
> >  pg_catalog | pg_typeof | regtype  | "any"   | func
> > (1 row)
> >
> > src6=# \df pg_basetype
> > List of functions
> >Schema   |Name | Result data type | Argument data types | Type
> > +-+--+-+--
> >  pg_catalog | pg_basetype | regtype  | "any"   | func
> > (1 row)
> >
> > v2-0001 is as is in the first email thread, 0002 is my changes based on 
> > v2-0001.
>
>
> I think the patch(es) look reasonable, so just a couple minor comments.
>
> 1) We already have pg_typeof() function, so maybe we should use a
> similar naming convention pg_basetypeof()?
>
I am ok with pg_basetypeof.

Re: Add bump memory context type and use it for tuplesorts




On 2/17/24 00:14, Tom Lane wrote:
> Tomas Vondra  writes:
>> Maybe I'm completely misunderstanding the implication of those limits,
>> but doesn't this mean the claim that we can support 8 memory context
>> types is not quite true, and the limit is 4, because the 4 IDs are
>> already used for malloc stuff?
> 
> Well, correct code would still work, but we will take a hit in our
> ability to detect bogus chunk pointers if we convert any of the four
> remaining bit-patterns to valid IDs.  That has costs for debugging.
> The particular bit patterns we left unused were calculated to make it
> likely that we could detect a malloced-instead-of-palloced chunk (at
> least with glibc); but in general, reducing the number of invalid
> patterns makes it more likely that a just-plain-bad pointer would
> escape detection.
> 
> I am less concerned about that than I was in 2022, because people
> have already had some time to flush out bugs associated with the
> GUC malloc->palloc conversion.  Still, maybe we should think harder
> about whether we can free up another ID bit before we go eating
> more ID types.  It's not apparent to me that the "bump context"
> idea is valuable enough to foreclose ever adding more context types,
> yet it will be pretty close to doing that if we commit it as-is.
> 
> If we do kick this can down the road, then I concur with eating 010
> next, as it seems the least likely to occur in glibc-malloced
> chunks.
> 

I don't know if the bump context for tuplesorts alone is worth it, but
I've been thinking it's not the only place doing something like that.
I'm aware of two other places doing this "dense allocation" - spell.c
and nodeHash.c. And in those cases it definitely made a big difference
(ofc, the question is how big the difference would be now, with all the
palloc improvements).

But maybe we could switch all those places to a proper memcontext
(instead of something built on top of a memory context) ... Of course,
the code in spell.c/nodeHash.c is quite stable, so the custom code does
not cost much.

>> One thing that confuses me a bit is that the comments introduced by
>> 80ef92675823 talk about glibc, but the related discussion in [1] talks a
>> lot about FreeBSD, NetBSD, ... which don't actually use glibc (AFAIK).
> 
> The conclusion was that the specific invalid values didn't matter as
> much on the other platforms as they do with glibc.  But right now you
> have a fifty-fifty chance that a pointer to garbage will look valid.
> Do we want to increase those odds?
> 

Not sure. The ability to detect bogus pointers seems valuable, but is
the difference between 4/8 and 3/8 really qualitatively different? If it
is, maybe we should try to increase it by simply adding a bit.


regards

-- 
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Re: MAINTAIN privilege -- what do we need to un-revert it?

2024-02-16 Thread Jeff Davis

On Wed, 2024-02-14 at 13:02 -0600, Nathan Bossart wrote:
> This seemed like the approach folks were most in favor of at the
> developer
> meeting a couple weeks ago [0].  At least, that was my interpretation
> of
> the discussion.

Attached rebased version.

Note the changes in amcheck. It's creating functions and calling those
functions from the comparators, and so the comparators need to set the
search_path. I don't think that's terribly common, but does represent a
behavior change and could break something.

Regards,
Jef Davis

From 6f399aa46fb97e73ad88eac90fe53d7e19c88f2b Mon Sep 17 00:00:00 2001
From: Jeff Davis 
Date: Fri, 16 Feb 2024 14:04:23 -0800
Subject: [PATCH v4] Fix search_path to a safe value during maintenance
 operations.

While executing maintenance operations (ANALYZE, CLUSTER, REFRESH
MATERIALIZED VIEW, REINDEX, or VACUUM), set search_path to
'pg_catalog, pg_temp' to prevent inconsistent behavior.

Functions that are used for functional indexes, in index expressions,
or in materialized views and depend on a different search path must be
declared with CREATE FUNCTION ... SET search_path='...'.

This change was previously committed as 05e1737351, then reverted in
commit 2fcc7ee7af because it was too late in the cycle.

Preparation for the MAINTAIN privilege, which was previously reverted
due to search_path manipulation hazards.

Discussion: https://postgr.es/m/d4ccaf3658cb3c281ec88c851a09733cd9482f22.ca...@j-davis.com
Discussion: https://postgr.es/m/E1q7j7Y-000z1H-Hr%40gemulon.postgresql.org
Discussion: https://postgr.es/m/e44327179e5c9015c8dda67351c04da552066017.camel%40j-davis.com
Reviewed-by: Greg Stark
Reviewed-by: Nathan Bossart
---
 contrib/amcheck/t/004_verify_nbtree_unique.pl | 33 +++
 contrib/amcheck/verify_nbtree.c   |  2 ++
 src/backend/access/brin/brin.c|  2 ++
 src/backend/catalog/index.c   |  8 +
 src/backend/catalog/namespace.c   | 10 +++---
 src/backend/commands/analyze.c|  2 ++
 src/backend/commands/cluster.c|  2 ++
 src/backend/commands/indexcmds.c  |  6 
 src/backend/commands/matview.c|  2 ++
 src/backend/commands/vacuum.c |  2 ++
 src/bin/scripts/t/100_vacuumdb.pl |  4 ---
 src/include/utils/guc.h   |  6 
 .../test_oat_hooks/expected/alter_table.out   |  2 ++
 .../expected/test_oat_hooks.out   |  4 +++
 src/test/regress/expected/matview.out |  4 ++-
 src/test/regress/expected/privileges.out  | 12 +++
 src/test/regress/expected/vacuum.out  |  2 +-
 src/test/regress/sql/matview.sql  |  4 ++-
 src/test/regress/sql/privileges.sql   |  8 ++---
 src/test/regress/sql/vacuum.sql   |  2 +-
 20 files changed, 81 insertions(+), 36 deletions(-)

diff --git a/contrib/amcheck/t/004_verify_nbtree_unique.pl b/contrib/amcheck/t/004_verify_nbtree_unique.pl
index 3f474a158a..4b704e6815 100644
--- a/contrib/amcheck/t/004_verify_nbtree_unique.pl
+++ b/contrib/amcheck/t/004_verify_nbtree_unique.pl
@@ -20,8 +20,11 @@ $node->safe_psql(
 	'postgres', q(
 	CREATE EXTENSION amcheck;
 
+	CREATE SCHEMA test_amcheck;
+	SET search_path = test_amcheck;
+
 	CREATE FUNCTION ok_cmp (int4, int4)
-	RETURNS int LANGUAGE sql AS
+	RETURNS int LANGUAGE sql SET search_path = test_amcheck AS
 	$$
 		SELECT
 			CASE WHEN $1 < $2 THEN -1
@@ -34,7 +37,7 @@ $node->safe_psql(
 	--- Check 1: uniqueness violation.
 	---
 	CREATE FUNCTION ok_cmp1 (int4, int4)
-	RETURNS int LANGUAGE sql AS
+	RETURNS int LANGUAGE sql SET search_path = test_amcheck AS
 	$$
 		SELECT ok_cmp($1, $2);
 	$$;
@@ -43,7 +46,7 @@ $node->safe_psql(
 	--- Make values 768 and 769 look equal.
 	---
 	CREATE FUNCTION bad_cmp1 (int4, int4)
-	RETURNS int LANGUAGE sql AS
+	RETURNS int LANGUAGE sql SET search_path = test_amcheck AS
 	$$
 		SELECT
 			CASE WHEN ($1 = 768 AND $2 = 769) OR
@@ -56,13 +59,13 @@ $node->safe_psql(
 	--- Check 2: uniqueness violation without deduplication.
 	---
 	CREATE FUNCTION ok_cmp2 (int4, int4)
-	RETURNS int LANGUAGE sql AS
+	RETURNS int LANGUAGE sql SET search_path = test_amcheck AS
 	$$
 		SELECT ok_cmp($1, $2);
 	$$;
 
 	CREATE FUNCTION bad_cmp2 (int4, int4)
-	RETURNS int LANGUAGE sql AS
+	RETURNS int LANGUAGE sql SET search_path = test_amcheck AS
 	$$
 		SELECT
 			CASE WHEN $1 = $2 AND $1 = 400 THEN -1
@@ -74,13 +77,13 @@ $node->safe_psql(
 	--- Check 3: uniqueness violation with deduplication.
 	---
 	CREATE FUNCTION ok_cmp3 (int4, int4)
-	RETURNS int LANGUAGE sql AS
+	RETURNS int LANGUAGE sql SET search_path = test_amcheck AS
 	$$
 		SELECT ok_cmp($1, $2);
 	$$;
 
 	CREATE FUNCTION bad_cmp3 (int4, int4)
-	RETURNS int LANGUAGE sql AS
+	RETURNS int LANGUAGE sql SET search_path = test_amcheck AS
 	$$
 		SELECT bad_cmp2($1, $2);
 	$$;
@@ -142,7 +145,7 @@ my ($result, $stdout, $stderr);
 # We have not yet broken the index, so we should get no corruption

Re: Add bump memory context type and use it for tuplesorts

Tomas Vondra  writes:
> Maybe I'm completely misunderstanding the implication of those limits,
> but doesn't this mean the claim that we can support 8 memory context
> types is not quite true, and the limit is 4, because the 4 IDs are
> already used for malloc stuff?

Well, correct code would still work, but we will take a hit in our
ability to detect bogus chunk pointers if we convert any of the four
remaining bit-patterns to valid IDs.  That has costs for debugging.
The particular bit patterns we left unused were calculated to make it
likely that we could detect a malloced-instead-of-palloced chunk (at
least with glibc); but in general, reducing the number of invalid
patterns makes it more likely that a just-plain-bad pointer would
escape detection.

I am less concerned about that than I was in 2022, because people
have already had some time to flush out bugs associated with the
GUC malloc->palloc conversion.  Still, maybe we should think harder
about whether we can free up another ID bit before we go eating
more ID types.  It's not apparent to me that the "bump context"
idea is valuable enough to foreclose ever adding more context types,
yet it will be pretty close to doing that if we commit it as-is.

If we do kick this can down the road, then I concur with eating 010
next, as it seems the least likely to occur in glibc-malloced
chunks.

> One thing that confuses me a bit is that the comments introduced by
> 80ef92675823 talk about glibc, but the related discussion in [1] talks a
> lot about FreeBSD, NetBSD, ... which don't actually use glibc (AFAIK).

The conclusion was that the specific invalid values didn't matter as
much on the other platforms as they do with glibc.  But right now you
have a fifty-fifty chance that a pointer to garbage will look valid.
Do we want to increase those odds?

regards, tom lane

Re: Change COPY ... ON_ERROR ignore to ON_ERROR ignore_row

2024-02-16 Thread Jim Jones

On 16.02.24 21:31, David G. Johnston wrote:
> Yes.  In particular not all columns in the table need be specified in
> the copy command so while the parsed input data is all nulls the
> record itself may not be.

Yeah, you have a point there.
I guess if users want to avoid it to happen they can rely on NOT NULL
constraints.

Thanks

-- 
Jim

Re: Add bump memory context type and use it for tuplesorts

On 11/6/23 19:54, Matthias van de Meent wrote:
>
> ...
>
> Tangent: Do we have specific notes on worst-case memory usage of
> memory contexts with various allocation patterns? This new bump
> allocator seems to be quite efficient, but in a worst-case allocation
> pattern it can still waste about 1/3 of its allocated memory due to
> never using free space on previous blocks after an allocation didn't
> fit on that block.
> It probably isn't going to be a huge problem in general, but this
> seems like something that could be documented as a potential problem
> when you're looking for which allocator to use and compare it with
> other allocators that handle different allocation sizes more
> gracefully.
> 

I don't think it's documented anywhere, but I agree it might be an
interesting piece of information. It probably did not matter too much
when we had just AllocSet, but now we have 3 very different allocators,
so maybe we should explain this.

When implementing these allocators, it didn't feel that important,
because the new allocators started as intended for a very specific part
of the code (as in "This piece of code has a very unique allocation
pattern, let's develop a custom allocator for it."), but if we feel we
want to make it simpler to use the allocators elsewhere ...

I think there are two obvious places where to document this - either in
the header of each memory context .c file, or a README in the mmgr
directory. Or some combination of it.

At some point I was thinking about writing a "proper paper" comparing
these allocators in a more scientific / thorough way, but I never got to
do it. I wonder if that'd be interesting for enough people.

regards

-- 
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Re: Add bump memory context type and use it for tuplesorts

Hi,

I wanted to take a look at this patch but it seems to need a rebase,
because of a seemingly trivial conflict in MemoryContextMethodID:

--- src/include/utils/memutils_internal.h
+++ src/include/utils/memutils_internal.h
@@ -123,12 +140,13 @@ typedef enum MemoryContextMethodID
 {
   MCTX_UNUSED1_ID,  /* 000 occurs in never-used memory */
   MCTX_UNUSED2_ID,  /* glibc malloc'd chunks usually match 001 */
-  MCTX_UNUSED3_ID,  /* glibc malloc'd chunks > 128kB match 010 */
+  MCTX_BUMP_ID,/* glibc malloc'd chunks > 128kB match 010
+ * XXX? */
   MCTX_ASET_ID,
   MCTX_GENERATION_ID,
   MCTX_SLAB_ID,
   MCTX_ALIGNED_REDIRECT_ID,
-  MCTX_UNUSED4_ID/* 111 occurs in wipe_mem'd memory */
+  MCTX_UNUSED3_ID/* 111 occurs in wipe_mem'd memory */
 } MemoryContextMethodID;


I wasn't paying much attention to these memcontext reworks in 2022, so
my instinct was simply to use one of those "UNUSED" IDs. But after
looking at the 80ef92675823 a bit more, are those IDs really unused? I
mean, we're relying on those values to detect bogus pointers, right? So
if we instead start using those values for a new memory context, won't
we lose the ability to detect those issues?

Maybe I'm completely misunderstanding the implication of those limits,
but doesn't this mean the claim that we can support 8 memory context
types is not quite true, and the limit is 4, because the 4 IDs are
already used for malloc stuff?

One thing that confuses me a bit is that the comments introduced by
80ef92675823 talk about glibc, but the related discussion in [1] talks a
lot about FreeBSD, NetBSD, ... which don't actually use glibc (AFAIK).
So how portable are those unused IDs, actually?

Or am I just too caffeine-deprived and missing something obvious?

regards


[1] https://postgr.es/m/2910981.1665080...@sss.pgh.pa.us

-- 
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

RE: Psql meta-command conninfo+

2024-02-16 Thread Maiquel Grassi

Hi!

>7) -h 0.0.0.0 - As you mentioned, this is one of the cases where host
>and "server address" differ.
>   I am not sure if it is an issue. Perhaps the other reviewers might
>have an opinion on it

>$ /usr/local/postgres-dev/bin/psql -x -U postgres -h 0.0.0.0 -c
>"\conninfo+" -c "\conninfo"
>Current Connection Information
>-[ RECORD 1 ]--+---
>Database   | postgres
>Authenticated User | postgres
>System User|
>Current User   | postgres
>Session User   | postgres
>Backend PID| 404395
>Server Address | 127.0.0.1
>Server Port| 5432
>Client Address | 127.0.0.1
>Client Port| 54806
>Socket Directory   |
>Host   | 0.0.0.0
>Encryption | SSL
>Protocol   | TLSv1.3
>Cipher | TLS_AES_256_GCM_SHA384
>Compression| off

I believe we don't actually have a problem here. To me, this makes sense.
I'm not a Networking expert, but from my standpoint, any string or IP
accepted by the "host" parameter of psql, pointing to "127.0.0.1" or to "::1", 
is correct.
I see it as a convention (would need to revisit Tanenbaum's books haha),
and I don't think there's a need to modify it. But as you mentioned, if someone
with more Networking knowledge could weigh in, a suggestion would be welcome.

>I'm not sure of the impact of this change in the existing \conninfo - at
>least the cfbot and "make -j check-world" didn't complain.
>I'll take a closer look at it as soon we have test cases.

This type of modification, in principle, does not generate errors as we
are only changing the display on screen of the strings already previously 
stored.

>To keep it consistent with the other options, we might wanna use "+ is
>appended" instead of "+ is specified" or "+ is given"

It seems to me that "appended" sounds good. I opted for it.
Following your suggestion, and merging this with the existing
previous ideas from the documentation, I've created a provisional
prototype of the column descriptions. At a later stage, I'll place the
descriptions correctly by removing placeholders (blah blah blah).
Along with this, I'll evaluate an iteration suggested by Nathan Bossart,

who also proposed changes.

Thank you for your work.

Regards,
Maiquel Grassi.


v17-0001-psql-meta-command-conninfo-plus.patch
Description: v17-0001-psql-meta-command-conninfo-plus.patch

Re: Things I don't like about \du's "Attributes" column

"David G. Johnston"  writes:
> Per the recent bug report, we should probably add something like (ignored)
> after the 50 connections for role1 since they are not allowed to login so
> the value is indeed ignored.  It is ignored to zero as opposed to unlimited
> for the Superuser so maybe a different word (not allowed)?

Not sure it's worth worrying about, but if we do I'd not bother to
show the irrelevant value at all: it's just making the display wider
to little purpose.  We could make the column read as "(irrelevant)",
or leave it blank.  I'd argue the same for password expiration
time BTW.

regards, tom lane

Re: Things I don't like about \du's "Attributes" column

2024-02-16 Thread David G. Johnston

On Mon, Feb 12, 2024 at 2:29 PM Pavel Luzanov 
wrote:

>  regress_du_role1 | no| Inherit | no| 2024-12-31 
> 00:00:00+03(invalid) | 50   | Group role without password but 
> with valid until
>  regress_du_role2 | yes   | Inherit | yes   | 
> | Not allowed  | No connections allowed
>  regress_du_role3 | yes   | | yes   | 
> | 10   | User without attributes
>  regress_du_su| yes   | Superuser  +| yes   | 
> | 3(ignored)   | Superuser but with connection limit
>
>
Per the recent bug report, we should probably add something like (ignored)
after the 50 connections for role1 since they are not allowed to login so
the value is indeed ignored.  It is ignored to zero as opposed to unlimited
for the Superuser so maybe a different word (not allowed)?

David J.

Re: System username in pg_stat_activity

On 2024-02-16 21:56:25 +0100, Magnus Hagander wrote:
> On Fri, Feb 16, 2024 at 9:51 PM Andres Freund  wrote:
> > I only skimmed the patch, but it sure looks to me that we could end up with
> > none of the branches setting 31,32, so I think you'd have to make sure to
> > handle that case.
> 
> That case sets nulls[] for both of them to true I believe? And when
> that is set I don't believe we need to set the values themselves.

Seems I skimmed too quickly :) - you're right.

Re: System username in pg_stat_activity

On Fri, Feb 16, 2024 at 9:51 PM Andres Freund  wrote:
>
> Hi,
>
> On 2024-02-16 21:41:41 +0100, Magnus Hagander wrote:
> > > Maybe I am missing something, but why aren't we just getting the value 
> > > from
> > > the leader's entry, instead of copying it?
> >
> > The answer to that would be "because I didn't think of it" :)
>
> :)
>
>
> > Were you thinking of something like the attached?
>
> > @@ -435,6 +438,22 @@ pg_stat_get_activity(PG_FUNCTION_ARGS)
> >   {
> >   values[29] = 
> > Int32GetDatum(leader->pid);
> >   nulls[29] = false;
> > +
> > + /*
> > +  * The authenticated user in a 
> > parallel worker is the same as the one in
> > +  * the leader, so look it up there.
> > +  */
> > + if (leader->backendId)
> > + {
> > + LocalPgBackendStatus 
> > *leaderstat = pgstat_get_local_beentry_by_backend_id(leader->backendId);
> > +
> > + if 
> > (leaderstat->backendStatus.st_auth_method != uaReject && 
> > leaderstat->backendStatus.st_auth_method != uaImplicitReject)
> > + {
> > + nulls[31] = nulls[32] 
> > = false;
> > + values[31] = 
> > CStringGetTextDatum(hba_authname(leaderstat->backendStatus.st_auth_method));
> > + values[32] = 
> > CStringGetTextDatum(leaderstat->backendStatus.st_auth_identity);
> > + }
> > + }
>
> Mostly, yes.
>
> I only skimmed the patch, but it sure looks to me that we could end up with
> none of the branches setting 31,32, so I think you'd have to make sure to
> handle that case.

That case sets nulls[] for both of them to true I believe? And when
that is set I don't believe we need to set the values themselves.

-- 
 Magnus Hagander
 Me: https://www.hagander.net/
 Work: https://www.redpill-linpro.com/

Re: LogwrtResult contended spinlock

2024-02-16 Thread Jeff Davis

On Mon, 2024-02-12 at 17:44 -0800, Jeff Davis wrote:
> It looks like there's some renewed interest in this patch:

After rebasing (attached as 0001), I'm seeing some test failures. It
looks like the local LogwrtResult is not being updated in as many
places, and that's hitting the Assert that I recently added. The fix is
easy (attached as 0002).

Though it looks like we can remove the non-shared LogwrtResult
entirely. Andres expressed some concern here:

https://www.postgresql.org/message-id/20210130020211.rtu5ir3dpjrbi...@alap3.anarazel.de

But then seemed to favor removing it here:

https://www.postgresql.org/message-id/20240213001150.4uqzh7tinuhvo...@awork3.anarazel.de

I'm inclined to think we can get rid of the non-shared copy.

A few other comments:

 * Should GetFlushRecPtr()/GetXLogWriteRecPtr() use a read memory
barrier?
 * Why did you add pg_memory_barrier() right before a spinlock
acquisition?
 * Is it an invariant that Write >= Flush at all times? Are there
guaranteed to be write barriers in the right place to ensure that?

I would also like it if we could add a new "Copy" pointer indicating
how much WAL data has been copied to the WAL buffers. That would be set
by WaitXLogInsertionsToFinish() so that subsequent calls are cheap.
Attached a patch (0003) for illustration purposes. It adds to the size
of XLogCtlData, but it's fairly large already, so I'm not sure if
that's a problem. If we do add this, there would be an invariant that
Copy >= Write at all times.

Regards,
Jeff Davis

From cd48125afc61faa46f15f444390ed01f64988320 Mon Sep 17 00:00:00 2001
From: Alvaro Herrera 
Date: Tue, 2 Feb 2021 14:03:43 -0300
Subject: [PATCH v11j 1/3] Make XLogCtl->LogwrtResult accessible with atomics
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 8bit

Currently, access to LogwrtResult is protected by a spinlock.  This
becomes severely contended in some scenarios, such as with a largish
replication flock: walsenders all calling GetFlushRecPtr repeatedly
cause the processor heat up to the point where eggs can be fried on top.

This can be reduced to a non-problem by replacing XLogCtl->LogwrtResult
with a struct containing a pair of atomically accessed variables. Do so.
In a few places, we can adjust the exact location where the locals are
updated to account for the fact that we no longer need the spinlock.

Author: Álvaro Herrera 
Discussion: https://postgr.es/m/20200831182156.GA3983@alvherre.pgsql
---
 src/backend/access/transam/xlog.c | 106 ++
 src/include/port/atomics.h|  29 
 src/tools/pgindent/typedefs.list  |   1 +
 3 files changed, 78 insertions(+), 58 deletions(-)

diff --git a/src/backend/access/transam/xlog.c b/src/backend/access/transam/xlog.c
index 50c347a679..46394c5d39 100644
--- a/src/backend/access/transam/xlog.c
+++ b/src/backend/access/transam/xlog.c
@@ -294,16 +294,13 @@ static bool doPageWrites;
  *
  * LogwrtRqst indicates a byte position that we need to write and/or fsync
  * the log up to (all records before that point must be written or fsynced).
- * LogwrtResult indicates the byte positions we have already written/fsynced.
- * These structs are identical but are declared separately to indicate their
- * slightly different functions.
+ * LogWrtResult indicates the byte positions we have already written/fsynced.
+ * These structs are similar but are declared separately to indicate their
+ * slightly different functions; in addition, the latter is read and written
+ * using atomic operations.
  *
- * To read XLogCtl->LogwrtResult, you must hold either info_lck or
- * WALWriteLock.  To update it, you need to hold both locks.  The point of
- * this arrangement is that the value can be examined by code that already
- * holds WALWriteLock without needing to grab info_lck as well.  In addition
- * to the shared variable, each backend has a private copy of LogwrtResult,
- * which is updated when convenient.
+ * In addition to the shared variable, each backend has a private copy of
+ * LogwrtResult, each member of which is separately updated when convenient.
  *
  * The request bookkeeping is simpler: there is a shared XLogCtl->LogwrtRqst
  * (protected by info_lck), but we don't need to cache any copies of it.
@@ -326,6 +323,12 @@ static bool doPageWrites;
  *--
  */
 
+typedef struct XLogwrtAtomic
+{
+	pg_atomic_uint64 Write;		/* last byte + 1 written out */
+	pg_atomic_uint64 Flush;		/* last byte + 1 flushed */
+} XLogwrtAtomic;
+
 typedef struct XLogwrtRqst
 {
 	XLogRecPtr	Write;			/* last byte + 1 to write out */
@@ -461,6 +464,7 @@ typedef struct XLogCtlData
 {
 	XLogCtlInsert Insert;
 
+	XLogwrtAtomic LogwrtResult; /* uses atomics */
 	/* Protected by info_lck: */
 	XLogwrtRqst LogwrtRqst;
 	XLogRecPtr	RedoRecPtr;		/* a recent copy of Insert->RedoRecPtr */
@@ -478,12 +482,6 @@ typedef struct XLogCtlData
 	pg_time_t	lastSegSwitchTime;
 	XLogRecPtr	lastSegSwitchLSN;
 
-	/*
-	 * Protected by

Re: System username in pg_stat_activity

Hi,

On 2024-02-16 21:41:41 +0100, Magnus Hagander wrote:
> > Maybe I am missing something, but why aren't we just getting the value from
> > the leader's entry, instead of copying it?
>
> The answer to that would be "because I didn't think of it" :)

:)


> Were you thinking of something like the attached?

> @@ -435,6 +438,22 @@ pg_stat_get_activity(PG_FUNCTION_ARGS)
>   {
>   values[29] = Int32GetDatum(leader->pid);
>   nulls[29] = false;
> +
> + /*
> +  * The authenticated user in a parallel 
> worker is the same as the one in
> +  * the leader, so look it up there.
> +  */
> + if (leader->backendId)
> + {
> + LocalPgBackendStatus 
> *leaderstat = pgstat_get_local_beentry_by_backend_id(leader->backendId);
> +
> + if 
> (leaderstat->backendStatus.st_auth_method != uaReject && 
> leaderstat->backendStatus.st_auth_method != uaImplicitReject)
> + {
> + nulls[31] = nulls[32] = 
> false;
> + values[31] = 
> CStringGetTextDatum(hba_authname(leaderstat->backendStatus.st_auth_method));
> + values[32] = 
> CStringGetTextDatum(leaderstat->backendStatus.st_auth_identity);
> + }
> + }

Mostly, yes.

I only skimmed the patch, but it sure looks to me that we could end up with
none of the branches setting 31,32, so I think you'd have to make sure to
handle that case.

Greetings,

Andres Freund

Re: System username in pg_stat_activity

Hi,

On 2024-02-16 15:22:16 -0500, Tom Lane wrote:
> Magnus Hagander  writes:
> > I mean, we could split it into more than one view. But adding a new
> > view for every new thing we want to show is also not very good from
> > either a usability or performance perspective.  So where would we put
> > it?
>
> It'd have to be a new view with a row per session, showing static
> (or at least mostly static?) properties of the session.

Yep.


> Could we move some existing fields of pg_stat_activity into such a
> view?

I'd suspect that at least some of
 - leader_pid
 - datid
 - datname
 - usesysid
 - usename
 - backend_start
 - client_addr
 - client_hostname
 - client_port
 - backend_type

could be moved. Whether's worth breaking existing queries, I don't quite know.

One option would be to not return (some) of them from pg_stat_get_activity(),
but add them to the view in a way that the planner can elide the reference.


> I'm not sure that this is worth the trouble TBH.  If it can be shown
> that pulling a few fields out of pg_stat_activity actually does make
> for a useful speedup, then maybe OK ... but Andres hasn't provided
> any evidence that there's a measurable issue.

If I thought that the two columns proposed here were all that we wanted to
add, I'd not be worried. But there have been quite a few other fields
proposed, e.g. tracking idle/active time on a per-connection granularity.

We even already have a patch to add pg_stat_session
https://commitfest.postgresql.org/47/3405/

Greetings,

Andres Freund

Re: System username in pg_stat_activity

On Fri, Feb 16, 2024 at 8:55 PM Andres Freund  wrote:
>
> Hi,
>
> On 2024-01-12 17:16:53 +0100, Magnus Hagander wrote:
> > On Thu, Jan 11, 2024 at 5:55 PM Bertrand Drouvot
> >  wrote:
> > > On Thu, Jan 11, 2024 at 02:24:58PM +0100, Magnus Hagander wrote:
> > > > On Wed, Jan 10, 2024 at 3:12 PM Bertrand Drouvot
> > > >  wrote:
> > > > >
> > > > > If we go the 2 fields way, then what about auth_identity and 
> > > > > auth_method then?
> > > >
> > > >
> > > > Here is an updated patch based on this idea.
> > >
> > > Thanks!
> > >
> > > + 
> > > +  
> > > +   auth_method text
> > > +  
> > > +  
> > > +   The authentication method used for authenticating the connection, 
> > > or
> > > +   NULL for background processes.
> > > +  
> > >
> > > I'm wondering if it would make sense to populate it for parallel workers 
> > > too.
> > > I think it's doable thanks to d951052, but I'm not sure it's worth it 
> > > (one could
> > > join based on the leader_pid though). OTOH that would be consistent with
> > > how the SYSTEM_USER behaves with parallel workers (it's populated).
> >
> > I guess one could conceptually argue that "authentication happens int
> > he leader". But we do populate it with the other user records, and
> > it'd be weird if this one was excluded.
> >
> > The tricky thing is that pgstat_bestart() is called long before we
> > deserialize the data. But from what I can tell it should be safe to
> > change it per the attached? That should be AFAICT an extremely short
> > window of time longer before we report it, not enough to matter.
>
> I don't like that one bit. The whole subsystem initialization dance already is
> quite complicated, particularly for pgstat, we shouldn't make it more
> complicated. Especially not when the initialization is moved quite a bit away
> from all the other calls.
>
> Besides just that, I also don't think delaying visibility of the worker in
> pg_stat_activity until parallel worker initialization has completed is good,
> that's not all cheap work.
>
>
> Maybe I am missing something, but why aren't we just getting the value from
> the leader's entry, instead of copying it?

The answer to that would be "because I didn't think of it" :)

Were you thinking of something like the attached?

-- 
 Magnus Hagander
 Me: https://www.hagander.net/
 Work: https://www.redpill-linpro.com/
diff --git a/doc/src/sgml/func.sgml b/doc/src/sgml/func.sgml
index cf3de80394..9403df5886 100644
--- a/doc/src/sgml/func.sgml
+++ b/doc/src/sgml/func.sgml
@@ -23891,7 +23891,7 @@ SELECT * FROM pg_ls_dir('.') WITH ORDINALITY AS t(ls,n);
 
   

-
+
  system_user
 
 system_user
diff --git a/doc/src/sgml/monitoring.sgml b/doc/src/sgml/monitoring.sgml
index 5cf9363ac8..51ed656b17 100644
--- a/doc/src/sgml/monitoring.sgml
+++ b/doc/src/sgml/monitoring.sgml
@@ -784,6 +784,28 @@ postgres   27093  0.0  0.0  30096  2752 ?Ss   11:34   0:00 postgres: ser
   
  
 
+ 
+  
+   auth_method text
+  
+  
+   The authentication method used for authenticating the connection, or
+   NULL for background processes.
+  
+ 
+
+ 
+  
+   auth_identity text
+  
+  
+   The identity (if any) that the user presented during the authentication
+   cycle before they were assigned a database role.  Contains the same
+   value as the identity part in , or NULL
+   for background processes.
+  
+ 
+
  
   
application_name text
diff --git a/src/backend/catalog/system_views.sql b/src/backend/catalog/system_views.sql
index 04227a72d1..bb1060fc66 100644
--- a/src/backend/catalog/system_views.sql
+++ b/src/backend/catalog/system_views.sql
@@ -873,6 +873,8 @@ CREATE VIEW pg_stat_activity AS
 S.leader_pid,
 S.usesysid,
 U.rolname AS usename,
+S.auth_method,
+S.auth_identity,
 S.application_name,
 S.client_addr,
 S.client_hostname,
diff --git a/src/backend/libpq/auth.c b/src/backend/libpq/auth.c
index 9bbdc4beb0..6621969154 100644
--- a/src/backend/libpq/auth.c
+++ b/src/backend/libpq/auth.c
@@ -625,9 +625,20 @@ ClientAuthentication(Port *port)
 			status = CheckRADIUSAuth(port);
 			break;
 		case uaCert:
-			/* uaCert will be treated as if clientcert=verify-full (uaTrust) */
+
+			/*
+			 * uaCert will be treated as if clientcert=verify-full further
+			 * down
+			 */
+			break;
 		case uaTrust:
 			status = STATUS_OK;
+
+			/*
+			 * Trust doesn't set_authn_id(), but we still need to store the
+			 * auth_method
+			 */
+			MyClientConnectionInfo.auth_method = uaTrust;
 			break;
 	}
 
@@ -645,6 +656,12 @@ ClientAuthentication(Port *port)
 #endif
 	}
 
+	/*
+	 * All auth methods should have either called set_authn_id() or manually
+	 * set the auth_method if they were successful.
+	 */
+	Assert(status != STATUS_OK || MyClientConnectionInfo.auth_method !=

Re: Add LSN <-> time conversion functionality

Hi,

I took a look at this today, to try to understand the purpose and how it
works. Let me share some initial thoughts and questions I have. Some of
this may be wrong/missing the point, so apologies for that.

The goal seems worthwhile in general - the way I understand it, the
patch aims to provide tracking of WAL "velocity", i.e. how much WAL was
generated over time. Which we now don't have, as we only maintain simple
cumulative stats/counters. And then uses it to estimate timestamp for a
given LSN, and vice versa, because that's what the pruning patch needs.

When I first read this, I immediately started wondering if this might
use the commit timestamp stuff we already have. Because for each commit
we already store the LSN and commit timestamp, right? But I'm not sure
that would be a good match - the commit_ts serves a very special purpose
of mapping XID => (LSN, timestamp), I don't see how to make it work for
(LSN=>timestmap) and (timestamp=>LSN) very easily.


As for the inner workings of the patch, my understanding is this:

- "LSNTimeline" consists of "LSNTime" entries representing (LSN,ts)
points, but those points are really "buckets" that grow larger and
larger for older periods of time.

- The entries are being added from bgwriter, i.e. on each loop we add
the current (LSN, timestamp) into the timeline.

- We then estimate LSN/timestamp using the data stored in LSNTimeline
(either LSN => timestamp, or the opposite direction).


Some comments in arbitrary order:

- AFAIK each entry represent an interval of time, and the next (older)
interval is twice as long, right? So the first interval is 1 second,
then 2 seconds, 4 seconds, 8 seconds, ...

- But I don't understand how the LSNTimeline entries are "aging" and get
less accurate, while the "current" bucket is short. lsntime_insert()
seems to simply move to the next entry, but doesn't that mean we insert
the entries into larger and larger buckets?

- The comments never really spell what amount of time the entries cover
/ how granular it is. My understanding is it's simply measured in number
of entries added, which is assumed to be constant and drive by
bgwriter_delay, right? Which is 200ms by default. Which seems fine, but
isn't the hibernation (HIBERNATE_FACTOR) going to mess with it?

Is there some case where bgwriter would just loop without sleeping,
filling the timeline much faster? (I can't think of any, but ...)

- The LSNTimeline comment claims an array of size 64 is large enough to
not need to care about filling it, but maybe it should briefly explain
why we can never fill it (I guess 2^64 is just too many).

- I don't quite understand why 0005 adds the functions to pageinspect.
This has nothing to do with pages, right?

- Not sure why we need 0001. Just so that the "estimate" functions in
0002 have a convenient "start" point? Surely we could look at the
current LSNTimeline data and use the oldest value, or (if there's no
data) use the current timestamp/LSN?

- I wonder what happens if we lose the data - we know that if people
reset statistics for whatever reason (or just lose them because of a
crash, or because they're on a replica), bad things happen to
autovacuum. What's the (expected) impact on pruning?

- What about a SRF function that outputs the whole LSNTimeline? Would be
useful for debugging / development, I think. (Just a suggestion).


regards

-- 
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Re: PGC_SIGHUP shared_buffers?

2024-02-16 Thread Thomas Munro

On Fri, Feb 16, 2024 at 5:29 PM Robert Haas  wrote:
> 3. Reserve lots of address space and then only use some of it. I hear
> rumors that some forks of PG have implemented something like this. The
> idea is that you convince the OS to give you a whole bunch of address
> space, but you try to avoid having all of it be backed by physical
> memory. If you later want to increase shared_buffers, you then get the
> OS to back more of it by physical memory, and if you later want to
> decrease shared_buffers, you hopefully have some way of giving the OS
> the memory back. As compared with the previous two approaches, this
> seems less likely to be noticeable to most PG code. Problems include
> (1) you have to somehow figure out how much address space to reserve,
> and that forms an upper bound on how big shared_buffers can grow at
> runtime and (2) you have to figure out ways to reserve address space
> and back more or less of it with physical memory that will work on all
> of the platforms that we currently support or might want to support in
> the future.

FTR I'm aware of a working experimental prototype along these lines,
that will be presented in Vancouver:

https://www.pgevents.ca/events/pgconfdev2024/sessions/session/31-enhancing-postgresql-plasticity-new-frontiers-in-memory-management/

Re: automating RangeTblEntry node support

2024-02-16 Thread Paul Jungwirth

On 1/15/24 02:37, Peter Eisentraut wrote:
In this updated patch set, I have also added the treatment of the Constraint type. (I also noted
that the manual read/write functions for the Constraint type are out-of-sync again, so simplifying
this would be really helpful.) I have also added commit messages to each patch.

The way I have re-ordered the patch series now, I think patches 0001 through 0003 are candidates for
inclusion after review, patch 0004 still needs a bit more analysis and testing, as described therein.

I had to apply the first patch by hand (on 9f13376396), so this looks due for a rebase. Patches 2-4
applied fine.

Compiles & passes tests after each patch.

The overall idea seems like a good improvement to me.

A few remarks about cleaning up the RangeTblEntry comments:

After the fourth patch we have a "Fields valid in all RTEs" comment twice in the struct, once at the
top and once at the bottom. It's fine IMO but maybe the second could be "More fields valid in all RTEs"?

The new order of fields in RangleTblEntry matches the intro comment, which seems like another small
benefit.

It seems like we are moving away from ever putting RTEKind-specific fields into a union as suggested
by the FIXME comment here. It was written in 2002. Is it time to remove it?

This now needs to say "above" not "below":

/*
* join_using_alias is an alias clause attached directly to JOIN/USING. It
* is different from the alias field (below) in that it does not hide the
* range variables of the tables being joined.
*/
Alias *join_using_alias pg_node_attr(query_jumble_ignore);

Re bloating the serialization output, we could leave this last patch until after the work on that
other thread is done to skip default-valued items.

Yours,

--
Paul ~{:-)
p...@illuminatedcomputing.com

Re: Change COPY ... ON_ERROR ignore to ON_ERROR ignore_row

2024-02-16 Thread David G. Johnston

On Fri, Feb 16, 2024 at 1:16 PM Jim Jones  wrote:

> In case all columns of a record have been set to null due to data type
> incompatibility, should we insert it at all?

Yes.  In particular not all columns in the table need be specified in the
copy command so while the parsed input data is all nulls the record itself
may not be.

The system should allow the user to exclude rows with incomplete data by
ignoring a not null constraint violation.

In short we shouldn't judge non-usefulness and instead give tools to the
user to decide for themselves.

David J.

Re: System username in pg_stat_activity

On Fri, Feb 16, 2024 at 9:20 PM Andres Freund  wrote:
>
> Hi,
>
> On 2024-02-16 20:57:59 +0100, Magnus Hagander wrote:
> > On Fri, Feb 16, 2024 at 8:41 PM Andres Freund  wrote:
> > > On 2024-01-10 12:46:34 +0100, Magnus Hagander wrote:
> > > > The attached patch adds a column "authuser" to pg_stat_activity which
> > > > contains the username of the externally authenticated user, being the
> > > > same value as the SYSTEM_USER keyword returns in a backend.
> > >
> > > I continue to think that it's a bad idea to make pg_stat_activity ever 
> > > wider
> > > with columns that do not actually describe properties that change across 
> > > the
> > > course of a session.  Yes, there's the argument that that ship has 
> > > sailed, but
> > > I don't think that's a good reason to continue ever further down that 
> > > road.
> > >
> > > It's not just a usability issue, it also makes it more expensive to query
> > > pg_stat_activity. This is of course more pronounced with textual columns 
> > > than
> > > with integer ones.
> >
> > That's a fair point, but I do think that has in most ways already sailed, 
> > yes.
> >
> > I mean, we could split it into more than one view. But adding a new
> > view for every new thing we want to show is also not very good from
> > either a usability or performance perspective.  So where would we put
> > it?
>
> I think we should group new properties that don't change over the course of a
> session ([1]) in a new view (e.g. pg_stat_session). I don't think we need one
> view per property, but I do think it makes sense to split information that
> changes very frequently (like most pg_stat_activity contents) from information
> that doesn't (like auth_method, auth_identity).

That would make sense in many ways, but ends up with "other level of
annoyances". E.g. the database name and oid don't change, but would we
want to move those out of pg_stat_activity? Same for username? Don't
we just end up in a grayzone about what belongs where?

Also - were you envisioning just another view, or actually replacing
the pg_stat_get_activity() part? As in where do you think the cost
comes?

(And as to Toms question about key column - the pid column can surely
be that? We already do that for pg_stat_ssl and pg_stat_gssapi, that
are both driven from pg_stat_get_activity() but shows a different set
of columns.


> Additionally I think something like pg_stat_session could also contain
> per-session cumulative counters like the session's contribution to
> pg_stat_database.{idle_in_transaction_time,active_time}

-- 
 Magnus Hagander
 Me: https://www.hagander.net/
 Work: https://www.redpill-linpro.com/

Re: Where can I find the doxyfile?

2024-02-16 Thread John Morris

>> I have found it very strange that a tool like doxygen which can create
>> all sorts of call graphs, just ignores some comments.  The comments
>> above function are very important.

I agree with you . I hated doxygen for decades because of the irritating 
annotations it required.  When I discovered IDEs were creating doxygen-like 
popups from conventional comments, I tried to figure out what they were doing. 
In the process, I discovered filters, and I created a filter to match what I 
thought the IDEs were doing.

(As it turns out, IDEs implement their own rendering independent of doxygen.)

I find it ironic I’ve gone from being a long-time hater of doxygen to being its 
advocate.


  *   John Morris


Tiny tidbit of history. Back in the 70’s, I created a comment extractor for the 
language Ratfor. We used it to maintain an alphabetical index of functions and 
to display pseudo-code. We drew our call graphs by hand and saved them in 
manila folders. Hard to imagine now.

Re: PGC_SIGHUP shared_buffers?

2024-02-16 Thread Heikki Linnakangas


On 16/02/2024 06:28, Robert Haas wrote:

3. Reserve lots of address space and then only use some of it. I hear
rumors that some forks of PG have implemented something like this. The
idea is that you convince the OS to give you a whole bunch of address
space, but you try to avoid having all of it be backed by physical
memory. If you later want to increase shared_buffers, you then get the
OS to back more of it by physical memory, and if you later want to
decrease shared_buffers, you hopefully have some way of giving the OS
the memory back. As compared with the previous two approaches, this
seems less likely to be noticeable to most PG code. Problems include
(1) you have to somehow figure out how much address space to reserve,
and that forms an upper bound on how big shared_buffers can grow at
runtime and (2) you have to figure out ways to reserve address space
and back more or less of it with physical memory that will work on all
of the platforms that we currently support or might want to support in
the future.


A variant of this approach:

5. Re-map the shared_buffers when needed.

Between transactions, a backend should not hold any buffer pins. When 
there are no pins, you can munmap() the shared_buffers and mmap() it at 
a different address.


--
Heikki Linnakangas
Neon (https://neon.tech)

Re: System username in pg_stat_activity

Magnus Hagander  writes:
> I mean, we could split it into more than one view. But adding a new
> view for every new thing we want to show is also not very good from
> either a usability or performance perspective.  So where would we put
> it?

It'd have to be a new view with a row per session, showing static
(or at least mostly static?) properties of the session.

Could we move some existing fields of pg_stat_activity into such a
view?  In any case, there'd have to be a key column to use to join
it to pg_stat_activity.

I'm not sure that this is worth the trouble TBH.  If it can be shown
that pulling a few fields out of pg_stat_activity actually does make
for a useful speedup, then maybe OK ... but Andres hasn't provided
any evidence that there's a measurable issue.

regards, tom lane

Re: System username in pg_stat_activity

Hi,

On 2024-02-16 20:57:59 +0100, Magnus Hagander wrote:
> On Fri, Feb 16, 2024 at 8:41 PM Andres Freund  wrote:
> > On 2024-01-10 12:46:34 +0100, Magnus Hagander wrote:
> > > The attached patch adds a column "authuser" to pg_stat_activity which
> > > contains the username of the externally authenticated user, being the
> > > same value as the SYSTEM_USER keyword returns in a backend.
> >
> > I continue to think that it's a bad idea to make pg_stat_activity ever wider
> > with columns that do not actually describe properties that change across the
> > course of a session.  Yes, there's the argument that that ship has sailed, 
> > but
> > I don't think that's a good reason to continue ever further down that road.
> >
> > It's not just a usability issue, it also makes it more expensive to query
> > pg_stat_activity. This is of course more pronounced with textual columns 
> > than
> > with integer ones.
> 
> That's a fair point, but I do think that has in most ways already sailed, yes.
> 
> I mean, we could split it into more than one view. But adding a new
> view for every new thing we want to show is also not very good from
> either a usability or performance perspective.  So where would we put
> it?

I think we should group new properties that don't change over the course of a
session ([1]) in a new view (e.g. pg_stat_session). I don't think we need one
view per property, but I do think it makes sense to split information that
changes very frequently (like most pg_stat_activity contents) from information
that doesn't (like auth_method, auth_identity).

Greetings,

Andres Freund

[1]

Additionally I think something like pg_stat_session could also contain
per-session cumulative counters like the session's contribution to
pg_stat_database.{idle_in_transaction_time,active_time}

Re: Change COPY ... ON_ERROR ignore to ON_ERROR ignore_row

2024-02-16 Thread Jim Jones

Hi!

On 12.02.24 01:00, jian he wrote:
> attached v2.
> syntax: `on_error set_to_null`
> based on upthread discussion, now if you specified `on_error
> set_to_null` and your column has `not
> null` constraint, we convert the error field to null, so it may error
> while bulk inserting for violating NOT NULL constraint.
That's a very nice feature. Thanks for implementing it!

v2 applies cleanly and works as described.

\pset null '(NULL)'

CREATE TEMPORARY TABLE t1 (a int, b int);
COPY t1 (a,b) FROM STDIN;
1    a
2    1
3    2
4    b
a    c
\.
SELECT * FROM t1;

CONTEXT:  COPY t1, line 1, column b: "a"
 a | b
---+---
(0 rows)

CREATE TEMPORARY TABLE t2 (a int, b int);
COPY t2 (a,b) FROM STDIN WITH (on_error set_to_null);
1    a
2    1
3    2
4    b
a    c
\.
SELECT * FROM t2;

psql:test-copy-on_error-2.sql:12: NOTICE:  some columns of 3 rows, value
were converted to NULL due to data type incompatibility
COPY 5
   a    |   b    
+
  1 | (NULL)
  2 |  1
  3 |  2
  4 | (NULL)
 (NULL) | (NULL)
(5 rows)

I have one question though:

In case all columns of a record have been set to null due to data type
incompatibility, should we insert it at all? See t2 example above.
I'm not sure if these records would be of any use in the table. What do
you think?

Since the parameter is already called "set_to_null", maybe it is not
necessary to mention in the NOTICE message that the values have been set
to null.
Perhaps something like "XX records were only partially copied due to
data type incompatibility"

-- 
Jim

Re: Extend pgbench partitioning to pgbench_history

2024-02-16 Thread Melanie Plageman

On Fri, Feb 16, 2024 at 12:50 PM Tomas Vondra
 wrote:
>
> Hello Gabriele,
>
> I think the improvement makes sense (it's indeed a bit strange to not
> partition the history table), and the patch looks good.
>
> I did think about whether this should be optional in some way - that is,
> separate from partitioning the accounts table, and users would have to
> explicitly enable (or disable) it. But I don't think we need to do that.
>
> The vast majority of users simply want to partition everything. And this
> is just one way to partition the database anyway, it's our opinion on
> how to do that, but there's many other options how we might partition
> the tables, and we don't (and don't want too) have options for that.

I wonder how common it would be to partition a history table by
account ID? I sort of imagined the most common kind of partitioning
for an audit table is by time (range). Anyway, I'm not objecting to
doing it by account ID, just asking if there is a reason to do so.

Speaking of which, Tomas said users might want to "partition
everything" -- so any reason not to also partition tellers and
branches?

This change to the docs seems a bit misleading:

-Create a partitioned pgbench_accounts table with
-NUM partitions of nearly equal size for
-the scaled number of accounts.
+Create partitioned pgbench_accounts and
pgbench_history
+tables with NUM partitions of
nearly equal size for
+the scaled number of accounts and future history records.
 Default is 0, meaning no partitioning.

It says that partitions of "future history records" will be equal in
size. While it's true that at the end of a pgbench run, if you use a
random distribution for aid, the pgbench_history partitions should be
roughly equally sized, it is confusing to say it will "create
pgbench_history with partitions of equal size". Maybe it would be
better to write a new sentence about partitioning pgbench_history
without worrying about mirroring the sentence structure of the
existing sentence.

> The only case that I can think of where this might matter is when
> running a benchmarks that will be compared to some earlier results
> (executed using an older pgbench version). That will be affected by
> this, but I don't think we make many promises about compatibility in
> this regard ... it's probably better to always compare results only from
> the same pgbench version, I guess.

As a frequent pgbench user, I always use the same pgbench version even
when comparing different versions of Postgres. Other changes have made
it difficult to compare results across pgbench versions without
providing it as an option (see 06ba4a63b85e). So, I don't think it is
a problem if it is noted in release notes.

- Melanie

Re: glibc qsort() vulnerability

2024-02-16 Thread Nathan Bossart

On Fri, Feb 16, 2024 at 01:45:52PM +0100, Mats Kindahl wrote:
> Looks good to me.

Committed.

-- 
Nathan Bossart
Amazon Web Services: https://aws.amazon.com

Re: System username in pg_stat_activity

On Fri, Feb 16, 2024 at 8:41 PM Andres Freund  wrote:
>
> Hi,
>
> On 2024-01-10 12:46:34 +0100, Magnus Hagander wrote:
> > The attached patch adds a column "authuser" to pg_stat_activity which
> > contains the username of the externally authenticated user, being the
> > same value as the SYSTEM_USER keyword returns in a backend.
>
> I continue to think that it's a bad idea to make pg_stat_activity ever wider
> with columns that do not actually describe properties that change across the
> course of a session.  Yes, there's the argument that that ship has sailed, but
> I don't think that's a good reason to continue ever further down that road.
>
> It's not just a usability issue, it also makes it more expensive to query
> pg_stat_activity. This is of course more pronounced with textual columns than
> with integer ones.

That's a fair point, but I do think that has in most ways already sailed, yes.

I mean, we could split it into more than one view. But adding a new
view for every new thing we want to show is also not very good from
either a usability or performance perspective.  So where would we put
it?

-- 
 Magnus Hagander
 Me: https://www.hagander.net/
 Work: https://www.redpill-linpro.com/

Re: System username in pg_stat_activity

Hi,

On 2024-01-12 17:16:53 +0100, Magnus Hagander wrote:
> On Thu, Jan 11, 2024 at 5:55 PM Bertrand Drouvot
>  wrote:
> > On Thu, Jan 11, 2024 at 02:24:58PM +0100, Magnus Hagander wrote:
> > > On Wed, Jan 10, 2024 at 3:12 PM Bertrand Drouvot
> > >  wrote:
> > > >
> > > > If we go the 2 fields way, then what about auth_identity and 
> > > > auth_method then?
> > >
> > >
> > > Here is an updated patch based on this idea.
> >
> > Thanks!
> >
> > + 
> > +  
> > +   auth_method text
> > +  
> > +  
> > +   The authentication method used for authenticating the connection, or
> > +   NULL for background processes.
> > +  
> >
> > I'm wondering if it would make sense to populate it for parallel workers 
> > too.
> > I think it's doable thanks to d951052, but I'm not sure it's worth it (one 
> > could
> > join based on the leader_pid though). OTOH that would be consistent with
> > how the SYSTEM_USER behaves with parallel workers (it's populated).
> 
> I guess one could conceptually argue that "authentication happens int
> he leader". But we do populate it with the other user records, and
> it'd be weird if this one was excluded.
> 
> The tricky thing is that pgstat_bestart() is called long before we
> deserialize the data. But from what I can tell it should be safe to
> change it per the attached? That should be AFAICT an extremely short
> window of time longer before we report it, not enough to matter.

I don't like that one bit. The whole subsystem initialization dance already is
quite complicated, particularly for pgstat, we shouldn't make it more
complicated. Especially not when the initialization is moved quite a bit away
from all the other calls.

Besides just that, I also don't think delaying visibility of the worker in
pg_stat_activity until parallel worker initialization has completed is good,
that's not all cheap work.


Maybe I am missing something, but why aren't we just getting the value from
the leader's entry, instead of copying it?

Greetings,

Andres Freund

Re: System username in pg_stat_activity

Hi,

On 2024-01-10 12:46:34 +0100, Magnus Hagander wrote:
> The attached patch adds a column "authuser" to pg_stat_activity which
> contains the username of the externally authenticated user, being the
> same value as the SYSTEM_USER keyword returns in a backend.

I continue to think that it's a bad idea to make pg_stat_activity ever wider
with columns that do not actually describe properties that change across the
course of a session.  Yes, there's the argument that that ship has sailed, but
I don't think that's a good reason to continue ever further down that road.

It's not just a usability issue, it also makes it more expensive to query
pg_stat_activity. This is of course more pronounced with textual columns than
with integer ones.

Greetings,

Andres Freund

Re: System username in pg_stat_activity

On Fri, Jan 19, 2024 at 1:43 PM Julien Rouhaud  wrote:
>
> Hi,
>
> On Thu, Jan 18, 2024 at 11:01 PM Magnus Hagander  wrote:
> >
> > I did. Here it is, and also including that suggested docs fix as well
> > as a rebase on current master.
>
> +if (MyClientConnectionInfo.authn_id)
> +strlcpy(lbeentry.st_auth_identity,
> MyClientConnectionInfo.authn_id, NAMEDATALEN);
> +else
> +MemSet(_auth_identity, 0,
> sizeof(lbeentry.st_auth_identity));
>
> Should we use pg_mbcliplen() here?  I don't think there's any strong
> guarantee that no multibyte character can be used.  I also agree with
> the nearby comment about MemSet being overkill.

Hm. Good question. I don't think there is such a guarantee, no. So
something like attached v5?

Also, wouldn't that problem already exist a few lines down for the SSL parts?

> +   value as the identity part in , or NULL
> I was looking at
> https://www.postgresql.org/docs/current/auth-username-maps.html and
> noticed that this page is switching between system-user and
> system-username.  Should we clean that up while at it?

Seems like something we should clean up yes, but not as part of this patch.

-- 
 Magnus Hagander
 Me: https://www.hagander.net/
 Work: https://www.redpill-linpro.com/
diff --git a/doc/src/sgml/func.sgml b/doc/src/sgml/func.sgml
index cf3de80394..9403df5886 100644
--- a/doc/src/sgml/func.sgml
+++ b/doc/src/sgml/func.sgml
@@ -23891,7 +23891,7 @@ SELECT * FROM pg_ls_dir('.') WITH ORDINALITY AS t(ls,n);
 
   

-
+
  system_user
 
 system_user
diff --git a/doc/src/sgml/monitoring.sgml b/doc/src/sgml/monitoring.sgml
index 5cf9363ac8..51ed656b17 100644
--- a/doc/src/sgml/monitoring.sgml
+++ b/doc/src/sgml/monitoring.sgml
@@ -784,6 +784,28 @@ postgres   27093  0.0  0.0  30096  2752 ?Ss   11:34   0:00 postgres: ser
   
  
 
+ 
+  
+   auth_method text
+  
+  
+   The authentication method used for authenticating the connection, or
+   NULL for background processes.
+  
+ 
+
+ 
+  
+   auth_identity text
+  
+  
+   The identity (if any) that the user presented during the authentication
+   cycle before they were assigned a database role.  Contains the same
+   value as the identity part in , or NULL
+   for background processes.
+  
+ 
+
  
   
application_name text
diff --git a/src/backend/access/transam/parallel.c b/src/backend/access/transam/parallel.c
index 849a03e4b6..269ab4e586 100644
--- a/src/backend/access/transam/parallel.c
+++ b/src/backend/access/transam/parallel.c
@@ -1529,6 +1529,9 @@ ParallelWorkerMain(Datum main_arg)
 	/* Attach to the leader's serializable transaction, if SERIALIZABLE. */
 	AttachSerializableXact(fps->serializable_xact_handle);
 
+	/* Report to the stats system that we are started */
+	pgstat_bestart();
+
 	/*
 	 * We've initialized all of our state now; nothing should change
 	 * hereafter.
diff --git a/src/backend/catalog/system_views.sql b/src/backend/catalog/system_views.sql
index 04227a72d1..bb1060fc66 100644
--- a/src/backend/catalog/system_views.sql
+++ b/src/backend/catalog/system_views.sql
@@ -873,6 +873,8 @@ CREATE VIEW pg_stat_activity AS
 S.leader_pid,
 S.usesysid,
 U.rolname AS usename,
+S.auth_method,
+S.auth_identity,
 S.application_name,
 S.client_addr,
 S.client_hostname,
diff --git a/src/backend/libpq/auth.c b/src/backend/libpq/auth.c
index 9bbdc4beb0..6621969154 100644
--- a/src/backend/libpq/auth.c
+++ b/src/backend/libpq/auth.c
@@ -625,9 +625,20 @@ ClientAuthentication(Port *port)
 			status = CheckRADIUSAuth(port);
 			break;
 		case uaCert:
-			/* uaCert will be treated as if clientcert=verify-full (uaTrust) */
+
+			/*
+			 * uaCert will be treated as if clientcert=verify-full further
+			 * down
+			 */
+			break;
 		case uaTrust:
 			status = STATUS_OK;
+
+			/*
+			 * Trust doesn't set_authn_id(), but we still need to store the
+			 * auth_method
+			 */
+			MyClientConnectionInfo.auth_method = uaTrust;
 			break;
 	}
 
@@ -645,6 +656,12 @@ ClientAuthentication(Port *port)
 #endif
 	}
 
+	/*
+	 * All auth methods should have either called set_authn_id() or manually
+	 * set the auth_method if they were successful.
+	 */
+	Assert(status != STATUS_OK || MyClientConnectionInfo.auth_method != 0);
+
 	if (Log_connections && status == STATUS_OK &&
 		!MyClientConnectionInfo.authn_id)
 	{
diff --git a/src/backend/utils/activity/backend_status.c b/src/backend/utils/activity/backend_status.c
index 1a1050c8da..119319e85a 100644
--- a/src/backend/utils/activity/backend_status.c
+++ b/src/backend/utils/activity/backend_status.c
@@ -357,6 +357,17 @@ pgstat_bestart(void)
 	else
 		MemSet(_clientaddr, 0, sizeof(lbeentry.st_clientaddr));
 
+	lbeentry.st_auth_method = MyClientConnectionInfo.auth_method;
+	if (MyClientConnectionInfo.authn_id)

Re: Improving EXPLAIN's display of SubPlan nodes

I wrote:
> So now I'm thinking that we do have enough detail in the present
> proposal, and we just need to think about whether there's some
> nicer way to present it than the particular spelling I used here.

Here's a rebase over 9f1337639 --- no code changes, but this affects
some of the new or changed expected outputs from that commit.

regards, tom lane

From f84343864e74501df627f986e326f766072398cd Mon Sep 17 00:00:00 2001
From: Tom Lane 
Date: Fri, 16 Feb 2024 14:32:23 -0500
Subject: [PATCH v2] Improve EXPLAIN's display of SubPlan nodes.

Represent the SubLinkType as best we can, and show the testexpr
where relevant.  To aid in interpreting testexprs, add the output
parameter IDs to the subplan name for subplans as well as initplans.

Discussion: https://postgr.es/m/2838538.1705692...@sss.pgh.pa.us
---
 .../postgres_fdw/expected/postgres_fdw.out|  16 +-
 src/backend/optimizer/plan/subselect.c|  30 ++--
 src/backend/utils/adt/ruleutils.c |  45 -
 src/test/regress/expected/aggregates.out  |   2 +-
 src/test/regress/expected/insert_conflict.out |   2 +-
 src/test/regress/expected/join.out|  24 +--
 src/test/regress/expected/memoize.out |   2 +-
 src/test/regress/expected/rowsecurity.out |  56 +++---
 src/test/regress/expected/select_parallel.out |  22 +--
 src/test/regress/expected/subselect.out   | 159 --
 src/test/regress/expected/updatable_views.out |  24 +--
 src/test/regress/sql/subselect.sql|  14 ++
 12 files changed, 255 insertions(+), 141 deletions(-)

diff --git a/contrib/postgres_fdw/expected/postgres_fdw.out b/contrib/postgres_fdw/expected/postgres_fdw.out
index c355e8f3f7..bf067738e3 100644
--- a/contrib/postgres_fdw/expected/postgres_fdw.out
+++ b/contrib/postgres_fdw/expected/postgres_fdw.out
@@ -3264,14 +3264,14 @@ select sum(c1) filter (where (c1 / c1) * random() <= 1) from ft1 group by c2 ord
 
 explain (verbose, costs off)
 select sum(c2) filter (where c2 in (select c2 from ft1 where c2 < 5)) from ft1;
-QUERY PLAN 

+ QUERY PLAN  
+-
  Aggregate
-   Output: sum(ft1.c2) FILTER (WHERE (hashed SubPlan 1))
+   Output: sum(ft1.c2) FILTER (WHERE (ANY (ft1.c2 = $0) FROM HASHED SubPlan 1 (returns $0)))
->  Foreign Scan on public.ft1
  Output: ft1.c2
  Remote SQL: SELECT c2 FROM "S 1"."T 1"
-   SubPlan 1
+   SubPlan 1 (returns $0)
  ->  Foreign Scan on public.ft1 ft1_1
Output: ft1_1.c2
Remote SQL: SELECT c2 FROM "S 1"."T 1" WHERE ((c2 < 5))
@@ -11895,12 +11895,12 @@ CREATE FOREIGN TABLE foreign_tbl2 () INHERITS (foreign_tbl)
   SERVER loopback OPTIONS (table_name 'base_tbl');
 EXPLAIN (VERBOSE, COSTS OFF)
 SELECT a FROM base_tbl WHERE (a, random() > 0) IN (SELECT a, random() > 0 FROM foreign_tbl);
- QUERY PLAN  
--
+QUERY PLAN
+--
  Seq Scan on public.base_tbl
Output: base_tbl.a
-   Filter: (SubPlan 1)
-   SubPlan 1
+   Filter: (ANY ((base_tbl.a = $1) AND ((random() > '0'::double precision) = $2)) FROM SubPlan 1 (returns $1,$2))
+   SubPlan 1 (returns $1,$2)
  ->  Result
Output: base_tbl.a, (random() > '0'::double precision)
->  Append
diff --git a/src/backend/optimizer/plan/subselect.c b/src/backend/optimizer/plan/subselect.c
index 47e14723d2..d5919a9164 100644
--- a/src/backend/optimizer/plan/subselect.c
+++ b/src/backend/optimizer/plan/subselect.c
@@ -326,6 +326,7 @@ build_subplan(PlannerInfo *root, Plan *plan, PlannerInfo *subroot,
 	Node	   *result;
 	SubPlan*splan;
 	bool		isInitPlan;
+	StringInfoData splanname;
 	ListCell   *lc;
 
 	/*
@@ -560,22 +561,31 @@ build_subplan(PlannerInfo *root, Plan *plan, PlannerInfo *subroot,
    splan->plan_id);
 
 	/* Label the subplan for EXPLAIN purposes */
-	splan->plan_name = palloc(32 + 12 * list_length(splan->setParam));
-	sprintf(splan->plan_name, "%s %d",
-			isInitPlan ? "InitPlan" : "SubPlan",
-			splan->plan_id);
+	initStringInfo();
+	appendStringInfo(, "%s %d",
+	 isInitPlan ? "InitPlan" : "SubPlan",
+	 splan->plan_id);
 	if (splan->setParam)
 	{
-		char	   *ptr = splan->plan_name + strlen(splan->plan_name);
-
-		ptr += sprintf(ptr, " (returns ");
+		appendStringInfoString(, " (returns ");
 		foreach(lc, splan->setParam)
 		{
-			ptr += sprintf(ptr, "$%d%s",
-		   lfirst_int(lc),
-

Re: System username in pg_stat_activity

On Fri, Jan 19, 2024 at 12:33 PM Aleksander Alekseev
 wrote:
>
> Hi,
>
> > > Did you forget to share the new revision (aka v4)? I can only see the
> > > "reorder_parallel_worker_bestart.patch" attached.
> >
> > I did. Here it is, and also including that suggested docs fix as well
> > as a rebase on current master.
>
> ```
> +lbeentry.st_auth_method = MyClientConnectionInfo.auth_method;
> +if (MyClientConnectionInfo.authn_id)
> +strlcpy(lbeentry.st_auth_identity,
> MyClientConnectionInfo.authn_id, NAMEDATALEN);
> +else
> +MemSet(_auth_identity, 0,
> sizeof(lbeentry.st_auth_identity));
> ```
>
> I believe using sizeof(lbeentry.st_auth_identity) instead of
> NAMEDATALEN is generally considered a better practice.

We use the NAMEDATALEN method in the rest of the function, so I did it
the same way for consistency. I think if we want to change that, we
should change the whole function at once to keep it consistent.


> Calling MemSet for a CString seems to be an overkill. I suggest
> setting lbeentry.st_auth_identity[0] to zero.

Fair enough. Will make that change.

//Magnus

Re: SQL Property Graph Queries (SQL/PGQ)

Hi,

On 2024-02-16 15:53:11 +0100, Peter Eisentraut wrote:
> The patch is quite fragile, and treading outside the tested paths will
> likely lead to grave misbehavior.  Use with caution.  But I feel that
> the general structure is ok, and we just need to fill in the
> proverbial few thousand lines of code in the designated areas.

One aspect that I m concerned with structurally is that the transformation,
from property graph queries to something postgres understands, is done via the
rewrite system. I doubt that that is a good idea. For one it bars the planner
from making plans that benefit from the graph query formulation. But more
importantly, we IMO should reduce usage of the rewrite system, not increase
it.

Greetings,

Andres Freund

Re: System username in pg_stat_activity

On Fri, Jan 19, 2024 at 7:20 AM Bertrand Drouvot
 wrote:
>
> Hi,
>
> On Thu, Jan 18, 2024 at 04:01:33PM +0100, Magnus Hagander wrote:
> > On Mon, Jan 15, 2024 at 11:17 AM Bertrand Drouvot
> > > Did you forget to share the new revision (aka v4)? I can only see the
> > > "reorder_parallel_worker_bestart.patch" attached.
> >
> > I did. Here it is, and also including that suggested docs fix as well
> > as a rebase on current master.
> >
>
> Thanks!
>
> Just a few comments:
>
> 1 ===
>
> +   The authentication method used for authenticating the connection, or
> +   NULL for background processes.
>
> What about? "NULL for background processes (except for parallel workers which
> inherit this information from their leader process)"

Ugh. That doesn't read very well at all to me. Maybe just "NULL for
background processes without a user"?


> 2 ===
>
> +   cycle before they were assigned a database role.  Contains the same
> +   value as the identity part in , or NULL
> +   for background processes.
>
> Same comment about parallel workers.
>
> 3 ===
>
> +# pg_stat_activity should contain trust and empty string for trust auth
> +$res = $node->safe_psql(
> +   'postgres',
> +   "SELECT auth_method, auth_identity='' FROM pg_stat_activity WHERE 
> pid=pg_backend_pid()",
> +   connstr => "user=scram_role");
> +is($res, 'trust|t', 'Users with trust authentication should show correctly 
> in pg_stat_activity');
> +
> +# pg_stat_activity should contain NULL for auth of background processes
> +# (test is a bit out of place here, but this is where the other 
> pg_stat_activity.auth* tests are)
> +$res = $node->safe_psql(
> +   'postgres',
> +   "SELECT auth_method IS NULL, auth_identity IS NULL FROM 
> pg_stat_activity WHERE backend_type='checkpointer'",
> +);
> +is($res, 't|t', 'Background processes should show NULL for auth in 
> pg_stat_activity');
>
> What do you think about testing the parallel workers cases too? (I'm not 100%
> sure it's worth the extra complexity though).

I'm leaning towards not worth that.

-- 
 Magnus Hagander
 Me: https://www.hagander.net/
 Work: https://www.redpill-linpro.com/

Re: PGC_SIGHUP shared_buffers?

Hi,

On 2024-02-16 09:58:43 +0530, Robert Haas wrote:
> I remember Magnus making a comment many years ago to the effect that
> every setting that is PGC_POSTMASTER is a bug, but some of those bugs
> are very difficult to fix. Perhaps the use of the word bug is
> arguable, but I think the sentiment is apt, especially with regard to
> shared_buffers. Changing without a server restart would be really
> nice, but it's hard to figure out how to do it. I can think of a few
> basic approaches, and I'd like to know (a) which ones people think are
> good and which ones people think suck (maybe they all suck) and (b) if
> anybody's got any other ideas not mentioned here.

IMO the ability to *shrink* shared_buffers dynamically and cheaply is more
important than growing it in a way, except that they are related of
course. Idling hardware is expensive, thus overcommitting hardware is very
attractive (I count "serverless" as part of that). To be able to overcommit
effectively, unused long-lived memory has to be released. I.e. shared buffers
needs to be shrinkable.

Perhaps worth noting that there are two things limiting the size of shared
buffers: 1) the available buffer space 2) the available buffer *mapping*
space. I think making the buffer mapping resizable is considerably harder than
the buffers themselves. Of course pre-reserving memory for a buffer mapping
suitable for a huge shared_buffers is more feasible than pre-allocating all
that memory for the buffers themselves. But it' still mean youd have a maximum
set at server start.

> 1. Complicate the Buffer->pointer mapping. Right now, BufferGetBlock()
> is basically just BufferBlocks + (buffer - 1) * BLCKSZ, which means
> that we're expecting to find all of the buffers in a single giant
> array. Years ago, somebody proposed changing the implementation to
> essentially WhereIsTheBuffer[buffer], which was heavily criticized on
> performance grounds, because it requires an extra memory access. A
> gentler version of this might be something like
> WhereIsTheChunkOfBuffers[buffer/CHUNK_SIZE]+(buffer%CHUNK_SIZE)*BLCKSZ;
> i.e. instead of allowing every single buffer to be at some random
> address, manage chunks of the buffer pool. This makes the lookup array
> potentially quite a lot smaller, which might mitigate performance
> concerns. For example, if you had one chunk per GB of shared_buffers,
> your mapping array would need only a handful of cache lines, or a few
> handfuls on really big systems.

Such a scheme still leaves you with a dependend memory read for a quite
frequent operation. It could turn out to nto matter hugely if the mapping
array is cache resident, but I don't know if we can realistically bank on
that.

I'm also somewhat concerned about the coarse granularity being problematic. It
seems like it'd lead to a desire to make the granule small, causing slowness.

One big advantage of a scheme like this is that it'd be a step towards a NUMA
aware buffer mapping and replacement. Practically everything beyond the size
of a small consumer device these days has NUMA characteristics, even if not
"officially visible". We could make clock sweeps (or a better victim buffer
selection algorithm) happen within each "chunk", with some additional
infrastructure to choose which of the chunks to search a buffer in. Using a
chunk on the current numa node, except when there is a lot of imbalance
between buffer usage or replacement rate between chunks.

> 2. Make a Buffer just a disguised pointer. Imagine something like
> typedef struct { Page bp; } *buffer. WIth this approach,
> BufferGetBlock() becomes trivial.

You also additionally need something that allows for efficient iteration over
all shared buffers. Making buffer replacement and checkpointing more expensive
isn't great.

> 3. Reserve lots of address space and then only use some of it. I hear
> rumors that some forks of PG have implemented something like this. The
> idea is that you convince the OS to give you a whole bunch of address
> space, but you try to avoid having all of it be backed by physical
> memory. If you later want to increase shared_buffers, you then get the
> OS to back more of it by physical memory, and if you later want to
> decrease shared_buffers, you hopefully have some way of giving the OS
> the memory back. As compared with the previous two approaches, this
> seems less likely to be noticeable to most PG code.

Another advantage is that you can shrink shared buffers fairly granularly and
cheaply with that approach, compared to having to move buffes entirely out of
a larger mapping to be able to unmap it.

> Problems include (1) you have to somehow figure out how much address space
> to reserve, and that forms an upper bound on how big shared_buffers can grow
> at runtime and

Presumably you'd normally not want to reserve more than the physical amount of
memory on the system. Sure, memory can be hot added, but IME that's quite
rare.

> (2) you have to figure out ways to reserve

Re: table inheritance versus column compression and storage settings

I wrote:
> I find it surprising that the committed patch does not touch
> pg_dump.  Is it really true that pg_dump dumps situations with
> differing compression/storage settings accurately already?

It's worse than I thought.  Run "make installcheck" with
today's HEAD, then:

$ pg_dump -Fc regression >r.dump
$ createdb r2
$ pg_restore -d r2 r.dump 
pg_restore: error: could not execute query: ERROR:  column "a" inherits 
conflicting storage methods
HINT:  To resolve the conflict, specify a storage method explicitly.
Command was: CREATE TABLE public.stchild4 (
a text
)
INHERITS (public.stparent1, public.stparent2);
ALTER TABLE ONLY public.stchild4 ALTER COLUMN a SET STORAGE MAIN;

pg_restore: error: could not execute query: ERROR:  relation "public.stchild4" 
does not exist
Command was: ALTER TABLE public.stchild4 OWNER TO postgres;

pg_restore: error: could not execute query: ERROR:  relation "public.stchild4" 
does not exist
Command was: COPY public.stchild4 (a) FROM stdin;
pg_restore: warning: errors ignored on restore: 3

What I'd intended to compare was the results of the query added to the
regression tests:

regression=# SELECT attrelid::regclass, attname, attstorage FROM pg_attribute
WHERE (attrelid::regclass::name like 'stparent%'
OR attrelid::regclass::name like 'stchild%')
and attname = 'a'
ORDER BY 1, 2;
 attrelid  | attname | attstorage 
---+-+
 stparent1 | a   | p
 stparent2 | a   | x
 stchild1  | a   | p
 stchild3  | a   | m
 stchild4  | a   | m
 stchild5  | a   | x
 stchild6  | a   | m
(7 rows)

r2=# SELECT attrelid::regclass, attname, attstorage FROM pg_attribute
WHERE (attrelid::regclass::name like 'stparent%'
OR attrelid::regclass::name like 'stchild%')
and attname = 'a'
ORDER BY 1, 2;
 attrelid  | attname | attstorage 
---+-+
 stparent1 | a   | p
 stchild1  | a   | p
 stchild3  | a   | m
 stparent2 | a   | x
 stchild5  | a   | p
 stchild6  | a   | m
(6 rows)

So not only does stchild4 fail to restore altogether, but stchild5
ends with the wrong attstorage.

This patch definitely needs more work.

regards, tom lane

Re: Add pg_basetype() function to obtain a DOMAIN base type

Hi,

On 1/2/24 01:00, jian he wrote:
> On Mon, Dec 4, 2023 at 5:11 PM John Naylor  wrote:
>>
>> On Thu, Sep 28, 2023 at 12:22 AM Alexander Korotkov
>>  wrote:
>>> The one thing triggering my perfectionism is that the patch does two
>>> syscache lookups instead of one.
>>
>> For an admin function used interactively, I'm not sure why that
>> matters? Or do you see another use case?
> 
> I did a minor refactor based on v1-0001.
> I think pg_basetype should stay at "9.26.4. System Catalog Information
> Functions".
> So I placed it before pg_char_to_encoding.
> Now functions listed on "Table 9.73. System Catalog Information
> Functions" will look like alphabetical ordering.
> I slightly changed the src/include/catalog/pg_proc.dat.
> now it looks like very similar to pg_typeof
> 
> src6=# \df pg_typeof
>List of functions
>Schema   |   Name| Result data type | Argument data types | Type
> +---+--+-+--
>  pg_catalog | pg_typeof | regtype  | "any"   | func
> (1 row)
> 
> src6=# \df pg_basetype
> List of functions
>Schema   |Name | Result data type | Argument data types | Type
> +-+--+-+--
>  pg_catalog | pg_basetype | regtype  | "any"   | func
> (1 row)
> 
> v2-0001 is as is in the first email thread, 0002 is my changes based on 
> v2-0001.


I think the patch(es) look reasonable, so just a couple minor comments.

1) We already have pg_typeof() function, so maybe we should use a
similar naming convention pg_basetypeof()?

2) I was going to suggest using "any" argument, just like pg_typeof, but
I see 0002 patch already does that. Thanks!

3) I think the docs probably need some formatting - wrapping lines (to
make it consistent with the nearby stuff) and similar stuff.


Other than that it looks fine to me. It's a simple patch, so if we can
agree on the naming I'll get it cleaned up and pushed.


regards

-- 
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Re: Extend pgbench partitioning to pgbench_history

Hello Gabriele,

I think the improvement makes sense (it's indeed a bit strange to not
partition the history table), and the patch looks good.

I did think about whether this should be optional in some way - that is,
separate from partitioning the accounts table, and users would have to
explicitly enable (or disable) it. But I don't think we need to do that.

The vast majority of users simply want to partition everything. And this
is just one way to partition the database anyway, it's our opinion on
how to do that, but there's many other options how we might partition
the tables, and we don't (and don't want too) have options for that.

The only case that I can think of where this might matter is when
running a benchmarks that will be compared to some earlier results
(executed using an older pgbench version). That will be affected by
this, but I don't think we make many promises about compatibility in
this regard ... it's probably better to always compare results only from
the same pgbench version, I guess.

regards

-- 
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Re: why there is not VACUUM FULL CONCURRENTLY?

2024-02-16 Thread Antonin Houska

Alvaro Herrera  wrote:

> This is great to hear.
> 
> On 2024-Jan-31, Antonin Houska wrote:
> 
> > Is your plan to work on it soon or should I try to write a draft patch? (I
> > assume this is for PG >= 18.)
> 
> I don't have plans for it, so if you have resources, please go for it.

ok, I'm thinking how can the feature be integrated into the core.

BTW, I'm failing to understand why cluster_rel() has no argument of the
BufferAccessStrategy type. According to buffer/README, the criterion for using
specific strategy is that page "is unlikely to be needed again
soon". Specifically for cluster_rel(), the page will *definitely* not be used
again (unless the VACCUM FULL/CLUSTER command fails): BufferTag contains the
relatin file number and the old relation file is eventually dropped.

Am I missing anything?

-- 
Antonin Houska
Web: https://www.cybertec-postgresql.com

Re: Improve WALRead() to suck data directly from WAL buffers when possible

2024-02-16 Thread Jeff Davis

On Fri, 2024-02-16 at 13:08 +0530, Bharath Rupireddy wrote:
> I'd suggest we strike a balance here - error out in assert builds if
> startptr+count is past the current insert position and trust the
> callers for production builds.

It's not reasonable to have divergent behavior between assert-enabled
builds and production. I think for now I will just commit the Assert as
Andres suggested until we work out a few more details.

One idea is to use Álvaro's work to eliminate the spinlock, and then
add a variable to represent the last known point returned by
WaitXLogInsertionsToFinish(). Then we can cheaply Assert that the
caller requested something before that point.

> Here, I'm with v23 patch set:

Thank you, I'll look at these.

Regards,
Jeff Davis

Re: Psql meta-command conninfo+

2024-02-16 Thread Nathan Bossart

Thanks for your work on this.  I haven't been keeping up with the
discussion, but I took a quick look at the latest patch.

+
+"Database", "Authenticated User", "System User" (only for PostgreSQL 
16 or higher),
+"Current User", "Session User", "Backend PID", "Server Address", 
"Server Port",
+"Client Address", "Client Port", "Socket Directory", and "Host" 
columns are listed
+by default when \conninfo+ is invoked. The columns 
"Encryption",
+"Protocol", "Cipher", and "Compression" are added to this output when 
TLS (SSL)
+authentication is used. The same applies to GSS authentication is 
used, where the
+"GSSAPI" column is also added to the \conninfo+ 
output.
 

I might be alone on this, but I think this command should output the same
columns regardless of the version, whether it's using SSL, etc. and just
put NULL in any that do not apply.  IMHO that would simplify the code and
help prevent confusion.  Plus, I'm not aware of any existing meta-commands
that provide certain columns conditionally.

+   if (PQsslInUse(pset.db))
+   {
+   protocol = PQsslAttribute(pset.db, "protocol");
+   cipher = PQsslAttribute(pset.db, "cipher");
+   compression = PQsslAttribute(pset.db, "compression");
+   appendPQExpBuffer(,
+ "  ,'SSL' AS 
\"Encryption\",\n"
+ "  '%s' AS \"Protocol\",\n"
+ "  '%s' AS \"Cipher\",\n"
+ "  '%s' AS \"Compression\"\n",
+ protocol ? protocol : 
_("unknown"),
+ cipher ? cipher : 
_("unknown"),
+ (compression && 
strcmp(compression, "off") != 0) ? _("on") : _("off"));
+   }

Could we pull some of this information from pg_stat_ssl instead of from
libpq?  The reason I suggest this is because I think it would be nice if
the query that \conninfo+ uses could be copy/pasted as needed and not rely
on hard-coded values chosen by the client.

-- 
Nathan Bossart
Amazon Web Services: https://aws.amazon.com

Re: psql: Add command to use extended query protocol

2024-02-16 Thread Aleksander Alekseev

Hi,

> > In one of my environments, this feature didn't work as expected.
> > Digging into it, I found that it is incompatible with FETCH_COUNT
> > being set. Sorry for not recognising this during the betas.
> >
> > Attached a simple patch with tests running the cursor declaration
> > through PQexecParams instead of PGexec.
>
> Hmm, strange.  I had been trying to make \bind work with extended
> protocol, and my findings were that there's interactions with the code
> that was added for pipeline mode(*).  I put research aside to work on
> other things, but intended to get back to it soon ... I'm really
> surprised that it works for you here.
>
> Maybe your tests are just not extensive enough to show that it fails.
>
> (*) This is not actually proven, but Peter had told me that his \bind
> stuff had previously worked when he first implemented it before pipeline
> landed.  Because that's the only significant change that has happened to
> the libpq code lately, it's a reasonable hypothesis.

A colleague of mine is very excited about the new \bind functionality
in psql. However he is puzzled by the fact that there is no obvious
way to bind a NULL value, except for something like:

```
create table t (v text);
insert into t values (case when $1 = '' then NULL else $1 end) \bind '' \g
select v, v is null from t;
```

Maybe we should also support something like ... \bind val1 \null val3 \g ?

-- 
Best regards,
Aleksander Alekseev

Re: Replace current implementations in crypt() and gen_salt() to OpenSSL

> On 16 Feb 2024, at 15:49, Peter Eisentraut  wrote:

> Like, if we did a "crypt-aes", would that be FIPS-compliant?  I don't know.

If I remember my FIPS correct: Only if it used a FIPS certified implementation,
like the one in OpenSSL when the fips provider has been loaded.  The cipher
must be allowed *and* the implementation must be certified.

--
Daniel Gustafsson

Re: table inheritance versus column compression and storage settings

Peter Eisentraut  writes:
> I have committed this.  It is great to get this behavior fixed and also 
> to get the internals more consistent.  Thanks.

I find it surprising that the committed patch does not touch
pg_dump.  Is it really true that pg_dump dumps situations with
differing compression/storage settings accurately already?

(Note that it proves little that the pg_upgrade test passes,
since if pg_dump were blind to the settings applicable to a
child table, the second dump would still be blind to them.)

regards, tom lane

Re: Replace current implementations in crypt() and gen_salt() to OpenSSL


On 16.02.24 14:30, Daniel Gustafsson wrote:

On 16 Feb 2024, at 13:57, Peter Eisentraut  wrote:

On 16.02.24 10:16, Daniel Gustafsson wrote:

2. The crypt() and gen_salt() methods built on top of them (modes of operation, 
kind of) are not FIPS-compliant.

I wonder if it's worth trying to make pgcrypto disallow non-FIPS compliant
ciphers when the compiled against OpenSSL is running with FIPS mode enabled, or
raise a WARNING when used?  It seems rather unlikely that someone running
OpenSSL with FIPS=yes want to use our DES cipher without there being an error
or misconfiguration somewhere.


I wonder on what level this kind of check would be done.  For example, the 
password hashing done for SCRAM is not FIPS-compliant either, but surely we 
don't want to disallow that.


Can you elaborate?  When building with OpenSSL all SCRAM hashing will use the
OpenSSL implementation of pg_hmac and pg_cryptohash, so it would be subject to
OpenSSL FIPS configuration no?


Yes, but the overall methods of composing all this into secrets and 
protocol messages etc. are not covered by FIPS.



Maybe this should be done on the level of block ciphers.  So if someone wanted to add a 
"crypt-aes" module, that would then continue to work.


That's a fair point, we can check individual ciphers.  I'll hack up a version
doing this.


Like, if we did a "crypt-aes", would that be FIPS-compliant?  I don't know.

Re: Add new error_action COPY ON_ERROR "log"

2024-02-16 Thread torikoshia

On 2024-02-16 17:15, Bharath Rupireddy wrote:
On Wed, Feb 14, 2024 at 5:04 PM torikoshia  
wrote:

[] let the users know what line numbers are
> containing the errors that COPY ignored something like [1] with a
> simple change like [2].

Agreed.
Unlike my patch, it hides the error information(i.e. 22P02: invalid
input syntax for type integer: ), but I feel that it's usually
sufficient to know the row number and column where the error occurred.

Right.

> It not only helps users figure out which rows
> and attributes were malformed, but also helps them redirect them to
> server logs with setting log_min_messages = notice [3]. In the worst
> case scenario, a problem with this one NOTICE per malformed row is
> that it can overload the psql session if all the rows are malformed.
> I'm not sure if this is a big problem, but IMO better than a single
> summary NOTICE message and simpler than writing to tables of users'
> choice.

Maybe could we do what you suggested for the behavior when 'log' is 
set

to on_error?

 My point is that why someone wants just the summary of failures
without row and column info especially for bulk loading tasks. I'd
suggest doing it independently of 'log' or 'table'.  I think we can
keep things simple just like the attached patch, and see how this
feature will be adopted. I'm sure we can come back and do things like
saving to 'log' or 'table' or 'separate_error_file' etc., if we
receive any firsthand feedback.

Thoughts?

I may be wrong since I seldom do data loading tasks, but I greed with 
you.

I also a little concerned about the case where there are many malformed 
data and it causes lots of messages, but the information is usually 
valuable and if users don't need it, they can suppress it by changing 
client_min_messages.

Currently both summary of failures and individual information is logged 
in NOTICE level.
If we should assume that there are cases where only summary information 
is required, it'd be useful to set lower log level, i.e. LOG to the 
individual information.

--
Regards,

--
Atsushi Torikoshi
NTT DATA Group Corporation

Re: RFC: Logging plan of the running query

2024-02-16 Thread torikoshia

On Thu, Feb 15, 2024 at 6:12 PM Robert Haas  
wrote:

Hi,

I've just been catching up on this thread.

+ if (MyProc->heldLocks)
+ {
+ ereport(LOG_SERVER_ONLY,
+ errmsg("ignored request for logging query plan due to lock 
conflicts"),

+ errdetail("You can try again in a moment."));
+ return;
+ }

I don't like this for several reasons.

First, I think it's not nice to have a request just get ignored. A
user will expect that if we cannot immediately respond to some
request, we'll respond later at the first time that it's safe to do
so, rather than just ignoring it and telling them to retry.

Second, I don't think that the error message is very good. It talks
about lock conflicts, but we don't know that there is any real
problem. We know that, if we enter this block, the server is in the
middle of trying to acquire some lock, and we also know that we could
attempt to acquire a lock as part of generating the EXPLAIN output,
and maybe that's an issue. But that's not a lock conflict. That's a
re-entrancy problem. I don't know that we want to talk about
re-entrancy problems in an error message, and I don't really think
this error message should exist in the first place, but if we're going
to error out in this case surely we shouldn't do so with an error
message that describes a problem we don't have.

Third, I think that the re-entrancy problems with this patch may
extend well beyond lock acquisition. This is one example of how it can
be unsafe to do something as complicated as EXPLAIN at any arbitrary
CHECK_FOR_INTERRUPTS(). It is not correct to say, as
http://postgr.es/m/caaaqye9euuzd8bkjxtvcd9e4n5c7kzhzcvucjxt-xds9x4c...@mail.gmail.com
does, that the problems with running EXPLAIN at an arbitrary point are
specific to running this code in an aborted transaction. The existence
of this code shows that there is at least one hazard even if the
current transaction is not aborted, and I see no analysis on this
thread indicating whether there are any more such hazards, or how we
could go about finding them all.

I think the issue is very general. We have lots of subsystems that
both (a) use global variables and (b) contain CHECK_FOR_INTERRUPTS().
If we process an interrupt while that code is in the middle of
manipulating its global variables and then again re-enter that code,
bad things might happen. We could try to rule that out by analyzing
all such subsystems and all places where CHECK_FOR_INTERRUPTS() may
appear, but that's very difficult. Suppose we took the alternative
approach recommended by Andrey Lepikhov in
http://postgr.es/m/b1b110ae-61f6-4fd9-9b94-f967db9b5...@app.fastmail.com
and instead set a flag that gets handled in some suitable place in the
executor code, like ExecProcNode(). If we did that, then we would know
that we're not in the middle of a syscache lookup, a catcache lookup,
a heavyweight lock acquisition, an ereport, or any other low-level
subsystem call that might create problems for the EXPLAIN code.

In that design, the hack shown above would go away, and we could be
much more certain that we don't need any other similar hacks for other
subsystems. The only downside is that we might experience a slightly
longer delay before a requested EXPLAIN plan actually shows up, but
that seems like a pretty small price to pay for being able to reason
about the behavior of the system. I don't *think* there are any cases
where we run in the executor for a particularly long time without a
new call to ExecProcNode().

I think this case is a bit like vacuum_delay_point(). You might think
that vacuum_delay_point() could be moved inside of
CHECK_FOR_INTERRUPTS(), but we've made the opposite decision: every
vacuum_delay_point() calls CHECK_FOR_INTERRUPTS() but not every
CHECK_FOR_INTERRUPTS() calls vacuum_delay_point(). That means that we
can allow vacuum_delay_point() only at cases where we know it's safe,
rather than at every CHECK_FOR_INTERRUPTS(). I think that's a pretty
smart decision, even for vacuum_delay_point(), and AFAICS the code
we're proposing to run here does things that are substantially more
complicated than what vacuum_delay_point() does. That code just does a
bit of reading of shared memory, reports a wait event, and sleeps.
That's really simple compared to code that could try to do relcache
builds, which can trigger syscache lookups, which can both trigger
heavyweight lock acquisition, lightweight lock acquisition, bringing
pages into shared_buffers possibly through physical I/O, processing of
invalidation messages, and a bunch of other stuff.

It's really hard for me to accept that the heavyweight lock problem
for which the patch contains a workaround is the only one that exists.
I can't see any reason why that should be true.


Thanks for the review and the very detailed explanation!

I'm convinced that it's unsafe to execute EXPLAIN codes during 
CHECK_FOR_INTERRUPTS() and we need to execute it in other safe place, as 
well as the first and second point.



On 2024-02-16 03:59, Andres

Re: Do away with zero-padding assumption before WALRead()

On Fri, Feb 16, 2024 at 7:10 AM Kyotaro Horiguchi
 wrote:
>
> Good catch! The comment seems wrong also to me. The subsequent bytes
> can be written simultaneously, and it's very normal that there are
> unflushed bytes are in OS's page buffer. The objective of the comment
> seems to be to declare that there's no need to clear out the remaining
> bytes, here. I agree that it's not a problem for now. However, I think
> we need two fixes here.
>
> 1. It's useless to copy the whole page regardless of the 'count'. It's
>   enough to copy only up to the 'count'. The patch looks good in this
>   regard.

Yes, it's not needed to copy the whole page. Per my understanding
about page transfers between disk and OS page cache - upon OS page
cache miss, the whole page (of disk block size) gets fetched from disk
even if we just read 'count' bytes (< disk block size).

> 2. Maybe we need a comment that states the page_read callback
>   functions leave garbage bytes beyond the returned count, due to
>   partial copying without clearing the unused portion.

Isn't the comment around page_read callback at
https://git.postgresql.org/gitweb/?p=postgresql.git;a=blob;f=src/include/access/xlogreader.h;h=2e9e5f43eb2de1ca9ba81afe76d21357065c61aa;hb=d57b7cc3338e9d9aa1d7c5da1b25a17c5a72dcce#l78
enough?

"The callback shall return the number of bytes read (never more than
XLOG_BLCKSZ), or -1 on failure."

--
Bharath Rupireddy
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com

Re: Do away with zero-padding assumption before WALRead()

On Thu, Feb 15, 2024 at 3:49 PM Nazir Bilal Yavuz  wrote:
>
> > I'm wondering why the WALRead() callers are always reading XLOG_BLCKSZ
> > despite knowing exactly how much to read. Is it to tell the OS to
> > explicitly fetch the whole page from the disk? If yes, the OS will do
> > that anyway because the page transfers from disk to OS page cache are
> > always in terms of disk block sizes, no?
>
> I am curious about the same. The page size and disk block size could
> be different,

Yes, they can be different, but (see below)

> so the reason could be explicitly fetching the whole
> page from the disk as you said.

Upon OS page cache miss, the whole page (of disk block size) gets
fetched from disk even if we just read 'count' bytes (< disk block
size), no? This is my understanding about page transfers between disk
and OS page cache.

> Is this the reason or are there any
> other benefits of always reading XLOG_BLCKSZ instead of reading the
> sufficient part? I tried to search in older threads and code comments
> but I could not find an explanation.

FWIW, walsender for physical replication will just read as much as it
wants to read which can range from WAL of size < XLOG_BLCKSZ to
MAX_SEND_SIZE (XLOG_BLCKSZ * 16). I mean, it does not read the whole
page of bytes XLOG_BLCKSZ when it wants to read < XLOG_BLCKSZ.

--
Bharath Rupireddy
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com

Re: Replace current implementations in crypt() and gen_salt() to OpenSSL

> On 16 Feb 2024, at 13:57, Peter Eisentraut  wrote:
> 
> On 16.02.24 10:16, Daniel Gustafsson wrote:
>>> 2. The crypt() and gen_salt() methods built on top of them (modes of 
>>> operation, kind of) are not FIPS-compliant.
>> I wonder if it's worth trying to make pgcrypto disallow non-FIPS compliant
>> ciphers when the compiled against OpenSSL is running with FIPS mode enabled, 
>> or
>> raise a WARNING when used?  It seems rather unlikely that someone running
>> OpenSSL with FIPS=yes want to use our DES cipher without there being an error
>> or misconfiguration somewhere.
> 
> I wonder on what level this kind of check would be done.  For example, the 
> password hashing done for SCRAM is not FIPS-compliant either, but surely we 
> don't want to disallow that.

Can you elaborate?  When building with OpenSSL all SCRAM hashing will use the
OpenSSL implementation of pg_hmac and pg_cryptohash, so it would be subject to
OpenSSL FIPS configuration no?

> Maybe this should be done on the level of block ciphers.  So if someone 
> wanted to add a "crypt-aes" module, that would then continue to work.

That's a fair point, we can check individual ciphers.  I'll hack up a version
doing this.

--
Daniel Gustafsson

Re: table inheritance versus column compression and storage settings

I have committed this.  It is great to get this behavior fixed and also 
to get the internals more consistent.  Thanks.

Re: Replace current implementations in crypt() and gen_salt() to OpenSSL


On 16.02.24 10:16, Daniel Gustafsson wrote:

2. The crypt() and gen_salt() methods built on top of them (modes of operation, 
kind of) are not FIPS-compliant.

I wonder if it's worth trying to make pgcrypto disallow non-FIPS compliant
ciphers when the compiled against OpenSSL is running with FIPS mode enabled, or
raise a WARNING when used?  It seems rather unlikely that someone running
OpenSSL with FIPS=yes want to use our DES cipher without there being an error
or misconfiguration somewhere.


I wonder on what level this kind of check would be done.  For example, 
the password hashing done for SCRAM is not FIPS-compliant either, but 
surely we don't want to disallow that.  Maybe this should be done on the 
level of block ciphers.  So if someone wanted to add a "crypt-aes" 
module, that would then continue to work.

Re: POC, WIP: OR-clause support for indexes

2024-02-16 Thread jian he

On Fri, Feb 16, 2024 at 1:32 PM Andrei Lepikhov
 wrote:
>
> On 16/2/2024 07:00, jian he wrote:
> > On Wed, Feb 14, 2024 at 11:21 AM Andrei Lepikhov
> >  wrote:
> > My OS: Ubuntu 22.04.3 LTS
> > I already set the max_parallel_workers_per_gather to 10.
> > So for all cases, it should use parallelism first?
> >
> > a better question would be:
> > how to make the number of OR less than 29 still faster when
> > enable_or_transformation is ON by only set parameters?
> In my test environment this example gives some subtle supremacy to ORs
> over ANY with only 3 ors and less.
> Please, provide next EXPLAIN ANALYZE results for the case you want to
> discuss here:
> 1. with enable_or_transformation enabled
> 2. with enable_or_transformation disabled
> 3. with enable_or_transformation disabled but with manual transformation
> OR -> ANY done, to check the overhead of this optimization.
>

you previously mentioned playing with parallel_tuple_cost and
parallel_setup_cost.
(https://www.postgresql.org/message-id/e3338e82-a28d-4631-9eec-b9c0984b32d5%40postgrespro.ru)

So I did by
`
SET parallel_setup_cost = 0;
SET parallel_tuple_cost = 0;
`

After setting these parameters, overall enable_or_transformation ON is
performance better.
sorry for the noise.
so now I didn't find any corner case where enable_or_transformation is
ON peforms worse than when it's OFF.

+typedef struct OrClauseGroupEntry
+{
+ OrClauseGroupKey key;
+
+ Node   *node;
+ List   *consts;
+ Oid scalar_type;
+ List   *exprs;
+} OrClauseGroupEntry;

I found that the field `scalar_type` was never used.

RE: speed up a logical replica setup

2024-02-16 Thread Hayato Kuroda (Fujitsu)

Dear Euler,

Here are comments for v21.

01. main
```
/* rudimentary check for a data directory. */
...
/* subscriber PID file. */
```

Initial char must be upper, and period is not needed.

02. check_data_directory
```
snprintf(versionfile, MAXPGPATH, "%s/PG_VERSION", datadir);
```

You removed the version checking from PG_VERSION, but I think it is still 
needed.
Indeed v21 can detect the pg_ctl/pg_resetwal/pg_createsubscriber has different
verson, but this cannot ditect the started instance has the differnet version.
I.e., v20-0010 is partially needed.

03. store_pub_sub_info()
```
SimpleStringListCell *cell;
```

This definition can be in loop variable.

04. get_standby_sysid()
```
pfree(cf);
```

This can be pg_pfree().

05. check_subscriber
```
/* The target server must be a standby */
if (server_is_in_recovery(conn) == 0)
{
pg_log_error("The target server is not a standby");
return false;
}
```

What if the the function returns -1? Should we ditect (maybe the disconnection) 
here?

06. server_is_in_recovery
```
ret = strcmp("t", PQgetvalue(res, 0, 0));

PQclear(res);

if (ret == 0)
return 1;
else if (ret > 0)
return 0;
else
return -1;  /* should not happen */
```

But strcmp may return a negative value, right? Based on the comment atop 
function,
we should not do it. I think we can use ternary operator instead.

07. server_is_in_recovery

As the fisrt place, no one consider this returns -1. So can we change the bool
function and raise pg_fatal() in case of the error?

08. check_subscriber
```
if (strcmp(PQgetvalue(res, 0, 1), "t") != 0)
{
pg_log_error("permission denied for function \"%s\"",
 
"pg_catalog.pg_replication_origin_advance(text, pg_lsn)");
return false;
}
```

I think the third argument must be 2.

09. check_subscriber
```
pg_log_debug("subscriber: primary_slot_name: %s", primary_slot_name);
```

The output seems strange if the primary_slot_name is not set.

10. setup_publisher()
```
PGconn *conn;
PGresult   *res;
```

Definitions can be in the loop.

11. create_publication()
```
if (PQntuples(res) == 1)
{
/*
 * If publication name already exists and puballtables is true, 
let's
 * use it. A previous run of pg_createsubscriber must have 
created
 * this publication. Bail out.
 */
```

Hmm, but pre-existing publications may not send INSERT/UPDATE/DELETE/TRUNCATE.
They should be checked if we really want to reuse.
(I think it is OK to just raise ERROR)

12. create_publication()

Based on above, we do not have to check before creating publicatios. The 
publisher
can detect the duplication. I prefer it.

13. create_logical_replication_slot()
```
if (PQresultStatus(res) != PGRES_TUPLES_OK)
{
pg_log_error("could not create replication slot \"%s\" 
on database \"%s\": %s",
 slot_name, dbinfo->dbname,
 PQresultErrorMessage(res));
return lsn;
}
```

I know lsn is always NULL, but can we use `return NULL`?

14. setup_subscriber()
```
PGconn *conn;

```

This definition can be in the loop.


15.

You said in case of failure, cleanups is not needed if the process exits soon 
[1].
But some functions call PQfinish() then exit(1) or pg_fatal(). Should we follow?

16.

Some places refer PGresult or PGConn even after the cleanup. They must be fixed.
```
PQclear(res);
disconnect_database(conn);
pg_fatal("could not get system identifier: %s",
 PQresultErrorMessage(res));
```

I think this is a root cause why sometimes the wrong error message has output.


17.

Some places call PQerrorMessage() and other places call PQresultErrorMessage().
I think it PQerrorMessage() should be used only after the connection 
establishment
functions. Thought?

18. 041_pg_createsubscriber_standby.pl
```
use warnings;
```

We must set "FATAL = all";

19.
```
my $node_p;
my $node_f;
my $node_s;
my $node_c;
my $result;
my $slotname;
```

I could not find forward declarations in perl file.
The node name might be bit a consuging, but I could not find better name.

20.
```
# On node P
# - create databases
# - create test tables
# - insert a row
# - create a physical replication slot
$node_p->safe_psql(
'postgres', q(
CREATE DATABASE pg1;
CREATE DATABASE pg2;
));
$node_p->safe_psql('pg1', 'CREATE TABLE tbl1 (a text)');
$node_p->safe_psql('pg1', "INSERT INTO tbl1 VALUES('first row')");

Re: A proposal to provide a timeout option for CREATE_REPLICATION_SLOT/pg_create_logical_replication_slot

On Thu, Jun 9, 2022 at 1:01 PM Kyotaro Horiguchi
 wrote:
>
> > Currently CREATE_REPLICATION_SLOT/pg_create_logical_replication_slot waits
> > unboundedly if there are any in-progress write transactions [1]. []
> >
> > How about we provide a timeout for the command/function instead of letting
> > them wait unboundedly?
>
> How can the other sessions get affected by setting statement_timeout a
> session?  And "SET LOCAL" narrows the effect down to within a
> transaction. I think that is sufficient.

SET LOCAL needs to be run within an explicit txn whereas CREATE
SUBSCRIPTION can't.

> On the other hand,
> CREATE_REPLICATION_SLOT doesn't honor statement_timeout, but honors
> lock_timeout. (It's a bit strange but I would hardly go so far as to
> say we should "fix" it..) If a program issues CREATE_REPLICATION_SLOT,
> it's hard to believe that the same program cannot issue SET (for
> lock_timeout) command as well.

Yes it can issue lock_timeout.

> When CREATE_REPLICATION_SLOT is called from a CREATE SUBSCRIPTION
> command, the latter command itself honors statement_timeout and
> disconnects the peer walsender. Thus, client_connection_check_interval
> set on publisher side kills the walsender shortly after the
> disconnection.

Right.

> In short, I don't see much point in the timeout of the function/command.

I played with it a bit today. There are a couple of ways to get around
the CREATE SUBSCRIPTION blocking issue - set statement_timeout [1] or
transaction_timeout [2] on the subscriber at the session level before
creating the subscription, or set lock_timeout [3] on the publisher.

Since we have a bunch of timeouts already (transaction_timeout being
the latest addition), I don't think we need another one here. So I
withdraw my initial idea on this thread to have a separate timeout to
create a logical replication slot.

[1]
postgres=# SET transaction_timeout = '10s';
SET
postgres=# CREATE SUBSCRIPTION mysub CONNECTION 'dbname=postgres
port=5432' PUBLICATION mypub;
FATAL:  terminating connection due to transaction timeout
server closed the connection unexpectedly
This probably means the server terminated abnormally
before or while processing the request.
The connection to the server was lost. Attempting reset: Succeeded.

[2]
postgres=# SET statement_timeout = '10s';
SET
postgres=# CREATE SUBSCRIPTION mysub CONNECTION 'dbname=postgres
port=5432' PUBLICATION mypub;

ERROR:  canceling statement due to statement timeout

[3]
postgres=# CREATE SUBSCRIPTION mysub CONNECTION 'dbname=postgres
port=5432' PUBLICATION mypub;

ERROR:  could not create replication slot "mysub": ERROR:  canceling
statement due to lock timeout

-- 
Bharath Rupireddy
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com

Re: glibc qsort() vulnerability

2024-02-16 Thread Mats Kindahl

On Fri, Feb 16, 2024 at 12:32 AM Nathan Bossart 
wrote:

> Here is what I have staged for commit.
>

Looks good to me.

Checked that all of the comparisons are in the expected order, except
inside compDESC, cmp_lsn, and resource_priority_cmp, where the order is
reversed.

Best wishes,
Mats Kindahl

>
> --
> Nathan Bossart
> Amazon Web Services: https://aws.amazon.com
>

Re: pg_upgrade and logical replication

On Fri, Feb 16, 2024 at 10:50 AM Hayato Kuroda (Fujitsu)
 wrote:
>
> Thanks for reviewing! PSA new version.
>

+# Setup a disabled subscription. The upcoming test will check the
+# pg_createsubscriber won't work, so it is sufficient.
+$publisher->safe_psql('postgres', "CREATE PUBLICATION regress_pub1");

Why pg_createsubscriber is referred to here? I think it is a typo.

Other than that patch looks good to me.

-- 
With Regards,
Amit Kapila.

confirmed flush lsn seems to be move backward in certain error cases

2024-02-16 Thread vignesh C

Hi,

The following assertion failure was seen while testing one scenario
for other patch:
TRAP: failed Assert("s->data.confirmed_flush >=
s->last_saved_confirmed_flush"), File: "slot.c", Line: 1760, PID:
545314
postgres: checkpointer performing shutdown
checkpoint(ExceptionalCondition+0xbb)[0x564ee6870c58]
postgres: checkpointer performing shutdown
checkpoint(CheckPointReplicationSlots+0x18e)[0x564ee65e9c71]
postgres: checkpointer performing shutdown checkpoint(+0x1e1403)[0x564ee61be403]
postgres: checkpointer performing shutdown
checkpoint(CreateCheckPoint+0x78a)[0x564ee61bdace]
postgres: checkpointer performing shutdown
checkpoint(ShutdownXLOG+0x150)[0x564ee61bc735]
postgres: checkpointer performing shutdown checkpoint(+0x5ae28c)[0x564ee658b28c]
postgres: checkpointer performing shutdown
checkpoint(CheckpointerMain+0x31e)[0x564ee658ad55]
postgres: checkpointer performing shutdown
checkpoint(AuxiliaryProcessMain+0x1d1)[0x564ee65888d9]
postgres: checkpointer performing shutdown checkpoint(+0x5b7200)[0x564ee6594200]
postgres: checkpointer performing shutdown
checkpoint(PostmasterMain+0x14da)[0x564ee658f12f]
postgres: checkpointer performing shutdown checkpoint(+0x464fc6)[0x564ee6441fc6]
/lib/x86_64-linux-gnu/libc.so.6(+0x29d90)[0x7ff6afa29d90]
/lib/x86_64-linux-gnu/libc.so.6(__libc_start_main+0x80)[0x7ff6afa29e40]
postgres: checkpointer performing shutdown
checkpoint(_start+0x25)[0x564ee60b8e05]

I was able to reproduce this issue with the following steps:
-- Setup
-- Publisher node:
create table t1(c1 int);
create table t2(c1 int);
create publication pub1 for table t1;
create publication pub2 for table t2;

-- Subscriber node:
create table t1(c1 int);
create table t2(c1 int);
create subscription test1 connection 'dbname=postgres host=localhost
port=5432' publication pub1, pub2;
select * from pg_subscription;

-- Actual test
insert into t1 values(10);
insert into t2 values(20);
select pg_sleep(10);
drop publication pub2;
insert into t1 values(10);
insert into t2 values(20);

Stop the publisher to see the assertion.

For me the issue reproduces about twice in five times using the
assert_failure.sh script attached.

After the insert operation is replicated to the subscriber, the
subscriber will set the lsn value sent by the publisher in the
replication origin (in my case it was 0/1510978). publisher will then
send keepalive messages with the current WAL position in the publisher
(in my case it was 0/15109B0), but subscriber will simply send this
position as the flush_lsn to the publisher as there are no ongoing
transactions. Then since the publisher is started, it will identify
that publication does not exist and stop the walsender/apply worker
process. When the apply worker is restarted, we will get the
remote_lsn(in my case it was 0/1510978) of the origin and set it to
origin_startpos. We will start the apply worker with this
origin_startpos (origin's remote_lsn). This position will be sent as
feedback to the walsender process from the below stack:
run_apply_worker->start_apply->LogicalRepApplyLoop->send_feedback.
It will use the following send_feedback function call of
LogicalRepApplyLoop function as in below code here as nothing is
received from walsender:
LogicalRepApplyLoop function
...
len = walrcv_receive(LogRepWorkerWalRcvConn, , );
if (len != 0)
{
   /* Loop to process all available data (without blocking). */
   for (;;)
   {
  CHECK_FOR_INTERRUPTS();
  ...
   }
}

/* confirm all writes so far */
send_feedback(last_received, false, false);
...

In send_feedback, we will set flushpos to replication origin's
remote_lsn and send it to the walsender process. Walsender process
will receive this information and set confirmed_flush in:
ProcessStandbyReplyMessage->LogicalConfirmReceivedLocation

Then immediately we are trying to stop the publisher instance,
shutdown checkpoint process will be triggered. In this case:
confirmed_flush = 0/1510978 will be lesser than
last_saved_confirmed_flush = 0/15109B0 which will result in Assertion
failure.

This issue is happening because we allow setting the confirmed_flush
to a backward position.
There are a couple of ways to fix this:
a) One way it not to update the confirm_flush if the lsn sent is an
older value like in Confirm_flush_dont_allow_backward.patch
b) Another way is to remove the assertion in
CheckPointReplicationSlots and marking the slot as dirty only if
confirmed_flush is greater than last_saved_confirmed_flush like in
Assert_confirmed_flush_will_always_not_be_less_than_last_saved_confirmed_flush.patch

I preferred the first approach.

Thoughts?

Regards,
Vignesh


assert_failure.sh
Description: application/shellscript
diff --git a/src/backend/replication/logical/logical.c b/src/backend/replication/logical/logical.c
index a53815f2ed..98aa629ced 100644
--- a/src/backend/replication/logical/logical.c
+++

Re: Replace current implementations in crypt() and gen_salt() to OpenSSL

2024-02-16 Thread Joe Conway


On 2/16/24 04:16, Daniel Gustafsson wrote:

On 15 Feb 2024, at 16:49, Peter Eisentraut  wrote:



1. All the block ciphers currently supported by crypt() and gen_salt() are not 
FIPS-compliant.

2. The crypt() and gen_salt() methods built on top of them (modes of operation, 
kind of) are not FIPS-compliant.


I wonder if it's worth trying to make pgcrypto disallow non-FIPS compliant
ciphers when the compiled against OpenSSL is running with FIPS mode enabled, or
raise a WARNING when used?  It seems rather unlikely that someone running
OpenSSL with FIPS=yes want to use our DES cipher without there being an error
or misconfiguration somewhere.

Something like the below untested pseudocode.

diff --git a/contrib/pgcrypto/pgcrypto.c b/contrib/pgcrypto/pgcrypto.c
index 96447c5757..3d4391ebe1 100644
--- a/contrib/pgcrypto/pgcrypto.c
+++ b/contrib/pgcrypto/pgcrypto.c
@@ -187,6 +187,14 @@ pg_crypt(PG_FUNCTION_ARGS)
   *resbuf;
text   *res;
  
+#if defined FIPS_mode

+   if (FIPS_mode())
+#else
+   if 
(EVP_default_properties_is_fips_enabled(OSSL_LIB_CTX_get0_global_default()))
+#endif
+   ereport(ERROR,
+   (errmsg("not available when using OpenSSL in FIPS 
mode")));


Makes sense +1

--
Joe Conway
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com

RE: Replace current implementations in crypt() and gen_salt() to OpenSSL

2024-02-16 Thread Koshi Shibagaki (Fujitsu)

Dear Daniel

Thanks for your reply.

> I wonder if it's worth trying to make pgcrypto disallow non-FIPS compliant
> ciphers when the compiled against OpenSSL is running with FIPS mode
> enabled, or raise a WARNING when used?  It seems rather unlikely that
> someone running OpenSSL with FIPS=yes want to use our DES cipher without
> there being an error or misconfiguration somewhere.

Indeed, users do not use non-FIPS compliant ciphers in crypt() and gen_salt() 
such as DES with FIPS mode enabled.
However, can we reduce human error by having these functions make the judgment 
as to whether ciphers can or cannot be used?

If pgcrypto checks if FIPS enabled or not as in the pseudocode, it is easier to 
achieve than replacing to OpenSSL.
Currently, OpenSSL internally determines if it is in FIPS mode or not, but would
it be a problem to have PostgreSQL take on that role?

---
Fujitsu Limited
Shibagaki Koshi
shibagaki.ko...@fujitsu.com

RE: Replace current implementations in crypt() and gen_salt() to OpenSSL

2024-02-16 Thread Koshi Shibagaki (Fujitsu)

Dear Peter

Thanks for the replying

> 1. All the block ciphers currently supported by crypt() and gen_salt() are not
> FIPS-compliant.
>
> 2. The crypt() and gen_salt() methods built on top of them (modes of 
> operation,
> kind of) are not FIPS-compliant.
> 
> 3. The implementations (crypt-blowfish.c, crypt-des.c, etc.) are not 
> structured
> in a way that OpenSSL calls can easily be patched in.

Indeed, all the algorithm could not be used in FIPS and huge engineering might 
be needed for the replacement. If the benefit is smaller than the cost, we 
should consider another way - e.g., prohibit to call these functions in FIPS 
mode as in the pseudocode Daniel sent. Replacing OpenSSL is a way, the objective
is to eliminate the user's error in choosing an encryption algorithm.


---
Fujitsu Limited
Shibagaki Koshi
shibagaki.ko...@fujitsu.com

Re: Psql meta-command conninfo+

2024-02-16 Thread Jim Jones




On 15.02.24 23:16, Maiquel Grassi wrote:
>
> Hi!
>
> (v16)
>
> In this version, I made a small adjustment to the indentation
> of the \conninfo code and described the columns as returned
> by \conninfo+ as suggested by Jim Jones.
>
>

I've performed the following tests with v16:

1) hostaddr=172.19.42.1

$ /usr/local/postgres-dev/bin/psql -x "\
    host=server.uni-muenster.de
    hostaddr=172.19.42.1
    user=jim dbname=postgres
    sslrootcert=server-certificates/server.crt
    sslcert=jim-certificates/jim.crt
    sslkey=jim-certificates/jim.key" -c "\conninfo+" -c "\conninfo"

Current Connection Information
-[ RECORD 1
]--+---
Database   | postgres
Authenticated User | jim
System User    |
cert:emailAddress=wwwad...@uni-muenster.de,CN=jim,OU=WWU
IT,O=Universitaet Muenster,L=Muenster,ST=Nordrhein-Westfalen,C=DE
Current User   | jim
Session User   | jim
Backend PID    | 386839
Server Address | 172.19.42.1
Server Port    | 5432
Client Address | 192.168.178.27
Client Port    | 35602
Socket Directory   |
Host   | server.uni-muenster.de
Encryption | SSL
Protocol   | TLSv1.3
Cipher | TLS_AES_256_GCM_SHA384
Compression    | off

You are connected to database "postgres" as user "jim" on host
"server.uni-muenster.de" (address "172.19.42.1") at port "5432".
SSL connection (protocol: TLSv1.3, cipher: TLS_AES_256_GCM_SHA384,
compression: off)


The same with non-superusers

$ /usr/local/postgres-dev/bin/psql -x "\
    host=server.uni-muenster.de
    hostaddr=172.19.42.1
    user=jim dbname=postgres
    sslrootcert=server-certificates/server.crt
    sslcert=jim-certificates/jim.crt
    sslkey=jim-certificates/jim.key" -c "SET ROLE foo" -c "\conninfo+"
-c "\conninfo"
SET
Current Connection Information
-[ RECORD 1
]--+---
Database   | postgres
Authenticated User | jim
System User    |
cert:emailAddress=wwwad...@uni-muenster.de,CN=jim,OU=WWU
IT,O=Universitaet Muenster,L=Muenster,ST=Nordrhein-Westfalen,C=DE
Current User   | foo
Session User   | jim
Backend PID    | 547733
Server Address | 172.19.42.1
Server Port    | 5432
Client Address | 192.168.178.27
Client Port    | 58508
Socket Directory   |
Host   | server.uni-muenster.de
Encryption | SSL
Protocol   | TLSv1.3
Cipher | TLS_AES_256_GCM_SHA384
Compression    | off

You are connected to database "postgres" as user "jim" on host
"server.uni-muenster.de" (address "172.19.42.1") at port "5432".
SSL connection (protocol: TLSv1.3, cipher: TLS_AES_256_GCM_SHA384,
compression: off)


2) -h 192.168.178.27

$ /usr/local/postgres-dev/bin/psql -x -U postgres -h 192.168.178.27 -c
"\conninfo+" -c "\conninfo"
Current Connection Information
-[ RECORD 1 ]--+---
Database   | postgres
Authenticated User | postgres
System User    |
Current User   | postgres
Session User   | postgres
Backend PID    | 399670
Server Address | 192.168.178.27
Server Port    | 5432
Client Address | 192.168.178.27
Client Port    | 44174
Socket Directory   |
Host   | 192.168.178.27
Encryption | SSL
Protocol   | TLSv1.3
Cipher | TLS_AES_256_GCM_SHA384
Compression    | off

You are connected to database "postgres" as user "postgres" on host
"192.168.178.27" at port "5432".
SSL connection (protocol: TLSv1.3, cipher: TLS_AES_256_GCM_SHA384,
compression: off)


3) via socket

$ /usr/local/postgres-dev/bin/psql -x -U postgres -c "\conninfo+" -c
"\conninfo"
Current Connection Information
-[ RECORD 1 ]--+-
Database   | postgres
Authenticated User | postgres
System User    |
Current User   | postgres
Session User   | postgres
Backend PID    | 394273
Server Address |
Server Port    | 5432
Client Address |
Client Port    |
Socket Directory   | /tmp
Host   |

You are connected to database "postgres" as user "postgres" via socket
in "/tmp" at port "5432".

4) -h 127.0.0.1

$ /usr/local/postgres-dev/bin/psql -x -U postgres -h 127.0.0.1 -c
"\conninfo+" -c "\conninfo"
Current Connection Information
-[ RECORD 1 ]--+---
Database   | postgres
Authenticated User | postgres
System User    |
Current User   | postgres
Session User   | postgres
Backend PID    | 396070
Server Address | 127.0.0.1
Server Port    | 5432
Client Address | 127.0.0.1
Client Port    | 52528
Socket Directory   |
Host   | 127.0.0.1
Encryption | SSL
Protocol   | TLSv1.3
Cipher | TLS_AES_256_GCM_SHA384
Compression    | off

You are connected to database "postgres" as user

RE: speed up a logical replica setup

2024-02-16 Thread Hayato Kuroda (Fujitsu)

Dear Shubham,

Thanks for testing. It seems you ran with v20 patch, but I confirmed
It could reproduce with v21.

> 
> I found a couple of issues, while verifying the cascaded replication
> with the following scenarios:
> Scenario 1) Create cascade replication like node1->node2->node3
> without using replication slots (attached
> cascade_3node_setup_without_slots has the script for this):
> Then I ran pg_createsubscriber by specifying primary as node1 and
> standby as node3, this scenario runs successfully. I was not sure if
> this should be supported or not?

Hmm. After the script, the cascading would be broken. The replication would be:

```
Node1 -> node2
  |
Node3
```

And the operation is bit strange. The consistent LSN is gotten from the node1,
but node3 waits until it receives the record from NODE2.
Can we always success it?

> Scenario 2) Create cascade replication like node1->node2->node3 using
> replication slots (attached cascade_3node_setup_with_slots has the
> script for this):
> Here, slot name was used as slot1 for node1 to node2 and slot2 for
> node2 to node3. Then I ran pg_createsubscriber by specifying primary
> as node1 and standby as node3. In this case pg_createsubscriber fails
> with the following error:
> pg_createsubscriber: error: could not obtain replication slot
> information: got 0 rows, expected 1 row
> [Inferior 1 (process 2623483) exited with code 01]
> 
> This is failing because slot name slot2 is used between node2->node3
> but pg_createsubscriber is checked for slot1, the slot which is used
> for replication between node1->node2.
> Thoughts?

Right. The inconsistency is quite strange.

Overall, I felt such a case must be rejected. How should we detect at checking 
phase?

Best Regards,
Hayato Kuroda
FUJITSU LIMITED
https://www.fujitsu.com/

Re: Add trim_trailing_whitespace to editorconfig file


v3-0001-Remove-non-existing-file-from-.gitattributes.patch

I have committed that one.

v3-0002-Require-final-newline-in-.po-files.patch

The .po files are imported from elsewhere, so I'm not sure this is going 
to have the desired effect.  Perhaps it's worth cleaning up, but it 
would require more steps.


v3-0003-Bring-editorconfig-in-line-with-gitattributes.patch

I question whether we need to add rules to .editorconfig about files 
that are generated or imported from elsewhere, since those are not meant 
to be edited.

Re: Things I don't like about \du's "Attributes" column

2024-02-16 Thread Pavel Luzanov


On 13.02.2024 00:29, Pavel Luzanov wrote:

The changes are split into two patches.
0001 - pg_roles view. I plan to organize a new thread for discussion.


Please see it here:
https://www.postgresql.org/message-id/db1d94ba-1e6e-4e86-baff-91e6e79071c1%40postgrespro.ru

--
Pavel Luzanov
Postgres Professional:https://postgrespro.com

Show password presence in pg_roles for authorized roles

2024-02-16 Thread Pavel Luzanov


Hello,

Currently, a role with the createrole attribute can create roles, set and 
change their password,
but can't see the password. Can't even see if the password is set or not.
In this case, you can mistakenly set the Valid until attribute to roles without 
a password.
And there is no way to detect such a situation.

In the patch for changing the \du command, I want to give the opportunity to 
show
incorrect values of the Valid until attribute. [1]

I suggest changing the pg_roles view to allow a role with the createrole 
attribute to see
information about the password of the roles that this role manages
(has membership with admin option).

There are several ways to implement it.

1.
Change the values of the rolpassword column. Now it always shows ''.
The values should depend on the role executing the query.
If the query is executed by a superuser or a role with create role and admin 
membership,
then show '' instead of password or NULL (no password).
For other roles, show ''.

This is implemented in the attached patch.

2.
Change the values of the rolpassword column.
If the query is executed by a superuser or a role with create role and admin 
membership,
then show real password or NULL (no password).
For other roles, show ''.

3.
Leave the rolpassword column as it is for backward compatibility, but add
a new logical rolhaspassword column.
If the query is executed by a superuser or a role with create role and admin 
membership,
then show true/false depending on the password existence.
For other roles, show NULL.

Although it is possible that for security reasons such changes should not be 
made.

1.https://www.postgresql.org/message-id/ef4d000f-6766-4ae1-9f69-0d0caa8130d6%40postgrespro.ru

--
Pavel Luzanov
Postgres Professional:https://postgrespro.com
From 78a1b38386634becc6c82749c1e7e19c4f1cc94f Mon Sep 17 00:00:00 2001
From: Pavel Luzanov 
Date: Mon, 12 Feb 2024 23:46:15 +0300
Subject: [PATCH v4 1/2] Show password presence in pg_roles

Keeping with the changes made in v16 it does seem worthwhile modifying
pg_roles to be sensitive to the role querying the view having both
createrole and admin membership on the role being displayed.  With now
three possible outcomes: NULL if no password is in use, * if a
password is in use and the user has the ability to alter role, or
.
---
 src/backend/catalog/system_views.sql | 45 ++--
 src/test/regress/expected/rules.out  | 37 ++-
 2 files changed, 52 insertions(+), 30 deletions(-)

diff --git a/src/backend/catalog/system_views.sql b/src/backend/catalog/system_views.sql
index 6791bff9dd..2de6802cf7 100644
--- a/src/backend/catalog/system_views.sql
+++ b/src/backend/catalog/system_views.sql
@@ -16,21 +16,34 @@
 
 CREATE VIEW pg_roles AS
 SELECT
-rolname,
-rolsuper,
-rolinherit,
-rolcreaterole,
-rolcreatedb,
-rolcanlogin,
-rolreplication,
-rolconnlimit,
-''::text as rolpassword,
-rolvaliduntil,
-rolbypassrls,
-setconfig as rolconfig,
-pg_authid.oid
-FROM pg_authid LEFT JOIN pg_db_role_setting s
-ON (pg_authid.oid = setrole AND setdatabase = 0);
+r.rolname,
+r.rolsuper,
+r.rolinherit,
+r.rolcreaterole,
+r.rolcreatedb,
+r.rolcanlogin,
+r.rolreplication,
+r.rolconnlimit,
+CASE WHEN curr_user.rolsuper OR
+ (curr_user.rolcreaterole AND m.admin_option)
+ THEN
+  CASE WHEN r.rolpassword IS NULL
+   THEN NULL::pg_catalog.text
+   ELSE ''::pg_catalog.text
+  END
+ ELSE ''::pg_catalog.text
+END rolpassword,
+r.rolvaliduntil,
+r.rolbypassrls,
+s.setconfig AS rolconfig,
+r.oid
+FROM pg_catalog.pg_authid r
+JOIN pg_catalog.pg_authid curr_user
+ON (curr_user.rolname = CURRENT_USER)
+LEFT JOIN pg_catalog.pg_auth_members m
+ON (curr_user.oid = m.member AND r.oid = m.roleid AND m.admin_option)
+LEFT JOIN pg_catalog.pg_db_role_setting s
+ON (r.oid = s.setrole AND s.setdatabase = 0);
 
 CREATE VIEW pg_shadow AS
 SELECT
@@ -65,7 +78,7 @@ CREATE VIEW pg_user AS
 usesuper,
 userepl,
 usebypassrls,
-''::text as passwd,
+''::text AS passwd,
 valuntil,
 useconfig
 FROM pg_shadow;
diff --git a/src/test/regress/expected/rules.out b/src/test/regress/expected/rules.out
index abc944e8b8..a704015de3 100644
--- a/src/test/regress/expected/rules.out
+++ b/src/test/regress/expected/rules.out
@@ -1477,21 +1477,30 @@ pg_replication_slots| SELECT l.slot_name,
 l.failover
FROM (pg_get_replication_slots() l(slot_name, plugin, slot_type, datoid, temporary, active, active_pid, xmin, catalog_xmin, restart_lsn, confirmed_flush_lsn, wal_status, safe_wal_size,

Re: Allow passing extra options to initdb for tests

> On 15 Feb 2024, at 16:21, Tom Lane  wrote:
> 
> Daniel Gustafsson  writes:
>> On 15 Feb 2024, at 11:38, Peter Eisentraut  wrote:
>>> We don't have a man page for pg_regress, so there is no place to 
>>> comprehensively document all the options and their interactions.
> 
>> This comes up every now and again, just yesterday there was a question on
>> -general [0] about alternate output files which also are
>> undocumented.
> 
> Really?
> 
> https://www.postgresql.org/docs/devel/regress-variant.html

Doh, I missed that when looking.

>> Maybe it's time to add documentation for pg_regress?
> 
> I'm inclined to think that a formal man page wouldn't be that useful,
> since nobody ever invokes pg_regress directly;

Right, I was mostly thinking of a chapter along the lines of "I am an extension
author, what features does this tool have to help me write good tests".

> the complete lack of equivalent info about what
> to do in a meson build clearly needs to be rectified.

Indeed.

--
Daniel Gustafsson

Re: speed up a logical replica setup

2024-02-16 Thread Shubham Khanna

On Thu, Feb 15, 2024 at 4:53 PM Hayato Kuroda (Fujitsu)
 wrote:
>
> Dear Euler,
>
> > Policy)
> >
> > Basically, we do not prohibit to connect to primary/standby.
> > primary_slot_name may be changed during the conversion and
> > tuples may be inserted on target just after the promotion, but it seems no 
> > issues.
> >
> > API)
> >
> > -D (data directory) and -d (databases) are definitively needed.
> >
> > Regarding the -P (a connection string for source), we can require it for 
> > now.
> > But note that it may cause an inconsistency if the pointed not by -P is 
> > different
> > from the node pointde by primary_conninfo.
> >
> > As for the connection string for the target server, we can choose two ways:
> > a)
> > accept native connection string as -S. This can reuse the same parsing
> > mechanism as -P,
> > but there is a room that non-local server is specified.
> >
> > b)
> > accept username/port as -U/-p
> > (Since the policy is like above, listen_addresses would not be overwritten. 
> > Also,
> > the port just specify the listening port).
> > This can avoid connecting to non-local, but more options may be needed.
> > (E.g., Is socket directory needed? What about password?)
> >
> > Other discussing point, reported issue)
> >
> > Points raised by me [1] are not solved yet.
> >
> > * What if the target version is PG16-?
> > * What if the found executables have diffent version with 
> > pg_createsubscriber?
> > * What if the target is sending WAL to another server?
> >I.e., there are clusters like `node1->node2-.node3`, and the target is 
> > node2.
> > * Can we really cleanup the standby in case of failure?
> >Shouldn't we suggest to remove the target once?
> > * Can we move outputs to stdout?
>
> Based on the discussion, I updated the patch set. Feel free to pick them and 
> include.
> Removing -P patch was removed, but removing -S still remained.
>
> Also, while testing the patch set, I found some issues.
>
> 1.
> Cfbot got angry [1]. This is because WIFEXITED and others are defined in 
> ,
> but the inclusion was removed per comment. Added the inclusion again.
>
> 2.
> As Shubham pointed out [3], when we convert an intermediate node of cascading 
> replication,
> the last node would stuck. This is because a walreciever process requires 
> nodes have the same
> system identifier (in WalReceiverMain), but it would be changed by 
> pg_createsubscriebr.
>
> 3.
> Moreover, when we convert a last node of cascade, it won't work well. Because 
> we cannot create
> publications on the standby node.
>
> 4.
> If the standby server was initialized as PG16-, this command would fail.
> Because the API of pg_logical_create_replication_slot() were changed.
>
> 5.
> Also, used pg_ctl commands must have same versions with the instance.
> I think we should require all the executables and servers must be a same 
> major version.
>
> Based on them, below part describes attached ones:
>
> V20-0001: same as Euler's patch, v17-0001.
> V20-0002: Update docs per recent changes. Same as v19-0002
> V20-0003: Modify the alignment of codes. Same as v19-0003
> V20-0004: Change an argument of get_base_conninfo. Same as v19-0004
> === experimental patches ===
> V20-0005: Add testcases. Same as v19-0004
> V20-0006: Update a comment above global variables. Same as v19-0005
> V20-0007: Address comments from Vignesh. Some parts you don't like
>   are reverted.
> V20-0008: Fix error message in get_bin_directory(). Same as v19-0008
> V20-0009: Remove -S option. Refactored from v16-0007
> V20-0010: Add check versions of executables and the target, per above and [4]
> V20-0011: Detect a disconnection while waiting the recovery, per [4]
> V20-0012: Avoid running pg_createsubscriber for cascade physical replication, 
> per above.
>
> [1]: https://cirrus-ci.com/task/4619792833839104
> [2]: 
> https://www.postgresql.org/message-id/CALDaNm1r9ZOwZamYsh6MHzb%3D_XvhjC_5XnTAsVecANvU9FOz6w%40mail.gmail.com
> [3]: 
> https://www.postgresql.org/message-id/CAHv8RjJcUY23ieJc5xqg6-QeGr1Ppp4Jwbu7Mq29eqCBTDWfUw%40mail.gmail.com
> [4]: 
> https://www.postgresql.org/message-id/TYCPR01MB1207713BEC5C379A05D65E342F54B2%40TYCPR01MB12077.jpnprd01.prod.outlook.com

I found a couple of issues, while verifying the cascaded replication
with the following scenarios:
Scenario 1) Create cascade replication like node1->node2->node3
without using replication slots (attached
cascade_3node_setup_without_slots has the script for this):
Then I ran pg_createsubscriber by specifying primary as node1 and
standby as node3, this scenario runs successfully. I was not sure if
this should be supported or not?
Scenario 2) Create cascade replication like node1->node2->node3 using
replication slots (attached cascade_3node_setup_with_slots has the
script for this):
Here, slot name was used as slot1 for node1 to node2 and slot2 for
node2 to node3. Then I ran pg_createsubscriber by specifying primary
as node1 and standby as node3. In this case pg_createsubscriber fails

Re: Replace current implementations in crypt() and gen_salt() to OpenSSL

> On 15 Feb 2024, at 16:49, Peter Eisentraut  wrote:

> 1. All the block ciphers currently supported by crypt() and gen_salt() are 
> not FIPS-compliant.
> 
> 2. The crypt() and gen_salt() methods built on top of them (modes of 
> operation, kind of) are not FIPS-compliant.

I wonder if it's worth trying to make pgcrypto disallow non-FIPS compliant
ciphers when the compiled against OpenSSL is running with FIPS mode enabled, or
raise a WARNING when used?  It seems rather unlikely that someone running
OpenSSL with FIPS=yes want to use our DES cipher without there being an error
or misconfiguration somewhere.

Something like the below untested pseudocode.

diff --git a/contrib/pgcrypto/pgcrypto.c b/contrib/pgcrypto/pgcrypto.c
index 96447c5757..3d4391ebe1 100644
--- a/contrib/pgcrypto/pgcrypto.c
+++ b/contrib/pgcrypto/pgcrypto.c
@@ -187,6 +187,14 @@ pg_crypt(PG_FUNCTION_ARGS)
   *resbuf;
text   *res;

+#if defined FIPS_mode
+   if (FIPS_mode())
+#else
+   if 
(EVP_default_properties_is_fips_enabled(OSSL_LIB_CTX_get0_global_default()))
+#endif
+   ereport(ERROR,
+   (errmsg("not available when using OpenSSL in 
FIPS mode")));
+
buf0 = text_to_cstring(arg0);
buf1 = text_to_cstring(arg1);

Greenplum implemented similar functionality but with a GUC, fips_mode=.
The problem with that is that it gives the illusion that enabling such a GUC
gives any guarantees about FIPS which isn't really the case since postgres
isn't FIPS certified.

--
Daniel Gustafsson

RE: Synchronizing slots from primary to standby

2024-02-16 Thread Zhijie Hou (Fujitsu)

On Friday, February 16, 2024 3:43 PM Amit Kapila  
wrote:
> 
> On Fri, Feb 16, 2024 at 11:43 AM Amit Kapila  wrote:
> >
> > Thanks for noticing this. I have pushed all your debug patches. Let's
> > hope if there is a BF failure next time, we can gather enough
> > information to know the reason of the same.
> >
> 
> There is a new BF failure [1] after adding these LOGs and I think I know what 
> is
> going wrong. First, let's look at standby LOGs:
> 
> 2024-02-16 06:18:18.442 UTC [241414][client backend][2/14:0] DEBUG:
> segno: 4 of purposed restart_lsn for the synced slot, oldest_segno: 4 
> available
> 2024-02-16 06:18:18.443 UTC [241414][client backend][2/14:0] DEBUG:
> xmin required by slots: data 0, catalog 741
> 2024-02-16 06:18:18.443 UTC [241414][client backend][2/14:0] LOG:mote  could
> not sync slot information as reslot precedes local slot: remote slot 
> "lsub1_slot":
> LSN (0/4000168), catalog xmin (739) local slot: LSN (0/4000168), catalog xmin
> (741)
> 
> So, from the above LOG, it is clear that the remote slot's catalog xmin (739)
> precedes the local catalog xmin (741) which makes the sync on standby to not
> complete.
> 
> Next, let's look at the LOG from the primary during the nearby time:
> 2024-02-16 06:18:11.354 UTC [238037][autovacuum worker][5/17:0] DEBUG:
>  analyzing "pg_catalog.pg_depend"
> 2024-02-16 06:18:11.360 UTC [238037][autovacuum worker][5/17:0] DEBUG:
>  "pg_depend": scanned 13 of 13 pages, containing 1709 live rows and 0 dead
> rows; 1709 rows in sample, 1709 estimated total rows ...
> 2024-02-16 06:18:11.372 UTC [238037][autovacuum worker][5/0:0] DEBUG:
> Autovacuum VacuumUpdateCosts(db=1, rel=14050, dobalance=yes,
> cost_limit=200, cost_delay=2 active=yes failsafe=no)
> 2024-02-16 06:18:11.372 UTC [238037][autovacuum worker][5/19:0] DEBUG:
>  analyzing "information_schema.sql_features"
> 2024-02-16 06:18:11.377 UTC [238037][autovacuum worker][5/19:0] DEBUG:
>  "sql_features": scanned 8 of 8 pages, containing 756 live rows and 0 dead 
> rows;
> 756 rows in sample, 756 estimated total rows
> 
> It shows us that autovacuum worker has analyzed catalog table and for updating
> its statistics in pg_statistic table, it would have acquired a new 
> transaction id. Now,
> after the slot creation, a new transaction id that has updated the catalog is
> generated on primary and would have been replication to standby. Due to this
> catalog_xmin of primary's slot would precede standby's catalog_xmin and we see
> this failure.
> 
> As per this theory, we should disable autovacuum on primary to avoid updates 
> to
> catalog_xmin values.
> 

Agreed. Here is the patch to disable autovacuum in the test.

Best Regards,
Hou zj


0001-Disable-autovacuum-on-primary-to-stabilize-the-040_s.patch
Description:  0001-Disable-autovacuum-on-primary-to-stabilize-the-040_s.patch

Re: Add new error_action COPY ON_ERROR "log"