out of memory in crosstab()

2022-11-15 Thread Amit Langote
Hi,

A customer seems to have run into $subject.  Here's a reproducer they shared:

CREATE TABLE test (id integer, category integer, rate numeric);
INSERT INTO test
SELECT x.id,
   y.category,
   random() * 10 AS rate
FROM generate_series(1, 100) AS x(id)
INNER JOIN generate_series(1, 25) AS y(category)
ON 1 = 1;
SELECT * FROM crosstab('SELECT id, category, rate FROM test ORDER BY
1, 2') AS final_result(id integer, "1" numeric, "2" numeric, "3"
numeric, "4" numeric, "5" numeric);
server closed the connection unexpectedly
This probably means the server terminated abnormally
before or while processing the request.
The connection to the server was lost. Attempting reset: Failed.
The connection to the server was lost. Attempting reset: Failed.
Time: 106095.766 ms (01:46.096)
!?> \q

With the following logged:

LOG:  server process (PID 121846) was terminated by signal 9: Killed
DETAIL:  Failed process was running: SELECT * FROM crosstab('SELECT
id, category, rate FROM test ORDER BY 1, 2') AS final_result(id
integer, "1" numeric, "2" numeric, "3" numeric, "4" numeric, "5"
numeric);

The problem seems to be spi_printtup() continuing to allocate memory
to expand _SPI_current->tuptable to store the result of crosstab()'s
input query that's executed using:

/* Retrieve the desired rows */
ret = SPI_execute(sql, true, 0);

Note that this asks SPI to retrieve and store *all* result rows of the
query in _SPI_current->tuptable, and if there happen to be so many
rows, as in the case of above example, spi_printtup() ends up asking
for a bit too much memory.

The easiest fix for this seems to be for crosstab() to use open a
cursor (SPI_cursor_open) and fetch the rows in batches
(SPI_cursor_fetch) rather than all in one go.  I have implemented that
in the attached.  Maybe the patch should address other functions that
potentially have the same problem.

I also wondered about fixing this by making _SPI_current->tuptable use
a tuplestore that can spill to disk as its backing store rather than a
plain C HeapTuple array, but haven't checked how big of a change that
would be; SPI_tuptable is referenced in many places across the tree.
Though I suspect that idea has enough merits to give that a try
someday.

Thoughts on whether this should be fixed and the fix be back-patched?

-- 
Thanks, Amit Langote
EDB: http://www.enterprisedb.com


v1-0001-Teach-crosstab-to-use-SPI_cursor_-interface.patch
Description: Binary data


Re: [PoC] Improve dead tuple storage for lazy vacuum

2022-11-15 Thread John Naylor
On Wed, Nov 16, 2022 at 12:33 PM Masahiko Sawada 
wrote:
>
> On Wed, Nov 16, 2022 at 1:46 PM John Naylor
>  wrote:
> >
> >
> > On Tue, Nov 15, 2022 at 11:59 AM Masahiko Sawada 
wrote:
> > > Thanks! Please let me know if there is something I can help with.
> >
> > I didn't get very far because the tests fail on 0004 in rt_verify_node:
> >
> > TRAP: failed Assert("n4->chunks[i - 1] < n4->chunks[i]"), File:
"../src/backend/lib/radixtree.c", Line: 2186, PID: 18242
>
> Which tests do you use to get this assertion failure? I've confirmed
> there is a bug in 0005 patch but without it, "make check-world"
> passed.

Hmm, I started over and rebuilt and it didn't reproduce. Not sure what
happened, sorry for the noise.

I'm attaching a test I wrote to stress test branch prediction in search,
and while trying it out I found two possible issues.

It's based on the random int load test, but tests search speed. Run like
this:

select * from bench_search_random_nodes(10 * 1000 * 1000)

It also takes some care to include all the different node kinds,
restricting the possible keys by AND-ing with a filter. Here's a simple
demo:

filter = ((uint64)1<<40)-1;
LOG:  num_keys = 967, height = 4, n4 = 17513814, n32 = 6320, n128 =
62663, n256 = 3130

Just using random integers leads to >99% using the smallest node. I wanted
to get close to having the same number of each, but that's difficult while
still using random inputs. I ended up using

filter = (((uint64) 0x7F<<32) | (0x07<<24) | (0xFF<<16) | 0xFF)

which gives

LOG:  num_keys = 9291812, height = 4, n4 = 262144, n32 = 79603, n128 =
182670, n256 = 1024

Which seems okay for the task. One puzzling thing I found while trying
various filters is that sometimes the reported tree height would change.
For example:

filter = (((uint64) 1<<32) | (0xFF<<24));
LOG:  num_keys = 944, height = 7, n4 = 47515559, n32 = 6209, n128 =
62632, n256 = 3161

1) Any idea why the tree height would be reported as 7 here? I didn't
expect that.

2) It seems that 0004 actually causes a significant slowdown in this test
(as in the attached, using the second filter above and with turboboost
disabled):

v9 0003: 2062 2051 2050
v9 0004: 2346 2316 2321

That means my idea for the pointer struct might have some problems, at
least as currently implemented. Maybe in the course of separating out and
polishing that piece, an inefficiency will fall out. Or, it might be
another reason to template local and shared separately. Not sure yet. I
also haven't tried to adjust this test for the shared memory case.

--
John Naylor
EDB: http://www.enterprisedb.com
diff --git a/contrib/bench_radix_tree/bench_radix_tree--1.0.sql 
b/contrib/bench_radix_tree/bench_radix_tree--1.0.sql
index 0874201d7e..e0205b364e 100644
--- a/contrib/bench_radix_tree/bench_radix_tree--1.0.sql
+++ b/contrib/bench_radix_tree/bench_radix_tree--1.0.sql
@@ -43,6 +43,14 @@ returns record
 as 'MODULE_PATHNAME'
 LANGUAGE C STRICT VOLATILE PARALLEL UNSAFE;
 
+create function bench_search_random_nodes(
+cnt int8,
+OUT mem_allocated int8,
+OUT search_ms int8)
+returns record
+as 'MODULE_PATHNAME'
+LANGUAGE C STRICT VOLATILE PARALLEL UNSAFE;
+
 create function bench_fixed_height_search(
 fanout int4,
 OUT fanout int4,
diff --git a/contrib/bench_radix_tree/bench_radix_tree.c 
b/contrib/bench_radix_tree/bench_radix_tree.c
index 7abb237e96..a43fc61c2d 100644
--- a/contrib/bench_radix_tree/bench_radix_tree.c
+++ b/contrib/bench_radix_tree/bench_radix_tree.c
@@ -29,6 +29,7 @@ PG_FUNCTION_INFO_V1(bench_seq_search);
 PG_FUNCTION_INFO_V1(bench_shuffle_search);
 PG_FUNCTION_INFO_V1(bench_load_random_int);
 PG_FUNCTION_INFO_V1(bench_fixed_height_search);
+PG_FUNCTION_INFO_V1(bench_search_random_nodes);
 
 static uint64
 tid_to_key_off(ItemPointer tid, uint32 *off)
@@ -347,6 +348,77 @@ bench_load_random_int(PG_FUNCTION_ARGS)
PG_RETURN_DATUM(HeapTupleGetDatum(heap_form_tuple(tupdesc, values, 
nulls)));
 }
 
+/* copy of splitmix64() */
+static uint64
+hash64(uint64 x)
+{
+   x ^= x >> 30;
+   x *= UINT64CONST(0xbf58476d1ce4e5b9);
+   x ^= x >> 27;
+   x *= UINT64CONST(0x94d049bb133111eb);
+   x ^= x >> 31;
+   return x;
+}
+
+/* attempts to have a relatively even population of node kinds */
+Datum
+bench_search_random_nodes(PG_FUNCTION_ARGS)
+{
+   uint64  cnt = (uint64) PG_GETARG_INT64(0);
+   radix_tree *rt;
+   TupleDesc   tupdesc;
+   TimestampTz start_time,
+   end_time;
+   longsecs;
+   int usecs;
+   int64   search_time_ms;
+   Datum   values[2] = {0};
+   boolnulls[2] = {0};
+   /* from trial and error */
+   const uint64 filter = (((uint64) 0x7F<<32) | (0x07<<24) | (0xFF<<16) | 
0xFF);
+
+   /* Build a tuple descriptor for our result type */
+   if (get_call_result_type(fcinfo, NULL, ) != TYPEFUNC_COMPOSITE)
+   elog(ERROR, "return type must be a row type");
+
+  

Re: Avoid overhead open-close indexes (catalog updates)

2022-11-15 Thread Michael Paquier
On Wed, Nov 16, 2022 at 06:58:01AM +0900, Michael Paquier wrote:
> This one has been left out on purpose.  I was tempting to use
> WithInfo() with a CatalogIndexState opened optionally but I got the
> impression that it makes the code a bit harder to follow and
> AddRoleMems() is already complex on its own.  Most DDL patterns
> working on role would involve one role.  More roles could be added of
> course in one shot, but the extra logic complexity did not look that
> appealing to me especially as some role updates are skipped.

I have worked more on that today, and applied all that after splitting
the whole in three commits in total as different areas were touched.
It looks like we are good for this thread, then.

I have spotted more optimizations possible, particularly for operator
classes, but that could happen later.
--
Michael


signature.asc
Description: PGP signature


Re: HOT chain validation in verify_heapam()

2022-11-15 Thread Himanshu Upadhyaya
On Thu, Nov 10, 2022 at 3:38 AM Andres Freund  wrote:

>
> > + }
> > +
> > + /*
> > +  * Loop over offset and populate predecessor array from
> all entries
> > +  * that are present in successor array.
> > +  */
> > + ctx.attnum = -1;
> > + for (ctx.offnum = FirstOffsetNumber; ctx.offnum <= maxoff;
> > +  ctx.offnum = OffsetNumberNext(ctx.offnum))
> > + {
> > + ItemId  curr_lp;
> > + ItemId  next_lp;
> > + HeapTupleHeader curr_htup;
> > + HeapTupleHeader next_htup;
> > + TransactionId curr_xmax;
> > + TransactionId next_xmin;
> > +
> > + OffsetNumber nextoffnum = successor[ctx.offnum];
> > +
> > + curr_lp = PageGetItemId(ctx.page, ctx.offnum);
>
> Why do we get the item when nextoffnum is 0?
>
> Yes, right, I will move this call to PageGetItemId, just after the next
"if" condition in the patch.

>
> > + if (nextoffnum == 0 || !lp_valid[ctx.offnum] ||
> !lp_valid[nextoffnum])
> > + {
> > + /*
> > +  * This is either the last updated tuple
> in the chain or a
> > +  * corruption raised for this tuple.
> > +  */
>
> "or a corruption raised" isn't quite right grammatically.
>
> will change to "This is either the last updated tuple in the chain or
corruption has been raised for this tuple"

>
> > + continue;
> > + }
> > + if (ItemIdIsRedirected(curr_lp))
> > + {
> > + next_lp = PageGetItemId(ctx.page,
> nextoffnum);
> > + if (ItemIdIsRedirected(next_lp))
> > + {
> > + report_corruption(,
> > +
>  psprintf("redirected line pointer pointing to another redirected line
> pointer at offset %u",
> > +
> (unsigned) nextoffnum));
> > + continue;
> > + }
> > + next_htup = (HeapTupleHeader) PageGetItem(
> ctx.page, next_lp);
> > + if (!HeapTupleHeaderIsHeapOnly(next_htup))
> > + {
> > + report_corruption(,
> > +
>  psprintf("redirected tuple at line pointer offset %u is not heap only
> tuple",
> > +
> (unsigned) nextoffnum));
> > + }
> > + if ((next_htup->t_infomask & HEAP_UPDATED)
> == 0)
> > + {
> > + report_corruption(,
> > +
>  psprintf("redirected tuple at line pointer offset %u is not heap updated
> tuple",
> > +
> (unsigned) nextoffnum));
> > + }
> > + continue;
> > + }
> > +
> > + /*
> > +  * Add a line pointer offset to the predecessor
> array if xmax is
> > +  * matching with xmin of next tuple (reaching via
> its t_ctid).
> > +  * Prior to PostgreSQL 9.4, we actually changed
> the xmin to
> > +  * FrozenTransactionId
>
> I'm doubtful it's a good idea to try to validate the 9.4 case. The
> likelihood
> of getting that right seems low and I don't see us gaining much by even
> trying.
>
>
> > so we must add offset to predecessor
> > +  * array(irrespective of xmax-xmin matching) if
> updated tuple xmin
> > +  * is frozen, so that we can later do validation
> related to frozen
> > +  * xmin. Raise corruption if we have two tuples
> having the same
> > +  * predecessor.
> > +  * We add the offset to the predecessor array
> irrespective of the
> > +  * transaction (t_xmin) status. We will do
> validation related to
> > +  * the transaction status (and also all other
> validations) when we
> > +  * loop over the predecessor array.
> > +  */
> > + curr_htup = (HeapTupleHeader) PageGetItem(ctx.page,
> curr_lp);
> > + curr_xmax = HeapTupleHeaderGetUpdateXid(curr_htup);
> > + next_lp = PageGetItemId(ctx.page, nextoffnum);
> > + next_htup = (HeapTupleHeader) PageGetItem(ctx.page,
> next_lp);
> > + next_xmin = HeapTupleHeaderGetXmin(next_htup);
> > + if (TransactionIdIsValid(curr_xmax) &&
> > + (TransactionIdEquals(curr_xmax, next_xmin)
> ||
> 

Re: [PoC] Let libpq reject unexpected authentication requests

2022-11-15 Thread Michael Paquier
On Thu, Oct 20, 2022 at 11:36:34AM -0700, Jacob Champion wrote:
> Maybe I should just add a basic Assert here, to trip if someone adds a
> new SASL mechanism, and point that lucky person to this thread with a
> comment?

I am beginning to look at the last version proposed, which has been
marked as RfC.  Does this patch need a refresh in light of a9e9a9f and
0873b2d?  The changes for libpq_append_conn_error() should be
straight-forward.

The CF bot is still happy.
--
Michael


signature.asc
Description: PGP signature


Re: Small miscellaneous fixes

2022-11-15 Thread Michael Paquier
On Tue, Oct 04, 2022 at 08:23:16AM -0300, Ranier Vilela wrote:
> Both are correct, I missed the pqsignal calls.
> 
> Attached patch to change this.

The change for pgbench is missing and this is only changing
pg_test_fsync.  Switching to sig_atomic_t would be fine on non-WIN32
as these are used in signal handlers, but are we sure that this is
fine on WIN32 for pg_test_fsync where we rely on a separate thread to
control the timing of the alarm?
--
Michael


signature.asc
Description: PGP signature


Re: Standardizing how pg_waldump presents recovery conflict XID cutoffs

2022-11-15 Thread Andres Freund
On 2022-11-15 20:48:56 -0800, Peter Geoghegan wrote:
> On Tue, Nov 15, 2022 at 5:29 PM Andres Freund  wrote:
> > If we want to focus on the mvcc affects we could just go for something like
> > snapshotConflictHorizon or such.
> 
> Okay, let's go with snapshotConflictHorizon. I'll use that name in the
> next revision, which I should be able to post tomorrow.

Cool!




Re: HOT chain validation in verify_heapam()

2022-11-15 Thread Andres Freund
Hi,

On 2022-11-14 15:07:05 -0800, Peter Geoghegan wrote:
> I'd really like to know if the scary HOT chain freezing scenario is
> possible, for the very obvious reason. Have you tried to write a test
> case for that?

I tried. Unfortunately, even if the bug exists, we currently don't have the
infrastructure to write isolationtester tests for it. There's just too many
points where we'd need to wait where I don't know of ways to wait with
isolationtester.

I'm quite certain that it's possible to end up freezing an earlier row
versions in a hot chain in < 14, I got there with careful gdb
orchestration. Of course possible I screwed something up, given I did it once,
interactively. Not sure if trying to fix it is worth the risk of backpatching
all the necessary changes to switch to the retry approach.

Greetings,

Andres Freund




Re: when the startup process doesn't (logging startup delays)

2022-11-15 Thread Bharath Rupireddy
On Tue, Nov 15, 2022 at 10:55 PM Robert Haas  wrote:
>
> On Tue, Nov 15, 2022 at 8:33 AM Bharath Rupireddy
>  wrote:
> > Please review the v2 patch.
>
> It seems to me that this will call disable_startup_progress_timeout
> once per WAL record, which seems like an unnecessary expense. How
> about leaving the code inside the loop just as we have it, and putting
> if (StandbyMode) disable_startup_progress_timeout() before entering
> the loop?

That can be done, only if we can disable the timeout in another place
when the StandbyMode is set to true in ReadRecord(), that is, after
the standby server finishes crash recovery and enters standby mode.

I'm attaching the v3 patch for further review. Please find the CF
entry here - https://commitfest.postgresql.org/41/4012/.

--
Bharath Rupireddy
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com
From 612f17d5e81181b69c3d711524de840c88cb3b4a Mon Sep 17 00:00:00 2001
From: Bharath Rupireddy 
Date: Wed, 16 Nov 2022 06:20:15 +
Subject: [PATCH v3] Disable STARTUP_PROGRESS_TIMEOUT in standby mode

In standby mode, we actually don't report progress of recovery to
not bloat server logs as the standby is always in recovery unless
promoted. However, startup_progress_timeout_handler() gets called
every log_startup_progress_interval seconds, which is unnecessary.

Therefore, we disable the timeout in standby mode.

Reported-by: Thomas Munro
Authors: Bharath Rupireddy, Simon Riggs
Reviewed-by: Robert Haas
Backpatch-through: 15
Discussion: https://www.postgresql.org/message-id/CA%2BhUKGKCHSffAj8zZJKJvNX7ygnQFxVD6wm1d-2j3fVw%2BMafPQ%40mail.gmail.com
---
 src/backend/access/transam/xlogrecovery.c | 30 +--
 src/backend/postmaster/startup.c  | 29 +++---
 src/include/postmaster/startup.h  |  2 ++
 3 files changed, 56 insertions(+), 5 deletions(-)

diff --git a/src/backend/access/transam/xlogrecovery.c b/src/backend/access/transam/xlogrecovery.c
index cb07694aea..0d57ea17ca 100644
--- a/src/backend/access/transam/xlogrecovery.c
+++ b/src/backend/access/transam/xlogrecovery.c
@@ -1644,9 +1644,24 @@ PerformWalRecovery(void)
 (errmsg("redo starts at %X/%X",
 		LSN_FORMAT_ARGS(xlogreader->ReadRecPtr;
 
-		/* Prepare to report progress of the redo phase. */
-		if (!StandbyMode)
+		if (StandbyMode)
+		{
+			/*
+			 * To avoid server log bloat, we don't report recovery progress in
+			 * a standby as it will always be in recovery unless promoted. We
+			 * disable startup progress timeout in standby mode to avoid
+			 * calling startup_progress_timeout_handler() unnecessarily.
+			 */
+			disable_startup_progress_timeout();
+		}
+		else
+		{
+			/*
+			 * When not in standby mode, prepare to report progress of the redo
+			 * phase.
+			 */
 			begin_startup_progress_phase();
+		}
 
 		/*
 		 * main redo apply loop
@@ -3115,8 +3130,19 @@ ReadRecord(XLogPrefetcher *xlogprefetcher, int emode,
 		(errmsg_internal("reached end of WAL in pg_wal, entering archive recovery")));
 InArchiveRecovery = true;
 if (StandbyModeRequested)
+{
 	StandbyMode = true;
 
+	/*
+	 * To avoid server log bloat, we don't report
+	 * recovery progress in a standby as it will always be in
+	 * recovery unless promoted. We disable startup progress
+	 * timeout in standby mode to avoid calling
+	 * startup_progress_timeout_handler() unnecessarily.
+	 */
+	disable_startup_progress_timeout();
+}
+
 SwitchIntoArchiveRecovery(xlogreader->EndRecPtr, replayTLI);
 minRecoveryPoint = xlogreader->EndRecPtr;
 minRecoveryPointTLI = replayTLI;
diff --git a/src/backend/postmaster/startup.c b/src/backend/postmaster/startup.c
index f99186eab7..2705e425e8 100644
--- a/src/backend/postmaster/startup.c
+++ b/src/backend/postmaster/startup.c
@@ -314,11 +314,22 @@ startup_progress_timeout_handler(void)
 	startup_progress_timer_expired = true;
 }
 
+void
+disable_startup_progress_timeout(void)
+{
+	/* Feature is disabled. */
+	if (log_startup_progress_interval == 0)
+		return;
+
+	disable_timeout(STARTUP_PROGRESS_TIMEOUT, false);
+	startup_progress_timer_expired = false;
+}
+
 /*
  * Set the start timestamp of the current operation and enable the timeout.
  */
 void
-begin_startup_progress_phase(void)
+enable_startup_progress_timeout(void)
 {
 	TimestampTz fin_time;
 
@@ -326,8 +337,6 @@ begin_startup_progress_phase(void)
 	if (log_startup_progress_interval == 0)
 		return;
 
-	disable_timeout(STARTUP_PROGRESS_TIMEOUT, false);
-	startup_progress_timer_expired = false;
 	startup_progress_phase_start_time = GetCurrentTimestamp();
 	fin_time = TimestampTzPlusMilliseconds(startup_progress_phase_start_time,
 		   log_startup_progress_interval);
@@ -335,6 +344,20 @@ begin_startup_progress_phase(void)
 		 log_startup_progress_interval);
 }
 
+/*
+ * A thin wrapper to first disable and then enable the startup progress timeout.
+ */
+void

Re: Switching XLog source from archive to streaming when primary available

2022-11-15 Thread Bharath Rupireddy
On Wed, Nov 16, 2022 at 9:38 AM Ian Lawrence Barwick  wrote:
>
> While reviewing the patch backlog, we have determined that this patch adds
> one or more TAP tests but has not added the test to the "meson.build" file.

Thanks for pointing it out. Yeah, the test wasn't picking up on meson
builds. I added the new test file name in
src/test/recovery/meson.build.

I'm attaching the v10 patch for further review.

--
Bharath Rupireddy
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com
From 70323a5c96a889a21a5339c531193b6440d2e270 Mon Sep 17 00:00:00 2001
From: Bharath Rupireddy 
Date: Wed, 16 Nov 2022 05:34:41 +
Subject: [PATCH v10] Allow standby to switch WAL source from archive to
 streaming replication

A standby typically switches to streaming replication (get WAL
from primary), only when receive from WAL archive finishes (no
more WAL left there) or fails for any reason. Reading WAL from
archive may not always be efficient and cheaper because network
latencies, disk IO cost might differ on the archive as compared
to primary and often the archive may sit far from standby
impacting recovery performance on the standby. Hence reading WAL
from the primary, by setting this parameter, as opposed to the
archive enables the standby to catch up with the primary sooner
thus reducing replication lag and avoiding WAL files accumulation
on the primary.

This feature adds a new GUC that specifies amount of time after
which standby attempts to switch WAL source from WAL archive to
streaming replication. However, exhaust all the WAL present in
pg_wal before switching. If the standby fails to switch to stream
mode, it falls back to archive mode.

Reported-by: SATYANARAYANA NARLAPURAM
Author: Bharath Rupireddy
Reviewed-by: Cary Huang, Nathan Bossart
Reviewed-by: Kyotaro Horiguchi
Discussion: https://www.postgresql.org/message-id/CAHg+QDdLmfpS0n0U3U+e+dw7X7jjEOsJJ0aLEsrtxs-tUyf5Ag@mail.gmail.com
---
 doc/src/sgml/config.sgml  |  46 +
 doc/src/sgml/high-availability.sgml   |   7 +
 src/backend/access/transam/xlogrecovery.c | 158 --
 src/backend/utils/misc/guc_tables.c   |  12 ++
 src/backend/utils/misc/postgresql.conf.sample |   4 +
 src/include/access/xlogrecovery.h |   1 +
 src/test/recovery/meson.build |   1 +
 src/test/recovery/t/034_wal_source_switch.pl  | 126 ++
 8 files changed, 338 insertions(+), 17 deletions(-)
 create mode 100644 src/test/recovery/t/034_wal_source_switch.pl

diff --git a/doc/src/sgml/config.sgml b/doc/src/sgml/config.sgml
index bd50ea8e48..8a3a9ff296 100644
--- a/doc/src/sgml/config.sgml
+++ b/doc/src/sgml/config.sgml
@@ -4852,6 +4852,52 @@ ANY num_sync ( 
+  streaming_replication_retry_interval (integer)
+  
+   streaming_replication_retry_interval configuration parameter
+  
+  
+  
+   
+Specifies amount of time after which standby attempts to switch WAL
+source from WAL archive to streaming replication (get WAL from
+primary). However, exhaust all the WAL present in pg_wal before
+switching. If the standby fails to switch to stream mode, it falls
+back to archive mode.
+If this value is specified without units, it is taken as milliseconds.
+The default is five minutes (5min).
+With a lower setting of this parameter, the standby makes frequent
+WAL source switch attempts when the primary is lost for quite longer.
+To avoid this, set a reasonable value.
+A setting of 0 disables the feature. When disabled,
+the standby typically switches to stream mode, only when receive from
+WAL archive finishes (no more WAL left there) or fails for any reason.
+This parameter can only be set in
+the postgresql.conf file or on the server
+command line.
+   
+   
+Reading WAL from archive may not always be efficient and cheaper
+because network latencies, disk IO cost might differ on the archive as
+compared to primary and often the archive may sit far from standby
+impacting recovery performance on the standby. Hence reading WAL
+from the primary, by setting this parameter, as opposed to the archive
+enables the standby to catch up with the primary sooner thus reducing
+replication lag and avoiding WAL files accumulation on the primary.
+  
+   
+Note that the standby may not always attempt to switch source from
+WAL archive to streaming replication at exact
+streaming_replication_retry_interval intervals.
+For example, if the parameter is set to 1min and
+fetching from WAL archive takes 5min, then the
+source switch attempt happens for the next WAL after current WAL is
+fetched from WAL archive and applied.
+  
+  
+ 
+
  
   recovery_min_apply_delay (integer)
   

Re: [PoC] Improve dead tuple storage for lazy vacuum

2022-11-15 Thread Masahiko Sawada
On Wed, Nov 16, 2022 at 2:17 PM John Naylor
 wrote:
>
>
>
> On Wed, Nov 16, 2022 at 11:46 AM John Naylor  
> wrote:
> >
> >
> > On Tue, Nov 15, 2022 at 11:59 AM Masahiko Sawada  
> > wrote:
> > > Thanks! Please let me know if there is something I can help with.
> >
> > I didn't get very far because the tests fail on 0004 in rt_verify_node:
> >
> > TRAP: failed Assert("n4->chunks[i - 1] < n4->chunks[i]"), File: 
> > "../src/backend/lib/radixtree.c", Line: 2186, PID: 18242
>
> Actually I do want to offer some general advice. Upthread I recommended a 
> purely refactoring patch that added the node-pointer struct but did nothing 
> else, so that the DSA changes would be smaller. 0004 attempted pointer 
> tagging in the same commit, which makes it no longer a purely refactoring 
> patch, so that 1) makes it harder to tell what part caused the bug and 2) 
> obscures what is necessary for DSA pointers and what was additionally 
> necessary for pointer tagging. Shared memory support is a prerequisite for a 
> shippable feature, but pointer tagging is (hopefully) a performance 
> optimization. Let's keep them separate.

Totally agreed. I'll separate them in the next version patch. Thank
you for your advice.

Regards,

-- 
Masahiko Sawada
Amazon Web Services: https://aws.amazon.com




Re: [PoC] Improve dead tuple storage for lazy vacuum

2022-11-15 Thread Masahiko Sawada
On Wed, Nov 16, 2022 at 1:46 PM John Naylor
 wrote:
>
>
> On Tue, Nov 15, 2022 at 11:59 AM Masahiko Sawada  
> wrote:
> > Thanks! Please let me know if there is something I can help with.
>
> I didn't get very far because the tests fail on 0004 in rt_verify_node:
>
> TRAP: failed Assert("n4->chunks[i - 1] < n4->chunks[i]"), File: 
> "../src/backend/lib/radixtree.c", Line: 2186, PID: 18242

Which tests do you use to get this assertion failure? I've confirmed
there is a bug in 0005 patch but without it, "make check-world"
passed.

Regards,

-- 
Masahiko Sawada
Amazon Web Services: https://aws.amazon.com




Re: Allow file inclusion in pg_hba and pg_ident files

2022-11-15 Thread Julien Rouhaud
Le mer. 16 nov. 2022 à 13:01, Ian Lawrence Barwick  a
écrit :

> 2022年11月14日(月) 14:41 Michael Paquier :
> >
> > On Sat, Nov 12, 2022 at 04:13:53PM +0800, Julien Rouhaud wrote:
> > > It's looks good to me.  I agree that file name and line number should
> be enough
> > > to diagnose any unexpected error.
> >
> > Thanks for checking.  I have looked at 0001 and 0002 again with a
> > fresh mind, and applied both of them this morning.
>
> Hi
>
> Just a quick note to mention that 0003 adds a new test ("
> 004_file_inclusion.pl")
> but doesn't seem to include it in the "meson.build" file.
>

ah thanks! Michael also mentioned that previously but as we were focusing
on preliminary refactoring patches I didn't check if if was fixed in the
patches sent recently, but I will definitely take care of it for the next
round

>


Re: Memory leak in adjust_data_dir

2022-11-15 Thread Japin Li


On Wed, 16 Nov 2022 at 12:19, Tom Lane  wrote:
> Japin Li  writes:
>> Hi, hackers,
>
> ITYM pgsql-hackers, this is off-topic here.
>

Sorry for typo the email address.

>> When I'm reviewing patch [1], I find there is a memory leak in
>> adjust_data_dir(), the cmd was allocated by psprintf(), but forget
>> releasing.
>
> Can't get excited about it in pg_ctl; that program won't run
> long enough for anybody to notice.
>

Yeah, it won't run a long time.  I find that the memory of my_exec_path
was released, so I think we also should do release on cmd.  IMO, this is
a bit confused when should we do release the memory of variables for
short lifetime?

[Here is the origin contents which I send a wrong mail-list]

Hi, hackers,

When I'm reviewing patch [1], I find there is a memory leak in
adjust_data_dir(), the cmd was allocated by psprintf(), but forget
releasing.

[1] 
https://www.postgresql.org/message-id/CALte62y3yZpHNFnYVz1uACaFbmb6go9fyeRaO5uHF5XaxtarbA%40mail.gmail.com

diff --git a/src/bin/pg_ctl/pg_ctl.c b/src/bin/pg_ctl/pg_ctl.c
index ceab603c47..ace2d676fc 100644
--- a/src/bin/pg_ctl/pg_ctl.c
+++ b/src/bin/pg_ctl/pg_ctl.c
@@ -2159,6 +2159,7 @@ adjust_data_dir(void)
write_stderr(_("%s: could not determine the data directory 
using command \"%s\"\n"), progname, cmd);
exit(1);
}
+   free(cmd);
free(my_exec_path);
 
/* strip trailing newline and carriage return */

-- 
Regrads,
Japin Li.
ChengDu WenWu Information Technology Co.,Ltd.




Re: New strategies for freezing, advancing relfrozenxid early

2022-11-15 Thread Andres Freund
Hi,

On 2022-11-15 19:02:12 -0800, Peter Geoghegan wrote:
> From 352867c5027fae6194ab1c6480cd326963e201b1 Mon Sep 17 00:00:00 2001
> From: Peter Geoghegan 
> Date: Sun, 12 Jun 2022 15:46:08 -0700
> Subject: [PATCH v6 1/6] Add page-level freezing to VACUUM.
> 
> Teach VACUUM to decide on whether or not to trigger freezing at the
> level of whole heap pages, not individual tuple fields.  OldestXmin is
> now treated as the cutoff for freezing eligibility in all cases, while
> FreezeLimit is used to trigger freezing at the level of each page (we
> now freeze all eligible XIDs on a page when freezing is triggered for
> the page).
> 
> This approach decouples the question of _how_ VACUUM could/will freeze a
> given heap page (which of its XIDs are eligible to be frozen) from the
> question of whether it actually makes sense to do so right now.
> 
> Just adding page-level freezing does not change all that much on its
> own: VACUUM will still typically freeze very lazily, since we're only
> forcing freezing of all of a page's eligible tuples when we decide to
> freeze at least one (on the basis of XID age and FreezeLimit).  For now
> VACUUM still freezes everything almost as lazily as it always has.
> Later work will teach VACUUM to apply an alternative eager freezing
> strategy that triggers page-level freezing earlier, based on additional
> criteria.
> ---
>  src/include/access/heapam.h  |  42 +-
>  src/backend/access/heap/heapam.c | 199 +--
>  src/backend/access/heap/vacuumlazy.c |  95 -
>  3 files changed, 222 insertions(+), 114 deletions(-)
> 
> diff --git a/src/include/access/heapam.h b/src/include/access/heapam.h
> index ebe723abb..ea709bf1b 100644
> --- a/src/include/access/heapam.h
> +++ b/src/include/access/heapam.h
> @@ -112,6 +112,38 @@ typedef struct HeapTupleFreeze
>   OffsetNumber offset;
>  } HeapTupleFreeze;
>  
> +/*
> + * State used by VACUUM to track what the oldest extant XID/MXID will become
> + * when determing whether and how to freeze a page's heap tuples via calls to
> + * heap_prepare_freeze_tuple.

Perhaps this could say something like "what the oldest extant XID/MXID
currently is and what it would be if we decide to freeze the page" or such?


> + * The relfrozenxid_out and relminmxid_out fields are the current target
> + * relfrozenxid and relminmxid for VACUUM caller's heap rel.  Any and all

"VACUUM caller's heap rel." could stand to be rephrased.


> + * unfrozen XIDs or MXIDs that remain in caller's rel after VACUUM finishes
> + * _must_ have values >= the final relfrozenxid/relminmxid values in 
> pg_class.
> + * This includes XIDs that remain as MultiXact members from any tuple's xmax.
> + * Each heap_prepare_freeze_tuple call pushes back relfrozenxid_out and/or
> + * relminmxid_out as needed to avoid unsafe values in rel's authoritative
> + * pg_class tuple.
> + *
> + * Alternative "no freeze" variants of relfrozenxid_nofreeze_out and
> + * relminmxid_nofreeze_out must also be maintained for !freeze pages.
> + */

relfrozenxid_nofreeze_out isn't really a "no freeze variant" :)

I think it might be better to just always maintain the nofreeze state.


> +typedef struct HeapPageFreeze
> +{
> + /* Is heap_prepare_freeze_tuple caller required to freeze page? */
> + boolfreeze;

s/freeze/freeze_required/?


> + /* Values used when page is to be frozen based on freeze plans */
> + TransactionId relfrozenxid_out;
> + MultiXactId relminmxid_out;
> +
> + /* Used by caller for '!freeze' pages */
> + TransactionId relfrozenxid_nofreeze_out;
> + MultiXactId relminmxid_nofreeze_out;
> +
> +} HeapPageFreeze;
> +

Given the number of parameters to heap_prepare_freeze_tuple, why don't we pass
in more of them in via HeapPageFreeze?


>  /* 
>   *   function prototypes for heap access method
>   *
> @@ -180,17 +212,17 @@ extern void heap_inplace_update(Relation relation, 
> HeapTuple tuple);
>  extern bool heap_prepare_freeze_tuple(HeapTupleHeader tuple,
> 
> TransactionId relfrozenxid, TransactionId relminmxid,
> 
> TransactionId cutoff_xid, TransactionId cutoff_multi,
> +   
> TransactionId limit_xid, MultiXactId limit_multi,
> 
> HeapTupleFreeze *frz, bool *totally_frozen,
> -   
> TransactionId *relfrozenxid_out,
> -   
> MultiXactId *relminmxid_out);
> +   
> HeapPageFreeze *xtrack);

What does 'xtrack' stand for? Xid Tracking?


>   * VACUUM caller must assemble HeapFreezeTuple entries for 

Re: pg_rewind: warn when checkpoint hasn't happened after promotion

2022-11-15 Thread kuroda . keisuke

Hi, hackers


The issue here is pg_rewind looks into control file to determine the
soruce timeline, because the control file is not updated until the
first checkpoint ends after promotion finishes, even though file
blocks are already diverged.

Even in that case history file for the new timeline is already
created, so searching for the latest history file works.


I think this change is a good one because if I want
pg_rewind to run automatically after a promotion,
I don't have to wait for the checkpoint to complete.

The attached patch is Horiguchi-san's patch with
additional tests. The tests are based on James's tests,
"010_no_checkpoint_after_promotion.pl" tests that
pg_rewind is successfully executed without running
checkpoint after promote.

Best Regards,
Keisuke Kuroda
NTT COMWAREdiff --git a/doc/src/sgml/ref/pg_rewind.sgml b/doc/src/sgml/ref/pg_rewind.sgml
index 69d6924b3a..7f752b2ed0 100644
--- a/doc/src/sgml/ref/pg_rewind.sgml
+++ b/doc/src/sgml/ref/pg_rewind.sgml
@@ -334,15 +334,6 @@ GRANT EXECUTE ON function pg_catalog.pg_read_binary_file(text, bigint, bigint, b
 
   
 
-  
-   When executing pg_rewind using an online
-   cluster as source which has been recently promoted, it is necessary
-   to execute a CHECKPOINT after promotion such that its
-   control file reflects up-to-date timeline information, which is used by
-   pg_rewind to check if the target cluster
-   can be rewound using the designated source cluster.
-  
-
   
How It Works
 
diff --git a/src/bin/pg_rewind/file_ops.c b/src/bin/pg_rewind/file_ops.c
index db190bcba7..fe98f47519 100644
--- a/src/bin/pg_rewind/file_ops.c
+++ b/src/bin/pg_rewind/file_ops.c
@@ -309,9 +309,11 @@ sync_target_dir(void)
  * buffer is actually *filesize + 1. That's handy when reading a text file.
  * This function can be used to read binary files as well, you can just
  * ignore the zero-terminator in that case.
+ *
+ * If noerror is true, returns NULL when the file is not found.
  */
 char *
-slurpFile(const char *datadir, const char *path, size_t *filesize)
+slurpFile(const char *datadir, const char *path, size_t *filesize, bool noerror)
 {
 	int			fd;
 	char	   *buffer;
@@ -323,8 +325,13 @@ slurpFile(const char *datadir, const char *path, size_t *filesize)
 	snprintf(fullpath, sizeof(fullpath), "%s/%s", datadir, path);
 
 	if ((fd = open(fullpath, O_RDONLY | PG_BINARY, 0)) == -1)
+	{
+		if (noerror && errno == ENOENT)
+			return NULL;
+
 		pg_fatal("could not open file \"%s\" for reading: %m",
  fullpath);
+	}
 
 	if (fstat(fd, ) < 0)
 		pg_fatal("could not open file \"%s\" for reading: %m",
diff --git a/src/bin/pg_rewind/file_ops.h b/src/bin/pg_rewind/file_ops.h
index e6277c4631..559787a072 100644
--- a/src/bin/pg_rewind/file_ops.h
+++ b/src/bin/pg_rewind/file_ops.h
@@ -21,7 +21,8 @@ extern void create_target(file_entry_t *entry);
 extern void remove_target(file_entry_t *entry);
 extern void sync_target_dir(void);
 
-extern char *slurpFile(const char *datadir, const char *path, size_t *filesize);
+extern char *slurpFile(const char *datadir, const char *path, size_t *filesize,
+   bool noerror);
 
 typedef void (*process_file_callback_t) (const char *path, file_type_t type, size_t size, const char *link_target);
 extern void traverse_datadir(const char *datadir, process_file_callback_t callback);
diff --git a/src/bin/pg_rewind/libpq_source.c b/src/bin/pg_rewind/libpq_source.c
index 011c9cce6e..751c96e6e4 100644
--- a/src/bin/pg_rewind/libpq_source.c
+++ b/src/bin/pg_rewind/libpq_source.c
@@ -68,7 +68,7 @@ static void libpq_queue_fetch_range(rewind_source *source, const char *path,
 	off_t off, size_t len);
 static void libpq_finish_fetch(rewind_source *source);
 static char *libpq_fetch_file(rewind_source *source, const char *path,
-			  size_t *filesize);
+			  size_t *filesize, bool noerror);
 static XLogRecPtr libpq_get_current_wal_insert_lsn(rewind_source *source);
 static void libpq_destroy(rewind_source *source);
 
@@ -620,9 +620,12 @@ appendArrayEscapedString(StringInfo buf, const char *str)
 
 /*
  * Fetch a single file as a malloc'd buffer.
+ *
+ * If noerror is true, returns NULL if pg_read_binary_file() failed.
  */
 static char *
-libpq_fetch_file(rewind_source *source, const char *path, size_t *filesize)
+libpq_fetch_file(rewind_source *source, const char *path, size_t *filesize,
+ bool noerror)
 {
 	PGconn	   *conn = ((libpq_source *) source)->conn;
 	PGresult   *res;
@@ -631,6 +634,34 @@ libpq_fetch_file(rewind_source *source, const char *path, size_t *filesize)
 	const char *paramValues[1];
 
 	paramValues[0] = path;
+
+	/*
+	 * check the existence of the file. We do this before executing
+	 * pg_read_binary_file so that server doesn't emit an error
+	 */
+	if (noerror)
+	{
+		res = PQexecParams(conn, "SELECT pg_stat_file($1, true)",
+		   1, NULL, paramValues, NULL, NULL, 1);
+
+		if (PQresultStatus(res) != PGRES_TUPLES_OK)
+		{
+			pg_fatal("could not stat remote file \"%s\": %s",
+	 path, 

Re: [PoC] Improve dead tuple storage for lazy vacuum

2022-11-15 Thread John Naylor
On Wed, Nov 16, 2022 at 11:46 AM John Naylor 
wrote:
>
>
> On Tue, Nov 15, 2022 at 11:59 AM Masahiko Sawada 
wrote:
> > Thanks! Please let me know if there is something I can help with.
>
> I didn't get very far because the tests fail on 0004 in rt_verify_node:
>
> TRAP: failed Assert("n4->chunks[i - 1] < n4->chunks[i]"), File:
"../src/backend/lib/radixtree.c", Line: 2186, PID: 18242

Actually I do want to offer some general advice. Upthread I recommended a
purely refactoring patch that added the node-pointer struct but did nothing
else, so that the DSA changes would be smaller. 0004 attempted pointer
tagging in the same commit, which makes it no longer a purely refactoring
patch, so that 1) makes it harder to tell what part caused the bug and 2)
obscures what is necessary for DSA pointers and what was additionally
necessary for pointer tagging. Shared memory support is a prerequisite for
a shippable feature, but pointer tagging is (hopefully) a performance
optimization. Let's keep them separate.

--
John Naylor
EDB: http://www.enterprisedb.com


Re: Allow file inclusion in pg_hba and pg_ident files

2022-11-15 Thread Ian Lawrence Barwick
2022年11月14日(月) 14:41 Michael Paquier :
>
> On Sat, Nov 12, 2022 at 04:13:53PM +0800, Julien Rouhaud wrote:
> > It's looks good to me.  I agree that file name and line number should be 
> > enough
> > to diagnose any unexpected error.
>
> Thanks for checking.  I have looked at 0001 and 0002 again with a
> fresh mind, and applied both of them this morning.

Hi

Just a quick note to mention that 0003 adds a new test ("004_file_inclusion.pl")
but doesn't seem to include it in the "meson.build" file.

Regards

Ian Barwick




Re: Standardizing how pg_waldump presents recovery conflict XID cutoffs

2022-11-15 Thread Peter Geoghegan
On Tue, Nov 15, 2022 at 5:29 PM Andres Freund  wrote:
> If we want to focus on the mvcc affects we could just go for something like
> snapshotConflictHorizon or such.

Okay, let's go with snapshotConflictHorizon. I'll use that name in the
next revision, which I should be able to post tomorrow.

-- 
Peter Geoghegan




Re: vacuumlo: add test to vacuumlo for test coverage

2022-11-15 Thread Ian Lawrence Barwick
2022年9月3日(土) 17:28 Dong Wook Lee :
>
> Hi hackers,
> I write a tiny patch about vacuumlo to improve test coverage.
> I hope my work is meaningful.

Hi

While reviewing the patch backlog, we have determined that this patch adds
one or more TAP tests but has not added the test to the "meson.build" file.

To do this, locate the relevant "meson.build" file for each test and add it
in the 'tests' dictionary, which will look something like this:

  'tap': {
'tests': [
  't/001_basic.pl',
],
  },

For some additional details please see this Wiki article:

  https://wiki.postgresql.org/wiki/Meson_for_patch_authors

For more information on the meson build system for PostgreSQL see:

  https://wiki.postgresql.org/wiki/Meson


Regards

Ian Barwick




Re: [PoC] Improve dead tuple storage for lazy vacuum

2022-11-15 Thread John Naylor
On Tue, Nov 15, 2022 at 11:59 AM Masahiko Sawada 
wrote:
> Thanks! Please let me know if there is something I can help with.

I didn't get very far because the tests fail on 0004 in rt_verify_node:

TRAP: failed Assert("n4->chunks[i - 1] < n4->chunks[i]"), File:
"../src/backend/lib/radixtree.c", Line: 2186, PID: 18242

--
John Naylor
EDB: http://www.enterprisedb.com


Re: logical decoding and replication of sequences, take 2

2022-11-15 Thread Ian Lawrence Barwick
2022年11月12日(土) 7:49 Tomas Vondra :
>
> Hi,
>
> I noticed on cfbot the patch no longer applies, so here's a rebased
> version. Most of the breakage was due to the column filtering reworks,
> grammar changes etc. A lot of bitrot, but mostly mechanical stuff.

(...)

Hi

Thanks for the update patch.

While reviewing the patch backlog, we have determined that this patch adds
one or more TAP tests but has not added the test to the "meson.build" file.

To do this, locate the relevant "meson.build" file for each test and add it
in the 'tests' dictionary, which will look something like this:

  'tap': {
'tests': [
  't/001_basic.pl',
],
  },

For some additional details please see this Wiki article:

  https://wiki.postgresql.org/wiki/Meson_for_patch_authors

For more information on the meson build system for PostgreSQL see:

  https://wiki.postgresql.org/wiki/Meson


Regards

Ian Barwick




Re: Testing autovacuum wraparound (including failsafe)

2022-11-15 Thread Ian Lawrence Barwick
2022年6月30日(木) 10:40 Masahiko Sawada :
>
> Hi,
>
> On Tue, Feb 1, 2022 at 11:58 AM Masahiko Sawada  wrote:
> >
> > On Fri, Jun 11, 2021 at 10:19 AM Andres Freund  wrote:
> > >
> > > Hi,
> > >
> > > On 2021-06-10 16:42:01 +0300, Anastasia Lubennikova wrote:
> > > > Cool. Thank you for working on that!
> > > > Could you please share a WIP patch for the $subj? I'd be happy to help 
> > > > with
> > > > it.
> > >
> > > I've attached the current WIP state, which hasn't evolved much since
> > > this message... I put the test in 
> > > src/backend/access/heap/t/001_emergency_vacuum.pl
> > > but I'm not sure that's the best place. But I didn't think
> > > src/test/recovery is great either.
> > >
> >
> > Thank you for sharing the WIP patch.
> >
> > Regarding point (1) you mentioned (StartupSUBTRANS() takes a long time
> > for zeroing out all pages), how about using single-user mode instead
> > of preparing the transaction? That is, after pg_resetwal we check the
> > ages of datfrozenxid by executing a query in single-user mode. That
> > way, we don’t need to worry about autovacuum concurrently running
> > while checking the ages of frozenxids. I’ve attached a PoC patch that
> > does the scenario like:
> >
> > 1. start cluster with autovacuum=off and create tables with a few data
> > and make garbage on them
> > 2. stop cluster and do pg_resetwal
> > 3. start cluster in single-user mode
> > 4. check age(datfrozenxid)
> > 5. stop cluster
> > 6. start cluster and wait for autovacuums to increase template0,
> > template1, and postgres datfrozenxids
>
> The above steps are wrong.
>
> I think we can expose a function in an extension used only by this
> test in order to set nextXid to a future value with zeroing out
> clog/subtrans pages. We don't need to fill all clog/subtrans pages
> between oldestActiveXID and nextXid. I've attached a PoC patch for
> adding this regression test and am going to register it to the next
> CF.
>
> BTW, while testing the emergency situation, I found there is a race
> condition where anti-wraparound vacuum isn't invoked with the settings
> autovacuum = off, autovacuum_max_workers = 1. AN autovacuum worker
> sends a signal to the postmaster after advancing datfrozenxid in
> SetTransactionIdLimit(). But with the settings, if the autovacuum
> launcher attempts to launch a worker before the autovacuum worker who
> has signaled to the postmaster finishes, the launcher exits without
> launching a worker due to no free workers. The new launcher won’t be
> launched until new XID is generated (and only when new XID % 65536 ==
> 0). Although autovacuum_max_workers = 1 is not mandatory for this
> test, it's easier to verify the order of operations.

Hi

Thanks for the patch. While reviewing the patch backlog, we have determined that
the latest version of this patch was submitted before meson support was
implemented, so it should have a "meson.build" file added for consideration for
inclusion in PostgreSQL 16.

Regards

Ian Barwick




Re: Hash index build performance tweak from sorting

2022-11-15 Thread Simon Riggs
On Wed, 21 Sept 2022 at 02:32, David Rowley  wrote:
>

> I took this patch for a spin and saw a 2.5% performance increase using
> the random INT test that Tom posted. The index took an average of
> 7227.47 milliseconds on master and 7045.05 with the patch applied.

Thanks for the review, apologies for the delay in acting upon your comments.

My tests show the sorted and random tests are BOTH 4.6% faster with
the v3 changes using 5-test avg, but you'll be pleased to know your
kit is about 15.5% faster than mine, comparing absolute execution
times.

> On making a pass of the changes, I noted down a few things.

> 2. There are quite a few casts that are not required. e.g:
>
> _hash_doinsert(rel, itup, heapRel, (HashInsertState) );
> buildstate.spool = _h_spoolinit(heap, index, num_buckets,
> (HashInsertState) );
> buildstate.istate = (HashInsertState) 

Removed

> 3. Just a minor nitpick. Line wraps at 80 chars. You're doing this
> sometimes but not others. This seems just to be due to the additional
> function parameters that have been added.

Done

> 4. I added the following Assert to _hash_pgaddtup() as I expected the
> itup_off to be set to the same thing before and after this change. I
> see the Assert is failing in the regression tests.
>
> Assert(PageGetMaxOffsetNumber(page) + 1 ==
>_hash_binsearch(page, _hash_get_indextuple_hashkey(itup)));
>
> I think this is because _hash_binsearch() returns the offset with the
> first tuple with the given hashkey, so if there are duplicate hashkey
> values then it looks like PageAddItemExtended() will set needshuffle
> and memmove() the existing item(s) up one slot.  I don't know this
> hash index building code very well, but I wonder if it's worth having
> another version of _hash_binsearch() that can be used to make
> _hash_pgaddtup() put any duplicate hashkeys after the existing ones
> rather than before and shuffle the others up? It sounds like that
> might speed up normal insertions when there are many duplicate values
> to hash.

Sounds reasonable.

I tried changing src/backend/access/hash/hashinsert.c, line 307 (on
patched file) from

-   itup_off = _hash_binsearch(page, hashkey);

to

+   itup_off = _hash_binsearch_last(page, hashkey) + 1;

since exactly such a function already exists in code.

But this seems to cause a consistent ~1% regression in performance,
which surprises me.
Test was the random INSERT SELECT with 10E6 rows after the CREATE INDEX.

Not sure what to suggest, but the above change is not included in v3.

> I wonder if this might be the reason the random INT test didn't come
> out as good as your original test which had unique values. The unique
> values test would do less shuffling during PageAddItemExtended(). If
> so, that implies that skipping the binary search is only part of the
> gains here and that not shuffling tuples accounts for quite a bit of
> the gain you're seeing. If so, then it would be good to not have to
> shuffle duplicate hashkey tuples up in the page during normal
> insertions as well as when building the index.

There is still a 1.4% lead for the sorted test over the random one, in my tests.

> In any case, it would be nice to have some way to assert that we don't
> accidentally pass sorted==true to _hash_pgaddtup() when there's an
> existing item on the page with a higher hash value. Maybe we could
> just look at the hash value of the last tuple on the page and ensure
> it's <= to the current one?

Done

> 5. I think it would be nicer to move the insertstate.sorted = false;
> into the else branch in hashbuild(). However, you might have to do
> that anyway if you were to do what I mentioned in #1.

Done

> 1. In _h_spoolinit() the HSpool is allocated with palloc and then
> you're setting the istate field to a pointer to the HashInsertState
> which is allocated on the stack by the only calling function
> (hashbuild()).  Looking at hashbuild(), it looks like the return value
> of _h_spoolinit is never put anywhere to make it available outside of
> the function, so it does not seem like there is an actual bug there.
> However, it just seems like a bug waiting to happen. If _h_spoolinit()
> is pallocing memory, then we really shouldn't be setting pointer
> fields in that memory to point to something on the stack.  It might be
> nicer if the istate field in HSpool was a HashInsertStateData and
> _h_spoolinit() just memcpy'd the contents of that parameter.  That
> would make HSpool 4 bytes smaller and save additional pointer
> dereferences in _hash_doinsert().

> This is just my opinion, but I don't really see the value in having a
> typedef for a pointer to HashInsertStateData. I can understand that if
> the struct was local to a .c file, but you've got the struct and
> pointer typedef in the same header. I understand we often do this in
> the code, but I feel like we do it less often in newer code. e.g we do
> it in aset.c but not generation.c (which is much newer than aset.c).
> My personal preference would be just 

Re: SUBTRANS: Minimizing calls to SubTransSetParent()

2022-11-15 Thread Dilip Kumar
On Wed, Nov 16, 2022 at 8:41 AM Simon Riggs
 wrote:
>
> > No, that's not what XidInMVCCSnapshot does. If snapshot->suboverflowed
> > is set (ie, somebody somewhere/somewhen overflowed), then it does
> > SubTransGetTopmostTransaction and searches only the xips with the result.
> > This behavior requires that all live subxids be correctly mapped by
> > SubTransGetTopmostTransaction, or we'll draw false conclusions.
>
> Your comments are correct wrt to the existing coding, but not to the
> patch, which is coded as described and does not suffer those issues.
>

This will work because of these two changes in patch 1) even though
the snapshot is marked "overflow" we will include all the
subtransactions information in snapshot->subxip. 2) As Simon mentioned
in XidInMVCCSnapshot(), first, we search the subxip cache in
snapshot->subxip, and only if it is not found in that we will look
into the SLRU.  So now because of 1) we will always find any
concurrent subtransaction in "snapshot->subxip".

-- 
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com




Re: Suppressing useless wakeups in walreceiver

2022-11-15 Thread Nathan Bossart
On Wed, Nov 16, 2022 at 04:57:08PM +1300, Thomas Munro wrote:
> On Tue, Nov 15, 2022 at 5:49 PM Nathan Bossart  
> wrote:
>> Another option might be to just force initial reply/feedback messages when
>> streaming starts.  The attached patch improves src/test/recovery test
>> runtime just like the previous one I posted.
> 
> Yeah, looks good in my tests.  I think we just need to make it
> conditional so we don't force it if someone has
> wal_receiver_status_interval disabled.

Yeah, that makes sense.  IIUC setting "force" to false would have the same
effect for the initial round of streaming, but since writePtr/flushPtr will
be set in later rounds, no reply would be guaranteed.  That might be good
enough for the tests, but it seems a bit inconsistent.  So, your patch is
probably the way to go.

-- 
Nathan Bossart
Amazon Web Services: https://aws.amazon.com




Re: Add test module for Custom WAL Resource Manager feature

2022-11-15 Thread Jeff Davis
On Wed, 2022-11-16 at 10:27 +0900, Michael Paquier wrote:
> On Wed, Nov 16, 2022 at 10:26:32AM +0900, Michael Paquier wrote:
> > Not many buildfarm members test 32b builds, but lapwing does.
> 
> Well, it didn't take long:
> https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=lapwing=2022-11-16%2000%3A40%3A11

Fixed, thank you. I'll be more diligent about pushing to github CI
first.

Regards,
Jeff Davis





Re: Switching XLog source from archive to streaming when primary available

2022-11-15 Thread Ian Lawrence Barwick
2022年10月18日(火) 11:02 Bharath Rupireddy :
>
> On Tue, Oct 11, 2022 at 8:40 AM Nathan Bossart  
> wrote:
> >
> > On Mon, Oct 10, 2022 at 11:33:57AM +0530, Bharath Rupireddy wrote:
> > > On Mon, Oct 10, 2022 at 3:17 AM Nathan Bossart  
> > > wrote:
> > >> I wonder if it would be better to simply remove this extra polling of
> > >> pg_wal as a prerequisite to your patch.  The existing commentary leads me
> > >> to think there might not be a strong reason for this behavior, so it 
> > >> could
> > >> be a nice way to simplify your patch.
> > >
> > > I don't think it's a good idea to remove that completely. As said
> > > above, it might help someone, we never know.
> >
> > It would be great to hear whether anyone is using this functionality.  If
> > no one is aware of existing usage and there is no interest in keeping it
> > around, I don't think it would be unreasonable to remove it in v16.
>
> It seems like exhausting all the WAL in pg_wal before switching to
> streaming after failing to fetch from archive is unremovable. I found
> this after experimenting with it, here are my findings:
> 1. The standby has to recover initial WAL files in the pg_wal
> directory even for the normal post-restart/first-time-start case, I
> mean, in non-crash recovery case.
> 2. The standby received WAL files from primary (walreceiver just
> writes and flushes the received WAL to WAL files under pg_wal)
> pretty-fast and/or standby recovery is slow, say both the standby
> connection to primary and archive connection are broken for whatever
> reasons, then it has WAL files to recover in pg_wal directory.
>
> I think the fundamental behaviour for the standy is that it has to
> fully recover to the end of WAL under pg_wal no matter who copies WAL
> files there. I fully understand the consequences of manually copying
> WAL files into pg_wal, for that matter, manually copying/tinkering any
> other files into/under the data directory is something we don't
> recommend and encourage.
>
> In summary, the standby state machine in WaitForWALToBecomeAvailable()
> exhausts all the WAL in pg_wal before switching to streaming after
> failing to fetch from archive. The v8 patch proposed upthread deviates
> from this behaviour. Hence, attaching v9 patch that keeps the
> behaviour as-is, that means, the standby exhausts all the WAL in
> pg_wal before switching to streaming after fetching WAL from archive
> for at least streaming_replication_retry_interval milliseconds.
>
> Please review the v9 patch further.

Thanks for the updated patch.

While reviewing the patch backlog, we have determined that this patch adds
one or more TAP tests but has not added the test to the "meson.build" file.

To do this, locate the relevant "meson.build" file for each test and add it
in the 'tests' dictionary, which will look something like this:

  'tap': {
'tests': [
  't/001_basic.pl',
],
  },

For some additional details please see this Wiki article:

  https://wiki.postgresql.org/wiki/Meson_for_patch_authors

For more information on the meson build system for PostgreSQL see:

  https://wiki.postgresql.org/wiki/Meson


Regards

Ian Barwick




Re: Skipping schema changes in publication

2022-11-15 Thread Ian Lawrence Barwick
2022年11月7日(月) 22:39 vignesh C :
>
> On Fri, 4 Nov 2022 at 08:19, Ian Lawrence Barwick  wrote:
> >
> > Hi
> >
> > cfbot reports the patch no longer applies [1]. As CommitFest 2022-11 is
> > currently underway, this would be an excellent time to update the patch.
> >
> > [1] http://cfbot.cputube.org/patch_40_3646.log
>
> Here is an updated patch which is rebased on top of HEAD.

Thanks for the updated patch.

While reviewing the patch backlog, we have determined that this patch adds
one or more TAP tests but has not added the test to the "meson.build" file.

To do this, locate the relevant "meson.build" file for each test and add it
in the 'tests' dictionary, which will look something like this:

  'tap': {
'tests': [
  't/001_basic.pl',
],
  },

For some additional details please see this Wiki article:

  https://wiki.postgresql.org/wiki/Meson_for_patch_authors

For more information on the meson build system for PostgreSQL see:

  https://wiki.postgresql.org/wiki/Meson


Regards

Ian Barwick




Re: logical replication restrictions

2022-11-15 Thread Ian Lawrence Barwick
2022年11月14日(月) 10:09 Takamichi Osumi (Fujitsu) :
>
> On Tuesday, November 8, 2022 2:27 PM Kuroda, Hayato/黒田 隼人 
>  wrote:
> > If you could not allocate a time to discuss this problem because of other
> > important tasks or events, we would like to take over the thread and modify
> > your patch.
> >
> > We've planned that we will start to address comments and reported bugs if
> > you would not respond by the end of this week.
> Hi,
>
>
> I've simply rebased the patch to make it applicable on top of HEAD
> and make the tests pass. Note there are still open pending comments
> and I'm going to start to address those.
>
> I've written Euler as the original author in the commit message
> to note his credit.

Hi

Thanks for the updated patch.

While reviewing the patch backlog, we have determined that this patch adds
one or more TAP tests but has not added the test to the "meson.build" file.

To do this, locate the relevant "meson.build" file for each test and add it
in the 'tests' dictionary, which will look something like this:

  'tap': {
'tests': [
  't/001_basic.pl',
],
  },

For some additional details please see this Wiki article:

  https://wiki.postgresql.org/wiki/Meson_for_patch_authors

For more information on the meson build system for PostgreSQL see:

  https://wiki.postgresql.org/wiki/Meson


Regards

Ian Barwick




Re: Suppressing useless wakeups in walreceiver

2022-11-15 Thread Thomas Munro
On Tue, Nov 15, 2022 at 5:49 PM Nathan Bossart  wrote:
> Another option might be to just force initial reply/feedback messages when
> streaming starts.  The attached patch improves src/test/recovery test
> runtime just like the previous one I posted.

Yeah, looks good in my tests.  I think we just need to make it
conditional so we don't force it if someone has
wal_receiver_status_interval disabled.
From 25f5bd221a683ea93f4c19fdfc03924a9492984a Mon Sep 17 00:00:00 2001
From: Thomas Munro 
Date: Wed, 16 Nov 2022 16:00:07 +1300
Subject: [PATCH] Fix slowdown in TAP tests due to recent walreceiver change.

Commit 05a7be93 changed the timing of the first reply sent by a
walreceiver, which caused some TAP tests to wait in wait_for_catchup for
~10 seconds (wal_receiver_status_interval).  Fix by sending an initial
reply message immediately, and do the same for HS for consistency.

Author: Nathan Bossart 
Discussion: https://postgr.es/m/742545.1668377284%40sss.pgh.pa.us
---
 src/backend/replication/walreceiver.c | 5 +
 1 file changed, 5 insertions(+)

diff --git a/src/backend/replication/walreceiver.c b/src/backend/replication/walreceiver.c
index 8bd2ba37dd..87ee6b948c 100644
--- a/src/backend/replication/walreceiver.c
+++ b/src/backend/replication/walreceiver.c
@@ -433,6 +433,11 @@ WalReceiverMain(void)
 			for (int i = 0; i < NUM_WALRCV_WAKEUPS; ++i)
 WalRcvComputeNextWakeup(i, now);
 
+			/* Send initial reply/feedback messages. */
+			if (wal_receiver_status_interval > 0)
+XLogWalRcvSendReply(true, false);
+			XLogWalRcvSendHSFeedback(true);
+
 			/* Loop until end-of-streaming or error */
 			for (;;)
 			{
-- 
2.35.1



Re: closing file in adjust_data_dir

2022-11-15 Thread Ted Yu
On Tue, Nov 15, 2022 at 7:26 PM Japin Li  wrote:

>
> On Wed, 16 Nov 2022 at 11:15, Ted Yu  wrote:
> > On Tue, Nov 15, 2022 at 7:12 PM Japin Li  wrote:
> >> After some rethinking, I find the origin code do not have problems.
> >>
> >> If fd is NULL or fgets() returns NULL, the process exits.  Otherwise, we
> >> call
> >> pclose() to close fd.  The code isn't straightforward, however, it is
> >> correct.
>
> Hi,
Please take a look at the following:

https://en.cppreference.com/w/c/io/fgets

Quote: If the failure has been caused by some other error, sets the
*error* indicator
(see ferror() ) on stream. The
contents of the array pointed to by str are indeterminate (it may not even
be null-terminated).

I think we shouldn't assume that the fd doesn't need to be closed when NULL
is returned from fgets().

Cheers


Re: closing file in adjust_data_dir

2022-11-15 Thread Japin Li


On Wed, 16 Nov 2022 at 11:15, Ted Yu  wrote:
> On Tue, Nov 15, 2022 at 7:12 PM Japin Li  wrote:
>> After some rethinking, I find the origin code do not have problems.
>>
>> If fd is NULL or fgets() returns NULL, the process exits.  Otherwise, we
>> call
>> pclose() to close fd.  The code isn't straightforward, however, it is
>> correct.
>>
>>
>>
>> Please read this sentence from my first post:
>
> If the fgets() call doesn't return NULL, the pclose() would be skipped.

fgets() returns non-NULL, it means the second condition is false, and
it will check the third condition, which calls pclose(), so it cannot
be skipped, right?

-- 
Regrads,
Japin Li.
ChengDu WenWu Information Technology Co.,Ltd.




Re: libpq compression (part 2)

2022-11-15 Thread Justin Pryzby
On Mon, Nov 14, 2022 at 07:44:24PM -0800, Andrey Borodin wrote:
> patchset needs a heavy rebase. If no one shows up to fix it I'll do

Despite what its git timestamp says, this is based on the most recent
patch from January, which I've had floating around since then.  It
needed to be rebased over at least:

  - guc_tables patch;
  - build and test with meson;
  - doc/

Some of my changes are separate so you can see what I've done.
check_libpq_compression() is in the wrong place, but I couldn't
immediately see where else to put it, since src/common can't include the
backend's guc headers.

Some of the makefile changes seem unnecessary (now?), and my meson
changes don't seem quite right, either.

There's no reason for Zstd to be a separate patch anymore.

It should be updated to parse compression level and options using the
infrastructure introduced for basebackup.

And address the architectural issue from 2 years ago:
https://www.postgresql.org/message-id/20220118043919.GA23027%40telsasoft.com

The global variable PqStream should be moved into some libpq structure
(Port?) and handled within secure_read().  And pqsecure_read shouldn't
be passed as a function pointer/callback.  And verify it passes tests
with all supported compression algorithms and connects to old servers.

-- 
Justin
>From 6a74b2af0131499a3afa5132bc6dac08eb1884f6 Mon Sep 17 00:00:00 2001
From: usernamedt 
Date: Fri, 14 Jan 2022 01:38:55 +0500
Subject: [PATCH 1/4] Implement libpq compression

---
 doc/src/sgml/config.sgml  |   21 +
 doc/src/sgml/libpq.sgml   |   37 +
 doc/src/sgml/protocol.sgml|  158 +++
 src/Makefile.global.in|1 +
 src/backend/Makefile  |8 +
 src/backend/catalog/system_views.sql  |9 +
 src/backend/libpq/pqcomm.c|  277 -
 src/backend/postmaster/postmaster.c   |   10 +
 .../libpqwalreceiver/libpqwalreceiver.c   |   13 +-
 src/backend/utils/activity/backend_status.c   |   29 +
 src/backend/utils/adt/pgstatfuncs.c   |   50 +-
 src/backend/utils/misc/guc_funcs.c|   21 +
 src/backend/utils/misc/guc_tables.c   |   10 +
 src/bin/pgbench/pgbench.c |   17 +-
 src/bin/psql/command.c|   17 +
 src/common/Makefile   |4 +-
 src/common/z_stream.c |  663 +++
 src/common/zpq_stream.c   | 1022 +
 src/include/catalog/pg_proc.dat   |   18 +-
 src/include/common/z_stream.h |  109 ++
 src/include/common/zpq_stream.h   |  120 ++
 src/include/libpq/libpq-be.h  |3 +
 src/include/libpq/libpq.h |1 +
 src/include/libpq/pqcomm.h|3 +
 src/include/utils/backend_status.h|7 +
 src/interfaces/libpq/Makefile |   14 +
 src/interfaces/libpq/exports.txt  |2 +
 src/interfaces/libpq/fe-connect.c |  129 ++-
 src/interfaces/libpq/fe-exec.c|   10 +-
 src/interfaces/libpq/fe-misc.c|   92 +-
 src/interfaces/libpq/fe-protocol3.c   |   71 +-
 src/interfaces/libpq/libpq-fe.h   |4 +
 src/interfaces/libpq/libpq-int.h  |   14 +
 src/test/regress/expected/rules.out   |   14 +-
 src/tools/msvc/Mkvcbuild.pm   |2 +-
 35 files changed, 2884 insertions(+), 96 deletions(-)
 create mode 100644 src/common/z_stream.c
 create mode 100644 src/common/zpq_stream.c
 create mode 100644 src/include/common/z_stream.h
 create mode 100644 src/include/common/zpq_stream.h

diff --git a/doc/src/sgml/config.sgml b/doc/src/sgml/config.sgml
index bd50ea8e480..de4d9532cc5 100644
--- a/doc/src/sgml/config.sgml
+++ b/doc/src/sgml/config.sgml
@@ -1045,6 +1045,27 @@ include_dir 'conf.d'
   
  
 
+ 
+  libpq_compression (string)
+  
+   libpq_compression configuration parameter
+  
+  
+  
+   
+This parameter controls the available client-server traffic compression methods.
+It allows rejecting compression requests even if it is supported by the server (for example, due to security, or CPU consumption).
+The default is on, which means that all supported compression methods are allowed.
+For more precise control, a list of the allowed compression methods can be specified.
+For example, to allow only lz4 and zlib, set the setting to lz4,zlib.
+Also, maximal allowed compression level can be specified for each method, e.g. lz4:1,zlib:2 setting will set the
+maximal compression level for lz4 to 1 and zlib to 2.
+If a client requests the compression with a higher compression level, it will be set to the maximal allowed one.
+Default (and recommended) maximal compression level for each algorithm is 1.
+   
+  

Re: closing file in adjust_data_dir

2022-11-15 Thread Ted Yu
On Tue, Nov 15, 2022 at 7:12 PM Japin Li  wrote:

>
> On Wed, 16 Nov 2022 at 10:52, Ted Yu  wrote:
> > On Tue, Nov 15, 2022 at 6:35 PM Japin Li  wrote:
> >>
> >> fd = popen(cmd, "r");
> >> -   if (fd == NULL || fgets(filename, sizeof(filename), fd) == NULL
> ||
> >> pclose(fd) != 0)
> >> +   if (fd == NULL || fgets(filename, sizeof(filename), fd) == NULL)
> >> {
> >> +   pclose(fd);
> >> write_stderr(_("%s: could not determine the data
> directory
> >> using command \"%s\"\n"), progname, cmd);
> >> exit(1);
> >> }
> >>
> >> Here, segfault maybe occurs if fd is NULL.  I think we can remove
> pclose()
> >> safely since the process will exit.
> >>
> >
> > That means we're going back to v1 of the patch.
> >
>
> After some rethinking, I find the origin code do not have problems.
>
> If fd is NULL or fgets() returns NULL, the process exits.  Otherwise, we
> call
> pclose() to close fd.  The code isn't straightforward, however, it is
> correct.
>
>
>
> Please read this sentence from my first post:

If the fgets() call doesn't return NULL, the pclose() would be skipped.


Re: closing file in adjust_data_dir

2022-11-15 Thread Japin Li


On Wed, 16 Nov 2022 at 10:52, Ted Yu  wrote:
> On Tue, Nov 15, 2022 at 6:35 PM Japin Li  wrote:
>>
>> fd = popen(cmd, "r");
>> -   if (fd == NULL || fgets(filename, sizeof(filename), fd) == NULL ||
>> pclose(fd) != 0)
>> +   if (fd == NULL || fgets(filename, sizeof(filename), fd) == NULL)
>> {
>> +   pclose(fd);
>> write_stderr(_("%s: could not determine the data directory
>> using command \"%s\"\n"), progname, cmd);
>> exit(1);
>> }
>>
>> Here, segfault maybe occurs if fd is NULL.  I think we can remove pclose()
>> safely since the process will exit.
>>
>
> That means we're going back to v1 of the patch.
>

After some rethinking, I find the origin code do not have problems.

If fd is NULL or fgets() returns NULL, the process exits.  Otherwise, we call
pclose() to close fd.  The code isn't straightforward, however, it is correct.



-- 
Regrads,
Japin Li.
ChengDu WenWu Information Technology Co.,Ltd.




Re: SUBTRANS: Minimizing calls to SubTransSetParent()

2022-11-15 Thread Simon Riggs
On Wed, 16 Nov 2022 at 00:09, Tom Lane  wrote:
>
> Simon Riggs  writes:
> > On Tue, 15 Nov 2022 at 21:03, Tom Lane  wrote:
> >> The subxidStatus.overflowed check quoted above has a similar sort
> >> of myopia: it's checking whether our current transaction has
> >> already suboverflowed.  But (a) that doesn't prove it won't suboverflow
> >> later, and (b) the relevant logic in XidInMVCCSnapshot needs to run
> >> SubTransGetTopmostTransaction if *any* proc in the snapshot has
> >> suboverflowed.
>
> > Not the way it is coded now.
>
> > First, we search the subxid cache in snapshot->subxip.
> > Then, and only if the snapshot overflowed (i.e. ANY xact overflowed),
> > do we check subtrans.
>
> No, that's not what XidInMVCCSnapshot does. If snapshot->suboverflowed
> is set (ie, somebody somewhere/somewhen overflowed), then it does
> SubTransGetTopmostTransaction and searches only the xips with the result.
> This behavior requires that all live subxids be correctly mapped by
> SubTransGetTopmostTransaction, or we'll draw false conclusions.

Your comments are correct wrt to the existing coding, but not to the
patch, which is coded as described and does not suffer those issues.


> We could perhaps make it do what you suggest,

Already done in the patch since v5.


> but that would require
> a complete performance analysis to make sure we're not giving up
> more than we would gain.

I agree that a full performance analysis is sensible and an objective
analysis has been performed by Julien Tachoires. This is described
here, along with other explanations:
https://docs.google.com/presentation/d/1A7Ar8_LM5EdC2OHL_j3U9J-QwjMiGw9mmXeBLJOmFlg/edit?usp=sharing

It is important to understand the context here: there is already a
well documented LOSS of performance with the current coding. The patch
alleviates that, and I have not been able to find a performance case
where there is any negative impact.

Further tests welcome.


> Also, both GetSnapshotData and CopySnapshot assume that the subxips
> array is not used if suboverflowed is set, and don't bother
> (continuing to) populate it.  So we would need code changes and
> additional cycles in those areas too.

Already done in the patch since v5.

Any additional cycles apply only to the case of snapshot overflow,
which currently performs very badly.


> I'm not sure about your other claims, but I'm pretty sure this one
> point is enough to kill the patch.

Then please look again because there are misunderstandings above.

-- 
Simon Riggshttp://www.EnterpriseDB.com/




Re: closing file in adjust_data_dir

2022-11-15 Thread Ted Yu
On Tue, Nov 15, 2022 at 6:35 PM Japin Li  wrote:

>
> On Wed, 16 Nov 2022 at 10:06, Ted Yu  wrote:
> >> Hi,
> > That check is a few line above:
> >
> > +   if (fd == NULL || fgets(filename, sizeof(filename), fd) == NULL)
> > {
> >
> > Cheers
>
> Thanks for the explanation.  Comment on v2 patch.
>
> fd = popen(cmd, "r");
> -   if (fd == NULL || fgets(filename, sizeof(filename), fd) == NULL ||
> pclose(fd) != 0)
> +   if (fd == NULL || fgets(filename, sizeof(filename), fd) == NULL)
> {
> +   pclose(fd);
> write_stderr(_("%s: could not determine the data directory
> using command \"%s\"\n"), progname, cmd);
> exit(1);
> }
>
> Here, segfault maybe occurs if fd is NULL.  I think we can remove pclose()
> safely since the process will exit.
>
> --
> Regrads,
> Japin Li.
> ChengDu WenWu Information Technology Co.,Ltd.
>

That means we're going back to v1 of the patch.

Cheers


Re: Allow file inclusion in pg_hba and pg_ident files

2022-11-15 Thread Julien Rouhaud
On Tue, Nov 15, 2022 at 08:46:55AM +0900, Michael Paquier wrote:
> On Mon, Nov 14, 2022 at 03:47:27PM +0800, Julien Rouhaud wrote:
> >
> > If you have an include_dir directive and multiple files have wrong 
> > permission
> > (or maybe broken symlink or something like that), you will get multiple 
> > errors
> > when trying to process that single directive.  I think it's friendlier to
> > report as much detail as we can, so users can make sure they fix everything
> > rather than iterating over the first error.  That's especially helpful if 
> > the
> > fix is done in some external tooling (puppet or whatever) rather than 
> > directly
> > on the server.
>
> Hmm, Okay.  Well, this would include only errors on I/O or permission
> problems for existing files..  I have not seen deployments that have
> dozens of sub-files for GUCs with an included dir, but perhaps I lack
> of experience on user histories.

While being the same inclusion infrastructure, it's likely that people will
have different usage.  I'm assuming that for GUCs the main usage is to have
your automation tool put one of your template conf for instance (small_vm.conf,
big_vm.conf or something like that) in an included directory, so you don't
really need a lot of them.  You can also rely on ALTER SYSTEM to avoid manually
handling configuration files entirely.

For authentication there are probably very different pattern.  The use case I
had when writing this patch is some complex application that relies on many
services, each service having dedicated role and authentication, and services
can be enabled or disabled dynamically pretty much anytime.  I had to write
code to merge possibly new entries with existing pg_hba/pg_ident configuration
files.  With this feature it would be much easier (and robust) to simply have a
main pg_hba/pg_ident that includes some directory, and have the service
enable/disable simply creates/removes a dedicated file for each service, and we
then would usually have at least a dozen files there.

I'm assuming people doing multi-tenant can also have similar usage.

> > I think that the problem is that we have the same interface for processing 
> > the
> > files on a startup/reload and for filling the views, so if we want the 
> > views to
> > be helpful and report all errors we have to also allow a bogus "include" to
> > continue in the reload case too.  The same problem doesn't exists for GUCs, 
> > so
> > a slightly different behavior there might be acceptable.
>
> Well, there is as well the point that we don't have yet a view for
> GUCs that does the equivalent of HBA and ident,

Yes that's what I meant by "the same problem doesn't exists for GUCs".

> but that does not
> mean, it seems to me, that there should be an inconsistency in the way
> we process those commands because one has implemented a feature but
> not the other.  On the contrary, I'd rather try to make them
> consistent.

You mean stopping at the first error, even if it's only for the view reporting?
That will make the reload consistent, but the view will be a bit useless then.

> As things are in the patch, the only difference between
> "include_if_exists" and "include" is that the latter would report some
> information if the file goes missing, the former generates a LOG entry
> about the file being skipped without something in the system view.
> Now, wouldn't it be useful for the end-user to report that a file is
> skipped as an effect of "include_if_exists" in the system views?  If
> not, then what's the point of having this clause to begin with?

I don't really get the argument "the proposed monitoring view doesn't give this
information, so the feature isn't needed".  Also, unless I'm missing something
the other difference between "include" and "include_if_exists" is that the
first one will refuse to reload the conf (or start the server) if the file is
missing, while the other one will?

> My
> opinion is that both clauses are useful, still on the ground of
> consistency both clauses should work the same as for GUCs.  Both
> should report something in err_msg even if that's just about a file
> being skipped, though I agree that this could be considered confusing
> as well for an if_exists clause (does not look like a big deal to me
> based on the debuggability gain).

It would be nice to have the information that an "include_if_exists" file
didn't exist, but having a log-level message in the "error" column is a clear
POLA violation.  People will probably just do something like

SELECT file_name, line_number, error FROM pg_hba_file_rules WHERE error IS NOT
NULL;

and report an error if any row is found.  Having to parse the error field to
know if that's really an error or not is going to be a huge foot-gun.  Maybe we
could indeed report the problem in err_msg but for include_if_exists display it
in some other column of the view?




Re: Slow standby snapshot

2022-11-15 Thread Simon Riggs
On Wed, 16 Nov 2022 at 00:15, Tom Lane  wrote:
>
> Andres Freund  writes:
> > On 2022-11-15 23:14:42 +, Simon Riggs wrote:
> >> Hence more frequent compression is effective at reducing the overhead.
> >> But too frequent compression slows down the startup process, which
> >> can't then keep up.
> >> So we're just looking for an optimal frequency of compression for any
> >> given workload.
>
> > What about making the behaviour adaptive based on the amount of wasted 
> > effort
> > during those two operations, rather than just a hardcoded "emptiness" 
> > factor?
>
> Not quite sure how we could do that, given that those things aren't even
> happening in the same process.  But yeah, it does feel like the proposed
> approach is only going to be optimal over a small range of conditions.

I have not been able to think of a simple way to autotune it.

> > I don't think the xids % KAX_COMPRESS_FREQUENCY == 0 filter is a good idea -
> > if you have a workload with plenty subxids you might end up never 
> > compressing
> > because xids divisible by KAX_COMPRESS_FREQUENCY will end up as a subxid
> > most/all of the time.
>
> Yeah, I didn't think that was too safe either.

> It'd be more reliable
> to use a static counter to skip all but every N'th compress attempt
> (something we could do inside KnownAssignedXidsCompress itself, instead
> of adding warts at the call sites).

I was thinking exactly that myself, for the reason of keeping it all
inside KnownAssignedXidsCompress().

--
Simon Riggshttp://www.EnterpriseDB.com/




Re: closing file in adjust_data_dir

2022-11-15 Thread Japin Li


On Wed, 16 Nov 2022 at 10:06, Ted Yu  wrote:
>> Hi,
> That check is a few line above:
>
> +   if (fd == NULL || fgets(filename, sizeof(filename), fd) == NULL)
> {
>
> Cheers

Thanks for the explanation.  Comment on v2 patch.

fd = popen(cmd, "r");
-   if (fd == NULL || fgets(filename, sizeof(filename), fd) == NULL || 
pclose(fd) != 0)
+   if (fd == NULL || fgets(filename, sizeof(filename), fd) == NULL)
{
+   pclose(fd);
write_stderr(_("%s: could not determine the data directory 
using command \"%s\"\n"), progname, cmd);
exit(1);
}

Here, segfault maybe occurs if fd is NULL.  I think we can remove pclose()
safely since the process will exit.

-- 
Regrads,
Japin Li.
ChengDu WenWu Information Technology Co.,Ltd.




Re: List of Bitmapset (was Re: ExecRTCheckPerms() and many prunable partitions)

2022-11-15 Thread Amit Langote
On Wed, Nov 16, 2022 at 3:25 AM Tom Lane  wrote:
> Amit Langote  writes:
> >  On Mon, Nov 14, 2022 at 11:57 PM Tom Lane  wrote:
> >> + * The new member is identified by the zero-based index of the List
> >> + * element it should go into, and the bit number to be set therein.
>
> > The comment sounds a bit ambiguous, especially the ", and the bit
> > number to be set therein." part.  If you meant to describe the
> > arguments, how about mentioning their names too, as in:
>
> Done that way in the patch I just posted.

Thanks.

> >> +   /* forboth will stop at the end of the shorter list, which is fine */
>
> > Isn't this comment unnecessary given that the while loop makes both
> > lists be the same length?
>
> No, the while loop ensures that a is at least as long as b.
> It could have started out longer, though.

Oops, I missed that case.

The latest version looks pretty good to me.

-- 
Thanks, Amit Langote
EDB: http://www.enterprisedb.com




Re: Documentation for building with meson

2022-11-15 Thread Justin Pryzby
On Wed, Nov 16, 2022 at 10:52:35AM +0900, Ian Lawrence Barwick wrote:
> 2022年10月20日(木) 11:43 Justin Pryzby :
> >
> > On Wed, Oct 19, 2022 at 11:35:10AM -0700, samay sharma wrote:
> > > Creating a new thread focussed on adding docs for building Postgres with
> > > meson. This is a spinoff from the original thread [1] and I've attempted 
> > > to
> > > address all the feedback provided there in the attached patch.
> > >
> > > Please let me know your thoughts.
> >
> > It's easier to review rendered documentation.
> > I made a rendered copy available here:
> > https://api.cirrus-ci.com/v1/artifact/task/6084297781149696/html_docs/html_docs/install-meson.html
> 
> For reference, are there any instructions anywhere on how to do this?  It'd be
> useful to be able to provide a link to the latest version of this 
> documentation
> (and also document the process for patch authors in general).

I've submitted patches which would do that for every patch (ideally,
excluding patches that don't touch the docs, although it looks like the
"exclusion" isn't working).
https://commitfest.postgresql.org/40/3709/

The most recent patches on that thread don't include the "docs as
artifacts" patch, but only the preparatory "build docs as a separate
task".  I think the other part is stalled waiting for some updates to
cfbot to allow knowing how many commits are in the patchset.

FYI, you can navigate from the cirrus task's URL to the git commit (and
its parents)

https://cirrus-ci.com/task/6084297781149696 =>
https://github.com/justinpryzby/postgres/commit/7b57f3323fc77e9b04ef2e76976776090eb8b5b5

-- 
Justin




Re: closing file in adjust_data_dir

2022-11-15 Thread Japin Li


On Wed, 16 Nov 2022 at 10:02, Japin Li  wrote:
> I think we should check whether fd is NULL or not, otherwise, segmentation
> fault maybe occur.
>
> + if (pclose(fd) != 0)
> + {
> + write_stderr(_("%s: could not close the file following command 
> \"%s\"\n"), progname, cmd);
> + exit(1);
> + }

Sorry for the noise, I misunderstand it.

-- 
Regrads,
Japin Li.
ChengDu WenWu Information Technology Co.,Ltd.




Re: closing file in adjust_data_dir

2022-11-15 Thread Ted Yu
On Tue, Nov 15, 2022 at 6:02 PM Japin Li  wrote:

>
> On Wed, 16 Nov 2022 at 02:43, Ted Yu  wrote:
> > Hi,
> > I was looking at the commit:
> >
> > commit 2fe3bdbd691a5d11626308e7d660440be6c210c8
> > Author: Peter Eisentraut 
> > Date:   Tue Nov 15 15:35:37 2022 +0100
> >
> > Check return value of pclose() correctly
> >
> > In src/bin/pg_ctl/pg_ctl.c :
> >
> > if (fd == NULL || fgets(filename, sizeof(filename), fd) == NULL ||
> > pclose(fd) != 0)
> >
> > If the fgets() call doesn't return NULL, the pclose() would be skipped.
> > Since the original pclose() call was removed, wouldn't this lead to fd
> > leaking ?
> >
> > Please see attached patch for my proposal.
> >
> > Cheers
>
> I think we should check whether fd is NULL or not, otherwise, segmentation
> fault maybe occur.
>
> +   if (pclose(fd) != 0)
> +   {
> +   write_stderr(_("%s: could not close the file following
> command \"%s\"\n"), progname, cmd);
> +   exit(1);
> +   }
>
> Hi,
That check is a few line above:

+   if (fd == NULL || fgets(filename, sizeof(filename), fd) == NULL)
{

Cheers


Re: hash_xlog_split_allocate_page: failed to acquire cleanup lock

2022-11-15 Thread Amit Kapila
On Mon, Nov 14, 2022 at 11:18 PM Andres Freund  wrote:
>
> On 2022-11-14 16:06:27 +0530, Amit Kapila wrote:
> > Pushed.
>
> Thanks.
>

Please find the attached patch to remove the buffer cleanup check on
the new bucket page. I think we should do this only for the HEAD. Do
you have any suggestions or objections on this one?

-- 
With Regards,
Amit Kapila.


v3-0001-Don-t-acquire-cleanup-lock-on-the-new-bucket-page.patch
Description: Binary data


Re: closing file in adjust_data_dir

2022-11-15 Thread Japin Li


On Wed, 16 Nov 2022 at 02:43, Ted Yu  wrote:
> Hi,
> I was looking at the commit:
>
> commit 2fe3bdbd691a5d11626308e7d660440be6c210c8
> Author: Peter Eisentraut 
> Date:   Tue Nov 15 15:35:37 2022 +0100
>
> Check return value of pclose() correctly
>
> In src/bin/pg_ctl/pg_ctl.c :
>
> if (fd == NULL || fgets(filename, sizeof(filename), fd) == NULL ||
> pclose(fd) != 0)
>
> If the fgets() call doesn't return NULL, the pclose() would be skipped.
> Since the original pclose() call was removed, wouldn't this lead to fd
> leaking ?
>
> Please see attached patch for my proposal.
>
> Cheers

I think we should check whether fd is NULL or not, otherwise, segmentation
fault maybe occur.

+   if (pclose(fd) != 0)
+   {
+   write_stderr(_("%s: could not close the file following command 
\"%s\"\n"), progname, cmd);
+   exit(1);
+   }

-- 
Regrads,
Japin Li.
ChengDu WenWu Information Technology Co.,Ltd.




Re: Assertion failure in SnapBuildInitialSnapshot()

2022-11-15 Thread Andres Freund
Hi,

On 2022-11-15 16:20:00 +0530, Amit Kapila wrote:
> On Tue, Nov 15, 2022 at 8:08 AM Andres Freund  wrote:
> > nor do we enforce in an obvious place that we
> > don't already hold a snapshot.
> >
> 
> We have a check for (FirstXactSnapshot == NULL) in
> RestoreTransactionSnapshot->SetTransactionSnapshot. Won't that be
> sufficient?

I don't think that'd e.g. catch a catalog snapshot being held, yet that'd
still be bad. And I think checking in SetTransactionSnapshot() is too late,
we've already overwritten MyProc->xmin by that point.


On 2022-11-15 17:21:44 +0530, Amit Kapila wrote:
> > One thing I noticed just now is that we don't assert
> > builder->building_full_snapshot==true. I think we should? That didn't use to
> > be an option, but now it is... It doesn't look to me like that's the issue,
> > but ...
> >
> 
> Agreed.
> 
> The attached patch contains both changes. It seems to me this issue
> can happen, if somehow, either slot's effective_xmin increased after
> we assign its initial value in CreateInitDecodingContext() or somehow
> its value is InvalidTransactionId when we have invoked
> SnapBuildInitialSnapshot(). The other possibility is that the
> initial_xmin_horizon check in SnapBuildFindSnapshot() doesn't insulate
> us from assigning builder->xmin value older than initial_xmin_horizon.
> I am not able to see if any of this can be true.

Yea, I don't immediately see anything either. Given the discussion in
https://www.postgresql.org/message-id/Yz2hivgyjS1RfMKs%40depesz.com I am
starting to wonder if we've introduced a race in the slot machinery.


> diff --git a/src/backend/replication/logical/snapbuild.c 
> b/src/backend/replication/logical/snapbuild.c
> index 5006a5c464..e85c75e0e6 100644
> --- a/src/backend/replication/logical/snapbuild.c
> +++ b/src/backend/replication/logical/snapbuild.c
> @@ -566,11 +566,13 @@ SnapBuildInitialSnapshot(SnapBuild *builder)
>  {
>   Snapshotsnap;
>   TransactionId xid;
> + TransactionId safeXid;
>   TransactionId *newxip;
>   int newxcnt = 0;
>  
>   Assert(!FirstSnapshotSet);
>   Assert(XactIsoLevel == XACT_REPEATABLE_READ);
> + Assert(builder->building_full_snapshot);
>  
>   if (builder->state != SNAPBUILD_CONSISTENT)
>   elog(ERROR, "cannot build an initial slot snapshot before 
> reaching a consistent state");
> @@ -589,17 +591,13 @@ SnapBuildInitialSnapshot(SnapBuild *builder)
>* mechanism. Due to that we can do this without locks, we're only
>* changing our own value.
>*/

Perhaps add something like "Creating a snapshot is expensive and an unenforced
xmin horizon would have bad consequences, therefore always double-check that
the horizon is enforced"?


> -#ifdef USE_ASSERT_CHECKING
> - {
> - TransactionId safeXid;
> -
> - LWLockAcquire(ProcArrayLock, LW_SHARED);
> - safeXid = GetOldestSafeDecodingTransactionId(false);
> - LWLockRelease(ProcArrayLock);
> + LWLockAcquire(ProcArrayLock, LW_SHARED);
> + safeXid = GetOldestSafeDecodingTransactionId(false);
> + LWLockRelease(ProcArrayLock);
>  
> - Assert(TransactionIdPrecedesOrEquals(safeXid, snap->xmin));
> - }
> -#endif
> + if (TransactionIdFollows(safeXid, snap->xmin))
> + elog(ERROR, "cannot build an initial slot snapshot when oldest 
> safe xid %u follows snapshot's xmin %u",
> +  safeXid, snap->xmin);
>  
>   MyProc->xmin = snap->xmin;
>  

s/when/as/

Greetings,

Andres Freund




Re: Documentation for building with meson

2022-11-15 Thread Ian Lawrence Barwick
2022年10月20日(木) 11:43 Justin Pryzby :
>
> On Wed, Oct 19, 2022 at 11:35:10AM -0700, samay sharma wrote:
> > Creating a new thread focussed on adding docs for building Postgres with
> > meson. This is a spinoff from the original thread [1] and I've attempted to
> > address all the feedback provided there in the attached patch.
> >
> > Please let me know your thoughts.
>
> It's easier to review rendered documentation.
> I made a rendered copy available here:
> https://api.cirrus-ci.com/v1/artifact/task/6084297781149696/html_docs/html_docs/install-meson.html

For reference, are there any instructions anywhere on how to do this?  It'd be
useful to be able to provide a link to the latest version of this documentation
(and also document the process for patch authors in general).

Regards

Ian Barwick




Re: ssl tests aren't concurrency safe due to get_free_port()

2022-11-15 Thread Andres Freund
Hi,

On 2022-11-15 15:56:37 -0500, Andrew Dunstan wrote:
> On 2022-11-06 Su 11:30, Andrew Dunstan wrote:
> >
> > One possible addition would be to add removing the reservation files in
> > an END handler. That would be pretty simple.
> >
> >
> 
> 
> Here's a version with that. I suggest we try it out and see if anything
> breaks.

Thanks! I agree it makes sense to go ahead with this.

I'd guess we should test drive this a bit in HEAD but eventually backpatch?


> diff --git a/src/test/perl/PostgreSQL/Test/Cluster.pm 
> b/src/test/perl/PostgreSQL/Test/Cluster.pm
> index d80134b26f..85fae32c14 100644
> --- a/src/test/perl/PostgreSQL/Test/Cluster.pm
> +++ b/src/test/perl/PostgreSQL/Test/Cluster.pm

I think this should also update the comment for get_free_port(), since the
race mentioned there is largely addressed by this patch?


> @@ -140,6 +143,27 @@ INIT
>  
>   # Tracking of last port value assigned to accelerate free port lookup.
>   $last_port_assigned = int(rand() * 16384) + 49152;
> +
> + # Set the port lock directory
> +
> + # If we're told to use a directory (e.g. from a buildfarm client)
> + # explicitly, use that
> + $portdir = $ENV{PG_TEST_PORT_DIR};
> + # Otherwise, try to use a directory at the top of the build tree
> + if (! $portdir && $ENV{MESON_BUILD_ROOT})
> + {
> + $portdir = $ENV{MESON_BUILD_ROOT} . '/portlock'
> + }
> + elsif (! $portdir && ($ENV{TESTDATADIR} || "") =~ /\W(src|contrib)\W/p)
> + {
> + my $dir = ${^PREMATCH};
> + $portdir = "$dir/portlock" if $dir;
> + }
> + # As a last resort use a directory under tmp_check
> + $portdir ||= $PostgreSQL::Test::Utils::tmp_check . '/portlock';
> + $portdir =~ s!\\!/!g;
> + # Make sure the directory exists
> + mkpath($portdir) unless -d $portdir;
>  }

Perhaps we should just export a directory in configure instead of this
guessing game?


>  =pod
> @@ -1505,6 +1529,7 @@ sub get_free_port
>   last;
>   }
>   }
> + $found = _reserve_port($port) if $found;
>   }
>   }
>  
> @@ -1535,6 +1560,40 @@ sub can_bind
>   return $ret;
>  }
>  
> +# Internal routine to reserve a port number
> +# Returns 1 if successful, 0 if port is already reserved.
> +sub _reserve_port
> +{
> + my $port = shift;
> + # open in rw mode so we don't have to reopen it and lose the lock
> + my $filename = "$portdir/$port.rsv";
> + sysopen(my $portfile, $filename, O_RDWR|O_CREAT)
> +   || die "opening port file $filename";

Perhaps add $! to the message so e.g. permission denied errors or such are
easier to debug?


> + # take an exclusive lock to avoid concurrent access
> + flock($portfile, LOCK_EX) || die "locking port file $filename";

dito


> + # see if someone else has or had a reservation of this port
> + my $pid = <$portfile>;
> + chomp $pid;
> + if ($pid +0 > 0)
> + {
> + if (kill 0, $pid)
> + {
> + # process exists and is owned by us, so we can't 
> reserve this port
> + close($portfile);
> + return 0;
> + }
> + }
> + # All good, go ahead and reserve the port, first rewind and truncate.
> + # If truncation fails it's not a tragedy, it just might leave some
> + # trailing junk in the file that won't affect us.
> + seek($portfile, 0, SEEK_SET);
> + truncate($portfile, 0);

Perhaps check truncate's return value?


> + print $portfile "$$\n";
> + close($portfile);
> + push(@port_reservation_files, $filename);
> + return 1;
> +}

Perhaps it'd be better to release the file lock explicitly? flock() has this
annoying behaviour of only releasing the lock when the last file descriptor
for a file is closed. We shouldn't end up with dup'd FDs or forks here, but it
seems like it might be more robust to just explicitly release the lock?

Greetings,

Andres Freund




Re: [DOCS] Stats views and functions not in order?

2022-11-15 Thread Peter Smith
On Thu, Nov 10, 2022 at 10:04 AM David G. Johnston
 wrote:
>
...
>> > So ... how do we proceed?
>> >
>>
>> To proceed with the existing patches I need some guidance on exactly
>> which of the changes can be considered improvements versus which ones
>> are maybe just trading one 'random' order for another.
>>
>> How about below?
>>
>> Table 28.1. Dynamic Statistics Views -- Alphabetical order would be a
>> small improvement here, right?
>
>
> The present ordering seems mostly OK, though just like the "Progress" update 
> below the bottom 6 pg_stat_progress_* entries should be alphabetized; but 
> leaving them as a group at the end seems desirable.
>
> Move pg_stat_recovery_prefetch either after subscription or after activity - 
> the replication/received/subscription stuff all seems like it should be 
> grouped together.  As well as the security related ssl/gssapi.
>
>>
>> Table 28.2. Collected Statistics Views -- Leave this one unchanged
>> (per your comments above).
>
>
> I would suggest moving the 3 pg_statio_*_tables rows between the 
> pg_stat_*_tables and the pg_stat_xact_*_tables groups.
>
> Everything pertaining to cluster, database, tables, indexes, functions.  slru 
> and replication slots should likewise shift to the (near) top in the 
> cluster/database grouping.
>
>>
>> Table 28.12 Wait Events of type LWLock -- Seems a clear case of bad
>> merging. Alphabetical order is surely needed here, right?
>
>
> +1 Agreed.
>>
>>
>> Table 28.34 Additional Statistic Functions -- Alphabetical order would
>> be a small improvement here, right?
>
>
> No.  All "reset" items should be grouped at the end like they are.  I don't 
> see an alternative ordering among them that is clearly superior.  Same for 
> the first four.
>
>>
>>
>> Table 28.35 Per-Backend Statistics Functions --  Alphabetical order
>> would be a small improvement here, right?
>>
>
> This one I would rearrange alphabetically.  Or, at least, I have a different 
> opinion of what would make a decent order but it doesn't seem all that 
> clearly better than alphabetical.
>
>>
>> > I'd be inclined to alphabetize by SQL command name, but maybe
>> > leave Base Backup to the end since it's not a SQL command.
>> >
>>
>> Yes, I had previously only looked at the content of section 28.2
>> because I didn't want to get carried away by changing too much until
>> there was some support for doing the first part.
>>
>> Now PSA a separate patch for fixing section "28.4. Progress Reporting"
>> order as suggested.
>>
>
> This seems like a clear win.
>
> David J.

Thanks for the review and table ordering advice. AFAICT I have made
all the changes according to the suggestions.

Each re-ordering was done as a separate patch (so maybe they can be
pushed separately, in case some but not all are OK). PSA.

~~

I was also wondering (but have not yet done) if the content *outside*
the tables should be reordered to match the table 28.1/28.2 order.

e.g. Currently it is not quite the same:

CURRENT
28.2.3. pg_stat_activity
28.2.4. pg_stat_replication
28.2.5. pg_stat_replication_slots
28.2.6. pg_stat_wal_receiver
28.2.7. pg_stat_recovery_prefetch
28.2.8. pg_stat_subscription
28.2.9. pg_stat_subscription_stats
28.2.10. pg_stat_ssl
28.2.11. pg_stat_gssapi

28.2.12. pg_stat_archiver
28.2.13. pg_stat_bgwriter
28.2.14. pg_stat_wal
28.2.15. pg_stat_database
28.2.16. pg_stat_database_conflicts
28.2.17. pg_stat_all_tables
28.2.18. pg_stat_all_indexes
28.2.19. pg_statio_all_tables
28.2.20. pg_statio_all_indexes
28.2.21. pg_statio_all_sequences
28.2.22. pg_stat_user_functions
28.2.23. pg_stat_slru

SUGGESTED
28.2.3. pg_stat_activity
28.2.4. pg_stat_replication
28.2.6. pg_stat_wal_receiver
28.2.7. pg_stat_recovery_prefetch
28.2.8. pg_stat_subscription
28.2.10. pg_stat_ssl
28.2.11. pg_stat_gssapi

28.2.12. pg_stat_archiver
28.2.13. pg_stat_bgwriter
28.2.14. pg_stat_wal
28.2.15. pg_stat_database
28.2.16. pg_stat_database_conflicts
28.2.23. pg_stat_slru
28.2.5. pg_stat_replication_slots
28.2.17. pg_stat_all_tables
28.2.18. pg_stat_all_indexes
28.2.19. pg_statio_all_tables
28.2.20. pg_statio_all_indexes
28.2.21. pg_statio_all_sequences
28.2.22. pg_stat_user_functions
28.2.9. pg_stat_subscription_stats

Thoughts?

--
Kind Regards,
Peter Smith.
Fujitsu Australia


v5-0004-Re-order-Table-28.35-Per-Backend-Statistics-Funct.patch
Description: Binary data


v5-0002-Re-order-Table-28.2-Collected-Statistics-Views.patch
Description: Binary data


v5-0001-Re-order-sections-of-28.4.-Progress-Reporting.patch
Description: Binary data


v5-0003-Re-order-Table-28.12-Wait-Events-of-type-LWLock.patch
Description: Binary data


Re: Standardizing how pg_waldump presents recovery conflict XID cutoffs

2022-11-15 Thread Andres Freund
Hi,

On 2022-11-15 13:54:24 -0800, Peter Geoghegan wrote:
> On Tue, Nov 15, 2022 at 12:29 PM Andres Freund  wrote:
> > ... I strongly dislike latestCommittedXid. That seems at least as misleading
> > as latestRemovedXid and has the danger of confusion with latestCompletedXid
> > as you mention.
> 
> > How about latestAffectedXid?
> 
> I get why you don't care for latestCommittedXid, of course, but the
> name does have some advantages. Namely:
> 
> 1. Most conflicts come from PRUNE records (less often index deletion
> records) where the XID is some heap tuple's xmax, a
> committed-to-everybody XID on the primary (at the point of the
> original execution of the prune). It makes sense to emphasize the idea
> that snapshots running on a replica need to agree that this XID is
> definitely committed -- we need to kill any snapshots that don't
> definitely agree that this one particular XID is committed by now.

I don't agree that it makes sense there - to me it sounds like the record is
just carrying the globally latest committed xid rather than something just
describing the record.

I also just don't think "agreeing that a particular XID is committed" is a
good description of latestRemovedXID, there's just too many ways that
"agreeing xid has committed" can be understood. To me it's not obvious that
it's about mvcc snapshots.

If we want to focus on the mvcc affects we could just go for something like
snapshotConflictHorizon or such.



> Perhaps something like "mustBeCommittedCutoff" would work better? What
> do you think of that? The emphasis on how things need to work on the
> REDO side seems useful.

I don't think "committed" should be part of the name.

Greetings,

Andres Freund




Re: Add test module for Custom WAL Resource Manager feature

2022-11-15 Thread Michael Paquier
On Wed, Nov 16, 2022 at 10:26:32AM +0900, Michael Paquier wrote:
> Not many buildfarm members test 32b builds, but lapwing does.

Well, it didn't take long:
https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=lapwing=2022-11-16%2000%3A40%3A11
--
Michael


signature.asc
Description: PGP signature


Re: Add test module for Custom WAL Resource Manager feature

2022-11-15 Thread Michael Paquier
On Tue, Nov 15, 2022 at 04:29:08PM -0800, Jeff Davis wrote:
> Committed with some significant revisions (ae168c794f):
> 
>   * changed to insert a deterministic message, rather than a random
> one, which allows more complete testing
>   * fixed a couple bugs
>   * used a static initializer for the RmgrData rather than memset,
> which shows a better example
> 
> I also separately committed a patch to mark the argument of
> RegisterCustomRmgr as "const".

This is causing the CI job to fail for 32-bit builds.  Here is one
example in my own repository for what looks like an alignment issue: 
https://github.com/michaelpq/postgres/runs/9514121172

[01:17:23.152] ok 1 - custom WAL resource manager has successfully registered 
with the server
[01:17:23.152] not ok 2 - custom WAL resource manager has successfully written 
a WAL record
[01:17:23.152] 1..2
[01:17:23.152] # test failed
[01:17:23.152] --- stderr ---
[01:17:23.152] #   Failed test 'custom WAL resource manager has successfully 
written a WAL record'
[01:17:23.152] #   at 
/tmp/cirrus-ci-build/src/test/modules/test_custom_rmgrs/t/001_basic.pl line 56.
[01:17:23.152] #  got: 
'0/151E088|test_custom_rmgrs|TEST_CUSTOM_RMGRS_MESSAGE|40|14|0|payload (10 
bytes): payload123'
[01:17:23.152] # expected: 
'0/151E088|test_custom_rmgrs|TEST_CUSTOM_RMGRS_MESSAGE|44|18|0|payload (10 
bytes): payload123'
[01:17:23.152] # Looks like you failed 1 test of 2.
[01:17:23.152] 

Not many buildfarm members test 32b builds, but lapwing does.
--
Michael


signature.asc
Description: PGP signature


Re: Slow standby snapshot

2022-11-15 Thread Andres Freund
Hi,

On 2022-11-15 19:46:26 -0500, Tom Lane wrote:
> Andres Freund  writes:
> > To me it sounds like known_assigned_xids_lck is pointless and the talk about
> > memory barriers a red herring, since all modifications have to happen with
> > ProcArrayLock held exlusively and all reads with ProcArrayLock held in share
> > mode. It can't be legal to modify head/tail or the contents of the array
> > outside of that. And lwlocks provide sufficient barrier semantics.
> 
> No ... RecordKnownAssignedTransactionIds calls KnownAssignedXidsAdd
> with exclusive_lock = false, and in the typical case that will not
> acquire ProcArrayLock at all.  Since there's only one writer, that
> seems safe enough, and I believe the commentary's claim that we
> really just need to be sure the head-pointer update is seen
> after the array updates.

Oh, right. I focussed to much on the part of the comment quoted in your email.

I still think it's misleading for the comment to say that the tail can be
modified with the spinlock - I don't see how that'd ever be correct. Nor could
head be safely decreased with just the spinlock.


Too bad, that seems to make the idea of compressing in other backends a
non-starter unfortunately :(. Although - are we really gaining that much by
avoiding ProcArrayLock in the RecordKnownAssignedTransactionIds() case? It
only happens when latestObservedXid is increased, and we'll remove them at
commit with the exclusive lock held. Afaict that's the only KAX access that
doesn't also require ProcArrayLock?

Greetings,

Andres Freund




Re: pg_basebackup's --gzip switch misbehaves

2022-11-15 Thread Michael Paquier
On Tue, Nov 15, 2022 at 11:09:54AM +0100, Daniel Gustafsson wrote:
>> On 15 Nov 2022, at 00:58, Michael Paquier  wrote:
>> 
>> On Mon, Nov 14, 2022 at 03:27:14PM +0100, Daniel Gustafsson wrote:
>>> Ugh, yes, that's what it should say.
>> 
>> A split sounds fine by me.  On top of what Tom has mentioned, I have
>> spotted two small-ish things.
>> 
>> -This module is available from CPAN or an operating system package.
>> +This module is available from
>> +https://metacpan.org/release/IPC-Run;>CPAN
>> +or an operating system package.
>> 
>> It looks like there is a second one in install-windows.sgml.
> 
> Not sure I follow.  IPC::Run is already linked to with a ulink from that page
> (albeit with an empty tag rendering the URL instead).

Ah, I did not notice that there was already a link to that with
IPC::Run.  Anyway, shouldn't CPAN be marked at least as an 
if we are not going to use a link on it?  acronyms.sgml lists it, just
saying.

> A related nitpick I found though is that metacpan has changed their URL
> structure and these links now 301 redirect.  The attached 0001 fixes that 
> first
> before applying the other part.

WFM.
--
Michael


signature.asc
Description: PGP signature


Re: Unit tests for SLRU

2022-11-15 Thread Michael Paquier
On Tue, Nov 15, 2022 at 02:43:06PM +0400, Pavel Borisov wrote:
> I've looked through the patch again. I agree it looks better and can
> be committed.
> Mark it as RfC now.

Okay, applied, then.  The SQL function names speak by themselves, even
if some of them refer to pages but they act on segments, but that's
easy enough to see the difference through the code when we do segment
number compilations, as well.
--
Michael


signature.asc
Description: PGP signature


Re: Slow standby snapshot

2022-11-15 Thread Tom Lane
Andres Freund  writes:
> To me it sounds like known_assigned_xids_lck is pointless and the talk about
> memory barriers a red herring, since all modifications have to happen with
> ProcArrayLock held exlusively and all reads with ProcArrayLock held in share
> mode. It can't be legal to modify head/tail or the contents of the array
> outside of that. And lwlocks provide sufficient barrier semantics.

No ... RecordKnownAssignedTransactionIds calls KnownAssignedXidsAdd
with exclusive_lock = false, and in the typical case that will not
acquire ProcArrayLock at all.  Since there's only one writer, that
seems safe enough, and I believe the commentary's claim that we
really just need to be sure the head-pointer update is seen
after the array updates.

regards, tom lane




Re: Slow standby snapshot

2022-11-15 Thread Andres Freund
Hi,

On 2022-11-15 19:15:15 -0500, Tom Lane wrote:
> Andres Freund  writes:
> > On 2022-11-15 23:14:42 +, Simon Riggs wrote:
> >> Hence more frequent compression is effective at reducing the overhead.
> >> But too frequent compression slows down the startup process, which
> >> can't then keep up.
> >> So we're just looking for an optimal frequency of compression for any
> >> given workload.
> 
> > What about making the behaviour adaptive based on the amount of wasted 
> > effort
> > during those two operations, rather than just a hardcoded "emptiness" 
> > factor?
> 
> Not quite sure how we could do that, given that those things aren't even
> happening in the same process.

I'm not certain what the best approach is, but I don't think the
not-the-same-process part is a blocker.


Approach 1:

We could have an atomic variable in ProcArrayStruct that counts the amount of
wasted effort and have processes update it whenever they've wasted a
meaningful amount of effort.  Something like counting the skipped elements in
KnownAssignedXidsGetAndSetXmin in a function local static variable and
updating the shared counter whenever that reaches



Approach 2:

Perform conditional cleanup in non-startup processes - I think that'd actually
be ok, as long as ProcArrayLock is held exlusively.  We could count the amount
of skipped elements in KnownAssignedXidsGetAndSetXmin() in a local variable,
and whenever that gets too high, conditionally acquire ProcArrayLock lock
exlusively at the end of GetSnapshotData() and compress KAX. Reset the local
variable independent of getting the lock or not, to avoid causing a lot of
contention.

The nice part is that this would work even without the startup making
process. The not nice part that it'd require a bit of code study to figure out
whether it's safe to modify KAX from outside the startup process.



> But yeah, it does feel like the proposed
> approach is only going to be optimal over a small range of conditions.

In particular, it doesn't adapt at all to workloads that don't replay all that
much, but do compute a lot of snapshots.

Greetings,

Andres Freund




Re: Slow standby snapshot

2022-11-15 Thread Andres Freund
Hi,

On 2022-11-15 23:14:42 +, Simon Riggs wrote:
> On Tue, 15 Nov 2022 at 23:06, Tom Lane  wrote:
> >
> > BTW, while nosing around this code I came across this statement
> > (procarray.c, about line 4550 in HEAD):
> >
> >  * ... To add XIDs to the array, we just insert
> >  * them into slots to the right of the head pointer and then advance the 
> > head
> >  * pointer.  This wouldn't require any lock at all, except that on machines
> >  * with weak memory ordering we need to be careful that other processors
> >  * see the array element changes before they see the head pointer change.
> >  * We handle this by using a spinlock to protect reads and writes of the
> >  * head/tail pointers.  (We could dispense with the spinlock if we were to
> >  * create suitable memory access barrier primitives and use those instead.)
> >  * The spinlock must be taken to read or write the head/tail pointers unless
> >  * the caller holds ProcArrayLock exclusively.
> >
> > Nowadays we've *got* those primitives.  Can we get rid of
> > known_assigned_xids_lck, and if so would it make a meaningful
> > difference in this scenario?

Forgot to reply to this part:

I'm confused by the explanation of the semantics of the spinlock:

  "The spinlock must be taken to read or write the head/tail pointers
  unless the caller holds ProcArrayLock exclusively."

makes it sound like it'd be valid to modify the KnownAssignedXids without
holding ProcArrayLock exclusively. Doesn't that contradict only needing the
spinlock because of memory ordering?

And when would it be valid to do any modifications of KnownAssignedXids
without holding ProcArrayLock exclusively? Concurrent readers of
KnownAssignedXids will operate on a snapshot of head/tail (the spinlock is
only ever held to query them). Afaict any such modification would be racy,
because subsequent modifications of KnownAssignedXids could overwrite contents
of KnownAssignedXids that another backend is in the process of reading, based
on the stale snapshot of head/tail.


To me it sounds like known_assigned_xids_lck is pointless and the talk about
memory barriers a red herring, since all modifications have to happen with
ProcArrayLock held exlusively and all reads with ProcArrayLock held in share
mode. It can't be legal to modify head/tail or the contents of the array
outside of that. And lwlocks provide sufficient barrier semantics.



> I think you could do that *as well*, since it does act as an overhead
> but that is not related to the main issues

I think it might be a bigger effect than one might immediately think. Because
the spinlock will typically be on the same cacheline as head/tail, and because
every spinlock acquisition requires the cacheline to be modified (and thus
owned mexlusively) by the current core, uses of head/tail will very commonly
be cache misses even in workloads without a lot of KAX activity.

Greetings,

Andres Freund




Re: Slow standby snapshot

2022-11-15 Thread Tom Lane
Simon Riggs  writes:
> but that is not related to the main issues:

> * COMMITs: xids are removed from the array by performing a binary
> search - this gets more and more expensive as the array gets wider
> * SNAPSHOTs: scanning the array for snapshots becomes more expensive
> as the array gets wider

Right.  The case complained of in this thread is SNAPSHOT cost,
since that's what KnownAssignedXidsGetAndSetXmin is used for.

> Hence more frequent compression is effective at reducing the overhead.
> But too frequent compression slows down the startup process, which
> can't then keep up.
> So we're just looking for an optimal frequency of compression for any
> given workload.

Hmm.  I wonder if my inability to detect a problem is because the startup
process does keep ahead of the workload on my machine, while it fails
to do so on the OP's machine.  I've only got a 16-CPU machine at hand,
which probably limits the ability of the primary to saturate the standby's
startup process.  If that's accurate, reducing the frequency of
compression attempts could be counterproductive in my workload range.
It would help the startup process when that is the bottleneck --- but
that wasn't what the OP complained of, so I'm not sure it helps him
either.

It seems like maybe what we should do is just drop the "nelements < 4 *
PROCARRAY_MAXPROCS" part of the existing heuristic, which is clearly
dangerous with large max_connection settings, and in any case doesn't
have a clear connection to either the cost of scanning or the cost
of compressing.  Or we could replace that with a hardwired constant,
like "nelements < 400".

regards, tom lane




Re: Add test module for Custom WAL Resource Manager feature

2022-11-15 Thread Jeff Davis
On Mon, 2022-11-14 at 09:34 +0530, Bharath Rupireddy wrote:
> Thanks. I would like to keep it simple.
> 
> I've added some more comments and attached v2 patch herewith. Please
> review.

Committed with some significant revisions (ae168c794f):

  * changed to insert a deterministic message, rather than a random
one, which allows more complete testing
  * fixed a couple bugs
  * used a static initializer for the RmgrData rather than memset,
which shows a better example

I also separately committed a patch to mark the argument of
RegisterCustomRmgr as "const".


-- 
Jeff Davis
PostgreSQL Contributor Team - AWS






Re: allowing for control over SET ROLE

2022-11-15 Thread Jeff Davis
On Tue, 2022-11-15 at 12:07 -0500, Robert Haas wrote:
> If anyone else wants to comment, or if either of those people want to
> comment further, please speak up soon.

Did you have some thoughts on:

https://postgr.es/m/a41d606d03b629c2ef0ed274ae3b04a2c266.ca...@j-davis.com

Regards,
Jeff Davis





Re: Unit tests for SLRU

2022-11-15 Thread Michael Paquier
On Tue, Nov 15, 2022 at 11:39:20AM +0100, Daniel Gustafsson wrote:
> + /* write given data to the page */
> + strncpy(TestSlruCtl->shared->page_buffer[slotno], data, BLCKSZ - 1);
> 
> Would it make sense to instead use pg_pwrite to closer match the code being
> tested?

Hmm.  I am not exactly sure what we'd gain with that, as it would
imply that we need to write directly to the file using SlruFileName()
after doing ourselves a OpenTransientFile(), duplicating what
SlruPhysicalWritePage() does to create a fd to feed to a pg_pwrite()?
Or I misunderstood your point.
--
Michael


signature.asc
Description: PGP signature


Re: Slow standby snapshot

2022-11-15 Thread Tom Lane
Andres Freund  writes:
> On 2022-11-15 23:14:42 +, Simon Riggs wrote:
>> Hence more frequent compression is effective at reducing the overhead.
>> But too frequent compression slows down the startup process, which
>> can't then keep up.
>> So we're just looking for an optimal frequency of compression for any
>> given workload.

> What about making the behaviour adaptive based on the amount of wasted effort
> during those two operations, rather than just a hardcoded "emptiness" factor?

Not quite sure how we could do that, given that those things aren't even
happening in the same process.  But yeah, it does feel like the proposed
approach is only going to be optimal over a small range of conditions.

> I don't think the xids % KAX_COMPRESS_FREQUENCY == 0 filter is a good idea -
> if you have a workload with plenty subxids you might end up never compressing
> because xids divisible by KAX_COMPRESS_FREQUENCY will end up as a subxid
> most/all of the time.

Yeah, I didn't think that was too safe either.  It'd be more reliable
to use a static counter to skip all but every N'th compress attempt
(something we could do inside KnownAssignedXidsCompress itself, instead
of adding warts at the call sites).

regards, tom lane




Re: Meson doesn't define HAVE_LOCALE_T for mscv

2022-11-15 Thread Juan José Santamaría Flecha
On Tue, Nov 15, 2022 at 9:02 PM Andres Freund  wrote:

>
> On 2022-11-15 15:35:31 +0100, Juan José Santamaría Flecha wrote:
> > On Tue, Nov 15, 2022 at 1:49 AM Andres Freund 
> wrote:
> > > Hm. Is it right that the changes are only done for msvc? win32_port.h
> > > defines the types for mingw as well afaict.
>
> > Yes, it does, but configure does nothing with them, so adding those
> > defines is a new feature for MinGW but a correction for MSVC.
>
> Any chance you checked if autoconf already detects locale_t with mingw?
> Possible that mingw supplies one of the relevant headers...
>
> Otherwise it looks like a sensible improvement to me.
>
> I've checked the autoconf version of pg_config.h and it's not detected.
Also, manually inspecting  I see no definition of locale_t in
MinGW.

Regards,

Juan José Santamaría Flecha


Re: SUBTRANS: Minimizing calls to SubTransSetParent()

2022-11-15 Thread Tom Lane
Simon Riggs  writes:
> On Tue, 15 Nov 2022 at 21:03, Tom Lane  wrote:
>> The subxidStatus.overflowed check quoted above has a similar sort
>> of myopia: it's checking whether our current transaction has
>> already suboverflowed.  But (a) that doesn't prove it won't suboverflow
>> later, and (b) the relevant logic in XidInMVCCSnapshot needs to run
>> SubTransGetTopmostTransaction if *any* proc in the snapshot has
>> suboverflowed.

> Not the way it is coded now.

> First, we search the subxid cache in snapshot->subxip.
> Then, and only if the snapshot overflowed (i.e. ANY xact overflowed),
> do we check subtrans.

No, that's not what XidInMVCCSnapshot does.  If snapshot->suboverflowed
is set (ie, somebody somewhere/somewhen overflowed), then it does
SubTransGetTopmostTransaction and searches only the xips with the result.
This behavior requires that all live subxids be correctly mapped by
SubTransGetTopmostTransaction, or we'll draw false conclusions.

We could perhaps make it do what you suggest, but that would require
a complete performance analysis to make sure we're not giving up
more than we would gain.

Also, both GetSnapshotData and CopySnapshot assume that the subxips
array is not used if suboverflowed is set, and don't bother
(continuing to) populate it.  So we would need code changes and
additional cycles in those areas too.

I'm not sure about your other claims, but I'm pretty sure this one
point is enough to kill the patch.

regards, tom lane




Re: Slow standby snapshot

2022-11-15 Thread Andres Freund
Hi,

On 2022-11-15 23:14:42 +, Simon Riggs wrote:
> * COMMITs: xids are removed from the array by performing a binary
> search - this gets more and more expensive as the array gets wider
> * SNAPSHOTs: scanning the array for snapshots becomes more expensive
> as the array gets wider
>
> Hence more frequent compression is effective at reducing the overhead.
> But too frequent compression slows down the startup process, which
> can't then keep up.

> So we're just looking for an optimal frequency of compression for any
> given workload.

What about making the behaviour adaptive based on the amount of wasted effort
during those two operations, rather than just a hardcoded "emptiness" factor?
It's common that basically no snapshots are taken, and constantly compressing
in that case is likely going to be wasted effort.


The heuristic the patch adds to KnownAssignedXidsRemoveTree() seems somewhat
misplaced to me. When called from ProcArrayApplyXidAssignment() we probably
should always compress - it'll only be issued when a substantial amount of
subxids have been assigned, so there'll be a bunch of cleanup work.  It makes
more sense from ExpireTreeKnownAssignedTransactionIds(), since it will very
commonly called for individual xids - but even then, we afaict should take
into account how many xids we've just expired.

I don't think the xids % KAX_COMPRESS_FREQUENCY == 0 filter is a good idea -
if you have a workload with plenty subxids you might end up never compressing
because xids divisible by KAX_COMPRESS_FREQUENCY will end up as a subxid
most/all of the time.


Re cost of processing at COMMITs: We do a fresh binary search for each subxid
right now. There's a lot of locality in the xids that can be expired. Perhaps
we could have a cache for the position of the latest value in
KnownAssignedXidsSearch() and search linearly if the distance from the last
looked up value isn't large?


Greetings,

Andres




Re: Meson doesn't define HAVE_LOCALE_T for mscv

2022-11-15 Thread Juan José Santamaría Flecha
On Tue, Nov 15, 2022 at 8:53 PM Andres Freund  wrote:

>
> I don't think we should print mingw - that's really just redundant with
> gcc. But including host_system seems like a good idea. Not sure why I
> didn't
> do that.
>
> I'll open a new thread for this. Also, I think this is skipping
collate.linux.utf.sql and infinite_recurse.sql tests in their intended
platforms.

Regards,

Juan José Santamaría Flecha


Re: Meson add host_system to PG_VERSION_STR

2022-11-15 Thread Michael Paquier
On Wed, Nov 16, 2022 at 12:08:56AM +0100, Juan José Santamaría Flecha wrote:
> As mentioned here [1] it might be interesting to complete the returned
> information by version() when compiled with meson by including the
> host_system.

The meson build provides extra_version, which would be able to do the
same, no?  The information would be appended to PG_VERSION_STR through
PG_VERSION.
--
Michael


signature.asc
Description: PGP signature


Re: meson oddities

2022-11-15 Thread Andres Freund
Hi,

On 2022-11-15 16:08:35 -0500, Tom Lane wrote:
> Andrew Dunstan  writes:
> > On 2022-11-15 Tu 14:04, Andres Freund wrote:
> >> I don't think it's a virtue to break the layout of the platform by
> >> e.g. installing 64bit libs into the directory containing 32bit libs.
>
> > You might end up surprising people who have installed from source for
> > years and will have the layout suddenly changed, especially on RedHat
> > flavored systems.

Just to make sure that's clear: meson defaults to lib/ or lib64/ (depending on
bitness obviously) on RedHat systems, not lib/i386-linux-gnu/ or
lib/x86_64-linux-gnu.


> Yeah, I'm not too pleased with this idea either.  The people who want
> to install according to some platform-specific plan have already figured
> out how to do that.  People who are accustomed to the way PG has done
> it in the past are not likely to think this is an improvement.

I think that's a good argument to not change the default for configure, but
imo not a good argument for forcing 'lib' rather than the appropriate platform
default in the meson build, given that that already requires changing existing
recipes.

Small note: I didn't intentionally make that change during the meson porting
work, it's just meson's default.

I can live with forcing lib/, but I don't think it's the better solution long
term. And this seems like the best point for switching we're going to get.


We'd just have to add 'libdir=lib' to the default_options array in the
toplevel meson.build.


> Also, unless you intend to drop the special cases involving whether
> the install path string contains "postgres" or "pgsql", it's already
> not platform-standard.

For me that's the best argument for forcing 'lib'. Still not quite enough to
swing me around, because it's imo a pretty reasonable thing to want to install
a 32bit and 64bit libpq, and I don't think we should make that harder.

Somewhat relatedly, I wonder if we should have a better way to enable/disable
the 'pgsql' path logic. It's pretty annoying that prefix basically doesn't
work if it doesn't contain 'pgsql' or 'postgres'.

Greetings,

Andres Freund




Bug in row_number() optimization

2022-11-15 Thread Sergey Shinderuk

Hello,

We've accidentally found a subtle bug introduced by

commit 9d9c02ccd1aea8e9131d8f4edb21bf1687e40782
Author: David Rowley
Date:   Fri Apr 8 10:34:36 2022 +1200

Teach planner and executor about monotonic window funcs


On a 32-bit system Valgrind reports use-after-free when running the 
"window" test:


==35487== Invalid read of size 4
==35487==at 0x48398A4: memcpy (vg_replace_strmem.c:1035)
==35487==by 0x1A2902: fill_val (heaptuple.c:287)
==35487==by 0x1A2902: heap_fill_tuple (heaptuple.c:336)
==35487==by 0x1A3C29: heap_form_minimal_tuple (heaptuple.c:1412)
==35487==by 0x3D4555: tts_virtual_copy_minimal_tuple (execTuples.c:290)
==35487==by 0x72FC33: ExecCopySlotMinimalTuple (tuptable.h:473)
==35487==by 0x72FC33: tuplesort_puttupleslot (tuplesortvariants.c:610)
==35487==by 0x403463: ExecSort (nodeSort.c:153)
==35487==by 0x3D0C8E: ExecProcNodeFirst (execProcnode.c:464)
==35487==by 0x40AF09: ExecProcNode (executor.h:259)
==35487==by 0x40AF09: begin_partition (nodeWindowAgg.c:1106)
==35487==by 0x40D259: ExecWindowAgg (nodeWindowAgg.c:2125)
==35487==by 0x3D0C8E: ExecProcNodeFirst (execProcnode.c:464)
==35487==by 0x405E17: ExecProcNode (executor.h:259)
==35487==by 0x405E17: SubqueryNext (nodeSubqueryscan.c:53)
==35487==by 0x3D41C7: ExecScanFetch (execScan.c:133)
==35487==by 0x3D41C7: ExecScan (execScan.c:199)
==35487==  Address 0xe3e8af0 is 168 bytes inside a block of size 8,192 
alloc'd

==35487==at 0x483463B: malloc (vg_replace_malloc.c:299)
==35487==by 0x712B63: AllocSetContextCreateInternal (aset.c:446)
==35487==by 0x3D82BE: CreateExprContextInternal (execUtils.c:253)
==35487==by 0x3D84DC: CreateExprContext (execUtils.c:303)
==35487==by 0x3D8750: ExecAssignExprContext (execUtils.c:482)
==35487==by 0x40BC1A: ExecInitWindowAgg (nodeWindowAgg.c:2382)
==35487==by 0x3D1232: ExecInitNode (execProcnode.c:346)
==35487==by 0x4035E0: ExecInitSort (nodeSort.c:265)
==35487==by 0x3D11AB: ExecInitNode (execProcnode.c:321)
==35487==by 0x40BD36: ExecInitWindowAgg (nodeWindowAgg.c:2432)
==35487==by 0x3D1232: ExecInitNode (execProcnode.c:346)
==35487==by 0x405E99: ExecInitSubqueryScan (nodeSubqueryscan.c:126)


It's faster to run just this test under Valgrind:

make installcheck-test TESTS='test_setup window'


This can also be reproduced on a 64-bit system by forcing int8 to be 
passed by reference:


--- a/src/include/pg_config_manual.h
+++ b/src/include/pg_config_manual.h
@@ -82,9 +82,7 @@
  *
  * Changing this requires an initdb.
  */
-#if SIZEOF_VOID_P >= 8
-#define USE_FLOAT8_BYVAL 1
-#endif
+#undef USE_FLOAT8_BYVAL

 /*
  * When we don't have native spinlocks, we use semaphores to simulate 
them.



Futhermore, zeroing freed memory makes the test fail:

--- a/src/include/utils/memdebug.h
+++ b/src/include/utils/memdebug.h
@@ -39,7 +39,7 @@ static inline void
 wipe_mem(void *ptr, size_t size)
 {
VALGRIND_MAKE_MEM_UNDEFINED(ptr, size);
-   memset(ptr, 0x7F, size);
+   memset(ptr, 0, size);
VALGRIND_MAKE_MEM_NOACCESS(ptr, size);
 }

$ cat src/test/regress/regression.diffs
diff -U3 
/home/sergey/pgwork/devel/src/src/test/regress/expected/window.out 
/home/sergey/pgwork/devel/src/src/test/regress/results/window.out
--- /home/sergey/pgwork/devel/src/src/test/regress/expected/window.out 
2022-11-03 18:26:52.203624217 +0300
+++ /home/sergey/pgwork/devel/src/src/test/regress/results/window.out 
2022-11-16 01:47:18.494273352 +0300

@@ -3721,7 +3721,8 @@
 ---+---++-++++
  personnel | 5 |   3500 | 12-10-2007  |  2 |  1 |  2 |  2
  sales | 3 |   4800 | 08-01-2007  |  3 |  1 |  3 |  3
-(2 rows)
+ sales | 4 |   4800 | 08-08-2007  |  3 |  0 |  3 |  3
+(3 rows)

 -- Tests to ensure we don't push down the run condition when it's not 
valid to

 -- do so.


The failing query is:

SELECT * FROM
  (SELECT *,
  count(salary) OVER (PARTITION BY depname || '') c1, -- w1
  row_number() OVER (PARTITION BY depname) rn, -- w2
  count(*) OVER (PARTITION BY depname) c2, -- w2
  count(*) OVER (PARTITION BY '' || depname) c3 -- w3
   FROM empsalary
) e WHERE rn <= 1 AND c1 <= 3;


As far as I understand, ExecWindowAgg for the intermediate WindowAgg 
node switches into pass-through mode, stops evaluating row_number(), and 
returns the previous value instead. But if int8 is passed by reference, 
the previous value stored in econtext->ecxt_aggvalues becomes a dangling 
pointer when the per-output-tuple memory context is reset.


Attaching a patch that makes the window test fail on a 64-bit system.

Best regards,

--
Sergey Shinderukhttps://postgrespro.com/diff --git a/src/include/pg_config_manual.h b/src/include/pg_config_manual.h
index f2a106f983d..66ab343d51c 100644
--- a/src/include/pg_config_manual.h
+++ b/src/include/pg_config_manual.h
@@ -82,9 +82,7 @@
  *
  * Changing this requires an 

Re: Slow standby snapshot

2022-11-15 Thread Simon Riggs
On Tue, 15 Nov 2022 at 23:06, Tom Lane  wrote:
>
> BTW, while nosing around this code I came across this statement
> (procarray.c, about line 4550 in HEAD):
>
>  * ... To add XIDs to the array, we just insert
>  * them into slots to the right of the head pointer and then advance the head
>  * pointer.  This wouldn't require any lock at all, except that on machines
>  * with weak memory ordering we need to be careful that other processors
>  * see the array element changes before they see the head pointer change.
>  * We handle this by using a spinlock to protect reads and writes of the
>  * head/tail pointers.  (We could dispense with the spinlock if we were to
>  * create suitable memory access barrier primitives and use those instead.)
>  * The spinlock must be taken to read or write the head/tail pointers unless
>  * the caller holds ProcArrayLock exclusively.
>
> Nowadays we've *got* those primitives.  Can we get rid of
> known_assigned_xids_lck, and if so would it make a meaningful
> difference in this scenario?

I think you could do that *as well*, since it does act as an overhead
but that is not related to the main issues:

* COMMITs: xids are removed from the array by performing a binary
search - this gets more and more expensive as the array gets wider
* SNAPSHOTs: scanning the array for snapshots becomes more expensive
as the array gets wider

Hence more frequent compression is effective at reducing the overhead.
But too frequent compression slows down the startup process, which
can't then keep up.

So we're just looking for an optimal frequency of compression for any
given workload.

-- 
Simon Riggshttp://www.EnterpriseDB.com/




Re: Add palloc_aligned() to allow arbitrary power of 2 memory alignment

2022-11-15 Thread Andres Freund
Hi,

On 2022-11-15 16:58:10 -0500, Greg Stark wrote:
> So I think it's kind of cute that you've implemented these as agnostic
> wrappers that work with any allocator ... but why?
> 
> I would have expected the functionality to just be added directly to
> the allocator to explicitly request whole aligned pages which IIRC
> it's already capable of doing but just doesn't have any way to
> explicitly request.

We'd need to support it in multiple allocators, and they'd code quite similar
to this because for allocations that go directly to malloc.

It's possible that we'd want to add optional support for aligned allocations
to e.g. aset.c but not other allocators - this patch would allow to add
support for that transparently.


> DirectIO doesn't really need a wide variety of allocation sizes or
> alignments, it's always going to be the physical block size which
> apparently can be as low as 512 bytes but I'm guessing we're always
> going to be using 4kB alignment and multiples of 8kB allocations.

Yep - I posted numbers in some other thread showing that using a larger
alignment is a good idea.


> Wouldn't just having a pool of 8kB pages all aligned on 4kB or 8kB
> alignment be simpler and more efficient than working around misaligned
> pointers and having all these branches and arithmetic happening?

I'm quite certain it'd increase memory usage, rather than reduce it - there's
not actually a whole lot of places that need aligned pages outside of bufmgr,
so the pool would just be unused most of the time. And you'd need special code
to return those pages to the pool when the operation using the aligned buffer
fails - whereas integrating with memory contexts already takes care of
that. Lastly, there's other places where we can benefit from aligned
allocations far smaller than 4kB (most typically cacheline aligned, I'd
guess).

Greetings,

Andres Freund




Meson add host_system to PG_VERSION_STR

2022-11-15 Thread Juan José Santamaría Flecha
Hello all,

As mentioned here [1] it might be interesting to complete the returned
information by version() when compiled with meson by including the
host_system.

[1]
https://www.postgresql.org/message-id/20221115195318.5v5ynapmkusgyzks%40awork3.anarazel.de

Regards,

Juan José Santamaría Flecha


0001-meson-add-host_system-to-PG_VERSION_STR.patch
Description: Binary data


Re: Slow standby snapshot

2022-11-15 Thread Tom Lane
BTW, while nosing around this code I came across this statement
(procarray.c, about line 4550 in HEAD):

 * ... To add XIDs to the array, we just insert
 * them into slots to the right of the head pointer and then advance the head
 * pointer.  This wouldn't require any lock at all, except that on machines
 * with weak memory ordering we need to be careful that other processors
 * see the array element changes before they see the head pointer change.
 * We handle this by using a spinlock to protect reads and writes of the
 * head/tail pointers.  (We could dispense with the spinlock if we were to
 * create suitable memory access barrier primitives and use those instead.)
 * The spinlock must be taken to read or write the head/tail pointers unless
 * the caller holds ProcArrayLock exclusively.

Nowadays we've *got* those primitives.  Can we get rid of
known_assigned_xids_lck, and if so would it make a meaningful
difference in this scenario?

regards, tom lane




Re: SUBTRANS: Minimizing calls to SubTransSetParent()

2022-11-15 Thread Simon Riggs
On Tue, 15 Nov 2022 at 21:03, Tom Lane  wrote:
>
> Simon Riggs  writes:
> > New version attached.
>
> I looked at this patch and I do not see how it can possibly be safe.

I grant you it is complex, so please bear with me.


> The most direct counterexample arises from the fact that
> HeapCheckForSerializableConflictOut checks SubTransGetTopmostTransaction
> in some cases.  You haven't tried to analyze when, but just disabled
> the optimization in serializable mode:
>
> + * 2. When IsolationIsSerializable() we sometimes need to access 
> topxid.
> + *This occurs only when SERIALIZABLE is requested by app user.
> +...
> +if (MyProc->subxidStatus.overflowed ||
> +IsolationIsSerializable() ||
>
> However, what this is checking is whether *our current transaction*
> is serializable.  If we're not serializable, but other transactions
> in the system are, then this fails to store information that they'll
> need for correctness.  You'd have to have some way of knowing that
> no transaction in the system is using serializable mode, and that
> none will start to do so while this transaction is still in-doubt.

Not true.

src/backend/storage/lmgr/README-SSI   says this...

* Any transaction which is run at a transaction isolation level
other than SERIALIZABLE will not be affected by SSI.  If you want to
enforce business rules through SSI, all transactions should be run at
the SERIALIZABLE transaction isolation level, and that should
probably be set as the default.

If HeapCheckForSerializableConflictOut() cannot find a subxid's parent
then it will not be involved in serialization errors.

So skipping the storage of subxids in subtrans for non-serializable
xacts is valid for both SSI and non-SSI xacts.

Thus the owning transaction can decide to skip the insert into
subtrans if it is not serializable.

> I fear that's already enough to kill the idea; but there's more.
> The subxidStatus.overflowed check quoted above has a similar sort
> of myopia: it's checking whether our current transaction has
> already suboverflowed.  But (a) that doesn't prove it won't suboverflow
> later, and (b) the relevant logic in XidInMVCCSnapshot needs to run
> SubTransGetTopmostTransaction if *any* proc in the snapshot has
> suboverflowed.

Not the way it is coded now.

First, we search the subxid cache in snapshot->subxip.
Then, and only if the snapshot overflowed (i.e. ANY xact overflowed),
do we check subtrans.

Thus, the owning xact knows that anyone else will find the first 64
xids in the subxid cache, so it need not insert them into subtrans,
even if someone else overflowed. When the owning xact overflows, it
knows it must now insert the subxid into subtrans before the xid is
used anywhere in storage, which the patch does. This allows each
owning xact to decide what to do, independent of the actions of
others.

> Lastly, I don't see what the "transaction on same page" business
> has got to do with anything.  The comment is certainly failing
> to make the case that it's safe to skip filling subtrans when that
> is true.

That seems strange, I grant you. It's the same logic that is used in
TransactionIdSetTreeStatus(), in reverse. I understand it 'cos I wrote
it.

TRANSACTION_STATUS_SUB_COMMITTED is only ever used if the topxid and
subxid are on different pages. Therefore TransactionIdDidCommit()
won't ever see a value of TRANSACTION_STATUS_SUB_COMMITTED unless they
are on separate pages. So the owning transaction can predict in
advance whether anyone will ever call SubTransGetParent() for one of
its xids. If they might, then we record the values just in case. If
they NEVER will, then we can skip recording them.


And just to be clear, all 3 of the above preconditions must be true
before the owning xact decides to skip writing a subxid to subtrans.

> I think we could salvage this small idea:
>
> + * Insert entries into subtrans for this xid, noting that the 
> entry
> + * points directly to the topxid, not the immediate parent. This 
> is
> + * done for two reasons:
> + * (1) so it is faster in a long chain of subxids, because the
> + * algorithm is then O(1), no matter how many subxids are 
> assigned.
>
> but some work would be needed to update the comments around
> SubTransGetParent and SubTransGetTopmostTransaction to explain that
> they're no longer reliably different.  I think that that is okay for
> the existing use-cases, but they'd better be documented.  In fact,
> couldn't we simplify them down to one function?  Given the restriction
> that we don't look back in pg_subtrans further than TransactionXmin,
> I don't think that updated code would ever need to resolve cases
> written by older code.  So we could remove the recursive checks
> entirely, or at least be confident that they don't recurse more
> than once.

Happy to do so, I'd left it that way to soften the blow of the
absorbing the earlier thoughts.

(Since we know all 

Re: Slow standby snapshot

2022-11-15 Thread Tom Lane
Simon Riggs  writes:
> I've cleaned up the comments and used a #define for the constant.
> IMHO this is ready for commit. I will add it to the next CF.

I looked at this a little.  It's a simple enough patch, and if it
solves the problem then I sure like it better than the previous
ideas in this thread.

However ... I tried to reproduce the original complaint, and
failed entirely.  I do see KnownAssignedXidsGetAndSetXmin
eating a bit of time in the standby backends, but it's under 1%
and doesn't seem to be rising over time.  Perhaps we've already
applied some optimization that ameliorates the problem?  But
I tested v13 as well as HEAD, and got the same results.

regards, tom lane




Re: closing file in adjust_data_dir

2022-11-15 Thread Ted Yu
On Tue, Nov 15, 2022 at 10:43 AM Ted Yu  wrote:

> Hi,
> I was looking at the commit:
>
> commit 2fe3bdbd691a5d11626308e7d660440be6c210c8
> Author: Peter Eisentraut 
> Date:   Tue Nov 15 15:35:37 2022 +0100
>
> Check return value of pclose() correctly
>
> In src/bin/pg_ctl/pg_ctl.c :
>
> if (fd == NULL || fgets(filename, sizeof(filename), fd) == NULL ||
> pclose(fd) != 0)
>
> If the fgets() call doesn't return NULL, the pclose() would be skipped.
> Since the original pclose() call was removed, wouldn't this lead to fd
> leaking ?
>
> Please see attached patch for my proposal.
>
> Cheers
>

There was potential leak of fd in patch v1.

Please take a look at patch v2.

Thanks


pg-ctl-close-fd-v2.patch
Description: Binary data


Re: Add palloc_aligned() to allow arbitrary power of 2 memory alignment

2022-11-15 Thread Greg Stark
So I think it's kind of cute that you've implemented these as agnostic
wrappers that work with any allocator ... but why?

I would have expected the functionality to just be added directly to
the allocator to explicitly request whole aligned pages which IIRC
it's already capable of doing but just doesn't have any way to
explicitly request.

DirectIO doesn't really need a wide variety of allocation sizes or
alignments, it's always going to be the physical block size which
apparently can be as low as 512 bytes but I'm guessing we're always
going to be using 4kB alignment and multiples of 8kB allocations.
Wouldn't just having a pool of 8kB pages all aligned on 4kB or 8kB
alignment be simpler and more efficient than working around misaligned
pointers and having all these branches and arithmetic happening?




Re: Avoid overhead open-close indexes (catalog updates)

2022-11-15 Thread Michael Paquier
On Tue, Nov 15, 2022 at 11:42:34AM -0300, Ranier Vilela wrote:
> I find it very difficult not to have some tuple to be updated,
> once inside CopyStatistics and the branch cost can get in the way,
> but I don't object with your solution.

The code assumes that it is a possibility.

> Missed AddRoleMems?
> Could you continue with CatalogTupleInsertWithInfo, what do you think?

This one has been left out on purpose.  I was tempting to use
WithInfo() with a CatalogIndexState opened optionally but I got the
impression that it makes the code a bit harder to follow and
AddRoleMems() is already complex on its own.  Most DDL patterns
working on role would involve one role.  More roles could be added of
course in one shot, but the extra logic complexity did not look that
appealing to me especially as some role updates are skipped.
--
Michael


signature.asc
Description: PGP signature


Re: Standardizing how pg_waldump presents recovery conflict XID cutoffs

2022-11-15 Thread Peter Geoghegan
On Tue, Nov 15, 2022 at 12:29 PM Andres Freund  wrote:
> ... I strongly dislike latestCommittedXid. That seems at least as misleading
> as latestRemovedXid and has the danger of confusion with latestCompletedXid
> as you mention.

> How about latestAffectedXid?

I get why you don't care for latestCommittedXid, of course, but the
name does have some advantages. Namely:

1. Most conflicts come from PRUNE records (less often index deletion
records) where the XID is some heap tuple's xmax, a
committed-to-everybody XID on the primary (at the point of the
original execution of the prune). It makes sense to emphasize the idea
that snapshots running on a replica need to agree that this XID is
definitely committed -- we need to kill any snapshots that don't
definitely agree that this one particular XID is committed by now.

2. It hints at the idea that we don't need to set any XID to do
cleanup for aborted transactions, per the optimization in
HeapTupleHeaderAdvanceLatestRemovedXid().

Perhaps something like "mustBeCommittedCutoff" would work better? What
do you think of that? The emphasis on how things need to work on the
REDO side seems useful.

-- 
Peter Geoghegan




Re: archive modules

2022-11-15 Thread Michael Paquier
On Tue, Nov 15, 2022 at 12:57:49PM -0800, Nathan Bossart wrote:
> On Tue, Nov 15, 2022 at 06:14:25PM +0100, Alvaro Herrera wrote:
>> Hmm, really?  It seems to me that we will have two slightly different
>> behaviors in 15 and master, which may be confusing later on.  I think
>> it'd be better to make them both work identically.
> 
> I don't have a strong opinion either way.  While consistency between v15
> and master seems nice, the behavior change might not be appropriate for a
> minor release.  BTW I was able to cherry-pick the committed patch to v15
> without any changes.  Peter, could you clarify what changes you'd like to
> see in a back-patched version?

FWIW, I am not sure that I would have done d627ce3 as I already
mentioned upthread as the library loading should not be related to
archive_command.  If there is support more support in doing that, I am
fine to withdraw, but the behavior between HEAD and REL_15_STABLE
ought to be consistent.
--
Michael


signature.asc
Description: PGP signature


Re: allowing for control over SET ROLE

2022-11-15 Thread Nathan Bossart
On Tue, Nov 15, 2022 at 12:07:06PM -0500, Robert Haas wrote:
> If anyone else wants to comment, or if either of those people want to
> comment further, please speak up soon. Otherwise, I am going to press
> forward with committing this.

I don't think I have any further thoughts about the approach, so I won't
balk if you proceed with this change.  It might be worth starting a new
thread to discuss whether to treat predefined roles as a special case, but
IMO that needn't hold up this patch.

-- 
Nathan Bossart
Amazon Web Services: https://aws.amazon.com




Re: meson oddities

2022-11-15 Thread Tom Lane
Andrew Dunstan  writes:
> On 2022-11-15 Tu 14:04, Andres Freund wrote:
>> I don't think it's a virtue to break the layout of the platform by
>> e.g. installing 64bit libs into the directory containing 32bit libs.

> You might end up surprising people who have installed from source for
> years and will have the layout suddenly changed, especially on RedHat
> flavored systems.

Yeah, I'm not too pleased with this idea either.  The people who want
to install according to some platform-specific plan have already figured
out how to do that.  People who are accustomed to the way PG has done
it in the past are not likely to think this is an improvement.   

Also, unless you intend to drop the special cases involving whether
the install path string contains "postgres" or "pgsql", it's already
not platform-standard.

regards, tom lane




Re: Speed up transaction completion faster after many relations are accessed in a transaction

2022-11-15 Thread David Rowley
Thank you for looking at the patch.

On Fri, 4 Nov 2022 at 04:43, Ankit Kumar Pandey  wrote:
> I don't see any performance improvement in tests.

Are you able to share what your test was?

In order to see a performance improvement you're likely going to have
to obtain a large number of locks in the session so that the local
lock table becomes bloated, then continue to run some fast query and
observe that LockReleaseAll has become slower as a result of the hash
table becoming bloated.  Running pgbench running a SELECT on a hash
partitioned table with a good number of partitions to look up a single
row with -M prepared.  The reason this becomes slow is that the
planner will try a generic plan on the 6th execution which will lock
every partition and bloat the local lock table. From then on it will
use a custom plan which only locks a single leaf partition.

I just tried the following:

$ pgbench -i --partition-method=hash --partitions=1000 postgres

Master:
$ pgbench -T 60 -S -M prepared postgres | grep tps
tps = 21286.172326 (without initial connection time)

Patched:
$ pgbench -T 60 -S -M prepared postgres | grep tps
tps = 23034.063261 (without initial connection time)

If I try again with 10,000 partitions, I get:

Master:
$ pgbench -T 60 -S -M prepared postgres | grep tps
tps = 13044.290903 (without initial connection time)

Patched:
$ pgbench -T 60 -S -M prepared postgres | grep tps
tps = 22683.545686 (without initial connection time)

David




Re: SUBTRANS: Minimizing calls to SubTransSetParent()

2022-11-15 Thread Tom Lane
Simon Riggs  writes:
> New version attached.

I looked at this patch and I do not see how it can possibly be safe.

The most direct counterexample arises from the fact that
HeapCheckForSerializableConflictOut checks SubTransGetTopmostTransaction
in some cases.  You haven't tried to analyze when, but just disabled
the optimization in serializable mode:

+ * 2. When IsolationIsSerializable() we sometimes need to access 
topxid.
+ *This occurs only when SERIALIZABLE is requested by app user.
+...
+if (MyProc->subxidStatus.overflowed ||
+IsolationIsSerializable() ||

However, what this is checking is whether *our current transaction*
is serializable.  If we're not serializable, but other transactions
in the system are, then this fails to store information that they'll
need for correctness.  You'd have to have some way of knowing that
no transaction in the system is using serializable mode, and that
none will start to do so while this transaction is still in-doubt.

I fear that's already enough to kill the idea; but there's more.
The subxidStatus.overflowed check quoted above has a similar sort
of myopia: it's checking whether our current transaction has
already suboverflowed.  But (a) that doesn't prove it won't suboverflow
later, and (b) the relevant logic in XidInMVCCSnapshot needs to run
SubTransGetTopmostTransaction if *any* proc in the snapshot has
suboverflowed.

Lastly, I don't see what the "transaction on same page" business
has got to do with anything.  The comment is certainly failing
to make the case that it's safe to skip filling subtrans when that
is true.

I think we could salvage this small idea:

+ * Insert entries into subtrans for this xid, noting that the entry
+ * points directly to the topxid, not the immediate parent. This is
+ * done for two reasons:
+ * (1) so it is faster in a long chain of subxids, because the
+ * algorithm is then O(1), no matter how many subxids are assigned.

but some work would be needed to update the comments around
SubTransGetParent and SubTransGetTopmostTransaction to explain that
they're no longer reliably different.  I think that that is okay for
the existing use-cases, but they'd better be documented.  In fact,
couldn't we simplify them down to one function?  Given the restriction
that we don't look back in pg_subtrans further than TransactionXmin,
I don't think that updated code would ever need to resolve cases
written by older code.  So we could remove the recursive checks
entirely, or at least be confident that they don't recurse more
than once.

regards, tom lane




Re: archive modules

2022-11-15 Thread Nathan Bossart
On Tue, Nov 15, 2022 at 06:14:25PM +0100, Alvaro Herrera wrote:
> On 2022-Nov-15, Nathan Bossart wrote:
> 
>> On Tue, Nov 15, 2022 at 10:31:44AM +0100, Peter Eisentraut wrote:
> 
>> > The surrounding code has changed a bit between PG15 and master, so if we
>> > wanted to backpatch this, we'd need another patch from you.  However, at
>> > this point, I'm content to just leave it be in PG15.
>> 
>> Sounds good to me.
> 
> Hmm, really?  It seems to me that we will have two slightly different
> behaviors in 15 and master, which may be confusing later on.  I think
> it'd be better to make them both work identically.

I don't have a strong opinion either way.  While consistency between v15
and master seems nice, the behavior change might not be appropriate for a
minor release.  BTW I was able to cherry-pick the committed patch to v15
without any changes.  Peter, could you clarify what changes you'd like to
see in a back-patched version?

-- 
Nathan Bossart
Amazon Web Services: https://aws.amazon.com




Re: ssl tests aren't concurrency safe due to get_free_port()

2022-11-15 Thread Andrew Dunstan

On 2022-11-06 Su 11:30, Andrew Dunstan wrote:
>
> One possible addition would be to add removing the reservation files in
> an END handler. That would be pretty simple.
>
>


Here's a version with that. I suggest we try it out and see if anything
breaks.


cheers


andrew

--
Andrew Dunstan
EDB: https://www.enterprisedb.com
diff --git a/src/test/perl/PostgreSQL/Test/Cluster.pm b/src/test/perl/PostgreSQL/Test/Cluster.pm
index d80134b26f..85fae32c14 100644
--- a/src/test/perl/PostgreSQL/Test/Cluster.pm
+++ b/src/test/perl/PostgreSQL/Test/Cluster.pm
@@ -93,9 +93,9 @@ use warnings;
 
 use Carp;
 use Config;
-use Fcntl qw(:mode);
+use Fcntl qw(:mode :flock :seek O_CREAT O_RDWR);
 use File::Basename;
-use File::Path qw(rmtree);
+use File::Path qw(rmtree mkpath);
 use File::Spec;
 use File::stat qw(stat);
 use File::Temp ();
@@ -109,12 +109,15 @@ use Time::HiRes qw(usleep);
 use Scalar::Util qw(blessed);
 
 our ($use_tcp, $test_localhost, $test_pghost, $last_host_assigned,
-	$last_port_assigned, @all_nodes, $died);
+	$last_port_assigned, @all_nodes, $died, $portdir);
 
 # the minimum version we believe to be compatible with this package without
 # subclassing.
 our $min_compat = 12;
 
+# list of file reservations made by get_free_port
+my @port_reservation_files;
+
 INIT
 {
 
@@ -140,6 +143,27 @@ INIT
 
 	# Tracking of last port value assigned to accelerate free port lookup.
 	$last_port_assigned = int(rand() * 16384) + 49152;
+
+	# Set the port lock directory
+
+	# If we're told to use a directory (e.g. from a buildfarm client)
+	# explicitly, use that
+	$portdir = $ENV{PG_TEST_PORT_DIR};
+	# Otherwise, try to use a directory at the top of the build tree
+	if (! $portdir && $ENV{MESON_BUILD_ROOT})
+	{
+		$portdir = $ENV{MESON_BUILD_ROOT} . '/portlock'
+	}
+	elsif (! $portdir && ($ENV{TESTDATADIR} || "") =~ /\W(src|contrib)\W/p)
+	{
+		my $dir = ${^PREMATCH};
+		$portdir = "$dir/portlock" if $dir;
+	}
+	# As a last resort use a directory under tmp_check
+	$portdir ||= $PostgreSQL::Test::Utils::tmp_check . '/portlock';
+	$portdir =~ s!\\!/!g;
+	# Make sure the directory exists
+	mkpath($portdir) unless -d $portdir;
 }
 
 =pod
@@ -1505,6 +1529,7 @@ sub get_free_port
 	last;
 }
 			}
+			$found = _reserve_port($port) if $found;
 		}
 	}
 
@@ -1535,6 +1560,40 @@ sub can_bind
 	return $ret;
 }
 
+# Internal routine to reserve a port number
+# Returns 1 if successful, 0 if port is already reserved.
+sub _reserve_port
+{
+	my $port = shift;
+	# open in rw mode so we don't have to reopen it and lose the lock
+	my $filename = "$portdir/$port.rsv";
+	sysopen(my $portfile, $filename, O_RDWR|O_CREAT)
+	  || die "opening port file $filename";
+	# take an exclusive lock to avoid concurrent access
+	flock($portfile, LOCK_EX) || die "locking port file $filename";
+	# see if someone else has or had a reservation of this port
+	my $pid = <$portfile>;
+	chomp $pid;
+	if ($pid +0 > 0)
+	{
+		if (kill 0, $pid)
+		{
+			# process exists and is owned by us, so we can't reserve this port
+			close($portfile);
+			return 0;
+		}
+	}
+	# All good, go ahead and reserve the port, first rewind and truncate.
+	# If truncation fails it's not a tragedy, it just might leave some
+	# trailing junk in the file that won't affect us.
+	seek($portfile, 0, SEEK_SET);
+	truncate($portfile, 0);
+	print $portfile "$$\n";
+	close($portfile);
+	push(@port_reservation_files, $filename);
+	return 1;
+}
+
 # Automatically shut down any still-running nodes (in the same order the nodes
 # were created in) when the test script exits.
 END
@@ -1562,6 +1621,8 @@ END
 		  if $exit_code == 0 && PostgreSQL::Test::Utils::all_tests_passing();
 	}
 
+	unlink @port_reservation_files;
+
 	$? = $exit_code;
 }
 


Re: meson oddities

2022-11-15 Thread Andrew Dunstan


On 2022-11-15 Tu 14:04, Andres Freund wrote:
>> But ISTM we shouldn't be presuming what packagers will do, and that
>> there is some virtue in having a default layout under ${prefix} that is
>> consistent across platforms, as is now the case with autoconf/configure.
> I don't think it's a virtue to break the layout of the platform by
> e.g. installing 64bit libs into the directory containing 32bit libs.


You might end up surprising people who have installed from source for
years and will have the layout suddenly changed, especially on RedHat
flavored systems.

I can work around it in the buildfarm, which does make some assumptions
about the layout (e.g. in the cross version pg_upgrade stuff), by
explicitly using --libdir.


>> But it's less clear to me that a bunch of defines belong in LDFLAGS.
>> Shouldn't that be only things that ld itself will recognize?
> I don't think there's a clear cut line what is for ld and what
> isn't. Including stuff that influences both preprocessor and
> linker. -ffreestanding will e.g. change preprocessor, compiler (I think), and
> linker behaviour.
>

Well it sure looks odd.


cheers


andrew


--
Andrew Dunstan
EDB: https://www.enterprisedb.com





Re: Standardizing how pg_waldump presents recovery conflict XID cutoffs

2022-11-15 Thread Andres Freund
Hi,

I like the idea of this, but:

On 2022-11-15 10:24:05 -0800, Peter Geoghegan wrote:
> I'm not necessarily that attached to the name latestCommittedXid. It
> is more accurate, but it's also a little bit too similar to another
> common XID symbol name, latestCompletedXid. Can anyone suggest an
> alternative?

... I strongly dislike latestCommittedXid. That seems at least as misleading
as latestRemovedXid and has the danger of confusion with latestCompletedXid
as you mention.

How about latestAffectedXid? Based on a quick scroll through the changed
structures it seems like it'd be reasonably discriptive for most?

Greetings,

Andres Freund




Re: HOT chain validation in verify_heapam()

2022-11-15 Thread Robert Haas
On Tue, Nov 15, 2022 at 2:50 PM Andres Freund  wrote:
> On 2022-11-15 11:36:21 -0500, Robert Haas wrote:
> > On Mon, Nov 14, 2022 at 5:02 PM Andres Freund  wrote:
> > > It seems like we should do a bit more validation within a chain of
> > > tuples. E.g. that no live tuple can follow an !DidCommit xmin?
> >
> > I think this check is already present in stronger form. If we see a
> > !DidCommit xmin, the xmin of the next tuple in the chain not only can't be
> > committed, but had better be the same.
>
> As I think I mentioned before, I don't think the "better be the same" aspect
> is correct, think subxacts. E.g.
>
> off 0: xmin: top, xmax: child_1
> off 1: xmin: child_1, xmax: invalid
>
> If top hasn't committed yet, the current logic afaict will warn about this
> situation, no? And I don't think we can generally the subxid parent at this
> point, unfortunately (might have truncated subtrans).

Woops, you're right.

> Different aspect: Is it ok that we use TransactionIdDidCommit() without a
> preceding IsInProgress() check?

Well, the code doesn't match the comments here, sadly. The comments
claim that we want to check that if the prior tuple's xmin was aborted
or in progress the current one is in the same state. If that's
actually what the code checked, we'd definitely need to check both
TransactionIdInProgress and TransactionIdCommit and be wary of the
possibility of the value changing concurrently. But the code doesn't
actually check the status of more than one XID, nor does it care about
the distinction between aborted and in progress, so I don't think that
the current code is buggy in that particular way, just in a bunch of
other ways.

> I do think there's some potential for additional checks that don't run into
> the above issue, e.g. checking that no in-progress xids follow an explicitly
> aborted xact, that a committed xid can't follow an uncommitted xid etc.

Yeah, maybe so.

-- 
Robert Haas
EDB: http://www.enterprisedb.com




Re: Report checkpoint progress with pg_stat_progress_checkpoint (was: Report checkpoint progress in server logs)

2022-11-15 Thread Andres Freund
Hi,

On 2022-11-04 09:25:52 +0100, Drouvot, Bertrand wrote:
>  
> @@ -7023,29 +7048,63 @@ static void
>  CheckPointGuts(XLogRecPtr checkPointRedo, int flags)
>  {
>   CheckPointRelationMap();
> +
> + pgstat_progress_update_param(PROGRESS_CHECKPOINT_PHASE,
> +  
> PROGRESS_CHECKPOINT_PHASE_REPLI_SLOTS);
>   CheckPointReplicationSlots();
> +
> + pgstat_progress_update_param(PROGRESS_CHECKPOINT_PHASE,
> +  
> PROGRESS_CHECKPOINT_PHASE_SNAPSHOTS);
>   CheckPointSnapBuild();
> +
> + pgstat_progress_update_param(PROGRESS_CHECKPOINT_PHASE,
> +  
> PROGRESS_CHECKPOINT_PHASE_LOGICAL_REWRITE_MAPPINGS);
>   CheckPointLogicalRewriteHeap();
> +
> + pgstat_progress_update_param(PROGRESS_CHECKPOINT_PHASE,
> +  
> PROGRESS_CHECKPOINT_PHASE_REPLI_ORIGIN);
>   CheckPointReplicationOrigin();
>  
>   /* Write out all dirty data in SLRUs and the main buffer pool */
>   TRACE_POSTGRESQL_BUFFER_CHECKPOINT_START(flags);
>   CheckpointStats.ckpt_write_t = GetCurrentTimestamp();
> +
> + pgstat_progress_update_param(PROGRESS_CHECKPOINT_PHASE,
> +  
> PROGRESS_CHECKPOINT_PHASE_CLOG_PAGES);
>   CheckPointCLOG();
> +
> + pgstat_progress_update_param(PROGRESS_CHECKPOINT_PHASE,
> +  
> PROGRESS_CHECKPOINT_PHASE_COMMITTS_PAGES);
>   CheckPointCommitTs();
> +
> + pgstat_progress_update_param(PROGRESS_CHECKPOINT_PHASE,
> +  
> PROGRESS_CHECKPOINT_PHASE_SUBTRANS_PAGES);
>   CheckPointSUBTRANS();
> +
> + pgstat_progress_update_param(PROGRESS_CHECKPOINT_PHASE,
> +  
> PROGRESS_CHECKPOINT_PHASE_MULTIXACT_PAGES);
>   CheckPointMultiXact();
> +
> + pgstat_progress_update_param(PROGRESS_CHECKPOINT_PHASE,
> +  
> PROGRESS_CHECKPOINT_PHASE_PREDICATE_LOCK_PAGES);
>   CheckPointPredicate();
> +
> + pgstat_progress_update_param(PROGRESS_CHECKPOINT_PHASE,
> +  
> PROGRESS_CHECKPOINT_PHASE_BUFFERS);
>   CheckPointBuffers(flags);
>  
>   /* Perform all queued up fsyncs */
>   TRACE_POSTGRESQL_BUFFER_CHECKPOINT_SYNC_START();
>   CheckpointStats.ckpt_sync_t = GetCurrentTimestamp();
> + pgstat_progress_update_param(PROGRESS_CHECKPOINT_PHASE,
> +  
> PROGRESS_CHECKPOINT_PHASE_SYNC_FILES);
>   ProcessSyncRequests();
>   CheckpointStats.ckpt_sync_end_t = GetCurrentTimestamp();
>   TRACE_POSTGRESQL_BUFFER_CHECKPOINT_DONE();
>  
>   /* We deliberately delay 2PC checkpointing as long as possible */
> + pgstat_progress_update_param(PROGRESS_CHECKPOINT_PHASE,
> +  
> PROGRESS_CHECKPOINT_PHASE_TWO_PHASE);
>   CheckPointTwoPhase(checkPointRedo);
>  }

This is quite the code bloat. Can we make this less duplicative?


> +CREATE VIEW pg_stat_progress_checkpoint AS
> +SELECT
> +S.pid AS pid,
> +CASE S.param1 WHEN 1 THEN 'checkpoint'
> +  WHEN 2 THEN 'restartpoint'
> +  END AS type,
> +( CASE WHEN (S.param2 & 4) > 0 THEN 'immediate ' ELSE '' END ||
> +  CASE WHEN (S.param2 & 8) > 0 THEN 'force ' ELSE '' END ||
> +  CASE WHEN (S.param2 & 16) > 0 THEN 'flush-all ' ELSE '' END ||
> +  CASE WHEN (S.param2 & 32) > 0 THEN 'wait ' ELSE '' END ||
> +  CASE WHEN (S.param2 & 128) > 0 THEN 'wal ' ELSE '' END ||
> +  CASE WHEN (S.param2 & 256) > 0 THEN 'time ' ELSE '' END
> +) AS flags,
> +( '0/0'::pg_lsn +
> +  ((CASE
> +WHEN S.param3 < 0 THEN pow(2::numeric, 64::numeric)::numeric
> +ELSE 0::numeric
> +END) +
> +   S.param3::numeric)
> +) AS start_lsn,

I don't think we should embed this much complexity in the view
defintions. It's hard to read, bloats the catalog, we can't fix them once
released.  This stuff seems like it should be in a helper function.

I don't have any iea what that pow stuff is supposed to be doing.


> +to_timestamp(946684800 + (S.param4::float8 / 100)) AS start_time,

I don't think this is a reasonable path - embedding way too much low-level
details about the timestamp format in the view definition. Why do we need to
do this?



Greetings,

Andres Freund




Re: Moving forward with TDE

2022-11-15 Thread Jacob Champion
On Tue, Nov 15, 2022 at 11:39 AM David Christensen
 wrote:
> Good to know about the next steps, thanks.

You're welcome!

> This was just a refresh of the old patches on the wiki to work as written on 
> HEAD. If there are known TODOs here this then that work is still needing to 
> be done.
>
> I was going to take 2) and Stephen was going to work on 3); I am not sure 
> about the other two but will review the thread you pointed to. Thanks for 
> pointing that out.

I've attached the diffs I'm carrying to build this under meson (as
well as -Wshadow; my removal of the two variables probably needs some
scrutiny). It looks like the testcrypto executable will need
substantial changes after the common/hex.h revert.

--Jacob
From eb4b55f5d03e362cf340f322c0cefbf95f53657a Mon Sep 17 00:00:00 2001
From: Jacob Champion 
Date: Mon, 14 Nov 2022 16:02:57 -0800
Subject: [PATCH] WIP: fix meson

Note that the testcrypto executable needs to be rewritten.
---
 src/backend/bootstrap/bootstrap.c |  2 --
 src/backend/crypto/kmgr.c |  1 -
 src/backend/crypto/meson.build| 14 ++
 src/backend/meson.build   |  1 +
 src/bin/meson.build   |  1 +
 src/bin/pg_alterckey/meson.build  | 16 
 src/common/meson.build|  3 +++
 7 files changed, 35 insertions(+), 3 deletions(-)
 create mode 100644 src/backend/crypto/meson.build
 create mode 100644 src/bin/pg_alterckey/meson.build

diff --git a/src/backend/bootstrap/bootstrap.c 
b/src/backend/bootstrap/bootstrap.c
index ffaf9c7cc4..07b2f15343 100644
--- a/src/backend/bootstrap/bootstrap.c
+++ b/src/backend/bootstrap/bootstrap.c
@@ -256,8 +256,6 @@ BootstrapModeMain(int argc, char *argv[], bool check_only)
break;
case 'K':
{
-   int i;
- 
/* method 0/disabled cannot be 
specified */
for (i = DISABLED_ENCRYPTION_METHOD + 1;
 i < NUM_ENCRYPTION_METHODS; 
i++)
diff --git a/src/backend/crypto/kmgr.c b/src/backend/crypto/kmgr.c
index a83effbca8..b4248965f8 100644
--- a/src/backend/crypto/kmgr.c
+++ b/src/backend/crypto/kmgr.c
@@ -141,7 +141,6 @@ BootStrapKmgr(void)
if (bootstrap_old_key_datadir == NULL)
{
unsigned char *bootstrap_keys_wrap[KMGR_NUM_DATA_KEYS];
-   int key_lens[KMGR_NUM_DATA_KEYS];
PgCipherCtx *cluster_key_ctx;
 
/* Create KEK encryption context */
diff --git a/src/backend/crypto/meson.build b/src/backend/crypto/meson.build
new file mode 100644
index 00..2b72b26ebf
--- /dev/null
+++ b/src/backend/crypto/meson.build
@@ -0,0 +1,14 @@
+backend_sources += files(
+  'bufenc.c',
+  'kmgr.c',
+)
+
+install_data(
+  'ckey_aws.sh.sample',
+  'ckey_direct.sh.sample',
+  'ckey_passphrase.sh.sample',
+  'ckey_piv_nopin.sh.sample',
+  'ckey_piv_pin.sh.sample',
+  'ssl_passphrase.sh.sample',
+  install_dir: dir_data / 'auth_commands',
+)
diff --git a/src/backend/meson.build b/src/backend/meson.build
index 37562bae13..2162aef554 100644
--- a/src/backend/meson.build
+++ b/src/backend/meson.build
@@ -9,6 +9,7 @@ subdir('backup')
 subdir('bootstrap')
 subdir('catalog')
 subdir('commands')
+subdir('crypto')
 subdir('executor')
 subdir('foreign')
 subdir('jit')
diff --git a/src/bin/meson.build b/src/bin/meson.build
index 5fd5a9d2f9..8dd9133213 100644
--- a/src/bin/meson.build
+++ b/src/bin/meson.build
@@ -1,4 +1,5 @@
 subdir('initdb')
+subdir('pg_alterckey')
 subdir('pg_amcheck')
 subdir('pg_archivecleanup')
 subdir('pg_basebackup')
diff --git a/src/bin/pg_alterckey/meson.build b/src/bin/pg_alterckey/meson.build
new file mode 100644
index 00..72ec8274c3
--- /dev/null
+++ b/src/bin/pg_alterckey/meson.build
@@ -0,0 +1,16 @@
+pg_alterckey_sources = files(
+  'pg_alterckey.c',
+)
+
+if host_system == 'windows'
+  pg_alterckey_sources += rc_bin_gen.process(win32ver_rc, extra_args: [
+'--NAME', 'pg_alterckey',
+'--FILEDESC', 'pg_alterckey - alter the cluster key',])
+endif
+
+pg_alterckey = executable('pg_alterckey',
+  pg_alterckey_sources,
+  dependencies: [frontend_code, libpq],
+  kwargs: default_bin_args,
+)
+bin_targets += pg_alterckey
diff --git a/src/common/meson.build b/src/common/meson.build
index 1c9b8a3a01..102b4d7b23 100644
--- a/src/common/meson.build
+++ b/src/common/meson.build
@@ -12,6 +12,7 @@ common_sources = files(
   'ip.c',
   'jsonapi.c',
   'keywords.c',
+  'kmgr_utils.c',
   'kwlookup.c',
   'link-canary.c',
   'md5_common.c',
@@ -34,12 +35,14 @@ common_sources = files(
 
 if ssl.found()
   common_sources += files(
+'cipher_openssl.c',
 'cryptohash_openssl.c',
 'hmac_openssl.c',
 'protocol_openssl.c',
   )
 else
   common_sources += files(
+'cipher.c',
 'cryptohash.c',
 'hmac.c',
 'md5.c',
-- 
2.25.1



Re: Report checkpoint progress with pg_stat_progress_checkpoint (was: Report checkpoint progress in server logs)

2022-11-15 Thread Robert Haas
On Fri, Nov 4, 2022 at 4:27 AM Drouvot, Bertrand
 wrote:
> Please find attached a rebase in v7.

I don't think it's a good thing that this patch is using the
progress-reporting machinery. The point of that machinery is that we
want any backend to be able to report progress for any command it
happens to be running, and we don't know which command that will be at
any given point in time, or how many backends will be running any
given command at once. So we need some generic set of counters that
can be repurposed for whatever any particular backend happens to be
doing right at the moment.

But none of that applies to the checkpointer. Any information about
the checkpointer that we want to expose can just be advertised in a
dedicated chunk of shared memory, perhaps even by simply adding it to
CheckpointerShmemStruct. Then you can give the fields whatever names,
types, and sizes you like, and you don't have to do all of this stuff
with mapping down to integers and back. The only real disadvantage
that I can see is then you have to think a bit harder about what the
concurrency model is here, and maybe you end up reimplementing
something similar to what the progress-reporting stuff does for you,
and *maybe* that is a sufficient reason to do it this way.

But I'm doubtful. This feels like a square-peg-round-hole situation.

--
Robert Haas
EDB: http://www.enterprisedb.com




Re: Meson doesn't define HAVE_LOCALE_T for mscv

2022-11-15 Thread Andres Freund
Hi,

Hm, the quoting was odd, making me think you had written a separate email
about the define issue. Hence the separate email...


On 2022-11-15 15:35:31 +0100, Juan José Santamaría Flecha wrote:
> On Tue, Nov 15, 2022 at 1:49 AM Andres Freund  wrote:
> > Hm. Is it right that the changes are only done for msvc? win32_port.h
> > defines the types for mingw as well afaict.

> Yes, it does, but configure does nothing with them, so adding those
> defines is a new feature for MinGW but a correction for MSVC.

Any chance you checked if autoconf already detects locale_t with mingw?
Possible that mingw supplies one of the relevant headers...

Otherwise it looks like a sensible improvement to me.

Greetings,

Andres Freund




Re: Meson doesn't define HAVE_LOCALE_T for mscv

2022-11-15 Thread Andres Freund
Hi,

On 2022-11-15 15:35:31 +0100, Juan José Santamaría Flecha wrote:
> I've seen that when building with meson on MinGW the output for version()
> is 'PostgreSQL 16devel on x86_64, compiled by gcc-12.2.0', which is not
> wrong but I cannot tell that it was done on MinGW. Should we include the
> 'host_system' in PG_VERSION_STR?

I don't think we should print mingw - that's really just redundant with
gcc. But including host_system seems like a good idea. Not sure why I didn't
do that.

Greetings,

Andres Freund




Re: [PATCH] Teach pg_waldump to extract FPIs from the WAL

2022-11-15 Thread David Christensen
Enclosed is v9.

- code style consistency (FPI instead of FPW) internally.
- cleanup of no-longer needed checksum-related pieces from code and tests.
- test cleanup/simplification.
- other comment cleanup.

Passes all CI checks.

Best,

David


v9-0001-Teach-pg_waldump-to-extract-FPIs-from-the-WAL-str.patch
Description: Binary data


Re: HOT chain validation in verify_heapam()

2022-11-15 Thread Andres Freund
Hi,

On 2022-11-15 11:36:21 -0500, Robert Haas wrote:
> On Mon, Nov 14, 2022 at 5:02 PM Andres Freund  wrote:
> > It seems like we should do a bit more validation within a chain of
> > tuples. E.g. that no live tuple can follow an !DidCommit xmin?
> 
> I think this check is already present in stronger form. If we see a
> !DidCommit xmin, the xmin of the next tuple in the chain not only can't be
> committed, but had better be the same.

As I think I mentioned before, I don't think the "better be the same" aspect
is correct, think subxacts. E.g.

off 0: xmin: top, xmax: child_1
off 1: xmin: child_1, xmax: invalid

If top hasn't committed yet, the current logic afaict will warn about this
situation, no? And I don't think we can generally the subxid parent at this
point, unfortunately (might have truncated subtrans).


Different aspect: Is it ok that we use TransactionIdDidCommit() without a
preceding IsInProgress() check?


I do think there's some potential for additional checks that don't run into
the above issue, e.g. checking that no in-progress xids follow an explicitly
aborted xact, that a committed xid can't follow an uncommitted xid etc.

Greetings,

Andres Freund




Re: Moving forward with TDE

2022-11-15 Thread David Christensen


> On Nov 15, 2022, at 1:08 PM, Jacob Champion  wrote:
> 
> On Mon, Oct 24, 2022 at 9:29 AM David Christensen
>  wrote:
>> I would love to open a discussion about how to move forward and get
>> some of these features built out.  The historical threads here are
>> quite long and complicated; is there a "current state" other than the
>> wiki that reflects the general thinking on this feature?  Any major
>> developments in direction that would not be reflected in the code from
>> May 2021?
> 
> I don't think the patchset here has incorporated the results of the
> discussion [1] that happened at the end of 2021. For example, it looks
> like AES-CTR is still in use for the pages, which I thought was
> already determined to be insufficient.

Good to know about the next steps, thanks. 

> The following next steps were proposed in that thread:
> 
>> 1. modify temporary file I/O to use a more centralized API
>> 2. modify the existing cluster file encryption patch to use XTS with a
>>   IV that uses more than the LSN
>> 3. add XTS regression test code like CTR
>> 4. create WAL encryption code using CTR
> 
> Does this patchset need review before those steps are taken (or was
> there additional conversation/work that I missed)?

This was just a refresh of the old patches on the wiki to work as written on 
HEAD. If there are known TODOs here this then that work is still needing to be 
done. 

I was going to take 2) and Stephen was going to work on 3); I am not sure about 
the other two but will review the thread you pointed to. Thanks for pointing 
that out. 

David





  1   2   >