Re: [PoC] Improve dead tuple storage for lazy vacuum

2023-12-18 Thread John Naylor
On Tue, Dec 19, 2023 at 12:37 PM Masahiko Sawada  wrote:
>
> On Mon, Dec 18, 2023 at 3:41 PM John Naylor  wrote:
> > Let's do it in just one place. In TidStoreCreate(), do
> >
> > /* clamp max_bytes to at least the size of the empty tree with
> > allocated blocks, so it doesn't immediately appear full */
> > ts->control->max_bytes = Max(max_bytes, {rt, shared_rt}_memory_usage);
> >
> > Then we can get rid of all the worry about 1MB/2MB, 64kB, 70kB -- all that.
>
> But doesn't it mean that even if we create a shared tidstore with
> small memory, say 64kB, it actually uses 1MB?

This sounds like an argument for controlling the minimum DSA segment
size. (I'm not really in favor of that, but open to others' opinion)

I wasn't talking about that above -- I was saying we should have only
one place where we clamp max_bytes so that the tree doesn't
immediately appear full.




Re: Change GUC hashtable to use simplehash?

2023-12-18 Thread Jeff Davis
On Mon, 2023-12-18 at 13:39 +0700, John Naylor wrote:
> For now just two:
> v10-0002 is Jeff's change to the search path cache, but with the
> chunked interface that I found to be faster.

Did you consider specializing for the case of an aligned pointer? If
it's a string (c string or byte string) it's almost always going to be
aligned, right?

I hacked up a patch (attached). I lost track of which benchmark we're
using to test the performance, but when I test in a loop it seems
substantially faster.

It reads past the NUL byte, but only to the next alignment boundary,
which I think is OK (though I think I'd need to fix the patch for when
maxalign < 8).

Regards,
Jeff Davis

From 055d5cc24404584fd98109fabdcf83348e5c49b4 Mon Sep 17 00:00:00 2001
From: Jeff Davis 
Date: Mon, 18 Dec 2023 16:44:27 -0800
Subject: [PATCH v10jd] Optimize hash function further.

---
 src/backend/catalog/namespace.c  | 46 +---
 src/include/common/hashfn_unstable.h |  9 ++
 2 files changed, 51 insertions(+), 4 deletions(-)

diff --git a/src/backend/catalog/namespace.c b/src/backend/catalog/namespace.c
index c23f46aca3..368a7fabec 100644
--- a/src/backend/catalog/namespace.c
+++ b/src/backend/catalog/namespace.c
@@ -245,15 +245,44 @@ static bool MatchNamedCall(HeapTuple proctup, int nargs, List *argnames,
  * offers a more convenient API.
  */
 
+/* From: https://graphics.stanford.edu/~seander/bithacks.html#ZeroInWord */
+#define haszero64(v) \
+	(((v) - 0x0101010101010101UL) & ~(v) & 0x8080808080808080UL)
+
 static inline uint32
-spcachekey_hash(SearchPathCacheKey key)
+cstring_hash_aligned(const char *str, uint64 seed)
+{
+	const char *const start = str;
+	const char *buf = start;
+	int chunk_len = 0;
+	fasthash_state hs;
+
+	fasthash_init(&hs, FH_UNKNOWN_LENGTH, seed);
+
+	Assert(PointerIsAligned(start, uint64));
+	while (!haszero64(*(uint64 *)buf))
+	{
+		fasthash_accum64(&hs, buf);
+		buf += sizeof(uint64);
+	}
+
+	while (buf[chunk_len] != '\0')
+		chunk_len++;
+	fasthash_accum(&hs, buf, chunk_len);
+	buf += chunk_len;
+
+	return fasthash_final32(&hs, buf - start);
+}
+
+static inline uint32
+cstring_hash_unaligned(const char *str, uint64 seed)
 {
-	const char *const start = key.searchPath;
-	const char *buf = key.searchPath;
+	const char *const start = str;
+	const char *buf = str;
 	fasthash_state hs;
 
 	/* WIP: maybe roleid should be mixed in normally */
-	fasthash_init(&hs, FH_UNKNOWN_LENGTH, key.roleid);
+	fasthash_init(&hs, FH_UNKNOWN_LENGTH, seed);
 	while (*buf)
 	{
 		int			chunk_len = 0;
@@ -269,6 +298,15 @@ spcachekey_hash(SearchPathCacheKey key)
 	return fasthash_final32(&hs, buf - start);
 }
 
+static inline uint32
+spcachekey_hash(SearchPathCacheKey key)
+{
+	if (PointerIsAligned(key.searchPath, uint64))
+		return cstring_hash_aligned(key.searchPath, key.roleid);
+	else
+		return cstring_hash_unaligned(key.searchPath, key.roleid);
+}
+
 static inline bool
 spcachekey_equal(SearchPathCacheKey a, SearchPathCacheKey b)
 {
diff --git a/src/include/common/hashfn_unstable.h b/src/include/common/hashfn_unstable.h
index bf1dbee28d..553fab0415 100644
--- a/src/include/common/hashfn_unstable.h
+++ b/src/include/common/hashfn_unstable.h
@@ -105,6 +105,15 @@ fasthash_combine(fasthash_state *hs)
 	hs->accum = 0;
 }
 
+/* Accumulate 8 bytes from aligned pointer and combine it into the hash */
+static inline void
+fasthash_accum64(fasthash_state *hs, const char *ptr)
+{
+	Assert(PointerIsAligned(ptr, uint64));
+	hs->accum = *(uint64 *)ptr;
+	fasthash_combine(hs);
+}
+
 /* Accumulate up to 8 bytes of input and combine it into the hash */
 static inline void
 fasthash_accum(fasthash_state *hs, const char *k, int len)
-- 
2.34.1



Re: Proposal to add page headers to SLRU pages

2023-12-18 Thread Li, Yong
> This work is being done in file.c – it seems to me the proper way to
> proceed would be to continue writing on-disk upgrade logic here.

> Besides that this looks good to me, would like to hear what others have to 
> say.

Thank you, Rishu for taking time to review the code.  I've updated the patch
and moved the on-disk upgrade logic to pg_upgrade/file.c.

I have also added this thread to the current Commitfest and hope this patch
will be  part of the 17 release.

The commitfest link:
https://commitfest.postgresql.org/46/4709/


Regards,
Yong,




slur_page_header_v2.patch
Description: slur_page_header_v2.patch


Re: "pgoutput" options missing on documentation

2023-12-18 Thread Amit Kapila
On Tue, Dec 19, 2023 at 12:07 PM Peter Smith  wrote:
>
> On Tue, Dec 19, 2023 at 1:35 AM Emre Hasegeli  wrote:
> >
> > > Fair enough. I think we should push your first patch only in HEAD as
> > > this is a minor improvement over the current behaviour. What do you
> > > think?
> >
> > I agree.
>
> Patch 0001
>
> AFAICT parse_output_parameters possible errors are never tested. For
> example, there is no code coverage [1] touching any of these ereports.
>
> IMO there should be some simple test cases -- I am happy to create
> some tests if you agree they should exist.
>

I don't think having tests for all sorts of error checking will add
much value as compared to the overhead they bring.

> ~~~
>
> While looking at the function parse_output_parameters() I noticed that
> if an unrecognised option is passed the function emits an elog instead
> of an ereport
>

We don't expect unrecognized option here and for such a thing, we use
elog in the code. See the similar usage in
parseCreateReplSlotOptions().

I think we should move to 0002 patch now. In that, I suggest preparing
separate back branch patches.

-- 
With Regards,
Amit Kapila.




Re: [DOC] Introducing Quick Start Guide to PL/pgSQL and PL/Python Documentation

2023-12-18 Thread Japin Li


On Sat, 16 Dec 2023 at 18:49, Ishaan Adarsh  wrote:
> Hi
>
> I have made some documentation enhancements for PL/pgSQL and PL/Python
> sections. The changes include the addition of a Quick Start Guide to
> facilitate a smoother onboarding experience for users.
>
> Patch File Name:
> 0001-plpyhton-plpgsql-docu-changes.patch
>
> Patch Description:
> This patch introduces a Quick Start Guide to the documentation for PL/pgSQL
> and PL/Python. The Quick Start Guide provides users with a step-by-step
> walkthrough of setting up and using these procedural languages. The goal is
> to improve user accessibility and streamline the learning process.
>
> Changes Made:
> 1. Added a new section titled "Quick Start Guide" to both PL/pgSQL and
> PL/Python documentation.
> 2. Included step-by-step instructions for users to get started with these
> procedural languages.
> 3. Provided explanations, code snippets, and examples to illustrate key
> concepts.
>
> Discussion Points:
> I am seeking your feedback on the following aspects:
> - Clarity and completeness of the Quick Start Guide
> - Any additional information or examples that could enhance the guide
> - Suggestions for improving the overall documentation structure
>
> Your insights and suggestions are highly valuable, and I appreciate your
> time in reviewing this documentation enhancement.

1.
It seems you miss  tag in plpgsql "Create the Makefile":

+  
+Create the Makefile
+
+
+  Create a Makefile in the pg_plpgsql_ext directory 
with the following content:
+

2.
We expected use CREATE EXTENSION to load the extension, should we add the
following in extension--version.sql?

-- complain if script is sourced in psql, rather than via CREATE EXTENSION
\echo Use "CREATE EXTENSION pair" to load this file. \quit

--
Regrads,
Japin Li
ChengDu WenWu Information Technology Co., Ltd.




Re: "pgoutput" options missing on documentation

2023-12-18 Thread Peter Smith
On Tue, Dec 19, 2023 at 1:35 AM Emre Hasegeli  wrote:
>
> > Fair enough. I think we should push your first patch only in HEAD as
> > this is a minor improvement over the current behaviour. What do you
> > think?
>
> I agree.

Patch 0001

AFAICT parse_output_parameters possible errors are never tested. For
example, there is no code coverage [1] touching any of these ereports.

IMO there should be some simple test cases -- I am happy to create
some tests if you agree they should exist.

~~~

While looking at the function parse_output_parameters() I noticed that
if an unrecognised option is passed the function emits an elog instead
of an ereport

--
test_pub=# SELECT * FROM pg_logical_slot_get_changes('test_slot_v1',
NULL, NULL, 'banana', '1');
2023-12-19 17:08:21.627 AEDT [8921] ERROR:  unrecognized pgoutput option: banana
2023-12-19 17:08:21.627 AEDT [8921] CONTEXT:  slot "test_slot_v1",
output plugin "pgoutput", in the startup callback
2023-12-19 17:08:21.627 AEDT [8921] STATEMENT:  SELECT * FROM
pg_logical_slot_get_changes('test_slot_v1', NULL, NULL, 'banana',
'1');
ERROR:  unrecognized pgoutput option: banana
CONTEXT:  slot "test_slot_v1", output plugin "pgoutput", in the startup callback
--

But that doesn't seem right. AFAIK elog messages use errmsg_internal
so this message would not get translated.

PSA a patch to fix that.

==
[1] code coverage --
https://coverage.postgresql.org/src/backend/replication/pgoutput/pgoutput.c.gcov.html

Kind Regards,
Peter Smith.
Fujitsu Australia


parse_output_parameters.diff
Description: Binary data


Update the comment in nodes.h to cover Cardinality

2023-12-18 Thread Richard Guo
By chance I discovered that the comment for the typedefs of "double"s
does not cover Cardinality.  Should we update that comment accordingly,
maybe something like below?

- * Typedefs for identifying qualifier selectivities and plan costs as such.
- * These are just plain "double"s, but declaring a variable as Selectivity
- * or Cost makes the intent more obvious.
+ * Typedefs for identifying qualifier selectivities, plan costs and
+ * estimated rows or other count as such.  These are just plain "double"s,
+ * but declaring a variable as Selectivity, Cost or Cardinality makes the
+ * intent more obvious.

Thanks
Richard


v1-0001-Add-comment-for-Cardinality-typedef.patch
Description: Binary data


Re: Synchronizing slots from primary to standby

2023-12-18 Thread Amit Kapila
On Tue, Dec 19, 2023 at 4:51 AM Peter Smith  wrote:
>
>
> ==
> doc/src/sgml/system-views.sgml
>
> 3.
> +  
> +  The hot standby can have any of these sync_state values for the slots 
> but
> +  on a hot standby, the slots with state 'r' and 'i' can neither be used
> +  for logical decoding nor dropped by the user.
> +  The sync_state has no meaning on the primary server; the primary
> +  sync_state value is default 'n' for all slots but may (if leftover
> +  from a promoted standby)  also be 'r'.
> +  
>
> I still feel we are exposing too much useless information about the
> primary server values.
>
> Isn't it sufficient to just say "The sync_state values have no meaning
> on a primary server.", and not bother to mention what those
> meaningless values might be -- e.g. if they are meaningless then who
> cares what they are or how they got there?
>

I feel it would be good to mention somewhere that primary can have
slots in 'r' state, if not here, some other place.

>
> 7.
> +/*
> + * Exit the slot sync worker with given exit-code.
> + */
> +static void
> +slotsync_worker_exit(const char *msg, int code)
> +{
> + ereport(LOG, errmsg("%s", msg));
> + proc_exit(code);
> +}
>
> This could be written differently (don't pass the exit code, instead
> pass a bool) like:
>
> static void
> slotsync_worker_exit(const char *msg, bool restart_worker)
>
> By doing it this way, you can keep the special exit code values (0,1)
> within this function where you can comment all about them instead of
> having scattered comments about exit codes in the callers.
>
> SUGGESTION
> ereport(LOG, errmsg("%s", msg));
> /*  restart or not> */
> proc_exit(restart_worker ? 1 : 0);
>

Hmm, I don't see the need for this function in the first place. We can
use proc_exit in the two callers directly.

-- 
With Regards,
Amit Kapila.




Re: Making the initial and maximum DSA segment sizes configurable

2023-12-18 Thread Masahiko Sawada
Hi,

On Wed, Mar 22, 2023 at 12:15 AM Masahiko Sawada  wrote:
>
> Hi all,
>
> I started this new thread from another thread[1] where we're
> discussing a new storage for TIDs, TidStore, since we found a
> difficulty about the memory usage limit for TidStores on DSA.
>
> TidStore is a new data structure to efficiently store TIDs, backed by
> a radix tree. In the patch series proposed on that thread, in addition
> to radix tree and TidStore, there is another patch for lazy (parallel)
> vacuum to replace the array of dead tuple TIDs with a TidStore. To
> support parallel vacuum, radix tree (and TidStore) can be created on a
> local memory as well as on DSA. Also, it has memory usage limit
> functionality; we can specify the memory limit (e.g.,
> maintenance_work_mem) to TidStoreCreate() function. Once the total DSA
> segment size (area->control->total_segment_size) exceeds the limit,
> TidStoreIsFull() returns true. The lazy vacuum can continue scanning
> heap blocks to collect dead tuple TIDs until TidStoreIsFull() returns
> true. Currently lazy vacuum is the sole user of TidStore but maybe it
> can be used by other codes such as tidbitmap.c where will be limited
> by work_mem.
>
> During the development, we found out that DSA memory growth is
> unpredictable, leading to inefficient memory limitation.
>
> DSA is built on top of DSM segments and it manages a set of DSM
> segments, adding new segments as required and detaching them when they
> are no longer needed. The DSA segment starts with 1MB in size and a
> new segment size is at least big enough to follow a geometric series
> that approximately doubles the total storage each time we create a new
> segment. Because of this fact, it's not efficient to simply compare
> the memory limit to the total segment size. For example, if
> maintenance_work_mem is 512MB, the total segment size will be like:
>
> 2 * (1 + 2 + 4 + 8 + 16 + 32 + 64 + 128) = 510MB -> less than the
> limit, continue heap scan.
>
> 2 * (1 + 2 + 4 + 8 + 16 + 32 + 64 + 128) + 256 = 766MB -> stop (exceed  
> 254MB).
>
> One might think we can use dsa_set_size_limit() but it cannot; lazy
> vacuum ends up with an error. If we set DSA_ALLOC_NO_OOM, we might end
> up stopping the insertion halfway.
>
> Besides excessively allocating memory, since the initial DSM segment
> size is fixed 1MB, memory usage of a shared TidStore will start from
> 1MB+. This is higher than the minimum values of both work_mem and
> maintenance_work_mem, 64kB and 1MB respectively. Increasing the
> minimum m_w_m to 2MB might be acceptable but not for work_mem.
>
> Researching possible solutions, we found that aset.c also has a
> similar characteristic; allocates an 8K block (by default) upon the
> first allocation in a context, and doubles that size for each
> successive block request. But we can specify the initial block size
> and max blocksize. This made me think of an idea to specify both to
> DSA and both values are calculated based on m_w_m. I've attached the
> patch for this idea. The changes to dsa.c are straightforward since
> dsa.c already uses macros DSA_INITIAL_SEGMENT_SIZE and
> DSA_MAX_SEGMENT_SIZE. I just made these values configurable.
>
> FYI with this patch, we can create a DSA in parallel_vacuum_init()
> with initial and maximum block sizes as follows:
>
> initial block size = min(m_w_m / 4, 1MB)
> max block size = max(m_w_m / 8, 8MB)
>
> In most cases, we can start with a 1MB initial segment, the same as
> before. For larger memory, the heap scan stops after DSA allocates
> 1.25 times more memory than m_w_m. For example, if m_w_m = 512MB, the
> both initial and maximum segment sizes are 1MB and 64MB respectively,
> and then DSA allocates the segments as follows until heap scanning
> stops:
>
> 2 * (1 + 2 + 4 + 8 + 16 + 32 + 64) + (64 * 4) = 510MB -> less than the
> limit, continue heap scan.
>
> 2 * (1 + 2 + 4 + 8 + 16 + 32 + 64) + (64 * 5) = 574MB -> stop
> (allocated additional 62MB).
>
> It also works with smaller memory; If the limit is 1MB, we start with
> a 256KB initial segment and heap scanning stops after DSA allocated
> 1.5MB (= 256kB + 256kB + 512kB + 512kB).
>
> There is room for considering better formulas for initial and maximum
> block sizes but making both values configurable is a promising idea.
> And the analogous behavior to aset could be a good thing for
> readability and maintainability. There is another test result where I
> used this idea on top of a radix tree[2].
>
> We need to consider the total number of allocated DSA segments as the
> total number of DSM segments available on the system is fixed[3]. But
> it seems not problematic even with this patch since we allocate only a
> few additional segments (in above examples 17 segs vs. 19 segs). There
> was no big difference also in performance[2].
>

The last time I posted this email seemed not good timing since it was
close to the feature freeze, and the email was very long. The tidstore
and radix tree developments are still in-p

Re: Add a perl function in Cluster.pm to generate WAL

2023-12-18 Thread Bharath Rupireddy
On Tue, Dec 19, 2023 at 9:51 AM Michael Paquier  wrote:
>
> On Mon, Dec 18, 2023 at 08:48:09AM -0300, Euler Taveira wrote:
> > It is cheaper.
>
> Agreed that this could just use a set of pg_logical_emit_message()
> when jumping across N segments.

Thanks. I missed the point of using pg_logical_emit_message() over
CREATE .. DROP TABLE to generate WAL. And, I agree that it's better
and relatively cheaper in terms of amount of WAL generated.

> Another thing that seems quite
> important to me is to force a flush of WAL with the last segment
> switch, and the new "flush" option of pg_logical_emit_message() can
> be very handy for this purpose.

I used pg_logical_emit_message() in non-transactional mode without
needing an explicit WAL flush as the pg_switch_wal() does a WAL flush
at the end [1].

Attached v4 patch.

[1]
/*
 * If this was an XLOG_SWITCH record, flush the record and the empty
 * padding space that fills the rest of the segment, and perform
 * end-of-segment actions (eg, notifying archiver).
 */
if (class == WALINSERT_SPECIAL_SWITCH)
{
TRACE_POSTGRESQL_WAL_SWITCH();
XLogFlush(EndPos);

--
Bharath Rupireddy
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com
From b7fa7545eb983aaf92e3d7e99bdf76ef42b8e40e Mon Sep 17 00:00:00 2001
From: Bharath Rupireddy 
Date: Tue, 19 Dec 2023 05:49:20 +
Subject: [PATCH v4] Add a TAP test function to generate WAL

This commit adds a perl function in Cluster.pm to generate WAL.
Some TAP tests are now using their own way to generate WAL.
Generalizing this functionality enables multiple TAP tests to
reuse the functionality.
---
 src/test/perl/PostgreSQL/Test/Cluster.pm  | 22 +
 src/test/recovery/t/001_stream_rep.pl |  6 +--
 src/test/recovery/t/019_replslot_limit.pl | 48 +--
 .../t/035_standby_logical_decoding.pl |  7 +--
 4 files changed, 38 insertions(+), 45 deletions(-)

diff --git a/src/test/perl/PostgreSQL/Test/Cluster.pm b/src/test/perl/PostgreSQL/Test/Cluster.pm
index a020377761..ad575ed6d6 100644
--- a/src/test/perl/PostgreSQL/Test/Cluster.pm
+++ b/src/test/perl/PostgreSQL/Test/Cluster.pm
@@ -3178,6 +3178,28 @@ sub create_logical_slot_on_standby
 
 =pod
 
+=item $node->advance_wal($n)
+
+Advance WAL of given node by $n segments
+
+=cut
+
+sub advance_wal
+{
+	my ($self, $n) = @_;
+
+	# Advance by $n segments (= (wal_segment_size * $n) bytes).
+	for (my $i = 0; $i < $n; $i++)
+	{
+		$self->safe_psql('postgres', qq{
+			SELECT pg_logical_emit_message(false, '', 'foo');
+			SELECT pg_switch_wal();
+			});
+	}
+}
+
+=pod
+
 =back
 
 =cut
diff --git a/src/test/recovery/t/001_stream_rep.pl b/src/test/recovery/t/001_stream_rep.pl
index 95f9b0d772..f0de921b4b 100644
--- a/src/test/recovery/t/001_stream_rep.pl
+++ b/src/test/recovery/t/001_stream_rep.pl
@@ -522,11 +522,7 @@ $node_primary->safe_psql('postgres',
 my $segment_removed = $node_primary->safe_psql('postgres',
 	'SELECT pg_walfile_name(pg_current_wal_lsn())');
 chomp($segment_removed);
-$node_primary->psql(
-	'postgres', "
-	CREATE TABLE tab_phys_slot (a int);
-	INSERT INTO tab_phys_slot VALUES (generate_series(1,10));
-	SELECT pg_switch_wal();");
+$node_primary->advance_wal(1);
 my $current_lsn =
   $node_primary->safe_psql('postgres', "SELECT pg_current_wal_lsn();");
 chomp($current_lsn);
diff --git a/src/test/recovery/t/019_replslot_limit.pl b/src/test/recovery/t/019_replslot_limit.pl
index 7d94f15778..e4b75c6545 100644
--- a/src/test/recovery/t/019_replslot_limit.pl
+++ b/src/test/recovery/t/019_replslot_limit.pl
@@ -59,7 +59,7 @@ $result = $node_primary->safe_psql('postgres',
 is($result, "reserved|t", 'check the catching-up state');
 
 # Advance WAL by five segments (= 5MB) on primary
-advance_wal($node_primary, 1);
+$node_primary->advance_wal(1);
 $node_primary->safe_psql('postgres', "CHECKPOINT;");
 
 # The slot is always "safe" when fitting max_wal_size
@@ -69,7 +69,7 @@ $result = $node_primary->safe_psql('postgres',
 is($result, "reserved|t",
 	'check that it is safe if WAL fits in max_wal_size');
 
-advance_wal($node_primary, 4);
+$node_primary->advance_wal(4);
 $node_primary->safe_psql('postgres', "CHECKPOINT;");
 
 # The slot is always "safe" when max_slot_wal_keep_size is not set
@@ -100,7 +100,7 @@ $result = $node_primary->safe_psql('postgres',
 is($result, "reserved", 'check that max_slot_wal_keep_size is working');
 
 # Advance WAL again then checkpoint, reducing remain by 2 MB.
-advance_wal($node_primary, 2);
+$node_primary->advance_wal(2);
 $node_primary->safe_psql('postgres', "CHECKPOINT;");
 
 # The slot is still working
@@ -118,7 +118,7 @@ $node_standby->stop;
 $result = $node_primary->safe_psql('postgres',
 	"ALTER SYSTEM SET wal_keep_size to '8MB'; SELECT pg_reload_conf();");
 # Advance WAL again then checkpoint, reducing remain by 6 MB.
-advance_wal($node_primary, 6);
+$node_primary->advance_wal(6);
 $result = $node_primary->safe_psql('postgres',

Re: [PoC] Improve dead tuple storage for lazy vacuum

2023-12-18 Thread Masahiko Sawada
On Mon, Dec 18, 2023 at 3:41 PM John Naylor  wrote:
>
> On Fri, Dec 15, 2023 at 3:15 PM Masahiko Sawada  wrote:
> >
> > On Fri, Dec 15, 2023 at 10:30 AM John Naylor  
> > wrote:
>
> > >  parallel_vacuum_init(Relation rel, Relation *indrels, int nindexes,
> > > - int nrequested_workers, int max_items,
> > > - int elevel, BufferAccessStrategy bstrategy)
> > > + int nrequested_workers, int vac_work_mem,
> > > + int max_offset, int elevel,
> > > + BufferAccessStrategy bstrategy)
> > >
> > > It seems very strange to me that this function has to pass the
> > > max_offset. In general, it's been simpler to assume we have a constant
> > > max_offset, but in this case that fact is not helping. Something to
> > > think about for later.
> >
> > max_offset was previously used in old TID encoding in tidstore. Since
> > tidstore has entries for each block, I think we no longer need it.
>
> It's needed now to properly size the allocation of TidStoreIter which
> contains...
>
> +/* Result struct for TidStoreIterateNext */
> +typedef struct TidStoreIterResult
> +{
> + BlockNumber blkno;
> + int num_offsets;
> + OffsetNumber offsets[FLEXIBLE_ARRAY_MEMBER];
> +} TidStoreIterResult;
>
> Maybe we can palloc the offset array to "almost always" big enough,
> with logic to resize if needed? If not too hard, seems worth it to
> avoid churn in the parameter list.

Yes, I was thinking of that.

>
> > > v45-0010:
> > >
> > > Thinking about this some more, I'm not sure we need to do anything
> > > different for the *starting* segment size. (Controlling *max* size
> > > does seem important, however.) For the corner case of m_w_m = 1MB,
> > > it's fine if vacuum quits pruning immediately after (in effect) it
> > > finds the DSA has gone to 2MB. It's not worth bothering with, IMO. If
> > > the memory accounting starts >1MB because we're adding the trivial
> > > size of some struct, let's just stop doing that. The segment
> > > allocations are what we care about.
> >
> > IIUC it's for work_mem, whose the minimum value is 64kB.
> >
> > >
> > > v45-0011:
> > >
> > > + /*
> > > + * max_bytes is forced to be at least 64kB, the current minimum valid
> > > + * value for the work_mem GUC.
> > > + */
> > > + max_bytes = Max(64 * 1024L, max_bytes);
> > >
> > > Why?
> >
> > This is to avoid creating a radix tree within very small memory. The
> > minimum work_mem value is a reasonable lower bound that PostgreSQL
> > uses internally. It's actually copied from tuplesort.c.
>
> There is no explanation for why it should be done like tuplesort.c. Also...
>
> - tree->leaf_ctx = SlabContextCreate(ctx,
> -"radix tree leaves",
> -RT_SLAB_BLOCK_SIZE(sizeof(RT_VALUE_TYPE)),
> -sizeof(RT_VALUE_TYPE));
> + tree->leaf_ctx = SlabContextCreate(ctx,
> +"radix tree leaves",
> +Min(RT_SLAB_BLOCK_SIZE(sizeof(RT_VALUE_TYPE)),
> +work_mem),
> +sizeof(RT_VALUE_TYPE));
>
> At first, my eyes skipped over this apparent re-indent, but hidden
> inside here is another (undocumented) attempt to clamp the size of
> something. There are too many of these sprinkled in various places,
> and they're already a maintenance hazard -- a different one was left
> behind in v45-0011:
>
> @@ -201,6 +183,7 @@ TidStoreCreate(size_t max_bytes, int max_off,
> dsa_area *area)
> ts->control->max_bytes = max_bytes - (70 * 1024);
>   }
>
> Let's do it in just one place. In TidStoreCreate(), do
>
> /* clamp max_bytes to at least the size of the empty tree with
> allocated blocks, so it doesn't immediately appear full */
> ts->control->max_bytes = Max(max_bytes, {rt, shared_rt}_memory_usage);
>
> Then we can get rid of all the worry about 1MB/2MB, 64kB, 70kB -- all that.

But doesn't it mean that even if we create a shared tidstore with
small memory, say 64kB, it actually uses 1MB?

Regards,

-- 
Masahiko Sawada
Amazon Web Services: https://aws.amazon.com




Re: SLRU optimization - configurable buffer pool and partitioning the SLRU lock

2023-12-18 Thread Dilip Kumar
On Mon, Dec 18, 2023 at 11:00 PM Robert Haas  wrote:
>
> On Mon, Dec 18, 2023 at 12:04 PM Robert Haas  wrote:
> > certain sense they are competing for the same job. However, they do
> > aim to alleviate different TYPES of contention: the group XID update
> > stuff should be most valuable when lots of processes are trying to
> > update the same page, and the banks should be most valuable when there
> > is simultaneous access to a bunch of different pages. So I'm not
> > convinced that this patch is a reason to remove the group XID update
> > mechanism, but someone might argue otherwise.
>
> Hmm, but, on the other hand:
>
> Currently all readers and writers are competing for the same LWLock.
> But with this change, the readers will (mostly) no longer be competing
> with the writers. So, in theory, that might reduce lock contention
> enough to make the group update mechanism pointless.

Thanks for your feedback, I agree that with a bank-wise lock, we might
not need group updates for some of the use cases as you said where
readers and writers are contenting on the centralized lock because, in
most of the cases, readers will be distributed across different banks.
OTOH there are use cases where the writer commit is the bottleneck (on
SLRU lock) like pgbench simple-update or TPC-B then we will still
benefit by group update.  During group update testing we have seen
benefits with such a scenario[1] with high client counts.  So as per
my understanding by distributing the SLRU locks there are scenarios
where we will not need group update anymore but OTOH there are also
scenarios where we will still benefit from the group update.


[1] 
https://www.postgresql.org/message-id/CAFiTN-u-XEzhd%3DhNGW586fmQwdTy6Qy6_SXe09tNB%3DgBcVzZ_A%40mail.gmail.com



--
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com




Re: Synchronizing slots from primary to standby

2023-12-18 Thread Amit Kapila
On Tue, Dec 19, 2023 at 6:58 AM Peter Smith  wrote:
>
> Here are some comments for the patch v49-0002.
>
> (This is in addition to my review comments for v48-0002 [1])
>
> ==
> src/backend/access/transam/xlogrecovery.c
>
>
> 1. FinishWalRecovery
>
> + *
> + * We do not update the sync_state from READY to NONE here, as any failed
> + * update could leave some slots in the 'NONE' state, causing issues during
> + * slot sync after restarting the server as a standby. While updating after
> + * switching to the new timeline is an option, it does not simplify the
> + * handling for both READY and NONE state slots. Therefore, we retain the
> + * READY state slots after promotion as they can provide useful information
> + * about their origin.
> + */
>
> Do you know if that wording is correct? e.g., If you were updating
> from READY to NONE and there was a failed update, that would leave
> some slots still in a READY state, right? So why does the comment say
> "could leave some slots in the 'NONE' state"?
>

The comment is correct because after restart the server will start as
'standby', so 'READY' marked slots are okay but the slots that we
changed to 'NONE' would now appear as user-created slots which would
be wrong.

-- 
With Regards,
Amit Kapila.




Re: Clang optimiser vs preproc.c

2023-12-18 Thread Thomas Munro
On Tue, Dec 19, 2023 at 11:42 AM Thomas Munro  wrote:
> Hrmph.  Well something weird is going on, but it might indeed involve
> me being confused about debug options of the compiler itself.  How can
> one find out which build options were used for clang/llvm compiler +
> libraries?  My earlier reports were from a little machine at home, so
> let's try again on an i9-9900 CPU @ 3.10GHz (a bit snappier) running
> Debian 12, again using packages from apt.llvm.org:
>
> 17 ~198s
> 16 ~14s
> 15 ~11s

And on another Debian machine (this time a VM) also using apt.llvm.org
packages, the huge ~3 minute time occurs with clang-16... hrrrnnnff...
seems like there must be some other variable here that I haven't
spotted yet...




Re: Add a perl function in Cluster.pm to generate WAL

2023-12-18 Thread Michael Paquier
On Mon, Dec 18, 2023 at 08:48:09AM -0300, Euler Taveira wrote:
> It is cheaper.

Agreed that this could just use a set of pg_logical_emit_message()
when jumping across N segments.  Another thing that seems quite
important to me is to force a flush of WAL with the last segment
switch, and the new "flush" option of pg_logical_emit_message() can
be very handy for this purpose.
--
Michael


signature.asc
Description: PGP signature


Re: WAL Insertion Lock Improvements

2023-12-18 Thread Nathan Bossart
On Tue, Jul 25, 2023 at 04:43:16PM +0900, Michael Paquier wrote:
> 0001 has been now applied.  I have done more tests while looking at
> this patch since yesterday and was surprised to see higher TPS numbers
> on HEAD with the same tests as previously, and the patch was still
> shining with more than 256 clients.

I found this code when searching for callers that use atomic exchanges as
atomic writes with barriers (for a separate thread [0]).  Can't we use
pg_atomic_write_u64() here since the locking functions that follow should
serve as barriers?

I've attached a patch to demonstrate what I'm thinking.  This might be more
performant, although maybe less so after commit 64b1fb5.  Am I missing
something obvious here?  If not, I might rerun the benchmarks to see
whether it makes any difference.

[0] 
https://www.postgresql.org/message-id/flat/20231110205128.GB1315705%40nathanxps13

-- 
Nathan Bossart
Amazon Web Services: https://aws.amazon.com
diff --git a/src/backend/storage/lmgr/lwlock.c b/src/backend/storage/lmgr/lwlock.c
index 315a78cda9..b43972bd2e 100644
--- a/src/backend/storage/lmgr/lwlock.c
+++ b/src/backend/storage/lmgr/lwlock.c
@@ -1752,10 +1752,10 @@ LWLockUpdateVar(LWLock *lock, pg_atomic_uint64 *valptr, uint64 val)
 	PRINT_LWDEBUG("LWLockUpdateVar", lock, LW_EXCLUSIVE);
 
 	/*
-	 * Note that pg_atomic_exchange_u64 is a full barrier, so we're guaranteed
-	 * that the variable is updated before waking up waiters.
+	 * We rely on the barrier provided by LWLockWaitListLock() to ensure the
+	 * variable is updated before waking up waiters.
 	 */
-	pg_atomic_exchange_u64(valptr, val);
+	pg_atomic_write_u64(valptr, val);
 
 	proclist_init(&wakeup);
 
@@ -1881,10 +1881,10 @@ void
 LWLockReleaseClearVar(LWLock *lock, pg_atomic_uint64 *valptr, uint64 val)
 {
 	/*
-	 * Note that pg_atomic_exchange_u64 is a full barrier, so we're guaranteed
-	 * that the variable is updated before releasing the lock.
+	 * We rely on the barrier provided by LWLockRelease() to ensure the
+	 * variable is updated before releasing the lock.
 	 */
-	pg_atomic_exchange_u64(valptr, val);
+	pg_atomic_write_u64(valptr, val);
 
 	LWLockRelease(lock);
 }


Re: Improve eviction algorithm in ReorderBuffer

2023-12-18 Thread Masahiko Sawada
On Sun, Dec 17, 2023 at 11:40 AM Amit Kapila  wrote:
>
> On Fri, Dec 15, 2023 at 11:29 AM Masahiko Sawada  
> wrote:
> >
> > On Fri, Dec 15, 2023 at 12:37 PM Amit Kapila  
> > wrote:
> > >
> > > On Wed, Dec 13, 2023 at 6:01 AM Masahiko Sawada  
> > > wrote:
> > > >
> > >
> > > IIUC, you are giving preference to multiple list ideas as compared to
> > > (a) because you don't need to adjust the list each time the
> > > transaction size changes, is that right?
> >
> > Right.
> >
> > >  If so, I think there is a
> > > cost to keep that data structure up-to-date but it can help in
> > > reducing the number of times we need to serialize.
> >
> > Yes, there is a trade-off.
> >
> > What I don't want to do is to keep all transactions ordered since it's
> > too costly. The proposed idea uses multiple lists to keep all
> > transactions roughly ordered. The maintenance cost would be cheap
> > since each list is unordered.
> >
> > It might be a good idea to have a threshold to switch how to pick the
> > largest transaction based on the number of transactions in the
> > reorderbuffer. If there are many transactions, we can use the proposed
> > algorithm to find a possibly-largest transaction, otherwise use the
> > current way.
> >
>
> Yeah, that makes sense.
>
> > >
> > > >
> > > > > 1) A scenario where suppose you have one very large transaction that
> > > > > is consuming ~40% of the memory and 5-6 comparatively smaller
> > > > > transactions that are just above 10% of the memory limit.  And now for
> > > > > coming under the memory limit instead of getting 1 large transaction
> > > > > evicted out, we are evicting out multiple times.
> > > >
> > > > Given the large transaction list will have up to 10 transactions, I
> > > > think it's cheap to pick the largest transaction among them. It's O(N)
> > > > but N won't be large.
> > > >
> > > > > 2) Another scenario where all the transactions are under 10% of the
> > > > > memory limit but let's say there are some transactions are consuming
> > > > > around 8-9% of the memory limit each but those are not very old
> > > > > transactions whereas there are certain old transactions which are
> > > > > fairly small and consuming under 1% of memory limit and there are many
> > > > > such transactions.  So how it would affect if we frequently select
> > > > > many of these transactions to come under memory limit instead of
> > > > > selecting a couple of large transactions which are consuming 8-9%?
> > > >
> > > > Yeah, probably we can do something for small transactions (i.e. small
> > > > and on-memory transactions). One idea is to pick the largest
> > > > transaction among them by iterating over all of them. Given that the
> > > > more transactions are evicted, the less transactions the on-memory
> > > > transaction list has, unlike the current algorithm, we would still
> > > > win. Or we could even split it into several sub-lists in order to
> > > > reduce the number of transactions to check. For example, splitting it
> > > > into two lists: transactions consuming 5% < and 5% >=  of the memory
> > > > limit, and checking the 5% >= list preferably.
> > > >
> > >
> > > Which memory limit are you referring to here? Is it 
> > > logical_decoding_work_mem?
> >
> > logical_decoding_work_mem.
> >
> > >
> > > > The cost for
> > > > maintaining these lists could increase, though.
> > > >
> > >
> > > Yeah, can't we maintain a single list of all xacts that are consuming
> > > equal to or greater than the memory limit? Considering that the memory
> > > limit is logical_decoding_work_mem, then I think just picking one
> > > transaction to serialize would be sufficient.
> >
> > IIUC we serialize a transaction when the sum of all transactions'
> > memory usage in the reorderbuffer exceeds logical_decoding_work_mem.
> > In what cases are multiple transactions consuming equal to or greater
> > than the logical_decoding_work_mem?
> >
>
> The individual transactions shouldn't cross
> 'logical_decoding_work_mem'. I got a bit confused by your proposal to
> maintain the lists: "...splitting it into two lists: transactions
> consuming 5% < and 5% >=  of the memory limit, and checking the 5% >=
> list preferably.". In the previous sentence, what did you mean by
> transactions consuming 5% >= of the memory limit? I got the impression
> that you are saying to maintain them in a separate transaction list
> which doesn't seems to be the case.

I wanted to mean that there are three lists in total: the first one
maintain the transactions consuming more than 10% of
logical_decoding_work_mem, the second one maintains other transactions
consuming more than or equal to 5% of logical_decoding_work_mem, and
the third one maintains other transactions consuming more than 0 and
less than 5% of logical_decoding_work_mem.

Regards,

-- 
Masahiko Sawada
Amazon Web Services: https://aws.amazon.com




Re: add non-option reordering to in-tree getopt_long

2023-12-18 Thread Tom Lane
Nathan Bossart  writes:
> On Mon, Dec 18, 2023 at 02:41:22PM -0500, Tom Lane wrote:
>> We just had a user complaint that seems to trace to exactly this
>> bogus reporting in pg_ctl [1].  Although I was originally not
>> very pleased with changing our getopt_long to do switch reordering,
>> I'm now wondering if we should back-patch these changes as bug
>> fixes.  It's probably not worth the risk, but ...

> I'm not too concerned about the risks of back-patching these commits, but
> if this 19-year-old bug was really first reported today, I'd agree that
> fixing it in the stable branches is probably not worth it.

Agreed, if it actually is 19 years old.  I'm wondering a little bit
if there could be some moderately-recent glibc behavior change
involved.  I'm not excited enough about it to go trawl their change
log, but we should keep our ears cocked for similar reports.

regards, tom lane




Re: Proposal to add page headers to SLRU pages

2023-12-18 Thread Bagga, Rishu
On Thu, Dec 8, 2023 at 1:36 AM Li, Yong  wrote:

>Given so many different approaches were discussed, I have started a  
>wiki to record and collaborate all efforts towards SLRU 
>improvements.  The wiki provides a concise overview of all the ideas 
>discussed and can serve as a portal for all historical 
>discussions.  Currently, the wiki summarizes four recent threads 
>ranging from identifier format change to page header change, to moving 
>SLRU into the main buffer pool, to reduce lock contention on SLRU 
>latches.  We can keep the patch related discussions in this thread and 
>use the wiki as a live document for larger scale collaborations.

>The wiki page is 
>here:  https://wiki.postgresql.org/wiki/SLRU_improvements

>Regarding the benefits of this patch, here is a detailed explanation:

1.  Checksum is added to each page, allowing us to verify if a page has
been corrupted when read from the disk.
2.  The ad-hoc LSN group structure is removed from the SLRU cache 
control data and is replaced by the page LSN in the page header. 
This allows us to use the same WAL protocol as used by pages in the 
main buffer pool: flush all redo logs up to the page LSN before 
flushing the page itself. If we move SLRU caches into the main 
buffer pool, this change fits naturally.
3.  It leaves further optimizations open. We can continue to pursue the 
goal of moving SLRU into the main buffer pool, or we can follow the 
lock partition idea. This change by itself does not conflict with 
either proposal.

>Also, the patch is now complete and is ready for review.  All check-
>world tests including tap tests passed with this patch. 




Hi Yong, 

I agree we should break the effort for the SLRU optimization into 
smaller chunks after having worked on some of the bigger patches and 
facing difficulty in making progress that way.

The patch looks mostly good to me; though one thing that I thought about 
differently with the upgrade portion is where we should keep the logic 
of re-writing the CLOG files.

There is a precedent introduced back in Postgres v9.6 in making on disk 
page format changes across different in visibility map: [1]

code comment: 
 * In versions of PostgreSQL prior to catversion 201603011, PostgreSQL's
 * visibility map included one bit per heap page; it now includes two.
 * When upgrading a cluster from before that time to a current PostgreSQL
 * version, we could refuse to copy visibility maps from the old cluster
 * to the new cluster; the next VACUUM would recreate them, but at the
 * price of scanning the entire table.  So, instead, we rewrite the old
 * visibility maps in the new format.  



This work is being done in file.c – it seems to me the proper way to 
proceed would be to continue writing on-disk upgrade logic here.


Besides that this looks good to me, would like to hear what others have to say.


Thanks, 

Rishu Bagga 

Amazon Web Services (AWS)

[1] 
https://github.com/postgres/postgres/commit/7087166a88fe0c04fc6636d0d6d6bea1737fc1fb



Re: Synchronizing slots from primary to standby

2023-12-18 Thread Peter Smith
Here are some comments for the patch v49-0002.

(This is in addition to my review comments for v48-0002 [1])

==
src/backend/access/transam/xlogrecovery.c


1. FinishWalRecovery

+ *
+ * We do not update the sync_state from READY to NONE here, as any failed
+ * update could leave some slots in the 'NONE' state, causing issues during
+ * slot sync after restarting the server as a standby. While updating after
+ * switching to the new timeline is an option, it does not simplify the
+ * handling for both READY and NONE state slots. Therefore, we retain the
+ * READY state slots after promotion as they can provide useful information
+ * about their origin.
+ */

Do you know if that wording is correct? e.g., If you were updating
from READY to NONE and there was a failed update, that would leave
some slots still in a READY state, right? So why does the comment say
"could leave some slots in the 'NONE' state"?

==
src/backend/replication/slot.c

2. ReplicationSlotAlter

+ /*
+ * Do not allow users to drop the slots which are currently being synced
+ * from the primary to the standby.
+ */
+ if (RecoveryInProgress() &&
+ MyReplicationSlot->data.sync_state != SYNCSLOT_STATE_NONE)
+ ereport(ERROR,
+ (errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE),
+ errmsg("cannot alter replication slot \"%s\"", name),
+ errdetail("This slot is being synced from the primary server.")));
+

The comment looks wrong -- should say "Do not allow users to alter..."

==

3.
+##
+# Test that synchronized slot can neither be decoded nor dropped by the user
+##
+

3a,
/Test that synchronized slot/Test that a synchronized slot/

3b.
Isn't there a missing test? Should this part also check that it cannot
ALTER the replication slot being synced? e.g. test for the new v49
error message that was added in ReplicationSlotAlter()

~~~

4.
+# Disable hot_standby_feedback
+$standby1->safe_psql('postgres', 'ALTER SYSTEM SET
hot_standby_feedback = off;');
+$standby1->restart;
+

Can there be a comment added to explain why you are doing the
'hot_standby_feedback' toggle?

~~~

5.
+##
+# Promote the standby1 to primary. Confirm that:
+# a) the sync-ready('r') slot 'lsub1_slot' is retained on the new primary
+# b) the initiated('i') slot 'logical_slot' is dropped on promotion
+# c) logical replication for regress_mysub1 is resumed succesfully
after failover
+##

/succesfully/successfully/

~~~

6.
+
+# Confirm that data in tab_mypub3 replicated on subscriber
+is( $subscriber1->safe_psql('postgres', q{SELECT count(*) FROM tab_int;}),
+ "$primary_row_count",
+ 'data replicated from the new primary');

The comment is wrong -- it names a different table ('tab_mypub3' ?) to
what the SQL says.

==
[1] My v48-0002 review comments.
https://www.postgresql.org/message-id/CAHut%2BPsyZQZ1A4XcKw-D%3DvcTg16pN9Dw0PzE8W_X7Yz_bv00rQ%40mail.gmail.com

Kind Regards,
Peter Smith.
Fujitsu Australia




Re: Transaction timeout

2023-12-18 Thread Japin Li


On Mon, 18 Dec 2023 at 17:40, Andrey M. Borodin  wrote:
>> On 18 Dec 2023, at 14:32, Japin Li  wrote:
>>
>>
>> Thanks for updating the patch
>
> Sorry for the noise, but commitfest bot found one more bug in handling 
> statement timeout. PFA v11.
>

On Windows, there still have an error:

diff -w -U3 C:/cirrus/src/test/isolation/expected/timeouts.out 
C:/cirrus/build/testrun/isolation/isolation/results/timeouts.out
--- C:/cirrus/src/test/isolation/expected/timeouts.out  2023-12-18 
10:22:21.772537200 +
+++ C:/cirrus/build/testrun/isolation/isolation/results/timeouts.out
2023-12-18 10:26:08.039831800 +
@@ -103,24 +103,7 @@
 0
 (1 row)

-step stt2_check: SELECT 1;
-FATAL:  terminating connection due to transaction timeout
-server closed the connection unexpectedly
+PQconsumeInput failed: server closed the connection unexpectedly
This probably means the server terminated abnormally
before or while processing the request.

-step itt4_set: SET idle_in_transaction_session_timeout = '1ms'; SET 
statement_timeout = '10s'; SET lock_timeout = '10s'; SET transaction_timeout = 
'10s';
-step itt4_begin: BEGIN ISOLATION LEVEL READ COMMITTED;
-step sleep_there: SELECT pg_sleep(0.01);
-pg_sleep
-
-
-(1 row)
-
-step stt3_check_itt4: SELECT count(*) FROM pg_stat_activity WHERE 
application_name = 'isolation/timeouts/itt4' 
-step stt3_check_itt4: <... completed>
-count
--
-0
-(1 row)
-


--
Regrads,
Japin Li
ChengDu WenWu Information Technology Co., Ltd.




Re: POC PATCH: copy from ... exceptions to: (was Re: VLDB Features)

2023-12-18 Thread Masahiko Sawada
On Mon, Dec 18, 2023 at 4:41 PM jian he  wrote:
>
> On Mon, Dec 18, 2023 at 1:09 PM torikoshia  wrote:
> >
> > Hi,
> >
> > > save the error metadata to  system catalogs would be more expensive,
> > > please see below explanation.
> > > I have no knowledge of publications.
> > > but i feel there is a feature request: publication FOR ALL TABLES
> > > exclude regex_pattern.
> > > Anyway, that would be another topic.
> >
> > I think saving error metadata to system catalog is not a good idea, too.
> > And I believe Sawada-san just pointed out missing features and did not
> > suggested that we use system catalog.
> >
> > > I don't think "specify the maximum number  of errors to tolerate
> > > before raising an ERROR." is very useful
> >
> > That may be so.
> > I imagine it's useful in some use case since some loading tools have
> > such options.
> > Anyway I agree it's not necessary for initial patch as mentioned in [1].
> >
> > > I suppose we can specify an ERRORFILE directory. similar
> > > implementation [2], demo in [3]
> > > it will generate 2 files, one file shows the malform line content as
> > > is, another file shows the error info.
> >
> > That may be a good option when considering "(2) logging errors to
> > somewhere".
> >
> > What do you think about the proposal to develop these features in
> > incrementally?
> >
>
> I am more with  tom's idea [1], that is when errors happen (data type
> conversion only), do not fail, AND we save the error to a table.  I
> guess we can implement this logic together, only with a new COPY
> option.

If we want only such a feature we need to implement it together (the
patch could be split, though). But if some parts of the feature are
useful for users as well, I'd recommend implementing it incrementally.
That way, the patches can get small and it would be easy for reviewers
and committers to review/commit them.

>
> imagine a case (it's not that contrived, imho), while conversion from
> text to table's int, postgres isspace is different from the source
> text file's isspace logic.
> then all the lines are malformed. If we just say on error continue and
> not save error meta info, the user is still confused which field has
> the wrong data, then the user will probably try to incrementally test
> which field contains malformed data.
>
> Since we need to save the error somewhere.
> Everyone has the privilege to INSERT can do COPY.
> I think we also need to handle the access privilege also.
> So like I mentioned above, one copy_error error table hub, then
> everyone can view/select their own copy failure record.

The error table hub idea is still unclear to me. I assume that there
are error tables at least on each database. And an error table can
have error data that happened during COPY FROM, including malformed
lines. Do the error tables grow without bounds and the users have to
delete rows at some point? If so, who can do that? How can we achieve
that the users can see only errored rows they generated? And the issue
with logical replication also needs to be resolved. Anyway, if we go
this direction, we need to discuss the overall design.

Regards,

-- 
Masahiko Sawada
Amazon Web Services: https://aws.amazon.com




Re: POC PATCH: copy from ... exceptions to: (was Re: VLDB Features)

2023-12-18 Thread Masahiko Sawada
On Mon, Dec 18, 2023 at 9:16 AM jian he  wrote:
>
> On Fri, Dec 15, 2023 at 4:49 AM Masahiko Sawada  wrote:
> >
> > Hi,
> >
> > I've read this thread and the latest patch. IIUC with SAVE_ERROR
> > option, COPY FROM creates an error table for the target table and
> > writes error information there.
> >
> > While I agree that the final shape of this feature would be something
> > like that design, I'm concerned some features are missing in order to
> > make this feature useful in practice. For instance, error logs are
> > inserted to error tables without bounds, meaning that users who want
> > to tolerate errors during COPY FROM  will have to truncate or drop the
> > error tables periodically, or the database will grow with error logs
> > without limit. Ideally such maintenance work should be done by the
> > database. There might be some users who want to log such conversion
> > errors in server logs to avoid such maintenance work. I think we
> > should provide an option for where to write, at least. Also, since the
> > error tables are normal user tables internally, error logs are also
> > replicated to subscribers if there is a publication FOR ALL TABLES,
> > unlike system catalogs. I think some users would not like such
> > behavior.
>
> save the error metadata to  system catalogs would be more expensive,
> please see below explanation.
> I have no knowledge of publications.
> but i feel there is a feature request: publication FOR ALL TABLES
> exclude regex_pattern.
> Anyway, that would be another topic.

I don't think the new regex idea would be a good solution for the
existing users who are using FOR ALL TABLES publication. It's not
desirable that they have to change the publication because of this
feature. With the current patch, a logical replication using FOR ALL
TABLES publication will stop immediately after an error information is
inserted into a new error table unless the same error table is created
on subscribers.

>
> > Looking at SAVE_ERROR feature closely, I think it consists of two
> > separate features. That is, it enables COPY FROM to load data while
> > (1) tolerating errors and (2) logging errors to somewhere (i.e., an
> > error table). If we implement only (1), it would be like COPY FROM
> > tolerate errors infinitely and log errors to /dev/null. The user
> > cannot see the error details but I guess it could still help some
> > cases as Andres mentioned[1] (it might be a good idea to send the
> > number of rows successfully loaded in a NOTICE message if some rows
> > could not be loaded). Then with (2), COPY FROM can log error
> > information to somewhere such as tables and server logs and the user
> > can select it. So I'm thinking we may be able to implement this
> > feature incrementally. The first step would be something like an
> > option to ignore all errors or an option to specify the maximum number
> > of errors to tolerate before raising an ERROR. The second step would
> > be to support logging destinations such as server logs and tables.
> >
> > Regards,
> >
> > [1] 
> > https://www.postgresql.org/message-id/20231109002600.fuihn34bjqqgmbjm%40awork3.anarazel.de
> >
> > --
> > Masahiko Sawada
> > Amazon Web Services: https://aws.amazon.com
>
> > feature incrementally. The first step would be something like an
> > option to ignore all errors or an option to specify the maximum number
> > of errors to tolerate before raising an ERROR. The second step would
>
> I don't think "specify the maximum number  of errors to tolerate
> before raising an ERROR." is very useful
>
> QUOTE from [1]
> MAXERROR [AS] error_count
> If the load returns the error_count number of errors or greater, the
> load fails. If the load returns fewer errors, it continues and returns
> an INFO message that states the number of rows that could not be
> loaded. Use this parameter to allow loads to continue when certain
> rows fail to load into the table because of formatting errors or other
> inconsistencies in the data.
> Set this value to 0 or 1 if you want the load to fail as soon as the
> first error occurs. The AS keyword is optional. The MAXERROR default
> value is 0 and the limit is 10.
> The actual number of errors reported might be greater than the
> specified MAXERROR because of the parallel nature of Amazon Redshift.
> If any node in the Amazon Redshift cluster detects that MAXERROR has
> been exceeded, each node reports all of the errors it has encountered.
> END OF QUOTE
>
> option MAXERROR error_count. iiuc, it fails while validating line
> error_count + 1, else it raises a notice, tells you how many rows have
> errors.
>
> * case when error_count is small, and the copy fails, it only tells
> you that at least the error_count line has malformed data. but what if
> the actual malformed rows are very big. In this case, this failure
> error message is not that helpful.
> * case when error_count is very big, and the copy does not fail. then
> the actual malformed data rows are very big (still les

Re: Synchronizing slots from primary to standby

2023-12-18 Thread Peter Smith
Here are some review comments for v48-0002

==
doc/src/sgml/config.sgml

1.
+  If slot synchronization is enabled then it is also necessary to
+  specify dbname in the
+  primary_conninfo string. This will only
be used for
+  slot synchronization. It is ignored for streaming.

I felt the "If slot synchronization is enabled" part should also
include an xref to the enable_slotsync GUC, otherwise there is no
information here about how to enable it.

SUGGESTION
If slot synchronization is enabled (see XXX) 

==
doc/src/sgml/logicaldecoding.sgml

2.
+
+ The ability to resume logical replication after failover depends upon the
+ pg_replication_slots.sync_state
+ value for the synchronized slots on the standby at the time of failover.
+ Only slots that have attained "ready" sync_state ('r') on the standby
+ before failover can be used for logical replication after failover. Slots
+ that have not yet reached 'r' state (they are still 'i') will be dropped,
+ therefore logical replication for those slots cannot be resumed. For
+ example, if the synchronized slot could not become sync-ready on standby
+ due to a disabled subscription, then the subscription cannot be resumed
+ after failover even when it is enabled.
+ If the primary is idle, then the synchronized slots on the standby may
+ take a noticeable time to reach the ready ('r') sync_state. This can
+ be sped up by calling the
+ pg_log_standby_snapshot function on the primary.
+

2a.
/sync-ready on standby/sync-ready on the standby/

~

2b.
Should "If the primary is idle" be in a new paragraph?

==
doc/src/sgml/system-views.sgml

3.
+  
+  The hot standby can have any of these sync_state values for the slots but
+  on a hot standby, the slots with state 'r' and 'i' can neither be used
+  for logical decoding nor dropped by the user.
+  The sync_state has no meaning on the primary server; the primary
+  sync_state value is default 'n' for all slots but may (if leftover
+  from a promoted standby)  also be 'r'.
+  

I still feel we are exposing too much useless information about the
primary server values.

Isn't it sufficient to just say "The sync_state values have no meaning
on a primary server.", and not bother to mention what those
meaningless values might be -- e.g. if they are meaningless then who
cares what they are or how they got there?

==
src/backend/replication/logical/slotsync.c

4. synchronize_one_slot

+ /* Slot ready for sync, so sync it. */
+ if (sync_state == SYNCSLOT_STATE_READY)
+ {
+ /*
+ * Sanity check: With hot_standby_feedback enabled and
+ * invalidations handled appropriately as above, this should never
+ * happen.
+ */
+ if (remote_slot->restart_lsn < slot->data.restart_lsn)
+ elog(ERROR,
+ "not synchronizing local slot \"%s\" LSN(%X/%X)"
+ " to remote slot's LSN(%X/%X) as synchronization "
+ " would move it backwards", remote_slot->name,
+ LSN_FORMAT_ARGS(slot->data.restart_lsn),
+ LSN_FORMAT_ARGS(remote_slot->restart_lsn));
+
+ if (remote_slot->confirmed_lsn != slot->data.confirmed_flush ||
+ remote_slot->restart_lsn != slot->data.restart_lsn ||
+ remote_slot->catalog_xmin != slot->data.catalog_xmin)
+ {
+ /* Update LSN of slot to remote slot's current position */
+ local_slot_update(remote_slot);
+ ReplicationSlotSave();
+ slot_updated = true;
+ }
+ }
+ /* Slot not ready yet, let's attempt to make it sync-ready now. */
+ else if (sync_state == SYNCSLOT_STATE_INITIATED)
+ {
+ /*
+ * Wait for the primary server to catch-up. Refer to the comment
+ * atop the file for details on this wait.
+ */
+ if (remote_slot->restart_lsn < slot->data.restart_lsn ||
+ TransactionIdPrecedes(remote_slot->catalog_xmin,
+   slot->data.catalog_xmin))
+ {
+ if (!wait_for_primary_slot_catchup(wrconn, remote_slot, NULL))
+ {
+ ReplicationSlotRelease();
+ return false;
+ }
+ }
+
+ /*
+ * Wait for primary is over, update the lsns and mark the slot as
+ * READY for further syncs.
+ */
+ local_slot_update(remote_slot);
+ SpinLockAcquire(&slot->mutex);
+ slot->data.sync_state = SYNCSLOT_STATE_READY;
+ SpinLockRelease(&slot->mutex);
+
+ /* Save the changes */
+ ReplicationSlotMarkDirty();
+ ReplicationSlotSave();
+ slot_updated = true;
+
+ ereport(LOG,
+ errmsg("newly locally created slot \"%s\" is sync-ready now",
+remote_slot->name));
+ }

4a.
It would be more natural in the code if you do the
SYNCSLOT_STATE_INITIATED logic before the SYNCSLOT_STATE_READY because
that is the order those states come in.

~

4b.
I'm not sure if it is worth it, but I was thinking that some duplicate
code can be avoided by doing if/if instead of if/else

if (sync_state == SYNCSLOT_STATE_INITIATED)
{
..
}
if (sync_state == SYNCSLOT_STATE_READY)
{
}

By arranging it this way maybe the SYNCSLOT_STATE_INITIATED code block
doesn't need to do anything except update the sync_state =
SYNCSLOT_STATE_READY; Then it can just fall through to the
S

Re: Clang optimiser vs preproc.c

2023-12-18 Thread Thomas Munro
On Sun, Dec 17, 2023 at 1:29 AM Andres Freund  wrote:
> On 2023-12-15 22:19:56 -0500, Tom Lane wrote:
> > Thomas Munro  writes:
> > > On Sat, Dec 16, 2023 at 3:44 PM Tom Lane  wrote:
> > >> Thomas Munro  writes:
> > >>> FYI, it looks like there is a big jump in CPU time to compile preproc.c 
> > >>> at -O2:
> > >>> clang15: ~16s
> > >>> clang16: ~211s
> > >>> clang17: ~233s
>
> Is this with non-assert clangs? Because I see times that seem smaller by more
> than what can be explained by hardware differences:
>
> preproc.c:
> 17   10.270s
> 169.685s
> 158.300s
>
> gram.c:
> 171.936s
> 162.131s
> 152.161s
>
> That's still bad, but a far cry away from 233s.

Hrmph.  Well something weird is going on, but it might indeed involve
me being confused about debug options of the compiler itself.  How can
one find out which build options were used for clang/llvm compiler +
libraries?  My earlier reports were from a little machine at home, so
let's try again on an i9-9900 CPU @ 3.10GHz (a bit snappier) running
Debian 12, again using packages from apt.llvm.org:

17 ~198s
16 ~14s
15 ~11s

OK so even if we ignore the wild outlier it is getting significantly
slower.  But... huh, there goes the big jump, but at a different
version than I saw with FBSD's packages.   Here's what perf says it's
doing:

+   99.42%20.12%  clang-17  libLLVM-17.so.1  [.]
llvm::slpvectorizer::BoUpSLP::getTreeCost
   ◆
+   96.91% 0.00%  clang-17  libLLVM-17.so.1  [.]
llvm::SLPVectorizerPass::runImpl
   ▒
+   96.91% 0.00%  clang-17  libLLVM-17.so.1  [.]
llvm::SLPVectorizerPass::vectorizeChainsInBlock
   ▒
+   96.91% 0.00%  clang-17  libLLVM-17.so.1  [.]
llvm::SLPVectorizerPass::vectorizeSimpleInstructions
   ▒
+   96.91% 0.00%  clang-17  libLLVM-17.so.1  [.]
llvm::SLPVectorizerPass::vectorizeInsertElementInst
   ▒
+   96.91% 0.00%  clang-17  libLLVM-17.so.1  [.]
llvm::SLPVectorizerPass::tryToVectorizeList
   ▒
+   73.79% 0.00%  clang-17  libLLVM-17.so.1  [.]
0x7fbead445cb0
   ▒
+   36.88%36.88%  clang-17  libLLVM-17.so.1  [.]
0x01e45cda
   ▒
+3.95% 3.95%  clang-17  libLLVM-17.so.1  [.] 0x01e45d11




Re: add non-option reordering to in-tree getopt_long

2023-12-18 Thread Nathan Bossart
On Mon, Dec 18, 2023 at 02:41:22PM -0500, Tom Lane wrote:
> We just had a user complaint that seems to trace to exactly this
> bogus reporting in pg_ctl [1].  Although I was originally not
> very pleased with changing our getopt_long to do switch reordering,
> I'm now wondering if we should back-patch these changes as bug
> fixes.  It's probably not worth the risk, but ...

I'm not too concerned about the risks of back-patching these commits, but
if this 19-year-old bug was really first reported today, I'd agree that
fixing it in the stable branches is probably not worth it.

-- 
Nathan Bossart
Amazon Web Services: https://aws.amazon.com




Re: Add --check option to pgindent

2023-12-18 Thread Tristan Partin

On Mon Dec 18, 2023 at 11:21 AM CST, Jelte Fennema-Nio wrote:

On Mon, 18 Dec 2023 at 17:50, Tristan Partin  wrote:
> I could propose something. It would help if I had an example of such
> a tool that already exists.

Basically the same behaviour as what you're trying to add now for
--check, only instead of printing the diff it would actually change
the files just like running pgindent without a --check flag does. i.e.
allow pgindent --check to not run in "dry-run" mode
My pre-commit hook looks like this currently (removed boring cruft around it):

  if ! src/tools/pgindent/pgindent --check $files > /dev/null; then
exit 0
  fi
  echo "Running pgindent on changed files"
  src/tools/pgindent/pgindent $files
  echo "Commit abandoned. Rerun git commit to adopt pgindent changes"
  exit 1

But I would like it to look like:

  if src/tools/pgindent/pgindent --check --write $files > /dev/null; then
exit 0
  fi
  echo "Commit abandoned. Rerun git commit to adopt pgindent changes"
  exit 1


To me, the two options seem at odds, like you either check or write. How 
would you feel about just capturing the diffs that are printed and 
patch(1)-ing them?


--
Tristan Partin
Neon (https://neon.tech)




Re: Add --check option to pgindent

2023-12-18 Thread Tristan Partin

On Mon Dec 18, 2023 at 10:50 AM CST, Tristan Partin wrote:

On Mon Dec 18, 2023 at 10:14 AM CST, Jelte Fennema-Nio wrote:
> On Mon, 18 Dec 2023 at 13:42, Daniel Gustafsson  wrote:
> > I think this is pretty much ready to go, the attached v4 squashes the 
changes
> > and fixes the man-page which also needed an update.  The referenced Wiki 
page
> > will need an edit or two after this goes in, but that's easy enough.
>
> One thing I'm wondering: When both --check and --diff are passed,
> should pgindent still early exit with 2 on the first incorrectly
> formatted file? Or should it show diffs for all failing files? I'm
> leaning towards the latter making more sense.

Makes sense. Let me iterate on the squashed version of the patch.


Here is an additional patch which implements the behavior you described. 
The first patch is just Daniel's squashed version of my patches.


--
Tristan Partin
Neon (https://neon.tech)
From 5cd4c5e9af0fc6e2e385b79111a6cca66df6cad9 Mon Sep 17 00:00:00 2001
From: Daniel Gustafsson 
Date: Mon, 18 Dec 2023 13:22:12 +0100
Subject: [PATCH v5 1/2] Rename non-destructive modes in pgindent

This renames --silent-diff to --check and --show-diff to --diff,
in order to make the options a little bit more self-explanatory
for developers used to similar formatters/linters.  --check and
--diff are also allowed to be combined.

Author: Tristan Partin 
Discussion: https://postgr.es/m/cxlx2xyth9s6.140sc6y61v...@neon.tech
---
 src/tools/pgindent/pgindent | 35 +
 src/tools/pgindent/pgindent.man | 12 +--
 2 files changed, 24 insertions(+), 23 deletions(-)

diff --git a/src/tools/pgindent/pgindent b/src/tools/pgindent/pgindent
index bce63d95da..eb2f52f4b9 100755
--- a/src/tools/pgindent/pgindent
+++ b/src/tools/pgindent/pgindent
@@ -22,8 +22,8 @@ my $indent_opts =
 my $devnull = File::Spec->devnull;
 
 my ($typedefs_file, $typedef_str, @excludes,
-	$indent, $build, $show_diff,
-	$silent_diff, $help, @commits,);
+	$indent, $build, $diff,
+	$check, $help, @commits,);
 
 $help = 0;
 
@@ -34,15 +34,12 @@ my %options = (
 	"list-of-typedefs=s" => \$typedef_str,
 	"excludes=s" => \@excludes,
 	"indent=s" => \$indent,
-	"show-diff" => \$show_diff,
-	"silent-diff" => \$silent_diff,);
+	"diff" => \$diff,
+	"check" => \$check,);
 GetOptions(%options) || usage("bad command line argument");
 
 usage() if $help;
 
-usage("Cannot have both --silent-diff and --show-diff")
-  if $silent_diff && $show_diff;
-
 usage("Cannot use --commit with command line file list")
   if (@commits && @ARGV);
 
@@ -294,7 +291,7 @@ sub run_indent
 	return $source;
 }
 
-sub show_diff
+sub diff
 {
 	my $indented = shift;
 	my $source_filename = shift;
@@ -323,8 +320,8 @@ Options:
 	--list-of-typedefs=STR  string containing typedefs, space separated
 	--excludes=PATH file containing list of filename patterns to ignore
 	--indent=PATH   path to pg_bsd_indent program
-	--show-diff show the changes that would be made
-	--silent-diff   exit with status 2 if any changes would be made
+	--diff  show the changes that would be made
+	--check exit with status 2 if any changes would be made
 The --excludes and --commit options can be given more than once.
 EOF
 	if ($help)
@@ -417,17 +414,21 @@ foreach my $source_filename (@files)
 
 	if ($source ne $orig_source)
 	{
-		if ($silent_diff)
-		{
-			exit 2;
-		}
-		elsif ($show_diff)
+		if (!$diff && !$check)
 		{
-			print show_diff($source, $source_filename);
+			write_source($source, $source_filename);
 		}
 		else
 		{
-			write_source($source, $source_filename);
+			if ($diff)
+			{
+print diff($source, $source_filename);
+			}
+
+			if ($check)
+			{
+exit 2;
+			}
 		}
 	}
 }
diff --git a/src/tools/pgindent/pgindent.man b/src/tools/pgindent/pgindent.man
index fe411ee699..caab5cde91 100644
--- a/src/tools/pgindent/pgindent.man
+++ b/src/tools/pgindent/pgindent.man
@@ -31,13 +31,13 @@ find the file src/tools/pgindent/exclude_file_patterns. The --excludes option
 can be used more than once to specify multiple files containing exclusion
 patterns.
 
-There are also two non-destructive modes of pgindent. If given the --show-diff
+There are also two non-destructive modes of pgindent. If given the --diff
 option pgindent will show the changes it would make, but doesn't actually make
-them. If given instead the --silent-diff option, pgindent will exit with a
-status of 2 if it finds any indent changes are required, but will not
-make the changes or give any other information. This mode is intended for
-possible use in a git pre-commit hook. An example of its use in a git hook
-can be seen at https://wiki.postgresql.org/wiki/Working_with_Git#Using_git_hooks
+them. If given instead the --check option, pgindent will exit with a status of
+2 if it finds any indent changes are required, but will not make the changes.
+This mode is intended for possible use in a git pre-commit hook. The --

Re: index prefetching

2023-12-18 Thread Robert Haas
On Sat, Dec 9, 2023 at 1:08 PM Tomas Vondra
 wrote:
> But there's a layering problem that I don't know how to solve - I don't
> see how we could make indexam.c entirely oblivious to the prefetching,
> and move it entirely to the executor. Because how else would you know
> what to prefetch?

Yeah, that seems impossible.

Some thoughts:

* I think perhaps the subject line of this thread is misleading. It
doesn't seem like there is any index prefetching going on here at all,
and there couldn't be, unless you extended the index AM API with new
methods. What you're actually doing is prefetching heap pages that
will be needed by a scan of the index. I think this confusing naming
has propagated itself into some parts of the patch, e.g.
index_prefetch() reads *from the heap* which is not at all clear from
the comment saying "Prefetch the TID, unless it's sequential or
recently prefetched." You're not prefetching the TID: you're
prefetching the heap tuple to which the TID points. That's not an
academic distinction IMHO -- the TID would be stored in the index, so
if we were prefetching the TID, we'd have to be reading index pages,
not heap pages.

* Regarding layering, my first thought was that the changes to
index_getnext_tid() and index_getnext_slot() are sensible: read ahead
by some number of TIDs, keep the TIDs you've fetched in an array
someplace, use that to drive prefetching of blocks on disk, and return
the previously-read TIDs from the queue without letting the caller
know that the queue exists. I think that's the obvious design for a
feature of this type, to the point where I don't really see that
there's a viable alternative design. Driving something down into the
individual index AMs would make sense if you wanted to prefetch *from
the indexes*, but it's unnecessary otherwise, and best avoided.

* But that said, the skip_all_visible flag passed down to
index_prefetch() looks like a VERY strong sign that the layering here
is not what it should be. Right now, when some code calls
index_getnext_tid(), that function does not need to know or care
whether the caller is going to fetch the heap tuple or not. But with
this patch, the code does need to care. So knowledge of the executor
concept of an index-only scan trickles down into indexam.c, which now
has to be able to make decisions that are consistent with the ones
that the executor will make. That doesn't seem good at all.

* I think it might make sense to have two different prefetching
schemes. Ideally they could share some structure. If a caller is using
index_getnext_slot(), then it's easy for prefetching to be fully
transparent. The caller can just ask for TIDs and the prefetching
distance and TID queue can be fully under the control of something
that is hidden from the caller. But when using index_getnext_tid(),
the caller needs to have an opportunity to evaluate each TID and
decide whether we even want the heap tuple. If yes, then we feed that
TID to the prefetcher; if no, we don't. That way, we're not
replicating executor logic in lower-level code. However, that also
means that the IOS logic needs to be aware that this TID queue exists
and interact with whatever controls the prefetch distance. Perhaps
after calling index_getnext_tid() you call
index_prefetcher_put_tid(prefetcher, tid, bool fetch_heap_tuple) and
then you call index_prefetcher_get_tid() to drain the queue. Perhaps
also the prefetcher has a "fill" callback that gets invoked when the
TID queue isn't as full as the prefetcher wants it to be. Then
index_getnext_slot() can just install a trivial fill callback that
says index_prefetecher_put_tid(prefetcher, index_getnext_tid(...),
true), but IOS can use a more sophisticated callback that checks the
VM to determine what to pass for the third argument.

* I realize that I'm being a little inconsistent in what I just said,
because in the first bullet point I said that this wasn't really index
prefetching, and now I'm proposing function names that still start
with index_prefetch. It's not entirely clear to me what the best thing
to do about the terminology is here -- could it be a heap prefetcher,
or a TID prefetcher, or an index scan prefetcher? I don't really know,
but whatever we can do to make the naming more clear seems like a
really good idea. Maybe there should be a clearer separation between
the queue of TIDs that we're going to return from the index and the
queue of blocks that we want to prefetch to get the corresponding heap
tuples -- making that separation crisper might ease some of the naming
issues.

* Not that I want to be critical because I think this is a great start
on an important project, but it does look like there's an awful lot of
stuff here that still needs to be sorted out before it would be
reasonable to think of committing this, both in terms of design
decisions and just general polish. There's a lot of stuff marked with
XXX and I think that's great because most of those seem to be good
questions but that does leave th

Re: encoding affects ICU regex character classification

2023-12-18 Thread Jeff Davis
On Fri, 2023-12-15 at 16:48 -0800, Jeremy Schneider wrote:
> This goes back to my other thread (which sadly got very little
> discussion): PosgreSQL really needs to be safe by /default/

Doesn't a built-in provider help create a safer option?

The built-in provider's version of Unicode will be consistent with
unicode_assigned(), which is a first step toward rejecting code points
that the provider doesn't understand. And by rejecting unassigned code
points, we get all kinds of Unicode compatibility guarantees that avoid
the kinds of change risks that you are worried about.

Regards,
Jeff Davis





Fixing backslash dot for COPY FROM...CSV

2023-12-18 Thread Daniel Verite
  Hi,

PFA a patch that attempts to fix the bug that \. on a line
by itself is handled incorrectly by COPY FROM ... CSV.
This issue has been discussed several times previously,
for instance in [1] and [2], and mentioned in the
doc for \copy in commit 42d3125.

There's one case that works today: when
the line is part of a multi-line quoted section,
and the data is read from a file, not from the client.
In other situations, an error is raised or the data is cut at
the point of \. without an error.

The patch addresses that issue in the server and in psql,
except for the case of inlined data, where \. cannot be 
both valid data and an EOF marker at the same time, so
it keeps treating it as an EOF marker. 


[1]
https://www.postgresql.org/message-id/10e3eff6-eb04-4b3f-aeb4-b920192b9...@manitou-mail.org
[2]
https://www.postgresql.org/message-id/8aeab305-5e94-4fa5-82bf-6da6baee6e05%40app.fastmail.com


Best regards,
-- 
Daniel Vérité
https://postgresql.verite.pro/
Twitter: @DanielVerite
From 868b1e065cf714f315bab65129fd05a1d60fc81b Mon Sep 17 00:00:00 2001
From: =?UTF-8?q?Daniel=20V=C3=A9rit=C3=A9?= 
Date: Mon, 18 Dec 2023 20:47:02 +0100
Subject: [PATCH v1] Support backslash-dot on a line by itself as valid data in
 COPY FROM...CSV

---
 doc/src/sgml/ref/psql-ref.sgml   |  4 
 src/backend/commands/copyfromparse.c | 13 ++-
 src/bin/psql/copy.c  | 32 
 src/test/regress/expected/copy.out   | 15 +
 src/test/regress/sql/copy.sql| 11 ++
 5 files changed, 51 insertions(+), 24 deletions(-)

diff --git a/doc/src/sgml/ref/psql-ref.sgml b/doc/src/sgml/ref/psql-ref.sgml
index cc7d797159..d94e3cacfc 100644
--- a/doc/src/sgml/ref/psql-ref.sgml
+++ b/doc/src/sgml/ref/psql-ref.sgml
@@ -1119,10 +1119,6 @@ INSERT INTO tbl1 VALUES ($1, $2) \bind 'first value' 
'second value' \g
 destination, because all data must pass through the client/server
 connection.  For large amounts of data the SQL
 command might be preferable.
-Also, because of this pass-through method, \copy
-... from in CSV mode will erroneously
-treat a \. data value alone on a line as an
-end-of-input marker.
 
 
 
diff --git a/src/backend/commands/copyfromparse.c 
b/src/backend/commands/copyfromparse.c
index f553734582..b4c291b25b 100644
--- a/src/backend/commands/copyfromparse.c
+++ b/src/backend/commands/copyfromparse.c
@@ -1137,7 +1137,6 @@ CopyReadLineText(CopyFromState cstate)
boolresult = false;
 
/* CSV variables */
-   boolfirst_char_in_line = true;
boolin_quote = false,
last_was_esc = false;
charquotec = '\0';
@@ -1335,10 +1334,9 @@ CopyReadLineText(CopyFromState cstate)
}
 
/*
-* In CSV mode, we only recognize \. alone on a line.  This is 
because
-* \. is a valid CSV data value.
+* In CSV mode, \. is a valid CSV data value anywhere in the 
line.
 */
-   if (c == '\\' && (!cstate->opts.csv_mode || first_char_in_line))
+   if (c == '\\' && !cstate->opts.csv_mode)
{
charc2;
 
@@ -1442,14 +1440,7 @@ CopyReadLineText(CopyFromState cstate)
}
}
 
-   /*
-* This label is for CSV cases where \. appears at the start of 
a
-* line, but there is more text after it, meaning it was a data 
value.
-* We are more strict for \. in CSV mode because \. could be a 
data
-* value, while in non-CSV mode, \. cannot be a data value.
-*/
 not_end_of_copy:
-   first_char_in_line = false;
}   /* end of outer 
loop */
 
/*
diff --git a/src/bin/psql/copy.c b/src/bin/psql/copy.c
index dbbbdb8898..f31a7acbb6 100644
--- a/src/bin/psql/copy.c
+++ b/src/bin/psql/copy.c
@@ -620,20 +620,34 @@ handleCopyIn(PGconn *conn, FILE *copystream, bool 
isbinary, PGresult **res)
/* current line is done? */
if (buf[buflen - 1] == '\n')
{
-   /* check for EOF marker, but not on a 
partial line */
-   if (at_line_begin)
+   /*
+* When at the beginning of the line, 
check for EOF marker.
+* If the marker is found and the data 
is inlined,
+* we must stop at this point.  If not, 
the \. line can be
+* sent to the server, and we let it 
decide whether
+* it's an E

Re: Built-in CTYPE provider

2023-12-18 Thread Jeff Davis
On Fri, 2023-12-15 at 16:30 -0800, Jeremy Schneider wrote:
> Looking closer, patches 3 and 4 look like an incremental extension of
> this earlier idea;

Yes, it's essentially the same thing extended to a few more files. I
don't know if "incremental" is the right word though; this is a
substantial extension of the idea.

>  the perl scripts download data from unicode.org and
> we've specifically defined Unicode version 15.1 and the scripts turn
> the
> data tables inside-out into C data structures optimized for lookup.
> That
> C code is then checked in to the PostgreSQL source code files
> unicode_category.h and unicode_case_table.h - right?

Yes. The standard build process shouldn't be downloading files, so the
static tables are checked in. Also, seeing the diffs of the static
tables improves the visibility of changes in case there's some mistake
or big surprise.

> Am I reading correctly that these two patches add C functions
> pg_u_prop_* and pg_u_is* (patch 3) and unicode_*case (patch 4) but we
> don't yet reference these functions anywhere? So this is just getting
> some plumbing in place?

Correct. Perhaps I should combine these into the builtin provider
thread, but these are independently testable and reviewable.

> > 
> My prediction is that updating this built-in provider eventually
> won't
> be any different from ICU or glibc.

The built-in provider will have several advantages because it's tied to
a PG major version:

  * A physical replica can't have different semantics than the primary.
  * Easier to document and test.
  * Changes are more transparent and can be documented in the release
notes, so that administrators can understand the risks and blast radius
at pg_upgrade time.

> Later on down the road, from a user perspective, I think we should be
> careful about confusion where providers are used inconsistently. It's
> not great if one function follow built-in Unicode 15.1 rules but
> another
> function uses Unicode 13 rules because it happened to call an ICU
> function or a glibc function. We could easily end up with multiple
> providers processing different parts of a single SQL statement, which
> could lead to strange results in some cases.

The whole concept of "providers" is that they aren't consistent with
each other. ICU, libc, and the builtin provider will all be based on
different versions of Unicode. That's by design.

The built-in provider will be a bit better in the sense that it's
consistent with the normalization functions, and the other providers
aren't.

Regards,
Jeff Davis







Re: add non-option reordering to in-tree getopt_long

2023-12-18 Thread Tom Lane
Michael Paquier  writes:
> On Thu, Jul 13, 2023 at 09:38:42PM -0700, Nathan Bossart wrote:
>> Take the following examples of client programs that accept one non-option:
>> 
>> ~$ pg_resetwal a b c
>> pg_resetwal: error: too many command-line arguments (first is "b")
>> pg_resetwal: hint: Try "pg_resetwal --help" for more information.
>> 
>> Yet pg_ctl gives:
>> 
>> ~$ pg_ctl start a b c
>> pg_ctl: too many command-line arguments (first is "start")
>> Try "pg_ctl --help" for more information.
>> 
>> In this example, isn't "a" the first extra non-option that should be
>> reported?

> Good point.  This is interpreting "first" as being the first option
> that's invalid.  Here my first impression was that pg_ctl got that
> right, where "first" refers to the first subcommand that would be
> valid.  Objection withdrawn.

We just had a user complaint that seems to trace to exactly this
bogus reporting in pg_ctl [1].  Although I was originally not
very pleased with changing our getopt_long to do switch reordering,
I'm now wondering if we should back-patch these changes as bug
fixes.  It's probably not worth the risk, but ...

regards, tom lane

[1] 
https://www.postgresql.org/message-id/flat/CANzqJaDCagH5wNkPQ42%3DFx3mJPR-YnB3PWFdCAYAVdb9%3DQ%2Bt-A%40mail.gmail.com




Re: trying again to get incremental backup

2023-12-18 Thread Robert Haas
On Fri, Dec 15, 2023 at 6:58 AM Peter Eisentraut  wrote:
> A separate bikeshedding topic: The GUC "summarize_wal", could that be
> "wal_something" instead?  (wal_summarize? wal_summarizer?)  It would be
> nice if these settings names group together a bit, both with existing
> wal_* ones and also with the new ones you are adding
> (wal_summary_keep_time).

Yeah, this is highly debatable, so bikeshed away. IMHO, the question
here is whether we care more about (1) having the name of the GUC
sound nice grammatically or (2) having the GUC begin with the same
string as other, related GUCs. I think that Tom Lane tends to prefer
the former, and probably some other people do too, while some other
people tend to prefer the latter. Ideally it would be possible to
satisfy both goals at once here, but everything I thought about that
started with "wal" sounded too awkward for me to like it; hence the
current choice of name. But if there's consensus on something else, so
be it.

-- 
Robert Haas
EDB: http://www.enterprisedb.com




Re: trying again to get incremental backup

2023-12-18 Thread Robert Haas
On Fri, Dec 15, 2023 at 6:53 AM Peter Eisentraut  wrote:
> The first fixes up some things in nls.mk related to a file move.  The
> second is some cleanup because some function you are using has been
> removed in the meantime; you probably found that yourself while rebasing.

Incorporated these. As you guessed,
MemoryContextResetAndDeleteChildren -> MemoryContextReset had already
been done locally.

> The pg_walsummary patch doesn't have a nls.mk, but you also comment that
> it doesn't have tests yet, so I assume it's not considered complete yet
> anyway.

I think this was more of a case of me just not realizing that I should
add that. I'll add something simple to the next version, but I'm not
very good at this NLS stuff.

-- 
Robert Haas
EDB: http://www.enterprisedb.com




Re: SLRU optimization - configurable buffer pool and partitioning the SLRU lock

2023-12-18 Thread Robert Haas
On Mon, Dec 18, 2023 at 12:53 PM Andrey M. Borodin  wrote:
> One page still accommodates 32K transaction statuses under one lock. It feels 
> like a lot. About 1 second of transactions on a typical installation.
>
> When the group commit was committed did we have a benchmark to estimate 
> efficiency of this technology? Can we repeat that test again?

I think we did, but it might take some research to find it in the
archives. If we can, I agree that repeating it feels like a good idea.

-- 
Robert Haas
EDB: http://www.enterprisedb.com




Re: SLRU optimization - configurable buffer pool and partitioning the SLRU lock

2023-12-18 Thread Andrey M. Borodin



> On 18 Dec 2023, at 22:30, Robert Haas  wrote:
> 
> On Mon, Dec 18, 2023 at 12:04 PM Robert Haas  wrote:
>> certain sense they are competing for the same job. However, they do
>> aim to alleviate different TYPES of contention: the group XID update
>> stuff should be most valuable when lots of processes are trying to
>> update the same page, and the banks should be most valuable when there
>> is simultaneous access to a bunch of different pages. So I'm not
>> convinced that this patch is a reason to remove the group XID update
>> mechanism, but someone might argue otherwise.
> 
> Hmm, but, on the other hand:
> 
> Currently all readers and writers are competing for the same LWLock.
> But with this change, the readers will (mostly) no longer be competing
> with the writers. So, in theory, that might reduce lock contention
> enough to make the group update mechanism pointless.

One page still accommodates 32K transaction statuses under one lock. It feels 
like a lot. About 1 second of transactions on a typical installation.

When the group commit was committed did we have a benchmark to estimate 
efficiency of this technology? Can we repeat that test again?


Best regards, Andrey Borodin.



Re: common signal handler protection

2023-12-18 Thread Nathan Bossart
rebased for cfbot

-- 
Nathan Bossart
Amazon Web Services: https://aws.amazon.com
>From c5fc2186960c483d53789f27fcf84771e98c5ca3 Mon Sep 17 00:00:00 2001
From: Nathan Bossart 
Date: Tue, 28 Nov 2023 14:58:20 -0600
Subject: [PATCH v5 1/3] Check that MyProcPid == getpid() in all signal
 handlers.

In commit 97550c0711, we added a similar check to the SIGTERM
handler for the startup process.  This commit adds this check to
all signal handlers installed with pqsignal().  This is done by
using a wrapper function that performs the check before calling the
actual handler.

The hope is that this will offer more general protection against
child processes of Postgres backends inadvertently modifying shared
memory due to inherited signal handlers.  Another potential
follow-up improvement is to use this wrapper handler function to
restore errno instead of relying on each individual handler
function to do so.

This commit makes the changes in commit 97550c0711 obsolete but
leaves reverting it for a follow-up commit.

Reviewed-by: Andres Freund
Discussion: https://postgr.es/m/20231121212008.GA3742740%40nathanxps13
---
 src/port/pqsignal.c | 86 +++--
 1 file changed, 84 insertions(+), 2 deletions(-)

diff --git a/src/port/pqsignal.c b/src/port/pqsignal.c
index 23e7f685c1..f04a7153de 100644
--- a/src/port/pqsignal.c
+++ b/src/port/pqsignal.c
@@ -46,23 +46,99 @@
 #include "c.h"
 
 #include 
+#ifndef FRONTEND
+#include 
+#endif
 
 #ifndef FRONTEND
 #include "libpq/pqsignal.h"
+#include "miscadmin.h"
+#endif
+
+#ifdef PG_SIGNAL_COUNT			/* Windows */
+#define PG_NSIG (PG_SIGNAL_COUNT)
+#elif defined(NSIG)
+#define PG_NSIG (NSIG)
+#else
+#define PG_NSIG (64)			/* XXX: wild guess */
+#endif
+
+/* Check a couple of common signals to make sure PG_NSIG is accurate. */
+StaticAssertDecl(SIGUSR2 < PG_NSIG, "SIGUSR2 >= PG_NSIG");
+StaticAssertDecl(SIGHUP < PG_NSIG, "SIGHUP >= PG_NSIG");
+StaticAssertDecl(SIGTERM < PG_NSIG, "SIGTERM >= PG_NSIG");
+StaticAssertDecl(SIGALRM < PG_NSIG, "SIGALRM >= PG_NSIG");
+
+static volatile pqsigfunc pqsignal_handlers[PG_NSIG];
+
+/*
+ * Except when called with SIG_IGN or SIG_DFL, pqsignal() sets up this function
+ * as the handler for all signals.  This wrapper handler function checks that
+ * it is called within a process that the server knows about (i.e., any process
+ * that has called InitProcessGlobals(), such as a client backend), and not a
+ * child process forked by system(3), etc.  This check ensures that such child
+ * processes do not modify shared memory, which is often detrimental.  If the
+ * check succeeds, the function originally provided to pqsignal() is called.
+ * Otherwise, the default signal handler is installed and then called.
+ */
+static void
+wrapper_handler(SIGNAL_ARGS)
+{
+#ifndef FRONTEND
+
+	/*
+	 * We expect processes to set MyProcPid before calling pqsignal() or
+	 * before accepting signals.
+	 */
+	Assert(MyProcPid);
+	Assert(MyProcPid != PostmasterPid || !IsUnderPostmaster);
+
+	if (unlikely(MyProcPid != (int) getpid()))
+	{
+		pqsignal(postgres_signal_arg, SIG_DFL);
+		raise(postgres_signal_arg);
+		return;
+	}
 #endif
 
+	(*pqsignal_handlers[postgres_signal_arg]) (postgres_signal_arg);
+}
+
 /*
  * Set up a signal handler, with SA_RESTART, for signal "signo"
  *
  * Returns the previous handler.
+ *
+ * NB: If called within a signal handler, race conditions may lead to bogus
+ * return values.  You should either avoid calling this within signal handlers
+ * or ignore the return value.
+ *
+ * XXX: Since no in-tree callers use the return value, and there is little
+ * reason to do so, it would be nice if we could convert this to a void
+ * function instead of providing potentially-bogus return values.
+ * Unfortunately, that requires modifying the pqsignal() in legacy-pqsignal.c,
+ * which in turn requires an SONAME bump, which is probably not worth it.
  */
 pqsigfunc
 pqsignal(int signo, pqsigfunc func)
 {
+	pqsigfunc	orig_func = pqsignal_handlers[signo];	/* assumed atomic */
 #if !(defined(WIN32) && defined(FRONTEND))
 	struct sigaction act,
 oact;
+#else
+	pqsigfunc	ret;
+#endif
 
+	Assert(signo < PG_NSIG);
+
+	if (func != SIG_IGN && func != SIG_DFL)
+	{
+		pqsignal_handlers[signo] = func;	/* assumed atomic */
+		func = wrapper_handler;
+	}
+
+#if !(defined(WIN32) && defined(FRONTEND))
 	act.sa_handler = func;
 	sigemptyset(&act.sa_mask);
 	act.sa_flags = SA_RESTART;
@@ -72,9 +148,15 @@ pqsignal(int signo, pqsigfunc func)
 #endif
 	if (sigaction(signo, &act, &oact) < 0)
 		return SIG_ERR;
-	return oact.sa_handler;
+	else if (oact.sa_handler == wrapper_handler)
+		return orig_func;
+	else
+		return oact.sa_handler;
 #else
 	/* Forward to Windows native signal system. */
-	return signal(signo, func);
+	if ((ret = signal(signo, func)) == wrapper_handler)
+		return orig_func;
+	else
+		return ret;
 #endif
 }
-- 
2.25.1

>From 44bae78090d44f2ecdc9fbaea43d9adc1827c445 Mon Sep 17 00:00:00 2001
From: Nathan Bossart 
Date: Tue

Re: SLRU optimization - configurable buffer pool and partitioning the SLRU lock

2023-12-18 Thread Robert Haas
On Mon, Dec 18, 2023 at 12:04 PM Robert Haas  wrote:
> certain sense they are competing for the same job. However, they do
> aim to alleviate different TYPES of contention: the group XID update
> stuff should be most valuable when lots of processes are trying to
> update the same page, and the banks should be most valuable when there
> is simultaneous access to a bunch of different pages. So I'm not
> convinced that this patch is a reason to remove the group XID update
> mechanism, but someone might argue otherwise.

Hmm, but, on the other hand:

Currently all readers and writers are competing for the same LWLock.
But with this change, the readers will (mostly) no longer be competing
with the writers. So, in theory, that might reduce lock contention
enough to make the group update mechanism pointless.

-- 
Robert Haas
EDB: http://www.enterprisedb.com




Re: Add --check option to pgindent

2023-12-18 Thread Jelte Fennema-Nio
On Mon, 18 Dec 2023 at 17:50, Tristan Partin  wrote:
> I could propose something. It would help if I had an example of such
> a tool that already exists.

Basically the same behaviour as what you're trying to add now for
--check, only instead of printing the diff it would actually change
the files just like running pgindent without a --check flag does. i.e.
allow pgindent --check to not run in "dry-run" mode
My pre-commit hook looks like this currently (removed boring cruft around it):

  if ! src/tools/pgindent/pgindent --check $files > /dev/null; then
exit 0
  fi
  echo "Running pgindent on changed files"
  src/tools/pgindent/pgindent $files
  echo "Commit abandoned. Rerun git commit to adopt pgindent changes"
  exit 1

But I would like it to look like:

  if src/tools/pgindent/pgindent --check --write $files > /dev/null; then
exit 0
  fi
  echo "Commit abandoned. Rerun git commit to adopt pgindent changes"
  exit 1




Re: SLRU optimization - configurable buffer pool and partitioning the SLRU lock

2023-12-18 Thread Robert Haas
On Tue, Dec 12, 2023 at 8:29 AM Alvaro Herrera  wrote:
> The problem I see is that the group update mechanism is designed around
> contention of the global xact-SLRU control lock; it uses atomics to
> coordinate a single queue when the lock is contended.  So if we split up
> the global SLRU control lock using banks, then multiple processes using
> different bank locks might not contend.  OK, this is fine, but what
> happens if two separate groups of processes encounter contention on two
> different bank locks?  I think they will both try to update the same
> queue, and coordinate access to that *using different bank locks*.  I
> don't see how can this work correctly.
>
> I suspect the first part of that algorithm, where atomics are used to
> create the list without a lock, might work fine.  But will each "leader"
> process, each of which is potentially using a different bank lock,
> coordinate correctly?  Maybe this works correctly because only one
> process will find the queue head not empty?  If this is what happens,
> then there needs to be comments about it.  Without any explanation,
> this seems broken and potentially dangerous, as some transaction commit
> bits might become lost given high enough concurrency and bad luck.

I don't want to be dismissive of this concern, but I took a look at
this part of the patch set and I don't see a correctness problem. I
think the idea of the mechanism is that we have a single linked list
in shared memory that can accumulate those waiters. At some point a
process pops the entire list of waiters, which allows a new group of
waiters to begin accumulating. The process that pops the list must
perform the updates for every process in the just-popped list without
failing, else updates would be lost. In theory, there can be any
number of lists that some process has popped and is currently working
its way through at the same time, although in practice I bet it's
quite rare for there to be more than one. The only correctness problem
is if it's possible for a process that popped the list to error out
before it finishes doing the updates that it "promised" to do by
popping the list.

Having individual bank locks doesn't really change anything here.
That's just a matter of what lock has to be held in order to perform
the update that was promised, and the algorithm described in the
previous paragraph doesn't really care about that. Nor is there a
deadlock hazard here as long as processes only take one lock at a
time, which I believe is the case. The only potential issue that I see
here is with performance. I've heard some questions about whether this
machinery performs well even as it stands, but certainly if we divide
up the lock into a bunch of bankwise locks then that should tend in
the direction of making a mechanism like this less valuable, because
both mechanisms are trying to alleviate contention, and so in a
certain sense they are competing for the same job. However, they do
aim to alleviate different TYPES of contention: the group XID update
stuff should be most valuable when lots of processes are trying to
update the same page, and the banks should be most valuable when there
is simultaneous access to a bunch of different pages. So I'm not
convinced that this patch is a reason to remove the group XID update
mechanism, but someone might argue otherwise.

A related concern is that, if by chance we do end up with multiple
updaters from different pages in the same group, it will now be more
expensive to sort that out because we'll have to potentially keep
switching locks. So that could make the group XID update mechanism
less performant than it is currently. I think that's probably not an
issue because I think it should be a rare occurrence, as the comments
say. A bit more cost in a code path that is almost never taken won't
matter. But if that path is more commonly taken than I think, then
maybe making it more expensive could hurt. It might be worth adding
some debugging to see how often we actually go down that path in a
highly stressed system.

BTW:

+* as well.  The main reason for now allowing requesters of
different pages

now -> not

-- 
Robert Haas
EDB: http://www.enterprisedb.com




Re: optimize atomic exchanges

2023-12-18 Thread Nathan Bossart
On Fri, Dec 15, 2023 at 04:56:27AM -0800, Andres Freund wrote:
> I don't think we need the inline asm. Otherwise looks good.

Committed with that change.  Thanks for reviewing!  I am going to watch the
buildfarm especially closely for this one.

-- 
Nathan Bossart
Amazon Web Services: https://aws.amazon.com




Re: Add --check option to pgindent

2023-12-18 Thread Tristan Partin

On Mon Dec 18, 2023 at 10:14 AM CST, Jelte Fennema-Nio wrote:

On Mon, 18 Dec 2023 at 13:42, Daniel Gustafsson  wrote:
> I think this is pretty much ready to go, the attached v4 squashes the changes
> and fixes the man-page which also needed an update.  The referenced Wiki page
> will need an edit or two after this goes in, but that's easy enough.

One thing I'm wondering: When both --check and --diff are passed,
should pgindent still early exit with 2 on the first incorrectly
formatted file? Or should it show diffs for all failing files? I'm
leaning towards the latter making more sense.


Makes sense. Let me iterate on the squashed version of the patch.


Related (but not required for this patch): For my pre-commit hook I
would very much like it if there was an option to have pgindent write
the changes to disk, but still exit non-zero, e.g. a --write flag that
could be combined with --check just like --diff and --check can be
combined with this patch. Currently my pre-commit hook need two
separate calls to pgindent to both abort the commit and write the
required changes to disk.


I could propose something. It would help if I had an example of such 
a tool that already exists.


--
Tristan Partin
Neon (https://neon.tech)




Re: micro-optimizing json.c

2023-12-18 Thread Nathan Bossart
On Fri, Dec 08, 2023 at 05:56:20PM -0500, Tom Lane wrote:
> Nathan Bossart  writes:
>> Here are a couple more easy micro-optimizations in nearby code.  I've split
>> them into individual patches for review, but I'll probably just combine
>> them into one patch before committing.
> 
> LGTM

Committed.  Thanks for reviewing!

For the record, I did think about changing appendStringInfoString() into a
macro or an inline function so that any calls with a string literal would
benefit from this sort of optimization, but I was on-the-fence about it
because it requires some special knowledge, i.e., you have to know to
provide string literals to remove the runtime calls to strlen().  Perhaps
this is worth further exploration...

-- 
Nathan Bossart
Amazon Web Services: https://aws.amazon.com




Re: psql JSON output format

2023-12-18 Thread Jelte Fennema-Nio
On Mon, 18 Dec 2023 at 16:38, Christoph Berg  wrote:
> We'd want both patches even if they do the same thing on two different
> levels, I'd say.

Makes sense. One thing I was still wondering is if it wouldn't be
easier to wrap all queries in "copy (select whatever) to stdout
(format json)" automatically when the -J flag is passed psql. Because
it would be nice not to have to implement this very similar logic in
two places.

But I guess that approach does not work for commands that don't work
inside COPY, i.e. DML and DDL. I'm assuming your current patch works
fine with DML/DDL. If that's indeed the case then I agree it makes
sense to have this patch. And another big benefit is that it wouldn't
require a new Postgres server function for the json functionality of
psql.




Re: Add --check option to pgindent

2023-12-18 Thread Jelte Fennema-Nio
On Mon, 18 Dec 2023 at 17:14, Jelte Fennema-Nio  wrote:
> One thing I'm wondering: When both --check and --diff are passed,
> should pgindent still early exit with 2 on the first incorrectly
> formatted file? Or should it show diffs for all failing files? I'm
> leaning towards the latter making more sense.

To be clear, in the latter case it would still exit with 2 at the end
of the script, but only after showing diffs for all files.




Re: Add --check option to pgindent

2023-12-18 Thread Jelte Fennema-Nio
On Mon, 18 Dec 2023 at 13:42, Daniel Gustafsson  wrote:
> I think this is pretty much ready to go, the attached v4 squashes the changes
> and fixes the man-page which also needed an update.  The referenced Wiki page
> will need an edit or two after this goes in, but that's easy enough.

One thing I'm wondering: When both --check and --diff are passed,
should pgindent still early exit with 2 on the first incorrectly
formatted file? Or should it show diffs for all failing files? I'm
leaning towards the latter making more sense.

Related (but not required for this patch): For my pre-commit hook I
would very much like it if there was an option to have pgindent write
the changes to disk, but still exit non-zero, e.g. a --write flag that
could be combined with --check just like --diff and --check can be
combined with this patch. Currently my pre-commit hook need two
separate calls to pgindent to both abort the commit and write the
required changes to disk.




Re: Add --check option to pgindent

2023-12-18 Thread Matthias van de Meent
On Mon, 18 Dec 2023 at 16:45, Tristan Partin  wrote:
>
> On Mon Dec 18, 2023 at 6:41 AM CST, Daniel Gustafsson wrote:
> > > On 15 Dec 2023, at 16:43, Tristan Partin  wrote:
> >
> > > Here is a v3.
> >
> > I think this is pretty much ready to go, the attached v4 squashes the 
> > changes
> > and fixes the man-page which also needed an update.  The referenced Wiki 
> > page
> > will need an edit or two after this goes in, but that's easy enough.
>
> I have never edited the Wiki before. How can I do that? More than happy
> to do it.

As mentioned at the bottom of the main page of the wiki:

  NOTE: due to recent spam activity "editor" privileges are granted
manually for the time being.
  If you just created a new community account or if your current
account used to have "editor" privileges ask on either the PostgreSQL
-www Mailinglist or the PostgreSQL IRC Channel (direct your request to
'pginfra', multiple individuals in the channel highlight on that
string) for help. Please include your community account name in those
requests.

After that, it's just a case of loggin in on the wiki (link top right
corner, which uses the community account) and then just go on to your
prefered page, click edit, and do your modifications. Don't forget to
save the modifications - I'm not sure whether the wiki editor's state
is locally persisted.

Kind regards,

Matthias van de Meent
Neon (https://neon.tech)




Re: Add --check option to pgindent

2023-12-18 Thread Tristan Partin

On Mon Dec 18, 2023 at 7:56 AM CST, Euler Taveira wrote:

On Mon, Dec 18, 2023, at 9:41 AM, Daniel Gustafsson wrote:
> > On 15 Dec 2023, at 16:43, Tristan Partin  wrote:
> 
> > Here is a v3.
> 
> I think this is pretty much ready to go, the attached v4 squashes the changes

> and fixes the man-page which also needed an update.  The referenced Wiki page
> will need an edit or two after this goes in, but that's easy enough.

... and pgbuildfarm client [1] needs to be changed too.


[1] 
https://github.com/PGBuildFarm/client-code/blob/main/PGBuild/Modules/CheckIndent.pm#L55-L90


Thanks Euler. I am on it.

--
Tristan Partin
Neon (https://neon.tech)




Re: Add --check option to pgindent

2023-12-18 Thread Tristan Partin

On Mon Dec 18, 2023 at 6:41 AM CST, Daniel Gustafsson wrote:

> On 15 Dec 2023, at 16:43, Tristan Partin  wrote:

> Here is a v3.

I think this is pretty much ready to go, the attached v4 squashes the changes
and fixes the man-page which also needed an update.  The referenced Wiki page
will need an edit or two after this goes in, but that's easy enough.


I have never edited the Wiki before. How can I do that? More than happy 
to do it.


--
Tristan Partin
Neon (https://neon.tech)




Re: psql JSON output format

2023-12-18 Thread Christoph Berg
Re: Jelte Fennema-Nio
> This seems useful to me too, but my usecases would also be solved (and
> possibly better solved) by adding JSON support to COPY as proposed
> here: 
> https://www.postgresql.org/message-id/flat/CALvfUkBxTYy5uWPFVwpk_7ii2zgT07t3d-yR_cy4sfrrLU%3Dkcg%40mail.gmail.com

Thanks for the pointer, I had not scrolled back enough to see that
thread.

I'm happy to see that this patch is also settling on "array of
objects".

> I'm wondering if your follow-up patch would be better served by that too or 
> not.

I'd need it to work on query results. Which could of course be wrapped
into "copy (select whatever) to stdout (format json)", but doing it in
psql without mangling the query is cleaner. And (see the other mail),
the psql format selection works nicely with existing queries like
`psql -l`.

And "copy format json" wouldn't support \x expanded mode.

We'd want both patches even if they do the same thing on two different
levels, I'd say.

Christoph




Re: A wrong comment about search_indexed_tlist_for_var

2023-12-18 Thread Matt Skelley
The following review has been posted through the commitfest application:
make installcheck-world:  not tested
Implements feature:   tested, passed
Spec compliant:   tested, passed
Documentation:tested, passed

Comment is updated correctly.

Re: psql JSON output format

2023-12-18 Thread Jelte Fennema-Nio
On Mon, 18 Dec 2023 at 15:56, Christoph Berg  wrote:
> I noticed psql was lacking JSON formatting of query results which I
> need for a follow-up patch.

This seems useful to me too, but my usecases would also be solved (and
possibly better solved) by adding JSON support to COPY as proposed
here: 
https://www.postgresql.org/message-id/flat/CALvfUkBxTYy5uWPFVwpk_7ii2zgT07t3d-yR_cy4sfrrLU%3Dkcg%40mail.gmail.com

I'm wondering if your follow-up patch would be better served by that too or not.




Re: psql JSON output format

2023-12-18 Thread Christoph Berg
Re: To PostgreSQL Hackers
> On the command line, the format is selected by `psql --json` and `psql -J`.

Among other uses, it enables easy post-processing of psql output using `jq`:

$ psql -lJ | jq
[
  {
"Name": "myon",
"Owner": "myon",
"Encoding": "UTF8",
"Locale Provider": "libc",
"Collate": "de_DE.utf8",
"Ctype": "de_DE.utf8",
"ICU Locale": null,
"ICU Rules": null,
"Access privileges": null
  },
...
]

Christoph




psql JSON output format

2023-12-18 Thread Christoph Berg
I noticed psql was lacking JSON formatting of query results which I
need for a follow-up patch. It also seems useful generally, so here's
a patch:

postgres=# \pset format json
Output format is json.
postgres=# select * from (values ('one', 2, 'three'), ('four', 5, 'six')) as 
sub(a, b, c);
[
{ "a": "one", "b": "2", "c": "three" },
{ "a": "four", "b": "5", "c": "six" }
]
postgres=# \x
Expanded display is on.
postgres=# select * from (values ('one', 2, 'three'), ('four', 5, 'six')) as 
sub(a, b, c);
[{
  "a": "one",
  "b": "2",
  "c": "three"
},{
  "a": "four",
  "b": "5",
  "c": "six"
}]
postgres=#

Both normal and expanded output format are optimized for readability
while still saving screen space.

Both formats output the same JSON structure, an array of objects.
Other variants like array-of-arrays or line-separated objects
("jsonline") might be possible, but I didn't want to overengineer it.

On the command line, the format is selected by `psql --json` and `psql -J`.
(I'm not attached to the short option, but -J was free and it's in
line with `psql -H` to select HTML.)

Christoph
>From 5de0629e3664b981b2a246c480a90636ddfc8dd7 Mon Sep 17 00:00:00 2001
From: Christoph Berg 
Date: Fri, 8 Sep 2023 15:59:29 +0200
Subject: [PATCH] Add JSON output format to psql

Query results are formatted as an array of JSON objects. In non-expanded
mode, one object per line is printed, and in expanded mode, one
key-value pair. Use `\pset format json` to enable, or run `psql --json`
or `psql -J`. NULL values are printed as `null`, independently from the
psql null setting, all other values are printed as quoted strings.
---
 doc/src/sgml/ref/psql-ref.sgml |  26 +-
 src/bin/psql/command.c |   6 +-
 src/bin/psql/help.c|   1 +
 src/bin/psql/startup.c |   6 +-
 src/bin/psql/tab-complete.c|   2 +-
 src/fe_utils/print.c   | 128 +++--
 src/include/fe_utils/print.h   |   1 +
 src/test/regress/expected/psql.out |  67 +++
 src/test/regress/sql/psql.sql  |  25 ++
 9 files changed, 253 insertions(+), 9 deletions(-)

diff --git a/doc/src/sgml/ref/psql-ref.sgml b/doc/src/sgml/ref/psql-ref.sgml
index cc7d797159..f1f7eda082 100644
--- a/doc/src/sgml/ref/psql-ref.sgml
+++ b/doc/src/sgml/ref/psql-ref.sgml
@@ -274,6 +274,17 @@ EOF
   
 
 
+
+  -J
+  --json
+  
+  
+  Switches to JSON output mode.  This is
+  equivalent to \pset format json.
+  
+  
+
+
 
   -l
   --list
@@ -2956,6 +2967,7 @@ lo_import 152801
   asciidoc,
   csv,
   html,
+  json,
   latex,
   latex-longtable, troff-ms,
   unaligned, or wrapped.
@@ -2972,7 +2984,7 @@ lo_import 152801
   in by other programs, for example, tab-separated or comma-separated
   format.  However, the field separator character is not treated
   specially if it appears in a column's value;
-  so CSV format may be better suited for such
+  the CSV or JSON formats may be better suited for such
   purposes.
   
 
@@ -2997,6 +3009,18 @@ lo_import 152801
   \pset csv_fieldsep.
   
 
+  json format
+  
+   JSON format
+   in psql
+  
+  writes output in JSON format, described in
+  https://www.ietf.org/rfc/rfc4627.txt";>RFC 4627.
+  The result is an array of JSON objects. In non-expanded mode, one
+  object per line is printed, while in expanded mode, each key-value
+  pair is a separate line.
+  
+
   wrapped format is like aligned but wraps
   wide data values across lines to make the output fit in the target
   column width.  The target width is determined as described under
diff --git a/src/bin/psql/command.c b/src/bin/psql/command.c
index 82cc091568..d8d11e9519 100644
--- a/src/bin/psql/command.c
+++ b/src/bin/psql/command.c
@@ -4318,6 +4318,9 @@ _align2string(enum printFormat in)
 		case PRINT_HTML:
 			return "html";
 			break;
+		case PRINT_JSON:
+			return "json";
+			break;
 		case PRINT_LATEX:
 			return "latex";
 			break;
@@ -4408,6 +4411,7 @@ do_pset(const char *param, const char *value, printQueryOpt *popt, bool quiet)
 			{"asciidoc", PRINT_ASCIIDOC},
 			{"csv", PRINT_CSV},
 			{"html", PRINT_HTML},
+			{"json", PRINT_JSON},
 			{"latex", PRINT_LATEX},
 			{"troff-ms", PRINT_TROFF_MS},
 			{"unaligned", PRINT_UNALIGNED},
@@ -4448,7 +4452,7 @@ do_pset(const char *param, const char *value, printQueryOpt *popt, bool quiet)
 			}
 			else
 			{
-pg_log_error("\\pset: allowed formats are aligned, asciidoc, csv, html, latex, latex-longtable, troff-ms, unaligned, wrapped");
+pg_log_error("\\pset: allowed formats are aligned, asciidoc, csv, html, json, latex, latex-longtable, troff-ms, unaligned, wrapped");
 return false;
 			}
 		}
diff --git a/src/bin/psql/help.c b/src/bin/psql/

Re: Detecting some cases of missing backup_label

2023-12-18 Thread Stephen Frost
Greetings,

* Stephen Frost (sfr...@snowman.net) wrote:
> * Andres Freund (and...@anarazel.de) wrote:
> > I recently mentioned to Robert (and also Heikki earlier), that I think I 
> > see a
> > way to detect an omitted backup_label in a relevant subset of the cases 
> > (it'd
> > apply to the pg_control as well, if we moved to that).  Robert encouraged me
> > to share the idea, even though it does not provide complete protection.
> 
> That would certainly be nice.
> 
> > The subset I think we can address is the following:
> > 
> > a) An omitted backup_label would lead to corruption, i.e. without the
> >backup_label we won't start recovery at the right position. Obviously 
> > it'd
> >be better to also catch a wrong procedure when it'd not cause corruption 
> > -
> >perhaps my idea can be extended to handle that, with a small bit of
> >overhead.
> > 
> > b) The backup has been taken from a primary. Unfortunately that probably 
> > can't
> >be addressed - but the vast majority of backups are taken from a primary,
> >so I think it's still a worthwhile protection.
> 
> Agreed that this is a worthwhile set to try and address, even if we
> can't address other cases.
> 
> > Here's my approach
> > 
> > 1) We add a XLOG_BACKUP_START WAL record when starting a base backup on a
> >primary, emitted just *after* the checkpoint completed
> > 
> > 2) When replaying a base backup start record, we create a state file that
> >includes the corresponding LSN in the filename
> > 
> > 3) On the primary, the state file for XLOG_BACKUP_START is *not* created at
> >that time. Instead the state file is created during pg_backup_stop().
> > 
> > 4) When replaying a XLOG_BACKUP_END record, we verif that the state file
> >created by XLOG_BACKUP_START is present, and error out if not.  Backups
> >that started before the redo LSN from backup_label are ignored
> >(necessitates remembering that LSN, but we've been discussing that 
> > anyway).
> > 
> > 
> > Because the backup state file on the primary is only created during
> > pg_backup_stop(), a copy of the data directory taken between 
> > pg_backup_start()
> > and pg_backup_stop() does *not* contain the corresponding "backup state
> > file". Because of this, an omitted backup_label is detected if recovery does
> > not start early enough - recovery won't encounter the XLOG_BACKUP_START 
> > record
> > and thus would not create the state file, leading to an error in 4).
> 
> While I see the idea here, I think, doesn't it end up being an issue if
> things happen like this:
> 
> pg_backup_start -> XLOG_BACKUP_START WAL written -> new checkpoint
> happens -> pg_backup_stop -> XLOG_BACKUP_STOP WAL written -> crash
> 
> In that scenario, we'd go back to the new checkpoint (the one *after*
> the checkpoint that happened before we wrote XLOG_BACKUP_START), start
> replaying, and then hit the XLOG_BACKUP_STOP and then error out, right?
> Even though we're actually doing crash recovery and everything should be
> fine as long as we replay all of the WAL.

Andres and I discussed this in person at PGConf.eu and the idea is that
if we find a XLOG_BACKUP_STOP record then we can check if the state file
was written out and if so then we can conclude that we are doing crash
recovery and not restoring from a backup and therefore we don't error
out.  This also implies that we don't consider PG to be recovered at the
XLOG_BACKUP_STOP point, if the state file exists, but instead we have to
be sure to replay all WAL that's been written.  Perhaps we even
explicitly refuse to use restore_command in this case?

We do error out if we hit a XLOG_BACKUP_STOP and the state file
doesn't exist, as that implies that we started replaying from a point
after a XLOG_BACKUP_START record was written but are working from a copy
of the data directory which didn't include the state file.

Of course, we need to actually implement and test these different cases
to make sure it all works but I'm at least feeling better about the idea
and wanted to share that here.

Thanks,

Stephen


signature.asc
Description: PGP signature


Re: "pgoutput" options missing on documentation

2023-12-18 Thread Emre Hasegeli
> Fair enough. I think we should push your first patch only in HEAD as
> this is a minor improvement over the current behaviour. What do you
> think?

I agree.




Re: Simplify newNode()

2023-12-18 Thread Heikki Linnakangas

On 15/12/2023 00:44, Tom Lane wrote:

Good point. Looking closer, modern compilers will actually turn the
MemSetLoop() in MemoryContextAllocZeroAligned() into a call to memset()
anyway! Funny. That is true for recent versions of gcc, clang, and MSVC.

Not here ...


Hmm, according to godbolt, the change happened in GCC version 10.1. 
Starting with gcc 10.1, it is turned into a memset(). On clang, the same 
change happened in version 3.4.1.


I think we have consensus on patch v2. It's simpler and not less 
performant than what we have now, at least on modern compilers. Barring 
objections, I'll commit that.


I'm not planning to spend more time on this, but there might be some 
room for further optimization if someone is interested to do the 
micro-benchmarking. The obvious thing would be to persuade modern 
compilers to not switch to memset() in MemoryContextAllocZeroAligned 
(*), making the old macro logic work the same way it used to on old 
compilers.


Also, instead of palloc0, it might be better for newNode() to call 
palloc followed by memset. That would allow the compiler to partially 
optimize away the memset. Most callers fill at least some of the fields 
after calling makeNode(), so the compiler could generate code that 
clears only the uninitialized fields and padding bytes.


(*) or rather, a new function like MemoryContextAllocZeroAligned but 
without the 'context' argument. We want to keep the savings in the 
callers from eliminating the extra argument.


--
Heikki Linnakangas
Neon (https://neon.tech)





Re: Add --check option to pgindent

2023-12-18 Thread Euler Taveira
On Mon, Dec 18, 2023, at 9:41 AM, Daniel Gustafsson wrote:
> > On 15 Dec 2023, at 16:43, Tristan Partin  wrote:
> 
> > Here is a v3.
> 
> I think this is pretty much ready to go, the attached v4 squashes the changes
> and fixes the man-page which also needed an update.  The referenced Wiki page
> will need an edit or two after this goes in, but that's easy enough.

... and pgbuildfarm client [1] needs to be changed too.


[1] 
https://github.com/PGBuildFarm/client-code/blob/main/PGBuild/Modules/CheckIndent.pm#L55-L90


--
Euler Taveira
EDB   https://www.enterprisedb.com/


Re: Remove MSVC scripts from the tree

2023-12-18 Thread Peter Eisentraut

On 18.12.23 11:49, vignesh C wrote:

Few comments:
1) Now that the MSVC build scripts are removed, should we have the
reference to "MSVC build scripts" here?
ltree.h:


I think this note is correct and can be kept, as it explains the 
historical context.



2) I had seen that if sed/gzip is not available meson build will fail:
2.a)
Program gsed sed found: NO
meson.build:334:6: ERROR: Program 'gsed sed' not found or not executable


Yes, this would need to be improved.  Currently, sed is only required if 
either selinux or dtrace is enabled, which isn't supported on Windows. 
But we should adjust the build scripts to not fail the top-level setup 
run unless those options are enabled.



2.b)
Program gzip found: NO
meson.build:337:7: ERROR: Program 'gzip' not found or not executable


gzip is only required for certain test suites, so again we should adjust 
the build scripts to not fail the build but instead skip the tests as 
appropriate.






Re: Add new for_each macros for iterating over a List that do not require ListCell pointer

2023-12-18 Thread Jelte Fennema-Nio
The more I think about it and look at the code, the more I like the
usage of the loop style proposed in the previous 0003 patch (which
automatically declares a loop variable for the scope of the loop using
a second for loop).

I did some testing on godbolt.org and both versions of the macros
result in the same assembly when compiling with -O2 (and even -O1)
when compiling with ancient versions of gcc (5.1) and clang (3.0):
https://godbolt.org/z/WqfTbhe4e

So attached is now an updated patchset that only includes these even
easier to use foreach macros. I also updated some of the comments and
moved modifying foreach_delete_current and foreach_current_index to
their own commit.

On Thu, 14 Dec 2023 at 16:54, Jelte Fennema-Nio  wrote:
>
> On Fri, 1 Dec 2023 at 05:20, Nathan Bossart  wrote:
> > Could we simplify it with something like the following?
>
> Great suggestion! Updated the patchset accordingly.
>
> This made it also easy to change the final patch to include the
> automatic scoped declaration logic for all of the new macros. I quite
> like how the calling code changes to not have to declare the variable.
> But it's definitely a larger divergence from the status quo than
> without patch 0003. So I'm not sure if it's desired.
>
> Finally, I also renamed the functions to use foreach instead of
> for_each, since based on this thread that seems to be the generally
> preferred naming.


v7-0001-Add-macros-for-looping-through-a-list-without-nee.patch
Description: Binary data


v7-0003-Use-new-foreach_xyz-macros-in-a-few-places.patch
Description: Binary data


v7-0002-Make-some-macros-that-used-cell-work-with-new-for.patch
Description: Binary data


Re: Postgres picks suboptimal index after building of an extended statistics

2023-12-18 Thread Alexander Korotkov
Hi!

I'd like to get this subject off the ground.  The problem originally
described in [1] obviously comes from wrong selectivity estimation.
 "Dependencies" extended statistics lead to significant selectivity miss
24/1000 instead of 1/1000.  When the estimation is correct, the PostgreSQL
optimizer is capable of choosing the appropriate unique index for the query.

Tom pointed out in [2] that this might be a problem of "Dependencies"
extended statistics.  I've rechecked this.  The dataset consists of two
parts.  The first part defined in the CREATE TABLE statement contains
independent values.  The second part defined in the INSERT statement
contains dependent values.  "Dependencies" extended statistics estimate
dependency degree as a fraction of rows containing dependent values.
According to this definition, it correctly estimates the dependency degree
as about 0.5 for all the combinations.  So, the error in the estimate comes
from the synergy of two factors: MCV estimation of the frequency of
individual values, and spreading of average dependency degree for those
values, which is not relevant for them.  I don't think there is a way to
fix "dependencies" extended statistics because it works exactly as
designed.  I have to note that instead of fixing "dependencies" extended
statistics you can just add multi-column MCV statistics, which would fix
the problem.

CREATE STATISTICS aestat(dependencies,ndistinct,mcv) ON x,y,z FROM a;

Independently on how well our statistics work, it looks pitiful that we
couldn't fix that using the knowledge of unique constraints on the table.
Despite statistics, which give us just more or less accurate estimates, the
constraint is something we really enforce and thus can rely on.  The
patches provided by Andrei in [1], [3], and [4] fix this issue at the index
scan path level.  As Thomas pointed out in [5], that could lead to
inconsistency between the number of rows used for unique index scan
estimation and the value displayed in EXPLAIN (and used for other paths).
Even though this was debated in [6], I can confirm this inconsistency.
Thus, with the patch published in [4], I can see the 28-row estimation with
a unique index scan.

`  QUERY PLAN
---
 Index Only Scan using a_pkey on a  (cost=0.28..8.30 rows=28 width=12)
   Index Cond: ((x = 1) AND (y = 1) AND (z = 1))
(2 rows)

Also, there is a set of patches [7], [8], and [9], which makes the
optimizer consider path selectivity as long as path costs during the path
selection.  I've rechecked that none of these patches could resolve the
original problem described in [1].  Also, I think they are quite tricky.
The model of our optimizer assumes that paths in the list should be the
different ways of getting the same result.  If we choose the paths by their
selectivity, that breaks this model.  I don't say there is no way for
this.  But if we do this, that would require significant rethinking of our
optimizer model and possible revision of a significant part of it.  Anyway,
I think if there is still interest in this, that should be moved into a
separate thread to keep this thread focused on the problem described in [1].

Finally, I'd like to note that the issue described in [1] is mostly the
selectivity estimation problem.  It could be solved by adding the
multi-column MCV statistics.  The patches published so far look more like
hacks for particular use cases rather than appropriate solutions.  It still
looks promising to me to use the knowledge of unique constraints during
selectivity estimation [10].  Even though it's hard to implement and
possibly implies some overhead, it fits the current model.  I also think
unique contracts could probably be used in some way to improve estimates
even when there is no full match.

Links.
1.
https://www.postgresql.org/message-id/0ca4553c-1f34-12ba-9122-44199d1ced41%40postgrespro.ru
2. https://www.postgresql.org/message-id/3119052.1657231656%40sss.pgh.pa.us
3.
https://www.postgresql.org/message-id/90a1d6ef-c777-b95d-9f77-0065ad4522df%40postgrespro.ru
4.
https://www.postgresql.org/message-id/a5a18d86-c0e5-0ceb-9a18-be1beb2d2944%40postgrespro.ru
5.
https://www.postgresql.org/message-id/f8044836-5d61-a4e0-af82-5821a2a1f0a7%40enterprisedb.com
6.
https://www.postgresql.org/message-id/90a1d6ef-c777-b95d-9f77-0065ad4522df%40postgrespro.ru
7.
https://www.postgresql.org/message-id/2df148b5-0bb8-f80b-ac03-251682fab585%40postgrespro.ru
8.
https://www.postgresql.org/message-id/6fb43191-2df3-4791-b307-be754e648276%40postgrespro.ru
9.
https://www.postgresql.org/message-id/154f786a-06a0-4fb1-b8a4-16c66149731b%40postgrespro.ru
10.
https://www.postgresql.org/message-id/f8044836-5d61-a4e0-af82-5821a2a1f0a7%40enterprisedb.com

--
Regards,
Alexander Korotkov


Re: "pgoutput" options missing on documentation

2023-12-18 Thread Amit Kapila
On Mon, Dec 18, 2023 at 1:08 PM Emre Hasegeli  wrote:
>
> > I found the existing error code appropriate because for syntax
> > specification, either we need to mandate this at the grammar level or
> > at the API level. Also, I think we should give a message similar to an
> > existing message: "publication_names parameter missing". For example,
> > we can say, "proto_version parameter missing". BTW, I also don't like
> > the other changes parse_output_parameters() done in 0001, if we want
> > to improve all the similar messages there are other places in the code
> > as well, so we can separately make the case for the same.
>
> Okay, I am changing these back.  I think we should keep the word
> "option".  It is used on other error messages.
>

Fair enough. I think we should push your first patch only in HEAD as
this is a minor improvement over the current behaviour. What do you
think?

-- 
With Regards,
Amit Kapila.




Re: planner chooses incremental but not the best one

2023-12-18 Thread Tomas Vondra



On 12/18/23 11:40, Richard Guo wrote:
> 
> On Mon, Dec 18, 2023 at 7:31 AM Tomas Vondra
> mailto:tomas.von...@enterprisedb.com>>
> wrote:
> 
> Oh! Now I see what you meant by using the new formula in 84f9a35e3
> depending on how we sum tuples. I agree that seems like the right thing.
> 
> I'm not sure it'll actually help with the issue, though - if I apply the
> patch, the plan does not actually change (and the cost changes just a
> little bit).
> 
> I looked at this only very briefly, but I believe it's due to the
> assumption of independence I mentioned earlier - we end up using the new
> formula introduced in 84f9a35e3, but it assumes it assumes the
> selectivity and number of groups are independent. But that'd not the
> case here, because the groups are very clearly correlated (with the
> condition on "b").
> 
> 
> You're right.  The patch allows us to adjust the estimate of distinct
> values for appendrels using the new formula introduced in 84f9a35e3.
> But if the restrictions are correlated with the grouping expressions,
> the new formula does not behave well.  That's why the patch does not
> help in case [1], where 'b' and 'c' are correlated.
> 
> OTOH, if the restrictions are not correlated with the grouping
> expressions, the new formula would perform quite well.  And in this case
> the patch would help a lot, as shown in [2] where estimate_num_groups()
> gives a much more accurate estimate with the help of this patch.
> 
> So this patch could be useful in certain situations.  I'm wondering if
> we should at least have this patch (if it is right).
>

I do agree the patch seems to do the right thing, and it's worth pushing
on it's own.

> 
> If that's the case, I'm not sure how to fix this :-(
> 
> 
> The commit message of 84f9a35e3 says
> 
>     This could possibly be improved upon in the future by identifying
>     correlated restrictions and using a hybrid of the old and new
>     formulae.
> 
> Maybe this is something we can consider trying.  But anyhow this is not
> an easy task I suppose.

Yeah, if it was easy, it'd have been done in 84f9a35e3 already ;-)

The challenge is where to get usable information about correlation
between columns. I only have a couple very rought ideas of what might
try. For example, if we have multi-column ndistinct statistics, we might
look at ndistinct(b,c) and ndistinct(b,c,d) and deduce something from

ndistinct(b,c,d) / ndistinct(b,c)

If we know how many distinct values we have for the predicate column, we
could then estimate the number of groups. I mean, we know that for the
restriction "WHERE b = 3" we only have 1 distinct value, so we could
estimate the number of groups as

1 * ndistinct(b,c)

I'm well aware this is only a very trivial example, and for more complex
examples it's likely way more complicated. But hopefully it illustrates
the general idea.

The other idea would be to maybe look at multi-column MCV, and try using
it to deduce cross-column correlation too (it could be more accurate for
arbitrary predicates).

And finally, we might try being a bit more pessimistic and look at what
the "worst case" behavior could be. So for example instead of trying to
estimate the real number of groups, we'd ask "What's the minimum number
of groups we're likely to get?". And use that to cost the incremental
sort. But I don't think we do this elsewhere, and I'm not sure we want
to start with this risk-based approach here. It might be OK, because we
usually expect the incremental sort to be much cheaper, ...

If this something would be interested in exploring? I don't have
capacity to work on this myself, but I'd be available for reviews,
feedback and so on.

regards

-- 
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company




Re: Add --check option to pgindent

2023-12-18 Thread Daniel Gustafsson
> On 15 Dec 2023, at 16:43, Tristan Partin  wrote:

> Here is a v3.

I think this is pretty much ready to go, the attached v4 squashes the changes
and fixes the man-page which also needed an update.  The referenced Wiki page
will need an edit or two after this goes in, but that's easy enough.

--
Daniel Gustafsson



v4-0001-Rename-non-destructive-modes-in-pgindent.patch
Description: Binary data


RE: Synchronizing slots from primary to standby

2023-12-18 Thread Zhijie Hou (Fujitsu)
On Monday, December 11, 2023 5:31 PM shveta malik  
wrote:
> 
> On Thu, Dec 7, 2023 at 1:33 PM Peter Smith 
> wrote:
> >
> > Hi.
> >
> > Here are my review comments for patch v43-0002.
> >
> 
> > ==
> > src/backend/access/transam/xlogrecovery.c
> >
> > 13.
> > + /*
> > + * Shutdown the slot sync workers to prevent potential conflicts between
> > + * user processes and slotsync workers after a promotion. Additionally,
> > + * drop any slots that have initiated but not yet completed the sync
> > + * process.
> > + */
> > + ShutDownSlotSync();
> > + slotsync_drop_initiated_slots();
> > +
> >
> > Is this where maybe the 'sync_state' should also be updated for
> > everything so you are not left with confusion about different states
> > on a node that is no longer a standby node?
> >
> 
> yes, this is the place. But this needs more thought as it may cause
> too much disk activity during promotion. so let me analyze and come
> back.

Per off-list discussion with Amit.

I think it's fine to keep both READY and NONE on a primary. Because even if we
update the sync_state from READY to NONE on promotion, it doesn't reduce the
complexity for the handling of READY and NONE state. And it's not
straightforward to choose the right place to update sync_state, here is the
analysis:

(related steps on promotion)
1 (patch) shutdown slotsync worker
2 (patch) drop 'i' state slots.
3 remove standby.signal and recovery.signal
4 switch to a new timeline and write the timeline history file 
5 set SharedRecoveryState = RECOVERY_STATE_DONE which means 
RecoveryInProgress() will return false.

We could not update the sync_state before step 3 because if the update fails 
after
updating some of slots' state, then the server will be shutdown leaving some
'NONE' state slots. After restarting, the server is still a standby so the slot
sync worker will fail to sync these 'NONE' state slots.

We also could not update it after step 3 and before step 4. Because if any ERROR
when updating, then after restarting the server, although the server will
become a master(as standby.signal is removed), but it can still be made as a
active standby by creating a standby.signal file because the timeline has not
been switched. And in this case, the slot sync worker will also fail to sync
these 'NONE' state slots.

Updating the sync_state after step 4 and before step 5 is OK, but still
It doesn't simplify the handling for both READY and NONE state slots.
Therefore, I think we can retain the READY state slots after promotion as they
can provide information about the slot's origin. I added some comments around
slotsync cleanup codes (in FinishWalRecovery) to mentioned the reason.

Best Regards,
Hou zj



Re: [DOC] Introducing Quick Start Guide to PL/pgSQL and PL/Python Documentation

2023-12-18 Thread Euler Taveira
On Sat, Dec 16, 2023, at 7:49 AM, Ishaan Adarsh wrote:
> I have made some documentation enhancements for PL/pgSQL and PL/Python 
> sections. The changes include the addition of a Quick Start Guide to 
> facilitate a smoother onboarding experience for users.

Great! Add your patch to the next CF [1] so we don't miss it.

[1] https://commitfest.postgresql.org/46/


--
Euler Taveira
EDB   https://www.enterprisedb.com/


Re: Add a perl function in Cluster.pm to generate WAL

2023-12-18 Thread Euler Taveira
On Mon, Dec 18, 2023, at 2:39 AM, Bharath Rupireddy wrote:
> Rebase needed, attached v3 patch.

I think you don't understand the suggestion proposed by Michael and Kyotaro. If
you do a comparison with the following SQL commands:

euler=# select pg_walfile_name(pg_current_wal_lsn());
 pg_walfile_name  
--
00010040
(1 row)

euler=# select pg_logical_emit_message(true, 'prefix', 'message4');
pg_logical_emit_message 
-
0/40A8
(1 row)

euler=# select pg_switch_wal();
pg_switch_wal 
---
0/40F0
(1 row)

euler=# create table cc (b int);
CREATE TABLE
euler=# drop table cc;
DROP TABLE
euler=# select pg_switch_wal();
pg_switch_wal 
---
0/41017C88
(1 row)

euler=# select pg_walfile_name(pg_current_wal_lsn());
 pg_walfile_name  
--
00010041
(1 row)

You get

$ pg_waldump 00010040
rmgr: Standby len (rec/tot): 50/50, tx:  0, lsn: 
0/4028, prev 0/3F0001C0, desc: RUNNING_XACTS nextXid 295180 
latestCompletedXid 295179 oldestRunningXid 295180
rmgr: LogicalMessage len (rec/tot): 65/65, tx: 295180, lsn: 
0/4060, prev 0/4028, desc: MESSAGE transactional, prefix "prefix"; 
payload (8 bytes): 6D 65 73 73 61 67 65 34
rmgr: Transaction len (rec/tot): 46/46, tx: 295180, lsn: 
0/40A8, prev 0/4060, desc: COMMIT 2023-12-18 08:35:06.821322 -03
rmgr: XLOGlen (rec/tot): 24/24, tx:  0, lsn: 
0/40D8, prev 0/40A8, desc: SWITCH 

$ pg_waldump 00010041
rmgr: Standby len (rec/tot): 50/50, tx:  0, lsn: 
0/4128, prev 0/40D8, desc: RUNNING_XACTS nextXid 295181 
latestCompletedXid 295180 oldestRunningXid 295181
rmgr: Storage len (rec/tot): 42/42, tx:  0, lsn: 
0/4160, prev 0/4128, desc: CREATE base/33287/88102
rmgr: Heap2   len (rec/tot): 60/60, tx: 295181, lsn: 
0/4190, prev 0/4160, desc: NEW_CID rel: 1663/33287/1247, tid: 14/16, 
cmin: 0, cmax: 4294967295, combo: 4294967295
rmgr: Heaplen (rec/tot): 54/  3086, tx: 295181, lsn: 
0/41D0, prev 0/4190, desc: INSERT off: 16, flags: 0x00, blkref #0: rel 
1663/33287/1247 blk 14 FPW
rmgr: Btree   len (rec/tot): 53/  5133, tx: 295181, lsn: 
0/41000CE0, prev 0/41D0, desc: INSERT_LEAF off: 252, blkref #0: rel 
1663/33287/2703 blk 2 FPW
.
.
.
rmgr: Btree   len (rec/tot): 72/72, tx: 295181, lsn: 
0/41016E48, prev 0/41014F00, desc: INSERT_LEAF off: 111, blkref #0: rel 
1663/33287/2674 blk 7
rmgr: Heap2   len (rec/tot): 69/69, tx: 295181, lsn: 
0/41016E90, prev 0/41016E48, desc: PRUNE snapshotConflictHorizon: 295177, 
nredirected: 0, ndead: 7, nunused: 0, redirected: [], dead: [20, 21, 22, 23, 
24, 26, 27], unused: [], blkref #0: rel 1663/33287/1249 blk 17
rmgr: Transaction len (rec/tot):385/   385, tx: 295181, lsn: 
0/41016ED8, prev 0/41016E90, desc: INVALIDATION ; inval msgs: catcache 80 
catcache 79 catcache 80 catcache 79 catcache 55 catcache 54 catcache 7 catcache 
6 catcache 7 catcache 6 catcache 7 catcache 6 catcache 7 catcache 6 catcache 7 
catcache 6 catcache 7 catcache 6 catcache 7 catcache 6 snapshot 2608 relcache 
88102
rmgr: Standby len (rec/tot): 42/42, tx: 295181, lsn: 
0/41017060, prev 0/41016ED8, desc: LOCK xid 295181 db 33287 rel 88102 
rmgr: Transaction len (rec/tot):405/   405, tx: 295181, lsn: 
0/41017090, prev 0/41017060, desc: COMMIT 2023-12-18 08:35:22.342122 -03; inval 
msgs: catcache 80 catcache 79 catcache 80 catcache 79 catcache 55 catcache 54 
catcache 7 catcache 6 catcache 7 catcache 6 catcache 7 catcache 6 catcache 7 
catcache 6 catcache 7 catcache 6 catcache 7 catcache 6 catcache 7 catcache 6 
snapshot 2608 relcache 88102
rmgr: Standby len (rec/tot): 42/42, tx: 295182, lsn: 
0/41017228, prev 0/41017090, desc: LOCK xid 295182 db 33287 rel 88102 
rmgr: Heap2   len (rec/tot): 61/61, tx: 295182, lsn: 
0/41017258, prev 0/41017228, desc: PRUNE snapshotConflictHorizon: 295177, 
nredirected: 0, ndead: 3, nunused: 0, redirected: [], dead: [9, 12, 15], 
unused: [], blkref #0: rel 1663/33287/2608 blk 3
rmgr: Heap2   len (rec/tot): 60/60, tx: 295182, lsn: 
0/41017298, prev 0/41017258, desc: NEW_CID rel: 1663/33287/1247, tid: 14/17, 
cmin: 4294967295, cmax: 0, combo: 4294967295
rmgr: Heaplen (rec/tot): 54/54, tx: 295182, lsn: 
0/410172D8, prev 0/41017298, desc: DELETE xmax: 295182, off: 17, infobits: 
[KEYS_UPDATED], flags: 0x00, blkref #0: rel 1663/33287/1247 blk 14
.
.
.
rmgr: Heap2   len (rec/tot): 60/60, tx: 295182, lsn: 
0/410178D8, prev 0/410178A0, desc: NEW_CID rel: 1663/33287/2608, tid: 3/24, 
cmin: 4294967295, cmax: 2, combo: 4294967295
rmgr: Heaplen (rec/tot): 54/54, tx: 295182, lsn: 
0/41017918, prev 0/410178D8, desc: DELETE

Re: Synchronizing slots from primary to standby

2023-12-18 Thread Amit Kapila
On Fri, Dec 15, 2023 at 11:03 AM shveta malik  wrote:
>
> Sorry, I missed attaching the patch. PFA v48.
>

Few comments on v48_0002

1.
+static void
+slotsync_reread_config(WalReceiverConn *wrconn)
{
...
+ pfree(old_primary_conninfo);
+ pfree(old_primary_slotname);
+
+ if (restart)
+ {
+ char*msg = "slot sync worker will restart because of a parameter change";
+
+ /*
+ * The exit code 1 will make postmaster restart the slot sync worker.
+ */
+ slotsync_worker_exit(msg, 1 /* proc_exit code */ );
+ }
...

I don't see the need to explicitly pfree in case we are already
exiting the process because anyway the memory will be released. We can
avoid using the 'restart' variable for this. Also, probably, directly
exiting here makes sense and at another place where this function is
used. I see that in maybe_reread_subscription(), we exit with a 0 code
and still apply worker restarts, so why use a different exit code
here?

2.
+static void
+check_primary_info(WalReceiverConn *wrconn, bool *am_cascading_standby)
{
...
+ remote_in_recovery = DatumGetBool(slot_getattr(tupslot, 1, &isnull));
+ Assert(!isnull);
+
+ /* No need to check further, return that we are cascading standby */
+ if (remote_in_recovery)
+ {
+ *am_cascading_standby = true;
+ CommitTransactionCommand();
+ return;
...
}

Don't we need to clear the result and tuple in case of early return?

3. It would be a good idea to mention about requirements like a
physical slot on primary, hot_standby_feedback, etc. in the commit
message.

4.
+static bool
+wait_for_primary_slot_catchup(WalReceiverConn *wrconn, RemoteSlot *remote_slot,
+   bool *wait_attempts_exceeded)
{
...
+ tupslot = MakeSingleTupleTableSlot(res->tupledesc, &TTSOpsMinimalTuple);
+ if (!tuplestore_gettupleslot(res->tuplestore, true, false, tupslot))
+ {
+ ereport(WARNING,
+ (errmsg("slot \"%s\" creation aborted", remote_slot->name),
+ errdetail("This slot was not found on the primary server")));
...
+ /*
+ * It is possible to get null values for LSN and Xmin if slot is
+ * invalidated on the primary server, so handle accordingly.
+ */
+ new_invalidated = DatumGetBool(slot_getattr(tupslot, 1, &isnull));
+ Assert(!isnull);
+
+ new_restart_lsn = DatumGetLSN(slot_getattr(tupslot, 2, &isnull));
+ if (new_invalidated || isnull)
+ {
+ ereport(WARNING,
+ (errmsg("slot \"%s\" creation aborted", remote_slot->name),
+ errdetail("This slot was invalidated on the primary server")));
...
}

a. The errdetail message should end with a full stop. Please check all
other errdetail messages in the patch to follow the same guideline.
b. I think saying slot creation aborted is not completely correct
because we would have created the slot especially when it is in 'i'
state. Can we change it to something like: "aborting initial sync for
slot \"%s\""?
c. Also, if the remote_slot is invalidated, ideally, we can even drop
the local slot but it seems that the patch will drop the same before
the next-sync cycle with any other slot that needs to be dropped. If
so, can we add the comment to indicate the same?

5.
+static void
+local_slot_update(RemoteSlot *remote_slot)
+{
+ Assert(MyReplicationSlot->data.invalidated == RS_INVAL_NONE);
+
+ LogicalConfirmReceivedLocation(remote_slot->confirmed_lsn);
+ LogicalIncreaseXminForSlot(remote_slot->confirmed_lsn,
+remote_slot->catalog_xmin);
+ LogicalIncreaseRestartDecodingForSlot(remote_slot->confirmed_lsn,
+   remote_slot->restart_lsn);
+
+ SpinLockAcquire(&MyReplicationSlot->mutex);
+ MyReplicationSlot->data.invalidated = remote_slot->invalidated;
+ SpinLockRelease(&MyReplicationSlot->mutex);
...
...

If required, the invalidated flag is updated in the caller as well, so
why do we need to update it here as well?

-- 
With Regards,
Amit Kapila.




Re: Remove MSVC scripts from the tree

2023-12-18 Thread vignesh C
On Wed, 6 Dec 2023 at 12:59, Michael Paquier  wrote:
>
> On Wed, Dec 06, 2023 at 12:15:50PM +0530, Shubham Khanna wrote:
> > Patch is not applying. Please share the Rebased Version. Please find the 
> > error:
>
> Thanks.  Here you go with a v6.

Few comments:
1) Now that the MSVC build scripts are removed, should we have the
reference to "MSVC build scripts" here?
ltree.h:
.
/*
 * LOWER_NODE used to be defined in the Makefile via the compile flags.
 * However the MSVC build scripts neglected to do the same which resulted in
 * MSVC builds not using LOWER_NODE.  Since then, the MSVC scripts have been
 * modified to look for -D compile flags in Makefiles, so here, in order to
 * get the historic behavior of LOWER_NODE not being defined on MSVC, we only
 * define it when not building in that environment.  This is important as we
 * want to maintain the same LOWER_NODE behavior after a pg_upgrade.
 */
#ifndef _MSC_VER
#define LOWER_NODE
#endif
.

2) I had seen that if sed/gzip is not available meson build will fail:
2.a)
Program gsed sed found: NO
meson.build:334:6: ERROR: Program 'gsed sed' not found or not executable

2.b)
Program gzip found: NO
meson.build:337:7: ERROR: Program 'gzip' not found or not executable

Should we mention sed and gzip here?
+  
+   Bison and
+Flex
+   
+   
+Bison and
Flex are
+required.  Only Bison versions 2.3 and later
+will work. Flex must be version
2.5.35 or later.
+   

Regards,
Vignesh




Re: planner chooses incremental but not the best one

2023-12-18 Thread Richard Guo
On Mon, Dec 18, 2023 at 7:31 AM Tomas Vondra 
wrote:

> Oh! Now I see what you meant by using the new formula in 84f9a35e3
> depending on how we sum tuples. I agree that seems like the right thing.
>
> I'm not sure it'll actually help with the issue, though - if I apply the
> patch, the plan does not actually change (and the cost changes just a
> little bit).
>
> I looked at this only very briefly, but I believe it's due to the
> assumption of independence I mentioned earlier - we end up using the new
> formula introduced in 84f9a35e3, but it assumes it assumes the
> selectivity and number of groups are independent. But that'd not the
> case here, because the groups are very clearly correlated (with the
> condition on "b").


You're right.  The patch allows us to adjust the estimate of distinct
values for appendrels using the new formula introduced in 84f9a35e3.
But if the restrictions are correlated with the grouping expressions,
the new formula does not behave well.  That's why the patch does not
help in case [1], where 'b' and 'c' are correlated.

OTOH, if the restrictions are not correlated with the grouping
expressions, the new formula would perform quite well.  And in this case
the patch would help a lot, as shown in [2] where estimate_num_groups()
gives a much more accurate estimate with the help of this patch.

So this patch could be useful in certain situations.  I'm wondering if
we should at least have this patch (if it is right).


> If that's the case, I'm not sure how to fix this :-(


The commit message of 84f9a35e3 says

This could possibly be improved upon in the future by identifying
correlated restrictions and using a hybrid of the old and new
formulae.

Maybe this is something we can consider trying.  But anyhow this is not
an easy task I suppose.

[1]
https://www.postgresql.org/message-id/CAMbWs4-FtsJ0dQUv6g%3D%3DXR_gsq%3DFj9oiydW6gbqwQ_wrbU0osw%40mail.gmail.com
[2]
https://www.postgresql.org/message-id/CAMbWs4-ocromEKMtVDH3RBMuAJQaQDK0qi4k6zOuvpOnMWZauw%40mail.gmail.com

Thanks
Richard


Re: Move walreceiver state assignment (to WALRCV_STREAMING) in WalReceiverMain()

2023-12-18 Thread Drouvot, Bertrand

Hi,

On 12/13/23 3:33 PM, Michael Paquier wrote:

On Tue, Dec 12, 2023 at 04:54:32PM -0300, Euler Taveira wrote:

Couldn't it give up before starting if you apply your patch? My main concern is
due to a slow system, the walrcv_connect() took to long in WalReceiverMain()
and the code above kills the walreceiver while in the process to start it.
Since you cannot control the hardcoded WALRCV_STARTUP_TIMEOUT value, you might
have issues during overload periods.


Sounds like a fair point to me, 


Thanks for looking at it! I'm not sure about it, see my comment in [1].


Another thing that I'm a bit surprised with is why it would be OK to
switch the status to STREAMING only we first_stream is set, discarding
the restart case.


Yeah, that looks like a miss on my side. Thanks for pointing out!

Please find attached v2 addressing this remark.

[1]: 
https://www.postgresql.org/message-id/c76c0a65-f754-4614-b616-1d48f9195745%40gmail.com

Regards,

--
Bertrand Drouvot
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.comFrom 12da1b3c6295e2369b3a65c8f4ee40882def310f Mon Sep 17 00:00:00 2001
From: Bertrand Drouvot 
Date: Mon, 11 Dec 2023 20:05:25 +
Subject: [PATCH v2] move walrcv->walRcvState assignment to WALRCV_STREAMING

walrcv->walRcvState = WALRCV_STREAMING was set to early. Move the assignement
to a more appropriate place.
---
 src/backend/replication/walreceiver.c | 4 +++-
 1 file changed, 3 insertions(+), 1 deletion(-)
 100.0% src/backend/replication/

diff --git a/src/backend/replication/walreceiver.c 
b/src/backend/replication/walreceiver.c
index 26ded928a7..5f612d354e 100644
--- a/src/backend/replication/walreceiver.c
+++ b/src/backend/replication/walreceiver.c
@@ -242,7 +242,6 @@ WalReceiverMain(void)
}
/* Advertise our PID so that the startup process can kill us */
walrcv->pid = MyProcPid;
-   walrcv->walRcvState = WALRCV_STREAMING;
 
/* Fetch information required to start streaming */
walrcv->ready_to_display = false;
@@ -412,6 +411,9 @@ WalReceiverMain(void)
options.proto.physical.startpointTLI = startpointTLI;
if (walrcv_startstreaming(wrconn, &options))
{
+   SpinLockAcquire(&walrcv->mutex);
+   walrcv->walRcvState = WALRCV_STREAMING;
+   SpinLockRelease(&walrcv->mutex);
if (first_stream)
ereport(LOG,
(errmsg("started streaming WAL 
from primary at %X/%X on timeline %u",
-- 
2.34.1



Re: Move walreceiver state assignment (to WALRCV_STREAMING) in WalReceiverMain()

2023-12-18 Thread Drouvot, Bertrand

Hi,

On 12/12/23 8:54 PM, Euler Taveira wrote:

On Tue, Dec 12, 2023, at 12:58 PM, Drouvot, Bertrand wrote:

Currently walrcv->walRcvState is set to WALRCV_STREAMING at the
beginning of WalReceiverMain().

But it seems that after this assignment things could be wrong before the
walreicever actually starts streaming (like not being able to connect
to the primary).

It looks to me that WALRCV_STREAMING should be set once walrcv_startstreaming()
returns true: this is the proposal of this patch.


Per the state name (streaming), it seems it should be set later as you
proposed, 


Thanks for looking at it!


however, I'm concerned about the previous state (WALRCV_STARTING). If
I'm reading the code correctly, WALRCV_STARTING is assigned at
RequestXLogStreaming():

     SetInstallXLogFileSegmentActive();
     RequestXLogStreaming(tli, ptr, PrimaryConnInfo,
  PrimarySlotName,
  wal_receiver_create_temp_slot);
     flushedUpto = 0;
     }

     /*
  * Check if WAL receiver is active or wait to start up.
  */
     if (!WalRcvStreaming())
     {
     lastSourceFailed = true;
     break;
     }

After a few lines the function WalRcvStreaming() has:

     SpinLockRelease(&walrcv->mutex);

     /*
  * If it has taken too long for walreceiver to start up, give up. Setting
  * the state to STOPPED ensures that if walreceiver later does start up
  * after all, it will see that it's not supposed to be running and die
  * without doing anything.
  */
     if (state == WALRCV_STARTING)
     {
     pg_time_t   now = (pg_time_t) time(NULL);

     if ((now - startTime) > WALRCV_STARTUP_TIMEOUT)
     {
     bool    stopped = false;

     SpinLockAcquire(&walrcv->mutex);
     if (walrcv->walRcvState == WALRCV_STARTING)
     {
     state = walrcv->walRcvState = WALRCV_STOPPED;
     stopped = true;
     }
     SpinLockRelease(&walrcv->mutex);

     if (stopped)
     ConditionVariableBroadcast(&walrcv->walRcvStoppedCV);
     }
     }

Couldn't it give up before starting if you apply your patch? My main concern is
due to a slow system, the walrcv_connect() took to long in WalReceiverMain()
and the code above kills the walreceiver while in the process to start it.


Yeah, so it looks to me that the sequence of events is:

1) The startup process sets walrcv->walRcvState = WALRCV_STARTING (in 
RequestXLogStreaming())
2) The startup process sets the walrcv->startTime (in RequestXLogStreaming())
3) The startup process asks then the postmaster to starts the walreceiver
4) Then The startup process checks if WalRcvStreaming() is true

Note that 3) is not waiting for the walreceiver to actually start: it "just" 
sets
a flag and kill (SIGUSR1) the postmaster (in SendPostmasterSignal()).

It means that if the time between 1 and 4 is <= WALRCV_STARTUP_TIMEOUT (10 
seconds)
then WalRcvStreaming() returns true (even if the walreceiver is not streaming 
yet).

So it looks to me that even if the walreceiver does take time to start 
streaming,
as long as the time between 1 and 4 is <= 10 seconds we are fine.

And I think it's fine because WalRcvStreaming() does not actually "only" check 
that the
walreceiver is streaming but as its comment states:

"
/*
 * Is walreceiver running and streaming (or at least attempting to connect,
 * or starting up)?
 */
"

Regards,

--
Bertrand Drouvot
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com




Re: Simplify newNode()

2023-12-18 Thread John Naylor
On Fri, Dec 15, 2023 at 5:44 AM Tom Lane  wrote:
>
> I did check that the v1 patch successfully inlines newNode() and
> reduces it to just a MemoryContextAllocZeroAligned call, so it's
> correct that modern compilers do that better than whatever I tested
> in 2008.  But I wonder what is happening in v2 to reduce the code
> size so much.  MemoryContextAllocZeroAligned is not 20kB by itself.

I poked at this a bit and it seems to come from what Heikki said
upthread about fewer instructions before the calls: Running objdump on
v1 and v2 copyfuncs.o and diff'ing shows there are fewer MOV
instructions (some extraneous stuff removed):

e9 da 5f 00 00  jmp<_copyReindexStmt>
-   48 8b 05 00 00 00 00movrax,QWORD PTR [rip+0x0]
-   be 18 00 00 00  movesi,0x18
-   48 8b 38movrdi,QWORD PTR [rax]
-   e8 00 00 00 00  call   MemoryContextAllocZeroAligned-0x4
+   bf 18 00 00 00  movedi,0x18
+   e8 00 00 00 00  call   palloc0-0x4

That's 10 bytes savings.

-   48 8b 05 00 00 00 00movrax,QWORD PTR [rip+0x0]
-   48 8b 38movrdi,QWORD PTR [rax]
-   e8 00 00 00 00  call   MemoryContextAllocZeroAligned-0x4
+   e8 00 00 00 00  call   palloc0-0x4

...another 10 bytes. Over and over again.

Because of the size differences, the compiler is inlining more: e.g.
in v1 _copyFieldStore has 4 call sites, but in v2 it got inlined.

About the patch, I'm wondering if this whitespace is intentional, but
it's otherwise straightforward:

--- a/src/include/nodes/nodes.h
+++ b/src/include/nodes/nodes.h
@@ -132,6 +132,7 @@ typedef struct Node

 #define nodeTag(nodeptr) (((const Node*)(nodeptr))->type)

+
 /*




Re: remaining sql/json patches

2023-12-18 Thread jian he
Hi! another minor issue I found:

+SELECT pg_get_expr(adbin, adrelid)
+FROM pg_attrdef
+WHERE adrelid = 'test_jsonb_constraints'::regclass
+ORDER BY 1;
+
+SELECT pg_get_expr(adbin, adrelid) FROM pg_attrdef WHERE adrelid =
'test_jsonb_constraints'::regclass;

I think these two queries are the same? Why do we test it twice.




Re: Transaction timeout

2023-12-18 Thread Andrey M. Borodin


> On 18 Dec 2023, at 14:32, Japin Li  wrote:
> 
> 
> Thanks for updating the patch

Sorry for the noise, but commitfest bot found one more bug in handling 
statement timeout. PFA v11.


Best regards, Andrey Borodin.


v11-0001-Introduce-transaction_timeout.patch
Description: Binary data




Re: Transaction timeout

2023-12-18 Thread Japin Li


On Mon, 18 Dec 2023 at 13:49, Andrey M. Borodin  wrote:
>> On 16 Dec 2023, at 05:58, Japin Li  wrote:
>>
>>
>> On Fri, 15 Dec 2023 at 17:51, Andrey M. Borodin  wrote:
 On 8 Dec 2023, at 15:29, Japin Li  wrote:

 Thanks for updating the patch. LGTM.
>>>
>>> PFA v9. Changes:
>>> 1. Added tests for idle_in_transaction_timeout
>>> 2. Suppress statement_timeout if it’s shorter than transaction_timeout
>>>
>> +   if (StatementTimeout > 0
>> +   && IdleInTransactionSessionTimeout < TransactionTimeout)
>>   ^
>>
>> Should be StatementTimeout?
> Yes, that’s an oversight. I’ve adjusted tests so they catch this problem.
>
>> Maybe we should add documentation to describe this behavior.
>
> I've added a paragraph about it to config.sgml, but I'm not sure about the 
> comprehensiveness of the wording.
>

Thanks for updating the patch, no objections.

--
Regrads,
Japin Li
ChengDu WenWu Information Technology Co., Ltd.




Re: ALTER COLUMN ... SET EXPRESSION to alter stored generated column's expression

2023-12-18 Thread Peter Eisentraut

On 11.12.23 13:22, Amul Sul wrote:


create table t1 (a int, b int generated always as (a + 1) stored);
alter table t1 add column c int, alter column b set expression as (a
+ c);
ERROR:  42703: column "c" does not exist

I think intuitively, this ought to work.  Maybe just moving the new
pass
after AT_PASS_ADD_COL would do it.


I think we can't support that (like alter type) since we need to place 
this new

pass before AT_PASS_OLD_INDEX & AT_PASS_OLD_CONSTR to re-add indexes and
constraints for the validation.


Could we have AT_PASS_ADD_COL before AT_PASS_OLD_*?  So overall it would be

...
AT_PASS_ALTER_TYPE,
AT_PASS_ADD_COL, // moved
AT_PASS_SET_EXPRESSION,  // new
AT_PASS_OLD_INDEX,
AT_PASS_OLD_CONSTR,
AT_PASS_ADD_CONSTR,
...

This appears to not break any existing tests.





Re: trying again to get incremental backup

2023-12-18 Thread Peter Eisentraut

Another set of comments, about the patch that adds pg_combinebackup:

Make sure all the options are listed in a consistent order.  We have 
lately changed everything to be alphabetical.  This includes:


- reference page pg_combinebackup.sgml

- long_options listing

- getopt_long() argument

- subsequent switch

- (--help output, but it looks ok as is)

Also, in pg_combinebackup.sgml, the option --sync-method is listed as if 
it does not take an argument, but it does.






Re: introduce dynamic shared memory registry

2023-12-18 Thread Nikita Malakhov
Hi!

This patch looks like a good solution for a pain in the ass, I'm too for
this patch to be committed.
Have looked through the code and agree with Andrei, the code looks good.
Just a suggestion - maybe it is worth adding a function for detaching the
segment,
for cases when we unload and/or re-load the extension?

--
Regards,
Nikita Malakhov
Postgres Professional
The Russian Postgres Company
https://postgrespro.ru/


Re: introduce dynamic shared memory registry

2023-12-18 Thread Andrei Lepikhov

On 18/12/2023 13:39, Andrei Lepikhov wrote:

On 5/12/2023 10:46, Nathan Bossart wrote:

I don't presently have any concrete plans to use this for anything, but I
thought it might be useful for extensions for caching, etc. and wanted to
see whether there was any interest in the feature.


I am delighted that you commenced this thread.
Designing extensions, every time I feel pain introducing one shared 
value or some global stat, the extension must be required to be loadable 
on startup only. It reduces the flexibility of even very lightweight 
extensions, which look harmful to use in a cloud.


After looking into the code, I have some comments:
1. The code looks good; I didn't find possible mishaps. Some proposed 
changes are in the attachment.
2. I think a separate file for this feature looks too expensive. 
According to the gist of that code, it is a part of the DSA module.
3. The dsm_registry_init_or_attach routine allocates a DSM segment. Why 
not create dsa_area for a requestor and return it?


--
regards,
Andrei Lepikhov
Postgres Professional
diff --git a/src/backend/storage/ipc/dsm_registry.c 
b/src/backend/storage/ipc/dsm_registry.c
index ea80f45716..0343ce987f 100644
--- a/src/backend/storage/ipc/dsm_registry.c
+++ b/src/backend/storage/ipc/dsm_registry.c
@@ -45,8 +45,8 @@ static const dshash_parameters dsh_params = {
LWTRANCHE_DSM_REGISTRY_HASH
 };
 
-static dsa_area *dsm_registry_dsa;
-static dshash_table *dsm_registry_table;
+static dsa_area *dsm_registry_dsa = NULL;
+static dshash_table *dsm_registry_table = NULL;
 
 static void init_dsm_registry(void);
 
@@ -83,13 +83,20 @@ init_dsm_registry(void)
 {
/* Quick exit if we already did this. */
if (dsm_registry_table)
+   {
+   Assert(dsm_registry_dsa != NULL);
return;
+   }
+
+   Assert(dsm_registry_dsa == NULL);
 
/* Otherwise, use a lock to ensure only one process creates the table. 
*/
LWLockAcquire(DSMRegistryLock, LW_EXCLUSIVE);
 
if (DSMRegistryCtx->dshh == DSHASH_HANDLE_INVALID)
{
+   Assert(DSMRegistryCtx->dsah == DSHASH_HANDLE_INVALID);
+
/* Initialize dynamic shared hash table for registry. */
dsm_registry_dsa = dsa_create(LWTRANCHE_DSM_REGISTRY_DSA);
dsa_pin(dsm_registry_dsa);
@@ -102,6 +109,8 @@ init_dsm_registry(void)
}
else
{
+   Assert(DSMRegistryCtx->dsah != DSHASH_HANDLE_INVALID);
+
/* Attach to existing dynamic shared hash table. */
dsm_registry_dsa = dsa_attach(DSMRegistryCtx->dsah);
dsa_pin_mapping(dsm_registry_dsa);
diff --git a/src/include/storage/lwlock.h b/src/include/storage/lwlock.h
index 665d471418..e0e7b3b765 100644
--- a/src/include/storage/lwlock.h
+++ b/src/include/storage/lwlock.h
@@ -209,7 +209,7 @@ typedef enum BuiltinTrancheIds
LWTRANCHE_LAUNCHER_HASH,
LWTRANCHE_DSM_REGISTRY_DSA,
LWTRANCHE_DSM_REGISTRY_HASH,
-   LWTRANCHE_FIRST_USER_DEFINED
+   LWTRANCHE_FIRST_USER_DEFINED,
 }  BuiltinTrancheIds;
 
 /*