Re: pg_upgrade and logical replication

2023-03-03 Thread Amit Kapila
On Thu, Mar 2, 2023 at 4:21 PM Julien Rouhaud  wrote:
>
> On Thu, Mar 02, 2023 at 03:47:53PM +0530, Amit Kapila wrote:
> >
> > Why don't we try to support the direct upgrade of logical replication
> > nodes? Have you tried to analyze what are the obstacles and whether we
> > can have solutions for those? For example, one of the challenges is to
> > support the upgrade of slots, can we copy (from the old cluster) and
> > recreate them in the new cluster by resetting LSNs? We can also reset
> > origins during the upgrade of subscribers and recommend to first
> > upgrade the subscriber node.
>
> I'm not sure I get your question.  This whole thread is about direct upgrade 
> of
> logical replication nodes, at least the subscribers, and what is currently
> preventing it.
>

It is only about subscribers and nothing about publishers.

> For the publisher nodes, that may be something nice to support (I'm assuming 
> it
> could be useful for more complex replication setups) but I'm not interested in
> that at the moment as my goal is to reduce downtime for major upgrade of
> physical replica, thus *not* doing pg_upgrade of the primary node, whether
> physical or logical.  I don't see why it couldn't be done later on, if/when
> someone has a use case for it.
>

I thought there is value if we provide a way to upgrade both publisher
and subscriber. Now, you came up with a use case linking it to a
physical replica where allowing an upgrade of only subscriber nodes is
useful. It is possible that users find your steps easy to perform and
didn't find them error-prone but it may be better to get some
authentication of the same. I haven't yet analyzed all the steps in
detail but let's see what others think.

-- 
With Regards,
Amit Kapila.




Re: Move defaults toward ICU in 16?

2023-03-03 Thread Jeff Davis
On Thu, 2023-03-02 at 10:37 +0100, Peter Eisentraut wrote:
> I would skip this.  There was a brief discussion about this at [0], 
> where I pointed out that if we are going to do something like that, 
> there would be other candidates among the optional dependencies to 
> promote, such as certainly openssl and arguably lz4.  If we don't do 
> this consistently across dependencies, then there will be confusion.

The difference is that ICU affects semantics of collations, and
collations are not really an optional feature. If postgres is built
without ICU, that will affect the default at initdb time (after patch
003, anyway), which will then affect the default collations in all
databases.

> In practice, I don't think it matters.  Almost all installations are 
> made by packagers, who will make their own choices.

Mostly true, but the discussion at [0] reveals that some people do
build postgresql themselves for whatever reason.

When I first started out with postgres I always built from source. That
was quite a while ago, so maybe that means nothing; but it would be sad
to think that the build-it-yourself experience doesn't matter.

> >    0002: update template0 in new cluster (as described above)
> 
> This makes sense.  I'm confused what the code originally wanted to 
> achieve, e.g.,
> 
> -/*
> - * Check that every database that already exists in the new cluster
> is
> - * compatible with the corresponding database in the old one.
> - */
> -static void
> -check_databases_are_compatible(void)
> 
> Was there once support for the new cluster having additional
> databases 
> in place?  Weird.

It looks like 33755e8edf was the last significant change here. CC
Heikki for comment.

> In any case, I think you can remove additional code from
> get_db_infos() 
> related to fields that are no longer used, such as db_encoding, and
> the 
> corresponding struct fields in DbInfo.

You're right: there's not much of an intersection between the code that
needs a DbInfo and the code that needs the locale fields. I created a
separate DbLocaleInfo struct for the template0 locale information, and
removed the locale fields from DbInfo.

I also added a TAP test.

> >    0003: default initdb to use ICU
> 
> What are the changes in the citext tests about?

There's a test in citext_utf8.sql for:

  SELECT 'i'::citext = 'İ'::citext AS t;

citext_eq uses DEFAULT_COLLATION_OID, comparing the results after
applying lower(). Apparently:

  lower('İ' collate "en_US") = 'i' -- true
  lower('İ' collate "tr-TR-x-icu") = 'i' -- true
  lower('İ' collate "en-US-x-icu") = 'i' -- false

the test was passing before because it seems to be true for all libc
locales. But for ICU, it seems to only be true in the "tr-TR" locale at
colstrength=secondary (and true for other ICU locales at
colstrength=primary).

We can't fix the test by being explicit about the collation, because
citext doesn't pass it down; it always uses the default collation. We
could fix citext to pass it down properly, but that seems like a
different patch.

In any case, citext doesn't seem very important to those using ICU (we
have a doc note suggesting ICU instead), so I don't see a strong reason
to test the combination. So, I just exit the test early if it's ICU. I
added a better comment.


>   Is it the same issue as 
> in unaccent?  In that case, the OR should be an AND?  Maybe add a
> comment?
> 
> Why is unaccent is "broken" if the default collation is provided by
> ICU 
> and LC_CTYPE=C?  Is that a general problem?  Should we prevent this 
> combination?

A different issue: unaccent is calling t_isspace(), which is just not
properly handling locales. t_isspace() always passes NULL as the last
argument to char2wchar. There are TODO comments throughout that file.

Specifically what happens:
  lc_ctype_is_c(DEFAULT_COLLATION_OID) returns false
  so it calls char2wchar(), which calls mbstowcs()
  which returns an error because the LC_CTYPE=C

Right now, that's a longstanding issue for all users of t_isspace() and
related functions: tsearch, ltree, pg_trgm, dict_xsyn, and unaccent. I
assume it was known and considered unimportant, otherwise we wouldn't
have left the TODO comments in there.

I believe it's only a problem when the provider is ICU and the LC_CTYPE
is C. I think a quick fix would be to just test LC_CTYPE directly (from
the environment or setlocale(LC_CTYPE, NULL)) rather than try to
extract it from the default collation. It sounds like a separate patch,
and should be handled as a bugfix and backported.

A better fix would be to support character classification in ICU. I
don't think that's hard, but ICU has quite a few options, and we'd need
to discuss which are the right ones to support. We may also want to
pass collation information down rather than just using the database
default, but that may depend on the caller and we should discuss that,
as well.

> What are the changes in the ecpg tests about?  Looks harmless, but if
> there is a need, maybe it should be 

Re: Simplify standby state machine a bit in WaitForWALToBecomeAvailable()

2023-03-03 Thread Bharath Rupireddy
On Sat, Mar 4, 2023 at 8:14 AM Bharath Rupireddy
 wrote:
>
> On Sat, Mar 4, 2023 at 8:00 AM Michael Paquier  wrote:
> >
> > On Fri, Mar 03, 2023 at 04:33:39PM -0800, Nathan Bossart wrote:
> > > Given both Bharath and I missed this, perhaps we should add a comment 
> > > about
> > > this behavior.
> >
> > Makes sense to me to document that in a better way.  What about the
> > addition of a short paragraph at the top of XLogFileReadAnyTLI() that
> > explains the behaviors we expect depending on the values of
> > XLogSource?  The case of XLOG_FROM_ANY with the order to scan the
> > archives then pg_wal/ for each timeline is the most important bit,
> > surely.
>
> +1. I will send a patch in a bit.

Okay, here's a patch attached.

-- 
Bharath Rupireddy
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com


v1-0001-Clarify-XLogFileReadAnyTLI-s-behaviour-for-XLOG_F.patch
Description: Binary data


Re: Simplify standby state machine a bit in WaitForWALToBecomeAvailable()

2023-03-03 Thread Bharath Rupireddy
On Sat, Mar 4, 2023 at 8:00 AM Michael Paquier  wrote:
>
> On Fri, Mar 03, 2023 at 04:33:39PM -0800, Nathan Bossart wrote:
> > Given both Bharath and I missed this, perhaps we should add a comment about
> > this behavior.
>
> Makes sense to me to document that in a better way.  What about the
> addition of a short paragraph at the top of XLogFileReadAnyTLI() that
> explains the behaviors we expect depending on the values of
> XLogSource?  The case of XLOG_FROM_ANY with the order to scan the
> archives then pg_wal/ for each timeline is the most important bit,
> surely.

+1. I will send a patch in a bit.

-- 
Bharath Rupireddy
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com




Re: [PoC] Let libpq reject unexpected authentication requests

2023-03-03 Thread Michael Paquier
On Tue, Feb 28, 2023 at 03:38:21PM -0800, Jacob Champion wrote:
> 0001 and 0002 are the core features. 0003 is a more future-looking
> refactoring of the internals, to make it easier to handle more SASL
> mechanisms, but it's not required and contains some unexercised code.

I was refreshing my mind with 0001 yesterday, and except for the two
parts where we need to worry about AUTH_REQ_OK being sent too early
and the business with gssenc, this is a rather straight-forward.  It
also looks like the the participants of the thread are OK with the
design you are proposing (list of keywords, potentially negative 
patterns).  I think that I can get this part merged for this CF, at
least, not sure about the rest :p
--
Michael


signature.asc
Description: PGP signature


Re: Simplify standby state machine a bit in WaitForWALToBecomeAvailable()

2023-03-03 Thread Michael Paquier
On Fri, Mar 03, 2023 at 04:33:39PM -0800, Nathan Bossart wrote:
> Given both Bharath and I missed this, perhaps we should add a comment about
> this behavior.

Makes sense to me to document that in a better way.  What about the
addition of a short paragraph at the top of XLogFileReadAnyTLI() that
explains the behaviors we expect depending on the values of
XLogSource?  The case of XLOG_FROM_ANY with the order to scan the
archives then pg_wal/ for each timeline is the most important bit,
surely.
--
Michael


signature.asc
Description: PGP signature


Re: RADIUS tests and improvements

2023-03-03 Thread Thomas Munro
New improved version:

* fixed stupid misuse of PG_FINALLY() (oops, must have been thinking
of another language)
* realised that it was strange to have a GUC for the timeout, and made
a new HBA parameter instead
* added documentation for that
* used TimestampDifferenceMilliseconds() instead of open-coded TimestampTz maths

I don't exactly love the PG_TRY()/PG_CATCH() around the
CHECK_FOR_INTERRUPTS().  In fact this kind of CFI-with-cleanup problem
has been haunting me across several projects.  For cases that memory
contexts and resource owners can't help with, I don't currently know
what else to do here.  Better ideas welcome.  If I just let that
socket leak because I know this backend will soon exit, I'd expect a
knock at the door from the programming police.

I don't actually know why we have
src/test/authentication/t/...{password,sasl,peer}..., but then
src/test/{kerberos,ldap,ssl}/t/001_auth.pl.  For this one, I just
copied the second style, creating src/test/radius/t/001_auth.pl.  I
can't explain why it should be like that, though.  If I propose
another test for PAM, where should it go?
From d7f57dfe4a316c8ac1270a5f35f837447e335153 Mon Sep 17 00:00:00 2001
From: Thomas Munro 
Date: Sat, 31 Dec 2022 14:41:57 +1300
Subject: [PATCH v2 1/3] Add simple test for RADIUS authentication.

This is similar to the existing tests for ldap and kerberos.  It
requires FreeRADIUS to be installed, and opens ports that may be
considered insecure, so users have to opt in with PG_EXTRA_TESTS=radius.

Discussion: https://postgr.es/m/CA%2BhUKGKxNoVjkMCksnj6z3BwiS3y2v6LN6z7_CisLK%2Brv%2B0V4g%40mail.gmail.com
---
 src/test/Makefile |   2 +-
 src/test/meson.build  |   1 +
 src/test/radius/Makefile  |  23 +
 src/test/radius/meson.build   |  12 +++
 src/test/radius/t/001_auth.pl | 187 ++
 5 files changed, 224 insertions(+), 1 deletion(-)
 create mode 100644 src/test/radius/Makefile
 create mode 100644 src/test/radius/meson.build
 create mode 100644 src/test/radius/t/001_auth.pl

diff --git a/src/test/Makefile b/src/test/Makefile
index dbd3192874..687164412c 100644
--- a/src/test/Makefile
+++ b/src/test/Makefile
@@ -12,7 +12,7 @@ subdir = src/test
 top_builddir = ../..
 include $(top_builddir)/src/Makefile.global
 
-SUBDIRS = perl regress isolation modules authentication recovery subscription
+SUBDIRS = perl regress isolation modules authentication recovery radius subscription
 
 ifeq ($(with_icu),yes)
 SUBDIRS += icu
diff --git a/src/test/meson.build b/src/test/meson.build
index 5f3c9c2ba2..b5da17b531 100644
--- a/src/test/meson.build
+++ b/src/test/meson.build
@@ -5,6 +5,7 @@ subdir('isolation')
 
 subdir('authentication')
 subdir('recovery')
+subdir('radius')
 subdir('subscription')
 subdir('modules')
 
diff --git a/src/test/radius/Makefile b/src/test/radius/Makefile
new file mode 100644
index 00..56768a3ca9
--- /dev/null
+++ b/src/test/radius/Makefile
@@ -0,0 +1,23 @@
+#-
+#
+# Makefile for src/test/radius
+#
+# Portions Copyright (c) 1996-2022, PostgreSQL Global Development Group
+# Portions Copyright (c) 1994, Regents of the University of California
+#
+# src/test/radius/Makefile
+#
+#-
+
+subdir = src/test/radius
+top_builddir = ../../..
+include $(top_builddir)/src/Makefile.global
+
+check:
+	$(prove_check)
+
+installcheck:
+	$(prove_installcheck)
+
+clean distclean maintainer-clean:
+	rm -rf tmp_check
diff --git a/src/test/radius/meson.build b/src/test/radius/meson.build
new file mode 100644
index 00..ea7afc4555
--- /dev/null
+++ b/src/test/radius/meson.build
@@ -0,0 +1,12 @@
+# Copyright (c) 2022, PostgreSQL Global Development Group
+
+tests += {
+  'name': 'radius',
+  'sd': meson.current_source_dir(),
+  'bd': meson.current_build_dir(),
+  'tap': {
+'tests': [
+  't/001_auth.pl',
+],
+  },
+}
diff --git a/src/test/radius/t/001_auth.pl b/src/test/radius/t/001_auth.pl
new file mode 100644
index 00..44db62a3d7
--- /dev/null
+++ b/src/test/radius/t/001_auth.pl
@@ -0,0 +1,187 @@
+
+# Copyright (c) 2021-2022, PostgreSQL Global Development Group
+
+# Debian: apt-get install freeradius
+# Homebrew: brew install freeradius-server
+# FreeBSD: pkg install freeradius3
+# MacPorts: port install freeradius
+
+use strict;
+use warnings;
+use File::Copy;
+use PostgreSQL::Test::Utils;
+use PostgreSQL::Test::Cluster;
+use Test::More;
+
+my $radiusd_dir = "${PostgreSQL::Test::Utils::tmp_check}/radiusd_data";
+my $radiusd_conf = "radiusd.conf";
+my $radiusd_users = "users.txt";
+my $radiusd_prefix;
+my $radiusd;
+
+if ($ENV{PG_TEST_EXTRA} !~ /\bradius\b/)
+{
+	plan skip_all => 'Potentially unsafe test RADIUS not enabled in PG_TEST_EXTRA';
+}
+elsif ($^O eq 'freebsd')
+{
+	$radiusd = '/usr/local/sbin/radiusd';
+}
+elsif ($^O eq 'linux' && -f '/usr/sbin/freeradius')
+{
+	$radiusd = 

Re: Simplify standby state machine a bit in WaitForWALToBecomeAvailable()

2023-03-03 Thread Nathan Bossart
On Fri, Mar 03, 2023 at 01:38:38PM +0530, Bharath Rupireddy wrote:
> On Wed, Mar 1, 2023 at 1:46 PM Michael Paquier  wrote:
>> Well, did you notice 4d894b41?  It introduced this change:
>>
>> -   readFile = XLogFileReadAnyTLI(readSegNo, DEBUG2, 
>> currentSource);
>> +   readFile = XLogFileReadAnyTLI(readSegNo, DEBUG2,
>> +   currentSource == XLOG_FROM_ARCHIVE ? XLOG_FROM_ANY :
>> +currentSource);
>>
>> And this patch basically undoes that, meaning that we would basically
>> look at the archives first for all the expected TLIs, but only if no
>> files were found in pg_wal/.
>>
>> The change is subtle, see XLogFileReadAnyTLI().  On HEAD we go through
>> each timeline listed and check both archives and then pg_wal/ after
>> the last source that failed was the archives.  The patch does
>> something different: it checks all the timelines for the archives,
>> then all the timelines in pg_wal/ with two separate calls to
>> XLogFileReadAnyTLI().
> 
> Thanks. Yeah, the patch proposed here just reverts that commit [1]
> https://git.postgresql.org/gitweb/?p=postgresql.git;a=commitdiff;h=4d894b41cd12179b710526eba9dc62c2b99abc4d.
> 
> That commit fixes an issue - "If there is a WAL segment with same ID
> but different TLI present in both the WAL archive and pg_xlog, prefer
> the one with higher TLI.".

Given both Bharath and I missed this, perhaps we should add a comment about
this behavior.

-- 
Nathan Bossart
Amazon Web Services: https://aws.amazon.com




Re: generate_series for timestamptz and time zone problem

2023-03-03 Thread Andreas 'ads' Scherbaum

On 31/01/2023 08:50, Gurjeet Singh wrote:

On Mon, Jan 30, 2023 at 4:07 PM Tom Lane  wrote:

Gurjeet Singh  writes:

[ generate_series_with_timezone.v6.patch ]

The cfbot isn't terribly happy with this.  It looks like UBSan
is detecting some undefined behavior.  Possibly an uninitialized
variable?

It was the classical case of out-of-bounds access. I was trying to
access 4th argument, even in the case where the 3-argument variant of
generate_series() was called.

Please see attached v7 of the patch. It now checks PG_NARGS() before
accessing the optional parameter.

This mistake would've been caught early if there were assertions
preventing access beyond the number of arguments passed to the
function. I'll send the assert_enough_args.patch, that adds these
checks, in a separate thread to avoid potentially confusing cfbot.


Tested this patch on current head.
The patch applies, with a few offsets.

Functionality wise it works as documented, also tried with 
"America/New_York" and "Europe/Berlin" as time zone.
The included tests cover both an entire year (including a new year), and 
also a DST switch (date_add() for 2021-10-31 in Europe/Warsaw, which is 
the date the country switches to standard time).


Minor nitpick: the texts use both "time zone" and "timezone".


Regards,

--
Andreas 'ads' Scherbaum
German PostgreSQL User Group
European PostgreSQL User Group - Board of Directors
Volunteer Regional Contact, Germany - PostgreSQL Project





Re: running logical replication as the subscription owner

2023-03-03 Thread Jelte Fennema
I'm definitely in favor of making it easier to use logical replication
in a safe manner. In Citus we need to logically replicate and we're
currently using quite some nasty and undocumented hacks to do so:
We're creating a subscription per table owner, where each subscription
is owned by a temporary user that has the same permissions as the
table owner. These temporary users were originally superusers, because
otherwise we cannot make them subscription owners, but once assigning
a subscription to them we take away the superuser permissions from
them[1]. And we also need to hook into ALTER/DELETE subscription
commands to make sure that these temporary owners cannot edit their
own subscription[2].

Getting this right was not easy. And even it has the serious downside
that we need multiple subscriptions/replication slots which causes
extra complexity in various ways and it eats much more aggressively
into the replication slot limits than we'd like. Having one
subscription that could apply into tables that were owned by multiple
users in a safe way would make this sooo much easier.

[1]: 
https://github.com/citusdata/citus/blob/main/src/backend/distributed/replication/multi_logical_replication.c#L1487-L1573
[2]: 
https://github.com/citusdata/citus/blob/main/src/backend/distributed/commands/utility_hook.c#L455-L487




Re: Raising the SCRAM iteration count

2023-03-03 Thread Daniel Gustafsson
> On 27 Feb 2023, at 08:06, Michael Paquier  wrote:

> +   conn->scram_sha_256_iterations = atoi(value);
> +   }
> This should match on "scram_iterations", which is the name of the
> GUC.

Fixed.

> Would the long-term plan be to use multiple variables in conn if
> we ever get to : that would require more parsing?

I personally don't think we'll see more than 2 or at most 3 values so parsing
that format shouldn't be a problem, but it can always be revisited if/when we
get there.

> Perhaps there should be a test with \password to make sure that libpq
> gets the call when the GUC is updated by a SET command?

That would indeed be nice, but is there a way to do this without a complicated
pump TAP expression?  I was unable to think of a way but I might be missing
something?

--
Daniel Gustafsson



v6-0001-Make-SCRAM-iteration-count-configurable.patch
Description: Binary data


Re: Making empty Bitmapsets always be NULL

2023-03-03 Thread Tom Lane
David Rowley  writes:
> On Fri, 3 Mar 2023 at 15:17, Tom Lane  wrote:
>> Another point here is that I'm pretty sure that just about all
>> bitmapsets we deal with are only one or two words, so I'm not
>> convinced you're going to get any performance win to justify
>> the added management overhead.

> It's true that the majority of Bitmapsets are going to be just 1 word,
> but it's important to acknowledge that we do suffer in some more
> extreme cases when Bitmapsets become large. Partition with large
> numbers of partitions is one such case.

Maybe, but optimizing for that while pessimizing every other case
doesn't sound very attractive from here.  I think we need some
benchmarks on normal-size bitmapsets before considering doing much
in this area.

Also, if we're going to make any sort of changes here it'd probably
behoove us to make struct Bitmapset private in bitmapset.c, so that
we can have confidence that nobody is playing games with them.
I had a go at that and was pleasantly surprised to find that
actually nobody has; the attached passes check-world.  It'd probably
be smart to commit this as a follow-on to 00b41463c, whether or not
we go any further.

Also, given that we do this, I don't think that check_bitmapset_invariants
as you propose it is worth the trouble.  The reason we've gone to such
lengths with checking List invariants is that initially we had a large
number of places doing creative and not-too-structured things with Lists,
plus we've made several absolutely fundamental changes to that data
structure.  Despite the far larger bug surface, I don't recall that those
invariant checks ever found anything after the initial rounds of changes.
So I don't buy that there's an argument for a similarly expensive set
of checks here.  bitmapset.c is small enough that we should be able to
pretty much prove it correct by eyeball.

regards, tom lane

diff --git a/src/backend/nodes/Makefile b/src/backend/nodes/Makefile
index af12c64878..a6eb2a87bb 100644
--- a/src/backend/nodes/Makefile
+++ b/src/backend/nodes/Makefile
@@ -54,7 +54,6 @@ node_headers = \
 	commands/trigger.h \
 	executor/tuptable.h \
 	foreign/fdwapi.h \
-	nodes/bitmapset.h \
 	nodes/extensible.h \
 	nodes/lockoptions.h \
 	nodes/miscnodes.h \
diff --git a/src/backend/nodes/bitmapset.c b/src/backend/nodes/bitmapset.c
index 7ba3cf635b..c11daeb303 100644
--- a/src/backend/nodes/bitmapset.c
+++ b/src/backend/nodes/bitmapset.c
@@ -24,6 +24,34 @@
 #include "port/pg_bitutils.h"
 
 
+/*
+ * Data representation
+ *
+ * Larger bitmap word sizes generally give better performance, so long as
+ * they're not wider than the processor can handle efficiently.  We use
+ * 64-bit words if pointers are that large, else 32-bit words.
+ */
+#if SIZEOF_VOID_P >= 8
+
+#define BITS_PER_BITMAPWORD 64
+typedef uint64 bitmapword;		/* must be an unsigned type */
+typedef int64 signedbitmapword; /* must be the matching signed type */
+
+#else
+
+#define BITS_PER_BITMAPWORD 32
+typedef uint32 bitmapword;		/* must be an unsigned type */
+typedef int32 signedbitmapword; /* must be the matching signed type */
+
+#endif
+
+struct Bitmapset
+{
+	NodeTag		type;			/* it's a Node */
+	int			nwords;			/* number of words in array */
+	bitmapword	words[FLEXIBLE_ARRAY_MEMBER];	/* really [nwords] */
+};
+
 #define WORDNUM(x)	((x) / BITS_PER_BITMAPWORD)
 #define BITNUM(x)	((x) % BITS_PER_BITMAPWORD)
 
diff --git a/src/backend/nodes/gen_node_support.pl b/src/backend/nodes/gen_node_support.pl
index ecbcadb8bf..fd5d61bfd4 100644
--- a/src/backend/nodes/gen_node_support.pl
+++ b/src/backend/nodes/gen_node_support.pl
@@ -65,7 +65,6 @@ my @all_input_files = qw(
   commands/trigger.h
   executor/tuptable.h
   foreign/fdwapi.h
-  nodes/bitmapset.h
   nodes/extensible.h
   nodes/lockoptions.h
   nodes/miscnodes.h
@@ -132,6 +131,19 @@ my @special_read_write;
 # node types we don't want any support functions for, just node tags
 my @nodetag_only;
 
+# Nodes with custom copy/equal implementations are skipped from
+# .funcs.c but need case statements in .switch.c.
+my @custom_copy_equal;
+
+# Similarly for custom read/write implementations.
+my @custom_read_write;
+
+# Similarly for custom query jumble implementation.
+my @custom_query_jumble;
+
+# Track node types with manually assigned NodeTag numbers.
+my %manual_nodetag_number;
+
 # types that are copied by straight assignment
 my @scalar_types = qw(
   bits32 bool char double int int8 int16 int32 int64 long uint8 uint16 uint32 uint64
@@ -166,18 +178,12 @@ push @no_equal,   qw(List);
 push @no_query_jumble,qw(List);
 push @special_read_write, qw(List);
 
-# Nodes with custom copy/equal implementations are skipped from
-# .funcs.c but need case statements in .switch.c.
-my @custom_copy_equal;
-
-# Similarly for custom read/write implementations.
-my @custom_read_write;
-
-# Similarly for custom query jumble implementation.
-my @custom_query_jumble;
-
-# Track node types with manually assigned NodeTag 

Re: odd buildfarm failure - "pg_ctl: control file appears to be corrupt"

2023-03-03 Thread Thomas Munro
On Fri, Feb 24, 2023 at 11:12 PM Anton A. Melnikov  wrote:
> On 17.02.2023 06:21, Thomas Munro wrote:
> > BTW there are at least two other places where PostgreSQL already knows
> > that concurrent reads and writes are possibly non-atomic (and we also
> > don't even try to get the alignment right, making the question moot):
> > pg_basebackup, which enables full_page_writes implicitly if you didn't
> > have the GUC on already, and pg_rewind, which refuses to run if you
> > haven't enabled full_page_writes explicitly (as recently discussed on
> > another thread recently; that's an annoying difference, and also an
> > annoying behaviour if you know your storage doesn't really need it!)
>
> It seems a good topic for a separate thread patch. Would you provide a
> link to the thread you mentioned please?

https://www.postgresql.org/message-id/flat/367d01a7-90bb-9b70-4cda-248e81cc475c%40cosium.com

> > Therefore, we need a solution for Windows too.  I tried to write the
> > equivalent code, in the attached.  I learned a few things: Windows
> > locks are mandatory (that is, if you lock a file, reads/writes can
> > fail, unlike Unix).  Combined with async release, that means we
> > probably also need to lock the file in xlog.c, when reading it in
> > xlog.c:ReadControlFile() (see comments).  Before I added that, on a CI
> > run, I found that the read in there would sometimes fail, and adding
> > the locking fixed that.  I am a bit confused, though, because I
> > expected that to be necessary only if someone managed to crash while
> > holding the write lock, which the CI tests shouldn't do.  Do you have
> > any ideas?
>
> Unfortunately, no ideas so far. Was it a pg_control CRC or I/O errors?
> Maybe logs of such a fail were saved somewhere? I would like to see
> them if possible.

I think it was this one:

https://cirrus-ci.com/task/5004082033721344

For example, see subscription/011_generated which failed like this:

2023-02-16 06:57:25.724 GMT [5736][not initialized] PANIC:  could not
read file "global/pg_control": Permission denied

That was fixed after I changed it to also do locking in xlog.c
ReadControlFile(), in the version you tested.  There must be something
I don't understand going on, because that cluster wasn't even running
before: it had just been created by initdb.

But, anyway, I have a new idea that makes that whole problem go away
(though I'd still like to understand what happened there):

With the "pseudo-advisory lock for Windows" idea from the last email,
we don't need to bother with file system level locking in many places.
Just in controldata_utils.c, for FE/BE interlocking (that is, we don't
need to use that for interlocking of concurrent reads and writes that
are entirely in the backend, because we already have an LWLock that we
could use more consistently).  Changes:

1. xlog.c mostly uses no locking
2. basebackup.c now acquires ControlFileLock
3. only controldata_utils.c uses the new file locking, for FE-vs-BE interlocking
4. lock past the end (pseudo-advisory locking for Windows)

Note that when we recover from a basebackup or pg_backup_start()
backup, we use the backup label to find a redo start location in the
WAL (see read_backup_label()), BUT we still read the copied pg_control
file (one that might be too "new", so we don't use its redo pointer).
So it had better not be torn, or the recovery will fail.  So, in this
version I protected that sendFile() with ControlFileLock.  But...

Isn't that a bit strange?  To go down this path we would also need to
document the need to copy the control file with the file locked to
avoid a rare failure, in the pg_backup_start() documentation.  That's
annoying (I don't even know how to do it with easy shell commands;
maybe we should provide a program that locks and cats the file...?).
Could we make better use of the safe copy that we have in the log?
Then the pg_backup_start() subproblem would disappear.  Conceptually,
that'd be just like the way we use FPI for data pages copied during a
backup.  I'm not sure about any of that, though, it's just an idea,
not tested.

> > Or maybe the problem is/was too theoretical before...
>
> As far as i understand, this problem has always been, but the probability of
> this is extremely small in practice, which is directly pointed in
> the docs [4]:
> "So while it is theoretically a weak spot, pg_control does not seem
> to be a problem in practice."

I guess that was talking about power loss atomicity again?  Knowledge
of the read/write atomicity problem seems to be less evenly
distributed (and I think it became more likely in Linux > 3.something;
and the Windows situation possibly hadn't been examined by anyone
before).

> > Here's a patch like that.
>
> In this patch, the problem is solved for the live database and
> maybe remains for some possible cases of an external backup. In a whole,
> i think it can be stated that this is a sensible step forward.
>
> Just like last time, i ran a long stress test under windows 

Re: zstd compression for pg_dump

2023-03-03 Thread Jacob Champion
On Fri, Mar 3, 2023 at 10:55 AM Justin Pryzby  wrote:
> Thanks for looking.  If your zstd library is compiled with thread
> support, could you also try with :workers=N ?  I believe this is working
> correctly, but I'm going to ask for help verifying that...

Unfortunately not (Ubuntu 20.04):

pg_dump: error: could not set compression parameter: Unsupported parameter

But that lets me review the error! I think these error messages should
say which options caused them.

> It'd be especially useful to test under windows, where pgdump/restore
> use threads instead of forking...  If you have a windows environment but
> not set up for development, I think it's possible to get cirrusci to
> compile a patch for you and then retrieve the binaries provided as an
> "artifact" (credit/blame for this idea should be directed to Thomas
> Munro).

I should be able to do that next week.

> > With this particular dataset, I don't see much improvement with
> > zstd:long.
>
> Yeah.  I this could be because either 1) you already got very good
> comprssion without looking at more data; and/or 2) the neighboring data
> is already very similar, maybe equally or more similar, than the further
> data, from which there's nothing to gain.

What kinds of improvements do you see with your setup? I'm wondering
when we would suggest that people use it.

> I don't want to start exposing lots of fine-granined parameters at this
> point.  In the immediate case, it looks like it may require more than
> just adding another parameter:
>
>   Note: If windowLog is set to larger than 27,
> --long=windowLog or --memory=windowSize needs to be passed to the
> decompressor.

Hm. That would complicate things.

Thanks,
--Jacob




Re: Schema variables - new implementation for Postgres 15

2023-03-03 Thread Dmitry Dolgov
> On Tue, Feb 28, 2023 at 06:12:50AM +0100, Pavel Stehule wrote:
>
> fresh rebase

I'm continuing to review, this time going through shadowing stuff in
transformColumnRef, IdentifyVariable etc. Well, that's a lot of leg work
for rather little outcome :) I guess all attempts to simplify this part
weren't successful?

Couple of questions to it. In IdentifyVariable in the branch handling
two values the commentary says:

/*
 * a.b can mean "schema"."variable" or "variable"."field",
 * Check both variants, and returns InvalidOid with
 * not_unique flag, when both interpretations are
 * possible. Second node can be star. In this case, the
 * only allowed possibility is "variable"."*".
 */

I read this as "variable"."*" is a valid combination, but the very next
part of this condition says differently:

/*
 * Session variables doesn't support unboxing by star
 * syntax. But this syntax have to be calculated here,
 * because can come from non session variables related
 * expressions.
 */
Assert(IsA(field2, A_Star));

Is the first commentary not quite correct?

Another question about how shadowing warning should work between namespaces.
Let's say I've got two namespaces, public and test, both have a session
variable with the same name, but only one has a table with the same name:

-- in public
create table test_agg(a int);
create type for_test_agg as (a int);
create variable test_agg for_test_agg;

-- in test
create type for_test_agg as (a int);
create variable test_agg for_test_agg;

Now if we will try to trigger the shadowing warning from public
namespace, it would work differently:

-- in public
=# let test.test_agg.a = 10;
=# let test_agg.a = 20;
=# set session_variables_ambiguity_warning to on;

-- note the value returned from the table
=# select jsonb_agg(test_agg.a) from test_agg;
WARNING:  42702: session variable "test_agg.a" is shadowed
LINE 1: select jsonb_agg(test_agg.a) from test_agg;
 ^
DETAIL:  Session variables can be shadowed by columns, routine's 
variables and routine's arguments with the same name.
LOCATION:  transformColumnRef, parse_expr.c:940
 jsonb_agg
---
 [1]

-- no warning, note the session variable value
=# select jsonb_agg(test.test_agg.a) from test_agg;
 jsonb_agg
---
 [10]

It happens because in the second scenario the logic inside transformColumnRef
will not set up the node variable (there is no corresponding table in the
"test" schema), and the following conditions covering session variables
shadowing are depending on it. Is it supposed to be like this?




Re: SQL JSON path enhanced numeric literals

2023-03-03 Thread Dean Rasheed
On Tue, 28 Feb 2023 at 07:44, Peter Eisentraut
 wrote:
>
> Attached is a patch to add nondecimal integer literals and underscores
> in numeric literals to the SQL JSON path language.  This matches the
> recent additions to the core SQL syntax.  It follows ECMAScript in
> combination with the current SQL draft.
>

I think this new feature ought to be mentioned in the docs somewhere.
Perhaps a sentence or two in the note below table 9.49 would suffice,
since it looks like that's where jsonpath numbers are mentioned for
the first time.

In jsonpath_scan.l, I think the hex/oct/bininteger cases could do with
a comment, such as

/* Non-decimal integers in ECMAScript; must not have underscore after radix */
hexinteger0[xX]{hexdigit}(_?{hexdigit})*
octinteger0[oO]{octdigit}(_?{octdigit})*
bininteger0[bB]{bindigit}(_?{bindigit})*

since that's different from the main lexer's syntax.

Perhaps it's worth mentioning that difference in the docs.

Otherwise, this looks good to me.

Regards,
Dean




Re: Evaluate arguments of correlated SubPlans in the referencing ExprState

2023-03-03 Thread Tom Lane
Andres Freund  writes:
> On 2023-03-02 13:00:31 -0800, Andres Freund wrote:
>> I'm not opposed to EXPR_PARAM_SET, to be clear. I'll send an updated
>> version later. I was just thinking about the correctness in the current
>> world.

> Attached.

I've looked through this, and it looks basically OK so I marked it RfC.
I do have a few nitpicks that you might or might not choose to adopt:

It'd be good to have a header comment for ExecInitExprRec documenting
the arguments, particularly that resv/resnull are where to put the
subplan's eventual result.

You could avoid having to assume ExprState's resvalue/resnull being
safe to use by instead using the target resv/resnull.  This would
require putting those into the EEOP_PARAM_SET step so that
ExecEvalParamSet knows where to fetch from, so maybe it's not an
improvement, but perhaps worth considering.

+   /* type isn't needed, but an old value could be confusing */
+   scratch.d.param.paramtype = InvalidOid;
I'd just store the param's type, rather than justifying why you didn't.
It's cheap enough and even less confusing.

I think that ExecEvalParamSet should either set prm->execPlan to NULL,
or maybe better Assert that it is already NULL.

It's a bit weird to keep this in ExecScanSubPlan, when the code there
no longer depends on it:
+   Assert(list_length(subplan->parParam) == list_length(subplan->args));
I'd put that before the forboth() in ExecInitSubPlanExpr instead,
where it does matter.

regards, tom lane




Re: meson: Optionally disable installation of test modules

2023-03-03 Thread Andrew Dunstan


On 2023-03-03 Fr 01:47, Peter Eisentraut wrote:

On 02.03.23 08:09, Nazir Bilal Yavuz wrote:

On Wed, 1 Mar 2023 at 22:21, Peter Eisentraut
 wrote:


Looks good to me.  I did a small pass over it to adjust some namings.
For example, I renamed test_install_files to test_install_data, so it's
consistent with the overall meson naming:

-install_data(
+test_install_data += files(

Let me know if you have any concerns about this version. Otherwise, I'm
happy to commit it.


That makes sense, thanks!


committed





These changes have broken the buildfarm adaptation work in different 
ways on different platforms.


On Windows (but not Linux), the install_test_files are apparently 
getting installed under runpython in the build directory rather than in 
the tmp_install location, so those tests fail. Meanwhile, it's not clear 
to me how to install them in a standard installation, which means that 
on Linux the corresponding -running tests are failing.



cheers


andrew


--
Andrew Dunstan
EDB:https://www.enterprisedb.com


Re: libpq: PQgetCopyData() and allocation overhead

2023-03-03 Thread Jeroen Vermeulen
Interested, yes.  But there may be reasons not to do that.  Last time I
looked the binary format wasn't documented.

Anyway, I tried re-implementing pqGetCopyData3() using the callback.
Wasn't hard, but I did have to add a way for the callback to return an
error.  Kept it pretty dumb for now, hoping a sensible rule will become
obvious later.

Saw no obvious performance impact, so that's good.


Jeroen

On Fri, 3 Mar 2023 at 19:53, Tom Lane  wrote:

> Jeroen Vermeulen  writes:
> > The printf() is just the simplest example that sprang to mind though.
> > There may be other use-cases out there involving  libraries that require
> > zero-terminated strings, and I figured an ability to set a sentinel could
> > help those.
>
> Well, since it won't help for binary COPY, I'm skeptical that this is
> something we should cater to.  Anybody who's sufficiently worried about
> performance to be trying to remove malloc/free cycles ought to be
> interested in binary COPY as well.
>
> regards, tom lane
>
diff --git a/bench.c b/bench.c
new file mode 100644
index 00..35f8c7a36f
--- /dev/null
+++ b/bench.c
@@ -0,0 +1,132 @@
+/*
+ * Minimal benchmark for PQgetCopyData alternative.
+ *
+ * Define CALL to 0 (to use the classic PQgetCopyData) or 1 (to use the
+ * proposed new function), then run the binary through "time" to get time and
+ * CPU usage stats.
+ *
+ * DO NOT UPSTREAM THIS FILE.  It's just a demonstration for the prototype
+ * patch.
+ */
+#include 
+#include 
+
+#include 
+
+/* Define CALL to...
+ * 0: Use classic PQgetCopyData()
+ * 1: Use experimental PQhandleCopyData()
+ */
+
+/* Benchmark results (best result per category, out of 4 runs):
+ *
+ * PQgetCopyData:
+ * real - 0m32.972s
+ * user - 0m11.364s
+ * sys - 0m1.255s
+ *
+ * PQhandleCopyData:
+ * real - 0m32.839s
+ * user - 0m3.407s
+ * sys - 0m0.872s
+ */
+
+#if CALL == 1
+/*
+ * Print line, add newline.
+ */
+static int
+print_row_and_newline(void *, const char *buf, size_t len)
+{
+	fwrite(buf, 1, len, stdout);
+	return 0;
+}
+#endif
+
+
+int
+main()
+{
+#if !defined(CALL)
+#error "Set CALL: 0 = PQgetCopyDta, 1 = PQhandleCopyData."
+#elif CALL == 0
+	fprintf(stderr, "Testing classic PQgetCopyData().\n");
+#elif CALL == 1
+	fprintf(stderr, "Testing experimental PQhandleCopyData.\n");
+#else
+#error "Unknown CALL value."
+#endif
+
+	PGconn	   *cx = PQconnectdb("");
+
+	if (!cx)
+	{
+		fprintf(stderr, "Could not connect.\n");
+		exit(1);
+	}
+	PGresult   *tx = PQexec(cx, "BEGIN");
+
+	if (!tx)
+	{
+		fprintf(stderr, "No result from BEGIN!\n");
+		exit(1);
+	}
+	int			s = PQresultStatus(tx);
+
+	if (s != PGRES_COMMAND_OK)
+	{
+		fprintf(stderr, "Failed to start transaction: status %d.\n", s);
+		exit(1);
+	}
+
+	PGresult   *r = PQexec(
+		   cx,
+		   "COPY ("
+		   "SELECT generate_series, 'row #' || generate_series "
+		   "FROM generate_series(1, 1)"
+		   ") TO STDOUT"
+	);
+
+	if (!r)
+	{
+		fprintf(stderr, "No result!\n");
+		exit(1);
+	}
+	int			status = PQresultStatus(r);
+
+	if (status != PGRES_COPY_OUT)
+	{
+		fprintf(stderr, "Failed to start COPY: status %d.\n", status);
+		exit(1);
+	}
+
+	int			bytes;
+#if CALL == 0
+	char	   *buffer = NULL;
+
+	for (
+		 bytes = PQgetCopyData(cx, , 0);
+		 bytes > 0;
+		 bytes = PQgetCopyData(cx, , 0)
+		)
+	{
+		if (buffer)
+		{
+			printf("%s", buffer);
+			PQfreemem(buffer);
+		}
+	}
+#elif CALL == 1
+	while ((bytes = PQhandleCopyData(cx, print_row_and_newline, NULL, 0)) > 0);
+#else
+#error "Unknown CALL value."
+#endif
+
+	if (bytes != -1)
+	{
+		fprintf(stderr, "Got unexpected result: %d.\n", bytes);
+		exit(1);
+	}
+
+	/* (Don't bother cleaning up.) */
+}
diff --git a/src/interfaces/ecpg/ecpglib/execute.c b/src/interfaces/ecpg/ecpglib/execute.c
index 641851983d..aaf31ea216 100644
--- a/src/interfaces/ecpg/ecpglib/execute.c
+++ b/src/interfaces/ecpg/ecpglib/execute.c
@@ -32,6 +32,16 @@
 #include "sqlda-compat.h"
 #include "sqlda-native.h"
 
+/*
+ * Print non-zero-terminated line received from COPY.
+ */
+static int
+print_row(void *, const char *buf, size_t len)
+{
+	fwrite(buf, 1, len, stdout);
+	return 0;
+}
+
 /*
  *	This function returns a newly malloced string that has ' and \
  *	escaped.
@@ -1876,16 +1886,10 @@ ecpg_process_output(struct statement *stmt, bool clear_result)
 			break;
 		case PGRES_COPY_OUT:
 			{
-char	   *buffer;
 int			res;
 
 ecpg_log("ecpg_process_output on line %d: COPY OUT data transfer in progress\n", stmt->lineno);
-while ((res = PQgetCopyData(stmt->connection->connection,
-			, 0)) > 0)
-{
-	printf("%s", buffer);
-	PQfreemem(buffer);
-}
+while ((res = PQhandleCopyData(stmt->connection->connection, print_row, NULL, 0)) > 0);
 if (res == -1)
 {
 	/* COPY done */
diff --git a/src/interfaces/libpq/exports.txt b/src/interfaces/libpq/exports.txt
index e8bcc88370..add1ff1591 100644
--- a/src/interfaces/libpq/exports.txt
+++ b/src/interfaces/libpq/exports.txt
@@ 

Re: zstd compression for pg_dump

2023-03-03 Thread Justin Pryzby
On Fri, Mar 03, 2023 at 10:32:53AM -0800, Jacob Champion wrote:
> On Sat, Feb 25, 2023 at 5:22 PM Justin Pryzby  wrote:
> > This resolves cfbot warnings: windows and cppcheck.
> > And refactors zstd routines.
> > And updates docs.
> > And includes some fixes for earlier patches that these patches conflicts
> > with/depends on.
> 
> This'll need a rebase (cfbot took a while to catch up).

Soon.

> The patchset includes basebackup modifications, which are part of a
> different CF entry; was that intended?

Yes, it's intentional - if zstd:long mode were to be merged first, then
this patch should include long mode from the start.
Or, if pgdump+zstd were merged first, then long mode could be added to
both places.

> I tried this on a local, 3.5GB, mostly-text table (from the UK Price

Thanks for looking.  If your zstd library is compiled with thread
support, could you also try with :workers=N ?  I believe this is working
correctly, but I'm going to ask for help verifying that...

It'd be especially useful to test under windows, where pgdump/restore
use threads instead of forking...  If you have a windows environment but
not set up for development, I think it's possible to get cirrusci to
compile a patch for you and then retrieve the binaries provided as an
"artifact" (credit/blame for this idea should be directed to Thomas
Munro).

> With this particular dataset, I don't see much improvement with
> zstd:long.

Yeah.  I this could be because either 1) you already got very good
comprssion without looking at more data; and/or 2) the neighboring data
is already very similar, maybe equally or more similar, than the further
data, from which there's nothing to gain.

> (At nearly double the CPU time, I get a <1% improvement in
> compression size.) I assume it's heavily data dependent, but from the
> notes on --long [2] it seems like they expect you to play around with
> the window size to further tailor it to your data. Does it make sense
> to provide the long option without the windowLog parameter?

I don't want to start exposing lots of fine-granined parameters at this
point.  In the immediate case, it looks like it may require more than
just adding another parameter:

  Note: If windowLog is set to larger than 27,
--long=windowLog or --memory=windowSize needs to be passed to the
decompressor.

-- 
Justin




Re: libpq: PQgetCopyData() and allocation overhead

2023-03-03 Thread Tom Lane
Jeroen Vermeulen  writes:
> The printf() is just the simplest example that sprang to mind though.
> There may be other use-cases out there involving  libraries that require
> zero-terminated strings, and I figured an ability to set a sentinel could
> help those.

Well, since it won't help for binary COPY, I'm skeptical that this is
something we should cater to.  Anybody who's sufficiently worried about
performance to be trying to remove malloc/free cycles ought to be
interested in binary COPY as well.

regards, tom lane




Re: libpq-fe.h should compile *entirely* standalone

2023-03-03 Thread Tom Lane
I wrote:
> We can easily do better, as attached, but I wonder which other
> headers should get the same treatment.

After a bit of further research I propose the attached.  I'm not
sure exactly what subset of ECPG headers is meant to be exposed
to clients, but we can adjust these patterns if new info emerges.

This is actually moving the inclusion-check goalposts quite far,
but HEAD seems to pass cleanly, and again we can always adjust later.
Any objections?

regards, tom lane

diff --git a/src/tools/pginclude/cpluspluscheck b/src/tools/pginclude/cpluspluscheck
index 2c5042eb41..6ad5ef6942 100755
--- a/src/tools/pginclude/cpluspluscheck
+++ b/src/tools/pginclude/cpluspluscheck
@@ -161,7 +161,31 @@ do
 	# OK, create .c file to include this .h file.
 	{
 	echo 'extern "C" {'
-	test "$f" != src/include/postgres_fe.h && echo '#include "postgres.h"'
+	# Ideally we'd pre-include only the appropriate one of
+	# postgres.h, postgres_fe.h, or c.h.  We don't always have enough
+	# info to guess which, but in some subdirectories there's a
+	# reasonable choice to make, and otherwise we use postgres.h.
+	# Also, those three files should compile with no pre-include, as
+	# should src/interfaces headers meant to be exposed to clients.
+	case "$f" in
+		src/include/postgres.h) ;;
+		src/include/postgres_fe.h) ;;
+		src/include/c.h) ;;
+		src/interfaces/libpq/libpq-fe.h) ;;
+		src/interfaces/libpq/libpq-events.h) ;;
+		src/interfaces/ecpg/ecpglib/ecpglib_extern.h)
+		echo '#include "postgres_fe.h"' ;;
+		src/interfaces/ecpg/ecpglib/*) ;;
+		src/interfaces/*)
+		echo '#include "postgres_fe.h"' ;;
+		src/bin/*)
+		echo '#include "postgres_fe.h"' ;;
+		src/port/*) ;;
+		src/common/*)
+		echo '#include "c.h"' ;;
+		*)
+		echo '#include "postgres.h"' ;;
+	esac
 	echo "#include \"$f\""
 	echo '};'
 	} >$tmp/test.cpp
diff --git a/src/tools/pginclude/headerscheck b/src/tools/pginclude/headerscheck
index abbba7aa63..f1810c09bb 100755
--- a/src/tools/pginclude/headerscheck
+++ b/src/tools/pginclude/headerscheck
@@ -142,7 +142,31 @@ do
 
 	# OK, create .c file to include this .h file.
 	{
-	test "$f" != src/include/postgres_fe.h && echo '#include "postgres.h"'
+	# Ideally we'd pre-include only the appropriate one of
+	# postgres.h, postgres_fe.h, or c.h.  We don't always have enough
+	# info to guess which, but in some subdirectories there's a
+	# reasonable choice to make, and otherwise we use postgres.h.
+	# Also, those three files should compile with no pre-include, as
+	# should src/interfaces headers meant to be exposed to clients.
+	case "$f" in
+		src/include/postgres.h) ;;
+		src/include/postgres_fe.h) ;;
+		src/include/c.h) ;;
+		src/interfaces/libpq/libpq-fe.h) ;;
+		src/interfaces/libpq/libpq-events.h) ;;
+		src/interfaces/ecpg/ecpglib/ecpglib_extern.h)
+		echo '#include "postgres_fe.h"' ;;
+		src/interfaces/ecpg/ecpglib/*) ;;
+		src/interfaces/*)
+		echo '#include "postgres_fe.h"' ;;
+		src/bin/*)
+		echo '#include "postgres_fe.h"' ;;
+		src/port/*) ;;
+		src/common/*)
+		echo '#include "c.h"' ;;
+		*)
+		echo '#include "postgres.h"' ;;
+	esac
 	echo "#include \"$f\""
 	} >$tmp/test.c
 


Re: zstd compression for pg_dump

2023-03-03 Thread Jacob Champion
On Sat, Feb 25, 2023 at 5:22 PM Justin Pryzby  wrote:
> This resolves cfbot warnings: windows and cppcheck.
> And refactors zstd routines.
> And updates docs.
> And includes some fixes for earlier patches that these patches conflicts
> with/depends on.

This'll need a rebase (cfbot took a while to catch up). The patchset
includes basebackup modifications, which are part of a different CF
entry; was that intended?

I tried this on a local, 3.5GB, mostly-text table (from the UK Price
Paid dataset [1]) and the comparison against the other methods was
impressive. (I'm no good at constructing compression benchmarks, so
this is a super naive setup. Client's on the same laptop as the
server.)

$ time ./src/bin/pg_dump/pg_dump -d postgres -t pp_complete -Z
zstd > /tmp/zstd.dump
real1m17.632s
user0m35.521s
sys0m2.683s

$ time ./\src/bin/pg_dump/pg_dump -d postgres -t pp_complete -Z
lz4 > /tmp/lz4.dump
real1m13.125s
user0m19.795s
sys0m3.370s

$ time ./\src/bin/pg_dump/pg_dump -d postgres -t pp_complete -Z
gzip > /tmp/gzip.dump
real2m24.523s
user2m22.114s
sys0m1.848s

$ ls -l /tmp/*.dump
-rw-rw-r-- 1 jacob jacob 1331493925 Mar  3 09:45 /tmp/gzip.dump
-rw-rw-r-- 1 jacob jacob 2125998939 Mar  3 09:42 /tmp/lz4.dump
-rw-rw-r-- 1 jacob jacob 1215834718 Mar  3 09:40 /tmp/zstd.dump

Default gzip was the only method that bottlenecked on pg_dump rather
than the server, and default zstd outcompressed it at a fraction of
the CPU time. So, naively, this looks really good.

With this particular dataset, I don't see much improvement with
zstd:long. (At nearly double the CPU time, I get a <1% improvement in
compression size.) I assume it's heavily data dependent, but from the
notes on --long [2] it seems like they expect you to play around with
the window size to further tailor it to your data. Does it make sense
to provide the long option without the windowLog parameter?

Thanks,
--Jacob

[1] https://landregistry.data.gov.uk/
[2] https://github.com/facebook/zstd/releases/tag/v1.3.2




Re: libpq: PQgetCopyData() and allocation overhead

2023-03-03 Thread Jeroen Vermeulen
On Fri, 3 Mar 2023 at 18:14, Tom Lane  wrote:

> Jeroen Vermeulen  writes:
> > On Fri, 3 Mar 2023 at 17:33, Tom Lane  wrote:
> >> Let's not do that.  Declare it const char *, or maybe better const void
> *.
>
> > Personally I would much prefer "char" over "void" here:
> > * It really is a char buffer, containing text.
>
> Not in binary-mode COPY.
>

True.  And in that case zero-termination doesn't matter much either.  But
overall libpq's existing choice seems reasonable.


> As for const, I would definitely have preferred that.  But if the caller
> > needs a zero-terminated string, forcing them into a memcpy() would kind
> of
> > defeat the purpose.
>
> I'm willing to grant that avoiding malloc-and-free is worth the trouble.
> I'm not willing to allow applications to scribble on libpq buffers to
> avoid memcpy.  Even your not-a-patch patch fails to make the case that
> this is essential, because you could have used fwrite() instead of
> printf() (which would be significantly faster yet btw, printf formatting
> ain't cheap).
>

Your house, your rules.  For my own use-case "const" is just peachy.

The printf() is just the simplest example that sprang to mind though.
There may be other use-cases out there involving  libraries that require
zero-terminated strings, and I figured an ability to set a sentinel could
help those.


> Can do that, sure.  I'll also try benchmarking a variant that doesn't take
> > a callback at all, but gives you the buffer pointer in addition to the
> > size/status return.  I don't generally like callbacks.
>
> Um ... that would require an assumption that libpq neither changes nor
> moves that buffer before returning to the caller.  I don't much like
> that either.
>

Not an assumption about _before returning to the caller_ I guess, because
the function would be on top of that anyway.  The concern would be libpq
changing or moving the buffer _before the caller is done with the line._
Which would require some kind of clear rule about what invalidates the
buffer.  Yes, that is easier with the callback.


Jeroen


Re: pgsql: Harden new test case against force_parallel_mode = regress.

2023-03-03 Thread Tom Lane
Robert Haas  writes:
> On Fri, Mar 3, 2023 at 12:46 PM Tom Lane  wrote:
>> Couldn't we install the leader's snapshot into both transactions?

> Yeah, maybe that would Just Work. Not sure.

Well, IIUC the worker is currently getting a brand new snapshot
for its startup transaction, which is exactly what you said upthread
it should never do.  Seems like that could have more failure modes
than just this one.

regards, tom lane




Re: pgsql: Harden new test case against force_parallel_mode = regress.

2023-03-03 Thread Robert Haas
On Fri, Mar 3, 2023 at 12:46 PM Tom Lane  wrote:
> Couldn't we install the leader's snapshot into both transactions?

Yeah, maybe that would Just Work. Not sure.

-- 
Robert Haas
EDB: http://www.enterprisedb.com




Re: libpq: PQgetCopyData() and allocation overhead

2023-03-03 Thread Tom Lane
Jelte Fennema  writes:
> On Fri, 3 Mar 2023 at 17:33, Tom Lane  wrote:
>> If you mean exposing PQExpBuffer to users of libpq-fe.h, I'd be very
>> seriously against that.  I realize that libpq exposes it at an ABI
>> level, but that doesn't mean we want non-Postgres code to use it.

> The code comment in the pqexpbuffer.h header suggests that client
> applications are fine too use the API to:

Our own client apps, sure.  But you have to buy into the whole Postgres
compilation environment to use PQExpBuffer.  (If you don't believe me,
just try including pqexpbuffer.h by itself.)  That's a non-starter
for most clients.

regards, tom lane




Re: Fix comments in gistxlogDelete, xl_heap_freeze_page and xl_btree_delete

2023-03-03 Thread Robert Haas
On Fri, Mar 3, 2023 at 11:28 AM Drouvot, Bertrand
 wrote:
> Thanks for having looked at it!

+1. Committed.

-- 
Robert Haas
EDB: http://www.enterprisedb.com




Re: libpq: PQgetCopyData() and allocation overhead

2023-03-03 Thread Jelte Fennema
On Fri, 3 Mar 2023 at 17:33, Tom Lane  wrote:
> If you mean exposing PQExpBuffer to users of libpq-fe.h, I'd be very
> seriously against that.  I realize that libpq exposes it at an ABI
> level, but that doesn't mean we want non-Postgres code to use it.

The code comment in the pqexpbuffer.h header suggests that client
applications are fine too use the API to:

> * This module is essentially the same as the backend's StringInfo data type,
> * but it is intended for use in frontend libpq and client applications.

I know both pg_auto_failover and pgcopydb use it quite a lot.




Re: pgsql: Harden new test case against force_parallel_mode = regress.

2023-03-03 Thread Tom Lane
Robert Haas  writes:
> On Fri, Mar 3, 2023 at 11:37 AM Tom Lane  wrote:
>> +ERROR:  role "regress_psql_user" does not exist
>> +CONTEXT:  while setting parameter "session_authorization" to 
>> "regress_psql_user"

> Oh, that's interesting (and sad). A parallel worker has a "startup
> transaction" that is used to restore library and GUC state, and then
> after that transaction commits, it starts up a new transaction that
> uses the same snapshot and settings as the transaction in the parallel
> leader. So the problem here is that the startup transaction can't see
> the uncommitted work of some unrelated (as far as it knows)
> transaction, and that prevents restoring the session_authorization
> GUC.

Got it.

> That startup transaction has broken stuff before, and it would be nice
> to get rid of it. Unfortunately, I don't remember right now why we
> need it in the first place. I'm fairly sure that if you load the
> library and GUC state without any transaction, that doesn't work,
> because a bunch of important processing gets skipped. And I think if
> you try to do those things in the "real" transaction that fails for
> some reason too, maybe that there's no guarantee that all the relevant
> GUCs can be changed at that point, but I'm fuzzy on the details at the
> moment.

Couldn't we install the leader's snapshot into both transactions?

regards, tom lane




Re: pgsql: Harden new test case against force_parallel_mode = regress.

2023-03-03 Thread Robert Haas
On Fri, Mar 3, 2023 at 11:37 AM Tom Lane  wrote:
> The workers were failing at startup, eg (from [1]):
>
> +ERROR:  role "regress_psql_user" does not exist
> +CONTEXT:  while setting parameter "session_authorization" to 
> "regress_psql_user"
>
> Maybe this says that worker startup needs to install the snapshot before
> doing any catalog accesses?  Anyway, I'd be happy to revert this test
> hack if you care to make the case work.

Oh, that's interesting (and sad). A parallel worker has a "startup
transaction" that is used to restore library and GUC state, and then
after that transaction commits, it starts up a new transaction that
uses the same snapshot and settings as the transaction in the parallel
leader. So the problem here is that the startup transaction can't see
the uncommitted work of some unrelated (as far as it knows)
transaction, and that prevents restoring the session_authorization
GUC.

That startup transaction has broken stuff before, and it would be nice
to get rid of it. Unfortunately, I don't remember right now why we
need it in the first place. I'm fairly sure that if you load the
library and GUC state without any transaction, that doesn't work,
because a bunch of important processing gets skipped. And I think if
you try to do those things in the "real" transaction that fails for
some reason too, maybe that there's no guarantee that all the relevant
GUCs can be changed at that point, but I'm fuzzy on the details at the
moment.

So I don't know how to fix this right now, but thanks for the details.

-- 
Robert Haas
EDB: http://www.enterprisedb.com




Re: libpq: PQgetCopyData() and allocation overhead

2023-03-03 Thread Tom Lane
Jeroen Vermeulen  writes:
> On Fri, 3 Mar 2023 at 17:33, Tom Lane  wrote:
>> Let's not do that.  Declare it const char *, or maybe better const void *.

> Personally I would much prefer "char" over "void" here:
> * It really is a char buffer, containing text.

Not in binary-mode COPY.

> As for const, I would definitely have preferred that.  But if the caller
> needs a zero-terminated string, forcing them into a memcpy() would kind of
> defeat the purpose.

I'm willing to grant that avoiding malloc-and-free is worth the trouble.
I'm not willing to allow applications to scribble on libpq buffers to
avoid memcpy.  Even your not-a-patch patch fails to make the case that
this is essential, because you could have used fwrite() instead of
printf() (which would be significantly faster yet btw, printf formatting
ain't cheap).

> Can do that, sure.  I'll also try benchmarking a variant that doesn't take
> a callback at all, but gives you the buffer pointer in addition to the
> size/status return.  I don't generally like callbacks.

Um ... that would require an assumption that libpq neither changes nor
moves that buffer before returning to the caller.  I don't much like
that either.

regards, tom lane




libpq-fe.h should compile *entirely* standalone

2023-03-03 Thread Tom Lane
I realized that headerscheck is failing to enforce $SUBJECT.
This is bad, since we aren't really using libpq-fe.h ourselves
in a way that would ensure that c.h symbols don't creep into it.

We can easily do better, as attached, but I wonder which other
headers should get the same treatment.

regards, tom lane

diff --git a/src/tools/pginclude/headerscheck b/src/tools/pginclude/headerscheck
index abbba7aa63..69897092b2 100755
--- a/src/tools/pginclude/headerscheck
+++ b/src/tools/pginclude/headerscheck
@@ -142,7 +142,14 @@ do
 
 	# OK, create .c file to include this .h file.
 	{
-	test "$f" != src/include/postgres_fe.h && echo '#include "postgres.h"'
+	# Ideally we'd pre-include only the appropriate one of
+	# postgres.h, postgres_fe.h, or c.h, but we don't have enough
+	# info here to guess what to do in most cases.
+	if test "$f" != src/include/postgres_fe.h -a \
+		"$f" != src/interfaces/libpq/libpq-fe.h
+	then
+		echo '#include "postgres.h"'
+	fi
 	echo "#include \"$f\""
 	} >$tmp/test.c
 


Re: libpq: PQgetCopyData() and allocation overhead

2023-03-03 Thread Jeroen Vermeulen
On Fri, 3 Mar 2023 at 17:33, Tom Lane  wrote:

>
> If you mean exposing PQExpBuffer to users of libpq-fe.h, I'd be very
> seriously against that.  I realize that libpq exposes it at an ABI
> level, but that doesn't mean we want non-Postgres code to use it.
> I also don't see what it'd add for this particular use-case.
>

Fair enough.  Never even got around to checking whether it was in the API
already.



> One thing I don't care for at all in the proposed API spec is the bit
> about how the handler function can scribble on the passed buffer.
> Let's not do that.  Declare it const char *, or maybe better const void *.
>

Personally I would much prefer "char" over "void" here:
* It really is a char buffer, containing text.
* If there is to be any type punning, best have it explicit.
* Reduces risk of getting the two pointer arguments the wrong way around.

As for const, I would definitely have preferred that.  But if the caller
needs a zero-terminated string, forcing them into a memcpy() would kind of
defeat the purpose.

I even tried poking a terminating zero in there from inside the function,
but that made the code significantly less efficient.  Optimiser
assumptions, I suppose.


Rather than duplicating most of pqGetCopyData3, I'd suggest revising
> it to take a callback, where the callback is either user-supplied
> or is supplied by PQgetCopyData to emulate the existing behavior.
> This would both avoid duplicate coding and provide a simple check that
> you've made a usable callback API (in particular, one that can do
> something sane for error cases).
>

Can do that, sure.  I'll also try benchmarking a variant that doesn't take
a callback at all, but gives you the buffer pointer in addition to the
size/status return.  I don't generally like callbacks.


Jeroen


Re: pgsql: Harden new test case against force_parallel_mode = regress.

2023-03-03 Thread Tom Lane
I wrote:
> The workers were failing at startup, eg (from [1]):

argh, forgot to add the link:

[1] 
https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=hippopotamus=2023-03-02%2022%3A31%3A17

regards, tom lane




Re: pgsql: Harden new test case against force_parallel_mode = regress.

2023-03-03 Thread Matthias van de Meent
On Fri, 3 Mar 2023 at 17:16, Robert Haas  wrote:
>
> On Thu, Mar 2, 2023 at 5:47 PM Tom Lane  wrote:
> > Harden new test case against force_parallel_mode = regress.
> >
> > Per buildfarm: worker processes can't see a role created in
> > the current transaction.
>
> Now why would that happen? Surely the snapshot for each command is
> passed down from leader to worker, and the worker is not free to
> invent a snapshot from nothing.

Probably because we nitialize which user and database to use in the
backend before we load the parent process' snapshot:

in ParallelWorkerMain (parallel.c, as of HEAD @ b6a0d469):

  /* Restore database connection. */
  BackgroundWorkerInitializeConnectionByOid(fps->database_id,
  fps->authenticated_user_id,
  0);
[...]

  /* Crank up a transaction state appropriate to a parallel worker. */
  tstatespace = shm_toc_lookup(toc, PARALLEL_KEY_TRANSACTION_STATE, false);
  StartParallelWorkerTransaction(tstatespace);

  /* Restore combo CID state. */
  combocidspace = shm_toc_lookup(toc, PARALLEL_KEY_COMBO_CID, false);
  RestoreComboCIDState(combocidspace);

-Matthias




Re: pgsql: Harden new test case against force_parallel_mode = regress.

2023-03-03 Thread Tom Lane
Robert Haas  writes:
> On Thu, Mar 2, 2023 at 5:47 PM Tom Lane  wrote:
>> Per buildfarm: worker processes can't see a role created in
>> the current transaction.

> Now why would that happen? Surely the snapshot for each command is
> passed down from leader to worker, and the worker is not free to
> invent a snapshot from nothing.

The workers were failing at startup, eg (from [1]):

+ERROR:  role "regress_psql_user" does not exist
+CONTEXT:  while setting parameter "session_authorization" to 
"regress_psql_user"

Maybe this says that worker startup needs to install the snapshot before
doing any catalog accesses?  Anyway, I'd be happy to revert this test
hack if you care to make the case work.

regards, tom lane




Re: libpq: PQgetCopyData() and allocation overhead

2023-03-03 Thread Tom Lane
Jelte Fennema  writes:
> Did you try with PQExpBuffer? I still think that probably fits better
> in the API design of libpq.

If you mean exposing PQExpBuffer to users of libpq-fe.h, I'd be very
seriously against that.  I realize that libpq exposes it at an ABI
level, but that doesn't mean we want non-Postgres code to use it.
I also don't see what it'd add for this particular use-case.

One thing I don't care for at all in the proposed API spec is the bit
about how the handler function can scribble on the passed buffer.
Let's not do that.  Declare it const char *, or maybe better const void *.

Rather than duplicating most of pqGetCopyData3, I'd suggest revising
it to take a callback, where the callback is either user-supplied
or is supplied by PQgetCopyData to emulate the existing behavior.
This would both avoid duplicate coding and provide a simple check that
you've made a usable callback API (in particular, one that can do
something sane for error cases).

regards, tom lane




Re: Fix comments in gistxlogDelete, xl_heap_freeze_page and xl_btree_delete

2023-03-03 Thread Drouvot, Bertrand

Hi,

On 3/3/23 12:30 PM, Amit Kapila wrote:

On Thu, Mar 2, 2023 at 6:35 PM Drouvot, Bertrand
 wrote:


On 1/6/23 11:05 AM, Drouvot, Bertrand wrote:

Hi hackers,

Please find attached a patch to $SUBJECT.

The wrong comments have been discovered by Robert in [1].

Submitting this here as a separate thread so it does not get lost in the 
logical decoding
on standby thread.

[1]: 
https://www.postgresql.org/message-id/CA%2BTgmoYTTsxP8y6uknZvCBNCRq%2B1FJ4zGbX8Px1TGW459fGsaQ%40mail.gmail.com

Looking forward to your feedback,

Regards,



It looks like I did not create a CF entry for this one: fixed with [1].

Also, while at it, adding a commit message in V2 attached.



LGTM.



Thanks for having looked at it!

Regards,

--
Bertrand Drouvot
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com




Re: Minimal logical decoding on standbys

2023-03-03 Thread Drouvot, Bertrand

Hi,

On 3/3/23 8:58 AM, Jeff Davis wrote:

On Thu, 2023-03-02 at 11:45 -0800, Jeff Davis wrote:

In this case it looks easier to add the right API than to be sure
about
whether it's needed or not.


I attached a sketch of one approach. 


Oh, that's very cool, thanks a lot!


I'm not very confident that it's
the right API or even that it works as I intended it, but if others
like the approach I can work on it some more.



I'll look at it early next week.

Regards,

--
Bertrand Drouvot
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com




Re: Minimal logical decoding on standbys

2023-03-03 Thread Drouvot, Bertrand

Hi,

On 3/2/23 8:45 PM, Jeff Davis wrote:

On Thu, 2023-03-02 at 10:20 +0100, Drouvot, Bertrand wrote:

Right, but in our case, right after the wakeup (the one due to the CV
broadcast,
aka the one that will remove it from the wait queue) we'll exit the
loop due to:

"
  /* check whether we're done */
  if (loc <= RecentFlushPtr)
  break;
"

as the CV broadcast means that a flush/replay occurred.


But does it mean that the flush/replay advanced *enough* to be greater
than or equal to loc?



Yes I think so: loc is when we started waiting initially
and RecentFlushPtr is >= to when the broadcast has been sent.


- If it is awakened due to the CV broadcast, then we'll right after
exit the loop (see above)


...


I think that's not needed as we'd exit the loop right after we are
awakened by a CV broadcast.


See the comment here:
WalSndWaitForWal
  * If this process has been taken out of the wait list, then we know
  * that it has been signaled by ConditionVariableSignal (or
  * ConditionVariableBroadcast), so we should return to the caller. But
  * that doesn't guarantee that the exit condition is met, only that we
  * ought to check it.

You seem to be arguing that in this case, it doesn't matter; that
walreceiver knows what walsender is waiting for, and will never wake it
up before it's ready. I don't think that's true, and even if it is, it
needs explanation.



What I think is that, in this particular case, we are sure that
the loop exit condition is met as we know that loc <= RecentFlushPtr.



I agree that's a good idea and that it should/would work too. I just
wanted to highlight that in this particular
case that might not be necessary to build this new API.


In this case it looks easier to add the right API than to be sure about
whether it's needed or not.



What I meant is that of course I might be wrong.

If we do not agree that the new API (in this particular case) is not needed then
I agree that building the new API is the way to go ;-) (+ it offers the 
advantage to
be able to be more precise while reporting the wait event).

Regards,

--
Bertrand Drouvot
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com




Re: psql: Add role's membership options to the \du+ command

2023-03-03 Thread David G. Johnston
On Fri, Mar 3, 2023 at 4:01 AM Pavel Luzanov 
wrote:

> Hello,
>
> On 22.02.2023 00:34, David G. Johnston wrote:
>
> I didn't even know this function existed. But I see that it was changed in
> 3d14e171 with updated documentation:
>
> https://www.postgresql.org/docs/devel/functions-info.html#FUNCTIONS-INFO-ACCESS
> Maybe that's enough.
>
>
> I think that should probably have ADMIN as one of the options as well.
> Also curious what it reports for an empty membership.
>
>
> I've been experimenting for a few days and I want to admit that this is a
> very difficult and not obvious topic.
> I'll try to summarize what I think.
>
> 1.
> About ADMIN value for pg_has_role.
> Implementation of ADMIN value will be different from USAGE and SET.
> To be True, USAGE value requires the full chain of memberships to have
> INHERIT option.
> Similar with SET: the full chain of memberships must have SET option.
> But for ADMIN, only last member in the chain must have ADMIN option and
> all previous members
> must have INHERIT (to administer directly) or SET option (to switch to
> role, last in the chain).
> Therefore, it is not obvious to me that the function needs the ADMIN value.
>

Or you can SET to some role that then has an unbroken INHERIT chain to the
administered role.

ADMIN basically implies SET/USAGE but it doesn't work the other way around.

I'd be fine with "pg_can_admin_role" being a newly created function that
provides this true/false answer but it seems indisputable that today there
is no core-provided means to answer the question "can one role get ADMIN
rights on another role".  Modifying \du to show this seems out-of-scope but
the pg_has_role function already provides that question for INHERIT and SET
so it is at least plausible to extend it to include ADMIN, even if the
phrase "has role" seems a bit of a misnomer.  I do cover this aspect with
the Role Graph pseudo-extension but given the presence and ease-of-use of a
boolean-returning function this seems like a natural addition.  We've also
survived quite long without it - this isn't a new concept in v16, just a
bit refined.


>
> 2.
> pg_has_role function description starts with: Does user have privilege for
> role?
> - This is not exact: function works not only with users, but with
> NOLOGIN roles too.
> - Term "privilege": this term used for ACL columns, such usage may be
> confusing,
>   especially after adding INHERIT and SET in addition to ADMIN option.
>

Yes, it missed the whole "there are only roles now" memo.  I don't have an
issue with using privilege here though - you have to use the GRANT command
which "defines access privileges".  Otherwise "membership option" or maybe
just "option" would need to be explored.


>
> 3.
> It is possible to grant membership with all three options turned off:
> grant a to b with admin false, inherit false, set false;
> But such membership is completely useless (if i didn't miss something).
> May be such grants must be prohibited. At least this may be documented in
> the GRANT command.
>

I have no issue with prohibiting the "empty membership" if someone wants to
code that up.


> 4.
> Since v16 it is possible to grant membership from one role to another
> several times with different grantors.
> And only grantor can revoke membership.
> - This is not documented anywhere.
>

Yeah, a pass over the GRANTED BY actual operation versus documentation
seems warranted.


> - Current behavior of \du command with duplicated roles in "Member of"
> column strongly confusing.
>   This is one of the goals of the discussion patch.
>

This indeed needs to be fixed, one way (include grantor) or the other
(du-duplicate), with the current proposal of including grantor getting my
vote.


>
> I think to write about this to pgsql-docs additionally to this topic.
>

I wouldn't bother starting yet another thread in this area right now, this
one can absorb some related changes as well as the subject line item.

David J.


Re: pgsql: Harden new test case against force_parallel_mode = regress.

2023-03-03 Thread Robert Haas
On Thu, Mar 2, 2023 at 5:47 PM Tom Lane  wrote:
> Harden new test case against force_parallel_mode = regress.
>
> Per buildfarm: worker processes can't see a role created in
> the current transaction.

Now why would that happen? Surely the snapshot for each command is
passed down from leader to worker, and the worker is not free to
invent a snapshot from nothing.

-- 
Robert Haas
EDB: http://www.enterprisedb.com




Re: libpq: PQgetCopyData() and allocation overhead

2023-03-03 Thread Jeroen Vermeulen
Thank you.

I meant to try PQExpBuffer — as you say, it fits much better with existing
libpq style.  But then I hit on the callback idea which was even more
efficient, by a fair margin.  It was also considerably simpler both inside
libpq and in the client code, eliminating all sorts of awkward questions
about who deallocates the buffer in which situations.  So I ditched the
"dynamic buffer" concept and went with the callback.

One other possible alternative suggests itself: instead of taking a
callback and a context pointer, the function could probably just return a
struct: status/size, and buffer.  Then the caller would have to figure out
whether there's a line in the buffer, and if so, process it.  It seems like
more work for the client code, but it may make the compiler's optimisation
work easier.


Jeroen

On Fri, 3 Mar 2023 at 16:52, Jelte Fennema  wrote:

> On Thu, 2 Mar 2023 at 20:45, Jeroen Vermeulen  wrote:
> > I'm attaching a diff now.  This is not a patch, it's just a discussion
> piece.
>
> Did you try with PQExpBuffer? I still think that probably fits better
> in the API design of libpq. Obviously if it's significantly slower
> than the callback approach in this patch then it's worth considering
> using the callback approach. Overall it definitely seems reasonable to
> me to have an API that doesn't do an allocation per row.
>


Re: Timeline ID hexadecimal format

2023-03-03 Thread Robert Haas
On Tue, Jan 31, 2023 at 2:16 PM Greg Stark  wrote:
> I don't see any advantage in converting every place where we refer to
> timelines into hex and then having to refer to things like timeline
> 1A. It doesn't seem any more intuitive to someone understanding what's
> going on than referring to timeline 26.

The point, though, is that the WAL files we have on disk already say
1A. If we change the log messages to match, that's easier for users.
We could alternatively change the naming convention for WAL files on
disk, but that feels like a much bigger compatibility break.

-- 
Robert Haas
EDB: http://www.enterprisedb.com




running logical replication as the subscription owner

2023-03-03 Thread Robert Haas
Hi,

Here's a patch against master for $SUBJECT. It lacks documentation
changes and might have bugs, so please review if you're concerned
about this issue.

To recap, under CVE-2020-14349, Noah documented that untrusted users
shouldn't own tables into which the system is performing logical
replication. Otherwise, the users could hook up triggers or default
expressions or whatever to those tables and they would execute with
the subscription owner's privileges, which would allow the table
owners to escalate to superuser. However, I'm unsatisfied with just
documenting the hazard, because I feel like almost everyone who uses
logical replication wants to do the exact thing that this
documentation says you shouldn't do. If you don't use logical
replication or run everything as the superuser or just don't care
about security, well then you don't have any problem here, but
otherwise you probably do.

The proposed fix is to perform logical replication actions (SELECT,
INSERT, UPDATE, DELETE, and TRUNCATE) as the user who owns the table
rather than as the owner of the subscription. The session still runs
as the subscription owner, but the active user ID is switched to the
table owner for the duration of each operation. To prevent table
owners from doing tricky things to attack the subscription owner, we
impose SECURITY_RESTRICTED_OPERATION while running as the table owner.
To avoid inconveniencing users when this restriction adds no
meaningful security, we refrain from imposing
SECURITY_RESTRICTED_OPERATION when the table owner can SET ROLE to the
subscription owner anyway. Such a user need not use logical
replication to break into the subscription owner's account: they have
access to it anyway.

There is also a possibility of an attack in the other direction. Maybe
the subscription owner would like to obtain the table owner's
permissions, or at the very least, use logical replication as a
vehicle to perform operations they can't perform directly. A malicious
subscription owner could hook up logical replication to a table into
which the table owner doesn't want replication to occur. To block such
attacks, the patch requires that the subscription owner have the
ability to SET ROLE to each table owner. If the subscription owner is
a superuser, which is usual, this will be automatic. Otherwise, the
superuser will need to grant to the subscription owner the roles that
own relevant tables. This can usefully serve as a kind of access
control to make sure that the subscription doesn't touch any tables
other than the ones it's supposed to be touching: just make those
tables owned by a different user and don't grant them to the
subscription owner. Previously, we provided no way at all of
controlling the tables that replication can target.

This fix interacts in an interesting way with Mark Dilger's work,
committed by Jeff Davis, to make logical replication respect table
permissions. I initially thought that with this change, that commit
would end up just being reverted, with the permissions scheme
described above replacing the existing one. However, I then realized
that it's still good to perform those checks. Normally, a table owner
can do any DML operation on a table they own, so those checks will
never fail. However, if a table owner has revoked their ability to,
say, INSERT into one of their own tables, then logical replication
shouldn't bypass that and perform the INSERT anyway. So I now think
that the checks added by that commit complement the ones added by this
proposed patch, rather than competing with them.

It is unclear to me whether we should try to back-port this. It's
definitely a behavior change, and changing the behavior in the back
branches is not a nice thing to do. On the other hand, at least in my
opinion, the security consequences of the status quo are pretty dire.
I tend to suspect that a really high percentage of people who are
using logical replication at all are vulnerable to this, and lots of
people having a way to escalate to superuser isn't good.

Comments?

-- 
Robert Haas
EDB: http://www.enterprisedb.com


v1-0001-Perform-logical-replication-actions-as-the-table-.patch
Description: Binary data


Re: libpq: PQgetCopyData() and allocation overhead

2023-03-03 Thread Jelte Fennema
On Thu, 2 Mar 2023 at 20:45, Jeroen Vermeulen  wrote:
> I'm attaching a diff now.  This is not a patch, it's just a discussion piece.

Did you try with PQExpBuffer? I still think that probably fits better
in the API design of libpq. Obviously if it's significantly slower
than the callback approach in this patch then it's worth considering
using the callback approach. Overall it definitely seems reasonable to
me to have an API that doesn't do an allocation per row.




Re: Timeline ID hexadecimal format

2023-03-03 Thread Sébastien Lardière

On 02/03/2023 09:12, Peter Eisentraut wrote:

On 24.02.23 17:27, Sébastien Lardière wrote:

diff --git a/doc/src/sgml/backup.sgml b/doc/src/sgml/backup.sgml
index be05a33205..7e26b51031 100644
--- a/doc/src/sgml/backup.sgml
+++ b/doc/src/sgml/backup.sgml
@@ -1332,7 +1332,8 @@ restore_command = 'cp/mnt/server/archivedir/%f %p'
  you like, add comments to a history file to record your own 
notes about
  how and why this particular timeline was created.  Such 
comments will be
  especially valuable when you have a thicket of different 
timelines as

-    a result of experimentation.
+    a result of experimentation. In both WAL segment file names and 
history files,

+    the timeline ID number is expressed in hexadecimal.
 
   


I think here it would be more helpful to show actual examples. Like, 
here is a possible file name, this is what the different parts mean.


So you mean explain the WAL filename and the history filename ? Is it 
the good place for it ?






diff --git a/doc/src/sgml/config.sgml b/doc/src/sgml/config.sgml
index e5c41cc6c6..3b5d041d92 100644
--- a/doc/src/sgml/config.sgml
+++ b/doc/src/sgml/config.sgml
@@ -4110,7 +4110,9 @@ restore_command = 'copy 
"C:\\server\\archivedir\\%f" "%p"'  # Windows

  current when the base backup was taken.  The
  value latest recovers
  to the latest timeline found in the archive, which is 
useful in

-    a standby server. latest is the default.
+    a standby server. A numerical value expressed in hexadecimal 
must be
+    prefixed with 0x, for example 
0x11.

+    latest is the default.
 
   


This applies to all configuration parameters, so it doesn't need to be 
mentioned explicitly for individual ones.


Probably, but is there another parameter with the same consequence ?

worth it to document this point globally ?




diff --git a/doc/src/sgml/ref/pg_waldump.sgml 
b/doc/src/sgml/ref/pg_waldump.sgml

index 343f0482a9..4ae8f2ebdd 100644
--- a/doc/src/sgml/ref/pg_waldump.sgml
+++ b/doc/src/sgml/ref/pg_waldump.sgml
@@ -215,7 +215,8 @@ PostgreSQL documentation
 
  Timeline from which to read WAL records. The default is to 
use the
  value in startseg, if that is 
specified; otherwise, the

-    default is 1.
+    default is 1. The value must be expressed in decimal, 
contrary to the hexadecimal

+    value given in WAL segment file names and history files.
 
    
   


Maybe this could be fixed instead?




Indeed, and strtoul is probably a better option than sscanf, don't you 
think ?




--
Sébastien





Re: Time delayed LR (WAS Re: logical replication restrictions)

2023-03-03 Thread Masahiko Sawada
On Thu, Mar 2, 2023 at 1:07 PM Amit Kapila  wrote:
>
> On Thu, Mar 2, 2023 at 7:38 AM Masahiko Sawada  wrote:
> >
> > On Wed, Mar 1, 2023 at 6:21 PM Hayato Kuroda (Fujitsu)
> >  wrote:
> > >
> > > >
> > > > Apart from a bad-use case example I mentioned, in general, piling up
> > > > WAL files due to the replication slot has many bad effects on the
> > > > system. I'm concerned that the side effect of this feature (at least
> > > > of the current design) is too huge compared to the benefit, and afraid
> > > > that users might end up using this feature without understanding the
> > > > side effect well. It might be okay if we thoroughly document it but
> > > > I'm not sure.
> > >
> > > One approach is that change max_slot_wal_keep_size forcibly when 
> > > min_send_delay
> > > is set. But it may lead to disable the slot because WALs needed by the 
> > > time-delayed
> > > replication may be also removed. Just the right value cannot be set by us 
> > > because
> > > it is quite depends on the min_send_delay and workload.
> > >
> > > How about throwing the WARNING when min_send_delay > 0 but
> > > max_slot_wal_keep_size < 0? Differ from previous, version the subscription
> > > parameter min_send_delay will be sent to publisher. Therefore, we can 
> > > compare
> > > min_send_delay and max_slot_wal_keep_size when publisher receives the 
> > > parameter.
> >
> > Since max_slot_wal_keep_size can be changed by reloading the config
> > file, each walsender warns it also at that time?
> >
>
> I think Kuroda-San wants to emit a WARNING at the time of CREATE
> SUBSCRIPTION. But it won't be possible to emit a WARNING at the time
> of ALTER SUBSCRIPTION. Also, as you say if the user later changes the
> value of max_slot_wal_keep_size, then even if we issue LOG/WARNING in
> walsender, it may go unnoticed. If we really want to give WARNING for
> this then we can probably give it as soon as user has set non-default
> value of min_send_delay to indicate that this can lead to retaining
> WAL on the publisher and they should consider setting
> max_slot_wal_keep_size.
>
> Having said that, I think users can always tune max_slot_wal_keep_size
> and min_send_delay (as none of these requires restart) if they see any
> indication of unexpected WAL size growth. There could be multiple ways
> to check it but I think one can refer wal_status in
> pg_replication_slots, the extended value can be an indicator of this.
>
> > Not sure it's
> > helpful. I think it's a legitimate use case to set min_send_delay > 0
> > and max_slot_wal_keep_size = -1, and users might not even notice the
> > WARNING message.
> >
>
> I think it would be better to tell about this in the docs along with
> the 'min_send_delay' description. The key point is whether this would
> be an acceptable trade-off for users who want to use this feature. I
> think it can harm only if users use this without understanding the
> corresponding trade-off. As we kept the default to no delay, it is
> expected from users using this have an understanding of the trade-off.

I imagine that a typical use case would be to set min_send_delay to
several hours to days. I'm concerned that it could not be an
acceptable trade-off for many users that the system cannot collect any
garbage during that.

I think we can have the apply process write the decoded changes
somewhere on the disk (as not temporary files) and return the flush
LSN so that the apply worker can apply them later and the publisher
can advance slot's LSN. The feature would be more complex but from the
user perspective it would be better.

Regards,

-- 
Masahiko Sawada
Amazon Web Services: https://aws.amazon.com




Re: Missing free_var() at end of accum_sum_final()?

2023-03-03 Thread Dean Rasheed
> On 20.02.23 23:16, Joel Jacobson wrote:
> > In the new attached patch, Andres fixed buffer idea has been implemented
> > throughout the entire numeric.c code base.
>

I have been going over this patch, and IMO it's far too invasive for
the fairly modest performance gains (and performance regressions in
some cases) that it gives (which seem to be somewhat smaller on my
machine).

One code change that I am particularly opposed to is changing all the
low-level functions like add_var(), mul_var(), etc., so that they no
longer accept the result being the same variable as any of the inputs.
That is a particularly convenient thing to be able to do, and without
it, various functions become more complex and less readable, and have
to resort to using more temporary variables.

I actually find the whole business of attaching a static buffer and
new buf_len fields to NumericVar quite ugly, and the associated extra
complexity in alloc_var(), free_var(), zero_var(), and
set_var_from_var() is all part of that. Now that might be worth it, if
this gave significant performance gains across the board, but the
trouble is it doesn't. AFAICS, it seems to be just as likely to
degrade performance. For example:

SELECT sqrt(6*sum(1/x/x)) FROM generate_series(1::numeric
,1000::numeric) g(x);

is consistently 1-2% slower for me, with this patch. That's not much,
but then neither are most of the gains. In a lot of cases, it's so
close to the level of noise that I don't think most users will notice
one way or the other.

So IMO the results just don't justify such an extensive set of
changes, and I think we should abandon this fixed buffer approach.

Having said that, I think the results from the COPY test are worth
looking at more closely. Your results seem to suggest around a 5%
improvement. On my machine it was only around 3%, but that still might
be worth having, if it didn't involve such invasive changes throughout
the rest of the code.

As an experiment, I took another look at my earlier patch, making
make_result() construct the result using the same allocated memory as
the variable's digit buffer (if allocated). That eliminates around a
third of all free_var() calls from numeric.c, and for most functions,
it saves both a palloc() and a pfree(). In the case of numeric_in(), I
realised that it's possible to go further, by reusing the decimal
digits buffer for the NumericVar's digits, and then later for the
final Numeric result. Also, by carefully aligning things, it's
possible to arrange it so that the final make_result() doesn't need to
copy/move the digits at all. With that I get something closer to a 15%
improvement in the COPY test, which is definitely worth having.

In the pi series above, it gave a 3-4% performance improvement, and
that seemed to be a common pattern across a number of other tests.
It's also a much less invasive change, since it's only really changing
make_result(), which makes the knock-on effects much more manageable,
and reduces the chances of any performance regressions.

I didn't do all the tests that you did though, so it would be
interesting to see how it fares in those.

Regards,
Dean
diff --git a/src/backend/utils/adt/numeric.c b/src/backend/utils/adt/numeric.c
new file mode 100644
index a83feea..c024fcf
--- a/src/backend/utils/adt/numeric.c
+++ b/src/backend/utils/adt/numeric.c
@@ -474,8 +474,14 @@ static void dump_var(const char *str, Nu
 #define dump_var(s,v)
 #endif
 
+/*
+ * Macros to allocate/free numeric digit buffers. Whenever we allocate a digit
+ * buffer, we give it an extra NUMERIC_HDRSZ (8 bytes) of space, so that the
+ * same memory block can be used by make_result() to construct a Numeric result
+ * from it, avoiding palloc/pfree overhead.
+ */
 #define digitbuf_alloc(ndigits)  \
-	((NumericDigit *) palloc((ndigits) * sizeof(NumericDigit)))
+	((NumericDigit *) palloc(NUMERIC_HDRSZ + (ndigits) * sizeof(NumericDigit)))
 #define digitbuf_free(buf)	\
 	do { \
 		 if ((buf) != NULL) \
@@ -783,8 +789,6 @@ numeric_in(PG_FUNCTION_ARGS)
 			ereturn(escontext, (Datum) 0,
 	(errcode(ERRCODE_NUMERIC_VALUE_OUT_OF_RANGE),
 	 errmsg("value overflows numeric format")));
-
-		free_var();
 	}
 
 	PG_RETURN_NUMERIC(res);
@@ -1141,8 +1145,6 @@ numeric_recv(PG_FUNCTION_ARGS)
 		(void) apply_typmod_special(res, typmod, NULL);
 	}
 
-	free_var();
-
 	PG_RETURN_NUMERIC(res);
 }
 
@@ -1305,8 +1307,6 @@ numeric		(PG_FUNCTION_ARGS)
 	(void) apply_typmod(, typmod, NULL);
 	new = make_result();
 
-	free_var();
-
 	PG_RETURN_NUMERIC(new);
 }
 
@@ -1566,7 +1566,6 @@ numeric_round(PG_FUNCTION_ARGS)
 	 */
 	res = make_result();
 
-	free_var();
 	PG_RETURN_NUMERIC(res);
 }
 
@@ -1615,7 +1614,6 @@ numeric_trunc(PG_FUNCTION_ARGS)
 	 */
 	res = make_result();
 
-	free_var();
 	PG_RETURN_NUMERIC(res);
 }
 
@@ -1642,7 +1640,6 @@ numeric_ceil(PG_FUNCTION_ARGS)
 	ceil_var(, );
 
 	res = make_result();
-	free_var();
 
 	PG_RETURN_NUMERIC(res);
 }
@@ -1670,7 +1667,6 @@ numeric_floor(PG_FUNCTION_ARGS)
 	

Re: Non-superuser subscription owners

2023-03-03 Thread Robert Haas
On Wed, Mar 1, 2023 at 7:34 PM Andres Freund  wrote:
> ISTM that this would require annotating most functions in the system. There's
> many problems besides accessing database contents. Just a few examples:
>
> - dblink functions to access another system / looping back
> - pg_read_file()/pg_file_write() allows almost arbitrary mischief
> - pg_stat_reset[_shared]()
> - creating/dropping logical replication slots
> - use untrusted PL functions
> - many more
>
> A single wrongly annotated function would be sufficient to escape. This
> includes user defined functions.
>
> This basically proposes that we can implement a safe sandbox for executing
> arbitrary code in a privileged context. IMO history suggests that that's a
> hard thing to do.

Yeah, that's true, but I don't think switching users all the time is
going to be great either. And it's not like other people haven't gone
this way: that's what plperl (not plperlu) is all about, and
JavaScript running in your browser, and so on. Those things aren't
problem-free, of course, but we're all using them.

When I was initially thinking about this, I thought that maybe we
could just block access to tables and utility statements. That's got
problems in both directions. On the one hand, there are functions like
the ones you propose here that have side effects which we might not
want to allow, and on the other hand, somebody might have an index
expression that does a lookup in a table that they "never change". The
latter case is problematic for non-security reasons, because there's
an invisible dump-ordering constraint that must be obeyed for
dump/restore to work at all, but there's no security issue. Still, I'm
not sure this idea is completely dead in the water. It doesn't seem
unreasonable to me that if you have that kind of case, you have to
somehow opt into the behavior: yeah, I know that index functions I'm
executing are going to read from tables, and I consent to that. And
similarly, if your index expression calls pg_stat_reset_shared(), that
probably ought to be blocked by default too, and if you want to allow
it, you have to say so. Yes, that does require labelling functions, or
maybe putting run-time checks in them:

RequireAvailableCapability(CAP_MODIFY_DATABASE_STATE);

If that capability isn't available in the present context, the call
errors out. That way, it's possible for the required capabilities to
depend on the arguments to the function, and we can change markings in
minor releases without needing catalog changes.

There's another way of thinking about this problem, which involves
supposing that the invoker should only be induced to do things that
the definer could also have done. Basically do every privilege check
twice, and require that both pass. The problem I have with that is
that there are various operations which depend on your identity, not
just your privileges. For instance, a GRANT statement records a
grantor, and a REVOKE statement infers a grantor whose grant is to be
revoked. The operation isn't just allowed or disallowed based on who
performed it -- it actually does something different depending on who
performs it. I believe we have a number of cases like that, and I
think that they suggest that that whole model is pretty flawed. Even
if that were no issue, this also seems extremely complex to implement,
because we have an absolute crap-ton of places that perform privilege
checks and getting all of those places to check privileges as a second
user seems nightmarish. I also think that it might be lead to
confusing error messages: alice tried to do X but we're not allowing
it because bob isn't allowed to do X. Eh, what? As opposed to the
sandboxing approach, where I think you get something more like:

ERROR: database state cannot be modified now
DETAIL: The database system is evaluating an index expression.
HINT: Do $SOMETHING to allow this.

I don't want to press too hard on my idea here. I'm sure it has a
bunch of problems apart from those already mentioned, and those
already mentioned are not trivial. However, I do think there might be
ways to make it work, and I'm not at all convinced that trying to
switch users all over the place is going to be be better, either for
security or usability. Is there some other whole kind of approach we
can take here that we haven't discussed yet?

-- 
Robert Haas
EDB: http://www.enterprisedb.com




Re: [EXTERNAL] Re: Support load balancing in libpq

2023-03-03 Thread Jelte Fennema
> I want to note that the Fisher-Yates algorithm is implemented in a
> difficult to understand manner.
> +if (j < i) /* avoid fetching undefined data if j=i */
> This stuff does not make sense in case of shuffling arrays inplace. It
> is important only for making a new copy of an array and only in
> languages that cannot access uninitialized values. I'd suggest just
> removing this line (in both cases).

Done. Also added another patch to remove the same check from another
place in the codebase where it is unnecessary.


v10-0003-Support-load-balancing-in-libpq.patch
Description: Binary data


v10-0002-Refactor-libpq-to-store-addrinfo-in-a-libpq-owne.patch
Description: Binary data


v10-0004-Remove-unnecessary-check-from-Fisher-Yates-imple.patch
Description: Binary data


v10-0001-libpq-Run-pgindent-after-a9e9a9f32b3.patch
Description: Binary data


Re: [PATCH] Use indexes on the subscriber when REPLICA IDENTITY is full on the publisher

2023-03-03 Thread Önder Kalacı
Hi Hayato, Amit, all


>
>
> ```
> +   /* Build scankey for every non-expression attribute in the index.
> */
> ```
>
> I think that single line comments should not terminated by ".".
>

Hmm, looking into execReplication.c, many of the single line comments
terminated by ".".  Also, On the HEAD, the same comment has single
line comment. So, I'd rather stick to that?



>
> ```
> +   /* There should always be at least one attribute for the index
> scan. */
> ```
>
> Same as above.
>

Same as above :)


>
>
> ```
> +#ifdef USE_ASSERT_CHECKING
> +   IndexInfo  *indexInfo = BuildIndexInfo(idxrel);
> +
> +   Assert(!IsIndexOnlyOnExpression(indexInfo));
> +#endif
> ```
>
> I may misunderstand, but the condition of usable index has alteady been
> checked
> when the oid was set but anyway the you confirmed the condition again
> before
> really using that, right?
> So is it OK not to check another assumption that the index is b-tree,
> non-partial,
> and one column reference?

IIUC we can do that by adding new function like
> IsIndexUsableForReplicaIdentityFull()
> that checks these condition, and then call at
> RelationFindReplTupleByIndex() if
> idxIsRelationIdentityOrPK is false.
>

I think adding a function like IsIndexUsableForReplicaIdentityFull is
useful. I can use it inside
FindUsableIndexForReplicaIdentityFull() and assert here. Also good for
readability.

So, I mainly moved this assert to a more generic place with a more generic
check
to RelationFindReplTupleByIndex


>
> 032_subscribe_use_index.pl
>
> ```
> +# Testcase start: SUBSCRIPTION CREATE/DROP INDEX WORKS WITHOUT ISSUES
> ...
> +# Testcase end: SUBSCRIPTION RE-CALCULATES INDEX AFTER CREATE/DROP INDEX
> ```
>
> There is still non-consistent case.
>
>
Fixed, thanks

Anyway, I see your point, so
> if you want to keep the naming as you proposed at least don't change
> the ordering for get_opclass_input_type() call because that looks odd
> to me.


(A small comment from Amit's previous e-mail)

Sure, applied now.

Attaching v31.

Thanks for the reviews!
Onder KALACI


v31_0001_use_index_on_subs_when_pub_rep_ident_full.patch
Description: Binary data


Re: Testing autovacuum wraparound (including failsafe)

2023-03-03 Thread Heikki Linnakangas

On 03/03/2023 13:34, Heikki Linnakangas wrote:

On 16/11/2022 06:38, Ian Lawrence Barwick wrote:

Thanks for the patch. While reviewing the patch backlog, we have determined that
the latest version of this patch was submitted before meson support was
implemented, so it should have a "meson.build" file added for consideration for
inclusion in PostgreSQL 16.


I wanted to do some XID wraparound testing again, to test the 64-bit
SLRUs patches [1], and revived this.


Forgot attachment.

- Heikki
From 34b6b37d7f8d75943924a49e2205033db6e8293c Mon Sep 17 00:00:00 2001
From: Heikki Linnakangas 
Date: Fri, 3 Mar 2023 12:01:28 +0200
Subject: [PATCH 1/1] Add tests for XID wraparound.

The test module includes helper functions to quickly burn through lots
of XIDs. They are used in the tests, and are also handy for manually
testing XID wraparound.

Author: Masahiko Sawada, Heikki Linnakangas
Discussion: https://www.postgresql.org/message-id/CAD21AoDVhkXp8HjpFO-gp3TgL6tCKcZQNxn04m01VAtcSi-5sA%40mail.gmail.com
---
 src/test/modules/Makefile |   1 +
 src/test/modules/meson.build  |   1 +
 src/test/modules/xid_wraparound/.gitignore|   4 +
 src/test/modules/xid_wraparound/Makefile  |  23 ++
 src/test/modules/xid_wraparound/README|   3 +
 src/test/modules/xid_wraparound/meson.build   |  36 +++
 .../xid_wraparound/t/001_emergency_vacuum.pl  | 110 +
 .../modules/xid_wraparound/t/002_limits.pl| 114 +
 .../xid_wraparound/t/003_wraparounds.pl   |  46 
 .../xid_wraparound/xid_wraparound--1.0.sql|  12 +
 .../modules/xid_wraparound/xid_wraparound.c   | 221 ++
 .../xid_wraparound/xid_wraparound.control |   4 +
 12 files changed, 575 insertions(+)
 create mode 100644 src/test/modules/xid_wraparound/.gitignore
 create mode 100644 src/test/modules/xid_wraparound/Makefile
 create mode 100644 src/test/modules/xid_wraparound/README
 create mode 100644 src/test/modules/xid_wraparound/meson.build
 create mode 100644 src/test/modules/xid_wraparound/t/001_emergency_vacuum.pl
 create mode 100644 src/test/modules/xid_wraparound/t/002_limits.pl
 create mode 100644 src/test/modules/xid_wraparound/t/003_wraparounds.pl
 create mode 100644 src/test/modules/xid_wraparound/xid_wraparound--1.0.sql
 create mode 100644 src/test/modules/xid_wraparound/xid_wraparound.c
 create mode 100644 src/test/modules/xid_wraparound/xid_wraparound.control

diff --git a/src/test/modules/Makefile b/src/test/modules/Makefile
index c629cbe3830..99f5fa23f16 100644
--- a/src/test/modules/Makefile
+++ b/src/test/modules/Makefile
@@ -14,6 +14,7 @@ SUBDIRS = \
 		  plsample \
 		  snapshot_too_old \
 		  spgist_name_ops \
+		  xid_wraparound \
 		  test_bloomfilter \
 		  test_copy_callbacks \
 		  test_custom_rmgrs \
diff --git a/src/test/modules/meson.build b/src/test/modules/meson.build
index 1baa6b558d1..39de43dee33 100644
--- a/src/test/modules/meson.build
+++ b/src/test/modules/meson.build
@@ -10,6 +10,7 @@ subdir('plsample')
 subdir('snapshot_too_old')
 subdir('spgist_name_ops')
 subdir('ssl_passphrase_callback')
+subdir('xid_wraparound')
 subdir('test_bloomfilter')
 subdir('test_copy_callbacks')
 subdir('test_custom_rmgrs')
diff --git a/src/test/modules/xid_wraparound/.gitignore b/src/test/modules/xid_wraparound/.gitignore
new file mode 100644
index 000..5dcb3ff9723
--- /dev/null
+++ b/src/test/modules/xid_wraparound/.gitignore
@@ -0,0 +1,4 @@
+# Generated subdirectories
+/log/
+/results/
+/tmp_check/
diff --git a/src/test/modules/xid_wraparound/Makefile b/src/test/modules/xid_wraparound/Makefile
new file mode 100644
index 000..7a6e0f66762
--- /dev/null
+++ b/src/test/modules/xid_wraparound/Makefile
@@ -0,0 +1,23 @@
+# src/test/modules/xid_wraparound/Makefile
+
+MODULE_big = xid_wraparound
+OBJS = \
+	$(WIN32RES) \
+	xid_wraparound.o
+PGFILEDESC = "xid_wraparound - tests for XID wraparound"
+
+EXTENSION = xid_wraparound
+DATA = xid_wraparound--1.0.sql
+
+TAP_TESTS = 1
+
+ifdef USE_PGXS
+PG_CONFIG = pg_config
+PGXS := $(shell $(PG_CONFIG) --pgxs)
+include $(PGXS)
+else
+subdir = src/test/modules/xid_wraparound
+top_builddir = ../../../..
+include $(top_builddir)/src/Makefile.global
+include $(top_srcdir)/contrib/contrib-global.mk
+endif
diff --git a/src/test/modules/xid_wraparound/README b/src/test/modules/xid_wraparound/README
new file mode 100644
index 000..3aab464dec7
--- /dev/null
+++ b/src/test/modules/xid_wraparound/README
@@ -0,0 +1,3 @@
+This module contains tests for XID wraparound. The tests use two
+helper functions to quickly consume lots of XIDs, to reach XID
+wraparound faster.
diff --git a/src/test/modules/xid_wraparound/meson.build b/src/test/modules/xid_wraparound/meson.build
new file mode 100644
index 000..ea513e0536c
--- /dev/null
+++ b/src/test/modules/xid_wraparound/meson.build
@@ -0,0 +1,36 @@
+# Copyright (c) 2022-2023, PostgreSQL Global Development Group
+
+# FIXME: prevent install during main install, but not 

Re: Improve WALRead() to suck data directly from WAL buffers when possible

2023-03-03 Thread Bharath Rupireddy
On Wed, Mar 1, 2023 at 9:45 AM Nathan Bossart  wrote:
>
> On Tue, Feb 28, 2023 at 10:38:31AM +0530, Bharath Rupireddy wrote:
> > On Tue, Feb 28, 2023 at 6:14 AM Nathan Bossart  
> > wrote:
> >> Why do we only read a page at a time in XLogReadFromBuffersGuts()?  What is
> >> preventing us from copying all the data we need in one go?
> >
> > Note that most of the WALRead() callers request a single page of
> > XLOG_BLCKSZ bytes even if the server has less or more available WAL
> > pages. It's the streaming replication wal sender that can request less
> > than XLOG_BLCKSZ bytes and upto MAX_SEND_SIZE (16 * XLOG_BLCKSZ). And,
> > if we read, say, MAX_SEND_SIZE at once while holding
> > WALBufMappingLock, that might impact concurrent inserters (at least, I
> > can say it in theory) - one of the main intentions of this patch is
> > not to impact inserters much.
>
> Perhaps we should test both approaches to see if there is a noticeable
> difference.  It might not be great for concurrent inserts to repeatedly
> take the lock, either.  If there's no real difference, we might be able to
> simplify the code a bit.

I took a stab at this - acquire WALBufMappingLock separately for each
requested WAL buffer page vs acquire WALBufMappingLock once for all
requested WAL buffer pages. I chose the pgbench tpcb-like benchmark
that has 3 UPDATE statements and 1 INSERT statement. I ran pgbench for
30min with scale factor 100 and 4096 clients with primary and 1 async
standby, see [1]. I captured wait_events to see the contention on
WALBufMappingLock. I haven't noticed any contention on the lock and no
difference in TPS too, see [2] for results on HEAD, see [3] for
results on v6 patch which has "acquire WALBufMappingLock separately
for each requested WAL buffer page" strategy and see [4] for results
on v7 patch (attached herewith) which has "acquire WALBufMappingLock
once for all requested WAL buffer pages" strategy. Another thing to
note from the test results is that reduction in WALRead IO wait events
from 136 on HEAD to 1 on v6 or v7 patch. So, the read from WAL buffers
is really  helping here.

With these observations, I'd like to use the approach that acquires
WALBufMappingLock once for all requested WAL buffer pages unlike v6
and the previous patches.

I'm attaching the v7 patch set with this change for further review.

[1]
shared_buffers = '8GB'
wal_buffers = '1GB'
max_wal_size = '16GB'
max_connections = '5000'
archive_mode = 'on'
archive_command='cp %p /home/ubuntu/archived_wal/%f'
./pgbench --initialize --scale=100 postgres
./pgbench -n -M prepared -U ubuntu postgres -b tpcb-like -c4096 -j4096 -T1800

[2]
HEAD:
done in 20.03 s (drop tables 0.00 s, create tables 0.01 s, client-side
generate 15.53 s, vacuum 0.19 s, primary keys 4.30 s).
tps = 11654.475345 (without initial connection time)

50950253  Lock| transactionid
16472447  Lock| tuple
3869523  LWLock  | LockManager
 739283  IPC | ProcArrayGroupUpdate
 718549  |
 439877  LWLock  | WALWrite
 130737  Client  | ClientRead
 121113  LWLock  | BufferContent
  70778  LWLock  | WALInsert
  43346  IPC | XactGroupUpdate
  18547
  18546  Activity| LogicalLauncherMain
  18545  Activity| AutoVacuumMain
  18272  Activity| ArchiverMain
  17627  Activity| WalSenderMain
  17207  Activity| WalWriterMain
  15455  IO  | WALSync
  14963  LWLock  | ProcArray
  14747  LWLock  | XactSLRU
  13943  Timeout | CheckpointWriteDelay
  10519  Activity| BgWriterHibernate
   8022  Activity| BgWriterMain
   4486  Timeout | SpinDelay
   4443  Activity| CheckpointerMain
   1435  Lock| extend
670  LWLock  | XidGen
373  IO  | WALWrite
283  Timeout | VacuumDelay
268  IPC | ArchiveCommand
249  Timeout | VacuumTruncate
136  IO  | WALRead
115  IO  | WALInitSync
 74  IO  | DataFileWrite
 67  IO  | WALInitWrite
 36  IO  | DataFileFlush
 35  IO  | DataFileExtend
 17  IO  | DataFileRead
  4  IO  | SLRUWrite
  3  IO  | BufFileWrite
  2  IO  | DataFileImmediateSync
  1 Tuples only is on.
  1  LWLock  | SInvalWrite
  1  LWLock  | LockFastPath
  1  IO  | ControlFileSyncUpdate

[3]
done in 19.99 s (drop tables 0.00 s, create tables 0.01 s, client-side
generate 15.52 s, vacuum 0.18 s, primary keys 4.28 s).
tps = 11689.584538 (without initial connection time)

50678977  Lock| transactionid
16252048  Lock| tuple
4146827  LWLock  | LockManager
 768256  |
 719923  IPC | ProcArrayGroupUpdate
 432836  LWLock  | WALWrite
 140354  Client  | ClientRead
 124203  

Re: [PATCH] Use indexes on the subscriber when REPLICA IDENTITY is full on the publisher

2023-03-03 Thread Önder Kalacı
Hi Vignesh,

Thanks for the review


> 1) We are currently calling RelationGetIndexList twice, once in
> FindUsableIndexForReplicaIdentityFull function and in the caller too,
> we could avoid one of the calls by passing the indexlist to the
> function or removing the check here, index list check can be handled
> in FindUsableIndexForReplicaIdentityFull.
> +   if (remoterel->replident == REPLICA_IDENTITY_FULL &&
> +   RelationGetIndexList(localrel) != NIL)
> +   {
> +   /*
> +* If we had a primary key or relation identity with a
> unique index,
> +* we would have already found and returned that oid.
> At this point,
> +* the remote relation has replica identity full and
> we have at least
> +* one local index defined.
> +*
> +* We are looking for one more opportunity for using
> an index. If
> +* there are any indexes defined on the local
> relation, try to pick
> +* a suitable index.
> +*
> +* The index selection safely assumes that all the
> columns are going
> +* to be available for the index scan given that
> remote relation has
> +* replica identity full.
> +*/
> +   return FindUsableIndexForReplicaIdentityFull(localrel);
> +   }
> +
>

makes sense, done


>
> 2) Copyright year should be mentioned as 2023
> diff --git a/src/test/subscription/t/032_subscribe_use_index.pl
> b/src/test/subscription/t/032_subscribe_use_index.pl
> new file mode 100644
> index 00..db0a7ea2a0
> --- /dev/null
> +++ b/src/test/subscription/t/032_subscribe_use_index.pl
> @@ -0,0 +1,861 @@
> +# Copyright (c) 2021-2022, PostgreSQL Global Development Group
> +
> +# Test logical replication behavior with subscriber uses available index
> +use strict;
> +use warnings;
> +use PostgreSQL::Test::Cluster;
> +use PostgreSQL::Test::Utils;
> +use Test::More;
> +
>

I changed it to #Copyright (c) 2022-2023, but I'm not sure if it should be
only 2023 or
like this.


>
> 3) Many of the tests are using the same tables, we need not
> drop/create publication/subscription for each of the team, we could
> just drop and create required indexes and verify the update/delete
> statements.
> +# 
> +# Testcase start: SUBSCRIPTION USES INDEX
> +#
> +# Basic test where the subscriber uses index
> +# and only updates 1 row and deletes
> +# 1 other row
> +#
> +
> +# create tables pub and sub
> +$node_publisher->safe_psql('postgres',
> +   "CREATE TABLE test_replica_id_full (x int)");
> +$node_publisher->safe_psql('postgres',
> +   "ALTER TABLE test_replica_id_full REPLICA IDENTITY FULL;");
> +$node_subscriber->safe_psql('postgres',
> +   "CREATE TABLE test_replica_id_full (x int)");
> +$node_subscriber->safe_psql('postgres',
> +   "CREATE INDEX test_replica_id_full_idx ON
> test_replica_id_full(x)");
>
> +# 
> +# Testcase start: SUBSCRIPTION CREATE/DROP INDEX WORKS WITHOUT ISSUES
> +#
> +# This test ensures that after CREATE INDEX, the subscriber can
> automatically
> +# use one of the indexes (provided that it fulfils the requirements).
> +# Similarly, after DROP index, the subscriber can automatically switch to
> +# sequential scan
> +
> +# create tables pub and sub
> +$node_publisher->safe_psql('postgres',
> +   "CREATE TABLE test_replica_id_full (x int NOT NULL, y int)");
> +$node_publisher->safe_psql('postgres',
> +   "ALTER TABLE test_replica_id_full REPLICA IDENTITY FULL;");
> +$node_subscriber->safe_psql('postgres',
> +   "CREATE TABLE test_replica_id_full (x int NOT NULL, y int)");
>

Well, not all the tables are exactly the same, there are 4-5 different
tables. Mostly the table names are the same.

Plus, the overhead does not seem to be large enough to complicate
the test. Many of the src/test/subscription/t files follow this pattern.

Do you have strong opinions on changing this?


> 4) These additional blank lines can be removed to keep it consistent:
> 4.a)
> +# Testcase end: SUBSCRIPTION DOES NOT USE PARTIAL INDEX
> +# 
> +
> +
> +# 
> +# Testcase start: SUBSCRIPTION DOES NOT USE INDEXES WITH ONLY EXPRESSIONS
>
> 4.b)
> +# Testcase end: Unique index that is not primary key or replica identity
> +# 
> +
> +
> +
> +# 
> +# Testcase start: SUBSCRIPTION BEHAVIOR WITH ENABLE_INDEXSCAN
>

Thanks, fixed.

Attached v30
From 29ec8ec8f9e69db3850fa1f7af83e06baa334910 Mon Sep 17 00:00:00 2001
From: Onder Kalaci 
Date: Wed, 22 Feb 2023 14:12:56 +0300
Subject: [PATCH 

Re: Allow tailoring of ICU locales with custom rules

2023-03-03 Thread Peter Eisentraut

On 02.03.23 16:39, Laurenz Albe wrote:

On Wed, 2023-02-22 at 18:35 +0100, Peter Eisentraut wrote:

- there doesn't seem to be a way to add rules to template1.
If someone wants to have icu rules and initial contents to their new
databases, I think they need to create a custom template database
(from template0) for that purpose, in addition to template1.
   From a usability standpoint, this is a bit cumbersome, as it's
normally the role of template1.
To improve on that, shouldn't initdb be able to create template0 with
rules too?


Right, that would be an initdb option.  Is that too many initdb options
then?  It would be easy to add, if we think it's worth it.


An alternative would be to document that you can drop "template1" and
create it again using the ICU collation rules you need.

But I'd prefer an "initdb" option.


Ok, here is a version with an initdb option and also a createdb option 
(to expose the CREATE DATABASE option).


You can mess with people by setting up your databases like this:

initdb -D data --locale-provider=icu --icu-rules=' < c < b < e < d'

;-)
From 615763ccf5a1c18c3da1286eb4c86d19eb397ac0 Mon Sep 17 00:00:00 2001
From: Peter Eisentraut 
Date: Fri, 3 Mar 2023 11:46:56 +0100
Subject: [PATCH v7] Allow tailoring of ICU locales with custom rules

This exposes the ICU facility to add custom collation rules to a
standard collation.

Discussion: 
https://www.postgresql.org/message-id/flat/821c71a4-6ef0-d366-9acf-bb8e367f7...@enterprisedb.com
---
 doc/src/sgml/catalogs.sgml|  18 
 doc/src/sgml/ref/create_collation.sgml|  22 
 doc/src/sgml/ref/create_database.sgml |  14 +++
 doc/src/sgml/ref/createdb.sgml|  10 ++
 doc/src/sgml/ref/initdb.sgml  |  10 ++
 src/backend/catalog/pg_collation.c|   5 +
 src/backend/commands/collationcmds.c  |  23 +++-
 src/backend/commands/dbcommands.c |  51 -
 src/backend/utils/adt/pg_locale.c |  41 ++-
 src/backend/utils/init/postinit.c |  11 +-
 src/bin/initdb/initdb.c   |  15 ++-
 src/bin/pg_dump/pg_dump.c |  37 +++
 src/bin/psql/describe.c   | 100 +++---
 src/bin/scripts/createdb.c|  11 ++
 src/include/catalog/pg_collation.h|   2 +
 src/include/catalog/pg_database.dat   |   2 +-
 src/include/catalog/pg_database.h |   3 +
 src/include/utils/pg_locale.h |   1 +
 .../regress/expected/collate.icu.utf8.out |  30 ++
 src/test/regress/expected/psql.out|  18 ++--
 src/test/regress/sql/collate.icu.utf8.sql |  13 +++
 21 files changed, 379 insertions(+), 58 deletions(-)

diff --git a/doc/src/sgml/catalogs.sgml b/doc/src/sgml/catalogs.sgml
index c1e4048054..746baf5053 100644
--- a/doc/src/sgml/catalogs.sgml
+++ b/doc/src/sgml/catalogs.sgml
@@ -2428,6 +2428,15 @@ pg_collation 
Columns
   
  
 
+ 
+  
+   collicurules text
+  
+  
+   ICU collation rules for this collation object
+  
+ 
+
  
   
collversion text
@@ -3106,6 +3115,15 @@ pg_database 
Columns
   
  
 
+ 
+  
+   daticurules text
+  
+  
+   ICU collation rules for this database
+  
+ 
+
  
   
datcollversion text
diff --git a/doc/src/sgml/ref/create_collation.sgml 
b/doc/src/sgml/ref/create_collation.sgml
index 136976165c..289f8147f1 100644
--- a/doc/src/sgml/ref/create_collation.sgml
+++ b/doc/src/sgml/ref/create_collation.sgml
@@ -27,6 +27,7 @@
 [ LC_CTYPE = lc_ctype, ]
 [ PROVIDER = provider, ]
 [ DETERMINISTIC = boolean, ]
+[ RULES = rules, ]
 [ VERSION = version ]
 )
 CREATE COLLATION [ IF NOT EXISTS ] name FROM 
existing_collation
@@ -149,6 +150,19 @@ Parameters
  
 
 
+
+ rules
+
+ 
+  
+   Specifies additional collation rules to customize the behavior of the
+   collation.  This is supported for ICU only.  See https://unicode-org.github.io/icu/userguide/collation/customization/"/>
+   for details on the syntax.
+  
+ 
+
+
 
  version
 
@@ -228,6 +242,14 @@ Examples
 
   
 
+  
+   To create a collation using the ICU provider, based on the English ICU
+   locale, with custom rules:
+
+
+
+  
+
   
To create a collation from an existing collation:
 
diff --git a/doc/src/sgml/ref/create_database.sgml 
b/doc/src/sgml/ref/create_database.sgml
index 57d13e34c2..13793bb6b7 100644
--- a/doc/src/sgml/ref/create_database.sgml
+++ b/doc/src/sgml/ref/create_database.sgml
@@ -30,6 +30,7 @@
[ LC_COLLATE [=] lc_collate ]
[ LC_CTYPE [=] lc_ctype ]
[ ICU_LOCALE [=] icu_locale ]
+   [ ICU_RULES [=] icu_rules ]
[ LOCALE_PROVIDER [=] locale_provider ]
[ COLLATION_VERSION = collation_version ]
[ TABLESPACE [=] tablespace_name ]
@@ -192,6 +193,19 

What's MultiXactId?

2023-03-03 Thread jack...@gmail.com
in src/include/access/htup_details.h, I find out this:
#define HEAP_XMAX_IS_MULTI 0x1000 /* t_xmax is a MultiXactId */
what's MultiXactId? Can you give me a scenario to make this bit as 1?


jack...@gmail.com


RE: [Proposal] Add foreign-server health checks infrastructure

2023-03-03 Thread Hayato Kuroda (Fujitsu)
Hi Katsuragi-san,

Thank you for reviewing! PSA new version.

> >> I rethought the pqSocketPoll part. Current interpretation of
> >> arguments seems a little bit confusing because a specific pattern
> >> of arguments has a different meaning. What do you think about
> >> introducing a new argument like `int forConnCheck`? This seems
> >> straightforward and readable.
> >
> > I think it may be better, so fixed.
> > But now we must consider another thing - will we support combination
> > of conncheck
> > and {read|write} or timeout? Especially about timeout, if someone
> > calls pqSocketPoll()
> > with forConnCheck = 1 and end_time = -1, the process may be stuck
> > because it waits
> > till socket peer closed connection.
> > Currently the forConnCheck can be specified with other request, but
> > timeout must be zero.
> 
> Yes, we need to consider these.
> I'm wondering whether we need a special care for the combination
> of event and timeout. Surely, if forConnCheck is set and end_time = -1,
> pqSocketPoll blocks until the connection close. However, the
> behavior matches the meaning of the arguments and does not seem
> confusing (also not an error state). Do we need to restrict this
> kind of usage in the pqSocketPoll side? I think like the following
> might be fine.
> 
> ```
>   if (!forRead && !forWrite)
>   {
>   if (!(forConnCheck && PQconnCheckable()))
>   return 0;
>   }
> ```

Seems right, I was too pessimistic. Fixed.

> > While making the patch, I come up with idea that
> > postgres_fdw_verify_connection_states*
> > returns NULL if PQconnCheck() cannot work on this platform. This can be
> > done by
> > adding PQconnCheckable(). It may be reasonable because the checking is
> > not really
> > done on this environment. How do you think?
> 
> I agree with you.

Changed.

> Followings are comments for v33. Please check.
> 0001:
> 1. the comment of PQconnCheck
> +/*
> + * Check whether the socket peer closed connection or not.
> + *
> 
> Check whether the socket peer closed 'the' connection or not?

Changed.

> 2. the comment of pqSocketCheck
> - * or both.  Returns >0 if one or more conditions are met, 0 if it
> timed
> - * out, -1 if an error occurred.
> + * or both. Moreover, this function can check the health of socket on
> some
> + * limited platforms if end_time is 0.
> 
> the health of socket -> the health of the connection?
> if end_time is 0 -> if forConnCehck is specified?

Fixed.

> 3. the comment of pqSocketPoll
> - * If neither forRead nor forWrite are set, immediately return a
> timeout
> - * condition (without waiting).  Return >0 if condition is met, 0
> - * if a timeout occurred, -1 if an error or interrupt occurred.
> + * Moreover, this function can check the health of socket on some
> limited
> + * platforms if end_time is 0.
> 
> the health of socket -> the health of the connection?
> if end_time is 0 -> if forConnCehck is specified?

Fixed.

> 4. the code of pqSocketPoll
> +#if defined(POLLRDHUP)
> + if (forConnCheck)
> + input_fd.events |= POLLRDHUP;
> +#endif
> 
> I think it is better to use PQconnCheckable() to remove the macro.

IIUC the macro is needed. In FreeBSD, macOS and other platforms do not have the
macro POLLRDHUP so they cannot compile. I checked by my CI and got following 
error.

```
...
FAILED: src/interfaces/libpq/libpq.a.p/fe-misc.c.o 
...
../src/interfaces/libpq/fe-misc.c:1149:22: error: use of undeclared identifier 
'POLLRDHUP'
input_fd.events |= POLLRDHUP;


It must be invisible from them.

> 5. the code of pqSocketCheck and the definition of pqSocketPoll
> - result = pqSocketPoll(conn->sock, forRead, forWrite,
> end_time);
> + result = pqSocketPoll(conn->sock, forRead, forWrite,
> forConnCheck,
> end_time);
> 
> -pqSocketPoll(int sock, int forRead, int forWrite, time_t end_time)
> +pqSocketPoll(int sock, int forRead, int forWrite, int forConnCheck,
> time_t end_time)
> 
> Should these be divided into two lines?

pgindent did not say anything about them, but I divided. 

> 6. the comment of verify_cached_connections
> + * This function emits warnings if a disconnection is found, and this
> returns
> + * false only when the lastly verified server seems to be disconnected.

Updated. I think comments were not correct, so these were also fixed.

> It seems better to write the case where this function returns
> true.
> 7. the comment of postgres_fdw_verify_connection_states
> + * This function emits a warning if a disconnection is found. This
> returns
> + * false only when the verified server seems to be disconnected, and
> reutrns
> + * NULL if the connection check had not been done.
> 
> It seems better to write the case where this function returns
> true.


> 8. the code of
> + (errcode(ERRCODE_CONNECTION_FAILURE),
> +  errmsg("%s", str.data),
> +  errdetail("Socket close is 

Re: Add support for unit "B" to pg_size_pretty()

2023-03-03 Thread Dean Rasheed
On Fri, 3 Mar 2023 at 11:23, David Rowley  wrote:
>
> On Fri, 3 Mar 2023 at 09:32, Dean Rasheed  wrote:
> > Hmm, I think it would be easier to just have a separate table for
> > pg_size_bytes(), rather than reusing pg_size_pretty()'s table.
>
> Maybe that's worthwhile if we were actually thinking of adding any
> non-base 2 units in the future, but if we're not, perhaps it's better
> just to have the smaller alias array which for Peter's needs will just
> require 1 element + the NULL one instead of 6 + NULL.
>

Maybe. It's the tradeoff between having a smaller array and more code
(2 loops) vs a larger array and less code (1 loop).

> In any case, I'm not really sure I see what the path forward would be
> to add something like base-10 units would be for pg_size_bytes(). If
> we were to change MB to mean 10^6 rather than 2^20 I think many people
> would get upset.
>

Yeah, that's probably true. Given the way this and configuration
parameters currently work, I think we're stuck with 1MB meaning 2^20
bytes.

Regards,
Dean




Re: Missing update of all_hasnulls in BRIN opclasses

2023-03-03 Thread Tomas Vondra



On 3/3/23 11:32, Alvaro Herrera wrote:
> 
> Thanks for doing all this.  (Do I understand correctly that this patch
> is not in the commitfest?)
> 
> I think my mental model for this was that "allnulls" meant that either
> there are no values for the column in question or that the values were
> all nulls (For minmax without NULL handling, which is where this all
> started, these two things are essentially the same: the range is not to
> be returned.  So this became a bug the instant I added handling for NULL
> values.)  I failed to realize that these were two different things, and
> this is likely the origin of all these troubles.
> 
> What do you think of using the unused bit in BrinTuple->bt_info to
> denote a range that contains no heap tuples?  This also means we need it
> in BrinMemTuple, I think we can do this:
> 
> @@ -44,6 +44,7 @@ typedef struct BrinValues
>  typedef struct BrinMemTuple
>  {
>   boolbt_placeholder; /* this is a placeholder tuple */
> + boolbt_empty_range; /* range has no tuples */
>   BlockNumber bt_blkno;   /* heap blkno that the tuple is for */
>   MemoryContext bt_context;   /* memcxt holding the bt_columns values 
> */
>   /* output arrays for brin_deform_tuple: */
> @@ -69,7 +70,7 @@ typedef struct BrinTuple
>*
>* 7th (high) bit: has nulls
>* 6th bit: is placeholder tuple
> -  * 5th bit: unused
> +  * 5th bit: range has no tuples
>* 4-0 bit: offset of data
>* ---
>*/
> @@ -82,7 +83,7 @@ typedef struct BrinTuple
>   * bt_info manipulation macros
>   */
>  #define BRIN_OFFSET_MASK 0x1F
> -/* bit 0x20 is not used at present */
> +#define BRIN_EMPTY_RANGE 0x20
>  #define BRIN_PLACEHOLDER_MASK0x40
>  #define BRIN_NULLS_MASK  0x80
>  
> (Note that bt_empty_range uses a hole in the struct, so there's no ABI
> change.)
> 
> This is BRIN-tuple-level, not column-level, so conceptually it seems
> more appropriate.  (In the case where both are empty in union_tuples, we
> can return without entering the per-attribute loop at all, though I
> admit it's not a very interesting case.)  This approach avoids having to
> invent the strange combination of all+has to mean empty.
> 

Oh, that's an interesting idea! I haven't realized there's an unused bit
at the tuple level, and I absolutely agree it'd be a better match than
having this in individual summaries (like now).

It'd mean we'd not have the option to fix this withing the opclasses,
because we only pass them the BrinValue and not the tuple. But if you
think that's reasonable, that'd be OK.

The other thing I was unsure is if the bit could be set for any existing
tuples, but AFAICS that shouldn't be possible - brin_form_tuple does
palloc0, so it should be 0.

I suspect doing this might make the patch quite a bit simpler, actually.

> 
> On 2023-Feb-24, Tomas Vondra wrote:
> 
>> I wonder what's the best
>> way to test this in an automated way - it's very dependent on timing of
>> the concurrent updated. For example we need to do something like this:
>>
>> T1: run pg_summarize_range() until it inserts the placeholder tuple
>> T2: do an insert into the page range (updates placeholder)
>> T1: continue pg_summarize_range() to merge into the placeholder
>>
>> But there are no convenient ways to do this, I think. I had to check the
>> various cases using breakpoints in gdb etc.
> 
> Yeah, I struggled with this during initial development but in the end
> did nothing.  I think we would need to introduce some new framework,
> perhaps Korotkov stop-events stuff at 
> https://postgr.es/m/CAPpHfdsTeb+hBT5=qxghjng_chcjldanq9sdy9vnwbf2e2p...@mail.gmail.com
> which seemed to me a good fit -- we would add a stop point after the
> placeholder tuple is inserted.
> 

Yeah, but we don't have that at the moment. I actually ended up adding a
couple sleeps during development, which allowed me to hit just the right
order of operations - a poor-man's version of those stop-events. Did
work but hardly an acceptable approach.

>> I'm not very happy with the union_tuples() changes - it's quite verbose,
>> perhaps a bit too verbose. We have to check for empty ranges first, and
>> then various combinations of allnulls/hasnulls flags for both BRIN
>> tuples. There are 9 combinations, and the current code just checks them
>> one by one - I was getting repeatedly confused by the original code, but
>> maybe it's too much.
> 
> I think it's okay.  I tried to make it more compact (by saying "these
> two combinations here are case 2, and these two other are case 4", and
> keeping each of the other combinations a separate case; so there are
> really 7 cases).  But that doesn't make it any easier to follow, on the
> contrary it was more convoluted.  I think a dozen extra lines of source
> is not a problem.
> 

OK

>> The alternative is to apply the same fix to every BRIN_PROCNUM_UNION
>> 

Re: Allow ordered partition scans in more cases

2023-03-03 Thread David Rowley
On Thu, 23 Feb 2023 at 02:10, Ronan Dunklau  wrote:
> I haven't looked too deeply into it, but it seems reasonable that the whole
> sort would cost cheaper than individual sorts on partitions + incremental
> sorts, except when the the whole sort would spill to disk much more than the
> incremental ones. I find it quite difficult to reason about what that 
> threshold
> should be, but I managed to find a case which could fit in a test:

Thanks for coming up with that test case.  It's a little disappointing
to see that so many rows had to be added to get the plan to change. I
wonder if it's really worth testing this particular case. ~1800 rows
is a little more significant than I'd have hoped. The buildfarm has a
few dinosaurs that would likely see a noticeable slowdown from that.

What's on my mind now is if turning 1 Sort into N Sorts is a
particularly good idea from a work_mem standpoint. I see that we don't
do tuplesort_end() until executor shutdown, so that would mean that we
could end up using 1 x work_mem per Sort node.  I idly wondered if we
couldn't do tuplesort_end() after spitting out the final tuple when
EXEC_FLAG_REWIND is not set, but that would still mean we could use N
work_mems when EXEC_FLAG_REWIND *is* set. We only really have
visibility of that during execution too, so can't really make a
decision at plan time based on that.

I'm not quite sure if I'm being overly concerned here or not. All it
would take to get a sort per partition today would be to put a
suitable index on just 1 of the partitions. So this isn't exactly a
new problem, it's just making an old problem perhaps a little more
likely.  The problem does also exist for things like partition-wise
joins too for Hash and Merge joins.  Partition-wise joins are disabled
by default, however.

David




Re: Testing autovacuum wraparound (including failsafe)

2023-03-03 Thread Heikki Linnakangas

On 16/11/2022 06:38, Ian Lawrence Barwick wrote:

Thanks for the patch. While reviewing the patch backlog, we have determined that
the latest version of this patch was submitted before meson support was
implemented, so it should have a "meson.build" file added for consideration for
inclusion in PostgreSQL 16.


I wanted to do some XID wraparound testing again, to test the 64-bit 
SLRUs patches [1], and revived this.


I took a different approach to consuming the XIDs. Instead of setting 
nextXID directly, bypassing GetNewTransactionId(), this patch introduces 
a helper function to call GetNewTransactionId() repeatedly. But because 
that's slow, it does include a shortcut to skip over "uninteresting" 
XIDs. Whenever nextXid is close to an SLRU page boundary or XID 
wraparound, it calls GetNewTransactionId(), and otherwise it bumps up 
nextXid close to the next "interesting" value. That's still a lot slower 
than just setting nextXid, but exercises the code more realistically.


I've written some variant of this helper function many times over the 
years, for ad hoc testing. I'd love to have it permanently in the git tree.


In addition to Masahiko's test for emergency vacuum, this includes two 
other tests. 002_limits.pl tests the "warn limit" and "stop limit" in 
GetNewTransactionId(), and 003_wraparound.pl burns through 10 billion 
transactions in total, exercising XID wraparound in general. 
Unfortunately these tests are pretty slow; the tests run for about 4 
minutes on my laptop in total, and use about 20 GB of disk space. So 
perhaps these need to be put in a special test suite that's not run as 
part of "check-world". Or perhaps leave out the 003_wraparounds.pl test, 
that's the slowest of the tests. But I'd love to have these in the git 
tree in some form.


[1] 
https://www.postgresql.org/message-id/CAJ7c6TPKf0W3MfpP2vr=kq7-nm5g12vtbhi7miu_5m8ag3c...@mail.gmail.com)


- Heikki





Re: Fix comments in gistxlogDelete, xl_heap_freeze_page and xl_btree_delete

2023-03-03 Thread Amit Kapila
On Thu, Mar 2, 2023 at 6:35 PM Drouvot, Bertrand
 wrote:
>
> On 1/6/23 11:05 AM, Drouvot, Bertrand wrote:
> > Hi hackers,
> >
> > Please find attached a patch to $SUBJECT.
> >
> > The wrong comments have been discovered by Robert in [1].
> >
> > Submitting this here as a separate thread so it does not get lost in the 
> > logical decoding
> > on standby thread.
> >
> > [1]: 
> > https://www.postgresql.org/message-id/CA%2BTgmoYTTsxP8y6uknZvCBNCRq%2B1FJ4zGbX8Px1TGW459fGsaQ%40mail.gmail.com
> >
> > Looking forward to your feedback,
> >
> > Regards,
> >
>
> It looks like I did not create a CF entry for this one: fixed with [1].
>
> Also, while at it, adding a commit message in V2 attached.
>

LGTM.

-- 
With Regards,
Amit Kapila.




Re: Add support for unit "B" to pg_size_pretty()

2023-03-03 Thread David Rowley
On Fri, 3 Mar 2023 at 09:32, Dean Rasheed  wrote:
> Hmm, I think it would be easier to just have a separate table for
> pg_size_bytes(), rather than reusing pg_size_pretty()'s table. I.e.,
> size_bytes_units[], which would only need name and multiplier columns
> (not round and limit). Done that way, it would be easier to add other
> units later (e.g., non-base-2 units).

Maybe that's worthwhile if we were actually thinking of adding any
non-base 2 units in the future, but if we're not, perhaps it's better
just to have the smaller alias array which for Peter's needs will just
require 1 element + the NULL one instead of 6 + NULL.

In any case, I'm not really sure I see what the path forward would be
to add something like base-10 units would be for pg_size_bytes(). If
we were to change MB to mean 10^6 rather than 2^20 I think many people
would get upset.

David




Re: Simplify standby state machine a bit in WaitForWALToBecomeAvailable()

2023-03-03 Thread Michael Paquier
On Fri, Mar 03, 2023 at 01:38:38PM +0530, Bharath Rupireddy wrote:
> I will withdraw the patch proposed in this thread.

Okay, thanks for confirming.
--
Michael


signature.asc
Description: PGP signature


Re: Add documentation for coverage reports with meson

2023-03-03 Thread Michael Paquier
On Fri, Mar 03, 2023 at 10:10:15AM +0100, Peter Eisentraut wrote:
> genhtml is part of the lcov package.  I think it would be confusing to
> mention it explicitly, since you won't be able to find it as something to
> install.  Maybe leave the original list and change "programs" to "packages"?

Makes sense.

> In the installation chapter we use titles like "Building and Installation
> with Autoconf and Make" and "Building and Installation with Meson".  We
> should use analogous wordings here.

OK, changed to something like that.

> This ignores which directory you have to be in.  The meson calls have to be
> at the top level, the ninja calls have to be in the build directory. We
> should be more precise here, otherwise someone trying this will find that it
> doesn't work.

Hmm.  I can see that it is possible to pass the repository to move to
with -C, still it is simpler to move into the build repository.

> Personally I use "meson compile" instead of "ninja"; I'm not sure what the
> best recommendation is, but that least that way all the initial commands are
> "meson something" instead of going back and forth.

Using meson compile is fine by me for the docs.  Note that I cannot
see an option with meson to do coverage reports, and my environment
uses 1.0.1.  Only ninja handles that.

Updated version attached.
--
Michael
diff --git a/doc/src/sgml/regress.sgml b/doc/src/sgml/regress.sgml
index a08c7a78af..344c834280 100644
--- a/doc/src/sgml/regress.sgml
+++ b/doc/src/sgml/regress.sgml
@@ -825,53 +825,77 @@ PG_TEST_NOCLEAN=1 make -C src/bin/pg_dump check
 instrumentation, so that it becomes possible to examine which
 parts of the code are covered by the regression tests or any other
 test suite that is run with the code.  This is currently supported
-when compiling with GCC, and it requires the gcov
-and lcov programs.
+when compiling with GCC, and it requires the gcov
+and lcov packages.

 
-   
-A typical workflow looks like this:
+   
+Coverage with Autoconf and Make
+
+ A typical workflow looks like this:
 
 ./configure --enable-coverage ... OTHER OPTIONS ...
 make
 make check # or other test suite
 make coverage-html
 
-Then point your HTML browser
-to coverage/index.html.
-   
+ Then point your HTML browser
+ to coverage/index.html.
+
 
-   
-If you don't have lcov or prefer text output over an
-HTML report, you can run
+
+ If you don't have lcov or prefer text output over an
+ HTML report, you can run
 
 make coverage
 
-instead of make coverage-html, which will
-produce .gcov output files for each source file
-relevant to the test.  (make coverage and make
-coverage-html will overwrite each other's files, so mixing them
-might be confusing.)
-   
+ instead of make coverage-html, which will
+ produce .gcov output files for each source file
+ relevant to the test.  (make coverage and make
+ coverage-html will overwrite each other's files, so mixing them
+ might be confusing.)
+
 
-   
-You can run several different tests before making the coverage report;
-the execution counts will accumulate.  If you want
-to reset the execution counts between test runs, run:
+
+ You can run several different tests before making the coverage report;
+ the execution counts will accumulate.  If you want
+ to reset the execution counts between test runs, run:
 
 make coverage-clean
 
-   
+
 
-   
-You can run the make coverage-html or make
-coverage command in a subdirectory if you want a coverage
-report for only a portion of the code tree.
-   
+
+ You can run the make coverage-html or make
+ coverage command in a subdirectory if you want a coverage
+ report for only a portion of the code tree.
+
 
-   
-Use make distclean to clean up when done.
-   
+
+ Use make distclean to clean up when done.
+
+   
+
+   
+Coverage with Meson
+
+ A typical workflow looks like this:
+
+meson setup -Db_coverage=true ... OTHER OPTIONS ... builddir/
+cd builddir/
+meson compile
+meson test
+ninja coverage-html
+
+ Then point your HTML browser
+ to ./meson-logs/coveragereport/index.html.
+
+
+
+ You can run several different tests before making the coverage report;
+ the execution counts will accumulate.
+
+   
   
 
 


signature.asc
Description: PGP signature


Re: Deduplicate logicalrep_read_tuple()

2023-03-03 Thread Amit Kapila
On Fri, Mar 3, 2023 at 4:13 PM Bharath Rupireddy
 wrote:
>
> On Thu, Jan 19, 2023 at 8:36 AM Peter Smith  wrote:
> >
> > On Wed, Jan 18, 2023 at 6:26 PM Bharath Rupireddy
> >  wrote:
> > >
> > > Hi,
> > >
> > > logicalrep_read_tuple() duplicates code for LOGICALREP_COLUMN_TEXT and
> > > LOGICALREP_COLUMN_BINARY introduced by commit 9de77b5. While it
> > > doesn't hurt anyone, deduplication makes code a bit leaner by 57 bytes
> > > [1]. I've attached a patch for $SUBJECT.
> > >
> > > Thoughts?
> > >
> >
> > The code looks the same but there is a subtle comment difference where
> > previously only LOGICALREP_COLUMN_BINARY case said:
> >  /* not strictly necessary but per StringInfo practice */
> >
> > So if you de-duplicate the code then should that comment be modified to say
> > /* not strictly necessary for LOGICALREP_COLUMN_BINARY but per
> > StringInfo practice */
>
> Thanks. Done so in the attached v2.
>

LGTM. Unless Peter or someone has any comments on this, I'll push this
early next week.

-- 
With Regards,
Amit Kapila.




Re: [PoC] Improve dead tuple storage for lazy vacuum

2023-03-03 Thread John Naylor
On Wed, Mar 1, 2023 at 6:59 PM Masahiko Sawada 
wrote:
>
> On Wed, Mar 1, 2023 at 3:37 PM John Naylor 
wrote:
> >
> > I think we're trying to solve the wrong problem here. I need to study
this more, but it seems that code that needs to stay within a memory limit
only needs to track what's been allocated in chunks within a block, since
writing there is what invokes a page fault.
>
> Right. I guess we've discussed what we use for calculating the *used*
> memory amount but I don't remember.
>
> I think I was confused by the fact that we use some different
> approaches to calculate the amount of used memory. Parallel hash and
> tidbitmap use the allocated chunk size whereas hash_agg_check_limits()
> in nodeAgg.c uses MemoryContextMemAllocated(), which uses the
> allocated block size.

That's good to know. The latter says:

 * After adding a new group to the hash table, check whether we need to
enter
 * spill mode. Allocations may happen without adding new groups (for
instance,
 * if the transition state size grows), so this check is imperfect.

I'm willing to claim that vacuum can be imperfect also, given the tid
store's properties: 1) on average much more efficient in used space, and 2)
no longer bound by the 1GB limit.

> > I'm not sure how this affects progress reporting, because it would be
nice if it didn't report dead_tuple_bytes bigger than max_dead_tuple_bytes.
>
> Yes, the progress reporting could be confusable. Particularly, in
> shared tidstore cases, the dead_tuple_bytes could be much bigger than
> max_dead_tuple_bytes. Probably what we need might be functions for
> MemoryContext and dsa_area to get the amount of memory that has been
> allocated, by not tracking every chunk space. For example, the
> functions would be like what SlabStats() does; iterate over every
> block and calculates the total/free memory usage.

I'm not sure we need to invent new infrastructure for this. Looking at v29
in vacuumlazy.c, the order of operations for memory accounting is:

First, get the block-level space -- stop and vacuum indexes if we exceed
the limit:

/*
 * Consider if we definitely have enough space to process TIDs on page
 * already.  If we are close to overrunning the available space for
 * dead_items TIDs, pause and do a cycle of vacuuming before we tackle
 * this page.
 */
if (TidStoreIsFull(vacrel->dead_items)) --> which is basically "if
(TidStoreMemoryUsage(ts) > ts->control->max_bytes)"

Then, after pruning the current page, store the tids and then get the
block-level space again:

else if (prunestate.num_offsets > 0)
{
  /* Save details of the LP_DEAD items from the page in dead_items */
  TidStoreSetBlockOffsets(...);

  pgstat_progress_update_param(PROGRESS_VACUUM_DEAD_TUPLE_BYTES,
   TidStoreMemoryUsage(dead_items));
}

Since the block-level measurement is likely overestimating quite a bit, I
propose to simply reverse the order of the actions here, effectively
reporting progress for the *last page* and not the current one: First
update progress with the current memory usage, then add tids for this page.
If this allocated a new block, only a small bit of that will be written to.
If this block pushes it over the limit, we will detect that up at the top
of the loop. It's kind of like our earlier attempts at a "fudge factor",
but simpler and less brittle. And, as far as OS pages we have actually
written to, I think it'll effectively respect the memory limit, at least in
the local mem case. And the numbers will make sense.

Thoughts?

But now that I'm looking more closely at the details of memory accounting,
I don't like that TidStoreMemoryUsage() is called twice per page pruned
(see above). Maybe it wouldn't noticeably slow things down, but it's a bit
sloppy. It seems like we should call it once per loop and save the result
somewhere. If that's the right way to go, that possibly indicates that
TidStoreIsFull() is not a useful interface, at least in this form.

--
John Naylor
EDB: http://www.enterprisedb.com


Re: psql: Add role's membership options to the \du+ command

2023-03-03 Thread Pavel Luzanov

Hello,

On 22.02.2023 00:34, David G. Johnston wrote:
I didn't even know this function existed. But I see that it was 
changed in 3d14e171 with updated documentation:

https://www.postgresql.org/docs/devel/functions-info.html#FUNCTIONS-INFO-ACCESS
Maybe that's enough.


I think that should probably have ADMIN as one of the options as well. 
Also curious what it reports for an empty membership.


I've been experimenting for a few days and I want to admit that this is 
a very difficult and not obvious topic.

I'll try to summarize what I think.

1.
About ADMIN value for pg_has_role.
Implementation of ADMIN value will be different from USAGE and SET.
To be True, USAGE value requires the full chain of memberships to have 
INHERIT option.

Similar with SET: the full chain of memberships must have SET option.
But for ADMIN, only last member in the chain must have ADMIN option and 
all previous members
must have INHERIT (to administer directly) or SET option (to switch to 
role, last in the chain).

Therefore, it is not obvious to me that the function needs the ADMIN value.

2.
pg_has_role function description starts with: Does user have privilege 
for role?
    - This is not exact: function works not only with users, but with 
NOLOGIN roles too.
    - Term "privilege": this term used for ACL columns, such usage may 
be confusing,

  especially after adding INHERIT and SET in addition to ADMIN option.

3.
It is possible to grant membership with all three options turned off:
    grant a to b with admin false, inherit false, set false;
But such membership is completely useless (if i didn't miss something).
May be such grants must be prohibited. At least this may be documented 
in the GRANT command.


4.
Since v16 it is possible to grant membership from one role to another 
several times with different grantors.

And only grantor can revoke membership.
    - This is not documented anywhere.
    - Current behavior of \du command with duplicated roles in "Member 
of" column strongly confusing.

  This is one of the goals of the discussion patch.

I think to write about this to pgsql-docs additionally to this topic.

--
Pavel Luzanov
Postgres Professional:https://postgrespro.com


Re: cataloguing NOT NULL constraints

2023-03-03 Thread Alvaro Herrera
On 2023-Mar-03, Peter Eisentraut wrote:

> On 28.02.23 20:15, Alvaro Herrera wrote:
> > So I reworked this to use a new contype value for the NOT NULL
> > pg_constraint rows; I attach it here.  I think it's fairly clean.
> > 
> > 0001 is just a trivial change that seemed obvious as soon as I ran into
> > the problem.
> 
> This looks harmless enough, but I wonder what the reason for it is. What
> command can cause this error (no test case?)?  Is there ever a confusion
> about what table is in play?

Hmm, I realize now that the only reason I have this is that I had a bug
at some point: the case where it's not evident which table it is, is
when you're adding a PK to a partitioned table and one of the partitions
doesn't have the NOT NULL marking.  But if you add a PK with the patch,
the partitions are supposed to get the nullability marking
automatically; the bug is that they didn't.  So we don't need patch 0001
at all.

> > 0002 is the most interesting part.

Another thing I realized after posting, is that the constraint naming
business is mistaken.  It's currently written to work similarly to CHECK
constraints, that is: each descendent needs to have the constraint named
the same (this is so that descent works correctly when altering/dropping
the constraint afterwards).  But for NOT NULL constraints, that is not
necessary, because when descending down the hierarchy, we can just match
the constraint based on column name, since each column has at most one
NOT NULL constraint.  So the games with constraint renaming are
altogether unnecessary and can be removed from the patch.  We just need
to ensure that coninhcount/conislocal is updated appropriately.


> Where did this syntax come from:
> 
> --- a/doc/src/sgml/ref/create_table.sgml
> +++ b/doc/src/sgml/ref/create_table.sgml
> @@ -77,6 +77,7 @@ CREATE [ [ GLOBAL | LOCAL ] { TEMPORARY | TEMP } |
> UNLOGGED ] TABLE [ IF NOT EXI
> 
>  [ CONSTRAINT constraint_name ]
>  { CHECK ( expression ) [ NO
> INHERIT ] |
> +  NOT NULL column_name |
>UNIQUE [ NULLS [ NOT ] DISTINCT ] (  class="parameter">column_name [, ... ] )  class="parameter">in>
>PRIMARY KEY ( column_name [,
> ... ] ) index_parameters
>EXCLUDE [ USING index_method
> ] ( exclude_element
> 
> I don't see that in the standard.

Yeah, I made it up because I needed table-level constraints for some
reason that doesn't come to mind right now.

> If we need it for something, we should at least document that it's an
> extension.

OK.

> The test tables in foreign_key.sql are declared with columns like
> 
> id bigint NOT NULL PRIMARY KEY,
> 
> which is a bit weird and causes expected output diffs in your patch.  Is
> that interesting for this patch?  Otherwise I suggest dropping the NOT NULL
> from those table definitions to avoid these extra diffs.

The behavior is completely different if you drop the primary key.  If
you don't have NOT NULL, then when you drop the PK the columns becomes
nullable.  If you do have a NOT NULL constraint in addition to the PK,
and drop the PK, then the column remains non nullable.

Now, if you want to suggest that dropping the PK ought to leave the
column as NOT NULL (that is, it automatically acquires a NOT NULL
constraint), then let's discuss that.  But I understand the standard as
saying otherwise.


> > 0003:
> > Since nobody liked the idea of listing the constraints in psql \d's
> > footer, I changed \d+ so that the "not null" column shows the name of
> > the constraint if there is one, or the string "(primary key)" if the
> > attnotnull marking for the column comes from the primary key.  The new
> > column is going to be quite wide in some cases; if we want to hide it
> > further, we could add the mythical \d++ and have *that* list the
> > constraint name, keeping \d+ as current.
> 
> I think my rough preference here would be to leave the existing output style
> (column header "Nullable", content "not null") alone and display the
> constraint name somewhere in the footer optionally.

Well, there is resistance to showing the name of the constraint in the
footer also because it's too verbose.  In the end, I think a
"super-verbose" mode is the most convincing way forward.  (I think the
list of partitions in the footer of a partitioned table is a terrible
design.  Let's not repeat that.)

> In practice, the name of the constraint is rarely needed.

That is true.

> I do like the idea of mentioning primary key-ness inside the table somehow.

Maybe change the "not null" to "primary key" in the Nullable column and
nothing else.

-- 
Álvaro Herrera   48°01'N 7°57'E  —  https://www.EnterpriseDB.com/
"Cómo ponemos nuestros dedos en la arcilla del otro. Eso es la amistad; jugar
al alfarero y ver qué formas se pueden sacar del otro" (C. Halloway en
La Feria de las Tinieblas, R. Bradbury)




Re: Deduplicate logicalrep_read_tuple()

2023-03-03 Thread Bharath Rupireddy
On Thu, Jan 19, 2023 at 8:36 AM Peter Smith  wrote:
>
> On Wed, Jan 18, 2023 at 6:26 PM Bharath Rupireddy
>  wrote:
> >
> > Hi,
> >
> > logicalrep_read_tuple() duplicates code for LOGICALREP_COLUMN_TEXT and
> > LOGICALREP_COLUMN_BINARY introduced by commit 9de77b5. While it
> > doesn't hurt anyone, deduplication makes code a bit leaner by 57 bytes
> > [1]. I've attached a patch for $SUBJECT.
> >
> > Thoughts?
> >
>
> The code looks the same but there is a subtle comment difference where
> previously only LOGICALREP_COLUMN_BINARY case said:
>  /* not strictly necessary but per StringInfo practice */
>
> So if you de-duplicate the code then should that comment be modified to say
> /* not strictly necessary for LOGICALREP_COLUMN_BINARY but per
> StringInfo practice */

Thanks. Done so in the attached v2.

-- 
Bharath Rupireddy
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com


v2-0001-Deduplicate-logicalrep_read_tuple.patch
Description: Binary data


Re: min/max aggregation for jsonb

2023-03-03 Thread David Rowley
On Fri, 3 Mar 2023 at 23:17, Daneel Yaitskov  wrote:
> I wanted to use min/max aggregation functions for jsonb type and noticed
> there is no functions for this type, meanwhile string/array types are 
> supported.

It's not really clear to me how you'd want these to sort. If you just
want to sort by what the output that you see from the type's output
function then you might get what you need by casting to text.

> Is there a concern about implementing support for jsonb in min/max?

I imagine a lack of any meaningful way of comparing two jsonb values
to find out which is greater than the other is of some concern.

David




Re: [PATCH] Use indexes on the subscriber when REPLICA IDENTITY is full on the publisher

2023-03-03 Thread Amit Kapila
On Fri, Mar 3, 2023 at 3:02 PM Önder Kalacı  wrote:
>>
>> 7.
>> - /* Build scankey for every attribute in the index. */
>> - for (attoff = 0; attoff <
>> IndexRelationGetNumberOfKeyAttributes(idxrel); attoff++)
>> + /* Build scankey for every non-expression attribute in the index. */
>> + for (index_attoff = 0; index_attoff <
>> IndexRelationGetNumberOfKeyAttributes(idxrel);
>> + index_attoff++)
>>   {
>>   Oid operator;
>>   Oid opfamily;
>> + Oid optype = get_opclass_input_type(opclass->values[index_attoff]);
>>   RegProcedure regop;
>> - int pkattno = attoff + 1;
>> - int mainattno = indkey->values[attoff];
>> - Oid optype = get_opclass_input_type(opclass->values[attoff]);
>> + int table_attno = indkey->values[index_attoff];
>>
>> I don't think here we need to change variable names if we retain
>> mainattno as it is instead of changing it to table_attno. The current
>> naming doesn't seem bad for the current usage of the patch.
>
>
> Hmm, I'm actually not convinced that the variable naming on HEAD is good for
> the current patch. The main difference is that now we allow indexes like:
> CREATE INDEX idx ON table(foo(col), col_2)
>
> (See # Testcase start: SUBSCRIPTION CAN USE INDEXES WITH
> EXPRESSIONS AND COLUMNS)
>
> In such indexes, we could skip the attributes of the index. So, skey_attoff 
> is not
> equal to  index_attoff anymore. So, calling out this explicitly via the 
> variable names
> seems more robust to me. Plus, mainattno sounded vague to me when I first read
> this function.
>

Yeah, I understand this part. By looking at the diff, it appeared to
me that this was an unnecessary change. Anyway, I see your point, so
if you want to keep the naming as you proposed at least don't change
the ordering for get_opclass_input_type() call because that looks odd
to me.
>
> Attached v29 for this review. Note that I'll be working on the disable index 
> scan changes after
>

Okay, thanks!

-- 
With Regards,
Amit Kapila.




Re: Missing update of all_hasnulls in BRIN opclasses

2023-03-03 Thread Alvaro Herrera

Thanks for doing all this.  (Do I understand correctly that this patch
is not in the commitfest?)

I think my mental model for this was that "allnulls" meant that either
there are no values for the column in question or that the values were
all nulls (For minmax without NULL handling, which is where this all
started, these two things are essentially the same: the range is not to
be returned.  So this became a bug the instant I added handling for NULL
values.)  I failed to realize that these were two different things, and
this is likely the origin of all these troubles.

What do you think of using the unused bit in BrinTuple->bt_info to
denote a range that contains no heap tuples?  This also means we need it
in BrinMemTuple, I think we can do this:

@@ -44,6 +44,7 @@ typedef struct BrinValues
 typedef struct BrinMemTuple
 {
boolbt_placeholder; /* this is a placeholder tuple */
+   boolbt_empty_range; /* range has no tuples */
BlockNumber bt_blkno;   /* heap blkno that the tuple is for */
MemoryContext bt_context;   /* memcxt holding the bt_columns values 
*/
/* output arrays for brin_deform_tuple: */
@@ -69,7 +70,7 @@ typedef struct BrinTuple
 *
 * 7th (high) bit: has nulls
 * 6th bit: is placeholder tuple
-* 5th bit: unused
+* 5th bit: range has no tuples
 * 4-0 bit: offset of data
 * ---
 */
@@ -82,7 +83,7 @@ typedef struct BrinTuple
  * bt_info manipulation macros
  */
 #define BRIN_OFFSET_MASK   0x1F
-/* bit 0x20 is not used at present */
+#define BRIN_EMPTY_RANGE   0x20
 #define BRIN_PLACEHOLDER_MASK  0x40
 #define BRIN_NULLS_MASK0x80
 
(Note that bt_empty_range uses a hole in the struct, so there's no ABI
change.)

This is BRIN-tuple-level, not column-level, so conceptually it seems
more appropriate.  (In the case where both are empty in union_tuples, we
can return without entering the per-attribute loop at all, though I
admit it's not a very interesting case.)  This approach avoids having to
invent the strange combination of all+has to mean empty.


On 2023-Feb-24, Tomas Vondra wrote:

> I wonder what's the best
> way to test this in an automated way - it's very dependent on timing of
> the concurrent updated. For example we need to do something like this:
> 
> T1: run pg_summarize_range() until it inserts the placeholder tuple
> T2: do an insert into the page range (updates placeholder)
> T1: continue pg_summarize_range() to merge into the placeholder
> 
> But there are no convenient ways to do this, I think. I had to check the
> various cases using breakpoints in gdb etc.

Yeah, I struggled with this during initial development but in the end
did nothing.  I think we would need to introduce some new framework,
perhaps Korotkov stop-events stuff at 
https://postgr.es/m/CAPpHfdsTeb+hBT5=qxghjng_chcjldanq9sdy9vnwbf2e2p...@mail.gmail.com
which seemed to me a good fit -- we would add a stop point after the
placeholder tuple is inserted.

> I'm not very happy with the union_tuples() changes - it's quite verbose,
> perhaps a bit too verbose. We have to check for empty ranges first, and
> then various combinations of allnulls/hasnulls flags for both BRIN
> tuples. There are 9 combinations, and the current code just checks them
> one by one - I was getting repeatedly confused by the original code, but
> maybe it's too much.

I think it's okay.  I tried to make it more compact (by saying "these
two combinations here are case 2, and these two other are case 4", and
keeping each of the other combinations a separate case; so there are
really 7 cases).  But that doesn't make it any easier to follow, on the
contrary it was more convoluted.  I think a dozen extra lines of source
is not a problem.

> The alternative is to apply the same fix to every BRIN_PROCNUM_UNION
> opclass procedure out there. I guess doing that for minmax+inclusion is
> not a huge deal, but what about external opclasses? And without the fix
> the indexes are effectively broken. Fixing this outside in brin.c (in
> the union procedure) fixes this for every opclass procedure, without any
> actual limitation of functinality (14+ does that anyway).

About the hypothetical question, you could as well ask what about
unicorns.  I have never seen any hint that any external opclass exist.
I am all for maintaining compatibility, but I think this concern is
overblown for BRIN.  Anyway, I think your proposed fix is better than
changing individual 'union' support procs, so it doesn't matter.

As far as I understood, you're now worried that there will be an
incompatibility because we will fail to call the 'union' procedure in
cases where we previously called it?  In other words, you fear that some
hypothetical opclass was handling the NULL values in some way that's
incompatible with this?  I haven't thought terribly hard about this, but
I can't see a 

min/max aggregation for jsonb

2023-03-03 Thread Daneel Yaitskov
Hi,

I wanted to use min/max aggregation functions for jsonb type and noticed
there is no functions for this type, meanwhile string/array types are
supported.
Is there a concern about implementing support for jsonb in min/max?

jsonb is a byte array.
json faces same limitations.


-- 

Best regards,
Daniil Iaitskov


Re: cataloguing NOT NULL constraints

2023-03-03 Thread Peter Eisentraut

On 28.02.23 20:15, Alvaro Herrera wrote:

So I reworked this to use a new contype value for the NOT NULL
pg_constraint rows; I attach it here.  I think it's fairly clean.

0001 is just a trivial change that seemed obvious as soon as I ran into
the problem.


This looks harmless enough, but I wonder what the reason for it is. 
What command can cause this error (no test case?)?  Is there ever a 
confusion about what table is in play?



0002 is the most interesting part.


Where did this syntax come from:

--- a/doc/src/sgml/ref/create_table.sgml
+++ b/doc/src/sgml/ref/create_table.sgml
@@ -77,6 +77,7 @@ CREATE [ [ GLOBAL | LOCAL ] { TEMPORARY | TEMP } | 
UNLOGGED ] TABLE [ IF NOT EXI


 [ CONSTRAINT class="parameter">constraint_name ]
 { CHECK ( expression ) [ 
NO INHERIT ] |

+  NOT NULL column_name |
   UNIQUE [ NULLS [ NOT ] DISTINCT ] ( class="parameter">column_name [, ... ] ) class="parameter">in>
   PRIMARY KEY ( class="parameter">column_name [, ... ] ) class="parameter">index_parameters
   EXCLUDE [ USING class="parameter">index_method ] ( class="parameter">exclude_element


I don't see that in the standard.

If we need it for something, we should at least document that it's an 
extension.


The test tables in foreign_key.sql are declared with columns like

id bigint NOT NULL PRIMARY KEY,

which is a bit weird and causes expected output diffs in your patch.  Is 
that interesting for this patch?  Otherwise I suggest dropping the NOT 
NULL from those table definitions to avoid these extra diffs.



0003:
Since nobody liked the idea of listing the constraints in psql \d's
footer, I changed \d+ so that the "not null" column shows the name of
the constraint if there is one, or the string "(primary key)" if the
attnotnull marking for the column comes from the primary key.  The new
column is going to be quite wide in some cases; if we want to hide it
further, we could add the mythical \d++ and have *that* list the
constraint name, keeping \d+ as current.


I think my rough preference here would be to leave the existing output 
style (column header "Nullable", content "not null") alone and display 
the constraint name somewhere in the footer optionally.  In practice, 
the name of the constraint is rarely needed.


I do like the idea of mentioning primary key-ness inside the table somehow.

As you wrote elsewhere, we can leave this patch alone for now.





Re: meson: Non-feature feature options

2023-03-03 Thread Nazir Bilal Yavuz
Hi,

On Fri, 3 Mar 2023 at 12:16, Peter Eisentraut
 wrote:
>
> On 02.03.23 11:41, Nazir Bilal Yavuz wrote:
> > I am kind of confused. I added these checks for considering other SSL
> > implementations in the future, for this reason I have two nested if
> > checks. The top one is for checking if we need to search an SSL
> > library and the nested one is for checking if we need to search this
> > specific SSL library. What do you think?
>
> I suppose that depends on how you envision integrating other SSL
> libraries into this logic.  It's not that important right now; if the
> structure makes sense to you, that's fine.
>
> Please send an updated patch with the small changes that have been
> mentioned.
>

The updated patch is attached.

Regards,
Nazir Bilal Yavuz
Microsoft
From 9cb9d50ba008e2a385a7b72219a759490a3de00e Mon Sep 17 00:00:00 2001
From: Nazir Bilal Yavuz 
Date: Fri, 3 Mar 2023 12:24:46 +0300
Subject: [PATCH v3] meson: Refactor SSL option

---
 .cirrus.yml  |   7 +-
 meson.build  | 121 ++-
 meson_options.txt|   4 +-
 src/interfaces/libpq/meson.build |   2 +-
 src/makefiles/meson.build|   2 +-
 src/test/ssl/meson.build |   2 +-
 6 files changed, 80 insertions(+), 58 deletions(-)

diff --git a/.cirrus.yml b/.cirrus.yml
index f2129787529..aaf4066366c 100644
--- a/.cirrus.yml
+++ b/.cirrus.yml
@@ -181,7 +181,7 @@ task:
 su postgres <<-EOF
   meson setup \
 --buildtype=debug \
--Dcassert=true -Dssl=openssl -Duuid=bsd -Dtcl_version=tcl86 -Ddtrace=auto \
+-Dcassert=true -Duuid=bsd -Dtcl_version=tcl86 -Ddtrace=auto \
 -DPG_TEST_EXTRA="$PG_TEST_EXTRA" \
 -Dextra_lib_dirs=/usr/local/lib -Dextra_include_dirs=/usr/local/include/ \
 build
@@ -243,7 +243,6 @@ LINUX_CONFIGURE_FEATURES: _CONFIGURE_FEATURES >-
 
 LINUX_MESON_FEATURES: _MESON_FEATURES >-
   -Dllvm=enabled
-  -Dssl=openssl
   -Duuid=e2fs
 
 
@@ -497,7 +496,7 @@ task:
   -Dextra_include_dirs=${brewpath}/include \
   -Dextra_lib_dirs=${brewpath}/lib \
   -Dcassert=true \
-  -Dssl=openssl -Duuid=e2fs -Ddtrace=auto \
+  -Duuid=e2fs -Ddtrace=auto \
   -Dsegsize_blocks=6 \
   -DPG_TEST_EXTRA="$PG_TEST_EXTRA" \
   build
@@ -568,7 +567,7 @@ task:
   # Use /DEBUG:FASTLINK to avoid high memory usage during linking
   configure_script: |
 vcvarsall x64
-meson setup --backend ninja --buildtype debug -Dc_link_args=/DEBUG:FASTLINK -Dcassert=true -Db_pch=true -Dssl=openssl -Dextra_lib_dirs=c:\openssl\1.1\lib -Dextra_include_dirs=c:\openssl\1.1\include -DTAR=%TAR% -DPG_TEST_EXTRA="%PG_TEST_EXTRA%" build
+meson setup --backend ninja --buildtype debug -Dc_link_args=/DEBUG:FASTLINK -Dcassert=true -Db_pch=true -Dextra_lib_dirs=c:\openssl\1.1\lib -Dextra_include_dirs=c:\openssl\1.1\include -DTAR=%TAR% -DPG_TEST_EXTRA="%PG_TEST_EXTRA%" build
 
   build_script: |
 vcvarsall x64
diff --git a/meson.build b/meson.build
index 26be83afb61..1e9411eb247 100644
--- a/meson.build
+++ b/meson.build
@@ -43,6 +43,7 @@ cc = meson.get_compiler('c')
 
 not_found_dep = dependency('', required: false)
 thread_dep = dependency('threads')
+auto_features = get_option('auto_features')
 
 
 
@@ -1171,7 +1172,16 @@ cdata.set('USE_SYSTEMD', systemd.found() ? 1 : false)
 # Library: SSL
 ###
 
-if get_option('ssl') == 'openssl'
+ssl = not_found_dep
+ssl_library = 'none'
+sslopt = get_option('ssl')
+
+if (sslopt == 'auto' and auto_features.disabled())
+  sslopt = 'none'
+endif
+
+if sslopt in ['auto', 'openssl']
+  openssl_required = sslopt == 'openssl'
 
   # Try to find openssl via pkg-config et al, if that doesn't work
   # (e.g. because it's provided as part of the OS, like on FreeBSD), look for
@@ -1192,59 +1202,72 @@ if get_option('ssl') == 'openssl'
 ssl_int = [ssl_lib, crypto_lib]
 
 ssl = declare_dependency(dependencies: ssl_int,
- include_directories: postgres_inc)
-  else
-cc.has_header('openssl/ssl.h', args: test_c_args, dependencies: ssl, required: true)
-cc.has_header('openssl/err.h', args: test_c_args, dependencies: ssl, required: true)
-
-ssl_int = [ssl]
+include_directories: postgres_inc)
+  elif cc.has_header('openssl/ssl.h', args: test_c_args, dependencies: ssl, required: openssl_required) and \
+cc.has_header('openssl/err.h', args: test_c_args, dependencies: ssl, required: openssl_required)
+  ssl_int = [ssl]
   endif
 
-  check_funcs = [
-['CRYPTO_new_ex_data', {'required': true}],
-['SSL_new', {'required': true}],
-
-# Function introduced in OpenSSL 1.0.2.
-['X509_get_signature_nid'],
-
-# Functions introduced in OpenSSL 1.1.0. We used to check for
-# OPENSSL_VERSION_NUMBER, but that didn't work with 1.1.0, because LibreSSL
-# defines OPENSSL_VERSION_NUMBER to claim version 2.0.0, even though it
-# 

Re: [PATCH] Use indexes on the subscriber when REPLICA IDENTITY is full on the publisher

2023-03-03 Thread Önder Kalacı
Hi Amit, all


> Few comments on 0001
> 
> 1.
> +   such suitable indexes, the search on the subscriber s ide can be
> very inefficient,
>
> unnecessary space in 'side'
>

Fixed in v28


>
> 2.
> -   identity.  If the table does not have any suitable key, then it can be
> set
> -   to replica identity full, which means the entire row
> becomes
> -   the key.  This, however, is very inefficient and should only be used
> as a
> -   fallback if no other solution is possible.  If a replica identity other
> +   identity. When replica identity FULL is specified,
> +   indexes can be used on the subscriber side for searching the rows.
>
> I think it is better to retain the first sentence (If the table does
> not ... entire row becomes the key.) as that says what will be part of
> the key.
>
>
right, that sentence looks useful, added back.


> 3.
> -   comprising the same or fewer columns must also be set on the subscriber
> -   side.  See  for
> details on
> -   how to set the replica identity.  If a table without a replica
> identity is
> -   added to a publication that replicates UPDATE
> +   comprising the same or fewer columns must also be set on the
> subscriber side.
> +   See  for
> +   details on how to set the replica identity.  If a table without a
> replica
> +   identity is added to a publication that replicates
> UPDATE
>
> I don't see any change in this except line length. If so, I don't
> think we should change it as part of this patch.
>
>
Yes, fixed. But the first line (starting with See  4.
>  /*
>   * Setup a ScanKey for a search in the relation 'rel' for a tuple 'key'
> that
>   * is setup to match 'rel' (*NOT* idxrel!).
>   *
> - * Returns whether any column contains NULLs.
> + * Returns how many columns should be used for the index scan.
> + *
> + * This is not generic routine, it expects the idxrel to be
> + * a btree, non-partial and have at least one column
> + * reference (e.g., should not consist of only expressions).
>   *
> - * This is not generic routine, it expects the idxrel to be replication
> - * identity of a rel and meet all limitations associated with that.
> + * By definition, replication identity of a rel meets all
> + * limitations associated with that. Note that any other
> + * index could also meet these limitations.
>
> The comment changes look quite asymmetric to me. Normally, we break
> the line if the line length goes beyond 80 cols. Please check and
> change other places in the patch if they have a similar symptom.
>

Went over the patch, and expanded each line ~80 chars.

I'm guessing 80 is not the hard limit, in some cases I went over 81-82.


>
> 5.
> + * There are no fundamental problems for supporting non-btree
> + * and/or partial indexes.
>
> Can we mention partial indexes in the above comment? It seems to me
> that because the required tuple may not satisfy the expression (in the
> case of partial indexes) it may not be easy to support it.
>

Expanded the comment and explained the differences a little further.


>
> 6.
> build_replindex_scan_key()
> {
> ...
> + for (index_attoff = 0; index_attoff <
> IndexRelationGetNumberOfKeyAttributes(idxrel);
> + index_attoff++)
> ...
> ...
> +#ifdef USE_ASSERT_CHECKING
> + IndexInfo  *indexInfo = BuildIndexInfo(idxrel);
> +
> + Assert(!IsIndexOnlyOnExpression(indexInfo));
> +#endif
> ...
> }
>
> We can avoid building index info multiple times. This can be either
> checked at the beginning of the function outside attribute offset loop
> or we can probably cache it. I understand this is for assert builds
> but seems easy to avoid it doing multiple times and it also looks odd
> to do it multiple times for the same index.
>

Applied your suggestions. Although I do not have strong opinions, I think
that
it was easier to follow with building the indexInfo for each iteration.


>
> 7.
> - /* Build scankey for every attribute in the index. */
> - for (attoff = 0; attoff <
> IndexRelationGetNumberOfKeyAttributes(idxrel); attoff++)
> + /* Build scankey for every non-expression attribute in the index. */
> + for (index_attoff = 0; index_attoff <
> IndexRelationGetNumberOfKeyAttributes(idxrel);
> + index_attoff++)
>   {
>   Oid operator;
>   Oid opfamily;
> + Oid optype = get_opclass_input_type(opclass->values[index_attoff]);
>   RegProcedure regop;
> - int pkattno = attoff + 1;
> - int mainattno = indkey->values[attoff];
> - Oid optype = get_opclass_input_type(opclass->values[attoff]);
> + int table_attno = indkey->values[index_attoff];
>
> I don't think here we need to change variable names if we retain
> mainattno as it is instead of changing it to table_attno. The current
> naming doesn't seem bad for the current usage of the patch.
>

Hmm, I'm actually not convinced that the variable naming on HEAD is good for
the current patch. The main difference is that now we allow indexes like:
   * CREATE INDEX idx ON table(foo(col), col_2)*

(See # Testcase start: SUBSCRIPTION CAN USE INDEXES WITH
EXPRESSIONS AND 

Re: meson: Non-feature feature options

2023-03-03 Thread Peter Eisentraut

On 02.03.23 11:41, Nazir Bilal Yavuz wrote:

I am kind of confused. I added these checks for considering other SSL
implementations in the future, for this reason I have two nested if
checks. The top one is for checking if we need to search an SSL
library and the nested one is for checking if we need to search this
specific SSL library. What do you think?


I suppose that depends on how you envision integrating other SSL 
libraries into this logic.  It's not that important right now; if the 
structure makes sense to you, that's fine.


Please send an updated patch with the small changes that have been 
mentioned.






Re: Add documentation for coverage reports with meson

2023-03-03 Thread Peter Eisentraut

On 28.02.23 09:49, Michael Paquier wrote:

-when compiling with GCC, and it requires the gcov
-and lcov programs.
+when compiling with GCC, and it requires the gcov,
+lcov and genhtml programs.


genhtml is part of the lcov package.  I think it would be confusing to 
mention it explicitly, since you won't be able to find it as something 
to install.  Maybe leave the original list and change "programs" to 
"packages"?



-   
-A typical workflow looks like this:
+   
+Coverage with configure
+
+ A typical workflow looks like this:


In the installation chapter we use titles like "Building and 
Installation with Autoconf and Make" and "Building and Installation with 
Meson".  We should use analogous wordings here.



+
+ A typical workflow looks like this:
+
+meson setup -Db_coverage=true ... OTHER OPTIONS ...
+ninja
+meson test
+ninja coverage-html
+
+ Then point your HTML browser
+ to ./meson-logs/coveragereport/index.html.
+


This ignores which directory you have to be in.  The meson calls have to 
be at the top level, the ninja calls have to be in the build directory. 
We should be more precise here, otherwise someone trying this will find 
that it doesn't work.


Personally I use "meson compile" instead of "ninja"; I'm not sure what 
the best recommendation is, but that least that way all the initial 
commands are "meson something" instead of going back and forth.






Re: In-placre persistance change of a relation

2023-03-03 Thread Kyotaro Horiguchi
At Wed, 1 Mar 2023 14:56:25 -0500, "Gregory Stark (as CFM)" 
 wrote in 
> On Mon, 6 Feb 2023 at 23:48, Kyotaro Horiguchi  
> wrote:
> >
> > Thank you for the comment!
> >
> > At Fri, 3 Feb 2023 08:42:52 +0100, Heikki Linnakangas  
> > wrote in
> > > I want to call out this part of this patch:
> 
> Looks like this patch has received some solid feedback from Heikki and
> you have a path forward. It's not currently building in the build farm
> either.
> 
> I'll set the patch to Waiting on Author for now.

Correctly they are three parts.

Correctly they are three parts. The attached patch is the first part -
the storage mark files, which are used to identify storage files that
have not been committed and should be removed during the next
startup. This feature resolves the issue of orphaned storage files
that may result from a crash occurring during the execution of a
transaction involving the creation of a new table.

I'll post all of the three parts shortly.

regards.

-- 
Kyotaro Horiguchi
NTT Open Source Software Center
>From 1665e3428b9d777989864ea302eef8368a739e7e Mon Sep 17 00:00:00 2001
From: Kyotaro Horiguchi 
Date: Thu, 2 Mar 2023 17:25:12 +0900
Subject: [PATCH v26] Storage mark files

In certain situations, specific operations followed by a crash-restart
can result in orphaned storage files.  These files cannot be removed
through standard methods.  To address this issue, this commit
implements 'mark files' that conveys information about the storage
file. Specifically, the "UNCOMMITED" mark file is introduced to denote
files that have not been committed and should be removed during the
next startup.
---
 src/backend/access/rmgrdesc/smgrdesc.c|  37 +++
 src/backend/access/transam/README |  10 +
 src/backend/access/transam/xact.c |   7 +
 src/backend/access/transam/xlogrecovery.c |  18 ++
 src/backend/backup/basebackup.c   |   9 +-
 src/backend/catalog/storage.c | 270 ++-
 src/backend/storage/file/fd.c |   4 +-
 src/backend/storage/file/reinit.c | 313 +++---
 src/backend/storage/smgr/md.c |  95 ++-
 src/backend/storage/smgr/smgr.c   |  32 +++
 src/backend/storage/sync/sync.c   |  21 +-
 src/bin/pg_rewind/parsexlog.c |  16 ++
 src/common/relpath.c  |  47 ++--
 src/include/catalog/storage.h |   3 +
 src/include/catalog/storage_xlog.h|  35 ++-
 src/include/common/relpath.h  |   9 +-
 src/include/storage/fd.h  |   1 +
 src/include/storage/md.h  |   8 +-
 src/include/storage/reinit.h  |   8 +-
 src/include/storage/smgr.h|  17 ++
 src/test/recovery/t/013_crash_restart.pl  |  21 ++
 src/tools/pgindent/typedefs.list  |   6 +
 22 files changed, 843 insertions(+), 144 deletions(-)

diff --git a/src/backend/access/rmgrdesc/smgrdesc.c b/src/backend/access/rmgrdesc/smgrdesc.c
index bd841b96e8..f8187385c4 100644
--- a/src/backend/access/rmgrdesc/smgrdesc.c
+++ b/src/backend/access/rmgrdesc/smgrdesc.c
@@ -40,6 +40,37 @@ smgr_desc(StringInfo buf, XLogReaderState *record)
 		 xlrec->blkno, xlrec->flags);
 		pfree(path);
 	}
+	else if (info == XLOG_SMGR_UNLINK)
+	{
+		xl_smgr_unlink *xlrec = (xl_smgr_unlink *) rec;
+		char	   *path = relpathperm(xlrec->rlocator, xlrec->forkNum);
+
+		appendStringInfoString(buf, path);
+		pfree(path);
+	}
+	else if (info == XLOG_SMGR_MARK)
+	{
+		xl_smgr_mark *xlrec = (xl_smgr_mark *) rec;
+		char	   *path = GetRelationPath(xlrec->rlocator.dbOid,
+		   xlrec->rlocator.spcOid,
+		   xlrec->rlocator.relNumber,
+		   InvalidBackendId,
+		   xlrec->forkNum, xlrec->mark);
+		char	   *action = "";
+
+		switch (xlrec->action)
+		{
+			case XLOG_SMGR_MARK_CREATE:
+action = "CREATE";
+break;
+			case XLOG_SMGR_MARK_UNLINK:
+action = "DELETE";
+break;
+		}
+
+		appendStringInfo(buf, "%s %s", action, path);
+		pfree(path);
+	}
 }
 
 const char *
@@ -55,6 +86,12 @@ smgr_identify(uint8 info)
 		case XLOG_SMGR_TRUNCATE:
 			id = "TRUNCATE";
 			break;
+		case XLOG_SMGR_UNLINK:
+			id = "UNLINK";
+			break;
+		case XLOG_SMGR_MARK:
+			id = "MARK";
+			break;
 	}
 
 	return id;
diff --git a/src/backend/access/transam/README b/src/backend/access/transam/README
index 22c8ae9755..bf83d19abd 100644
--- a/src/backend/access/transam/README
+++ b/src/backend/access/transam/README
@@ -741,6 +741,16 @@ we must panic and abort recovery.  The DBA will have to manually clean up
 then restart recovery.  This is part of the reason for not writing a WAL
 entry until we've successfully done the original action.
 
+
+Smgr MARK files
+
+
+An smgr mark file is an empty file that is created alongside a new
+relation storage file to signal that the storage file must be cleaned
+up during recovery.  In contrast to the four actions above, failing to
+remove 

RE: [PATCH] Use indexes on the subscriber when REPLICA IDENTITY is full on the publisher

2023-03-03 Thread Hayato Kuroda (Fujitsu)
Dear Önder,

Thanks for updating the patch!
I played with your patch and I confirmed that parallel apply worker and 
tablesync worker
could pick the index in typical case.

Followings are comments for v27-0001. Please ignore if it is fixed in newer one.

execReplication.c

```
+   /* Build scankey for every non-expression attribute in the index. */
```

I think that single line comments should not terminated by ".".

```
+   /* There should always be at least one attribute for the index scan. */
```

Same as above.


```
+#ifdef USE_ASSERT_CHECKING
+   IndexInfo  *indexInfo = BuildIndexInfo(idxrel);
+
+   Assert(!IsIndexOnlyOnExpression(indexInfo));
+#endif
```

I may misunderstand, but the condition of usable index has alteady been checked
when the oid was set but anyway the you confirmed the condition again before
really using that, right?
So is it OK not to check another assumption that the index is b-tree, 
non-partial,
and one column reference?

IIUC we can do that by adding new function like 
IsIndexUsableForReplicaIdentityFull()
that checks these condition, and then call at RelationFindReplTupleByIndex() if
idxIsRelationIdentityOrPK is false.

032_subscribe_use_index.pl

```
+# Testcase start: SUBSCRIPTION CREATE/DROP INDEX WORKS WITHOUT ISSUES
...
+# Testcase end: SUBSCRIPTION RE-CALCULATES INDEX AFTER CREATE/DROP INDEX
```

There is still non-consistent case.

Best Regards,
Hayato Kuroda
FUJITSU LIMITED



Re: Simplify standby state machine a bit in WaitForWALToBecomeAvailable()

2023-03-03 Thread Bharath Rupireddy
On Wed, Mar 1, 2023 at 1:46 PM Michael Paquier  wrote:
>
> On Tue, Jan 03, 2023 at 02:53:10PM +0530, Bharath Rupireddy wrote:
> > The proposed patch makes the inherent state change to pg_wal after
> > failure to read from archive in XLogFileReadAnyTLI() to explicit by
> > setting currentSource to XLOG_FROM_PG_WAL in the state machine. I
> > think it doesn't alter the existing state machine or add any new extra
> > lookups in pg_wal.
>
> Well, did you notice 4d894b41?  It introduced this change:
>
> -   readFile = XLogFileReadAnyTLI(readSegNo, DEBUG2, 
> currentSource);
> +   readFile = XLogFileReadAnyTLI(readSegNo, DEBUG2,
> +   currentSource == XLOG_FROM_ARCHIVE ? XLOG_FROM_ANY :
> +currentSource);
>
> And this patch basically undoes that, meaning that we would basically
> look at the archives first for all the expected TLIs, but only if no
> files were found in pg_wal/.
>
> The change is subtle, see XLogFileReadAnyTLI().  On HEAD we go through
> each timeline listed and check both archives and then pg_wal/ after
> the last source that failed was the archives.  The patch does
> something different: it checks all the timelines for the archives,
> then all the timelines in pg_wal/ with two separate calls to
> XLogFileReadAnyTLI().

Thanks. Yeah, the patch proposed here just reverts that commit [1]
https://git.postgresql.org/gitweb/?p=postgresql.git;a=commitdiff;h=4d894b41cd12179b710526eba9dc62c2b99abc4d.

That commit fixes an issue - "If there is a WAL segment with same ID
but different TLI present in both the WAL archive and pg_xlog, prefer
the one with higher TLI.".

I will withdraw the patch proposed in this thread.

[1]
commit 4d894b41cd12179b710526eba9dc62c2b99abc4d
Author: Heikki Linnakangas 
Date:   Fri Feb 14 15:15:09 2014 +0200

Change the order that pg_xlog and WAL archive are polled for WAL segments.

If there is a WAL segment with same ID but different TLI present in both
the WAL archive and pg_xlog, prefer the one with higher TLI. Before this
patch, the archive was polled first, for all expected TLIs, and only if no
file was found was pg_xlog scanned. This was a change in behavior from 9.3,
which first scanned archive and pg_xlog for the highest TLI, then archive
and pg_xlog for the next highest TLI and so forth. This patch reverts the
behavior back to what it was in 9.2.

The reason for this is that if for example you try to do archive recovery
to timeline 2, which branched off timeline 1, but the WAL for timeline 2 is
not archived yet, we would replay past the timeline switch point on
timeline 1 using the archived files, before even looking timeline 2's files
in pg_xlog

Report and patch by Kyotaro Horiguchi. Backpatch to 9.3 where the behavior
was changed.

--
Bharath Rupireddy
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com




Re: Flush SLRU counters in checkpointer process

2023-03-03 Thread Anthonin Bonnefoy
Here's the patch rebased with Andres' suggestions.
Happy to update it if there's any additionalj change required.


On Wed, Mar 1, 2023 at 8:46 PM Gregory Stark (as CFM) 
wrote:

> On Thu, 12 Jan 2023 at 03:46, Anthonin Bonnefoy
>  wrote:
> >
> >
> > That would make sense. I've created a new patch with everything moved in
> pgstat_report_checkpointer().
> > I did split the checkpointer flush in a pgstat_flush_checkpointer()
> function as it seemed more readable. Thought?
>
> This patch appears to need a rebase. Is there really any feedback
> needed or is it ready for committer once it's rebased?
>
>
>
> --
> Gregory Stark
> As Commitfest Manager
>


flush-slru-counters-v3.patch
Description: Binary data