Re: BackgroundPsql's set_query_timer_restart() may not work

2023-11-28 Thread Bharath Rupireddy
On Tue, Nov 28, 2023 at 12:23 PM Tom Lane  wrote:
>
> Bharath Rupireddy  writes:
> > A nitpick on the patch - how about honoring the passed-in parameter
> > with something like $self->{query_timer_restart} = 1 if !defined
> > $self->{query_timer_restart}; instead of just setting it to 1 (a value
> > other than undef) $self->{query_timer_restart} = 1;?
>
> I wondered about that too, but the evidence of existing callers is
> that nobody cares.  If we did make the code do something like that,
> (a) I don't think your fragment is right, and (b) we'd need to rewrite
> the function's comment to explain it.  I'm not seeing a reason to
> think it's worth spending effort on.

Hm. I don't mind doing just the $self->{query_timer_restart} = 1; like
in Sawada-san's patch.

-- 
Bharath Rupireddy
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com




Re: [HACKERS] Changing references of password encryption to hashing

2023-11-28 Thread Stephen Frost
Greetings,

* Robert Haas (robertmh...@gmail.com) wrote:
> On Tue, Nov 28, 2023 at 12:24 PM Stephen Frost  wrote:
> > I don’t know what they’re doing now, as you don’t say, and so I really 
> > couldn’t say if ldap is better or worse for them. In some cases, sure, 
> > perhaps ldap is better than … something else,
> 
> That's EXACTLY right. You can't say whether LDAP is better or worse in
> every scenario. And therefore you should not be proposing to remove
> it.

I had been hoping you might shed some light on just what use cases you
were referring to so that we could have a constructive discussion about
if ldap is actually a reasonable solution.  I even explicitly pointed
out that there may still exist some environments, such as those with
OpenLDAP only and which don't have Kerberos that could be a valid reason
to keep ldap, but I haven't heard of any such in quite a long time.

Until more recently, an argument could be made that we needed to keep
ldap around because pgAdmin didn't support Kerberos, but that's been
addressed for a couple years now.

At some point, once there aren't any good use-cases for it to be kept,
we should remove it or at least deprecate and discourage its use, though
as I suspect everyone already knows, I really don't like deprecating
things because it means we just end up keeping them around forever.

As I tried to get at in my prior response, we should be encouraging
the use of secure options when they're available and that's something
which I actively do, on these lists and in working with clients.  That I
continue to find cases of perfectly good AD deployments which could be
using Kerberos but instead are using ldap is frustrating.

What I don't understand is this push-back against even talking about the
issues with ldap or appreciation that passing around cleartext passwords
is a security issue (modern password-based authentication methods, like
SCRAM, don't even do this anymore, because it's not a good idea).

If my opinion that we should remove it is that offensive, then fine,
let's drop that part of the discussion and move on to the question of:
why are people still using ldap?

I'm pretty sure the answer is that far too often people think that's
just how you integrate PG into an AD environment- and they think it's
either the only option or sometimes believe it's a secure option.  We've
had people claim on these lists that using ldap doesn't pass around
cleartext passwords, which I suspect means that maybe they thought our
'ldap' auth was actually somehow doing Kerberos under the hood (I really
wish it was, but sadly that's obviously not the case).

For some users, the issue is that our Kerberos/pg_ident/et al system
isn't as good as it could be in an AD environment, which is one of the
things that I was trying to allude to in my prior reply.  In particular,
we don't currently have support for working with AD's ldap for what it
was actually meant for, which is using it to query for information like
group membership.  There had been some work on this not too long ago but
sadly I haven't seen recent activity around that.  What I've heard
requested are things like having an option to say "users in group X are
allowed to log in to role Y", or "users in group X are allowed to log
into database Z".  Would also be great to have built-in support for
syncing from ldap the list of users which should exist in PG
(pg_ldap_sync does an alright job of this today but we could certainly
do better by having it built-in, and ideally with a background worker
that connects to ldap, does an initial sync, and then listens for
changes, instead of having to have a frequent cronjob doing the sync).

> >> I think that is, to borrow a phrase from Tom, arrant nonsense. Sure,
> >> MD5 authentication has a pass-the-hash vulnerability, and that sucks.
> >
> > So, given that we all agree with the CVE-worthy issue that exists with 
> > every release where we include md5 auth, we should be applying for q CVE 
> > prior to each release, no?
> 
> You know, I said in my previous email that you were so sure that you
> were right that there was no point in trying to have a serious
> discussion, and I can't really see how you could have proved that
> point more thoroughly than you did here. You twisted my words around
> to make it seem like I was agreeing with your point when you know full
> well that I was doing the exact opposite of that.

I quoted the part where you agreed that md5 has a known vulnerability to
point out that, given that, we should be making our users aware of that
through the normal process that we use for that.  I wasn't claiming that
you agreed that we should remove md5, unless you're referring to some
other part of my response which you didn't quote, in which case it'd be
helpful if you could clarify which you were referring to so that I can
try to clarify.

I understand that you still don't want to remove md5 or ldap and I don't
think anyone following this discussion would be confused in 

Re: autovectorize page checksum code included elsewhere

2023-11-28 Thread John Naylor
On Tue, Nov 28, 2023 at 7:51 AM Andres Freund  wrote:
>
> Hi,
>
> On 2023-11-25 14:09:14 +0700, John Naylor wrote:
> > * Note: I have seen the threads with the idea of compiling multiple
> > entire binaries, and switching at postmaster start. I think it's a
> > good idea, but I also suspect detecting flags from the packager is an
> > easier intermediate step.
>
> It's certainly an easier incremental step - but will it get us all that far?

Granted, not much.

(TBH, I'm not sure how to design the multiple-binaries idea, but
surely we're not the first project to think of this...)

On Tue, Nov 28, 2023 at 4:21 AM Nathan Bossart  wrote:
> soon, but if we'd also like to start adding AVX2 enhancements (and I think
> we will), I'm assuming we'll want to provide an easy way for users to
> declare that they are building for v3+ CPUs.

Yeah, I remember now I saw instruction selection change a single shift
with -v3 which made the UTF-8 DFA significantly faster:

https://www.postgresql.org/message-id/CAFBsxsHR08mHEf06PvrMRstfcyPJLwF69g0r1pvRrxWD4GEVoQ%40mail.gmail.com

I imagine a number of places would get automatic improvements, and I
think others have said the same thing.

On Tue, Nov 28, 2023 at 5:26 AM Nathan Bossart  wrote:
> > I'm not sure why I thought checking each feature might be necessary.
> > --with-isa-level could essentially just be an alias for adding all the
> > CFLAGS for the extensions provided at that level, and --with-isa-level=auto
> > would just mean -march=native.  With those flags set, the ./configure
> > checks would succeed with the base set of compiler flags passed in, which
> > could be used as a heuristic for inlining (like CRC32C does today).
>
> Or, perhaps you mean removing those ./configure checks completely and
> assuming that the compiler knows about the intrinsics required for the
> specified ISA level...

With the multiple-binaries, we might be able to assume, since it'll be
opt-in (default is just use the baseline), but I'm not sure. And to
avoid needing a newish compiler, than we could probably just use the
equivalent, e.g. for -v2:

-march=x86-64 -mmmx -msse -msse2 -mfxsr -msahf -mcx16 -mpopcnt -msse3
-msse4.1 -msse4.2 -mssse3

...and it seems we'll want to make something up for Arm.




Re: [PATCH] Tracking statements entry timestamp in pg_stat_statements

2023-11-28 Thread Alexander Lakhin

Hello Alexander,

25.11.2023 23:46, Alexander Korotkov wrote:

On Sat, Nov 25, 2023 at 10:45 PM Andrei Zubkov  wrote:


I've noted a strange space in a commit message of 0001 patch:
"I t is needed for upcoming patch..."
It looks like a typo.

Thank you for catching it.  I'll fix this before commit.


I've found one more typo in that commit: minimun.

Best regards,
Alexander




RE: Synchronizing slots from primary to standby

2023-11-28 Thread Zhijie Hou (Fujitsu)
On Tuesday, November 28, 2023 8:07 PM Drouvot, Bertrand 
 wrote:

Hi,

> On 11/27/23 9:57 AM, Zhijie Hou (Fujitsu) wrote:
> > On Monday, November 27, 2023 4:51 PM shveta malik
>  wrote:
> >
> > Here is the updated version(v39_2) which include all the changes made in
> 0002.
> > Please use for review, and sorry for the confusion.
> >
> 
> Thanks!
> 
> As far v39_2-0001:
> 
> "
>  Altering the failover option of the subscription is currently not
>  permitted. However, this restriction may be lifted in future versions.
> "
> 
> Should we mention that we can alter the related replication slot?

Will add.

> 
> + 
> +  The implementation of failover requires that replication
> +  has successfully finished the initial table synchronization
> +  phase. So even when failover is enabled for a
> +  subscription, the internal failover state remains
> +  temporarily pending until the initialization
> phase
> +  completes. See column
> subfailoverstate
> +  of  linkend="catalog-pg-subscription">pg_subscription me>
> +  to know the actual failover state.
> + 
> 
> I think we have a corner case here. If one alter the replication slot on the
> primary then "subfailoverstate" is not updated accordingly on the subscriber.
> Given the 2 remarks above would that make sense to prevent altering a
> replication slot associated to a subscription?

Thanks for the review!

I think we could not distinguish the user created logical slot or subscriber
created slot as there is no related info in slot's data. And user could change
the slot on subscription by "alter sub set (slot_name)", so maintaining this 
info
would need some efforts.

Besides, I think this case overlaps the previous discussed "alter sub set
(slot_name)" issue[1]. Both the cases are because the slot's failover is
different from the subscription's failover setting. I think we could handle
them similarly that user need to take care of not changing the failover to
wrong value. Or do you prefer another approach that mentioned in that thread[1]
? (always alter the slot at the startup of apply worker).

[1] 
https://www.postgresql.org/message-id/564b195a-180c-42e9-902b-b1a8b50218ee%40gmail.com

Best Regards,
Hou zj


Re: EXCLUDE COLLATE in CREATE/ALTER TABLE document

2023-11-28 Thread Shubham Khanna
On Fri, Nov 17, 2023 at 4:55 AM shihao zhong  wrote:
>
> Hi Jian,
>
> Thanks for your comments, a new version is attached.
>
> Thanks,
> Shihao
>
> On Fri, Nov 10, 2023 at 9:59 AM jian he  wrote:
>>
>> On Wed, Nov 1, 2023 at 10:30 AM shihao zhong  wrote:
>> >
>> > Thank you for your feedback on my previous patch. I have fixed the issue 
>> > and attached a new patch for your review. Could you please take a look for 
>> > it if you have a sec? Thanks
>> >
>>
>> Your patch works fine. you can see it here:
>> https://cirrus-ci.com/task/6481922939944960
>> in an ideal world, since the doc is already built, we can probably
>> view it as a plain html file just click the ci test result.
>>
>> in src/sgml/ref/create_table.sgml:
>> "Each exclude_element can optionally specify an operator class and/or
>> ordering options; these are described fully under CREATE INDEX."
>>
>> You may need to update this sentence to reflect that exclude_element
>> can also optionally specify collation.

I have reviewed the changes and it looks fine.

Thanks and Regards,
Shubham Khanna.




Re: SSL tests fail on OpenSSL v3.2.0

2023-11-28 Thread Tom Lane
FTR, I've pushed this and the buildfarm seems happy.  In particular,
I just updated indri to the latest MacPorts packages including
OpenSSL 3.2.0, so we'll have coverage of that going forward.

regards, tom lane




Re: Synchronizing slots from primary to standby

2023-11-28 Thread shveta malik
On Tue, Nov 28, 2023 at 9:28 PM Drouvot, Bertrand
 wrote:
>
> Hi,
>
> On 11/28/23 10:40 AM, shveta malik wrote:
> > On Tue, Nov 28, 2023 at 12:19 PM Drouvot, Bertrand
> >  wrote:
> >>
> >> Hi,
> >>
> >> On 11/28/23 4:13 AM, shveta malik wrote:
> >>> On Mon, Nov 27, 2023 at 4:08 PM Amit Kapila  
> >>> wrote:
> 
>  On Mon, Nov 27, 2023 at 2:27 PM Zhijie Hou (Fujitsu)
>   wrote:
> >
> > Here is the updated version(v39_2) which include all the changes made 
> > in 0002.
> > Please use for review, and sorry for the confusion.
> >
> 
>  --- a/src/backend/replication/logical/launcher.c
>  +++ b/src/backend/replication/logical/launcher.c
>  @@ -8,20 +8,27 @@
>  *   src/backend/replication/logical/launcher.c
>  *
>  * NOTES
>  - *   This module contains the logical replication worker launcher which
>  - *   uses the background worker infrastructure to start the logical
>  - *   replication workers for every enabled subscription.
>  + *   This module contains the replication worker launcher which
>  + *   uses the background worker infrastructure to:
>  + *   a) start the logical replication workers for every enabled 
>  subscription
>  + *  when not in standby_mode.
>  + *   b) start the slot sync worker for logical failover slots 
>  synchronization
>  + *  from the primary server when in standby_mode.
> 
>  I was wondering do we really need a launcher on standby to invoke
>  sync-slot worker. If so, why? I guess it may be required for previous
>  versions where we were managing work for multiple slot-sync workers
>  which is also questionable in the sense of whether launcher is the
>  right candidate for the same but now with the single slot-sync worker,
>  it doesn't seem worth having it. What do you think?
> 
>  --
> >>>
> >>> Yes, earlier a manager process was needed to manage multiple slot-sync
> >>> workers and distribute load among them, but now that does not seem
> >>> necessary. I gave it a try (PoC) and it seems to work well.  If  there
> >>> are no objections to this approach, I can share the patch soon.
> >>>
> >>
> >> +1 on this new approach, thanks!
> >
> > PFA v40. This patch has removed Logical Replication Launcher support
> > to launch slotsync worker.
>
> Thanks!
>
> >  The slot-sync worker is now registered as
> > bgworker with postmaster, with
> > bgw_start_time=BgWorkerStart_ConsistentState and
> > bgw_restart_time=60sec.
> >
> > On removal of launcher, now all the validity checks have been shifted
> > to slot-sync worker itself.  This brings us to some point of concerns:
> >
> > a) We still need to maintain  RecoveryInProgress() check in slotsync
> > worker. Since worker has the start time of
> > BgWorkerStart_ConsistentState, it will be started on non-standby as
> > well. So to ensure that it exists on non-standby, "RecoveryInProgress"
> > has been introduced at the beginning of the worker. But once it exits,
> > postmaster will not restart it since it will be clean-exist i.e.
> > proc_exit(0) (the restart logic of postmaster comes into play only
> > when there is an abnormal exit). But to exit for the first time on
> > non-standby, we need that Recovery related check in worker.
> >
> > b) "enable_syncslot" check is moved to slotsync worker now. Since
> > enable_syncslot is PGC_SIGHUP, so proc_exit(1) is currently used to
> > exit the worker if 'enable_syncslot' is found to be disabled.
> > 'proc_exit(1)' has been used in order to ensure that the worker is
> > restarted and GUCs are checked again after restart_time. Downside of
> > this approach is, if someone has kept "enable_syncslot" as disabled
> > permanently even on standby, slotsync worker will keep on restarting
> > and exiting.
> >
> > So to overcome the above pain-points, I think a potential approach
> > will be to start slotsync worker only if 'enable_syncslot' is on and
> > the system is non-standby.
>
> That makes sense to me.
>
> > Potential ways (each with some issues) are:
> >
> > 1) Use the current way i.e. register slot-sync worker as bgworker with
> > postmaster, but introduce extra checks in 'maybe_start_bgworkers'. But
> > this seems more like a hack. This will need extra changes as currently
> > once 'maybe_start_bgworkers' is attempted by postmaster, it will
> > attempt again to start any worker only if the worker had abnormal exit
> > and restart_time !=0. The current postmatser will not attempt to start
> > worker on any GUC change.
> >
> > 2) Another way maybe to treat slotsync worker as special case and
> > separate out the start/restart of slotsync worker from bgworker, and
> > follow what we do for autovacuum launcher(StartAutoVacLauncher) to
> > keep starting it in the postmaster loop(ServerLoop). In this way, we
> > may be able to add more checks before starting worker. But by opting
> > this approach, we will have to manage slotsync worker completely 

Re: common signal handler protection

2023-11-28 Thread Nathan Bossart
On Tue, Nov 28, 2023 at 06:37:50PM -0800, Andres Freund wrote:
> For a moment I was, wrongly, worried this would break signal handlers we
> intentionally inherit from postmaster. It's fine though, because we block
> signals in fork_process() until somewhere in InitPostmasterChild(), after
> we've called InitProcessGlobals(). But perhaps that should be commented upon
> somewhere?

Good call.  I expanded on the MyProcPid assertion in wrapper_handler() a
bit.

-- 
Nathan Bossart
Amazon Web Services: https://aws.amazon.com
>From 8c36c9470bee558bcab3fa70d456ca976dc02ac2 Mon Sep 17 00:00:00 2001
From: Nathan Bossart 
Date: Tue, 28 Nov 2023 14:58:20 -0600
Subject: [PATCH v4 1/3] Check that MyProcPid == getpid() in all signal
 handlers.

In commit 97550c0711, we added a similar check to the SIGTERM
handler for the startup process.  This commit adds this check to
all signal handlers installed with pqsignal().  This is done by
using a wrapper function that performs the check before calling the
actual handler.

The hope is that this will offer more general protection against
child processes of Postgres backends inadvertently modifying shared
memory due to inherited signal handlers.  Another potential
follow-up improvement is to use this wrapper handler function to
restore errno instead of relying on each individual handler
function to do so.

This commit makes the changes in commit 97550c0711 obsolete but
leaves reverting it for a follow-up commit.

Reviewed-by: Andres Freund
Discussion: https://postgr.es/m/20231121212008.GA3742740%40nathanxps13
---
 src/port/pqsignal.c | 86 +++--
 1 file changed, 84 insertions(+), 2 deletions(-)

diff --git a/src/port/pqsignal.c b/src/port/pqsignal.c
index 23e7f685c1..f04a7153de 100644
--- a/src/port/pqsignal.c
+++ b/src/port/pqsignal.c
@@ -46,23 +46,99 @@
 #include "c.h"
 
 #include 
+#ifndef FRONTEND
+#include 
+#endif
 
 #ifndef FRONTEND
 #include "libpq/pqsignal.h"
+#include "miscadmin.h"
+#endif
+
+#ifdef PG_SIGNAL_COUNT			/* Windows */
+#define PG_NSIG (PG_SIGNAL_COUNT)
+#elif defined(NSIG)
+#define PG_NSIG (NSIG)
+#else
+#define PG_NSIG (64)			/* XXX: wild guess */
+#endif
+
+/* Check a couple of common signals to make sure PG_NSIG is accurate. */
+StaticAssertDecl(SIGUSR2 < PG_NSIG, "SIGUSR2 >= PG_NSIG");
+StaticAssertDecl(SIGHUP < PG_NSIG, "SIGHUP >= PG_NSIG");
+StaticAssertDecl(SIGTERM < PG_NSIG, "SIGTERM >= PG_NSIG");
+StaticAssertDecl(SIGALRM < PG_NSIG, "SIGALRM >= PG_NSIG");
+
+static volatile pqsigfunc pqsignal_handlers[PG_NSIG];
+
+/*
+ * Except when called with SIG_IGN or SIG_DFL, pqsignal() sets up this function
+ * as the handler for all signals.  This wrapper handler function checks that
+ * it is called within a process that the server knows about (i.e., any process
+ * that has called InitProcessGlobals(), such as a client backend), and not a
+ * child process forked by system(3), etc.  This check ensures that such child
+ * processes do not modify shared memory, which is often detrimental.  If the
+ * check succeeds, the function originally provided to pqsignal() is called.
+ * Otherwise, the default signal handler is installed and then called.
+ */
+static void
+wrapper_handler(SIGNAL_ARGS)
+{
+#ifndef FRONTEND
+
+	/*
+	 * We expect processes to set MyProcPid before calling pqsignal() or
+	 * before accepting signals.
+	 */
+	Assert(MyProcPid);
+	Assert(MyProcPid != PostmasterPid || !IsUnderPostmaster);
+
+	if (unlikely(MyProcPid != (int) getpid()))
+	{
+		pqsignal(postgres_signal_arg, SIG_DFL);
+		raise(postgres_signal_arg);
+		return;
+	}
 #endif
 
+	(*pqsignal_handlers[postgres_signal_arg]) (postgres_signal_arg);
+}
+
 /*
  * Set up a signal handler, with SA_RESTART, for signal "signo"
  *
  * Returns the previous handler.
+ *
+ * NB: If called within a signal handler, race conditions may lead to bogus
+ * return values.  You should either avoid calling this within signal handlers
+ * or ignore the return value.
+ *
+ * XXX: Since no in-tree callers use the return value, and there is little
+ * reason to do so, it would be nice if we could convert this to a void
+ * function instead of providing potentially-bogus return values.
+ * Unfortunately, that requires modifying the pqsignal() in legacy-pqsignal.c,
+ * which in turn requires an SONAME bump, which is probably not worth it.
  */
 pqsigfunc
 pqsignal(int signo, pqsigfunc func)
 {
+	pqsigfunc	orig_func = pqsignal_handlers[signo];	/* assumed atomic */
 #if !(defined(WIN32) && defined(FRONTEND))
 	struct sigaction act,
 oact;
+#else
+	pqsigfunc	ret;
+#endif
 
+	Assert(signo < PG_NSIG);
+
+	if (func != SIG_IGN && func != SIG_DFL)
+	{
+		pqsignal_handlers[signo] = func;	/* assumed atomic */
+		func = wrapper_handler;
+	}
+
+#if !(defined(WIN32) && defined(FRONTEND))
 	act.sa_handler = func;
 	sigemptyset(_mask);
 	act.sa_flags = SA_RESTART;
@@ -72,9 +148,15 @@ pqsignal(int signo, pqsigfunc func)
 #endif
 	if (sigaction(signo, , ) < 0)
 		return SIG_ERR;
-	return 

Re: Synchronizing slots from primary to standby

2023-11-28 Thread Peter Smith
Hi. Here are some review comments for the patch v39_2-0001.

Multiple items from my previous review [1] seemed unanswered, so it
wasn't clear if they were discarded because they were wrong or maybe
accidently missed. I've repeated all those again here, as well as some
new comments.

==
1. General.

Previously (see [1] #0) I asked a question about if there is some
documentation missing. Seems not yet answered.

==
Commit message

2.
Users can set this flag during CREATE SUBSCRIPTION or during
pg_create_logical_replication_slot API. Examples:

CREATE SUBSCRIPTION mysub CONNECTION '..' PUBLICATION mypub
WITH (failover = true);

(failover is the last arg)
SELECT * FROM pg_create_logical_replication_slot('myslot',
'pgoutput', false, true, true);

~

I felt it is better to say "Ex1" / "Ex2" (or "1" / "2" or something
similar) to indicate better where these examples start and finish,
otherwise they just sort of get lost among the text.

==
doc/src/sgml/catalogs.sgml

3.
>From previous review ([1] #6) I suggested reordering fields. Hous-san
wrote: "but I think the functionality of two fields are different and
I didn’t find the correlation between two fields except for the type
of value."

Yes, that is true. OTOH, I felt grouping the attributes by the same
types made the docs easier to read.

==
src/backend/commands/subscriptioncmds.c

4. CreateSubscription

+ /*
+ * If only the slot_name is specified (without create_slot option),
+ * it is possible that the user intends to use an existing slot on
+ * the publisher, so here we enable failover for the slot if
+ * requested.
+ */
+ else if (opts.slot_name && failover_enabled)
+ {

Unanswered question from previous review (see [1] #11a). i.e. How does
this condition ensure that *only* the slot name was specified (like
the comment is saying)?

~~~

5. AlterSubscription

  errmsg("ALTER SUBSCRIPTION with refresh and copy_data is not allowed
when two_phase is enabled"),
  errhint("Use ALTER SUBSCRIPTION ... SET PUBLICATION with refresh =
false, or with copy_data = false, or use DROP/CREATE
SUBSCRIPTION.")));

+ if (sub->failoverstate == LOGICALREP_FAILOVER_STATE_ENABLED && opts.copy_data)
+ ereport(ERROR,
+ (errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE),
+ errmsg("ALTER SUBSCRIPTION with refresh and copy_data is not allowed
when failover is enabled"),
+ errhint("Use ALTER SUBSCRIPTION ... SET PUBLICATION with refresh =
false, or with copy_data = false, or use DROP/CREATE
SUBSCRIPTION.")));
+

There are translations issues same as reported in my previous review
(see [1] #12b and also several other places as noted in [1]). Hou-san
replied that I "can start a separate thread to change the twophase
related messages, and we can change accordingly if it's accepted.",
but that's not right IMO because it is only the fact that this
sysncslot patch is reusing a similar message that warrants the need to
extract a "common" message part in the first place. So I think it is
responsibility if this sycslot patch to make this change.

==
src/backend/replication/logical/tablesync.c

6. process_syncing_tables_for_apply

+ if (MySubscription->twophasestate == LOGICALREP_TWOPHASE_STATE_PENDING)
+ ereport(LOG,
+ (errmsg("logical replication apply worker for subscription \"%s\"
will restart so that two_phase can be enabled",
+ MySubscription->name)));
+
+ if (MySubscription->failoverstate == LOGICALREP_FAILOVER_STATE_PENDING)
+ ereport(LOG,
+ (errmsg("logical replication apply worker for subscription \"%s\"
will restart so that failover can be enabled",
+ MySubscription->name)));

6a.
You may end up log 2 restart messages for the same restart. Is it OK?

~

6b.
This is another example where you should share the same common message
(for less translations)

==
src/backend/replication/logical/worker.c

7.
+ * The logical slot on the primary can be synced to the standby by specifying
+ * the failover = true when creating the subscription. Enabling failover allows
+ * us to smoothly transition to the standby in case the primary gets promoted,
+ * ensuring that we can subscribe to the new primary without losing any data.

/the failover = true/the failover = true option/

or

/the failover = true/failover = true/

~~~

8.
+
 #include "postgres.h"

Unnecessary extra blank line

==
src/backend/replication/slot.c

9. validate_standby_slots

There was no reply to the comment in my previous review (see [1] #27).
Maybe you disagree or maybe accidentally overlooked?

~~~

10. check_standby_slot_names

In previous review I asked ([1] #28) why a special check was needed
for "*". Hou-san replied that "SplitIdentifierString() does not give
error for '*' and '*' can be considered as valid value which if
accepted can mislead user".

Sure, but won't the code then just try to find if there is a
replication slot called "*" and that will fail. That was my point, if
the slot name lookup is going to fail anyway then why have the extra
code for the special "*" case up-front? Note -- I 

Re: Synchronizing slots from primary to standby

2023-11-28 Thread Amit Kapila
On Tue, Nov 28, 2023 at 2:17 PM Drouvot, Bertrand
 wrote:
>
> On 11/2/23 1:27 AM, Zhijie Hou (Fujitsu) wrote:
> > On Tuesday, October 31, 2023 6:45 PM Amit Kapila  
> > wrote:
> >> We have create_replication_slot and drop_replication_slot in repl_gram.y. 
> >> How
> >> about if introduce alter_replication_slot and handle the 'failover' flag 
> >> with that?
> >> The idea is we will either enable 'failover' at the time 
> >> create_replication_slot by
> >> providing an optional failover option or execute a separate command
> >> alter_replication_slot. I think we probably need to perform this command
> >> before the start of streaming.
> >
> > Here is an attempt to achieve the same. I added a new replication command
> > alter_replication_slot and introduced a walreceiver api walrcv_alter_slot to
> > execute the command. The subscription will call the api to enable/disable
> > the failover of the slot on publisher.
> >
> > The patch disallows altering the failover option for the subscription. But 
> > we
> > could release the restriction by using the following approaches in next 
> > version:
> >
> >> I think we will have the following options to allow alter of the 'failover'
> >> property: (a) we can allow altering 'failover' only for the 'disabled'
> >> subscription; to achieve that, we need to open a connection during alter
> >> subscription and change this property of slot; (b) apply worker detects the
> >> change in 'failover' option; run the alter_replication_slot command; this 
> >> needs
> >> more analysis as apply_worker is already doing streaming and changing slot
> >> property in between could be tricky.
> >
>
> What do you think about also adding a pg_alter_logical_replication_slot() or 
> such
> function?
>
> That would allow users to alter manually created logical replication slots 
> without
> the need to make a replication connection.
>

But then won't that make it inconsistent with the subscription
failover state? I think if we don't have a simple solution for this,
we can always do it as an enhancement to the main feature once we have
good ideas to solve it.

-- 
With Regards,
Amit Kapila.




Re: Fix assertion in autovacuum worker

2023-11-28 Thread Andres Freund
Hi,

On 2023-11-28 20:42:47 -0600, Nathan Bossart wrote:
> On Tue, Nov 28, 2023 at 04:03:49PM -0800, Andres Freund wrote:
> > On 2023-11-28 16:05:16 -0600, Nathan Bossart wrote:
> >> From a glance, it looks to me like the problem is that pgstat_shutdown_hook
> >> is registered as a before_shmem_exit callback, while ProcKill is registered
> >> as an on_shmem_exit callback.
> > 
> > That's required, as pgstat_shutdown_hook() needs to acquire lwlocks, which 
> > you
> > can't after ProcKill(). It's also not unique to pgstat, several other
> > before_shmem_exit() callbacks acquire lwlocks (e.g. AtProcExit_Twophase(),
> > ParallelWorkerShutdown(), do_pg_abort_backup(), the first three uses of
> > before_shmem_exit when git grepping) - which makes sense, they are 
> > presumably
> > before_shmem_exit() because they need to manage shared state, which often
> > needs locks.
> > 
> > In normal backends this is fine-ish, because ShutdownPostgres() is 
> > registered
> > very late (and thus is called early in the shutdown sequence), and the
> > AbortOutOfAnyTransaction() contained therein indirectly calls
> > LWLockReleaseAll() and very little happens outside of the transaction 
> > context.
> 
> Right.  Perhaps we could add a LWLockReleaseAll() to
> pgstat_shutdown_hook() instead of the autovacuum code, but I'm afraid that
> is still just a hack.

Yea, we'd need that in just about all before_shmem_exit() callbacks.  I could
see an argument for doing it in proc_exit_prepare(). While that'd be a fairly
gross layering violation, we already do reset a number a bunch of stuff in
there:
/*
 * Forget any pending cancel or die requests; we're doing our best to
 * close up shop already.  Note that the signal handlers will not set
 * these flags again, now that proc_exit_inprogress is set.
 */
InterruptPending = false;
ProcDiePending = false;
QueryCancelPending = false;
InterruptHoldoffCount = 1;
CritSectionCount = 0;


> >> I would expect your patch to fix this particular issue, but I'm wondering
> >> whether there's a bigger problem here.
> > 
> > Yes, there is - our subsystem initialization, shutdown, error recovery
> > infrastructure is a mess.  We've interwoven transaction handling far too
> > tightly with error handling, the order of subystem initialization is 
> > basically
> > random and differs between operating systems (due to EXEC_BACKEND) and 
> > "mode"
> > of execution (the order differs when using single user mode) and we've
> > distributed error recovery into ~10 places (all the sigsetjmp()s in backend
> > code, xact.c and and a few other places like WalSndErrorCleanup()).
> 
> :(
> 
> I do remember looking into uniting all the various sigsetjmp() calls
> before.  That could be worth another try.  The rest will probably require
> additional thought...

It'd definitely be worth some effort. I'm quite sure that we have a number of
hard to find bugs around this.

Greetings,

Andres Freund




Re: Fix assertion in autovacuum worker

2023-11-28 Thread Nathan Bossart
On Tue, Nov 28, 2023 at 04:03:49PM -0800, Andres Freund wrote:
> On 2023-11-28 16:05:16 -0600, Nathan Bossart wrote:
>> From a glance, it looks to me like the problem is that pgstat_shutdown_hook
>> is registered as a before_shmem_exit callback, while ProcKill is registered
>> as an on_shmem_exit callback.
> 
> That's required, as pgstat_shutdown_hook() needs to acquire lwlocks, which you
> can't after ProcKill(). It's also not unique to pgstat, several other
> before_shmem_exit() callbacks acquire lwlocks (e.g. AtProcExit_Twophase(),
> ParallelWorkerShutdown(), do_pg_abort_backup(), the first three uses of
> before_shmem_exit when git grepping) - which makes sense, they are presumably
> before_shmem_exit() because they need to manage shared state, which often
> needs locks.
> 
> In normal backends this is fine-ish, because ShutdownPostgres() is registered
> very late (and thus is called early in the shutdown sequence), and the
> AbortOutOfAnyTransaction() contained therein indirectly calls
> LWLockReleaseAll() and very little happens outside of the transaction context.

Right.  Perhaps we could add a LWLockReleaseAll() to
pgstat_shutdown_hook() instead of the autovacuum code, but I'm afraid that
is still just a hack.

>> I would expect your patch to fix this particular issue, but I'm wondering
>> whether there's a bigger problem here.
> 
> Yes, there is - our subsystem initialization, shutdown, error recovery
> infrastructure is a mess.  We've interwoven transaction handling far too
> tightly with error handling, the order of subystem initialization is basically
> random and differs between operating systems (due to EXEC_BACKEND) and "mode"
> of execution (the order differs when using single user mode) and we've
> distributed error recovery into ~10 places (all the sigsetjmp()s in backend
> code, xact.c and and a few other places like WalSndErrorCleanup()).

:(

I do remember looking into uniting all the various sigsetjmp() calls
before.  That could be worth another try.  The rest will probably require
additional thought...

-- 
Nathan Bossart
Amazon Web Services: https://aws.amazon.com




Re: common signal handler protection

2023-11-28 Thread Andres Freund
Hi,

On 2023-11-27 16:16:25 -0600, Nathan Bossart wrote:
> On Wed, Nov 22, 2023 at 02:59:45PM -0800, Andres Freund wrote:
> > Hm. I wonder if this indicates an issue.  Do the preceding changes perhaps
> > make it more likely that a child process of the startup process could hang
> > around for longer, because the signal we're delivering doesn't terminate 
> > child
> > processes, because we'd just reset the signal handler?  I think it's fine 
> > for
> > the startup process, because we ask the startup process to shut down with
> > SIGTERM, which defaults to exiting.
> > 
> > But we do have a few processes that we do ask to shut down with other
> > signals, wich do not trigger an exit by default, e.g. Checkpointer, 
> > archiver,
> > walsender are asked to shut down using SIGUSR2 IIRC.  The only one where 
> > that
> > could be an issue is archiver, I guess?
> 
> This did cross my mind.  AFAICT most default handlers already trigger an
> exit (including SIGUSR2), and for the ones that don't, we'd just end up in
> the same situation as if the signal arrived a moment later after the child
> process has installed its own handlers.  And postmaster doesn't send
> certain signals (e.g., SIGHUP) to the whole process group, so we don't have
> the opposite problem where things like reloading configuration files
> terminates these child processes.
> 
> So, I didn't notice any potential issues.  Did you have anything else in
> mind?

No, I just was wondering about issues in this area. I couldn't immediately
think of a concrete scenario, so I thought I'd mention it here.

Greetings,

Andres Freund




Re: common signal handler protection

2023-11-28 Thread Andres Freund
Hi,

On 2023-11-28 15:39:55 -0600, Nathan Bossart wrote:
> From e4bea5353c2685457545b67396095e9b96156982 Mon Sep 17 00:00:00 2001
> From: Nathan Bossart 
> Date: Tue, 28 Nov 2023 14:58:20 -0600
> Subject: [PATCH v3 1/3] Check that MyProcPid == getpid() in all signal
>  handlers.
> 
> In commit 97550c0711, we added a similar check to the SIGTERM
> handler for the startup process.  This commit adds this check to
> all signal handlers installed with pqsignal().  This is done by
> using a wrapper function that performs the check before calling the
> actual handler.
> 
> The hope is that this will offer more general protection against
> child processes of Postgres backends inadvertently modifying shared
> memory due to inherited signal handlers.  Another potential
> follow-up improvement is to use this wrapper handler function to
> restore errno instead of relying on each individual handler
> function to do so.
> 
> This commit makes the changes in commit 97550c0711 obsolete but
> leaves reverting it for a follow-up commit.

For a moment I was, wrongly, worried this would break signal handlers we
intentionally inherit from postmaster. It's fine though, because we block
signals in fork_process() until somewhere in InitPostmasterChild(), after
we've called InitProcessGlobals(). But perhaps that should be commented upon
somewhere?

Greetings,

Andres Freund




Re: remaining sql/json patches

2023-11-28 Thread Andres Freund
Hi,

On 2023-11-28 20:58:41 -0500, Andrew Dunstan wrote:
> On 2023-11-28 Tu 19:32, Tom Lane wrote:
> > Andrew Dunstan  writes:
> > So I'm now a bit baffled.  Can you provide more color on what
> > your test setup is?
> 
> 
> *sigh* yes, you're right. I inadvertently used a setup that used meson for
> building REL16_STABLE and HEAD. When I switch it to autoconf I get results
> that are similar to the earlier branches:
> 
> 
>  REL_16_STABLE 
> Time: 3401.625 ms (00:03.402)
>  HEAD 
> Time: 3419.088 ms (00:03.419)
> 
> 
> It's not clear to me why that should be. I didn't have assertions enabled
> anywhere. It's the same version of bison, same compiler throughout. Maybe
> meson sets a higher level of optimization? It shouldn't really matter, ISTM.

Is it possible that you have CFLAGS set in your environment? For reasons that
I find very debatable, configure.ac only adds -O2 when CFLAGS is not set:

# C[XX]FLAGS are selected so:
# If the user specifies something in the environment, that is used.
# else:  If the template file set something, that is used.
# else:  If coverage was enabled, don't set anything.
# else:  If the compiler is GCC, then we use -O2.
# else:  If the compiler is something else, then we use -O, unless debugging.

if test "$ac_env_CFLAGS_set" = set; then
  CFLAGS=$ac_env_CFLAGS_value
elif test "${CFLAGS+set}" = set; then
  : # (keep what template set)
elif test "$enable_coverage" = yes; then
  : # no optimization by default
elif test "$GCC" = yes; then
  CFLAGS="-O2"
else
  # if the user selected debug mode, don't use -O
  if test "$enable_debug" != yes; then
CFLAGS="-O"
  fi
fi

So if you have CFLAGS set in the environment, we'll not add -O2 to the
compilation flags.

I'd check what the actual flags are when building a some .o.

Greetings,

Andres Freund




RE: [PoC] pg_upgrade: allow to upgrade publisher node

2023-11-28 Thread Hayato Kuroda (Fujitsu)
Dear Sawada-san,

> > Actually, we do not expect that it won't input NULL. IIUC all of slots have
> > slot_name, and subquery uses its name. But will it be kept forever? I think 
> > we
> > can avoid any risk.
> >
> > > I've not tested it yet but even if it returns NULL, perhaps
> > > get_old_cluster_logical_slot_infos() would still set curr->caught_up
> > > to false, no?
> >
> > Hmm. I checked the C99 specification [1] of strcmp, but it does not define 
> > the
> > case when the NULL is input. So it depends implementation.
> 
> I think PQgetvalue() returns an empty string if the result value is null.
>

Oh, you are right... I found below paragraph from [1].

> An empty string is returned if the field value is null. See PQgetisnull to 
> distinguish
> null values from empty-string values.

So I agree what you said - current code can accept NULL.
But still not sure the error message is really good or not.
If we regard an empty string as false, the slot which has empty name will be 
reported like:
"The slot \"\" has not consumed the WAL yet" in 
check_old_cluster_for_valid_slots().
Isn't it inappropriate?

(Note again - currently we do not find such a case, so it may be overkill)

[1]: https://www.postgresql.org/docs/devel/libpq-exec.html#LIBPQ-PQGETVALUE

Best Regards,
Hayato Kuroda
FUJITSU LIMITED



Re: remaining sql/json patches

2023-11-28 Thread Andrew Dunstan



On 2023-11-28 Tu 19:32, Tom Lane wrote:

Andrew Dunstan  writes:

Cool, I took this and ran with it a bit. (See attached) Here are
comparative timings for 1000 iterations parsing most of the
information_schema.sql, all the way back to 9.3:
...
 REL_15_STABLE 
Time: 3372.491 ms (00:03.372)
 REL_16_STABLE 
Time: 1654.056 ms (00:01.654)
 HEAD 
Time: 1614.949 ms (00:01.615)
This is fairly repeatable.

These results astonished me, because I didn't recall us having done
anything that'd be likely to double the speed of the raw parser.
So I set out to replicate them, intending to bisect to find where
the change happened.  And ... I can't replicate them.  What I got
is essentially level performance from HEAD back to d10b19e22
(Stamp HEAD as 14devel):

HEAD: 3742.544 ms
d31d30973a (16 stamp): 3871.441 ms
596b5af1d (15 stamp): 3759.319 ms
d10b19e22 (14 stamp): 3730.834 ms

The run-to-run variation is a couple percent, which means that
these differences are down in the noise.  This is using your
test code from github (but with 5000 iterations not 1000).
Builds are pretty vanilla with asserts off, on an M1 MacBook Pro.
The bison version might matter here: it's 3.8.2 from MacPorts.

I wondered if you'd tested assert-enabled builds, but there
doesn't seem to be much variation with that turned on either.

So I'm now a bit baffled.  Can you provide more color on what
your test setup is?



*sigh* yes, you're right. I inadvertently used a setup that used meson 
for building REL16_STABLE and HEAD. When I switch it to autoconf I get 
results that are similar to the earlier branches:



 REL_16_STABLE 
Time: 3401.625 ms (00:03.402)
 HEAD 
Time: 3419.088 ms (00:03.419)


It's not clear to me why that should be. I didn't have assertions 
enabled anywhere. It's the same version of bison, same compiler 
throughout. Maybe meson sets a higher level of optimization? It 
shouldn't really matter, ISTM.



cheers


andrew


--
Andrew Dunstan
EDB: https://www.enterprisedb.com





Re: [PATCH] ltree hash functions

2023-11-28 Thread jian he
On Wed, Nov 29, 2023 at 6:09 AM Tommy Pavlicek  wrote:
>
> On Thu, Jul 6, 2023 at 2:18 AM Daniel Gustafsson  wrote:
> >
> > > On 19 Jun 2023, at 11:18, Tommy Pavlicek  wrote:
> >
> > > Tommy, are you interested in extending ALTER OPERATOR to allow this,
> > > which would also allow fixing the ltree operator?
> > >
> > > Yes, I can do that. I took a look over the code and email thread and it 
> > > seems like it should be relatively straight forward. I'll put a patch 
> > > together for that and then update this patch to alter the operator.
> >
> > Did you have a chance to look at this for an updated patch for this 
> > commitfest?
>
> I finally had a chance to look at this and I've updated the patch to
> alter the = operator to enable hash joins.
>
> This is ready to be looked at now.
>
> Is there anything I need to do to move this forward?
>

you only change Makefile, you also need to change contrib/ltree/meson.build?

+drop index tstidx;
+create index tstidx on ltreetest using hash (t);
+set enable_seqscan=off;
+
+SELECT * FROM ltreetest WHERE t =  '12.3' order by t asc;

Do you need to use EXPLAIN to demo the index usage?




Re: Streaming I/O, vectored I/O (WIP)

2023-11-28 Thread Melanie Plageman
On Wed, Nov 29, 2023 at 01:17:19AM +1300, Thomas Munro wrote:

Thanks for posting a new version. I've included a review of 0004.

> I've included just the pg_prewarm example user for now while we
> discuss the basic infrastructure.  The rest are rebased and in my
> public Github branch streaming-read (repo macdice/postgres) if anyone
> is interested (don't mind the red CI failures, they're just saying I
> ran out of monthly CI credits on the 29th, so close...)

I agree it makes sense to commit the interface with just prewarm as a
user. Then we can start new threads for the various streaming read users
(e.g. vacuum, sequential scan, bitmapheapscan).

> From db5de8ab5a1a804f41006239302fdce954cab331 Mon Sep 17 00:00:00 2001
> From: Thomas Munro 
> Date: Sat, 22 Jul 2023 17:31:54 +1200
> Subject: [PATCH v2 4/8] Provide vectored variant of ReadBuffer().
> 
> diff --git a/src/backend/storage/buffer/bufmgr.c 
> b/src/backend/storage/buffer/bufmgr.c
> index f7c67d504c..8ae3a72053 100644
> --- a/src/backend/storage/buffer/bufmgr.c
> +++ b/src/backend/storage/buffer/bufmgr.c
> @@ -1046,175 +1048,326 @@ ReadBuffer_common(SMgrRelation smgr, char 
> relpersistence, ForkNumber forkNum,
>   if (mode == RBM_ZERO_AND_LOCK || mode == 
> RBM_ZERO_AND_CLEANUP_LOCK)
>   flags |= EB_LOCK_FIRST;
>  
> - return ExtendBufferedRel(BMR_SMGR(smgr, relpersistence),
> -  forkNum, 
> strategy, flags);
> + *hit = false;
> +
> + return ExtendBufferedRel(bmr, forkNum, strategy, flags);
>   }
>  
> - TRACE_POSTGRESQL_BUFFER_READ_START(forkNum, blockNum,
> -
> smgr->smgr_rlocator.locator.spcOid,
> -
> smgr->smgr_rlocator.locator.dbOid,
> -
> smgr->smgr_rlocator.locator.relNumber,
> -
> smgr->smgr_rlocator.backend);
> + buffer = PrepareReadBuffer(bmr,
> +forkNum,
> +blockNum,
> +strategy,
> +hit,
> +);
> +
> + /* At this point we do NOT hold any locks. */
> +
> + if (mode == RBM_ZERO_AND_CLEANUP_LOCK || mode == RBM_ZERO_AND_LOCK)
> + {
> + /* if we just want zeroes and a lock, we're done */
> + ZeroBuffer(buffer, mode);
> + }
> + else if (!*hit)
> + {
> + /* we might need to perform I/O */
> + CompleteReadBuffers(bmr,
> + ,
> + forkNum,
> + blockNum,
> + 1,
> + mode == 
> RBM_ZERO_ON_ERROR,
> + strategy);
> + }
> +
> + return buffer;
> +}
> +
> +/*
> + * Prepare to read a block.  The buffer is pinned.  If this is a 'hit', then
> + * the returned buffer can be used immediately.  Otherwise, a physical read
> + * should be completed with CompleteReadBuffers().  PrepareReadBuffer()
> + * followed by CompleteReadBuffers() is equivalent ot ReadBuffer(), but the

ot -> to

> + * caller has the opportunity to coalesce reads of neighboring blocks into 
> one
> + * CompleteReadBuffers() call.
> + *
> + * *found is set to true for a hit, and false for a miss.
> + *
> + * *allocated is set to true for a miss that allocates a buffer for the first
> + * time.  If there are multiple calls to PrepareReadBuffer() for the same
> + * block before CompleteReadBuffers() or ReadBuffer_common() finishes the
> + * read, then only the first such call will receive *allocated == true, which
> + * the caller might use to issue just one prefetch hint.
> + */
> +Buffer
> +PrepareReadBuffer(BufferManagerRelation bmr,
> +   ForkNumber forkNum,
> +   BlockNumber blockNum,
> +   BufferAccessStrategy strategy,
> +   bool *found,
> +   bool *allocated)
> +{
> + BufferDesc *bufHdr;
> + boolisLocalBuf;
> + IOContext   io_context;
> + IOObjectio_object;
>  
> + Assert(blockNum != P_NEW);
> +
> + if (bmr.rel)
> + {
> + bmr.smgr = RelationGetSmgr(bmr.rel);
> + bmr.relpersistence = bmr.rel->rd_rel->relpersistence;
> + }
> +
> + isLocalBuf = SmgrIsTemp(bmr.smgr);
>   if 

Re: Optimizing nbtree ScalarArrayOp execution, allowing multi-column ordered scans, skip scan

2023-11-28 Thread Peter Geoghegan
On Tue, Nov 28, 2023 at 3:52 PM Peter Geoghegan  wrote:
> While I still think that we need heuristics that apply speculative
> criteria to decide whether or not going to the next page directly
> (when we have a choice at all), that doesn't mean that the v7
> heuristics can't be improved on, with a little more thought. It's a
> bit tricky, since we're probably also benefiting from the same
> heuristics all the time -- probably even for this same test case.

Correction: this particular test case happens to be one where the
optimal strategy is to do *exactly* what the master branch does
currently. The master branch is unbeatable, so the only reasonable
goal for the patch is to not lose (or to lose by no more than a
completely negligible amount).

I'm now prepared to say that this behavior is not okay -- I definitely
need to fix this. It's a bug.

Because each distinct value never fits on one leaf page (it's more
like 1.5 - 2 pages, even though we're deduplicating heavily), and
because Postgres 12 optimizations are so effective with low
cardinality/posting-list-heavy indexes such as this, we're bound to
lose quite often. The only reason it doesn't happen _every single
time_ we descend the index is because the test script uses CREATE
INDEX, rather than using retail inserts (I tend to prefer the latter
for this sort of analysis). Since nbtsort.c isn't as clever/aggressive
about suffix truncation as the nbtsplitloc.c split strategies would
have been, had we used them (had there been retail inserts), many
individual leaf pages are left with high keys that aren't particularly
good targets for the high key precheck optimization (see Postgres 12
commit 29b64d1d).

If I wanted to produce a truly adversarial case for this issue (which
this is already close to), I'd go with the following:

1. Retail inserts that leave each leaf page full of one single value,
which will allow each high key to still make a "clean break" from the
right sibling page -- it'll have the right sibling's value. Maybe
insert 1200 - 1300 tuples per distinct index value for this.

In other words, bulk loading that results in an index that never has
to append a heap TID tiebreaker during suffix truncation, but comes
very close to needing to. Bulk loading where nbtsplitloc.c needs to
use SPLIT_MANY_DUPLICATES all the time, but never quite gets to the
point of needing a SPLIT_SINGLE_VALUE split.

2. A SAOP query with an array with every second value in the index as
an element. Something like "WHERE arr in (2, 4, 6, 8, ...)".

The patch will read every single leaf page, whereas master will
*reliably* only read every second leaf page. I didn't need to "trick"
the patch in a contrived sort of way to get this bad outcome -- this
scenario is fairly realistic. So this behavior is definitely not
something that I'm prepared to defend. As I said, it's a bug.

It'll be fixed in the next revision.

--
Peter Geoghegan




Re: remaining sql/json patches

2023-11-28 Thread Tom Lane
Andrew Dunstan  writes:
> Cool, I took this and ran with it a bit. (See attached) Here are 
> comparative timings for 1000 iterations parsing most of the 
> information_schema.sql, all the way back to 9.3:
> ...
>  REL_15_STABLE 
> Time: 3372.491 ms (00:03.372)
>  REL_16_STABLE 
> Time: 1654.056 ms (00:01.654)
>  HEAD 
> Time: 1614.949 ms (00:01.615)
> This is fairly repeatable.

These results astonished me, because I didn't recall us having done
anything that'd be likely to double the speed of the raw parser.
So I set out to replicate them, intending to bisect to find where
the change happened.  And ... I can't replicate them.  What I got
is essentially level performance from HEAD back to d10b19e22
(Stamp HEAD as 14devel):

HEAD: 3742.544 ms
d31d30973a (16 stamp): 3871.441 ms
596b5af1d (15 stamp): 3759.319 ms
d10b19e22 (14 stamp): 3730.834 ms

The run-to-run variation is a couple percent, which means that
these differences are down in the noise.  This is using your
test code from github (but with 5000 iterations not 1000).
Builds are pretty vanilla with asserts off, on an M1 MacBook Pro.
The bison version might matter here: it's 3.8.2 from MacPorts.

I wondered if you'd tested assert-enabled builds, but there
doesn't seem to be much variation with that turned on either.

So I'm now a bit baffled.  Can you provide more color on what
your test setup is?

regards, tom lane




Re: remaining sql/json patches

2023-11-28 Thread Andres Freund
Hi,

On 2023-11-28 15:57:45 -0500, Andrew Dunstan wrote:
> To avoid upsetting the cfbot, I published the code here:
> 

Neat. I wonder if we ought to include something like this into core, so that
we can more easily evaluate performance effects going forward.

Andres




Re: Fix assertion in autovacuum worker

2023-11-28 Thread Andres Freund
Hi,

On 2023-11-28 16:05:16 -0600, Nathan Bossart wrote:
> On Tue, Nov 28, 2023 at 07:00:16PM +0100, David Geier wrote:
> > PostgreSQL hit the following assertion during error cleanup, after being OOM
> > in dsa_allocate0():
> > 
> > void dshash_detach(dshash_table *hash_table) {
> > ASSERT_NO_PARTITION_LOCKS_HELD_BY_ME(hash_table);
> > 
> > called from pgstat_shutdown_hook(), called from shmem_exit(), called from
> > proc_exit(), called from the exception handler.
> 
> Nice find.

+1


> > AutoVacWorkerMain() pgstat_report_autovac() pgstat_get_entry_ref_locked()
> > pgstat_get_entry_ref() dshash_find_or_insert() resize() resize() locks all
> > partitions so the hash table can safely be resized. Then it calls
> > dsa_allocate0(). If dsa_allocate0() fails to allocate, it errors out. The
> > exception handler calls proc_exit() which normally calls LWLockReleaseAll()
> > via AbortTransaction() but only if there's an active transaction. However,
> > pgstat_report_autovac() runs before a transaction got started and hence
> > LWLockReleaseAll() doesn't run before pgstat_shutdown_hook() is called.
> 
> From a glance, it looks to me like the problem is that pgstat_shutdown_hook
> is registered as a before_shmem_exit callback, while ProcKill is registered
> as an on_shmem_exit callback.

That's required, as pgstat_shutdown_hook() needs to acquire lwlocks, which you
can't after ProcKill(). It's also not unique to pgstat, several other
before_shmem_exit() callbacks acquire lwlocks (e.g. AtProcExit_Twophase(),
ParallelWorkerShutdown(), do_pg_abort_backup(), the first three uses of
before_shmem_exit when git grepping) - which makes sense, they are presumably
before_shmem_exit() because they need to manage shared state, which often
needs locks.

In normal backends this is fine-ish, because ShutdownPostgres() is registered
very late (and thus is called early in the shutdown sequence), and the
AbortOutOfAnyTransaction() contained therein indirectly calls
LWLockReleaseAll() and very little happens outside of the transaction context.


> I would expect your patch to fix this particular issue, but I'm wondering
> whether there's a bigger problem here.

Yes, there is - our subsystem initialization, shutdown, error recovery
infrastructure is a mess.  We've interwoven transaction handling far too
tightly with error handling, the order of subystem initialization is basically
random and differs between operating systems (due to EXEC_BACKEND) and "mode"
of execution (the order differs when using single user mode) and we've
distributed error recovery into ~10 places (all the sigsetjmp()s in backend
code, xact.c and and a few other places like WalSndErrorCleanup()).

Greetings,

Andres Freund




Re: Optimizing nbtree ScalarArrayOp execution, allowing multi-column ordered scans, skip scan

2023-11-28 Thread Peter Geoghegan
On Tue, Nov 28, 2023 at 9:19 AM Peter Geoghegan  wrote:
> > I'm not convinced this is a problem we have to solve. It's possible it
> > only affects cases that are implausible in practice (the script forces a
> > particular scan type, and maybe it would not be picked in practice). But
> > maybe it's fixable ...
>
> I would expect the patch to do quite well (relative to what is
> actually possible) on cases like the two extremes that I've focussed
> on so far. It seems possible that it will do less well on cases that
> are somewhere in the middle (that also have lots of distinct values on
> each page).

Actually, I think that it's more likely that the problems that you saw
are related to low cardinality data, which seems like it might not be
a great fit for the heuristics that the patch uses to decide whether
to continue the ongoing primitive index scan, or start a new one
instead. I'm referring to the heuristics I describe here:

https://postgr.es/m/CAH2-WzmTHoCsOmSgLg=yyft9loertuckxyg2gzn+28pzonf...@mail.gmail.com

The patch itself discusses these heuristics in a large comment block
after the point that _bt_checkkeys() calls _bt_advance_array_keys().

I hardly paid any attention to low cardinality data in my performance
validation work -- it was almost always indexes that had few or no
indexes (just pgbench_accounts if we're talking pure stress-tests),
just because those are more complicated, and so seemed more important.
I'm not quite prepared to say that there is definitely a problem here,
right this minute, but if there was then it wouldn't be terribly
surprising (the problems are usually wherever it is that I didn't look
for them).

Attached is a sample of my debug instrumentation for one such query,
based on running the test script that Tomas posted -- thanks for
writing this script, Tomas (I'll use it as the basis for some of my
own performance validation work going forward). I don't mind sharing
the patch that outputs this stuff if anybody is interested (it's kind
of a monstrosity, so I'm disinclined to post it with the patch until I
have a reason). Even without this instrumentation, you can get some
idea of the kinds of issues I'm talking about just by viewing EXPLAIN
ANALYZE output for a bitmap index scan -- that breaks out the index
page accesses separately, which is a number that we should expect to
remain lower than what the master branch shows in approximately all
cases.

While I still think that we need heuristics that apply speculative
criteria to decide whether or not going to the next page directly
(when we have a choice at all), that doesn't mean that the v7
heuristics can't be improved on, with a little more thought. It's a
bit tricky, since we're probably also benefiting from the same
heuristics all the time -- probably even for this same test case. We
do lose against the master branch, on balance, and by enough to
concern me, though. (I don't want to promise that it'll never happen
at all, but it should be very limited, which this wasn't.)

I didn't bother to ascertain how much longer it takes to execute the
query, since that question seems rather beside the point. The
important thing to me is whether or not this behavior actually makes
sense, all things considered, and what exactly can be done about it if
it doesn't make sense. I will need to think about this some more. This
is just a status update.

Thanks
-- 
Peter Geoghegan


index_walk_dump.txt.bz2
Description: BZip2 compressed data


Re: [HACKERS] Changing references of password encryption to hashing

2023-11-28 Thread Robert Haas
On Tue, Nov 28, 2023 at 12:24 PM Stephen Frost  wrote:
> I don’t know what they’re doing now, as you don’t say, and so I really 
> couldn’t say if ldap is better or worse for them. In some cases, sure, 
> perhaps ldap is better than … something else,

That's EXACTLY right. You can't say whether LDAP is better or worse in
every scenario. And therefore you should not be proposing to remove
it.

>> I think that is, to borrow a phrase from Tom, arrant nonsense. Sure,
>> MD5 authentication has a pass-the-hash vulnerability, and that sucks.
>
> So, given that we all agree with the CVE-worthy issue that exists with every 
> release where we include md5 auth, we should be applying for q CVE prior to 
> each release, no?

You know, I said in my previous email that you were so sure that you
were right that there was no point in trying to have a serious
discussion, and I can't really see how you could have proved that
point more thoroughly than you did here. You twisted my words around
to make it seem like I was agreeing with your point when you know full
well that I was doing the exact opposite of that.

Please don't do that.

-- 
Robert Haas
EDB: http://www.enterprisedb.com




Re: [PATCH] ltree hash functions

2023-11-28 Thread Tommy Pavlicek
On Thu, Jul 6, 2023 at 2:18 AM Daniel Gustafsson  wrote:
>
> > On 19 Jun 2023, at 11:18, Tommy Pavlicek  wrote:
>
> > Tommy, are you interested in extending ALTER OPERATOR to allow this,
> > which would also allow fixing the ltree operator?
> >
> > Yes, I can do that. I took a look over the code and email thread and it 
> > seems like it should be relatively straight forward. I'll put a patch 
> > together for that and then update this patch to alter the operator.
>
> Did you have a chance to look at this for an updated patch for this 
> commitfest?

I finally had a chance to look at this and I've updated the patch to
alter the = operator to enable hash joins.

This is ready to be looked at now.

Is there anything I need to do to move this forward?

Cheers,
Tommy


0001-ltree-hash-v2.patch
Description: Binary data


Re: Fix assertion in autovacuum worker

2023-11-28 Thread Nathan Bossart
On Tue, Nov 28, 2023 at 07:00:16PM +0100, David Geier wrote:
> PostgreSQL hit the following assertion during error cleanup, after being OOM
> in dsa_allocate0():
> 
> void dshash_detach(dshash_table *hash_table) {
> ASSERT_NO_PARTITION_LOCKS_HELD_BY_ME(hash_table);
> 
> called from pgstat_shutdown_hook(), called from shmem_exit(), called from
> proc_exit(), called from the exception handler.

Nice find.

> AutoVacWorkerMain() pgstat_report_autovac() pgstat_get_entry_ref_locked()
> pgstat_get_entry_ref() dshash_find_or_insert() resize() resize() locks all
> partitions so the hash table can safely be resized. Then it calls
> dsa_allocate0(). If dsa_allocate0() fails to allocate, it errors out. The
> exception handler calls proc_exit() which normally calls LWLockReleaseAll()
> via AbortTransaction() but only if there's an active transaction. However,
> pgstat_report_autovac() runs before a transaction got started and hence
> LWLockReleaseAll() doesn't run before pgstat_shutdown_hook() is called.

>From a glance, it looks to me like the problem is that pgstat_shutdown_hook
is registered as a before_shmem_exit callback, while ProcKill is registered
as an on_shmem_exit callback.  However, IIUC even moving them to the same
list wouldn't be sufficient because the pg_stat_shutdown_hook is registered
after ProcKill, and the code that calls the callbacks walks backwards
through the list.

I would expect your patch to fix this particular issue, but I'm wondering
whether there's a bigger problem here.

--
Nathan Bossart
Amazon Web Services: https://aws.amazon.com




Re: common signal handler protection

2023-11-28 Thread Nathan Bossart
Here is a new patch set with feedback addressed.

-- 
Nathan Bossart
Amazon Web Services: https://aws.amazon.com
>From e4bea5353c2685457545b67396095e9b96156982 Mon Sep 17 00:00:00 2001
From: Nathan Bossart 
Date: Tue, 28 Nov 2023 14:58:20 -0600
Subject: [PATCH v3 1/3] Check that MyProcPid == getpid() in all signal
 handlers.

In commit 97550c0711, we added a similar check to the SIGTERM
handler for the startup process.  This commit adds this check to
all signal handlers installed with pqsignal().  This is done by
using a wrapper function that performs the check before calling the
actual handler.

The hope is that this will offer more general protection against
child processes of Postgres backends inadvertently modifying shared
memory due to inherited signal handlers.  Another potential
follow-up improvement is to use this wrapper handler function to
restore errno instead of relying on each individual handler
function to do so.

This commit makes the changes in commit 97550c0711 obsolete but
leaves reverting it for a follow-up commit.

Reviewed-by: Andres Freund
Discussion: https://postgr.es/m/20231121212008.GA3742740%40nathanxps13
---
 src/port/pqsignal.c | 80 +++--
 1 file changed, 78 insertions(+), 2 deletions(-)

diff --git a/src/port/pqsignal.c b/src/port/pqsignal.c
index 23e7f685c1..a7e9dc49fc 100644
--- a/src/port/pqsignal.c
+++ b/src/port/pqsignal.c
@@ -46,23 +46,93 @@
 #include "c.h"
 
 #include 
+#ifndef FRONTEND
+#include 
+#endif
 
 #ifndef FRONTEND
 #include "libpq/pqsignal.h"
+#include "miscadmin.h"
+#endif
+
+#ifdef PG_SIGNAL_COUNT			/* Windows */
+#define PG_NSIG (PG_SIGNAL_COUNT)
+#elif defined(NSIG)
+#define PG_NSIG (NSIG)
+#else
+#define PG_NSIG (64)			/* XXX: wild guess */
 #endif
 
+/* Check a couple of common signals to make sure PG_NSIG is accurate. */
+StaticAssertDecl(SIGUSR2 < PG_NSIG, "SIGUSR2 >= PG_NSIG");
+StaticAssertDecl(SIGHUP < PG_NSIG, "SIGHUP >= PG_NSIG");
+StaticAssertDecl(SIGTERM < PG_NSIG, "SIGTERM >= PG_NSIG");
+StaticAssertDecl(SIGALRM < PG_NSIG, "SIGALRM >= PG_NSIG");
+
+static volatile pqsigfunc pqsignal_handlers[PG_NSIG];
+
+/*
+ * Except when called with SIG_IGN or SIG_DFL, pqsignal() sets up this function
+ * as the handler for all signals.  This wrapper handler function checks that
+ * it is called within a process that the server knows about (i.e., any process
+ * that has called InitProcessGlobals(), such as a client backend), and not a
+ * child process forked by system(3), etc.  This check ensures that such child
+ * processes do not modify shared memory, which is often detrimental.  If the
+ * check succeeds, the function originally provided to pqsignal() is called.
+ * Otherwise, the default signal handler is installed and then called.
+ */
+static void
+wrapper_handler(SIGNAL_ARGS)
+{
+#ifndef FRONTEND
+	Assert(MyProcPid);
+
+	if (unlikely(MyProcPid != (int) getpid()))
+	{
+		pqsignal(postgres_signal_arg, SIG_DFL);
+		raise(postgres_signal_arg);
+		return;
+	}
+#endif
+
+	(*pqsignal_handlers[postgres_signal_arg]) (postgres_signal_arg);
+}
+
 /*
  * Set up a signal handler, with SA_RESTART, for signal "signo"
  *
  * Returns the previous handler.
+ *
+ * NB: If called within a signal handler, race conditions may lead to bogus
+ * return values.  You should either avoid calling this within signal handlers
+ * or ignore the return value.
+ *
+ * XXX: Since no in-tree callers use the return value, and there is little
+ * reason to do so, it would be nice if we could convert this to a void
+ * function instead of providing potentially-bogus return values.
+ * Unfortunately, that requires modifying the pqsignal() in legacy-pqsignal.c,
+ * which in turn requires an SONAME bump, which is probably not worth it.
  */
 pqsigfunc
 pqsignal(int signo, pqsigfunc func)
 {
+	pqsigfunc	orig_func = pqsignal_handlers[signo];	/* assumed atomic */
 #if !(defined(WIN32) && defined(FRONTEND))
 	struct sigaction act,
 oact;
+#else
+	pqsigfunc	ret;
+#endif
+
+	Assert(signo < PG_NSIG);
 
+	if (func != SIG_IGN && func != SIG_DFL)
+	{
+		pqsignal_handlers[signo] = func;	/* assumed atomic */
+		func = wrapper_handler;
+	}
+
+#if !(defined(WIN32) && defined(FRONTEND))
 	act.sa_handler = func;
 	sigemptyset(_mask);
 	act.sa_flags = SA_RESTART;
@@ -72,9 +142,15 @@ pqsignal(int signo, pqsigfunc func)
 #endif
 	if (sigaction(signo, , ) < 0)
 		return SIG_ERR;
-	return oact.sa_handler;
+	else if (oact.sa_handler == wrapper_handler)
+		return orig_func;
+	else
+		return oact.sa_handler;
 #else
 	/* Forward to Windows native signal system. */
-	return signal(signo, func);
+	if ((ret = signal(signo, func)) == wrapper_handler)
+		return orig_func;
+	else
+		return ret;
 #endif
 }
-- 
2.25.1

>From a506a4ae932936222554c8be75b290eac57824d5 Mon Sep 17 00:00:00 2001
From: Nathan Bossart 
Date: Tue, 28 Nov 2023 15:18:13 -0600
Subject: [PATCH v3 2/3] Centralize logic for restoring errno in signal
 handlers.

Presently, we rely on each individual signal 

Re: remaining sql/json patches

2023-11-28 Thread Andrew Dunstan



On 2023-11-28 Tu 15:49, Andrew Dunstan wrote:


On 2023-11-28 Tu 00:10, John Naylor wrote:
On Mon, Nov 27, 2023 at 8:57 PM Andrew Dunstan  
wrote:

Interesting. But inferring a speed effect from such changes is
difficult. I don't have a good idea about measuring parser speed, but a
tool to do that would be useful. Amit has made a start on such
measurements, but it's only a start. I'd prefer to have evidence rather
than speculation.

Tom shared this test a while back, and that's the one I've used in the
past. The downside for a micro-benchmark like that is that it can
monopolize the CPU cache. Cache misses in real world queries are
likely much more dominant.

https://www.postgresql.org/message-id/14616.1558560...@sss.pgh.pa.us




Cool, I took this and ran with it a bit. (See attached) Here are 
comparative timings for 1000 iterations parsing most of the 
information_schema.sql, all the way back to 9.3:



 REL9_3_STABLE 
Time: 3998.701 ms
 REL9_4_STABLE 
Time: 3987.596 ms
 REL9_5_STABLE 
Time: 4129.049 ms
 REL9_6_STABLE 
Time: 4145.777 ms
 REL_10_STABLE 
Time: 4140.927 ms (00:04.141)
 REL_11_STABLE 
Time: 4145.078 ms (00:04.145)
 REL_12_STABLE 
Time: 3528.625 ms (00:03.529)
 REL_13_STABLE 
Time: 3356.067 ms (00:03.356)
 REL_14_STABLE 
Time: 3401.406 ms (00:03.401)
 REL_15_STABLE 
Time: 3372.491 ms (00:03.372)
 REL_16_STABLE 
Time: 1654.056 ms (00:01.654)
 HEAD 
Time: 1614.949 ms (00:01.615)


This is fairly repeatable.

The first good news is that the parser is pretty fast. Even 4ms to 
parse almost all the information schema setup is pretty good.


The second piece of good news is that recent modifications have vastly 
improved the speed. So even if the changes from the SQL/JSON patches 
eat up a bit of that gain, I think we're in good shape.


In a few days I'll re-run the test with the SQL/JSON patches applied.




To avoid upsetting the cfbot, I published the code here: 




cheers


andrew



Andrew Dunstan
EDB: https://www.enterprisedb.com





Re: remaining sql/json patches

2023-11-28 Thread Andrew Dunstan



On 2023-11-28 Tu 00:10, John Naylor wrote:

On Mon, Nov 27, 2023 at 8:57 PM Andrew Dunstan  wrote:

Interesting. But inferring a speed effect from such changes is
difficult. I don't have a good idea about measuring parser speed, but a
tool to do that would be useful. Amit has made a start on such
measurements, but it's only a start. I'd prefer to have evidence rather
than speculation.

Tom shared this test a while back, and that's the one I've used in the
past. The downside for a micro-benchmark like that is that it can
monopolize the CPU cache. Cache misses in real world queries are
likely much more dominant.

https://www.postgresql.org/message-id/14616.1558560...@sss.pgh.pa.us




Cool, I took this and ran with it a bit. (See attached) Here are 
comparative timings for 1000 iterations parsing most of the 
information_schema.sql, all the way back to 9.3:



 REL9_3_STABLE 
Time: 3998.701 ms
 REL9_4_STABLE 
Time: 3987.596 ms
 REL9_5_STABLE 
Time: 4129.049 ms
 REL9_6_STABLE 
Time: 4145.777 ms
 REL_10_STABLE 
Time: 4140.927 ms (00:04.141)
 REL_11_STABLE 
Time: 4145.078 ms (00:04.145)
 REL_12_STABLE 
Time: 3528.625 ms (00:03.529)
 REL_13_STABLE 
Time: 3356.067 ms (00:03.356)
 REL_14_STABLE 
Time: 3401.406 ms (00:03.401)
 REL_15_STABLE 
Time: 3372.491 ms (00:03.372)
 REL_16_STABLE 
Time: 1654.056 ms (00:01.654)
 HEAD 
Time: 1614.949 ms (00:01.615)


This is fairly repeatable.

The first good news is that the parser is pretty fast. Even 4ms to parse 
almost all the information schema setup is pretty good.


The second piece of good news is that recent modifications have vastly 
improved the speed. So even if the changes from the SQL/JSON patches eat 
up a bit of that gain, I think we're in good shape.


In a few days I'll re-run the test with the SQL/JSON patches applied.


cheers


andrew


--
Andrew Dunstan
EDB: https://www.enterprisedb.com





Re: Testing autovacuum wraparound (including failsafe)

2023-11-28 Thread Masahiko Sawada
On Tue, Nov 28, 2023 at 7:16 PM Daniel Gustafsson  wrote:
>
> > On 28 Nov 2023, at 03:00, Masahiko Sawada  wrote:
> >
> > On Mon, Nov 27, 2023 at 10:40 PM Daniel Gustafsson  wrote:
> >>
> >>> On 27 Nov 2023, at 14:06, Masahiko Sawada  wrote:
> >>
> >>> Is it true that we can modify the timeout after creating
> >>> BackgroundPsql object? If so, it seems we don't need to introduce the
> >>> new timeout argument. But how?
> >>
> >> I can't remember if that's leftovers that incorrectly remains from an 
> >> earlier
> >> version of the BackgroundPsql work, or if it's a very bad explanation of
> >> ->set_query_timer_restart().  The timeout will use the timeout_default 
> >> value
> >> and that cannot be overridden, it can only be reset per query.
> >
> > Thank you for confirming this. I see there is the same problem also in
> > interactive_psql(). So I've attached the 0001 patch to fix these
> > documentation issues.
>
> -A default timeout of $PostgreSQL::Test::Utils::timeout_default is set up,
> -which can be modified later.
> +A default timeout of $PostgreSQL::Test::Utils::timeout_default is set up.
>
> Since it cannot be modified, I think we should just say "A timeout of .." and
> call it a default timeout.  This obviously only matters for the backpatch 
> since
> the sentence is removed in 0002.

Agreed.

I've attached new version patches (0002 and 0003 are unchanged except
for the commit message). I'll push them, barring any objections.

Regards,

-- 
Masahiko Sawada
Amazon Web Services: https://aws.amazon.com


v7-0001-Fix-wrong-description-of-BackgroundPsql-s-timeout.patch
Description: Binary data


v7-0003-Add-tests-for-XID-wraparound.patch
Description: Binary data


v7-0002-Add-option-to-specify-timeout-seconds-to-Backgrou.patch
Description: Binary data


Re: Catalog domain not-null constraints

2023-11-28 Thread Peter Eisentraut

On 23.11.23 14:13, Aleksander Alekseev wrote:

=# create domain connotnull1 integer;
=# create domain connotnull2 integer;
=# alter domain connotnull1 add not null value;
=# alter domain connotnull2 set not null;
=# \dD
ERROR:  unexpected null value in cached tuple for catalog
pg_constraint column conkey


Yeah, for domain not-null constraints pg_constraint.conkey is indeed 
null.  Should we put something in there?


Attached is an updated patch that avoids the error by taking a separate 
code path for domain constraints in ruleutils.c.  But maybe there is 
another way to arrange this.From c297c458766a0e1ee65408d3ced469f32cf5e7d8 Mon Sep 17 00:00:00 2001
From: Peter Eisentraut 
Date: Tue, 28 Nov 2023 20:38:16 +0100
Subject: [PATCH v2 2/2] Catalog domain not-null constraints

This applies the explicit catalog representation of not-null
constraints introduced by b0e96f3119 for table constraints also to
domain not-null constraints.

TODO: catversion

Discussion: 
https://www.postgresql.org/message-id/flat/9ec24d7b-633d-463a-84c6-7acff769c9e8%40eisentraut.org
---
 src/backend/catalog/information_schema.sql |   2 +-
 src/backend/catalog/pg_constraint.c|  40 +++
 src/backend/commands/typecmds.c| 313 +++--
 src/backend/utils/adt/ruleutils.c  |   8 +
 src/backend/utils/cache/typcache.c |   2 +-
 src/bin/pg_dump/pg_dump.c  |   2 +-
 src/include/catalog/pg_constraint.h|   1 +
 src/test/regress/expected/domain.out   |  49 +++-
 src/test/regress/sql/domain.sql|  26 ++
 9 files changed, 339 insertions(+), 104 deletions(-)

diff --git a/src/backend/catalog/information_schema.sql 
b/src/backend/catalog/information_schema.sql
index 10b34c3c5b..95a1b34560 100644
--- a/src/backend/catalog/information_schema.sql
+++ b/src/backend/catalog/information_schema.sql
@@ -448,7 +448,7 @@ CREATE VIEW check_constraints AS
 SELECT current_database()::information_schema.sql_identifier AS 
constraint_catalog,
rs.nspname::information_schema.sql_identifier AS constraint_schema,
con.conname::information_schema.sql_identifier AS constraint_name,
-   pg_catalog.format('%s IS NOT NULL', 
at.attname)::information_schema.character_data AS check_clause
+   pg_catalog.format('%s IS NOT NULL', coalesce(at.attname, 
'VALUE'))::information_schema.character_data AS check_clause
  FROM pg_constraint con
 LEFT JOIN pg_namespace rs ON rs.oid = con.connamespace
 LEFT JOIN pg_class c ON c.oid = con.conrelid
diff --git a/src/backend/catalog/pg_constraint.c 
b/src/backend/catalog/pg_constraint.c
index e9d4d6006e..4f1a5e1c84 100644
--- a/src/backend/catalog/pg_constraint.c
+++ b/src/backend/catalog/pg_constraint.c
@@ -629,6 +629,46 @@ findNotNullConstraint(Oid relid, const char *colname)
return findNotNullConstraintAttnum(relid, attnum);
 }
 
+HeapTuple
+findDomainNotNullConstraint(Oid typid)
+{
+   Relationpg_constraint;
+   HeapTuple   conTup,
+   retval = NULL;
+   SysScanDesc scan;
+   ScanKeyData key;
+
+   pg_constraint = table_open(ConstraintRelationId, AccessShareLock);
+   ScanKeyInit(,
+   Anum_pg_constraint_contypid,
+   BTEqualStrategyNumber, F_OIDEQ,
+   ObjectIdGetDatum(typid));
+   scan = systable_beginscan(pg_constraint, 
ConstraintRelidTypidNameIndexId,
+ true, NULL, 1, );
+
+   while (HeapTupleIsValid(conTup = systable_getnext(scan)))
+   {
+   Form_pg_constraint con = (Form_pg_constraint) GETSTRUCT(conTup);
+
+   /*
+* We're looking for a NOTNULL constraint that's marked 
validated.
+*/
+   if (con->contype != CONSTRAINT_NOTNULL)
+   continue;
+   if (!con->convalidated)
+   continue;
+
+   /* Found it */
+   retval = heap_copytuple(conTup);
+   break;
+   }
+
+   systable_endscan(scan);
+   table_close(pg_constraint, AccessShareLock);
+
+   return retval;
+}
+
 /*
  * Given a pg_constraint tuple for a not-null constraint, return the column
  * number it is for.
diff --git a/src/backend/commands/typecmds.c b/src/backend/commands/typecmds.c
index aaf9728697..57389be9ed 100644
--- a/src/backend/commands/typecmds.c
+++ b/src/backend/commands/typecmds.c
@@ -126,15 +126,19 @@ static OidfindTypeSubscriptingFunction(List 
*procname, Oid typeOid);
 static Oid findRangeSubOpclass(List *opcname, Oid subtype);
 static Oid findRangeCanonicalFunction(List *procname, Oid typeOid);
 static Oid findRangeSubtypeDiffFunction(List *procname, Oid subtype);
-static void validateDomainConstraint(Oid domainoid, char *ccbin);
+static void validateDomainCheckConstraint(Oid domainoid, char 

Re: Memory consumed by paths during partitionwise join planning

2023-11-28 Thread David Rowley
On Fri, 28 Jul 2023 at 02:06, Ashutosh Bapat
 wrote:
> Table 1: Join between unpartitioned tables
> Number of tables | without patch  | with patch | % reduction |
> being joined ||| |
> --
>2 |  29.0 KiB  |   28.9 KiB |   0.66% |
>3 |  79.1 KiB  |   76.7 KiB |   3.03% |
>4 | 208.6 KiB  |  198.2 KiB |   4.97% |
>5 | 561.6 KiB  |  526.3 KiB |   6.28% |

I have mostly just random thoughts and some questions...

Have you done any analysis of the node types that are consuming all
that memory? Are Path and subclasses of Path really that dominant in
this?  The memory savings aren't that impressive and it makes me
wonder if this is worth the trouble.

What's the goal of the memory reduction work?  If it's for
performance, does this increase performance?  Tracking refcounts of
Paths isn't free.

Why do your refcounts start at 1?  This seems weird:

+ /* Free the path if none references it. */
+ if (path->ref_count == 1)

Does ref_count really need to be an int?  There's a 16-bit hole you
could use between param_info and parallel_aware.  I doubt you'd need
32 bits of space for this.

I agree that it would be nice to get rid of the "if (!IsA(old_path,
IndexPath))" hack in add_path() and it seems like this idea could
allow that. However, I think if we were to have something like this
then you'd want all the logic to be neatly contained inside pathnode.c
without adding any burden in other areas to track or check Path
refcounts.

I imagine the patch would be neater if you added some kind of Path
traversal function akin to expression_tree_walker() that allows you to
visit each subpath recursively and run a callback function on each.

David




Re: [PoC] pg_upgrade: allow to upgrade publisher node

2023-11-28 Thread Masahiko Sawada
On Tue, Nov 28, 2023 at 10:58 PM Hayato Kuroda (Fujitsu)
 wrote:
>
> Dear Sawada-san,
>
> > On Tue, Nov 28, 2023 at 7:04 PM Hayato Kuroda (Fujitsu)
> >  wrote:
> > >
> > >
> > > Yeah, we followed binary_upgrade_create_empty_extension(). Also, we set as
> > > un-strict to keep a caller function simpler.
> > >
> > > Currently get_old_cluster_logical_slot_infos() executes a query and it 
> > > contains
> > > binary_upgrade_logical_slot_has_caught_up(). In pg_upgrade layer, we
> > assumed
> > > either true or false is returned.
> > >
> > > But if proisstrict is changed true, we must handle the case when NULL is
> > returned.
> > > It is small but backseat operation.
> >
> > Which cases are you concerned pg_upgrade could pass NULL to
> > binary_upgrade_logical_slot_has_caught_up()?
>
> Actually, we do not expect that it won't input NULL. IIUC all of slots have
> slot_name, and subquery uses its name. But will it be kept forever? I think we
> can avoid any risk.
>
> > I've not tested it yet but even if it returns NULL, perhaps
> > get_old_cluster_logical_slot_infos() would still set curr->caught_up
> > to false, no?
>
> Hmm. I checked the C99 specification [1] of strcmp, but it does not define the
> case when the NULL is input. So it depends implementation.

I think PQgetvalue() returns an empty string if the result value is null.

Regards,

-- 
Masahiko Sawada
Amazon Web Services: https://aws.amazon.com




Re: Properly pathify the union planner

2023-11-28 Thread David Rowley
On Fri, 24 Nov 2023 at 18:43, Ashutosh Bapat
 wrote:
>
> On Fri, Nov 24, 2023 at 3:59 AM David Rowley  wrote:
> > For now, as a temporary fix, I've just #ifdef'd out the code in
> > add_path() that's pfreeing the old path.  I have drafted a patch that
> > refcounts Paths, but I'm unsure if that's the correct solution as I'm
> > only maintaining the refcounts in add_path() and add_partial_path(). I
> > think a true correct solution would bump the refcount when a path is
> > used as some other path's subpath.  That would mean having to
> > recursively pfree paths up until we find one with a refcount>0. Seems
> > a bit expensive for add_path() to do.
>
> Please find my proposal to refcount paths at [1]. I did that to reduce
> the memory consumed by partitionwise joins. I remember another thread
> where freeing a path that was referenced by upper sort path created
> minor debugging problem. [2]. I paused my work on my proposal since
> there didn't seem enough justification. But it looks like the
> requirement is coming up repeatedly. I am willing to resume my work if
> it's needed. The email lists next TODOs. As to making the add_path()
> expensive, I didn't find any noticeable impact on planning time.

I missed that thread. Thanks for pointing it out.

I skim read your patch and I see it does seem to have the workings for
tracking refcounts when the pack is a subpath of another path.  I
imagine that would allow the undocumented hack that is "if
(!IsA(old_path, IndexPath))" in add_path() to disappear.

I wondered if the problem of pfreeing paths that are in the pathlist
of another relation could be fixed in another way.  If we have an
AdoptedPath path type that just inherits the costs from its single
subpath and we wrap a Path up in one of these before we do add_path()
a Path which is not parented by the relation we're adding the path to,
since we don't recursively pfree() Paths in add_path(), we'd only ever
pfree the AdoptedPath rather than pfreeing a Path that directly exists
in another relations pathlist.

Another simpler option would be just don't pfree the Path if the Path
parent is not the add_path rel.

David

> [1] 
> https://www.postgresql.org/message-id/CAExHW5tUcVsBkq9qT%3DL5vYz4e-cwQNw%3DKAGJrtSyzOp3F%3DXacA%40mail.gmail.com
> [2] 
> https://www.postgresql.org/message-id/CAM2%2B6%3DUC1mcVtM0Y_LEMBEGHTM58HEkqHPn7vau_V_YfuZjEGg%40mail.gmail.com




Re: Missing docs on AT TIME ZONE precedence?

2023-11-28 Thread Tom Lane
Andrew Dunstan  writes:
> On 2023-11-28 Tu 10:27, Tom Lane wrote:
>> OK.  How about rewriting that first para like this?

> LGTM. Thanks.

Thanks for reviewing.  While checking things over one more time,
I noticed that there was an additional violation of this precept,
dating back to long before we understood the hazards: SET is
given its own priority, when it could perfectly well share that
of IDENT.  I adjusted that and pushed.

regards, tom lane




Re: May be BUG. Periodic burst growth of the checkpoint_req counter on replica.

2023-11-28 Thread Alexander Korotkov
Hi, Anton!

On Thu, Mar 16, 2023 at 2:39 PM Anton A. Melnikov  wrote:
> On 15.03.2023 21:29, Gregory Stark (as CFM) wrote:
>
> > These patches that are "Needs Review" and have received no comments at
> > all since before March 1st are these. If your patch is amongst this
> > list I would suggest any of:
> >
> > 1) Move it yourself to the next CF (or withdraw it)
> > 2) Post to the list with any pending questions asking for specific
> > feedback -- it's much more likely to get feedback than just a generic
> > "here's a patch plz review"...
> > 3) Mark it Ready for Committer and possibly post explaining the
> > resolution to any earlier questions to make it easier for a committer
> > to understand the state
> >
>
> There are two different patch variants and some discussion expected.
> So moved them to the next CF.

Thank you for your detailed observation regarding the spike growth of
the checkpoint_req counter on the replica following a large insert
operation on the master.  After reviewing your description and the
log, I agree with Kyotaro Horiguchi that the behavior you've outlined,
though potentially perceived as annoying, does not constitute a bug in
the PostgreSQL.

After examining the second patch
("v2-0001-Add-restartpoint-stats.patch"), it appears that adding
additional statistics as outlined in the patch is the most suitable
approach to address the concerns raised. This solution provides more
visibility into the system's behavior without altering its core
mechanics. However, it's essential that this additional functionality
is accompanied by comprehensive documentation to ensure clear
understanding and ease of use by the PostgreSQL community.

Please consider expanding the documentation to include detailed
explanations of the new statistics and their implications in various
scenarios.

--
Regards,
Alexander Korotkov




Re: Python installation selection in Meson

2023-11-28 Thread Andres Freund
Hi,

On 2023-11-28 19:02:42 +0100, Peter Eisentraut wrote:
> I noticed that under meson, the selection of the Python installation using
> the 'PYTHON' option doesn't work completely.  The 'PYTHON' option determined
> the Python binary that will be used to call the various build support
> programs.  But it doesn't affect the Python installation used for PL/Python.
> For that, we need to pass the program determined by the 'PYTHON' option back
> into the find_installation() routine of the python module.  (Otherwise,
> find_installation() will just search for an installation on its own.)  See
> attached patch.  I ran this through Cirrus, seems to work.

Makes sense!

Greetings,

Andres Freund




Re: XID formatting and SLRU refactorings

2023-11-28 Thread Alexander Korotkov
Hi!

On Tue, Nov 28, 2023 at 2:06 PM Aleksander Alekseev
 wrote:
>
> >> > I think what's done in patch 0001 is just an extension of existing
> >> > logic and moving it into separate function.
> >>
> >> That's right. I'm arguing that now is a good time to clean it up.
> >>
> >> I won't insist if Alexander prefers to commit it as it is, though. But
> >> let's at least explain how this works in the comment, instead of the XXX.
> >
> > I agree with you that would be good to add a comment instead of XXX and 
> > commit.
>
> +1
>
> One could argue that getting rid of short filenames entirely in the
> long term (i.e. always long_segment_names == true) could be a better
> strategy. Maybe it's not but I believe this should be discussed
> separately from the patchset under question.


Heikki, thank you for catching this.

This mess with file names formats already lasts quite long.  I don't
think we should hurry unifying this as long as we're anyway going to
change that in near future.

Please, find the revised patchset with relevant comment.

--
Regards,
Alexander Korotkov


0003-Make-use-FullTransactionId-in-2PC-filenames-v62.patch
Description: Binary data


0002-Use-larger-segment-file-names-for-pg_notify-v62.patch
Description: Binary data


0004-Add-SLRU-tests-for-64-bit-page-case-v62.patch
Description: Binary data


0001-Index-SLRUs-by-64-bit-integers-rather-than-by-32-v62.patch
Description: Binary data


Python installation selection in Meson

2023-11-28 Thread Peter Eisentraut
I noticed that under meson, the selection of the Python installation 
using the 'PYTHON' option doesn't work completely.  The 'PYTHON' option 
determined the Python binary that will be used to call the various build 
support programs.  But it doesn't affect the Python installation used 
for PL/Python.  For that, we need to pass the program determined by the 
'PYTHON' option back into the find_installation() routine of the python 
module.  (Otherwise, find_installation() will just search for an 
installation on its own.)  See attached patch.  I ran this through 
Cirrus, seems to work.From d83c43c1bed2a096549fada42951b346b269c551 Mon Sep 17 00:00:00 2001
From: Peter Eisentraut 
Date: Tue, 28 Nov 2023 18:56:18 +0100
Subject: [PATCH] Meson python detection

When we look for the Python installation using the meson python
module, we should make it use the python program previously determined
by the 'PYTHON' option.  Otherwise, it will just use its own search
and the 'PYTHON' option won't affect it.
---
 meson.build | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/meson.build b/meson.build
index ee58ee7a06..24fddc677b 100644
--- a/meson.build
+++ b/meson.build
@@ -1063,7 +1063,7 @@ pyopt = get_option('plpython')
 python3_dep = not_found_dep
 if not pyopt.disabled()
   pm = import('python')
-  python3_inst = pm.find_installation(required: pyopt)
+  python3_inst = pm.find_installation(python.path(), required: pyopt)
   if python3_inst.found()
 python3_dep = python3_inst.dependency(embed: true, required: pyopt)
 # Remove this check after we depend on Meson >= 1.1.0
-- 
2.43.0



Fix assertion in autovacuum worker

2023-11-28 Thread David Geier

Hi hackers,

PostgreSQL hit the following assertion during error cleanup, after being 
OOM in dsa_allocate0():


void dshash_detach(dshash_table *hash_table) { 
ASSERT_NO_PARTITION_LOCKS_HELD_BY_ME(hash_table);


called from pgstat_shutdown_hook(), called from shmem_exit(), called 
from proc_exit(), called from the exception handler.


The partition locks got previously acquired by

AutoVacWorkerMain() pgstat_report_autovac() 
pgstat_get_entry_ref_locked() pgstat_get_entry_ref() 
dshash_find_or_insert() resize() resize() locks all partitions so the 
hash table can safely be resized. Then it calls dsa_allocate0(). If 
dsa_allocate0() fails to allocate, it errors out. The exception handler 
calls proc_exit() which normally calls LWLockReleaseAll() via 
AbortTransaction() but only if there's an active transaction. However, 
pgstat_report_autovac() runs before a transaction got started and hence 
LWLockReleaseAll() doesn't run before pgstat_shutdown_hook() is called.


See attached patch for an attempt to fix this issue.

--
David Geier
(ServiceNow)
From 5580e3680b2211235e4bc2b5dcbfe6b4f5b8eee5 Mon Sep 17 00:00:00 2001
From: David Geier 
Date: Tue, 28 Nov 2023 18:52:46 +0100
Subject: [PATCH] Fix autovacuum cleanup on error

---
 src/backend/postmaster/autovacuum.c | 3 +++
 1 file changed, 3 insertions(+)

diff --git a/src/backend/postmaster/autovacuum.c b/src/backend/postmaster/autovacuum.c
index f929b62e8a..5de55649d9 100644
--- a/src/backend/postmaster/autovacuum.c
+++ b/src/backend/postmaster/autovacuum.c
@@ -1584,6 +1584,9 @@ AutoVacWorkerMain(int argc, char *argv[])
 		/* Report the error to the server log */
 		EmitErrorReport();
 
+		/* Make sure all locks are released so assertions don't hit in at-exit callbacks */
+		LWLockReleaseAll();
+
 		/*
 		 * We can now go away.  Note that because we called InitProcess, a
 		 * callback was registered to do ProcKill, which will clean up
-- 
2.39.2



Re: Parallel CREATE INDEX for BRIN indexes

2023-11-28 Thread Tomas Vondra
On 11/28/23 16:39, Matthias van de Meent wrote:
> On Thu, 23 Nov 2023 at 14:35, Tomas Vondra
>  wrote:
>> On 11/23/23 13:33, Matthias van de Meent wrote:
>>> The union operator may leak (lots of) memory, so I think it makes
>>> sense to keep a context around that can be reset after we've extracted
>>> the merge result.
>>>
>>
>> But does the current code actually achieve that? It does create a "brin
>> union" context, but then it only does this:
>>
>> /* Use our own memory context to avoid retail pfree */
>> cxt = AllocSetContextCreate(CurrentMemoryContext,
>> "brin union",
>> ALLOCSET_DEFAULT_SIZES);
>> oldcxt = MemoryContextSwitchTo(cxt);
>> db = brin_deform_tuple(bdesc, b, NULL);
>> MemoryContextSwitchTo(oldcxt);
>>
>> Surely that does not limit the amount of memory used by the actual union
>> functions in any way?
> 
> Oh, yes, of course. For some reason I thought that covered the calls
> to the union operator function too, but it indeed only covers
> deserialization. I do think it is still worthwhile to not do the
> create/delete cycle, but won't hold the patch back for that.
> 

I think the union_tuples() changes are better left for a separate patch.

 However, I don't think the number of union_tuples calls is likely to be
 very high, especially for large tables. Because we split the table into
 2048 chunks, and then cap the chunk size by 8192. For large tables
 (where this matters) we're likely close to 8192.
>>>
>>> I agree that the merging part of the index creation is the last part,
>>> and usually has no high impact on the total performance of the reindex
>>> operation, but in memory-constrained environments releasing and then
>>> requesting the same chunk of memory over and over again just isn't
>>> great.
>>
>> OK, I'll take a look at the scratch context you suggested.
>>
>> My point however was we won't actually do that very often, because on
>> large tables the BRIN ranges are likely smaller than the parallel scan
>> chunk size, so few overlaps. OTOH if the table is small, or if the BRIN
>> ranges are large, there'll be few of them.
> 
> That's true, so maybe I'm concerned about something that amounts to
> only marginal gains.
> 

However, after thinking about this a bit more, I think we actually do
need to do something about the memory management when merging tuples.
AFAIK the general assumption was that union_tuple() only runs for a
single range, and then the whole context gets freed. But the way the
merging was implemented, it all runs in a single context. And while a
single union_tuple() may not need a lot memory, in total it may be
annoying. I just added a palloc(1MB) into union_tuples and ended up with
~160MB allocated in the PortalContext on just 2GB table. In practice the
memory will grow more slowly, but not great :-/

The attached 0003 patch adds a memory context that's reset after
producing a merged BRIN tuple for each page range.

> I noticed that the v4 patch doesn't yet update the documentation in
> indexam.sgml with am->amcanbuildparallel.

Should be fixed by 0002. I decided to add a simple note to ambuild(),
not sure if something more is needed.

> Once that is included and reviewed I think this will be ready, unless
> you want to address any of my comments upthread (that I marked with
> 'not in this patch') in this patch.
> 

Thanks. I believe the attached version addresses it. There's also 0004
with some indentation tweaks per pgindent.

regards

-- 
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL CompanyFrom 219a451250cbbbc1235c60f4776c2ee6bc84432d Mon Sep 17 00:00:00 2001
From: Tomas Vondra 
Date: Tue, 7 Nov 2023 17:04:28 +0100
Subject: [PATCH v20231128 1/4] parallel CREATE INDEX for BRIN v3

---
 contrib/bloom/blutils.c   |   1 +
 src/backend/access/brin/brin.c| 866 +-
 src/backend/access/gin/ginutil.c  |   1 +
 src/backend/access/gist/gist.c|   1 +
 src/backend/access/hash/hash.c|   1 +
 src/backend/access/nbtree/nbtree.c|   1 +
 src/backend/access/spgist/spgutils.c  |   1 +
 src/backend/access/transam/parallel.c |   4 +
 src/backend/catalog/index.c   |   2 +-
 src/backend/utils/sort/tuplesortvariants.c| 207 +
 src/include/access/amapi.h|   2 +
 src/include/access/brin.h |   3 +
 src/include/utils/tuplesort.h |  11 +
 .../modules/dummy_index_am/dummy_index_am.c   |   1 +
 14 files changed, 1096 insertions(+), 6 deletions(-)

diff --git a/contrib/bloom/blutils.c b/contrib/bloom/blutils.c
index 4830cb3fee6..a781c5d98d6 100644
--- a/contrib/bloom/blutils.c
+++ b/contrib/bloom/blutils.c
@@ -122,6 +122,7 @@ blhandler(PG_FUNCTION_ARGS)
 	amroutine->amclusterable = false;
 	amroutine->ampredlocks = false;
 	amroutine->amcanparallel = false;
+	

Re: [HACKERS] Changing references of password encryption to hashing

2023-11-28 Thread Stephen Frost
Greetings,

On Tue, Nov 28, 2023 at 20:19 Robert Haas  wrote:

> On Tue, Nov 28, 2023 at 10:16 AM Stephen Frost  wrote:
> > We pass a completely cleartext password to the server from the client.
> > Yes, we might encrypt it on the way with TLS, but even SSH realized how
> > terrible that is long, long ago and strongly discourages it these days.
> >
> > The problem with ldap as an auth method is that a single compromised PG
> > server in an AD/ldap environment can collect up those username+password
> > credentials and gain access to those users *domain* level access.  The
> > CFO logging into a PG server with LDAP auth is giving up their complete
> > access credentials to the entire AD domain.  That's terrible.
>
> Good grief. I really think you need to reassess your approach to this
> whole area. It is massively unhelpful to be so prescriptive about what
> people are, or are not, allowed to be comfortable with from a security
> perspective. And it seems like you're so convinced that you're right
> that you're completely closed off to anyone else's input, so there's
> not even any point in arguing.


You asked what the issue is with our ldap auth method. I explained it
pretty clearly. Is there something incorrect about the issues I’ve pointed
out with ldap auth?  You basically came back at me saying that I’m being
unhelpful for responding with what the issues are.

In the reality I inhabit, many people could *massively* improve their
> security by switching to LDAP from the really lame things that they're
> doing right now. They would be FAR better off. It would not even be
> close. I pretty much know that you aren't going to believe that and
> are going to offer reasons why it isn't and can't possibly be true, or
> else just say that those people are so hopeless that we shouldn't even
> care about them. All I can say is that you're worrying about security
> problems that are so minute as to be invisible compared to what I see.


I don’t know what they’re doing now, as you don’t say, and so I really
couldn’t say if ldap is better or worse for them. In some cases, sure,
perhaps ldap is better than … something else, but in 99% of cases, ldap is
being used because it seems simpler than setting up GSSAPI … but that’s a
lack of clearer documentation on our side on how to set up GSSAPI with AD
and PG, imv.   I’ve tried to improve on that situation by explaining
clearly how to set up GSSAPI authentication with AD and then consistently
pointing people to that.  I don’t put this out there as fiat without any
effort behind it, I’ve been working to move folks in the right direction,
I’ve put in that effort on the lists and continue to do so, feel free to
review the archives.

What I don’t understand is the argument for using ldap in an AD environment
instead of GSSAPI/sspi. Is there some reason you feel makes ldap better?
When there are two such options and one is clearly much more secure,
shouldn’t we push people to the more secure option… possibly even to the
exclusion and removal of the insecure option?  Ditto below for md5 …

> I disagree- it's a known pass-the-hash vulnerability and frankly every
> > release we do with it still existing is deserving of an immediate CVE
> > (I've been asked off-list why we don't do this, in fact).
>
> I agree that the issues with MD5 are way worse than the ones with
> LDAP, but you're arguing here, as you did with exclusive backup mode,
> that the existence of options that are bad, or that you consider to be
> bad, is a problem even if nothing whatever compels people to use them.


I’m not impressed with the conflation of this discussion and that of
exclusive backup mode.  I wasn’t the one who deprecated exclusive backup
mode, tho I did agree with that move, and the reason for removing it was as
much about the complications of documenting it as for the fact that it was
actively broken.

I think that is, to borrow a phrase from Tom, arrant nonsense. Sure,
> MD5 authentication has a pass-the-hash vulnerability, and that sucks.


So, given that we all agree with the CVE-worthy issue that exists with
every release where we include md5 auth, we should be applying for q CVE
prior to each release, no?

So let's say we remove it. Then presumably we should also remove
> password authentication, which has an even easier-to-exploit
> vulnerability, and trust, which is easier to exploit still. And
> presumably it would be right to do all of that even if SCRAM
> authentication had never been implemented, so then we'd have no
> support for any form of password authentication at all. And we'd
> remove LDAP, too, so the most obvious way of doing password
> authentication without putting the passwords in the database would
> also not exist any more. Maybe we should nuke a few more too --- is
> PAM any better than LDAP in this regard? Pretty soon you end up in a
> world where it's either impossible to log into the database at all, or
> everybody's got to use, I don't know, Kerberos and SSL 

Re: Optimizing nbtree ScalarArrayOp execution, allowing multi-column ordered scans, skip scan

2023-11-28 Thread Peter Geoghegan
On Tue, Nov 28, 2023 at 4:29 AM Tomas Vondra
 wrote:
> I haven't looked at the code, but I decided to do a bit of blackbox perf
> and stress testing, to get some feeling of what to expect in terms of
> performance improvements, and see if there happen to be some unexpected
> regressions. Attached is a couple simple bash scripts doing a
> brute-force test with tables of different size / data distribution,
> number of values in the SAOP expression, etc.

My own stress-testing has focussed on the two obvious extremes for
this patch, using variants of pgbench with SAOP SELECTs on
pgbench_accounts:

1. The case where there is almost no chance of finding any two index
tuples together on the same page, because the constants are completely
random. This workload makes the patch's attempts at "coalescing
together" index page reads pure overhead, with no possible benefit.
Obviously that's a cost that needs to be kept under control.

2. The case where there are 255 of tuples with distinct values that
are clustered together (both in the key space and in physical index
pages). Usually they'll span two index pages, but they might all fit
together on one index page, allowing us to descend to it directly and
read it only once.

With 32 clients, I typically see a regression of about 1.5% for the
first case relative to master, measured in throughput/TPS. The second
case typically sees throughput that's ~4.8x master (i.e. a ~380%
increase). I consider both of these extremes to be fairly unrealistic.
With fewer array constants, the speed-ups you'll see in sympathetic
cases are still very significant, but nothing like 4.8x.  They're more
like the 30% numbers that you saw.

As you know, I'm not actually all that excited about cases like 2 --
it's not where users are likely to benefit from the patch. The truly
interesting cases are those cases where we can completely avoid heap
accesses in the first place (not just *repeat* accesses to the same
index pages), due to the patch's ability to consistently use index
quals rather than filter quals. It's not that hard to show cases where
there are 100x+ fewer pages accessed -- often with cases have very few
array constants. It's just that these cases aren't that interesting
from a performance validation point of view -- it's obvious that
filter quals are terrible.

Another thing that the patch does particularly well on is cases where
the array keys don't have any matches at all, but there is significant
clustering (relatively common when multiple SAOPs are used as index
quals, which becomes far more likely due to the planner changes). We
don't just skip over parts of the index that aren't relevant -- we
also skip over parts of the arrays that aren't relevant. Some of my
adversarial test cases that take ~1 millisecond for the patch to
execute will practically take forever on master (I had to have my test
suite not run those tests against master). You just need lots of array
keys.

What master does on those adversarial cases with billions of distinct
combinations of array keys (not that unlikely if there are 4 or 5
SAOPs with mere hundreds or thousands of array keys each) is so
inefficient that we might as well call it infinitely slower than the
patch. This is interesting to me from a performance
robustness/stability point of view. The slowest kind of SAOP index
scan with the patch becomes a full index scan -- just as it would be
if we were using any type of non-SOAP qual before now. The worst case
is a lot easier to reason about.

> I'm not convinced this is a problem we have to solve. It's possible it
> only affects cases that are implausible in practice (the script forces a
> particular scan type, and maybe it would not be picked in practice). But
> maybe it's fixable ...

I would expect the patch to do quite well (relative to what is
actually possible) on cases like the two extremes that I've focussed
on so far. It seems possible that it will do less well on cases that
are somewhere in the middle (that also have lots of distinct values on
each page).

We effectively do a linear search on a page that we know has at least
one more match (following a precheck that uses the high key). We hope
that the next match (for the next array value) closely follows an
initial match. But what if there are only 2 or 3 matches on each leaf
page, that are spaced relatively far apart? You're going to have to
grovel through the whole page.

It's not obvious that that's a problem to be fixed -- we're still only
descending the index once and still only locking the leaf page once,
so we'll probably still win relative to master. And it's not that easy
to imagine beating a linear search -- it's not like there is just one
"next value" to search for in these cases. But it's something that
deserves further consideration.

-- 
Peter Geoghegan




Re: archive modules loose ends

2023-11-28 Thread Nathan Bossart
Here is a new version of the patch with feedback addressed.

-- 
Nathan Bossart
Amazon Web Services: https://aws.amazon.com
>From 3cda5bb87c82738ad6f8a082ef5dfeb49cd51392 Mon Sep 17 00:00:00 2001
From: Nathan Bossart 
Date: Tue, 28 Nov 2023 11:17:13 -0600
Subject: [PATCH v3 1/1] archiver exception handling

---
 contrib/basic_archive/basic_archive.c | 134 +-
 doc/src/sgml/archive-modules.sgml |  11 ++-
 src/backend/archive/shell_archive.c   |   5 -
 src/backend/postmaster/pgarch.c   |  93 +-
 4 files changed, 107 insertions(+), 136 deletions(-)

diff --git a/contrib/basic_archive/basic_archive.c b/contrib/basic_archive/basic_archive.c
index 4d78c31859..d575faab93 100644
--- a/contrib/basic_archive/basic_archive.c
+++ b/contrib/basic_archive/basic_archive.c
@@ -40,26 +40,18 @@
 
 PG_MODULE_MAGIC;
 
-typedef struct BasicArchiveData
-{
-	MemoryContext context;
-} BasicArchiveData;
-
 static char *archive_directory = NULL;
 
-static void basic_archive_startup(ArchiveModuleState *state);
 static bool basic_archive_configured(ArchiveModuleState *state);
 static bool basic_archive_file(ArchiveModuleState *state, const char *file, const char *path);
-static void basic_archive_file_internal(const char *file, const char *path);
 static bool check_archive_directory(char **newval, void **extra, GucSource source);
 static bool compare_files(const char *file1, const char *file2);
-static void basic_archive_shutdown(ArchiveModuleState *state);
 
 static const ArchiveModuleCallbacks basic_archive_callbacks = {
-	.startup_cb = basic_archive_startup,
+	.startup_cb = NULL,
 	.check_configured_cb = basic_archive_configured,
 	.archive_file_cb = basic_archive_file,
-	.shutdown_cb = basic_archive_shutdown
+	.shutdown_cb = NULL
 };
 
 /*
@@ -93,24 +85,6 @@ _PG_archive_module_init(void)
 	return _archive_callbacks;
 }
 
-/*
- * basic_archive_startup
- *
- * Creates the module's memory context.
- */
-void
-basic_archive_startup(ArchiveModuleState *state)
-{
-	BasicArchiveData *data;
-
-	data = (BasicArchiveData *) MemoryContextAllocZero(TopMemoryContext,
-	   sizeof(BasicArchiveData));
-	data->context = AllocSetContextCreate(TopMemoryContext,
-		  "basic_archive",
-		  ALLOCSET_DEFAULT_SIZES);
-	state->private_data = (void *) data;
-}
-
 /*
  * check_archive_directory
  *
@@ -171,74 +145,6 @@ basic_archive_configured(ArchiveModuleState *state)
  */
 static bool
 basic_archive_file(ArchiveModuleState *state, const char *file, const char *path)
-{
-	sigjmp_buf	local_sigjmp_buf;
-	MemoryContext oldcontext;
-	BasicArchiveData *data = (BasicArchiveData *) state->private_data;
-	MemoryContext basic_archive_context = data->context;
-
-	/*
-	 * We run basic_archive_file_internal() in our own memory context so that
-	 * we can easily reset it during error recovery (thus avoiding memory
-	 * leaks).
-	 */
-	oldcontext = MemoryContextSwitchTo(basic_archive_context);
-
-	/*
-	 * Since the archiver operates at the bottom of the exception stack,
-	 * ERRORs turn into FATALs and cause the archiver process to restart.
-	 * However, using ereport(ERROR, ...) when there are problems is easy to
-	 * code and maintain.  Therefore, we create our own exception handler to
-	 * catch ERRORs and return false instead of restarting the archiver
-	 * whenever there is a failure.
-	 */
-	if (sigsetjmp(local_sigjmp_buf, 1) != 0)
-	{
-		/* Since not using PG_TRY, must reset error stack by hand */
-		error_context_stack = NULL;
-
-		/* Prevent interrupts while cleaning up */
-		HOLD_INTERRUPTS();
-
-		/* Report the error and clear ErrorContext for next time */
-		EmitErrorReport();
-		FlushErrorState();
-
-		/* Close any files left open by copy_file() or compare_files() */
-		AtEOSubXact_Files(false, InvalidSubTransactionId, InvalidSubTransactionId);
-
-		/* Reset our memory context and switch back to the original one */
-		MemoryContextSwitchTo(oldcontext);
-		MemoryContextReset(basic_archive_context);
-
-		/* Remove our exception handler */
-		PG_exception_stack = NULL;
-
-		/* Now we can allow interrupts again */
-		RESUME_INTERRUPTS();
-
-		/* Report failure so that the archiver retries this file */
-		return false;
-	}
-
-	/* Enable our exception handler */
-	PG_exception_stack = _sigjmp_buf;
-
-	/* Archive the file! */
-	basic_archive_file_internal(file, path);
-
-	/* Remove our exception handler */
-	PG_exception_stack = NULL;
-
-	/* Reset our memory context and switch back to the original one */
-	MemoryContextSwitchTo(oldcontext);
-	MemoryContextReset(basic_archive_context);
-
-	return true;
-}
-
-static void
-basic_archive_file_internal(const char *file, const char *path)
 {
 	char		destination[MAXPGPATH];
 	char		temp[MAXPGPATH + 256];
@@ -272,7 +178,7 @@ basic_archive_file_internal(const char *file, const char *path)
 			fsync_fname(destination, false);
 			fsync_fname(archive_directory, true);
 
-			return;
+			return true;
 		}
 
 		ereport(ERROR,
@@ -312,6 +218,8 @@ 

Re: [PATCH] LockAcquireExtended improvement

2023-11-28 Thread Andres Freund
Hi,

On 2023-11-28 20:52:31 +0800, Jingxian Li wrote:
> postgres=*# lock table test in exclusive mode ;
>
>
> T4
>
> Case 1:
>
> postgres=*# lock table test in share row exclusive mode   nowait;
>
> ERROR: could not   obtain lock on relation 
> "test"
>
> 
>
> Case 2:
>
> postgres=*# lock table test in share row exclusive mode;
>
> LOCK TABLE
>
> 
>
>
> At T4 moment in session A, (case 1) when executing SQL “lock table test in 
> share row exclusive mode nowait;”, an error occurs with message “could not 
> obtain lock on relation test";However, (case 2) when executing the SQL above 
> without nowait, lock can be obtained successfully.
>
> Digging into the source code, I find that in case 2 the lock was obtained in
> the function ProcSleep instead of LockAcquireExtended. Due to nowait logic
> processed before WaitOnLock-ProcSleep, acquiring lock failed in case
> 1. Can any changes be made so that the act of such lock granted occurs
> before WaitOnLock?

I don't think that'd make sense - lock reordering is done to prevent deadlocks
and is quite expensive. Why should NOWAIT incur that cost?


> 
>
> Providing a more universal case:
>
> Transaction A already holds an n-mode lock on table test. If then transaction 
> A requests an m-mode lock on table Test, m and n have the following 
> constraints:
>
> (lockMethodTable-conflictTab[n]  lockMethodTable-conflictTab[m]) 
> == lockMethodTable-conflictTab[m]
>
> Obviously, in this case, m<=n.
>
> Should the m-mode lock be granted before WaitOnLock?
>
> 
>
> In the case of m=n (i.e. we already hold the lock), the m-mode lock is
> immediately granted in the LocalLock path, without the need of lock conflict
> check.

Sure - it'd not help anybody to wait for a lock we already hold - in fact it'd
create a lot of deadlocks.


> Based on the facts above, can we obtain a weaker lock (m object within the same transaction without doing lock conflict check?

Perhaps. There's no inherent "lock strength" ordering for all locks though.

Greetings,

Andres Freund




Re: pg_upgrade and logical replication

2023-11-28 Thread vignesh C
On Mon, 27 Nov 2023 at 06:53, Peter Smith  wrote:
>
> Here are some review comments for patch set v19*
>
> //
>
> v19-0001.
>
> No comments
>
> ///
>
> v19-0002.
>
> (I saw that both changes below seemed cut/paste from similar
> functions, but I will ask the questions anyway).
>
> ==
> src/backend/commands/subscriptioncmds.c
>
> 1.
> +/* Potentially set by pg_upgrade_support functions */
> +Oid binary_upgrade_next_pg_subscription_oid = InvalidOid;
> +
>
> The comment "by pg_upgrade_support functions" seemed a bit vague. IMO
> you might as well tell the name of the function that sets this.
>
> SUGGESTION
> Potentially set by the pg_upgrade_support function --
> binary_upgrade_set_next_pg_subscription_oid().

Modified

> ~~~
>
> 2. CreateSubscription
>
> + if (!OidIsValid(binary_upgrade_next_pg_subscription_oid))
> + ereport(ERROR,
> + (errcode(ERRCODE_INVALID_PARAMETER_VALUE),
> + errmsg("pg_subscription OID value not set when in binary upgrade mode")));
>
> Doesn't this condition mean some kind of impossible internal error
> occurred -- i.e. should this be elog instead of ereport?

This is kind of a sanity check to prevent setting the subscription id
with an invalid oid. This can happen if the server is started in
binary upgrade mode and create subscription is called without calling
binary_upgrade_set_next_pg_subscription_oid.

The comment is handled in the v20 version patch attached at:
https://www.postgresql.org/message-id/CALDaNm0ST1iSrJLD_CV6hQs%3Dw4GZRCRdftQvQA3cO8Hq3QUvYw%40mail.gmail.com

Regards,
Vignesh




Re: pg_upgrade and logical replication

2023-11-28 Thread vignesh C
On Sat, 25 Nov 2023 at 17:50, Amit Kapila  wrote:
>
> 2.
> + * b) SUBREL_STATE_SYNCDONE: A relation upgraded while in this state
> + * would retain the replication origin in certain cases.
>
> I think this is vague. Can we briefly describe cases where the origins
> would be retained?

Modified

> 3. I think the cases where the publisher is also upgraded restoring
> the origin's LSN is of no use. Currently, I can't see a problem with
> restoring stale originLSN in such cases as we won't be able to
> distinguish during the upgrade but I think we should document it in
> the comments somewhere in the patch.

Added comments

These are handled in the v20 version patch attached at:
https://www.postgresql.org/message-id/CALDaNm0ST1iSrJLD_CV6hQs%3Dw4GZRCRdftQvQA3cO8Hq3QUvYw%40mail.gmail.com

Regards,
Vignesh




Re: [HACKERS] Changing references of password encryption to hashing

2023-11-28 Thread Robert Haas
On Tue, Nov 28, 2023 at 10:16 AM Stephen Frost  wrote:
> We pass a completely cleartext password to the server from the client.
> Yes, we might encrypt it on the way with TLS, but even SSH realized how
> terrible that is long, long ago and strongly discourages it these days.
>
> The problem with ldap as an auth method is that a single compromised PG
> server in an AD/ldap environment can collect up those username+password
> credentials and gain access to those users *domain* level access.  The
> CFO logging into a PG server with LDAP auth is giving up their complete
> access credentials to the entire AD domain.  That's terrible.

Good grief. I really think you need to reassess your approach to this
whole area. It is massively unhelpful to be so prescriptive about what
people are, or are not, allowed to be comfortable with from a security
perspective. And it seems like you're so convinced that you're right
that you're completely closed off to anyone else's input, so there's
not even any point in arguing.

In the reality I inhabit, many people could *massively* improve their
security by switching to LDAP from the really lame things that they're
doing right now. They would be FAR better off. It would not even be
close. I pretty much know that you aren't going to believe that and
are going to offer reasons why it isn't and can't possibly be true, or
else just say that those people are so hopeless that we shouldn't even
care about them. All I can say is that you're worrying about security
problems that are so minute as to be invisible compared to what I see.

> I disagree- it's a known pass-the-hash vulnerability and frankly every
> release we do with it still existing is deserving of an immediate CVE
> (I've been asked off-list why we don't do this, in fact).

I agree that the issues with MD5 are way worse than the ones with
LDAP, but you're arguing here, as you did with exclusive backup mode,
that the existence of options that are bad, or that you consider to be
bad, is a problem even if nothing whatever compels people to use them.
I think that is, to borrow a phrase from Tom, arrant nonsense. Sure,
MD5 authentication has a pass-the-hash vulnerability, and that sucks.
So let's say we remove it. Then presumably we should also remove
password authentication, which has an even easier-to-exploit
vulnerability, and trust, which is easier to exploit still. And
presumably it would be right to do all of that even if SCRAM
authentication had never been implemented, so then we'd have no
support for any form of password authentication at all. And we'd
remove LDAP, too, so the most obvious way of doing password
authentication without putting the passwords in the database would
also not exist any more. Maybe we should nuke a few more too --- is
PAM any better than LDAP in this regard? Pretty soon you end up in a
world where it's either impossible to log into the database at all, or
everybody's got to use, I don't know, Kerberos and SSL certificates,
or something.

But what will happen in practice is that either users will just give
up on PostgreSQL because it refuses to let them do what they want to
do, or those SSL certificates will end up unencrypted in a
world-writable file somewhere, or get emailed around to everyone in
the group, or maybe checked into the application's git repository that
is public to the whole company behind the firewall. There won't be a
passphrase on the certificate, or it will be set to 'secret123', or
whatever. The expiration date will be set to the furthest date in the
future that the software supports, or the user won't realize they need
to worry about that and the application will suddenly go down when the
certificate expires.

You can't make people do the right thing by removing their options.
You can only make them work a little harder to get around the security
control that you tried to impose by fiat, and in the process, make
them frustrated. I imagine that the reason you didn't suggest removing
'trust' is that you admit the existence of environments where it's
safe enough -- someone working with a local PG on a loopback port on a
single-user machine, maybe a connection from a certain IP where that
IP and the DB are both behind a firewall, or a burner system that
someone just literally doesn't care about. But the "safe enough"
argument applies to everything else, too. It's the job of users to
adequately secure their systems against external attack, and it's our
job to help them, not to enforce our own views on them. It is
absolutely fine to tell users, in a thoughtful and balanced way, what
the advantages and disadvantages of different choices are. But
removing their choices is high-handed and ineffective.

-- 
Robert Haas
EDB: http://www.enterprisedb.com




Re: SSL tests fail on OpenSSL v3.2.0

2023-11-28 Thread Tristan Partin

On Tue Nov 28, 2023 at 10:06 AM CST, Tom Lane wrote:

"Tristan Partin"  writes:
> On Tue Nov 28, 2023 at 9:42 AM CST, Tom Lane wrote:
>> I have no interest in supporting BoringSSL.  I just replied to
>> Daniel's comment because it seemed to resolve the last concern
>> about whether your patch is OK.

> If you haven't started fixing the tests, then I'll get to work on it and 
> send a new revision before the end of the day. Thanks!


No need, I can finish it up from here.


Sweet. I appreciate your help.

--
Tristan Partin
Neon (https://neon.tech)




Re: SSL tests fail on OpenSSL v3.2.0

2023-11-28 Thread Tom Lane
"Tristan Partin"  writes:
> On Tue Nov 28, 2023 at 9:42 AM CST, Tom Lane wrote:
>> I have no interest in supporting BoringSSL.  I just replied to
>> Daniel's comment because it seemed to resolve the last concern
>> about whether your patch is OK.

> If you haven't started fixing the tests, then I'll get to work on it and 
> send a new revision before the end of the day. Thanks!

No need, I can finish it up from here.

regards, tom lane




Re: Synchronizing slots from primary to standby

2023-11-28 Thread Drouvot, Bertrand

Hi,

On 11/28/23 10:40 AM, shveta malik wrote:

On Tue, Nov 28, 2023 at 12:19 PM Drouvot, Bertrand
 wrote:


Hi,

On 11/28/23 4:13 AM, shveta malik wrote:

On Mon, Nov 27, 2023 at 4:08 PM Amit Kapila  wrote:


On Mon, Nov 27, 2023 at 2:27 PM Zhijie Hou (Fujitsu)
 wrote:


Here is the updated version(v39_2) which include all the changes made in 0002.
Please use for review, and sorry for the confusion.



--- a/src/backend/replication/logical/launcher.c
+++ b/src/backend/replication/logical/launcher.c
@@ -8,20 +8,27 @@
*   src/backend/replication/logical/launcher.c
*
* NOTES
- *   This module contains the logical replication worker launcher which
- *   uses the background worker infrastructure to start the logical
- *   replication workers for every enabled subscription.
+ *   This module contains the replication worker launcher which
+ *   uses the background worker infrastructure to:
+ *   a) start the logical replication workers for every enabled subscription
+ *  when not in standby_mode.
+ *   b) start the slot sync worker for logical failover slots synchronization
+ *  from the primary server when in standby_mode.

I was wondering do we really need a launcher on standby to invoke
sync-slot worker. If so, why? I guess it may be required for previous
versions where we were managing work for multiple slot-sync workers
which is also questionable in the sense of whether launcher is the
right candidate for the same but now with the single slot-sync worker,
it doesn't seem worth having it. What do you think?

--


Yes, earlier a manager process was needed to manage multiple slot-sync
workers and distribute load among them, but now that does not seem
necessary. I gave it a try (PoC) and it seems to work well.  If  there
are no objections to this approach, I can share the patch soon.



+1 on this new approach, thanks!


PFA v40. This patch has removed Logical Replication Launcher support
to launch slotsync worker.


Thanks!


 The slot-sync worker is now registered as
bgworker with postmaster, with
bgw_start_time=BgWorkerStart_ConsistentState and
bgw_restart_time=60sec.

On removal of launcher, now all the validity checks have been shifted
to slot-sync worker itself.  This brings us to some point of concerns:

a) We still need to maintain  RecoveryInProgress() check in slotsync
worker. Since worker has the start time of
BgWorkerStart_ConsistentState, it will be started on non-standby as
well. So to ensure that it exists on non-standby, "RecoveryInProgress"
has been introduced at the beginning of the worker. But once it exits,
postmaster will not restart it since it will be clean-exist i.e.
proc_exit(0) (the restart logic of postmaster comes into play only
when there is an abnormal exit). But to exit for the first time on
non-standby, we need that Recovery related check in worker.

b) "enable_syncslot" check is moved to slotsync worker now. Since
enable_syncslot is PGC_SIGHUP, so proc_exit(1) is currently used to
exit the worker if 'enable_syncslot' is found to be disabled.
'proc_exit(1)' has been used in order to ensure that the worker is
restarted and GUCs are checked again after restart_time. Downside of
this approach is, if someone has kept "enable_syncslot" as disabled
permanently even on standby, slotsync worker will keep on restarting
and exiting.

So to overcome the above pain-points, I think a potential approach
will be to start slotsync worker only if 'enable_syncslot' is on and
the system is non-standby. 


That makes sense to me.


Potential ways (each with some issues) are:

1) Use the current way i.e. register slot-sync worker as bgworker with
postmaster, but introduce extra checks in 'maybe_start_bgworkers'. But
this seems more like a hack. This will need extra changes as currently
once 'maybe_start_bgworkers' is attempted by postmaster, it will
attempt again to start any worker only if the worker had abnormal exit
and restart_time !=0. The current postmatser will not attempt to start
worker on any GUC change.

2) Another way maybe to treat slotsync worker as special case and
separate out the start/restart of slotsync worker from bgworker, and
follow what we do for autovacuum launcher(StartAutoVacLauncher) to
keep starting it in the postmaster loop(ServerLoop). In this way, we
may be able to add more checks before starting worker. But by opting
this approach, we will have to manage slotsync worker completely by
ourself as it will be no longer be part of existing
bgworker-registration infra. If this seems okay and there are no other
better options, it can be analyzed further in detail.

3) Another approach could be, in order to solve issue (a), introduce a
new start_time 'BgWorkerStart_ConsistentState_HotStandby' which means
start a bgworker only if consistent state is reached and the system is
standby. And for issue (b), lets retain check of enable_syncslot in
the worker itself but make it 'PGC_POSTMASTER'. This will ensure we
can safely exit the worker(proc_exit(0) if 

Re: Annoying build warnings from latest Apple toolchain

2023-11-28 Thread Robert Haas
On Tue, Nov 21, 2023 at 9:59 AM Peter Eisentraut  wrote:
> Btw., I'm also seeing warnings like this.  I'm using homebrew.  Here is
> a sample:
>
> [145/147] Linking target src/test/modules/test_shm_mq/test_shm_mq.dylib
> -macosx_version_min has been renamed to -macos_version_min
> ld: warning: -undefined error is deprecated
> ld: warning: ignoring duplicate libraries: '-lgcc'
> [146/147] Linking target src/test/modules/test_slru/test_slru.dylib

I poked at this issue a bit more. In meson.build, for Darwin, we have this:

  # meson defaults to -Wl,-undefined,dynamic_lookup for modules, which we
  # don't want because a) it's different from what we do for autoconf, b) it
  # causes warnings starting in macOS Ventura
  ldflags_mod += ['-Wl,-undefined,error']

I don't really understand how meson works, but I blindly tried
commenting that out. What I found is that doing so reduces the number
of warnings that I get locally from 226 to 113. The difference seems
to be that, with the unpatched meson.build file, I get warnings both
about binaries and also about loadable modules, but with the patched
version, the loadable modules stop emitting warnings, and the binaries
continue to do so. This gives the flavor:

-[] Linking target contrib/adminpack/adminpack.dylib
-[] Linking target contrib/amcheck/amcheck.dylib
...
-[] Linking target src/backend...version_procs/latin2_and_win1250.dylib
-[] Linking target src/backend...version_procs/utf8_and_iso8859_1.dylib
 [] Linking target src/backend/postgres
-[] Linking target src/backend/replication/pgoutput/pgoutput.dylib
-[] Linking target src/backend/snowball/dict_snowball.dylib
 [] Linking target src/bin/initdb/initdb
 [] Linking target src/bin/pg_amcheck/pg_amcheck
 [] Linking target src/bin/pg_archivecleanup/pg_archivecleanup
 [] Linking target src/bin/pg_basebackup/pg_basebackup
...

The lines with - at the beginning are the warnings that disappear when
I comment out the addition to ldflags_mod. The lines without a - at
the beginning are the ones that appear either way.

The first, rather inescapable, conclusion is that the comment isn't
completely correct. It claims that we need to add -Wl,-undefined,error
on macOS Ventura to avoid warnings, but on my system it has exactly
the opposite effect: it produces warnings. I think we must have
misdiagnosed what the triggering condition actually is -- maybe it
depends on CPU architecture or choice of compiler or something, but
it's not as simple as "on Ventura you need this" because I am on
Ventura and I anti-need this.

The second conclusion that I draw is that there's something in meson
itself which is adding -Wl,-undefined,error when building binaries.
The snippet above is the only place in the entire source tree where we
specify a -undefined flag for a compile. The fact that the warning
still shows up when I comment that out means that in other cases,
meson itself is adding the flag, seemingly wrongly. But I don't know
how to make it not do that. I tried adding an option to ldflags, but
the linker isn't happy with adding something like
-Wl,-undefined,warning --- then it complains about both
-Wl,-undefined,error and -Wl,-undefined,warning. Apparently, what it
really wants is for the option to not be specified at all:

https://stackoverflow.com/questions/77525544/apple-linker-warning-ld-warning-undefined-error-is-deprecated

See also https://github.com/mesonbuild/meson/issues/12450

What a stupid, annoying decision on Apple's part. It seems like
-Wl,-undefined,error is the default behavior, so they could have just
ignored that flag if present, but instead they complain about being
asked to do what they were going to do anyway.

-- 
Robert Haas
EDB: http://www.enterprisedb.com




Re: Missing docs on AT TIME ZONE precedence?

2023-11-28 Thread Andrew Dunstan



On 2023-11-28 Tu 10:27, Tom Lane wrote:

Andrew Dunstan  writes:

Looks good. Perhaps the comments above the UNBOUNDED precedence setting
(esp. the first paragraph) need strengthening, with a stern injunction
to avoid different precedence for non-reserved keywords if at all possible.

OK.  How about rewriting that first para like this?

  * Sometimes it is necessary to assign precedence to keywords that are not
  * really part of the operator hierarchy, in order to resolve grammar
  * ambiguities.  It's best to avoid doing so whenever possible, because such
  * assignments have global effect and may hide ambiguities besides the one
  * you intended to solve.  (Attaching a precedence to a single rule with
  * %prec is far safer and should be preferred.)  If you must give precedence
  * to a new keyword, try very hard to give it the same precedence as IDENT.
  * If the keyword has IDENT's precedence then it clearly acts the same as
  * non-keywords and other similar keywords, thus reducing the risk of
  * unexpected precedence effects.
  *
  * We used to need to assign IDENT an explicit precedence just less than Op,
  * to support target_el without AS.  While that's not really necessary since
  * we removed postfix operators, we continue to do so because it provides a
  * reference point for a precedence level that we can assign to other
  * keywords that lack a natural precedence level.





LGTM. Thanks.


cheers


andrew

--
Andrew Dunstan
EDB: https://www.enterprisedb.com





Re: SSL tests fail on OpenSSL v3.2.0

2023-11-28 Thread Tristan Partin

On Tue Nov 28, 2023 at 9:42 AM CST, Tom Lane wrote:

"Tristan Partin"  writes:
> When you say "this" are you referring to the patch I sent or adding 
> support for BoringSSL?


I have no interest in supporting BoringSSL.  I just replied to
Daniel's comment because it seemed to resolve the last concern
about whether your patch is OK.


If you haven't started fixing the tests, then I'll get to work on it and 
send a new revision before the end of the day. Thanks!


--
Tristan Partin
Neon (https://neon.tech)




Re: Add recovery to pg_control and remove backup_label

2023-11-28 Thread Stephen Frost
Greetings,

* Robert Haas (robertmh...@gmail.com) wrote:
> On Sun, Nov 26, 2023 at 3:42 AM Stephen Frost  wrote:
> > What would really be helpful would be hearing from these individuals
> > directly as to what the issues are with the changes, such that perhaps
> > we can do things better in the future to avoid whatever the issue is
> > they're having with the changes.  Simply saying we shouldn't make
> > changes in this area isn't workable and the constant push-back is
> > actively discouraging to folks trying to make improvements.  Obviously
> > it's a biased view, but we've not had issues making the necessary
> > adjustments in pgbackrest with each release and I feel like if the
> > authors of wal-g or barman did that they would have spoken up.
> 
> I'm happy if people show up to comment on proposed changes, but I
> think you're being a little bit unrealistic here. I have had to help
> plenty of people who have screwed up their backups in one way or
> another, generally by using some home-grown script, sometimes by
> misusing some existing backup tool. Those people are EDB customers;
> they don't read and participate in discussions here. If they did,
> perhaps they wouldn't be paying EDB to have me and my colleagues sort
> things out for them when it all goes wrong. I'm not trying to say that
> EDB doesn't have customers who participate in mailing list
> discussions, because we do, but it's a small minority, and I don't
> think that should surprise anyone. Moreover, the people who don't
> wouldn't necessarily have the background, expertise, or *time* to
> assess specific proposals in detail. If your point is that my
> perspective on what's helpful or unhelpful is not valid because I've
> only helped 30 people who had problems in this area, but that the
> perspective of those 30 people who were helped would be more valid,
> well, I don't agree with that. I think your perspective and David's
> are valid precisely *because* you've worked a lot on pgbackrest and no
> doubt interacted with lots of users; I think Andres's perspective is
> valid precisely *because* of his experience working with the fleet at
> Microsoft and individual customers at EDB and 2Q before that; and I
> think my perspective is valid for the same kinds of reasons.

I didn't mean to imply that anyone's perspective wasn't valid.  I was
simply trying to get at the root question of: what *is* the issue with
the changes that are being made?  If the answer to that is: we made
this change, which was hard for folks to deal with, and could have
been avoided by doing X, then I really, really want to hear what X
was!  If the answer is, well, the changes weren't hard, but we didn't
like having to make any changes at all ... then I just don't have any
sympathy for that.  People who write backup software for PG, be it
pgbackrest authors, wal-g authors, or homegrown script authors, will
need to adapt between major versions as we discover things that are
broken (such as exclusive mode, and such as the clear risk that's been
demonstrated of a torn copy of pg_control getting copied, resulting in
a completely invalid backup) and fix them.

> I am more in agreement with the idea that it would be nice to hear
> from backup tool authors, but I think even that has limited value.
> Surely we can all agree that if the backup tool is correctly written,
> none of this matters, because you'll make the tool do the right things
> and then you'll be fine. The difficulty here, and the motivation
> behind this proposal and others like it, is that too many users fail
> to follow the procedure correctly. If we hear from the authors of
> well-written backup tools, I expect they will tell us they can adapt
> their tool to whatever we do. And if we hear from the authors of
> poorly-written tools, well, I don't think their opinions would form a
> great basis for making decisions.

Uhhh.  No, I disagree with this- I'd argue that pgbackrest was broken
until the most recently releases where we implemented a check to ensure
that the pg_control we copy has a valid PG CRC.  Did we know it was
broken before this discussion?  No, but that doesn't change the fact
that we certainly could have ended up copying an invalid pg_control and
thus have an invalid backup, which even our 'pgbackrest verify' wouldn't
have caught because that just checks that the checksum that pgbackrest
calculates for every file hasn't changed since we copied it- but that
didn't do anything for the issue about pg_control having an invalid
internal checksum due to a torn write when we copied it.

So, yes, it does matter.  We didn't make pgbackrest do the right thing
in this case because we thought it was true that you couldn't get a torn
read of pg_control; Thomas showed that wasn't true and that puts all of
our users at risk.  Thankfully somewhat minimal since we always copy
pg_control from the primary ... but still, it's not right, and we've
now taken steps to address it.  Unfortunately, other tools are going 

Re: SSL tests fail on OpenSSL v3.2.0

2023-11-28 Thread Tom Lane
"Tristan Partin"  writes:
> When you say "this" are you referring to the patch I sent or adding 
> support for BoringSSL?

I have no interest in supporting BoringSSL.  I just replied to
Daniel's comment because it seemed to resolve the last concern
about whether your patch is OK.

regards, tom lane




Re: Parallel CREATE INDEX for BRIN indexes

2023-11-28 Thread Matthias van de Meent
On Thu, 23 Nov 2023 at 14:35, Tomas Vondra
 wrote:
> On 11/23/23 13:33, Matthias van de Meent wrote:
>> The union operator may leak (lots of) memory, so I think it makes
>> sense to keep a context around that can be reset after we've extracted
>> the merge result.
>>
>
> But does the current code actually achieve that? It does create a "brin
> union" context, but then it only does this:
>
> /* Use our own memory context to avoid retail pfree */
> cxt = AllocSetContextCreate(CurrentMemoryContext,
> "brin union",
> ALLOCSET_DEFAULT_SIZES);
> oldcxt = MemoryContextSwitchTo(cxt);
> db = brin_deform_tuple(bdesc, b, NULL);
> MemoryContextSwitchTo(oldcxt);
>
> Surely that does not limit the amount of memory used by the actual union
> functions in any way?

Oh, yes, of course. For some reason I thought that covered the calls
to the union operator function too, but it indeed only covers
deserialization. I do think it is still worthwhile to not do the
create/delete cycle, but won't hold the patch back for that.

>>> However, I don't think the number of union_tuples calls is likely to be
>>> very high, especially for large tables. Because we split the table into
>>> 2048 chunks, and then cap the chunk size by 8192. For large tables
>>> (where this matters) we're likely close to 8192.
>>
>> I agree that the merging part of the index creation is the last part,
>> and usually has no high impact on the total performance of the reindex
>> operation, but in memory-constrained environments releasing and then
>> requesting the same chunk of memory over and over again just isn't
>> great.
>
> OK, I'll take a look at the scratch context you suggested.
>
> My point however was we won't actually do that very often, because on
> large tables the BRIN ranges are likely smaller than the parallel scan
> chunk size, so few overlaps. OTOH if the table is small, or if the BRIN
> ranges are large, there'll be few of them.

That's true, so maybe I'm concerned about something that amounts to
only marginal gains.

I noticed that the v4 patch doesn't yet update the documentation in
indexam.sgml with am->amcanbuildparallel.
Once that is included and reviewed I think this will be ready, unless
you want to address any of my comments upthread (that I marked with
'not in this patch') in this patch.


Kind regards,

Matthias van de Meent
Neon (https://neon.tech)




Re: SSL tests fail on OpenSSL v3.2.0

2023-11-28 Thread Tristan Partin

On Tue Nov 28, 2023 at 9:31 AM CST, Tom Lane wrote:

"Tristan Partin"  writes:
> How are you guys running the tests? I have PG_TEST_EXTRA=ssl and 
> everything passes for me. Granted, I am using the Meson build.


I'm doing what it says in test/ssl/README:

make check PG_TEST_EXTRA=ssl

I don't know whether the meson build has support for running these
extra "unsafe" tests or not.


Thanks Tom. I'll check again. Maybe I didn't set LD_LIBRARY_PATH when 
running the tests. I have openssl installing to a non-default prefix.


--
Tristan Partin
Neon (https://neon.tech)




Re: SSL tests fail on OpenSSL v3.2.0

2023-11-28 Thread Tom Lane
"Tristan Partin"  writes:
> How are you guys running the tests? I have PG_TEST_EXTRA=ssl and 
> everything passes for me. Granted, I am using the Meson build.

I'm doing what it says in test/ssl/README:

make check PG_TEST_EXTRA=ssl

I don't know whether the meson build has support for running these
extra "unsafe" tests or not.

regards, tom lane




Re: Missing docs on AT TIME ZONE precedence?

2023-11-28 Thread Tom Lane
Andrew Dunstan  writes:
> Looks good. Perhaps the comments above the UNBOUNDED precedence setting 
> (esp. the first paragraph) need strengthening, with a stern injunction 
> to avoid different precedence for non-reserved keywords if at all possible.

OK.  How about rewriting that first para like this?

 * Sometimes it is necessary to assign precedence to keywords that are not
 * really part of the operator hierarchy, in order to resolve grammar
 * ambiguities.  It's best to avoid doing so whenever possible, because such
 * assignments have global effect and may hide ambiguities besides the one
 * you intended to solve.  (Attaching a precedence to a single rule with
 * %prec is far safer and should be preferred.)  If you must give precedence
 * to a new keyword, try very hard to give it the same precedence as IDENT.
 * If the keyword has IDENT's precedence then it clearly acts the same as
 * non-keywords and other similar keywords, thus reducing the risk of
 * unexpected precedence effects.
 * 
 * We used to need to assign IDENT an explicit precedence just less than Op,
 * to support target_el without AS.  While that's not really necessary since
 * we removed postfix operators, we continue to do so because it provides a
 * reference point for a precedence level that we can assign to other
 * keywords that lack a natural precedence level.

regards, tom lane




Re: SSL tests fail on OpenSSL v3.2.0

2023-11-28 Thread Tristan Partin
How are you guys running the tests? I have PG_TEST_EXTRA=ssl and 
everything passes for me. Granted, I am using the Meson build.


--
Tristan Partin
Neon (https://neon.tech)




Re: [HACKERS] Changing references of password encryption to hashing

2023-11-28 Thread Stephen Frost
Greetings,

* Robert Haas (robertmh...@gmail.com) wrote:
> On Tue, Nov 28, 2023 at 9:55 AM Stephen Frost  wrote:
> > I do think we should use the correct terminology in our documentation
> > and would support your working on improving things in this area.
> 
> +1.
> 
> > I do wonder if perhaps we would be better off by having someone spend
> > time on removing terribly insecure authentication methods like md5 and
> > ldap though ...
> 
> Wait, what's insecure about LDAP?

We pass a completely cleartext password to the server from the client.
Yes, we might encrypt it on the way with TLS, but even SSH realized how
terrible that is long, long ago and strongly discourages it these days.

The problem with ldap as an auth method is that a single compromised PG
server in an AD/ldap environment can collect up those username+password
credentials and gain access to those users *domain* level access.  The
CFO logging into a PG server with LDAP auth is giving up their complete
access credentials to the entire AD domain.  That's terrible.

> I think we should eventually remove MD5, but I think there's no rush.

I disagree- it's a known pass-the-hash vulnerability and frankly every
release we do with it still existing is deserving of an immediate CVE
(I've been asked off-list why we don't do this, in fact).

> People who care about security will have already switched, and people
> who don't care about security are not required to start caring.

I wish it were this simple.  It's just not though.

> Eventually the maintenance burden will become large enough that it
> makes sense to phase it out for that reason, but I haven't seen any
> evidence that we're anywhere close to that point.

This seems to invite the idea that what people who care about this need
to do is make it painful for us to continue to keep it around, which I
really don't think is best for anyone.  We know it's bad, we know it is
broken, we need to remove it, not pretend like it's not broken or not
bad.

Thanks,

Stephen


signature.asc
Description: PGP signature


Re: SSL tests fail on OpenSSL v3.2.0

2023-11-28 Thread Tristan Partin

On Tue Nov 28, 2023 at 9:00 AM CST, Tom Lane wrote:

Daniel Gustafsson  writes:
> Thats not an issue, we don't support building with BoringSSL.

Right.  I'll work on getting this pushed, unless someone else
is already on it?


When you say "this" are you referring to the patch I sent or adding 
support for BoringSSL?


--
Tristan Partin
Neon (https://neon.tech)




Re: [HACKERS] Changing references of password encryption to hashing

2023-11-28 Thread Robert Haas
On Tue, Nov 28, 2023 at 9:55 AM Stephen Frost  wrote:
> I do think we should use the correct terminology in our documentation
> and would support your working on improving things in this area.

+1.

> I do wonder if perhaps we would be better off by having someone spend
> time on removing terribly insecure authentication methods like md5 and
> ldap though ...

Wait, what's insecure about LDAP?

I think we should eventually remove MD5, but I think there's no rush.
People who care about security will have already switched, and people
who don't care about security are not required to start caring.
Eventually the maintenance burden will become large enough that it
makes sense to phase it out for that reason, but I haven't seen any
evidence that we're anywhere close to that point.

-- 
Robert Haas
EDB: http://www.enterprisedb.com




Re: SSL tests fail on OpenSSL v3.2.0

2023-11-28 Thread Tom Lane
Daniel Gustafsson  writes:
> Thats not an issue, we don't support building with BoringSSL.

Right.  I'll work on getting this pushed, unless someone else
is already on it?

regards, tom lane




Re: [HACKERS] Changing references of password encryption to hashing

2023-11-28 Thread Stephen Frost
Greetings,

* Bruce Momjian (br...@momjian.us) wrote:
> Is there any interest in fixing our documentation that says encrypted
> when it means hashed?  Should I pursue this?

I do think we should use the correct terminology in our documentation
and would support your working on improving things in this area.

I do wonder if perhaps we would be better off by having someone spend
time on removing terribly insecure authentication methods like md5 and
ldap though ...

Thanks,

Stephen


signature.asc
Description: PGP signature


Re: logical decoding and replication of sequences, take 2

2023-11-28 Thread Tomas Vondra
On 11/28/23 12:32, Amit Kapila wrote:
> On Mon, Nov 27, 2023 at 11:45 PM Tomas Vondra
>  wrote:
>>
>> I spent a bit of time looking at the proposed change, and unfortunately
>> logging just the boolean flag does not work. A good example is this bit
>> from a TAP test added by the patch for built-in replication (which was
>> not included with the WIP patch):
>>
>>   BEGIN;
>>   ALTER SEQUENCE s RESTART WITH 1000;
>>   SAVEPOINT sp1;
>>   INSERT INTO seq_test SELECT nextval('s') FROM generate_series(1,100);
>>   ROLLBACK TO sp1;
>>   COMMIT;
>>
>> This is expected to produce:
>>
>>   1131|0|t
>>
>> but produces
>>
>>   1000|0|f
>>
>> instead. The reason is very simple - as implemented, the patch simply
>> checks if the relfilenode is from the same top-level transaction, which
>> it is, and sets the flag to "true". So we know the sequence changes need
>> to be queued and replayed as part of this transaction.
>>
>> But then during decoding, we still queue the changes into the subxact,
>> which then aborts, and the changes are discarded. That is not how it's
>> supposed to work, because the new relfilenode is still valid, someone
>> might do nextval() and commit. And the nextval() may not get WAL-logged,
>> so we'd lose this.
>>
>> What I guess we might do is log not just a boolean flag, but the XID of
>> the subtransaction that created the relfilenode. And then during
>> decoding we'd queue the changes into this subtransaction ...
>>
>> 0006 in the attached patch series does this, and it seems to fix the TAP
>> test failure. I left it at the end, to make it easier to run tests
>> without the patch applied.
>>
> 
> Offhand, I don't have any better idea than what you have suggested for
> the problem but this needs some thoughts including the questions asked
> by you. I'll spend some time on it and respond back.
> 

I've been experimenting with the idea to log the XID, and for a moment I
was worried it actually can't work, because subtransactions may not
actually be just nested in simple way, but form a tree. And what if the
sequence was altered in a different branch (sibling subxact), not in the
immediate parent. In which case the new SubTransactionGetXid() would
fail, because it just walks the current chain of subtransactions.

I've been thinking about cases like this:

   BEGIN;
   CREATE SEQUENCE s;# XID 1000
   SELECT alter_sequence();  # XID 1001
   SAVEPOINT s1;
   SELECT COUNT(nextval('s')) FROM generate_series(1,100); # XID 1000
   ROLLBACK TO s1;
   SELECT COUNT(nextval('s')) FROM generate_series(1,100); # XID 1000
   COMMIT;

The XID values are what the sequence wal record will reference, assuming
that the main transaction XID is 1000.

Initially, I thought it's wrong that the nextval() calls reference XID
of the main transaction, because the last relfilenode comes from 1001,
which is the subxact created by alter_sequence() thanks to the exception
handling block. And that's where the approach in reorderbuffer would
queue the changes.

But I think this is actually correct too. When a subtransaction commits
(e.g. when alter_sequence() completes), it essentially becomes part of
the parent. And AtEOSubXact_cleanup() updates rd_newRelfilelocatorSubid
accordingly, setting it to parentSubid.

This also means that SubTransactionGetXid() can't actually fail, because
the ID has to reference an active subtransaction in the current stack.
I'm still concerned about the cost of the lookup, because the list may
be long and the subxact we're looking for may be quite high, but I guess
we might have another field, caching the XID. It'd need to be updated
only in AtEOSubXact_cleanup, and at that point we know it's the
immediate parent, so it'd be pretty cheap I think.


regards

-- 
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company




Re: Partial aggregates pushdown

2023-11-28 Thread Stephen Frost
Greetings,

* Robert Haas (robertmh...@gmail.com) wrote:
> On Mon, Nov 27, 2023 at 4:23 PM Tom Lane  wrote:
> > Well, one of the founding principles of postgres_fdw was to be able
> > to talk to PG servers that are not of the same version as yours.
> > If we break that in the name of performance, we are going to have
> > a lot of unhappy users.  Even the ones who do get the benefit of
> > the speedup are going to be unhappy when it breaks because they
> > didn't upgrade local and remote at exactly the same time.
> 
> I agree with this.

+1.  We do want to continue to make this work- to the extent possible.
I don't think there's any problem with saying that when talking to an
older server, you don't get the same capabilities as you do when talking
to a newer server.

> > Just because we'd like to have it doesn't make the patch workable
> > in the real world.
> 
> And also with this in concept - I'd like to plan arbitrarily
> complicated queries perfectly and near-instantly, and then execute
> them at faster-than-light speed, but we can't. However, I don't
> understand the fatalism with respect to the feature at hand. As I said
> before, it's not like no other product has made this work. Sure, some
> of those products may not have the extensible system of data types
> that we do, or may not care about cross-version communication, but
> those don't seem like good enough reasons to just immediately give up.

Certainly there are other projects out there which are based on PG that
have managed to make this work and work really quite well.

> TBH, I suspect even some PG forks have made this work, like maybe PGXC
> or PGXL, although I don't know for certain. We might not like the
> trade-offs they made to get there, but we haven't even talked through
> possible design ideas yet, so it seems way too early to give up.

Yes, Citus[1] and Greenplum[2], to just name two.

I certainly understand the concern around the security of this and would
have thought the approach we'd use would be to not just take internal
state and pass it along but rather to provide a way for aggregates to
opt-in to supporting this and have them serialize/deserialize with
new dedicated functions that have appropriate checks to avoid bad things
happening.  That could also be versioned, perhaps, if we feel that's
necessary (I'm a bit skeptical, but it would hopefully address the
concern about different versions having different data that they want to
pass along).

> One of the things that I think is a problem in this area is that the
> ways we have to configure FDW connections are just not very rich.

Agreed.

> We're trying to cram everything into a set of strings that can be
> attached to the foreign server or the user mapping, but that's not a
> very good fit for something like how all the local SQL functions that
> might exist map onto all of the remote SQL functions that might exist.
> Now you might well say that we don't want the act of configuring a
> foreign data wrapper to be insanely complicated, and I would agree
> with that. But, on the other hand, as Larry Wall once said, a good
> programming language makes simple things simple and complicated things
> possible. I think our current configuration system is only
> accomplishing the first of those goals.

We've already got issues in this area with extensions- there's no way
for a user to say what version of an extension exists on the remote side
and no way for an extension to do anything different based on that
information.  Perhaps we could work on a solution to both of these
issues, but at the least I don't see holding back on this effort for a
problem that already exists but which we've happily accepted because of
the benefit it provides, like being able to push-down postgis bounding
box conditionals to allow for indexed lookups.

Thanks,

Stephen

[1]: https://docs.citusdata.com/en/v11.1/develop/reference_sql.html
[2]: 
https://postgresconf.org/conferences/Beijing/program/proposals/implementation-of-distributed-aggregation-in-greenplum


signature.asc
Description: PGP signature


RE: [PoC] pg_upgrade: allow to upgrade publisher node

2023-11-28 Thread Hayato Kuroda (Fujitsu)
Dear Sawada-san,

> On Tue, Nov 28, 2023 at 7:04 PM Hayato Kuroda (Fujitsu)
>  wrote:
> >
> >
> > Yeah, we followed binary_upgrade_create_empty_extension(). Also, we set as
> > un-strict to keep a caller function simpler.
> >
> > Currently get_old_cluster_logical_slot_infos() executes a query and it 
> > contains
> > binary_upgrade_logical_slot_has_caught_up(). In pg_upgrade layer, we
> assumed
> > either true or false is returned.
> >
> > But if proisstrict is changed true, we must handle the case when NULL is
> returned.
> > It is small but backseat operation.
> 
> Which cases are you concerned pg_upgrade could pass NULL to
> binary_upgrade_logical_slot_has_caught_up()?

Actually, we do not expect that it won't input NULL. IIUC all of slots have
slot_name, and subquery uses its name. But will it be kept forever? I think we
can avoid any risk.

> I've not tested it yet but even if it returns NULL, perhaps
> get_old_cluster_logical_slot_infos() would still set curr->caught_up
> to false, no?

Hmm. I checked the C99 specification [1] of strcmp, but it does not define the
case when the NULL is input. So it depends implementation.

[1]: https://www.dii.uchile.cl/~daespino/files/Iso_C_1999_definition.pdf

Best Regards,
Hayato Kuroda
FUJITSU LIMITED



Re: Missing docs on AT TIME ZONE precedence?

2023-11-28 Thread Alvaro Herrera
On 2023-Nov-27, Tom Lane wrote:

> I don't like the existing coding for more reasons than just
> underdocumentation.  Global assignment of precedence is a really,
> really dangerous tool for solving ambiguous-grammar problems, because
> it can mask problems unrelated to the one you think you are solving:
> basically, it eliminates bison's complaints about grammar ambiguities
> related to the token you mark.  (Commits 12b716457 and 28a61fc6c are
> relevant here.)  Attaching precedence to individual productions is
> far safer, because it won't have any effect that extends beyond that
> production.  You still need a precedence attached to the lookahead
> token; but I think we should try very hard to not assign a precedence
> different from IDENT's to any unreserved keywords.

Ooh, this is very useful, thank you.

> After a bit of fooling around I found a patch that seems to meet
> that criterion; attached.

It looks good and passes tests, including the ecpg ones.

-- 
Álvaro HerreraBreisgau, Deutschland  —  https://www.EnterpriseDB.com/
"Entristecido, Wutra (canción de Las Barreras)
echa a Freyr a rodar
y a nosotros al mar"




Re: Streaming I/O, vectored I/O (WIP)

2023-11-28 Thread Heikki Linnakangas

On 28/11/2023 14:17, Thomas Munro wrote:

On Thu, Sep 28, 2023 at 7:33 AM Heikki Linnakangas  wrote:

in streaming_read.h:


typedef bool (*PgStreamingReadBufferDetermineNextCB) (PgStreamingRead *pgsr,
 uintptr_t pgsr_private,
 void *per_io_private,
 BufferManagerRelation *bmr,
 ForkNumber *forkNum,
 BlockNumber *blockNum,
 ReadBufferMode *mode);


I was surprised that 'bmr', 'forkNum' and 'mode' are given separately on
each read. I see that you used that in the WAL replay prefetching, so I
guess that makes sense.


In this version I have introduced an alternative simple callback.
It's approximately what we had already tried out in an earlier version
before I started streamifying recovery, but in this version you can
choose, so recovery can opt for the wider callback.


Ok. Two APIs is a bit redundant, but because most callers would prefer 
the simpler API, that's probably a good tradeoff.



I've added some ramp-up logic.  The idea is that after we streamify
everything in sight, we don't want to penalise users that don't really
need more than one or two blocks, but don't know that yet.  Here is
how the system calls look when you do pg_prewarm():

pread64(32, ..., 8192, 0) = 8192 <--- start with just one block
pread64(32, ..., 16384, 8192) = 16384
pread64(32, ..., 32768, 24576) = 32768
pread64(32, ..., 65536, 57344) = 65536
pread64(32, ..., 131072, 122880) = 131072<--- soon reading 16
blocks at a time
pread64(32, ..., 131072, 253952) = 131072
pread64(32, ..., 131072, 385024) = 131072



I guess it could be done in quite a few different ways and I'm open to
better ideas.  This way inserts prefetching stalls but ramps up
quickly and is soon out of the way.  I wonder if we would want to make
that a policy that a caller can disable, if you want to skip the
ramp-up and go straight for the largest possible I/O size?  Then I
think we'd need a 'flags' argument to the streaming read constructor
functions.


I think a 'flags' argument and a way to opt-out of the slow start would 
make sense. pg_prewarm in particular knows that it will read the whole 
relation.



A small detour:  While contemplating how this interacts with parallel
sequential scan, which also has a notion of ramping up, I noticed
another problem.  One parallel seq scan process does this:

fadvise64(32, 35127296, 131072, POSIX_FADV_WILLNEED) = 0
preadv(32, [...], 2, 35127296) = 131072
preadv(32, [...], 2, 35258368) = 131072
fadvise64(32, 36175872, 131072, POSIX_FADV_WILLNEED) = 0
preadv(32, [...], 2, 36175872) = 131072
preadv(32, [...], 2, 36306944) = 131072
...

We don't really want those fadvise() calls.  We don't get them with
parallelism disabled, because streaming_read.c is careful not to
generate advice for sequential workloads based on ancient wisdom from
this mailing list, re-confirmed on recent Linux: WILLNEED hints
actually get in the way of Linux's own prefetching and slow you down,
so we only want them for truly random access.  But the logic can't see
that another process is making holes in this process's sequence.


Hmm, aside from making the sequential pattern invisible to this process, 
are we defeating Linux's logic too, just by performing the reads from 
multiple processes? The processes might issue the reads to the kernel 
out-of-order.


How bad is the slowdown when you issue WILLNEED hints on sequential access?


The two obvious solutions are (1) pass in a flag at the start saying
"I promise this is sequential even if it doesn't look like it, no
hints please" and (2) invent "shared" (cross-process) streaming
reads, and teach all the parallel seq scan processes to get their
buffers from there.

Idea (2) is interesting to think about but even if it is a useful idea
(not sure) it is certainly overkill just to solve this little problem
for now.  So perhaps I should implement (1), which would be another
reason to add a flags argument.  It's not a perfect solution though
because some more 'data driven' parallel scans (indexes, bitmaps, ...)
have a similar problem that is less amenable to top-down kludgery.


(1) seems fine to me.

--
Heikki Linnakangas
Neon (https://neon.tech)





[PATCH] LockAcquireExtended improvement

2023-11-28 Thread Jingxian Li
Hi hackers,
 
I found a problem when doing the test shown below:
 
Time
 
Session A
 
Session B
 

T1
 
postgres=# create table test(a int);
   
CREATE TABLE
   
postgres=# insert into test values (1);
   
INSERT 0 1
 

 

T2
 
postgres=# 
begin;
 
   
BEGIN
   
   
postgres=*# lock table test in access exclusive mode ; 
   
LOCK 
TABLE
 
 

 

T3
 

 
postgres=# begin;
   
BEGIN
   
postgres=*# lock table test in exclusive mode ;
 

T4
 
Case 1:
   
postgres=*# lock table test in share row exclusive mode   nowait; 
   
ERROR: could not   obtain lock on relation 
"test"
 
   

   
Case 2:
   
postgres=*# lock table test in share row exclusive mode;
   
LOCK TABLE
 

 
  
At T4 moment in session A, (case 1) when executing SQL ??lock table test in 
share row exclusive mode nowait;??, an error occurs with message ??could not 
obtain lock on relation test";However, (case 2) when executing the SQL above 
without nowait, lock can be obtained successfully. 
 
Digging into the source code, I find that in case 2 the lock was obtained in 
the function ProcSleep instead of LockAcquireExtended. Due to nowait logic 
processed before WaitOnLock-ProcSleep, acquiring lock failed in case 1. Can 
any changes be made so that the act of such lock granted occurs before 
WaitOnLock?
 

 
Providing a more universal case:
 
Transaction A already holds an n-mode lock on table test. If then transaction A 
requests an m-mode lock on table Test, m and n have the following constraints:
 
(lockMethodTable-conflictTab[n]  lockMethodTable-conflictTab[m]) 
== lockMethodTable-conflictTab[m]
 
Obviously, in this case, m<=n.
 
Should the m-mode lock be granted before WaitOnLock?
 

 
In the case of m=n (i.e. we already hold the lock), the m-mode lock is 
immediately granted in the LocalLock path, without the need of lock conflict 
check.
 
Based on the facts above, can we obtain a weaker lock (m

v1-0001-LockAcquireExtended-improvement.patch
Description: Binary data


Re: POC, WIP: OR-clause support for indexes

2023-11-28 Thread Robert Haas
On Mon, Nov 27, 2023 at 8:08 PM Peter Geoghegan  wrote:
> Maybe it should be the responsibility of some other phase of query
> processing, invented solely to make life easier for the optimizer, but
> not formally part of query planning per se.

I don't really see why that would be useful. Adding more stages to the
query pipeline adds cognitive burden for which there must be some
corresponding benefit. Even if this happened very early in query
planning as a completely separate pass over the query tree, that would
minimize the need for code changes outside the optimizer to need to
care about it. But I suspect that this shouldn't happen very early in
query planning as a completely separate pass, but someplace later
where it can be done together with other useful optimizations (e.g.
eval_const_expressions, or even path construction).

> > The right place to do
> > optimization is in the optimizer.
>
> Then why doesn't the optimizer do query rewriting? Isn't that also a
> kind of optimization, at least in part?

I mean, I think rewriting mostly means applying rules.

> ISTM that the real problem is that this is true in the first place. If
> the optimizer had only one representation for any two semantically
> equivalent spellings of the same qual, then it would always use the
> best available representation. That seems even smarter, because that
> way the planner can be dumb and still look fairly smart at runtime.

Sure, well, that's another way of attacking the problem, but the
in-array representation is more convenient to loop over than the
or-clause representation, so if you get to a point where looping over
all the values is a thing you want to do, you're going to want
something that looks like that. If I just care about the fact that the
values I'm looking for are 3, 4, and 6, I want someone to hand me 3,
4, and 6, not x = 3, x = 4, and x = 6, and then I have to skip over
the x = part each time.

-- 
Robert Haas
EDB: http://www.enterprisedb.com




Re: Partial aggregates pushdown

2023-11-28 Thread Robert Haas
On Tue, Nov 28, 2023 at 5:24 AM Ashutosh Bapat
 wrote:
> If my memory serves me right, PGXC implemented partial aggregation
> only when the output of partial aggregate was a SQL data type
> (non-Internal, non-Unknown). But I may be wrong. But at that time,
> JSONB wasn't there or wasn't that widespread.
>
> Problem with Internal is it's just a binary string whose content can
> change across version and which can be interpreted differently across
> different versions. There is no metadata in it to know how to
> interpret it. We can add that metadata to JSONB. The result of partial
> aggregate can be sent as a JSONB. If the local server finds the JSONB
> familiar it will construct the right partial aggregate value otherwise
> it will throw an error. If there's a way to even avoid that error (by
> looking at server version etc.) the error can be avoided too. But
> JSONB leaves very very less chance that the value will be interpreted
> wrong. Downside is we are tying PARTIAL's output to be JSONB thus
> tying SQL syntax with a data type.
>
> Does that look acceptable?

If somebody had gone to the trouble of making this work, and had done
a good job, I wouldn't vote against it, but in a vacuum, I'm not sure
it's the best design. The problem in my view is that working with JSON
is not actually very pleasant. It's super-easy to generate, and
super-easy for humans to read. But parsing and validating it is a
pain. You basically have to have two parsers, one to do syntactical
validation and then a second one to ensure that the structure of the
document and the contents of each item are as expected. See
parse_manifest.c for an example of what I mean by that. Now, if we add
new code, it can reuse the JSON parser we've already got, so it's not
that you need to write a new JSON parser for every new application of
JSON, but the semantic validator (a la parse_manifest.c) isn't
necessarily any less code than a whole new parser for a bespoke
format.

To make that a bit more concrete, for something like string_agg(), is
it easier to write a validator for the existing deserialization
function that accepts a bytea blob, or to write a validator for a JSON
blob that we could be passing instead? My suspicion is that the former
is less work and easier to verify, but it's possible I'm wrong about
that and they're more or less equal. I don't really see any way that
the JSON thing is straight-up better; at best it's a toss-up in terms
of amount of code. Now somebody could still make an argument that they
would like JSON better because it would be more useful for some
purpose other than this feature, and that is fine, but here I'm just
thinking about this feature in particular.

My personal suspicion is that the easiest way to support internal-type
aggregates here is to convert them to use an array or record type as a
transition state instead, or maybe serialize the internal state to one
of those things instead of to bytea. I suspect that would allow us to
leverage more of our existing validation infrastructure than using
JSON or sticking with bytea. But I'm certainly amenable to other
points of view. I'm not trying to pretend that my gut feeling is
necessarily correct; I'm just explaining what I currently think.

-- 
Robert Haas
EDB: http://www.enterprisedb.com




Re: Streaming I/O, vectored I/O (WIP)

2023-11-28 Thread Heikki Linnakangas

On 28/11/2023 14:17, Thomas Munro wrote:

On Thu, Sep 28, 2023 at 7:33 AM Heikki Linnakangas  wrote:

+ /* Avoid a slightly more expensive kernel call if there is no benefit. */
+ if (iovcnt == 1)
+ returnCode = pg_pread(vfdP->fd,
+   iov[0].iov_base,
+   iov[0].iov_len,
+   offset);
+ else
+ returnCode = pg_preadv(vfdP->fd, iov, iovcnt, offset);


How about pushing down this optimization to pg_preadv() itself?
pg_readv() is currently just a macro if the system provides preadv(),
but it could be a "static inline" that does the above dance. I think
that optimization is platform-dependent anyway, pread() might not be any
faster on some OSs. In particular, if the system doesn't provide
preadv() and we use the implementation in src/port/preadv.c, it's the
same kernel call anyway.


Done.  I like it, I just feel a bit bad about moving the p*v()
replacement functions around a couple of times already!  I figured it
might as well be static inline even if we use the fallback (= Solaris
and Windows).


LGTM. I think this 0001 patch is ready for commit, independently of the 
rest of the patches.


In v2-0002-Provide-vectored-variants-of-FileRead-and-FileWri-1.patch, fd.h:


+/* Filename components */
+#define PG_TEMP_FILES_DIR "pgsql_tmp"
+#define PG_TEMP_FILE_PREFIX "pgsql_tmp"
+


These seem out of place, we already have them in common/file_utils.h. 
Other than that, 
v2-0002-Provide-vectored-variants-of-FileRead-and-FileWri-1.patch and 
v2-0003-Provide-vectored-variants-of-smgrread-and-smgrwri.patch look 
good to me.


--
Heikki Linnakangas
Neon (https://neon.tech)





Re: [PoC] pg_upgrade: allow to upgrade publisher node

2023-11-28 Thread Masahiko Sawada
On Tue, Nov 28, 2023 at 6:50 PM Amit Kapila  wrote:
>
> On Tue, Nov 28, 2023 at 1:32 PM Bharath Rupireddy
>  wrote:
> >
> > On Tue, Nov 28, 2023 at 11:06 AM Masahiko Sawada  
> > wrote:
> > >
> > > One month has already been passed since this main patch got committed
> > > but reading this change, I have some questions on new
> > > binary_upgrade_logical_slot_has_caught_up() function:
> > >
> > > Is there any reason why this function can be executed only in binary
> > > upgrade mode? It seems to me that other functions in
> > > pg_upgrade_support.c must be called only in binary upgrade mode
> > > because it does some hacky changes internally. On the other hand,
> > > binary_upgrade_logical_slot_has_caught_up() just calls
> > > LogicalReplicationSlotHasPendingWal(), which doesn't change anything
> > > internally. If we make this function usable in normal mode, the user
> > > would be able to  check each slot's upgradability without pg_upgrade
> > > --check command (or without stopping the server if the user can ensure
> > > no more meaningful WAL records are generated).
> >
> > It may happen that such a user-facing function tells there's no
> > unconsumed WAL, but later on the WAL gets generated during pg_upgrade.
> > Therefore, the information the function gives turns out to be
> > incorrect. I don't see a real-world use-case for such a function right
> > now. If there's one, it's not a big change to turn it into a
> > user-facing function.
> >
>
> Yeah, as of now, I don't see a use case for it and in fact, it could
> lead to unpredictable results. Immediately after calling the function,
> there could be more activity on the server which could make the
> results incorrect. I think to check the slot's upgradeability, one can
> rely on the results of the pg_upgrade --check functionality.

Fair point.

This function is already a user-executable function as it's in
pg_catalog but is restricted to be executed only in binary upgrade
even though it doesn't change anything internally. So it wasn't clear
to me why we put such a restriction.

>
> > > ---
> > > Also, the function checks if the user has the REPLICATION privilege
> > > but I think that only superuser can connect to the server in binary
> > > upgrade mode in the first place.
> >
> > If that were true, I don't see a problem in having
> > CheckSlotPermissions() there, in fact it can act as an assertion.
> >
>
> I think we can change it to assertion or may elog(ERROR, ...) with a
> comment as to why we don't expect this can happen.

+1 for an assertion, to match other checks in the function.

>
> > > ---
> > > The following error message doesn't match the function name:
> > >
> > > /* We must check before dereferencing the argument */
> > > if (PG_ARGISNULL(0))
> > > elog(ERROR, "null argument to
> > > binary_upgrade_validate_wal_records is not allowed");
> > >
>
> This should be fixed.
>
> > > ---
> > > { oid => '8046', descr => 'for use by pg_upgrade',
> > >   proname => 'binary_upgrade_logical_slot_has_caught_up', proisstrict => 
> > > 'f',
> > >   provolatile => 'v', proparallel => 'u', prorettype => 'bool',
> > >   proargtypes => 'name',
> > >   prosrc => 'binary_upgrade_logical_slot_has_caught_up' },
> > >
> > > The function is not a strict function but we check in the function if
> > > the passed argument is not null. I think it would be clearer to make
> > > it a strict function.
> >
> > I think it has been done that way similar to
> > binary_upgrade_create_empty_extension().

binary_upgrade_create_empty_extension() needs to be a non-strict
function since it needs to accept NULL in some arguments such as
extConfig. On the other hand,
binary_upgrade_logical_slot_has_caught_up() doesn't handle NULL and
it's conventional to make such a function a strict function.

> >
> > > ---
> > > LogicalReplicationSlotHasPendingWal() is defined in logical.c but I
> > > guess it's more suitable to be in slotfunc.s where similar functions
> > > such as pg_logical_replication_slot_advance() is also defined.
> >
> > Why not in logicalfuncs.c?
> >
>
> I am not sure if either of those is better than logical.c. IIRC, I
> thought it was okay to keep in logical.c as others primarily deal with
> exposed SQL functions and I felt it somewhat matches with the intent
> of logical.c ("The goal is to encapsulate most of the internal
> complexity for consumers of logical decoding, so they can create and
> consume a changestream with a low amount of code..").

I see your point. To me it looks that the functions in logical.c are
APIs and internal functions to manage logical decoding context and
replication slot (e.g., restart_lsn). On the other hand,
LogicalReplicationSlotHasPendingWal() seems to be a user of the
logical decoding. But anyway, it seems that three hackers have
different opinions. So we can keep it unless someone has a good reason
to change it.

On Tue, Nov 28, 2023 at 7:04 PM Hayato Kuroda (Fujitsu)
 wrote:
>
>
> Yeah, we followed 

Re: Fix a wrong comment in setrefs.c

2023-11-28 Thread Heikki Linnakangas

On 03/11/2023 08:10, Richard Guo wrote:
On Tue, Sep 26, 2023 at 9:51 AM Richard Guo > wrote:


On Tue, Sep 26, 2023 at 5:45 AM Tom Lane mailto:t...@sss.pgh.pa.us>> wrote:

I'm inclined to write the comment more like "Usually the equal()
check is redundant, but in setop plans it may not be, since
prepunion.c assigns ressortgroupref equal to the column resno
without regard to whether that matches the topmost level's
sortgrouprefs and without regard to whether any implicit coercions
are added in the setop tree.  We might have to clean that up
someday;
but for now, just ignore any false matches."


+1.  It explains the situation much more clearly and accurately.

To make it easier to review, I've updated the patch to be so.


Committed, thanks!

--
Heikki Linnakangas
Neon (https://neon.tech)





Re: ALTER COLUMN ... SET EXPRESSION to alter stored generated column's expression

2023-11-28 Thread Peter Eisentraut

On 23.11.23 15:13, Amul Sul wrote:

The exact sequencing of this seems to be tricky.  It's clear that we
need to do it earlier than at the end.  I also think it should be
strictly after AT_PASS_ALTER_TYPE so that the new expression can refer
to the new type of a column.  It should also be after AT_PASS_ADD_COL,
so that the new expression can refer to any newly added column.  But
then it's after AT_PASS_OLD_INDEX and AT_PASS_OLD_CONSTR, is that a
problem?

AT_PASS_ALTER_TYPE and AT_PASS_ADD_COL cannot be together, the ALTER TYPE
cannot see that column, I think we can adopt the samebehaviour.


With your v5 patch, I see the following behavior:

create table t1 (a int, b int generated always as (a + 1) stored);
alter table t1 add column c int, alter column b set expression as (a + c);
ERROR:  42703: column "c" does not exist

I think intuitively, this ought to work.  Maybe just moving the new pass 
after AT_PASS_ADD_COL would do it.



While looking at this, I figured that converting the AT_PASS_* macros to 
an enum would make this code simpler and easier to follow.  For patches 
like yours you wouldn't have to renumber the whole list.  We could put 
this patch before yours if people agree with it.




I tried to reuse the code by borrowing code from ALTER TYPE, see if that
looks good to you.

But I have concerns, with that code reuse where we drop and re-add the 
indexes

and constraints which seems unnecessary for SET EXPRESSION where column
attributes will stay the same. I don't know why ATLER TYPE does that for 
index

since finish_heap_swap() anyway does reindexing. We could skip re-adding
index for SET EXPRESSION which would be fine but we could not skip the
re-addition of constraints, since rebuilding constraints for checking might
need a good amount of code copy especially for foreign key constraints.

Please have a look at the attached version, 0001 patch does the code
refactoring, and 0002 is the implementation, using the newly refactored 
code to

re-add indexes and constraints for the validation. Added tests for the same.


This looks reasonable after a first reading.  (I have found that using 
git diff --patience makes the 0001 patch look more readable.  Maybe 
that's helpful.)From eac97b6f2a081f327d0649d35a346db0b4bb9d99 Mon Sep 17 00:00:00 2001
From: Peter Eisentraut 
Date: Tue, 28 Nov 2023 12:04:03 +0100
Subject: [PATCH] Turn AT_PASS_* macros into an enum

---
 src/backend/commands/tablecmds.c | 55 +---
 src/tools/pgindent/typedefs.list |  1 +
 2 files changed, 30 insertions(+), 26 deletions(-)

diff --git a/src/backend/commands/tablecmds.c b/src/backend/commands/tablecmds.c
index 323d9bf870..7c675b834c 100644
--- a/src/backend/commands/tablecmds.c
+++ b/src/backend/commands/tablecmds.c
@@ -142,20 +142,24 @@ static List *on_commits = NIL;
  * a pass determined by subcommand type.
  */
 
-#define AT_PASS_UNSET  -1  /* UNSET will cause ERROR */
-#define AT_PASS_DROP   0   /* DROP (all flavors) */
-#define AT_PASS_ALTER_TYPE 1   /* ALTER COLUMN TYPE */
-#define AT_PASS_OLD_INDEX  2   /* re-add existing indexes */
-#define AT_PASS_OLD_CONSTR 3   /* re-add existing constraints 
*/
-/* We could support a RENAME COLUMN pass here, but not currently used */
-#define AT_PASS_ADD_COL4   /* ADD COLUMN */
-#define AT_PASS_ADD_CONSTR 5   /* ADD constraints (initial 
examination) */
-#define AT_PASS_COL_ATTRS  6   /* set column attributes, eg 
NOT NULL */
-#define AT_PASS_ADD_INDEXCONSTR7   /* ADD index-based constraints 
*/
-#define AT_PASS_ADD_INDEX  8   /* ADD indexes */
-#define AT_PASS_ADD_OTHERCONSTR9   /* ADD other constraints, 
defaults */
-#define AT_PASS_MISC   10  /* other stuff */
-#define AT_NUM_PASSES  11
+typedef enum AlterTablePass
+{
+   AT_PASS_UNSET = -1, /* UNSET will cause ERROR */
+   AT_PASS_DROP,   /* DROP (all flavors) */
+   AT_PASS_ALTER_TYPE, /* ALTER COLUMN TYPE */
+   AT_PASS_OLD_INDEX,  /* re-add existing indexes */
+   AT_PASS_OLD_CONSTR, /* re-add existing constraints 
*/
+   /* We could support a RENAME COLUMN pass here, but not currently used */
+   AT_PASS_ADD_COL,/* ADD COLUMN */
+   AT_PASS_ADD_CONSTR, /* ADD constraints (initial 
examination) */
+   AT_PASS_COL_ATTRS,  /* set column attributes, eg 
NOT NULL */
+   AT_PASS_ADD_INDEXCONSTR,/* ADD index-based constraints */
+   AT_PASS_ADD_INDEX,  /* ADD indexes */
+   AT_PASS_ADD_OTHERCONSTR,/* ADD other constraints, defaults */
+   AT_PASS_MISC,   /* other stuff */
+} 

Re: Synchronizing slots from primary to standby

2023-11-28 Thread Drouvot, Bertrand

Hi,

On 11/27/23 9:57 AM, Zhijie Hou (Fujitsu) wrote:

On Monday, November 27, 2023 4:51 PM shveta malik  
wrote:

Here is the updated version(v39_2) which include all the changes made in 0002.
Please use for review, and sorry for the confusion.



Thanks!

As far v39_2-0001:

"
Altering the failover option of the subscription is currently not
permitted. However, this restriction may be lifted in future versions.
"

Should we mention that we can alter the related replication slot?

+ 
+  The implementation of failover requires that replication
+  has successfully finished the initial table synchronization
+  phase. So even when failover is enabled for a
+  subscription, the internal failover state remains
+  temporarily pending until the initialization phase
+  completes. See column subfailoverstate
+  of pg_subscription
+  to know the actual failover state.
+ 

I think we have a corner case here. If one alter the replication slot on the 
primary
then "subfailoverstate" is not updated accordingly on the subscriber. Given the 
2 remarks above
would that make sense to prevent altering a replication slot associated to a 
subscription?

Regards,

--
Bertrand Drouvot
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com




Re: XID formatting and SLRU refactorings

2023-11-28 Thread Aleksander Alekseev
Hi,

>> > I think what's done in patch 0001 is just an extension of existing
>> > logic and moving it into separate function.
>>
>> That's right. I'm arguing that now is a good time to clean it up.
>>
>> I won't insist if Alexander prefers to commit it as it is, though. But
>> let's at least explain how this works in the comment, instead of the XXX.
>
> I agree with you that would be good to add a comment instead of XXX and 
> commit.

+1

One could argue that getting rid of short filenames entirely in the
long term (i.e. always long_segment_names == true) could be a better
strategy. Maybe it's not but I believe this should be discussed
separately from the patchset under question.

-- 
Best regards,
Aleksander Alekseev




Re: POC, WIP: OR-clause support for indexes

2023-11-28 Thread Andrei Lepikhov

On 28/11/2023 04:03, Robert Haas wrote:

On Mon, Nov 27, 2023 at 3:02 AM Andrei Lepikhov
 wrote:

On 25/11/2023 08:23, Alexander Korotkov wrote:

I think patch certainly gets better in this aspect.  One thing I can't
understand is why do we use home-grown code for resolving
hash-collisions.  You can just define custom hash and match functions
in HASHCTL.  Even if we need to avoid repeated JumbleExpr calls, we
still can save pre-calculated hash value into hash entry and use
custom hash and match.  This doesn't imply us to write our own
collision-resolving code.


Thanks, it was an insightful suggestion.
I implemented it, and the code has become shorter (see attachment).


Neither the code comments nor the commit message really explain the
design idea here. That's unfortunate, principally because it makes
review difficult.


Yeah, it is still an issue. We will think about how to improve this; any 
suggestions are welcome.



I'm very skeptical about the idea of using JumbleExpr for any part of
this. It seems fairly expensive, and it might produce false matches.
If expensive is OK, then why not just use equal()? If it's not, then
this probably isn't really OK either. But in any case there should be
comments explaining why this strategy was chosen.


We used the equal() routine without hashing in earlier versions. Hashing 
resolves issues with many different OR clauses. Is it expensive? Maybe, 
but we assume this transformation should be applied to simple enough 
expressions.



The use of op_mergejoinable() seems pretty random to me. Why should we
care about that? If somebody writes a<1 or a<2 or a<3 or a<4, you can
transform that to a

You are right. The only reason was to obtain a working patch to 
benchmark and look for corner cases. We would rewrite that place but 
still live with the equivalence operator.



The reader shouldn't be left to guess whether a rule like this was made for
reasons of correctness or for reasons of efficiency or something else.
Looking further, I see that the reason for this is likely that the
operator for the transformation result is constructing using
list_make1(makeString((char *) "=")), but trying to choose an operator
based on the operator name is, I think, pretty clearly unacceptable.


Yes, it was a big mistake. It is fixed in the new version (I guess).


I am extremely dubious about the use of select_common_type() here. Why
not do this only when the types already match exactly? Maybe the
concern is unknown literals, but perhaps that should be handled in
some other way. If you do this kind of thing, you need to justify why
it can't fail or produce wrong answers.


Perhaps. We implemented your approach in the next version. At least we 
could see consequences.



Honestly, it seems very hard to avoid the conclusion that this
transformation is being done at too early a stage. Parse analysis is
not the time to try to do query optimization. I can't really believe
that there's a way to produce a committable patch along these lines.
Ideally, a transformation like this should be done after we know what
plan shape we're using (or considering using), so that we can make
cost-based decisions about whether to transform or not. But at the
very least it should happen somewhere in the planner. There's really
no justification for parse analysis rewriting the SQL that the user
entered.


Here, we assume that array operation is generally better than many ORs.
As a result, it should be more effective to make OR->ANY transformation 
in the parser (it is a relatively lightweight operation here) and, as a 
second phase, decompose that in the optimizer.
We implemented earlier prototypes in different places of the optimizer, 
and I'm convinced that only this approach resolves the issues we found.

Does this approach look weird? Maybe. We can debate it in this thread.

--
regards,
Andrei Lepikhov
Postgres Professional
From a5f8eba522ff907f7a3fffb0635b61d38933e385 Mon Sep 17 00:00:00 2001
From: "Andrey V. Lepikhov" 
Date: Thu, 23 Nov 2023 16:00:13 +0700
Subject: [PATCH] Transform OR clause to ANY expressions.

Replace (X=N1) OR (X=N2) ... with X = ANY(N1, N2) on the preliminary stage of
optimization when we are still working with a tree expression.
Sometimes it can lead to not optimal plan. But we think it is better to have
array of elements instead of a lot of OR clauses. Here is a room for further
optimizations on decomposing that array into more optimal parts.
---
 .../postgres_fdw/expected/postgres_fdw.out|   8 +-
 src/backend/nodes/queryjumblefuncs.c  |  30 ++
 src/backend/parser/parse_expr.c   | 289 +-
 src/backend/utils/misc/guc_tables.c   |  10 +
 src/backend/utils/misc/postgresql.conf.sample |   1 +
 src/include/nodes/queryjumble.h   |   1 +
 src/include/parser/parse_expr.h   |   1 +
 src/test/regress/expected/create_index.out| 141 +++--
 src/test/regress/expected/guc.out |   3 +-
 

Re: logical decoding and replication of sequences, take 2

2023-11-28 Thread Amit Kapila
On Mon, Nov 27, 2023 at 11:45 PM Tomas Vondra
 wrote:
>
> I spent a bit of time looking at the proposed change, and unfortunately
> logging just the boolean flag does not work. A good example is this bit
> from a TAP test added by the patch for built-in replication (which was
> not included with the WIP patch):
>
>   BEGIN;
>   ALTER SEQUENCE s RESTART WITH 1000;
>   SAVEPOINT sp1;
>   INSERT INTO seq_test SELECT nextval('s') FROM generate_series(1,100);
>   ROLLBACK TO sp1;
>   COMMIT;
>
> This is expected to produce:
>
>   1131|0|t
>
> but produces
>
>   1000|0|f
>
> instead. The reason is very simple - as implemented, the patch simply
> checks if the relfilenode is from the same top-level transaction, which
> it is, and sets the flag to "true". So we know the sequence changes need
> to be queued and replayed as part of this transaction.
>
> But then during decoding, we still queue the changes into the subxact,
> which then aborts, and the changes are discarded. That is not how it's
> supposed to work, because the new relfilenode is still valid, someone
> might do nextval() and commit. And the nextval() may not get WAL-logged,
> so we'd lose this.
>
> What I guess we might do is log not just a boolean flag, but the XID of
> the subtransaction that created the relfilenode. And then during
> decoding we'd queue the changes into this subtransaction ...
>
> 0006 in the attached patch series does this, and it seems to fix the TAP
> test failure. I left it at the end, to make it easier to run tests
> without the patch applied.
>

Offhand, I don't have any better idea than what you have suggested for
the problem but this needs some thoughts including the questions asked
by you. I'll spend some time on it and respond back.

-- 
With Regards,
Amit Kapila.




Re: pg_upgrade and logical replication

2023-11-28 Thread vignesh C
On Mon, 27 Nov 2023 at 17:12, Amit Kapila  wrote:
>
> On Mon, Nov 27, 2023 at 3:18 PM vignesh C  wrote:
> >
> > On Sat, 25 Nov 2023 at 17:50, Amit Kapila  wrote:
> > >
> > > On Sat, Nov 25, 2023 at 7:21 AM vignesh C  wrote:
> > > >
> > >
> > > Few comments on v19:
> > > ==
> > > 1.
> > > +
> > > + The subscriptions will be migrated to the new cluster in a disabled 
> > > state.
> > > + After migration, do this:
> > > +
> > > +
> > > +
> > > + 
> > > +  
> > > +   Enable the subscriptions by executing
> > > +   ALTER
> > > SUBSCRIPTION ... ENABLE.
> > >
> > > The reason for this restriction is not very clear to me. Is it because
> > > we are using pg_dump for subscription and the existing functionality
> > > is doing it? If so, I think currently even connect is false.
> >
> > This was done this way so that the apply worker doesn't get started
> > while the upgrade is happening. Now that we have set
> > max_logical_replication_workers to 0, the apply workers will not get
> > started during the upgrade process. I think now we can create the
> > subscriptions with the same options as the old cluster in case of
> > upgrade.
> >
>
> Okay, but what is your plan to change it. Currently, we are relying on
> existing pg_dump code to dump subscriptions data, do you want to
> change that? There is a reason for the current behavior of pg_dump
> which as mentioned in docs is: "When dumping logical replication
> subscriptions, pg_dump will generate CREATE SUBSCRIPTION commands that
> use the connect = false option, so that restoring the subscription
> does not make remote connections for creating a replication slot or
> for initial table copy. That way, the dump can be restored without
> requiring network access to the remote servers. It is then up to the
> user to reactivate the subscriptions in a suitable way. If the
> involved hosts have changed, the connection information might have to
> be changed. It might also be appropriate to truncate the target tables
> before initiating a new full table copy."
>
> I guess one reason to not enable subscription after restore was that
> it can't work without origins, and also one can restore the dump in a
> totally different environment, and one may choose not to dump all the
> corresponding tables which I don't think is true for an upgrade. So,
> that could be one reason to do differently for upgrades. Do we see
> reasons similar to pg_dump/restore due to which after upgrade
> subscriptions may not work?

I felt that the behavior for upgrade can be slightly different than
the dump as the subscription relations and the replication origin will
be updated when the subscriber is upgraded. And as the logical
replication workers will not be started during the upgrade we can
preserve the subscription enabled status too. I felt just adding an
"ALTER SUBSCRIPTION sub-name ENABLE" for the subscriptions that were
enabled in the old cluster in case of upgrade like in the attached
patch should be fine. The behavior of dump is not changed it is
retained as it is.

Regards,
Vignesh
From 7bbce54014434f23ba1e30390bbb903ebf174134 Mon Sep 17 00:00:00 2001
From: Vignesh C 
Date: Tue, 28 Nov 2023 15:35:42 +0530
Subject: [PATCH v20 2/2] Retain the subscription oids during upgrade.

Retain the subscription oids during upgrade.
---
 src/backend/commands/subscriptioncmds.c   | 25 +--
 src/backend/utils/adt/pg_upgrade_support.c| 10 
 src/bin/pg_dump/pg_dump.c |  8 ++
 src/bin/pg_upgrade/t/004_subscription.pl  |  4 +++
 src/include/catalog/binary_upgrade.h  |  1 +
 src/include/catalog/pg_proc.dat   |  4 +++
 .../expected/spgist_name_ops.out  |  6 +++--
 7 files changed, 54 insertions(+), 4 deletions(-)

diff --git a/src/backend/commands/subscriptioncmds.c b/src/backend/commands/subscriptioncmds.c
index edc82c11be..f839989208 100644
--- a/src/backend/commands/subscriptioncmds.c
+++ b/src/backend/commands/subscriptioncmds.c
@@ -75,6 +75,12 @@
 /* check if the 'val' has 'bits' set */
 #define IsSet(val, bits)  (((val) & (bits)) == (bits))
 
+/*
+ * This will be set by the pg_upgrade_support function --
+ * binary_upgrade_set_next_pg_subscription_oid().
+ */
+Oid			binary_upgrade_next_pg_subscription_oid = InvalidOid;
+
 /*
  * Structure to hold a bitmap representing the user-provided CREATE/ALTER
  * SUBSCRIPTION command options and the parsed/default values of each of them.
@@ -679,8 +685,23 @@ CreateSubscription(ParseState *pstate, CreateSubscriptionStmt *stmt,
 	memset(values, 0, sizeof(values));
 	memset(nulls, false, sizeof(nulls));
 
-	subid = GetNewOidWithIndex(rel, SubscriptionObjectIndexId,
-			   Anum_pg_subscription_oid);
+	/* Use binary-upgrade override for pg_subscription.oid? */
+	if (IsBinaryUpgrade)
+	{
+		if (!OidIsValid(binary_upgrade_next_pg_subscription_oid))
+			ereport(ERROR,
+	(errcode(ERRCODE_INVALID_PARAMETER_VALUE),
+	 

Re: XID formatting and SLRU refactorings

2023-11-28 Thread Pavel Borisov
On Tue, 28 Nov 2023 at 14:37, Heikki Linnakangas  wrote:

> On 28/11/2023 12:14, Pavel Borisov wrote:
> > On Tue, 28 Nov 2023 at 13:13, Heikki Linnakangas 
> wrote:
> >>
> >> On 27/11/2023 01:43, Alexander Korotkov wrote:
> >>> v61 looks good to me.  I'm going to push it as long as there are no
> objections.
> >> This was discussed earlier, but is still present in v61:
> >>
> >>> +/*
> >>> + * An internal function used by SlruScanDirectory().
> >>> + *
> >>> + * Returns true if a file with a name of a given length may be a
> correct
> >>> + * SLRU segment.
> >>> + */
> >>> +static inline bool
> >>> +SlruCorrectSegmentFilenameLength(SlruCtl ctl, size_t len)
> >>> +{
> >>> + if (ctl->long_segment_names)
> >>> + return (len == 15); /* see SlruFileName() */
> >>> + else
> >>> + /*
> >>> +  * Commit 638cf09e76d allowed 5-character lengths. Later
> commit
> >>> +  * 73c986adde5 allowed 6-character length.
> >>> +  *
> >>> +  * XXX should we still consider such names to be valid?
> >>> +  */
> >>> + return (len == 4 || len == 5 || len == 6);
> >>> +}
> >>> +
> >>
> >> I think it's pretty sloppy that the "short" filenames can be 4, 5 or 6
> >> chars long. For pg_multixact/members, which introduced the 5-char case,
> >> I think we should always pad the filenames 5 characters, and for
> >> commit_ts which introduced the 6 char case, always pad to 6 characters.
> >>
> >> Instead of a "long_segment_names" boolean, how about an integer field,
> >> to specify the length.
> >>
> >> That means that we'll need pg_upgrade to copy pg_multixact/members files
> >> under the new names. That should be pretty straightforward.
> >
> > I think what's done in patch 0001 is just an extension of existing
> > logic and moving it into separate function.
>
> That's right. I'm arguing that now is a good time to clean it up.
>
> I won't insist if Alexander prefers to commit it as it is, though. But
> let's at least explain how this works in the comment, instead of the XXX.
>
I agree with you that would be good to add a comment instead of XXX and
commit.

Pavel


Re: XID formatting and SLRU refactorings

2023-11-28 Thread Heikki Linnakangas

On 28/11/2023 12:14, Pavel Borisov wrote:

On Tue, 28 Nov 2023 at 13:13, Heikki Linnakangas  wrote:


On 27/11/2023 01:43, Alexander Korotkov wrote:

v61 looks good to me.  I'm going to push it as long as there are no objections.

This was discussed earlier, but is still present in v61:


+/*
+ * An internal function used by SlruScanDirectory().
+ *
+ * Returns true if a file with a name of a given length may be a correct
+ * SLRU segment.
+ */
+static inline bool
+SlruCorrectSegmentFilenameLength(SlruCtl ctl, size_t len)
+{
+ if (ctl->long_segment_names)
+ return (len == 15); /* see SlruFileName() */
+ else
+ /*
+  * Commit 638cf09e76d allowed 5-character lengths. Later commit
+  * 73c986adde5 allowed 6-character length.
+  *
+  * XXX should we still consider such names to be valid?
+  */
+ return (len == 4 || len == 5 || len == 6);
+}
+


I think it's pretty sloppy that the "short" filenames can be 4, 5 or 6
chars long. For pg_multixact/members, which introduced the 5-char case,
I think we should always pad the filenames 5 characters, and for
commit_ts which introduced the 6 char case, always pad to 6 characters.

Instead of a "long_segment_names" boolean, how about an integer field,
to specify the length.

That means that we'll need pg_upgrade to copy pg_multixact/members files
under the new names. That should be pretty straightforward.


I think what's done in patch 0001 is just an extension of existing
logic and moving it into separate function.


That's right. I'm arguing that now is a good time to clean it up.

I won't insist if Alexander prefers to commit it as it is, though. But 
let's at least explain how this works in the comment, instead of the XXX.


--
Heikki Linnakangas
Neon (https://neon.tech)





Re: Table AM Interface Enhancements

2023-11-28 Thread Pavel Borisov
Hi, Alexander!

I think table AM extensibility is a very good idea generally, not only in
the scope of APIs that are needed in OrioleDB. Thanks for your proposals!

For patches

> 0001-Allow-locking-updated-tuples-in-tuple_update-and--v1.patch

0002-Add-EvalPlanQual-delete-returning-isolation-test-v1.patch


The new isolation test is related to the previous patch.  These two

patches were previously discussed in [2].


As discussion in [2] seems close to the patches being committed and the
only thing it is not in v16 yet is that it was too close to feature freeze,
I've copied these most recent versions of patches 0001 and 0002 from this
thread in [2] to finish and commit them there.

I'm planning to review some of the other patches from the current patchset
soon.

[2].
https://www.postgresql.org/message-id/CAPpHfdua-YFw3XTprfutzGp28xXLigFtzNbuFY8yPhqeq6X5kg%40mail.gmail.com

Kind regards,
Pavel Borisov


Re: Partial aggregates pushdown

2023-11-28 Thread Ashutosh Bapat
On Tue, Nov 28, 2023 at 5:21 AM Robert Haas  wrote:
>
> TBH, I suspect even some PG forks have made this work, like maybe PGXC
> or PGXL, although I don't know for certain. We might not like the
> trade-offs they made to get there, but we haven't even talked through
> possible design ideas yet, so it seems way too early to give up.

If my memory serves me right, PGXC implemented partial aggregation
only when the output of partial aggregate was a SQL data type
(non-Internal, non-Unknown). But I may be wrong. But at that time,
JSONB wasn't there or wasn't that widespread.

Problem with Internal is it's just a binary string whose content can
change across version and which can be interpreted differently across
different versions. There is no metadata in it to know how to
interpret it. We can add that metadata to JSONB. The result of partial
aggregate can be sent as a JSONB. If the local server finds the JSONB
familiar it will construct the right partial aggregate value otherwise
it will throw an error. If there's a way to even avoid that error (by
looking at server version etc.) the error can be avoided too. But
JSONB leaves very very less chance that the value will be interpreted
wrong. Downside is we are tying PARTIAL's output to be JSONB thus
tying SQL syntax with a data type.

Does that look acceptable?

-- 
Best Wishes,
Ashutosh Bapat




Re: WIP: libpq: add a possibility to not send D(escribe) when executing a prepared statement

2023-11-28 Thread Ivan Trofimov

Hi Tom! Thank you for considering this.


It adds a whole new set of programmer-error possibilities, and I doubt
it saves enough in typical cases to justify creating such a foot-gun.
Although I agree it adds a considerable amount of complexity, I'd argue 
it doesn't bring the complexity to a new level, since matching queries 
against responses is a concept users of asynchronous processing are 
already familiar with, especially so when pipelining is in play.


In case of a single-row select this can easily save as much as a half of 
the network traffic, which is likely to be encrypted/decrypted through 
multiple hops (a connection-pooler, for example), and has to be 
serialized/parsed on a server, a client, a pooler etc.
For example, i have a service which bombards its Postgres database with 
~20kRPS of "SELECT * FROM users WHERE id=$1", with "users" being a table 
of just a bunch of textual ids, a couple of timestamps and some enums in 
it, and for that service alone this change would save
~10Megabytes of server-originated traffic per second, and i have 
hundreds of such services at my workplace.


I can provide more elaborate network/CPU measurements of different 
workloads if needed.



Instead, I'm tempted to suggest having PQprepare/PQexecPrepared
maintain a cache that maps statement name to result tupdesc, so that
this is all handled internally to libpq
From a perspective of someone who maintains a library built on top of 
libpq and is familiar with other such libraries, I think this is much 
easier done on the level above libpq, simply because there is more 
control of when and how invalidation/eviction is done, and the level 
above also has a more straightforward way to access the cache across 
different asynchronous processing points.



I just think that successful use of that option requires a client-
side coding structure that allows tying a previously-obtained
tuple descriptor to the current query with confidence. The proposed
API fails badly at that, or at least leaves it up to the end-user
programmer while providing no tools to help her get it right
I understand your concerns of usability/safety of what I propose, and I 
think I have an idea of how to make this much less of a foot-gun: what 
if we add a new function


PGresult *
PQexecPreparedPredescribed(PGconn *conn,
   const char *stmtName,
   PGresult* description,
   ...);
which requires both a prepared statement and its tuple descriptor (or 
these two could even be tied together by a struct), and exposes its 
implementation (basically what I've prototyped in the patch) in the 
libpq-int.h?


This way users of synchronous API get a nice thing too, which is 
arguably pretty hard to misuse:
if the description isn't available upfront then there's no point to 
reach for the function added since PQexecPrepared is strictly better 
performance/usability-wise, and if the description is available it's 
most likely cached alongside the statement.

If a user still manages to provide an erroneous description, well,
they either get a parsing error or the erroneous description back,
I don't see how libpq could misbehave badly here.

Exposure of the implementation in the internal includes gives a 
possibility for users to juggle the actual foot-gun, but implies they 
know very well what they are doing, and are ready to be on their own.


What do you think of such approach?




Re: Testing autovacuum wraparound (including failsafe)

2023-11-28 Thread Daniel Gustafsson
> On 28 Nov 2023, at 03:00, Masahiko Sawada  wrote:
> 
> On Mon, Nov 27, 2023 at 10:40 PM Daniel Gustafsson  wrote:
>> 
>>> On 27 Nov 2023, at 14:06, Masahiko Sawada  wrote:
>> 
>>> Is it true that we can modify the timeout after creating
>>> BackgroundPsql object? If so, it seems we don't need to introduce the
>>> new timeout argument. But how?
>> 
>> I can't remember if that's leftovers that incorrectly remains from an earlier
>> version of the BackgroundPsql work, or if it's a very bad explanation of
>> ->set_query_timer_restart().  The timeout will use the timeout_default value
>> and that cannot be overridden, it can only be reset per query.
> 
> Thank you for confirming this. I see there is the same problem also in
> interactive_psql(). So I've attached the 0001 patch to fix these
> documentation issues.

-A default timeout of $PostgreSQL::Test::Utils::timeout_default is set up,
-which can be modified later.
+A default timeout of $PostgreSQL::Test::Utils::timeout_default is set up.

Since it cannot be modified, I think we should just say "A timeout of .." and
call it a default timeout.  This obviously only matters for the backpatch since
the sentence is removed in 0002.

> Which could be backpatched.

+1

>> With your patch the timeout still cannot be changed, but at least set during
>> start which seems good enough until there are tests warranting more 
>> complexity.
>> The docs should be corrected to reflect this in your patch.
> 
> I've incorporated the comments except for the following one and
> attached updated version of the rest patches:

LGTM.

--
Daniel Gustafsson





Re: XID formatting and SLRU refactorings

2023-11-28 Thread Pavel Borisov
Hi, Heikki!

On Tue, 28 Nov 2023 at 13:13, Heikki Linnakangas  wrote:
>
> On 27/11/2023 01:43, Alexander Korotkov wrote:
> > v61 looks good to me.  I'm going to push it as long as there are no 
> > objections.
> This was discussed earlier, but is still present in v61:
>
> > +/*
> > + * An internal function used by SlruScanDirectory().
> > + *
> > + * Returns true if a file with a name of a given length may be a correct
> > + * SLRU segment.
> > + */
> > +static inline bool
> > +SlruCorrectSegmentFilenameLength(SlruCtl ctl, size_t len)
> > +{
> > + if (ctl->long_segment_names)
> > + return (len == 15); /* see SlruFileName() */
> > + else
> > + /*
> > +  * Commit 638cf09e76d allowed 5-character lengths. Later 
> > commit
> > +  * 73c986adde5 allowed 6-character length.
> > +  *
> > +  * XXX should we still consider such names to be valid?
> > +  */
> > + return (len == 4 || len == 5 || len == 6);
> > +}
> > +
>
> I think it's pretty sloppy that the "short" filenames can be 4, 5 or 6
> chars long. For pg_multixact/members, which introduced the 5-char case,
> I think we should always pad the filenames 5 characters, and for
> commit_ts which introduced the 6 char case, always pad to 6 characters.
>
> Instead of a "long_segment_names" boolean, how about an integer field,
> to specify the length.
>
> That means that we'll need pg_upgrade to copy pg_multixact/members files
> under the new names. That should be pretty straightforward.

I think what's done in patch 0001 is just an extension of existing
logic and moving it into separate function.

- if ((len == 4 || len == 5 || len == 6) &&
+ if (SlruCorrectSegmentFilenameLength(ctl, len) &&
  strspn(clde->d_name, "0123456789ABCDEF") == len)
  {
- segno = (int) strtol(clde->d_name, NULL, 16);
+ segno = strtoi64(clde->d_name, NULL, 16);
  segpage = segno * SLRU_PAGES_PER_SEGMENT;

I'd prefer to leave it as it is as a part of 64-bit extension patch.

Regards,
Pavel.




Re: proposal: change behavior on collation version mismatch

2023-11-28 Thread Daniel Verite
Jeremy Schneider wrote:

> 1) "collation changes are uncommon" (which is relatively correct)
> 2) "most users would rather have ease-of-use than 100% safety, since
> it's uncommon"
> 
> And I think this led to the current behavior of issuing a warning rather
> than an error,

There's a technical reason for this being a warning.
If it was an error, any attempt to do anything with the collation
would fail, which includes REINDEX on indexes  using
that collation.
And yet that's precisely what you're supposed to do in that
situation.


Best regards,
-- 
Daniel Vérité
https://postgresql.verite.pro/
Twitter: @DanielVerite




RE: [PoC] pg_upgrade: allow to upgrade publisher node

2023-11-28 Thread Hayato Kuroda (Fujitsu)
Dear Bharath, Sawada-san,

Welcome back!

> >
> > ---
> > { oid => '8046', descr => 'for use by pg_upgrade',
> >   proname => 'binary_upgrade_logical_slot_has_caught_up', proisstrict => 
> > 'f',
> >   provolatile => 'v', proparallel => 'u', prorettype => 'bool',
> >   proargtypes => 'name',
> >   prosrc => 'binary_upgrade_logical_slot_has_caught_up' },
> >
> > The function is not a strict function but we check in the function if
> > the passed argument is not null. I think it would be clearer to make
> > it a strict function.
> 
> I think it has been done that way similar to
> binary_upgrade_create_empty_extension().

Yeah, we followed binary_upgrade_create_empty_extension(). Also, we set as
un-strict to keep a caller function simpler.

Currently get_old_cluster_logical_slot_infos() executes a query and it contains
binary_upgrade_logical_slot_has_caught_up(). In pg_upgrade layer, we assumed
either true or false is returned.
 
But if proisstrict is changed true, we must handle the case when NULL is 
returned.
It is small but backseat operation.

Best Regards,
Hayato Kuroda
FUJITSU LIMITED



Re: Synchronizing slots from primary to standby

2023-11-28 Thread shveta malik
On Fri, Nov 17, 2023 at 5:08 PM Amit Kapila  wrote:
>
> On Thu, Nov 16, 2023 at 5:34 PM shveta malik  wrote:
> >
> > PFA v35.
> >
>
> Review v35-0002*
> ==
> 1.
> As quoted in the commit message,
> >
> If a logical slot is invalidated on the primary, slot on the standby is also
> invalidated. If a logical slot on the primary is valid but is invalidated
> on the standby due to conflict (say required rows removed on the primary),
> then that slot is dropped and recreated on the standby in next sync-cycle.
> It is okay to recreate such slots as long as these are not consumable on the
> standby (which is the case currently).
> >
>
> I think this won't happen normally because of the physical slot and
> hot_standby_feedback but probably can occur in cases like if the user
> temporarily switches hot_standby_feedback from on to off. Are there
> any other reasons? I think we can mention the cases along with it as
> well at least for now. Additionally, I think this should be covered in
> code comments as well.
>
> 2.
>  #include "postgres.h"
> -
> +#include "access/genam.h"
>
> Spurious line removal.
>
> 3.
>A password needs to be provided too, if the sender demands password
>authentication.  It can be provided in the
>primary_conninfo string, or in a separate
> -  ~/.pgpass file on the standby server (use
> -  replication as the database name).
> -  Do not specify a database name in the
> -  primary_conninfo string.
> +  ~/.pgpass file on the standby server.
> + 
> + 
> +  Specify dbname in
> +  primary_conninfo string to allow synchronization
> +  of slots from the primary server to the standby server.
> +  This will only be used for slot synchronization. It is ignored
> +  for streaming.
>
> Is there a reason to remove part of the earlier sentence "use
> replication as the database name"?
>
> 4.
> +   enable_syncslot configuration
> parameter
> +  
> +  
> +  
> +   
> +It enables a physical standby to synchronize logical failover slots
> +from the primary server so that logical subscribers are not blocked
> +after failover.
> +   
> +   
> +It is enabled by default. This parameter can only be set in the
> +postgresql.conf file or on the server
> command line.
> +   
>
> I think you forgot to update the documentation for the default value
> of this variable.
>
> 5.
> + *   a) start the logical replication workers for every enabled subscription
> + *  when not in standby_mode
> + *   b) start the slot-sync worker for logical failover slots synchronization
> + *  from the primary server when in standby_mode.
>
> Either use a full stop after both lines or none of these.
>
> 6.
> +static void slotsync_worker_cleanup(SlotSyncWorkerInfo * worker);
>
> There shouldn't be space between * and the worker.
>
> 7.
> + if (!SlotSyncWorker->hdr.in_use)
> + {
> + LWLockRelease(SlotSyncWorkerLock);
> + ereport(ERROR,
> + (errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE),
> + errmsg("replication slot-sync worker not initialized, "
> + "cannot attach")));
> + }
> +
> + if (SlotSyncWorker->hdr.proc)
> + {
> + LWLockRelease(SlotSyncWorkerLock);
> + ereport(ERROR,
> + (errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE),
> + errmsg("replication slot-sync worker is "
> + "already running, cannot attach")));
> + }
>
> Using slot-sync in the error messages looks a bit odd to me. Can we
> use  "replication slot sync worker ..." in both these and other
> similar messages? I think it would be better if we don't split the
> messages into multiple lines in these cases as messages don't appear
> too long to me.
>
> 8.
> +/*
> + * Detach the worker from DSM and update 'proc' and 'in_use'.
> + * Logical replication launcher will come to know using these
> + * that the worker has shutdown.
> + */
> +void
> +slotsync_worker_detach(int code, Datum arg)
> +{
>
> I think the reference to DSM is leftover from the previous version of
> the patch. Can we change the above comments as per the new code?
>
> 9.
> +static bool
> +slotsync_worker_launch()
> {
> ...
> + /* TODO: do we really need 'generation', analyse more here */
> + worker->hdr.generation++;
>
> We should do something about this TODO. As per my understanding, we
> don't need a generation number for the slot sync worker as we have one
> such worker but I guess the patch requires it because we are using
> existing logical replication worker infrastructure. This brings the
> question of whether we really need a separate SlotSyncWorkerInfo or if
> we can use existing LogicalRepWorker and distinguish it with
> LogicalRepWorkerType? I guess you didn't use it because most of the
> fields in LogicalRepWorker will be unused for slot sync worker.
>
> 10.
> + * Can't use existing functions like 'get_database_oid' from dbcommands.c for
> + * validity purpose as they need db 

Re: [PATCH] psql: Add tab-complete for optional view parameters

2023-11-28 Thread Dean Rasheed
On Tue, 28 Nov 2023 at 03:42, Shubham Khanna
 wrote:
>
> I reviewed the given Patch and it is working fine.
>

Thanks for checking. Patch pushed.

Regards,
Dean




Re: [PoC] pg_upgrade: allow to upgrade publisher node

2023-11-28 Thread Amit Kapila
On Tue, Nov 28, 2023 at 1:32 PM Bharath Rupireddy
 wrote:
>
> On Tue, Nov 28, 2023 at 11:06 AM Masahiko Sawada  
> wrote:
> >
> > One month has already been passed since this main patch got committed
> > but reading this change, I have some questions on new
> > binary_upgrade_logical_slot_has_caught_up() function:
> >
> > Is there any reason why this function can be executed only in binary
> > upgrade mode? It seems to me that other functions in
> > pg_upgrade_support.c must be called only in binary upgrade mode
> > because it does some hacky changes internally. On the other hand,
> > binary_upgrade_logical_slot_has_caught_up() just calls
> > LogicalReplicationSlotHasPendingWal(), which doesn't change anything
> > internally. If we make this function usable in normal mode, the user
> > would be able to  check each slot's upgradability without pg_upgrade
> > --check command (or without stopping the server if the user can ensure
> > no more meaningful WAL records are generated).
>
> It may happen that such a user-facing function tells there's no
> unconsumed WAL, but later on the WAL gets generated during pg_upgrade.
> Therefore, the information the function gives turns out to be
> incorrect. I don't see a real-world use-case for such a function right
> now. If there's one, it's not a big change to turn it into a
> user-facing function.
>

Yeah, as of now, I don't see a use case for it and in fact, it could
lead to unpredictable results. Immediately after calling the function,
there could be more activity on the server which could make the
results incorrect. I think to check the slot's upgradeability, one can
rely on the results of the pg_upgrade --check functionality.

> > ---
> > Also, the function checks if the user has the REPLICATION privilege
> > but I think that only superuser can connect to the server in binary
> > upgrade mode in the first place.
>
> If that were true, I don't see a problem in having
> CheckSlotPermissions() there, in fact it can act as an assertion.
>

I think we can change it to assertion or may elog(ERROR, ...) with a
comment as to why we don't expect this can happen.

> > ---
> > The following error message doesn't match the function name:
> >
> > /* We must check before dereferencing the argument */
> > if (PG_ARGISNULL(0))
> > elog(ERROR, "null argument to
> > binary_upgrade_validate_wal_records is not allowed");
> >

This should be fixed.

> > ---
> > { oid => '8046', descr => 'for use by pg_upgrade',
> >   proname => 'binary_upgrade_logical_slot_has_caught_up', proisstrict => 
> > 'f',
> >   provolatile => 'v', proparallel => 'u', prorettype => 'bool',
> >   proargtypes => 'name',
> >   prosrc => 'binary_upgrade_logical_slot_has_caught_up' },
> >
> > The function is not a strict function but we check in the function if
> > the passed argument is not null. I think it would be clearer to make
> > it a strict function.
>
> I think it has been done that way similar to
> binary_upgrade_create_empty_extension().
>
> > ---
> > LogicalReplicationSlotHasPendingWal() is defined in logical.c but I
> > guess it's more suitable to be in slotfunc.s where similar functions
> > such as pg_logical_replication_slot_advance() is also defined.
>
> Why not in logicalfuncs.c?
>

I am not sure if either of those is better than logical.c. IIRC, I
thought it was okay to keep in logical.c as others primarily deal with
exposed SQL functions and I felt it somewhat matches with the intent
of logical.c ("The goal is to encapsulate most of the internal
complexity for consumers of logical decoding, so they can create and
consume a changestream with a low amount of code..").

-- 
With Regards,
Amit Kapila.




Re: Synchronizing slots from primary to standby

2023-11-28 Thread shveta malik
On Tue, Nov 28, 2023 at 3:10 PM shveta malik  wrote:
>
> On Tue, Nov 28, 2023 at 12:19 PM Drouvot, Bertrand
>  wrote:
> >
> > Hi,
> >
> > On 11/28/23 4:13 AM, shveta malik wrote:
> > > On Mon, Nov 27, 2023 at 4:08 PM Amit Kapila  
> > > wrote:
> > >>
> > >> On Mon, Nov 27, 2023 at 2:27 PM Zhijie Hou (Fujitsu)
> > >>  wrote:
> > >>>
> > >>> Here is the updated version(v39_2) which include all the changes made 
> > >>> in 0002.
> > >>> Please use for review, and sorry for the confusion.
> > >>>
> > >>
> > >> --- a/src/backend/replication/logical/launcher.c
> > >> +++ b/src/backend/replication/logical/launcher.c
> > >> @@ -8,20 +8,27 @@
> > >>*   src/backend/replication/logical/launcher.c
> > >>*
> > >>* NOTES
> > >> - *   This module contains the logical replication worker launcher which
> > >> - *   uses the background worker infrastructure to start the logical
> > >> - *   replication workers for every enabled subscription.
> > >> + *   This module contains the replication worker launcher which
> > >> + *   uses the background worker infrastructure to:
> > >> + *   a) start the logical replication workers for every enabled 
> > >> subscription
> > >> + *  when not in standby_mode.
> > >> + *   b) start the slot sync worker for logical failover slots 
> > >> synchronization
> > >> + *  from the primary server when in standby_mode.
> > >>
> > >> I was wondering do we really need a launcher on standby to invoke
> > >> sync-slot worker. If so, why? I guess it may be required for previous
> > >> versions where we were managing work for multiple slot-sync workers
> > >> which is also questionable in the sense of whether launcher is the
> > >> right candidate for the same but now with the single slot-sync worker,
> > >> it doesn't seem worth having it. What do you think?
> > >>
> > >> --
> > >
> > > Yes, earlier a manager process was needed to manage multiple slot-sync
> > > workers and distribute load among them, but now that does not seem
> > > necessary. I gave it a try (PoC) and it seems to work well.  If  there
> > > are no objections to this approach, I can share the patch soon.
> > >
> >
> > +1 on this new approach, thanks!
>
> PFA v40. This patch has removed Logical Replication Launcher support
> to launch slotsync worker.  The slot-sync worker is now registered as
> bgworker with postmaster, with
> bgw_start_time=BgWorkerStart_ConsistentState and
> bgw_restart_time=60sec.
>
> On removal of launcher, now all the validity checks have been shifted
> to slot-sync worker itself.  This brings us to some point of concerns:
>
> a) We still need to maintain  RecoveryInProgress() check in slotsync
> worker. Since worker has the start time of
> BgWorkerStart_ConsistentState, it will be started on non-standby as
> well. So to ensure that it exists on non-standby, "RecoveryInProgress"
> has been introduced at the beginning of the worker. But once it exits,
> postmaster will not restart it since it will be clean-exist i.e.
> proc_exit(0) (the restart logic of postmaster comes into play only
> when there is an abnormal exit). But to exit for the first time on
> non-standby, we need that Recovery related check in worker.
>
> b) "enable_syncslot" check is moved to slotsync worker now. Since
> enable_syncslot is PGC_SIGHUP, so proc_exit(1) is currently used to
> exit the worker if 'enable_syncslot' is found to be disabled.
> 'proc_exit(1)' has been used in order to ensure that the worker is
> restarted and GUCs are checked again after restart_time. Downside of
> this approach is, if someone has kept "enable_syncslot" as disabled
> permanently even on standby, slotsync worker will keep on restarting
> and exiting.
>
> So to overcome the above pain-points, I think a potential approach
> will be to start slotsync worker only if 'enable_syncslot' is on and
> the system is non-standby. Potential ways (each with some issues) are:
>

Correction here:  start slotsync worker only if 'enable_syncslot' is
on and the system is standby.

> 1) Use the current way i.e. register slot-sync worker as bgworker with
> postmaster, but introduce extra checks in 'maybe_start_bgworkers'. But
> this seems more like a hack. This will need extra changes as currently
> once 'maybe_start_bgworkers' is attempted by postmaster, it will
> attempt again to start any worker only if the worker had abnormal exit
> and restart_time !=0. The current postmatser will not attempt to start
> worker on any GUC change.
>
> 2) Another way maybe to treat slotsync worker as special case and
> separate out the start/restart of slotsync worker from bgworker, and
> follow what we do for autovacuum launcher(StartAutoVacLauncher) to
> keep starting it in the postmaster loop(ServerLoop). In this way, we
> may be able to add more checks before starting worker. But by opting
> this approach, we will have to manage slotsync worker completely by
> ourself as it will be no longer be part of existing
> bgworker-registration infra. If this seems okay and there 

Re: SSL tests fail on OpenSSL v3.2.0

2023-11-28 Thread Daniel Gustafsson
> On 28 Nov 2023, at 01:29, Bo Anderson  wrote:

> It probably doesn’t exist in BoringSSL but neither does a lot of things.

Thats not an issue, we don't support building with BoringSSL.

--
Daniel Gustafsson





Re: XID formatting and SLRU refactorings

2023-11-28 Thread Heikki Linnakangas

On 27/11/2023 01:43, Alexander Korotkov wrote:

v61 looks good to me.  I'm going to push it as long as there are no objections.

This was discussed earlier, but is still present in v61:


+/*
+ * An internal function used by SlruScanDirectory().
+ *
+ * Returns true if a file with a name of a given length may be a correct
+ * SLRU segment.
+ */
+static inline bool
+SlruCorrectSegmentFilenameLength(SlruCtl ctl, size_t len)
+{
+   if (ctl->long_segment_names)
+   return (len == 15); /* see SlruFileName() */
+   else
+   /*
+* Commit 638cf09e76d allowed 5-character lengths. Later commit
+* 73c986adde5 allowed 6-character length.
+*
+* XXX should we still consider such names to be valid?
+*/
+   return (len == 4 || len == 5 || len == 6);
+}
+


I think it's pretty sloppy that the "short" filenames can be 4, 5 or 6 
chars long. For pg_multixact/members, which introduced the 5-char case, 
I think we should always pad the filenames 5 characters, and for 
commit_ts which introduced the 6 char case, always pad to 6 characters.


Instead of a "long_segment_names" boolean, how about an integer field, 
to specify the length.


That means that we'll need pg_upgrade to copy pg_multixact/members files 
under the new names. That should be pretty straightforward.


--
Heikki Linnakangas
Neon (https://neon.tech)





  1   2   >