date:20231130

On Thu, Nov 30, 2023 at 5:21 PM Tomas Vondra
 wrote:
>
> On 11/30/23 11:56, Amit Kapila wrote:
>
> >
> > This was the key point that I wanted to discuss or hear opinions
> > about. So, if we wish to have some sort of global sequences then it is
> > not clear to me what benefits will we get by having replication of
> > non-global sequences. One thing that comes to mind is replication
> > covers a subset of use cases (like help in case of failover or
> > switchover to subscriber) and till the time we have some
> > implementation of global sequences, it can help users.
> >
>
> What are you going to do about use cases like using logical replication
> for upgrade to the next major version?

As per my understanding, they should work as it is when using a global
sequence. Just for the sake of example, considering we have a
same-name global sequence on both pub and sub now it should work
during and after major version upgrades.

>
> Or applications that prefer (or
> have to) use traditional sequences?
>

I think we have to suggest them to use global sequence for the use
cases where they want those to work with logical replication use
cases. Now, if still users want their existing sequences to work then
we can probably see if there is a way to provide an option via Alter
Sequence to change it to a global sequence.

-- 
With Regards,
Amit Kapila.

RE: Synchronizing slots from primary to standby

2023-11-30 Thread Zhijie Hou (Fujitsu)

On Friday, December 1, 2023 12:51 PM shveta malik  
wrote:

Hi,

> 
> On Fri, Dec 1, 2023 at 9:40 AM Zhijie Hou (Fujitsu)
>  wrote:
> >
> > On Wednesday, November 29, 2023 5:12 PM Zhijie Hou (Fujitsu)
>  wrote:
> >
> > I was reviewing slotsync worker design and here
> > are few comments on 0002 patch:
> 
> Thanks for reviewing the patch.
> 
> >
> >
> > 3. In synchronize_one_slot, do we need to skip the slot sync and drop if the
> > local slot is a physical one ?
> >
> 
> IMO, if a local slot exists which is a physical one, it will be a user
> created slot and in that case worker will error out on finding
> existing slot with same name. And the case where local slot is
> physical one but not user-created is not possible on standby (assuming
> we have correct check on primary disallowing setting 'failover'
> property for physical slot). Do you have some other scenario in mind,
> which I am missing here?

I was thinking about the race condition when it has confirmed that the slot is
not a user created one and enter "sync_state == SYNCSLOT_STATE_READY" branch,
but at this moment, if someone uses "DROP_REPLICATION_SLOT" to drop this slot 
and
recreate another one(e.g. a physical one), then the slotsync worker will
overwrite the fields of this physical slot. Although this affects user created
logical slots in similar cases as well.

And the same is true for slotsync_drop_initiated_slots() and
drop_obsolete_slots(), as we don't lock the slots in the list, if user tri to
drop and re-create old slot concurrently, then we could drop user created slot
here.

Best Regards,
Hou zj

Re: pg_upgrade and logical replication

2023-11-30 Thread Peter Smith

Here are review comments for patch v21-0001

==
src/bin/pg_upgrade/check.c

1. check_old_cluster_subscription_state

+/*
+ * check_old_cluster_subscription_state()
+ *
+ * Verify that each of the subscriptions has all their corresponding tables in
+ * i (initialize) or r (ready).
+ */
+static void
+check_old_cluster_subscription_state(void)

Function comment should also mention it also validates the origin.

~~~

2.
In this function there are a couple of errors written to the
"subs_invalid.txt" file:

+ fprintf(script, "replication origin is missing for database:\"%s\"
subscription:\"%s\"\n",
+ PQgetvalue(res, i, 0),
+ PQgetvalue(res, i, 1));

and

+ fprintf(script, "database:\"%s\" subscription:\"%s\" schema:\"%s\"
relation:\"%s\" state:\"%s\" not in required state\n",
+ active_db->db_name,
+ PQgetvalue(res, i, 0),
+ PQgetvalue(res, i, 1),
+ PQgetvalue(res, i, 2),
+ PQgetvalue(res, i, 3));

The format of those messages is not consistent. It could be improved
in a number of ways to make them more similar. e.g. below.

SUGGESTION #1
the replication origin is missing for database:\"%s\" subscription:\"%s\"\n
the table sync state \"%s\" is not allowed for database:\"%s\"
subscription:\"%s\" schema:\"%s\" relation:\"%s\"\n

SUGGESTION #2
database:\"%s\" subscription:\"%s\" -- replication origin is missing\n
database:\"%s\" subscription:\"%s\" schema:\"%s\" relation:\"%s\" --
upgrade when table sync state is \"%s\" is not supported\n

etc.

==
src/bin/pg_upgrade/t/004_subscription.pl

3.
+# Initial setup
+$publisher->safe_psql('postgres', "CREATE TABLE tab_upgraded1(id int)");
+$publisher->safe_psql('postgres', "CREATE TABLE tab_upgraded2(id int)");
+$old_sub->safe_psql('postgres', "CREATE TABLE tab_upgraded1(id int)");
+$old_sub->safe_psql('postgres', "CREATE TABLE tab_upgraded2(id int)");

IMO it is tidier to combine multiple DDLS whenever you can.

~~~

4.
+# Create a subscription in enabled state before upgrade
+$publisher->safe_psql('postgres', "CREATE PUBLICATION regress_pub1");
+$old_sub->safe_psql('postgres',
+ "CREATE SUBSCRIPTION regress_sub1 CONNECTION '$connstr' PUBLICATION
regress_pub1"
+);
+$old_sub->wait_for_subscription_sync($publisher, 'regress_sub1');

That publication has an empty set of tables. Should there be some
comment to explain why it is OK like this?

~~~

5.
+# Wait till the table tab_upgraded1 reaches 'ready' state
+my $synced_query =
+  "SELECT count(1) = 1 FROM pg_subscription_rel WHERE srsubstate = 'r'";
+$old_sub->poll_query_until('postgres', $synced_query)
+  or die "Timed out while waiting for the table to reach ready state";
+
+$publisher->safe_psql('postgres',
+ "INSERT INTO tab_upgraded1 VALUES (generate_series(1,50))"
+);
+$publisher->wait_for_catchup('regress_sub2');

IMO better without the blank line, so then everything more clearly
belongs to this same comment.

~~~

6.
+# Pre-setup for preparing subscription table in init state. Add tab_upgraded2
+# to the publication.
+$publisher->safe_psql('postgres',
+ "ALTER PUBLICATION regress_pub2 ADD TABLE tab_upgraded2");
+
+$old_sub->safe_psql('postgres',
+ "ALTER SUBSCRIPTION regress_sub2 REFRESH PUBLICATION");

Ditto. IMO better without the blank line, so then everything more
clearly belongs to this same comment.

~~~

7.
+command_ok(
+ [
+ 'pg_upgrade', '--no-sync', '-d', $old_sub->data_dir,
+ '-D', $new_sub->data_dir, '-b', $oldbindir,
+ '-B', $newbindir, '-s', $new_sub->host,
+ '-p', $old_sub->port, '-P', $new_sub->port,
+ $mode
+ ],
+ 'run of pg_upgrade for old instance when the subscription tables are
in init/ready state'
+);

Maybe those 'command_ok' args can be formatted neatly (like you've
done later for the 'command_checks_all').

~~~

8.
+# --
+# Check that the data inserted to the publisher when the subscriber
is down will
+# be replicated to the new subscriber once the new subscriber is started.
+# --

8a.
SUGGESTION
...when the new subscriber is down will be replicated once it is started.

~

8b.
I thought this main comment should also say something like "Also check
that the old subscription states and relations origins are all
preserved."

~~~

9.
+$publisher->safe_psql('postgres', "INSERT INTO tab_upgraded1 VALUES(51)");
+$publisher->safe_psql('postgres', "INSERT INTO tab_upgraded2 VALUES(1)");

IMO it is tidier to combine multiple DDLS whenever you can.

~~~

10.
+# The subscription's running status should be preserved
+$result =
+  $new_sub->safe_psql('postgres',
+ "SELECT subenabled FROM pg_subscription ORDER BY subname");
+is($result, qq(t
+f),
+ "check that the subscription's running status are preserved"
+);

I felt this was a bit too tricky. It might be more readable to do 2
separate SELECTs with explicit subnames. Alternatively, leave the code
as-is but improve the comment to explicitly say something like:

# old subscription regress_sub was enabled
# old subscription regress_sub1 was disable

processes stuck in shutdown following OOM/recovery

2023-11-30 Thread Justin Pryzby

If postgres starts, and one of its children is immediately killed, and
the cluster is also told to stop, then, instead, the whole system gets
wedged.

$ initdb -D ./pgdev.dat1
$ pg_ctl -D ./pgdev.dat1 start -o '-c port=5678'
$ kill -9 2524495; sleep 0.05; pg_ctl -D ./pgdev.dat1 stop -m fast # 2524495 is 
a child's pid
.. failed
pg_ctl: server does not shut down

$ ps -wwwf --ppid 2524494
UID  PIDPPID  C STIME TTY  TIME CMD
pryzbyj  2524552 2524494  0 20:47 ?00:00:00 postgres: checkpointer 

(gdb) bt
#0  0x7f0ce2d08c03 in epoll_wait (epfd=10, events=0x55cb4cbaac28, 
maxevents=1, timeout=timeout@entry=156481) at 
../sysdeps/unix/sysv/linux/epoll_wait.c:30
#1  0x55cb4c219208 in WaitEventSetWaitBlock (set=set@entry=0x55cb4cbaabc0, 
cur_timeout=cur_timeout@entry=156481, 
occurred_events=occurred_events@entry=0x7ffd80130410, 
nevents=nevents@entry=1) at ../src/backend/storage/ipc/latch.c:1583
#2  0x55cb4c219e02 in WaitEventSetWait (set=0x55cb4cbaabc0, 
timeout=timeout@entry=30, 
occurred_events=occurred_events@entry=0x7ffd80130410, nevents=nevents@entry=1, 
wait_event_info=wait_event_info@entry=83886084) at 
../src/backend/storage/ipc/latch.c:1529
#3  0x55cb4c219f87 in WaitLatch (latch=, 
wakeEvents=wakeEvents@entry=41, timeout=timeout@entry=30, 
wait_event_info=wait_event_info@entry=83886084)
at ../src/backend/storage/ipc/latch.c:539
#4  0x55cb4c1aabc2 in CheckpointerMain () at 
../src/backend/postmaster/checkpointer.c:523
#5  0x55cb4c1a8207 in AuxiliaryProcessMain 
(auxtype=auxtype@entry=CheckpointerProcess) at 
../src/backend/postmaster/auxprocess.c:153
#6  0x55cb4c1ae63d in StartChildProcess 
(type=type@entry=CheckpointerProcess) at 
../src/backend/postmaster/postmaster.c:5331
#7  0x55cb4c1b07f3 in ServerLoop () at 
../src/backend/postmaster/postmaster.c:1792
#8  0x55cb4c1b1c56 in PostmasterMain (argc=argc@entry=5, 
argv=argv@entry=0x55cb4cbaa380) at ../src/backend/postmaster/postmaster.c:1466
#9  0x55cb4c0f4c1b in main (argc=5, argv=0x55cb4cbaa380) at 
../src/backend/main/main.c:198

I noticed this because of the counter-effective behavior of systemd+PGDG
unit files to run "pg_ctl stop" whenever a backend is killed for OOM:
https://www.postgresql.org/message-id/ZVI112aVNCHOQgfF@pryzbyj2023

This affects v15, and fails at 7ff23c6d27 but not its parent.

commit 7ff23c6d277d1d90478a51f0dd81414d343f3850 (HEAD)
Author: Thomas Munro 
Date:   Mon Aug 2 17:32:20 2021 +1200

Run checkpointer and bgwriter in crash recovery.

-- 
Justin

Re: Synchronizing slots from primary to standby

2023-11-30 Thread shveta malik

On Fri, Dec 1, 2023 at 9:40 AM Zhijie Hou (Fujitsu)
 wrote:
>
> On Wednesday, November 29, 2023 5:12 PM Zhijie Hou (Fujitsu) 
>  wrote:
>
> I was reviewing slotsync worker design and here
> are few comments on 0002 patch:

Thanks for reviewing the patch.

>
> 1.
>
> +   if (!WalRcv ||
> +   (WalRcv->slotname[0] == '\0') ||
> +   XLogRecPtrIsInvalid(WalRcv->latestWalEnd))
>
> I think we'd better take spinlock when accessing these shared memory fields.
>
> 2.
>
> /*
>  * The slot sync feature itself is disabled, exit.
>  */
> if (!enable_syncslot)
> {
> ereport(LOG,
> errmsg("exiting slot sync worker as 
> enable_syncslot is disabled."));
>
> Can we check the GUC when registering the worker(SlotSyncWorkerRegister),
> so that the worker won't be started if enable_syncslot is false.
>
> 3. In synchronize_one_slot, do we need to skip the slot sync and drop if the
> local slot is a physical one ?
>

IMO, if a local slot exists which is a physical one, it will be a user
created slot and in that case worker will error out on finding
existing slot with same name. And the case where local slot is
physical one but not user-created is not possible on standby (assuming
we have correct check on primary disallowing setting 'failover'
property for physical slot). Do you have some other scenario in mind,
which I am missing here?

> 4.
>
> *locally_invalidated =
> (remote_slot->invalidated == RS_INVAL_NONE) &&
> (local_slot->data.invalidated != 
> RS_INVAL_NONE);
>
> When reading the invalidated flag of local slot, I think we'd better take
> spinlock.
>
> Best Regards,
> Hou zj

Re: optimize atomic exchanges

On Thu, Nov 30, 2023 at 07:56:27PM -0800, Andres Freund wrote:
> On 2023-11-30 21:18:15 -0600, Nathan Bossart wrote:
>> Some rudimentary tests show a >40% speedup with this patch on x86_64.
> 
> On bigger machines, with contention, the wins are likely much higher. I see
> two orders of magnitude higher throughput in a test program that I had around,
> on a two socket cascade lake machine.  Of course it's also much less
> powerfull...

Nice.  Thanks for trying it out.

One thing on my mind is whether we should bother with the inline assembly
versions.  It looks like gcc has had __atomic since 4.7.0 (2012), so I'm
not sure we gain much from them.  OTOH they are pretty simple and seem
unlikely to cause too much trouble.

-- 
Nathan Bossart
Amazon Web Services: https://aws.amazon.com

Re: Add new for_each macros for iterating over a List that do not require ListCell pointer

On Wed, Oct 25, 2023 at 12:39:01PM +0200, Jelte Fennema wrote:
> Attached is a slightly updated version, with a bit simpler
> implementation of foreach_delete_current.
> Instead of decrementing i and then adding 1 to it when indexing the
> list, it now indexes the list using a postfix decrement.

Both the macros and the comments in 0001 seem quite repetitive to me.
Could we simplify it with something like the following?

#define foreach_internal(var, lst, func) \
for (ForEachState var##__state = {(lst), 0}; \
 (var##__state.l != NIL && \
  var##__state.i < var##__state.l->length && \
 (var = func(&var##__state.l->elements[var##__state.i]), true)); \
 var##__state.i++)

#define foreach_ptr(var, lst)   foreach_internal(var, lst, lfirst)
#define foreach_int(var, lst)   foreach_internal(var, lst, lfirst_int)
#define foreach_oid(var, lst)   foreach_internal(var, lst, lfirst_oid)
#define foreach_xid(var, lst)   foreach_internal(var, lst, lfirst_xid)

#define foreach_node(type, var, lst) \
for (ForEachState var##__state = {(lst), 0}; \
 (var##__state.l != NIL && \
  var##__state.i < var##__state.l->length && \
 (var = lfirst_node(type, 
&var##__state.l->elements[var##__state.i]), true));\
 var##__state.i++)

There might be a way to use foreach_internal for foreach_node, too, but
this is probably already too magical...

-- 
Nathan Bossart
Amazon Web Services: https://aws.amazon.com

RE: Synchronizing slots from primary to standby

2023-11-30 Thread Zhijie Hou (Fujitsu)

On Wednesday, November 29, 2023 5:12 PM Zhijie Hou (Fujitsu) 
 wrote:

I was reviewing slotsync worker design and here
are few comments on 0002 patch:

1.

+   if (!WalRcv ||
+   (WalRcv->slotname[0] == '\0') ||
+   XLogRecPtrIsInvalid(WalRcv->latestWalEnd))

I think we'd better take spinlock when accessing these shared memory fields.

2.

/*
 * The slot sync feature itself is disabled, exit.
 */
if (!enable_syncslot)
{
ereport(LOG,
errmsg("exiting slot sync worker as 
enable_syncslot is disabled."));

Can we check the GUC when registering the worker(SlotSyncWorkerRegister),
so that the worker won't be started if enable_syncslot is false.

3. In synchronize_one_slot, do we need to skip the slot sync and drop if the
local slot is a physical one ?

4.

*locally_invalidated =
(remote_slot->invalidated == RS_INVAL_NONE) &&
(local_slot->data.invalidated != RS_INVAL_NONE);

When reading the invalidated flag of local slot, I think we'd better take
spinlock.

Best Regards,
Hou zj

Re: meson: Stop using deprecated way getting path of files

On 2023-11-29 10:50:53 -0800, Andres Freund wrote:
> I plan to apply this soon, unless I hear some opposition / better ideas / 

Pushed.

Re: Annoying build warnings from latest Apple toolchain

Hi,

On 2023-11-30 21:24:21 -0500, Tom Lane wrote:
> Andres Freund  writes:
> > On 2023-11-28 10:48:04 -0500, Robert Haas wrote:
> >> What a stupid, annoying decision on Apple's part. It seems like
> >> -Wl,-undefined,error is the default behavior, so they could have just
> >> ignored that flag if present, but instead they complain about being
> >> asked to do what they were going to do anyway.
> 
> > Especially because I think it didn't actually use to be the default when
> > building a dylib.
> 
> Indeed.  Whoever's in charge of their linker now seems to be quite
> clueless, or at least aggressively anti-backwards-compatibility.

It looks like it even affects a bunch of Apple's own products [1]...


> > While not helpful for this, I just noticed that there is
> > -no_warn_duplicate_libraries, which would at least get rid of the
> >   ld: warning: ignoring duplicate libraries: '-lxml2'
> > style warnings.
> 
> Oh!  If that's been there long enough to rely on, that does seem
> very helpful.

I think it's new too. But we can just check if the flag is supported.


This is really ridiculous. For at least part of venturas life they warned
about -Wl,-undefined,dynamic_lookup, which lead to 9a95a510adf3. They don't
seem to do that anymore, but now you can't specify -Wl,-undefined,error
anymore without a warning.


Attached is a prototype testing this via CI on both Sonoma and Ventura.


It's certainly possible to encounter the duplicate library issue with
autoconf, but probably not that common. So I'm not sure if we should inject
-Wl,-no_warn_duplicate_libraries as well?

Greetings,

Andres Freund


[1] https://developer.apple.com/forums/thread/733317
>From 64cc70f05bb5874802192800e5c8af04b1a1df79 Mon Sep 17 00:00:00 2001
From: Andres Freund 
Date: Thu, 28 Sep 2023 08:21:18 -0700
Subject: [PATCH v3] meson: Improve Sonoma compability

Still triggers a bunch of "ld: warning: -undefined error is deprecated", but
fewer than before. There should not be any "ld: warning: ignoring duplicate
libraries:" warnings anymore.

ci-os-only: macos
---
 meson.build  | 14 --
 .cirrus.tasks.yml| 13 ++---
 src/tools/ci/ci_macports_packages.sh |  8 +++-
 3 files changed, 29 insertions(+), 6 deletions(-)

diff --git a/meson.build b/meson.build
index 0f2c76ec25e..eab7b6e722d 100644
--- a/meson.build
+++ b/meson.build
@@ -237,10 +237,20 @@ elif host_system == 'darwin'
 cflags += ['-isysroot', pg_sysroot]
 ldflags += ['-isysroot', pg_sysroot]
   endif
+
   # meson defaults to -Wl,-undefined,dynamic_lookup for modules, which we
   # don't want because a) it's different from what we do for autoconf, b) it
-  # causes warnings starting in macOS Ventura
-  ldflags_mod += ['-Wl,-undefined,error']
+  # causes warnings in macOS Ventura. But using -Wl,-undefined,error causes a
+  # warning starting in Sonoma. So only add -Wl,-undefined,error if it does
+  # not cause a warning.
+  if cc.has_multi_link_arguments('-Wl,-undefined,error', '-Werror')
+ldflags_mod += '-Wl,-undefined,error'
+  endif
+
+  # Starting in Sonoma, the linker warns about the same library being
+  # linked twice.  Which can easily happen when multiple dependencies
+  # depend on the same library. Quiesce the ill considered warning.
+  ldflags += cc.get_supported_link_arguments('-Wl,-no_warn_duplicate_libraries')
 
 elif host_system == 'freebsd'
   sema_kind = 'unnamed_posix'
diff --git a/.cirrus.tasks.yml b/.cirrus.tasks.yml
index e137769850d..21ee53b9c2f 100644
--- a/.cirrus.tasks.yml
+++ b/.cirrus.tasks.yml
@@ -410,8 +410,6 @@ task:
 
 
 task:
-  name: macOS - Ventura - Meson
-
   env:
 CPUS: 4 # always get that much for cirrusci macOS instances
 BUILD_JOBS: $CPUS
@@ -419,7 +417,6 @@ task:
 # work OK. See
 # https://postgr.es/m/20220927040208.l3shfcidovpzqxfh%40awork3.anarazel.de
 TEST_JOBS: 8
-IMAGE: ghcr.io/cirruslabs/macos-ventura-base:latest
 
 CIRRUS_WORKING_DIR: ${HOME}/pgsql/
 CCACHE_DIR: ${HOME}/ccache
@@ -430,6 +427,16 @@ task:
 CFLAGS: -Og -ggdb
 CXXFLAGS: -Og -ggdb
 
+
+  matrix:
+- name: macOS - Sonoma - Meson
+  env:
+IMAGE: ghcr.io/cirruslabs/macos-sonoma-base:latest
+
+- name: macOS - Ventura - Meson
+  env:
+IMAGE: ghcr.io/cirruslabs/macos-ventura-base:latest
+
   <<: *macos_task_template
 
   depends_on: SanityCheck
diff --git a/src/tools/ci/ci_macports_packages.sh b/src/tools/ci/ci_macports_packages.sh
index 4bc594a31d1..3161ea08ad3 100755
--- a/src/tools/ci/ci_macports_packages.sh
+++ b/src/tools/ci/ci_macports_packages.sh
@@ -13,7 +13,13 @@ set -e
 
 packages="$@"
 
-macports_url="https://github.com/macports/macports-base/releases/download/v2.8.1/MacPorts-2.8.1-13-Ventura.pkg";
+macos_ver=$(sw_vers -productVersion | sed 's/\.//g')
+if [ $macos_ver -ge 1400 ]; then
+macports_url="https://github.com/macports/macports-base/releases/download/v2.8.1/MacPorts-2.8.1-14-Sonoma.pkg";
+else
+macports_url="

Re: [Proposal] global sequence implemented by snowflake ID

2023-11-30 Thread Michael Paquier

On Thu, Nov 30, 2023 at 12:51:38PM +0100, Tomas Vondra wrote:
> As for implementation/replication, I haven't checked the code, but I'd
> imagine the AM should be able to decide whether something needs to be
> replicated (and how) or not. So the traditional sequences would
> replicate, and the alternative sequences would not replicate anything.

Yep, exactly.  Keeping compatibility for the in-core sequence
computation is very important (including the fact that this stuff uses
pseudo-heap tables for its metadata with the values computed).

>> This was the key point that I wanted to discuss or hear opinions
>> about. So, if we wish to have some sort of global sequences then it is
>> not clear to me what benefits will we get by having replication of
>> non-global sequences. One thing that comes to mind is replication
>> covers a subset of use cases (like help in case of failover or
>> switchover to subscriber) and till the time we have some
>> implementation of global sequences, it can help users.
> 
> What are you going to do about use cases like using logical replication
> for upgrade to the next major version? Or applications that prefer (or
> have to) use traditional sequences?

Yeah, and that's why the logical replication of sequence has value.
Giving the possibility for users or application developers to use a
custom computation method may be useful for some applications, but not
others.  The use cases are too much different, so IMO both are useful,
when applied to each user's requirements.
--
Michael


signature.asc
Description: PGP signature

Re: Custom explain options

2023-11-30 Thread Andrei Lepikhov


On 30/11/2023 22:40, Konstantin Knizhnik wrote:


On 30/11/2023 5:59 am, Andrei Lepikhov wrote:

On 21/10/2023 19:16, Konstantin Knizhnik wrote:
EXPLAIN statement has a list of options (i.e. ANALYZE, BUFFERS, 
COST,...) which help to provide useful details of query execution.
In Neon we have added PREFETCH option which shows information about 
page prefetching during query execution (prefetching is more critical 
for Neon
architecture because of separation of compute and storage, so it is 
implemented not only for bitmap heap scan as in Vanilla Postgres, but 
also for seqscan, indexscan and indexonly scan). Another possible 
candidate  for explain options is local file cache (extra caching 
layer above shared buffers which is used to somehow replace file 
system cache in standalone Postgres).


I think that it will be nice to have a generic mechanism which allows 
extensions to add its own options to EXPLAIN.


Generally, I welcome this idea: Extensions can already do a lot of 
work, and they should have a tool to report their state, not only into 
the log.
But I think your approach needs to be elaborated. At first, it would 
be better to allow registering extended instruments for specific node 
types to avoid unneeded calls.
Secondly, looking into the Instrumentation usage, I don't see the 
reason to limit the size: as I see everywhere it exists locally or in 
the DSA where its size is calculated on the fly. So, by registering an 
extended instrument, we can reserve a slot for the extension. The 
actual size of underlying data can be provided by the extension routine.



Thank you for review.

I agree that support of extended instruments is desired. I just tried to 
minimize number of changes to make this patch smaller.


I got it. But having a substantial number of extensions in support, I 
think the extension part of instrumentation could have advantages and be 
worth elaborating on.


In all this cases we are using array of `Instrumentation` and if it 
contains varying part, then it is not clear where to place it.
Yes, there is also code which serialize and sends instrumentations 
between worker processes  and I have updated this code in my PR to send 
actual amount of custom instrumentation data. But it can not help with 
the cases above.

I see next basic instruments in the code:
- Instrumentation (which should be named NodeInstrumentation)
- MemoizeInstrumentation
- JitInstrumentation
- AggregateInstrumentation
- HashInstrumentation
- TuplesortInstrumentation

As a variant, extensibility can be designed with parent 
'AbstractInstrumentation' node, containing node type and link to 
extensible part. sizeof(Instr) calls should be replaced with the 
getInstrSize() call - not so much places in the code; memcpy() also can 
be replaced with the copy_instr() routine.


--
regards,
Andrei Lepikhov
Postgres Professional

Re: optimize atomic exchanges

Hi,

On 2023-11-30 21:18:15 -0600, Nathan Bossart wrote:
> On Wed, Nov 29, 2023 at 03:29:05PM -0600, Nathan Bossart wrote:
> > I haven't done any sort of performance testing on this yet.  Some
> > preliminary web searches suggest that there is unlikely to be much
> > difference between cmpxchg and xchg, but presumably there's some difference
> > between xchg and doing cmpxchg in a while loop (as is done in
> > atomics/generic.h today).  I'll report back once I've had a chance to do
> > some testing...
> 
> Some rudimentary tests show a >40% speedup with this patch on x86_64.

On bigger machines, with contention, the wins are likely much higher. I see
two orders of magnitude higher throughput in a test program that I had around,
on a two socket cascade lake machine.  Of course it's also much less
powerfull...

Greetings,

Andres Freund

Re: Synchronizing slots from primary to standby

2023-11-30 Thread shveta malik

On Thu, Nov 30, 2023 at 5:37 PM Ajin Cherian  wrote:
>
> On Wed, Nov 29, 2023 at 8:17 PM Zhijie Hou (Fujitsu)
>  wrote:
> >
> > This has been fixed.
> >
> > Best Regards,
> > Hou zj
>
> Thanks for addressing my comments. Some comments from my testing of patch v41
>
> 1. In my opinion, the second message "aborting the wait...moving to
> the next slot" does not hold much value. There might not even be a
> "next slot", there might be just one slot. I think the first LOG is
> enough to indicate that the sync-slot is waiting as it repeats this
> log till the slot catches up. I know these messages hold great value
> for debugging but in production,  "waiting..", "aborting the wait.."
> might not be as helpful, maybe change it to debug?
>
> 2023-11-30 05:13:49.811 EST [6115] LOG:  waiting for remote slot
> "sub1" LSN (0/3047A90) and catalog xmin (745) to pass local slot LSN
> (0/3047AC8) and catalog xmin (745)
> 2023-11-30 05:13:57.909 EST [6115] LOG:  aborting the wait for remote
> slot "sub1" and moving to the next slot, will attempt creating it
> again
> 2023-11-30 05:14:07.921 EST [6115] LOG:  waiting for remote slot
> "sub1" LSN (0/3047A90) and catalog xmin (745) to pass local slot LSN
> (0/3047AC8) and catalog xmin (745)
>

Sure, the message can be trimmed down. But I am not very sure if we
should convert it to DEBUG. It might be useful to know what exactly is
happening with this slot through the log file.Curious to know what
others think here?

>
> 2. If a slot on the standby is in the "i" state as it hasn't been
> synced and it was invalidated on the primary, should you continuously
> retry creating this invalidated slot on the standby?
>
> 2023-11-30 06:21:41.844 EST [10563] LOG:  waiting for remote slot
> "sub1" LSN (0/0) and catalog xmin (785) to pass local slot LSN
> (0/EED9330) and catalog xmin (785)
> 2023-11-30 06:21:41.845 EST [10563] WARNING:  slot "sub1" invalidated
> on the primary server, slot creation aborted
> 2023-11-30 06:21:51.892 EST [10563] LOG:  waiting for remote slot
> "sub1" LSN (0/0) and catalog xmin (785) to pass local slot LSN
> (0/EED9330) and catalog xmin (785)
> 2023-11-30 06:21:51.893 EST [10563] WARNING:  slot "sub1" invalidated
> on the primary server, slot creation aborted
>

No, it should not be synced after that. It should be marked as
invalidated and skipped. And perhaps the state should also be moved to
'r' as we are done with it; but since it is invalidated, it will not
be used further even if 'r'.

> 3. If creation of a slot on the standby fails for one slot because a
> slot of the same name exists, then thereafter no new sync slots are
> created on standby. Is this expected? I do see that previously created
> slots are kept up to date, just that no new slots are created after
> that.
>

yes, it is done so as per the suggestion/discussion in [1]. It is done
so that users can catch this issue at the earliest.

[1]: 
https://www.postgresql.org/message-id/CAA4eK1J5D-Z7dFa89acf7O%2BCa6Y9bygTpi52KAKVCg%2BPE%2BZfog%40mail.gmail.com

thanks
Shveta

Re: optimize atomic exchanges

On Wed, Nov 29, 2023 at 03:29:05PM -0600, Nathan Bossart wrote:
> I haven't done any sort of performance testing on this yet.  Some
> preliminary web searches suggest that there is unlikely to be much
> difference between cmpxchg and xchg, but presumably there's some difference
> between xchg and doing cmpxchg in a while loop (as is done in
> atomics/generic.h today).  I'll report back once I've had a chance to do
> some testing...

Some rudimentary tests show a >40% speedup with this patch on x86_64.

-- 
Nathan Bossart
Amazon Web Services: https://aws.amazon.com

Re: Bug in pgbench prepared statements

2023-11-30 Thread Lev Kokotov

> I see prepareCommand() is called one more time in
> prepareCommandsInPipeline(). Should you also check the return value
> there?

Yes, good catch. New patch attached.

> It may also be useful to throw this patch on the January commitfest if
> no one else comes along to review/commit it.

First time contributing, not familiar with the process here, but happy to
learn.

Best,

Lev
postgresml.org


On Thu, Nov 30, 2023 at 2:10 PM Tristan Partin  wrote:

> On Wed Nov 29, 2023 at 7:38 PM CST, Lev Kokotov wrote:
> > Patch attached, if there is any interest in fixing this small bug.
>
> I see prepareCommand() is called one more time in
> prepareCommandsInPipeline(). Should you also check the return value
> there?
>
> It may also be useful to throw this patch on the January commitfest if
> no one else comes along to review/commit it. This first set of changes
> looks good to me.
>
> --
> Tristan Partin
> Neon (https://neon.tech)
>


pgbench.patch
Description: Binary data

Re: Report planning memory in EXPLAIN ANALYZE

2023-11-30 Thread Andrei Lepikhov


On 30/11/2023 18:40, Alvaro Herrera wrote:

Well, SUMMARY is enabled by default with ANALYZE, and I'd rather not
have planner memory consumption displayed by default with all EXPLAIN
ANALYZEs.  So yeah, I still think this deserves its own option.

But let's hear others' opinions on this point.  I'm only one vote here.


I agree; it should be disabled by default. The fact that memory 
consumption outputs with byte precision (very uncomfortable for 
perception) is a sign that the primary audience is developers and for 
debugging purposes.


--
regards,
Andrei Lepikhov
Postgres Professional

Re: Annoying build warnings from latest Apple toolchain

Andres Freund  writes:
> On 2023-11-28 10:48:04 -0500, Robert Haas wrote:
>> What a stupid, annoying decision on Apple's part. It seems like
>> -Wl,-undefined,error is the default behavior, so they could have just
>> ignored that flag if present, but instead they complain about being
>> asked to do what they were going to do anyway.

> Especially because I think it didn't actually use to be the default when
> building a dylib.

Indeed.  Whoever's in charge of their linker now seems to be quite
clueless, or at least aggressively anti-backwards-compatibility.

> While not helpful for this, I just noticed that there is
> -no_warn_duplicate_libraries, which would at least get rid of the
>   ld: warning: ignoring duplicate libraries: '-lxml2'
> style warnings.

Oh!  If that's been there long enough to rely on, that does seem
very helpful.

regards, tom lane

Re: Testing autovacuum wraparound (including failsafe)

2023-11-30 Thread Masahiko Sawada

On Thu, Nov 30, 2023 at 4:35 PM Masahiko Sawada  wrote:
>
> On Wed, Nov 29, 2023 at 5:27 AM Masahiko Sawada  wrote:
> >
> > On Tue, Nov 28, 2023 at 7:16 PM Daniel Gustafsson  wrote:
> > >
> > > > On 28 Nov 2023, at 03:00, Masahiko Sawada  wrote:
> > > >
> > > > On Mon, Nov 27, 2023 at 10:40 PM Daniel Gustafsson  
> > > > wrote:
> > > >>
> > > >>> On 27 Nov 2023, at 14:06, Masahiko Sawada  
> > > >>> wrote:
> > > >>
> > > >>> Is it true that we can modify the timeout after creating
> > > >>> BackgroundPsql object? If so, it seems we don't need to introduce the
> > > >>> new timeout argument. But how?
> > > >>
> > > >> I can't remember if that's leftovers that incorrectly remains from an 
> > > >> earlier
> > > >> version of the BackgroundPsql work, or if it's a very bad explanation 
> > > >> of
> > > >> ->set_query_timer_restart().  The timeout will use the timeout_default 
> > > >> value
> > > >> and that cannot be overridden, it can only be reset per query.
> > > >
> > > > Thank you for confirming this. I see there is the same problem also in
> > > > interactive_psql(). So I've attached the 0001 patch to fix these
> > > > documentation issues.
> > >
> > > -A default timeout of $PostgreSQL::Test::Utils::timeout_default is set up,
> > > -which can be modified later.
> > > +A default timeout of $PostgreSQL::Test::Utils::timeout_default is set up.
> > >
> > > Since it cannot be modified, I think we should just say "A timeout of .." 
> > > and
> > > call it a default timeout.  This obviously only matters for the backpatch 
> > > since
> > > the sentence is removed in 0002.
> >
> > Agreed.
> >
> > I've attached new version patches (0002 and 0003 are unchanged except
> > for the commit message). I'll push them, barring any objections.
> >
>
> Pushed.

FYI I've configured the buildfarm animal perentie to run regression
tests including xid_wraparound:

https://buildfarm.postgresql.org/cgi-bin/show_history.pl?nm=perentie&br=HEAD

Regards,

-- 
Masahiko Sawada
Amazon Web Services: https://aws.amazon.com

Re: Annoying build warnings from latest Apple toolchain

Hi,

On 2023-11-28 10:48:04 -0500, Robert Haas wrote:
> The second conclusion that I draw is that there's something in meson
> itself which is adding -Wl,-undefined,error when building binaries.

Right.


> What a stupid, annoying decision on Apple's part. It seems like
> -Wl,-undefined,error is the default behavior, so they could have just
> ignored that flag if present, but instead they complain about being
> asked to do what they were going to do anyway.

Especially because I think it didn't actually use to be the default when
building a dylib.


While not helpful for this, I just noticed that there is
-no_warn_duplicate_libraries, which would at least get rid of the
  ld: warning: ignoring duplicate libraries: '-lxml2'
style warnings.

Greetings,

Andres Freund

Re: Refactoring backend fork+exec code

Hi,

On 2023-12-01 01:36:13 +0200, Heikki Linnakangas wrote:
> On 30/11/2023 22:26, Andres Freund wrote:
> > On 2023-11-30 01:36:25 +0200, Heikki Linnakangas wrote:
> > >  From a96b6e92fdeaa947bf32774c425419b8f987b8e2 Mon Sep 17 00:00:00 2001
> > > From: Heikki Linnakangas 
> > > Date: Thu, 30 Nov 2023 00:01:25 +0200
> > > Subject: [PATCH v3 1/7] Refactor CreateSharedMemoryAndSemaphores
> > >
> > > For clarity, have separate functions for *creating* the shared memory
> > > and semaphores at postmaster or single-user backend startup, and
> > > for *attaching* to existing shared memory structures in EXEC_BACKEND
> > > case. CreateSharedMemoryAndSemaphores() is now called only at
> > > postmaster startup, and a new AttachSharedMemoryStructs() function is
> > > called at backend startup in EXEC_BACKEND mode.
> >
> > I assume CreateSharedMemoryAndSemaphores() is still called during crash
> > restart?
>
> Yes.
>
> >  I wonder if it shouldn't split three ways:
> > 1) create
> > 2) initialize
> > 3) attach
>
> Why? What would be the difference between create and initialize phases?

Mainly because I somehow mis-remembered how we deal with the shared memory
allocation when crashing. I somehow had remembered that we reused the same
allocation across restarts, but reinitialized it from scratch. There's a
kernel of truth to that, because we can end up re-attaching to an existing
sysv shared memory segment. But not more. Perhaps I was confusing things with
the listen sockets?

Andres

Re: Something seems weird inside tts_virtual_copyslot()

2023-11-30 Thread David Rowley

On Fri, 1 Dec 2023 at 13:14, Andres Freund  wrote:
> So I think adding an assert to ExecCopySlot(), perhaps with a comment saying
> that the restriction could be lifted with a bit of work, would be fine.

Thanks for looking at this again.

How about the attached?  I wrote the comment you mentioned and also
removed the Assert from tts_virtual_copyslot().

I also noted in the copyslot callback declaration that implementers
can assume the number of attributes in the source and destination
slots match.

David
diff --git a/src/backend/executor/execTuples.c 
b/src/backend/executor/execTuples.c
index 2c2712ceac..5a2db83987 100644
--- a/src/backend/executor/execTuples.c
+++ b/src/backend/executor/execTuples.c
@@ -253,8 +253,6 @@ tts_virtual_copyslot(TupleTableSlot *dstslot, 
TupleTableSlot *srcslot)
 {
TupleDesc   srcdesc = srcslot->tts_tupleDescriptor;
 
-   Assert(srcdesc->natts <= dstslot->tts_tupleDescriptor->natts);
-
tts_virtual_clear(dstslot);
 
slot_getallattrs(srcslot);
diff --git a/src/include/executor/tuptable.h b/src/include/executor/tuptable.h
index 4210d6d838..ce96111865 100644
--- a/src/include/executor/tuptable.h
+++ b/src/include/executor/tuptable.h
@@ -174,7 +174,8 @@ struct TupleTableSlotOps
 
/*
 * Copy the contents of the source slot into the destination slot's own
-* context. Invoked using callback of the destination slot.
+* context. Invoked using callback of the destination slot.  'dstslot' 
and
+* 'srcslot' can be assumed to have the same number of attributes.
 */
void(*copyslot) (TupleTableSlot *dstslot, TupleTableSlot 
*srcslot);
 
@@ -477,12 +478,19 @@ ExecCopySlotMinimalTuple(TupleTableSlot *slot)
  *
  * If a source's system attributes are supposed to be accessed in the target
  * slot, the target slot and source slot types need to match.
+ *
+ * Currently, source and target slot's TupleDesc must have the same number of
+ * attributes.  Future work could see this relaxed to allow the source to
+ * contain additional attributes and have the code here only copy over the
+ * leading attributes.
  */
 static inline TupleTableSlot *
 ExecCopySlot(TupleTableSlot *dstslot, TupleTableSlot *srcslot)
 {
Assert(!TTS_EMPTY(srcslot));
Assert(srcslot != dstslot);
+   Assert(dstslot->tts_tupleDescriptor->natts ==
+  srcslot->tts_tupleDescriptor->natts);
 
dstslot->tts_ops->copyslot(dstslot, srcslot);

Re: [PATCH] ltree hash functions

2023-11-30 Thread Tommy Pavlicek

On Tue, Nov 28, 2023 at 7:38 PM jian he  wrote:
> you only change Makefile, you also need to change contrib/ltree/meson.build?
> Do you need to use EXPLAIN to demo the index usage?

Thanks! Yes, I missed the Meson build file. I added additional
commands with EXPLAIN (COSTS OFF) as I found in other places.

Patch updated for those comments (and a touch of cleanup in the tests) attached.

0001-ltree-hash-v3.patch
Description: Binary data

Re: PostgreSql: Canceled on conflict out to old pivot

Hi,

On 2023-11-30 18:51:35 -0500, Tom Lane wrote:
> On what grounds do you assert that?  Operations on shared catalogs
> are visible across databases.  Admittedly they can't be written by
> ordinary DML, and I'm not sure that we make any promises about DDL
> writes honoring serializability.  But I'm unwilling to add
> "optimizations" that assume that that will never happen.

I'd say the issue is more that it's quite expensive to collect the
information. I tried in the past to make the xmin computation in
GetSnapshotData() be database specific, but it quickly shows in profiles, and
GetSnapshotData() unfortunately is really performance / scalability critical.

If that weren't the case, we could check a shared horizon for shared tables,
and a non-shared horizon otherwise.

In some cases we can compute a "narrower" horizon when it's worth the cost,
but quite often we lack the necessary data, because various backends have
stored the "global" xmin in the procarray.

Greetings,

Andres Freund

Re: Something seems weird inside tts_virtual_copyslot()

Hi,

On 2023-11-06 11:16:26 +1300, David Rowley wrote:
> On Sat, 4 Nov 2023 at 15:15, Andres Freund  wrote:
> >
> > On 2023-11-01 11:35:50 +1300, David Rowley wrote:
> > > I changed the Assert in tts_virtual_copyslot() to check the natts
> > > match in each of the slots and all of the regression tests still pass,
> > > so it seems we have no tests where there's an attribute number
> > > mismatch...
> >
> > If we want to prohibit that, I think we ought to assert this in
> > ExecCopySlot(), rather than just tts_virtual_copyslot.
> >
> > Even that does survive the test - but I don't think it'd be really wrong to
> > copy from a slot with more columns into one with fewer. And it seems 
> > plausible
> > that that could happen somewhere, e.g. when copying from a slot in a child
> > partition with additional columns into a slot from the parent, where the
> > column types/order otherwise matches, so we don't have to use the attribute
> > mapping infrastructure.
> 
> Do you have any examples of when this could happen?

> I played around with partitioned tables and partitions with various
> combinations of dropped columns and can't see any cases of this. Given
> the assert's condition has been backwards for 5 years now
> (4da597edf1b), it just seems a bit unlikely that we have cases where
> the source slot can have more attributes than the destination.

I think my concerns might be unfounded - I was worried about stuff like
attribute mapping deciding that it's safe to copy without an attribute
mapping, because all the types match. But it looks like we do check that the
attnums match as well. There's similar code in a bunch of other places,
e.g. ExecEvalWholeRowVar(), but that also verifies ->natts matches.

So I think adding an assert to ExecCopySlot(), perhaps with a comment saying
that the restriction could be lifted with a bit of work, would be fine.

Greetings,

Andres Freund

Re: PostgreSql: Canceled on conflict out to old pivot

Heikki Linnakangas  writes:
> On 30/11/2023 18:24, Wirch, Eduard wrote:
>> My understanding of serializable isolation is that only transactions
>> which can somehow affect each other can conflict. It should be clear
>> for PostgreSql, that transactions belonging to different databases
>> cannot affect each other. Why do they cause serializable conflicts?

On what grounds do you assert that?  Operations on shared catalogs
are visible across databases.  Admittedly they can't be written by
ordinary DML, and I'm not sure that we make any promises about DDL
writes honoring serializability.  But I'm unwilling to add
"optimizations" that assume that that will never happen.

regards, tom lane

Re: Refactoring backend fork+exec code

2023-11-30 Thread Heikki Linnakangas

On 30/11/2023 22:26, Andres Freund wrote:

On 2023-11-30 01:36:25 +0200, Heikki Linnakangas wrote:

 From a96b6e92fdeaa947bf32774c425419b8f987b8e2 Mon Sep 17 00:00:00 2001
From: Heikki Linnakangas 
Date: Thu, 30 Nov 2023 00:01:25 +0200
Subject: [PATCH v3 1/7] Refactor CreateSharedMemoryAndSemaphores

For clarity, have separate functions for *creating* the shared memory
and semaphores at postmaster or single-user backend startup, and
for *attaching* to existing shared memory structures in EXEC_BACKEND
case. CreateSharedMemoryAndSemaphores() is now called only at
postmaster startup, and a new AttachSharedMemoryStructs() function is
called at backend startup in EXEC_BACKEND mode.

I assume CreateSharedMemoryAndSemaphores() is still called during crash
restart?

Yes.

 I wonder if it shouldn't split three ways:
1) create
2) initialize
3) attach

Why? What would be the difference between create and initialize phases?

--
Heikki Linnakangas
Neon (https://neon.tech)

Re: PostgreSql: Canceled on conflict out to old pivot

2023-11-30 Thread Heikki Linnakangas

On 30/11/2023 18:24, Wirch, Eduard wrote:

> > The longest transaction that could occur is 1 min long.
I hate to drill on this, but are you very sure about that? A transaction

in a different database?

Don't be sorry for that, drilling down is important. ;) It took me so
long to reply because I had to prepare the information carefully. You're
right, on that day I observed the behavior, there were indeed long
running transactions in different DBs!

A-ha! :-D

My understanding of serializable isolation is that only transactions
which can somehow affect each other can conflict. It should be clear
for PostgreSql, that transactions belonging to different databases
cannot affect each other. Why do they cause serializable conflicts?

When the system runs low on the memory reserved to track the potential
conflicts, it "summarizes" old transactions and writes them to disk. The
summarization loses a lot of information: all we actually store on disk
is the commit sequence number of the earliest "out-conflicting"
transaction. We don't store the database oid that the transaction ran in.

The whole SSI mechanism is conservative so that you can get false
serialization errors even when there is no actual problem. This is just
an extreme case of that.

Perhaps we should keep more information in the on-disk summary format.
Preserving the database OID would make a lot of sense, it would be only
4 bytes extra per transaction. But we don't preserve it today.

If you want something visual, I prepared a SO question with similar
content like this mail, but added an image of the tx flow:
https://stackoverflow.com/questions/77544821/postgresql-canceled-on-conflict-out-to-old-pivot

Nice graphs!

--
Heikki Linnakangas
Neon (https://neon.tech)

Re: should check collations when creating partitioned index

Peter Eisentraut  writes:
>> Here is an updated patch that works as indicated above.
>> 
>> The behavior if you try to create an index with mismatching collations 
>> now is that it will skip over the column and complain at the end with 
>> something like
>> 
>> ERROR:  0A000: unique constraint on partitioned table must include all 
>> partitioning columns
>> DETAIL:  UNIQUE constraint on table "t1" lacks column "b" which is part 
>> of the partition key.
>> 
>> which perhaps isn't intuitive, but I think it would be the same if you 
>> somehow tried to build an index with different operator classes than the 
>> partitioning.  I think these less-specific error messages are ok in such 
>> edge cases.

> If there are no further comments on this patch version, I plan to go 
> ahead and commit it soon.

Sorry for slow response --- I've been dealing with a little too much
$REAL_LIFE lately.  Anyway, I'm content with the v2 patch.  I see
that the existing code works a little harder than this to produce
an on-point error message for mismatching operator, but after
studying that I'm not totally convinced that it's ideal behavior
either.  I think we can wait for some field complaints to see if
we need a better error message for mismatching collation, and
if so what the shape of the bad input is exactly.

regards, tom lane

Re: Refactoring backend fork+exec code

2023-11-30 Thread Heikki Linnakangas


On 30/11/2023 22:26, Andres Freund wrote:

Aside: Somewhat odd that InitAuxiliaryProcess() doesn't call
InitLWLockAccess().


Yeah that caught my eye too.

It seems to have been an oversight in commit 1c6821be31f. Before that, 
in 9.4, the lwlock stats were printed for aux processes too, on shutdown.


Committed a fix for that to master.

--
Heikki Linnakangas
Neon (https://neon.tech)

Re: Bug in pgbench prepared statements

2023-11-30 Thread Tristan Partin


On Wed Nov 29, 2023 at 7:38 PM CST, Lev Kokotov wrote:

Patch attached, if there is any interest in fixing this small bug.


I see prepareCommand() is called one more time in 
prepareCommandsInPipeline(). Should you also check the return value 
there?


It may also be useful to throw this patch on the January commitfest if 
no one else comes along to review/commit it. This first set of changes 
looks good to me.


--
Tristan Partin
Neon (https://neon.tech)

Re: meson: Stop using deprecated way getting path of files

2023-11-30 Thread Andrew Dunstan




On 2023-11-30 Th 16:00, Tristan Partin wrote:

On Wed Nov 29, 2023 at 1:42 PM CST, Andres Freund wrote:

Hi,

On 2023-11-29 13:11:23 -0600, Tristan Partin wrote:
> What is our limiting factor on bumping the minimum Meson version?

Old distro versions, particularly ones where the distro just has an 
older
python. It's one thing to require installing meson but quite another 
to also

require building python. There's some other ongoing discussion about
establishing the minimum baseline in a somewhat more, uh, formalized 
way:
https://postgr.es/m/CA%2BhUKGLhNs5geZaVNj2EJ79Dx9W8fyWUU3HxcpZy55sMGcY%3DiA%40mail.gmail.com 



I'll take a look there. According to Meson, the following versions had 
Python version bumps:


0.61.5: 3.6
0.56.2: 3.5
0.45.1: 3.4

Taking a look at pkgs.org, Debian 10, Ubuntu 20.04, and Oracle Linux 7 
(a RHEL re-spin), and CentOS 7, all have >= Python 3.6.8. Granted, 
this isn't the whole picture of what Postgres supports from version 
16+. To put things in perspective, Python 3.6 was released on December 
23, 2016, which is coming up on 7 years. Python 3.6 reached end of 
life on the same date in 2021.


Is there a complete list somewhere that talks about what platforms 
each version of Postgres supports?



You can look at animals in the buildfarm. For meson only release 16 and 
up matter.



cheers


andrew

--
Andrew Dunstan
EDB: https://www.enterprisedb.com

Re: Add missing error codes to PANIC/FATAL error reports in xlogrecovery

Hi,

On 2023-11-30 16:02:55 -0500, Tom Lane wrote:
> Andres Freund  writes:
> > Wondering if we should add a ERRCODE_CLUSTER_CORRUPTED for cases like this. 
> > We
> > have ERRCODE_DATA_CORRUPTED and ERRCODE_INDEX_CORRUPTED, which make
> > ERRCODE_DATA_CORRUPTED feel a bit too specific in this kind of situation?
> 
> Maybe.  We didn't officially define DATA_CORRUPTED as referring to
> table data, but given the existence of INDEX_CORRUPTED maybe we
> should treat it as that.

I'm on the fence about it. Certainly DATA_CORRUPTED would be more appropriate
than INTERNAL_ERROR.

> > Hm, this one arguably is not corruption, but we still cannot
> > continue. ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE or maybe a new error 
> > code?
> 
> ... I don't really like turning a whole bunch of error cases into
> the same error code without some closer analysis.

Other than this instance, they all indicate that the cluster is toast in some
way or another. So *_CORRUPTED seems appropriate. And even this instance would
be better off as _CORRUPTED than as INTERNAL_ERROR. There's so many of the
latter that you can't realistically alert on them occurring.

I don't like my idea of ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE much, that's
not something you realistically can alert on, and this error certainly is an
instance of "you're screwed until you manually intervene".

Greetings,

Andres Freund

Re: Add missing error codes to PANIC/FATAL error reports in xlogrecovery

Andres Freund  writes:
> Wondering if we should add a ERRCODE_CLUSTER_CORRUPTED for cases like this. We
> have ERRCODE_DATA_CORRUPTED and ERRCODE_INDEX_CORRUPTED, which make
> ERRCODE_DATA_CORRUPTED feel a bit too specific in this kind of situation?

Maybe.  We didn't officially define DATA_CORRUPTED as referring to
table data, but given the existence of INDEX_CORRUPTED maybe we
should treat it as that.  In any case ...

> Hm, this one arguably is not corruption, but we still cannot
> continue. ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE or maybe a new error code?

... I don't really like turning a whole bunch of error cases into
the same error code without some closer analysis.  I think you
are right that these need a bit more case-by-case thought.

regards, tom lane

Re: meson: Stop using deprecated way getting path of files

2023-11-30 Thread Tristan Partin


On Wed Nov 29, 2023 at 1:42 PM CST, Andres Freund wrote:

Hi,

On 2023-11-29 13:11:23 -0600, Tristan Partin wrote:
> What is our limiting factor on bumping the minimum Meson version?

Old distro versions, particularly ones where the distro just has an older
python. It's one thing to require installing meson but quite another to also
require building python. There's some other ongoing discussion about
establishing the minimum baseline in a somewhat more, uh, formalized way:
https://postgr.es/m/CA%2BhUKGLhNs5geZaVNj2EJ79Dx9W8fyWUU3HxcpZy55sMGcY%3DiA%40mail.gmail.com


I'll take a look there. According to Meson, the following versions had 
Python version bumps:


0.61.5: 3.6
0.56.2: 3.5
0.45.1: 3.4

Taking a look at pkgs.org, Debian 10, Ubuntu 20.04, and Oracle Linux 
7 (a RHEL re-spin), and CentOS 7, all have >= Python 3.6.8. Granted, 
this isn't the whole picture of what Postgres supports from version 16+. 
To put things in perspective, Python 3.6 was released on December 23, 
2016, which is coming up on 7 years. Python 3.6 reached end of life on 
the same date in 2021.


Is there a complete list somewhere that talks about what platforms each 
version of Postgres supports?


--
Tristan Partin
Neon (https://neon.tech)

Detecting some cases of missing backup_label

Hi,

I recently mentioned to Robert (and also Heikki earlier), that I think I see a
way to detect an omitted backup_label in a relevant subset of the cases (it'd
apply to the pg_control as well, if we moved to that).  Robert encouraged me
to share the idea, even though it does not provide complete protection.


The subset I think we can address is the following:

a) An omitted backup_label would lead to corruption, i.e. without the
   backup_label we won't start recovery at the right position. Obviously it'd
   be better to also catch a wrong procedure when it'd not cause corruption -
   perhaps my idea can be extended to handle that, with a small bit of
   overhead.

b) The backup has been taken from a primary. Unfortunately that probably can't
   be addressed - but the vast majority of backups are taken from a primary,
   so I think it's still a worthwhile protection.


Here's my approach

1) We add a XLOG_BACKUP_START WAL record when starting a base backup on a
   primary, emitted just *after* the checkpoint completed

2) When replaying a base backup start record, we create a state file that
   includes the corresponding LSN in the filename

3) On the primary, the state file for XLOG_BACKUP_START is *not* created at
   that time. Instead the state file is created during pg_backup_stop().

4) When replaying a XLOG_BACKUP_END record, we verif that the state file
   created by XLOG_BACKUP_START is present, and error out if not.  Backups
   that started before the redo LSN from backup_label are ignored
   (necessitates remembering that LSN, but we've been discussing that anyway).


Because the backup state file on the primary is only created during
pg_backup_stop(), a copy of the data directory taken between pg_backup_start()
and pg_backup_stop() does *not* contain the corresponding "backup state
file". Because of this, an omitted backup_label is detected if recovery does
not start early enough - recovery won't encounter the XLOG_BACKUP_START record
and thus would not create the state file, leading to an error in 4).

It is not a problem that the primary does not create the state file before the
pg_backup_stop() - if the primary crashes before pg_backup_stop(), there is no
XLOG_BACKUP_END and thus no error will be raised.  It's a bit odd that the
sequence differs between normal processing and recovery, but I think that's
nothing a good comment couldn't explain.


I haven't worked out the details, but I think we might be able extend this to
catch errors even if there is no checkpoint during the base backup, by
emitting the WAL record *before* the RequestCheckpoint(), and creating the
corresponding state file during backup_label processing at the start of
recovery.  That'd probably make the logic for when we can remove the backup
state files a bit more complicated, but I think we could deal with that.


Comments? Swear words?

Greetings,

Andres Freund

Re: CRC32C Parallel Computation Optimization on ARM

On Thu, Nov 23, 2023 at 08:05:26AM +, Xiang Gao wrote:
> On Date: Wed, 22 Nov 2023 15:06:18PM -0600, Nathan Bossart wrote:
>>pg_crc32c_armv8.o: CFLAGS += ${CFLAGS_CRC} ${CFLAGS_CRYPTO}
> 
> It does not work correctly. CFLAGS ='-march=armv8-a+crc,
> -march=armv8-a+crypto', what actually works is '-march=armv8-a+crypto'.
> 
> We set a new variable CLAGS_CRC_CRYPTO,In configure.ac,
> 
> If test x"$CFLAGS_CRC" != x"" || test x"CFLAGS_CRYPTO" != x""; then
>   CLAGS_CRC_CRYPTO = '-march=armv8-a+crc+crypto'
> fi
> 
> then in makefile,
> pg_crc32c_armv8.o: CFLAGS +=${ CLAGS_CRC_CRYPTO }

Ah, I see.  We need to append +crc and/or +crypto based on what the
compiler understands.  That seems fine to me...

-- 
Nathan Bossart
Amazon Web Services: https://aws.amazon.com

Re: Refactoring backend fork+exec code

2023-11-30 Thread Thomas Munro

On Fri, Dec 1, 2023 at 9:31 AM Andres Freund  wrote:
> On 2023-11-30 12:44:33 -0600, Tristan Partin wrote:
> > >  +/*
> > >  + * Set reference point for stack-depth checking.  This might 
> > > seem
> > >  + * redundant in !EXEC_BACKEND builds; but it's not because the 
> > > postmaster
> > >  + * launches its children from signal handlers, so we might be 
> > > running on
> > >  + * an alternative stack. XXX still true?
> > >  + */
> > >  +(void) set_stack_base();
> >
> > Looks like there is still this XXX left. Can't say I completely understand
> > the second sentence either.
>
> We used to start some child processes of postmaster in signal handlers. That
> was fixed in
>
> commit 7389aad6366
> Author: Thomas Munro 
> Date:   2023-01-12 12:34:23 +1300
>
> Use WaitEventSet API for postmaster's event loop.
>
>
> In some cases signal handlers run on a separate stack, which meant that the
> set_stack_base() we did in postmaster would yield a completely bogus stack
> depth estimation.  So this comment should likely have been removed. Thomas?

Right, I should delete that comment in master and 16.  While wondering
what to write instead, my first thought is that it is better to leave
the actual call there though, because otherwise there is a small
difference in stack reference point between EXEC_BACKEND and
!EXEC_BACKEND builds, consumed by a few postmaster stack frames.  So
the new comment would just say that.

(I did idly wonder if there is a longjmp trick to zap those frames
post-fork, not looked into and probably doesn't really improve
anything we care about...)

postgres_fdw test timeouts

I noticed that the postgres_fdw test periodically times out on Windows:


https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=fairywren&dt=2023-11-10%2003%3A12%3A58
https://cirrus-ci.com/task/5504294095421440
https://cirrus-ci.com/task/4814111003901952
https://cirrus-ci.com/task/5343037535027200
https://cirrus-ci.com/task/5893655966253056
https://cirrus-ci.com/task/4568159320014848
https://cirrus-ci.com/task/5238067850641408
(and many more from cfbot)

>From a quick sampling, the relevant test logs end with either

ERROR:  server "unknownserver" does not exist
STATEMENT:  SELECT postgres_fdw_disconnect('unknownserver');

or

ERROR:  relation "public.non_existent_table" does not exist
CONTEXT:  remote SQL command: SELECT a, b, c FROM 
public.non_existent_table
STATEMENT:  SELECT * FROM async_pt;

before the test seems to hang.

AFAICT the failures began around September 10th, which leads me to wonder
if this is related to commit 04a09ee.  That is little more than a wild
guess, though.  I haven't been able to deduce much else from the logs I can
find, and I didn't find any previous reports about this in the archives
after lots of searching, so I thought I'd at least park these notes here in
case anyone else has ideas.

-- 
Nathan Bossart
Amazon Web Services: https://aws.amazon.com

Re: GUC names in messages

2023-11-30 Thread Peter Eisentraut


On 30.11.23 06:59, Michael Paquier wrote:

ereport(elevel,
(errcode(ERRCODE_UNDEFINED_OBJECT),
-errmsg("unrecognized configuration parameter \"%s\" 
in file \"%s\" line %d",
-   item->name,
+   /* translator: %s%s%s is for an optionally quoted GUC 
name */
+errmsg("unrecognized configuration parameter %s%s%s 
in file \"%s\" line %d",
+   GUC_FORMAT(item->name),
item->filename, 
item->sourceline)));


I think this is completely over-engineered and wrong.  If we start down 
this road, then the next person is going to start engineering some rules 
by which we should quote file names and other things.  Which will lead 
to more confusion, not less.  The whole point of this quoting thing is 
that you do it all the time or not, not dynamically based on what's 
inside of it.


The original version of this string (and similar ones) seems the most 
correct, simple, and useful one to me.

Re: Refactoring backend fork+exec code

Hi,

On 2023-11-30 12:44:33 -0600, Tristan Partin wrote:
> >  +/*
> >  + * Set reference point for stack-depth checking.  This might seem
> >  + * redundant in !EXEC_BACKEND builds; but it's not because the 
> > postmaster
> >  + * launches its children from signal handlers, so we might be 
> > running on
> >  + * an alternative stack. XXX still true?
> >  + */
> >  +(void) set_stack_base();
> 
> Looks like there is still this XXX left. Can't say I completely understand
> the second sentence either.

We used to start some child processes of postmaster in signal handlers. That
was fixed in

commit 7389aad6366
Author: Thomas Munro 
Date:   2023-01-12 12:34:23 +1300

Use WaitEventSet API for postmaster's event loop.

In some cases signal handlers run on a separate stack, which meant that the
set_stack_base() we did in postmaster would yield a completely bogus stack
depth estimation.  So this comment should likely have been removed. Thomas?

Greetings,

Andres Freund

Re: Refactoring backend fork+exec code

Hi,

On 2023-11-30 01:36:25 +0200, Heikki Linnakangas wrote:
> - patch 1 splits CreateSharedMemoryAndSemaphores into two functions:
> CreateSharedMemoryAndSemaphores is now only called at postmaster startup,
> and a new function called AttachSharedMemoryStructs() is called in backends
> in EXEC_BACKEND mode. I extracted the common parts of those functions to a
> new static function. (Some of this refactoring used to be part of the 3rd
> patch in the series, but it seems useful on its own, so I split it out.)

I like that idea.



> From a96b6e92fdeaa947bf32774c425419b8f987b8e2 Mon Sep 17 00:00:00 2001
> From: Heikki Linnakangas 
> Date: Thu, 30 Nov 2023 00:01:25 +0200
> Subject: [PATCH v3 1/7] Refactor CreateSharedMemoryAndSemaphores
>
> For clarity, have separate functions for *creating* the shared memory
> and semaphores at postmaster or single-user backend startup, and
> for *attaching* to existing shared memory structures in EXEC_BACKEND
> case. CreateSharedMemoryAndSemaphores() is now called only at
> postmaster startup, and a new AttachSharedMemoryStructs() function is
> called at backend startup in EXEC_BACKEND mode.

I assume CreateSharedMemoryAndSemaphores() is still called during crash
restart?  I wonder if it shouldn't split three ways:
1) create
2) initialize
3) attach


> From 3478cafcf74a5c8d649e0287e6c72669a29c0e70 Mon Sep 17 00:00:00 2001
> From: Heikki Linnakangas 
> Date: Thu, 30 Nov 2023 00:02:03 +0200
> Subject: [PATCH v3 2/7] Pass BackgroundWorker entry in the parameter file in
>  EXEC_BACKEND mode
>
> This makes it possible to move InitProcess later in SubPostmasterMain
> (in next commit), as we no longer need to access shared memory to read
> background worker entry.

>  static void read_backend_variables(char *id, Port *port);
> @@ -4831,7 +4833,7 @@ SubPostmasterMain(int argc, char *argv[])
>   strcmp(argv[1], "--forkavlauncher") == 0 ||
>   strcmp(argv[1], "--forkavworker") == 0 ||
>   strcmp(argv[1], "--forkaux") == 0 ||
> - strncmp(argv[1], "--forkbgworker=", 15) == 0)
> + strncmp(argv[1], "--forkbgworker", 14) == 0)
>   PGSharedMemoryReAttach();
>   else
>   PGSharedMemoryNoReAttach();
> @@ -4962,10 +4964,8 @@ SubPostmasterMain(int argc, char *argv[])
>
>   AutoVacWorkerMain(argc - 2, argv + 2);  /* does not return */
>   }
> - if (strncmp(argv[1], "--forkbgworker=", 15) == 0)
> + if (strncmp(argv[1], "--forkbgworker", 14) == 0)


Now that we don't need to look at parameters anymore, these should probably be
just a strcmp(), like the other cases?


> From 0d071474e12a70ff8113c7b0731c5b97fec45007 Mon Sep 17 00:00:00 2001
> From: Heikki Linnakangas 
> Date: Wed, 29 Nov 2023 23:47:25 +0200
> Subject: [PATCH v3 3/7] Refactor how InitProcess is called
>
> The order of process initialization steps is now more consistent
> between !EXEC_BACKEND and EXEC_BACKEND modes. InitProcess() is called
> at the same place in either mode. We can now also move the
> AttachSharedMemoryStructs() call into InitProcess() itself. This
> reduces the number of "#ifdef EXEC_BACKEND" blocks.

Yay.


> diff --git a/src/backend/storage/lmgr/proc.c b/src/backend/storage/lmgr/proc.c
> index cdfdd6fbe1d..6c708777dde 100644
> --- a/src/backend/storage/lmgr/proc.c
> +++ b/src/backend/storage/lmgr/proc.c
> @@ -461,6 +461,12 @@ InitProcess(void)
>*/
>   InitLWLockAccess();
>   InitDeadLockChecking();
> +
> +#ifdef EXEC_BACKEND
> + /* Attach process to shared data structures */
> + if (IsUnderPostmaster)
> + AttachSharedMemoryStructs();
> +#endif
>  }
>
>  /*
> @@ -614,6 +620,12 @@ InitAuxiliaryProcess(void)
>* Arrange to clean up at process exit.
>*/
>   on_shmem_exit(AuxiliaryProcKill, Int32GetDatum(proctype));
> +
> +#ifdef EXEC_BACKEND
> + /* Attach process to shared data structures */
> + if (IsUnderPostmaster)
> + AttachSharedMemoryStructs();
> +#endif
>  }

Aside: Somewhat odd that InitAuxiliaryProcess() doesn't call
InitLWLockAccess().


I think a short comment explaining why we can attach to shmem structs after
already accessing shared memory earlier in the function would be worthwhile.


> From ce51876f87f1e4317e25baf64184749448fcd033 Mon Sep 17 00:00:00 2001
> From: Heikki Linnakangas 
> Date: Thu, 30 Nov 2023 00:07:34 +0200
> Subject: [PATCH v3 7/7] Refactor postmaster child process launching
>
> - Move code related to launching backend processes to new source file,
>   launch_backend.c
>
> - Introduce new postmaster_child_launch() function that deals with the
>   differences between EXEC_BACKEND and fork mode.
>
> - Refactor the mechanism of passing information from the parent to
>   child process. Instead of using different command-line arguments when
>   launching the child process in EXEC_BACKEND mode, pass a
>   variable-length blob of data along with all the global variables. The
>   contents of that blob depend o

Re: Add missing error codes to PANIC/FATAL error reports in xlogrecovery

2023-11-30 Thread Robert Haas

On Thu, Nov 30, 2023 at 2:47 PM Andres Freund  wrote:
> Another aside: Isn't the hint here obsolete since we've removed exclusive
> backups? I can't think of any scenario now where removing backup_label would
> be correct in a non-exclusive backup.

That's an extremely good point.

-- 
Robert Haas
EDB: http://www.enterprisedb.com

Re: Add missing error codes to PANIC/FATAL error reports in xlogrecovery

Hi,

On 2023-11-30 10:54:12 -0800, Krishnakumar R wrote:
> diff --git a/src/backend/access/transam/xlogrecovery.c 
> b/src/backend/access/transam/xlogrecovery.c
> index c6156a..2f50928e7e 100644
> --- a/src/backend/access/transam/xlogrecovery.c
> +++ b/src/backend/access/transam/xlogrecovery.c
> @@ -630,7 +630,8 @@ InitWalRecovery(ControlFileData *ControlFile, bool 
> *wasShutdown_ptr,
>   if (!ReadRecord(xlogprefetcher, LOG, false,
>   
> checkPoint.ThisTimeLineID))
>   ereport(FATAL,
> - (errmsg("could not find 
> redo location referenced by checkpoint record"),
> + 
> (errcode(ERRCODE_DATA_CORRUPTED),
> +  errmsg("could not find 
> redo location referenced by checkpoint record"),
>errhint("If you are 
> restoring from a backup, touch \"%s/recovery.signal\" or 
> \"%s/standby.signal\" and add required recovery options.\n"
>"If 
> you are not restoring from a backup, try removing the file 
> \"%s/backup_label\".\n"
>"Be 
> careful: removing \"%s/backup_label\" will result in a corrupt cluster if 
> restoring from a backup.",

Wondering if we should add a ERRCODE_CLUSTER_CORRUPTED for cases like this. We
have ERRCODE_DATA_CORRUPTED and ERRCODE_INDEX_CORRUPTED, which make
ERRCODE_DATA_CORRUPTED feel a bit too specific in this kind of situation?

OTOH, just having anything other than ERRCODE_INTERNAL_ERROR is better.


> @@ -640,7 +641,8 @@ InitWalRecovery(ControlFileData *ControlFile, bool 
> *wasShutdown_ptr,
>   else
>   {
>   ereport(FATAL,
> - (errmsg("could not locate required 
> checkpoint record"),
> + (errcode(ERRCODE_DATA_CORRUPTED),
> +  errmsg("could not locate required 
> checkpoint record"),
>errhint("If you are restoring from a 
> backup, touch \"%s/recovery.signal\" or \"%s/standby.signal\" and add 
> required recovery options.\n"
>"If you are not 
> restoring from a backup, try removing the file \"%s/backup_label\".\n"
>"Be careful: removing 
> \"%s/backup_label\" will result in a corrupt cluster if restoring from a 
> backup.",

Another aside: Isn't the hint here obsolete since we've removed exclusive
backups? I can't think of any scenario now where removing backup_label would
be correct in a non-exclusive backup.


> @@ -817,7 +820,8 @@ InitWalRecovery(ControlFileData *ControlFile, bool 
> *wasShutdown_ptr,
>*/
>   switchpoint = 
> tliSwitchPoint(ControlFile->checkPointCopy.ThisTimeLineID, expectedTLEs, 
> NULL);
>   ereport(FATAL,
> - (errmsg("requested timeline %u is not a child 
> of this server's history",
> + (errcode(ERRCODE_DATA_CORRUPTED),
> +  errmsg("requested timeline %u is not a child 
> of this server's history",
>   recoveryTargetTLI),
>errdetail("Latest checkpoint is at %X/%X on 
> timeline %u, but in the history of the requested timeline, the server forked 
> off from that timeline at %X/%X.",
>  
> LSN_FORMAT_ARGS(ControlFile->checkPoint),

Hm, this one arguably is not corruption, but we still cannot
continue. ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE or maybe a new error code?

Greetings,

Andres Freund

Add missing error codes to PANIC/FATAL error reports in xlogrecovery

2023-11-30 Thread Krishnakumar R

Hi,

Please find a patch attached which adds missing sql error code in
error reports which are FATAL or PANIC, in xlogrecovery.
This will help with deducing patterns when looking at error reports
from multiple postgres instances.

--
Thanks and Regards,
Krishnakumar (KK).
[Microsoft]
From 4cc518f25710c512ba3f9452392dc6ea67c2248b Mon Sep 17 00:00:00 2001
From: "Krishnakumar R (KK)" 
Date: Thu, 30 Nov 2023 00:56:40 -0800
Subject: [PATCH v1] Add missing error codes to PANIC/FATAL error reports.

---
 src/backend/access/transam/xlogrecovery.c | 45 +++
 1 file changed, 30 insertions(+), 15 deletions(-)

diff --git a/src/backend/access/transam/xlogrecovery.c b/src/backend/access/transam/xlogrecovery.c
index c6156a..2f50928e7e 100644
--- a/src/backend/access/transam/xlogrecovery.c
+++ b/src/backend/access/transam/xlogrecovery.c
@@ -630,7 +630,8 @@ InitWalRecovery(ControlFileData *ControlFile, bool *wasShutdown_ptr,
 if (!ReadRecord(xlogprefetcher, LOG, false,
 checkPoint.ThisTimeLineID))
 	ereport(FATAL,
-			(errmsg("could not find redo location referenced by checkpoint record"),
+			(errcode(ERRCODE_DATA_CORRUPTED),
+			 errmsg("could not find redo location referenced by checkpoint record"),
 			 errhint("If you are restoring from a backup, touch \"%s/recovery.signal\" or \"%s/standby.signal\" and add required recovery options.\n"
 	 "If you are not restoring from a backup, try removing the file \"%s/backup_label\".\n"
 	 "Be careful: removing \"%s/backup_label\" will result in a corrupt cluster if restoring from a backup.",
@@ -640,7 +641,8 @@ InitWalRecovery(ControlFileData *ControlFile, bool *wasShutdown_ptr,
 		else
 		{
 			ereport(FATAL,
-	(errmsg("could not locate required checkpoint record"),
+	(errcode(ERRCODE_DATA_CORRUPTED),
+	 errmsg("could not locate required checkpoint record"),
 	 errhint("If you are restoring from a backup, touch \"%s/recovery.signal\" or \"%s/standby.signal\" and add required recovery options.\n"
 			 "If you are not restoring from a backup, try removing the file \"%s/backup_label\".\n"
 			 "Be careful: removing \"%s/backup_label\" will result in a corrupt cluster if restoring from a backup.",
@@ -764,7 +766,8 @@ InitWalRecovery(ControlFileData *ControlFile, bool *wasShutdown_ptr,
 			 * simplify processing around checkpoints.
 			 */
 			ereport(PANIC,
-	(errmsg("could not locate a valid checkpoint record")));
+	(errcode(ERRCODE_DATA_CORRUPTED),
+	 errmsg("could not locate a valid checkpoint record")));
 		}
 		memcpy(&checkPoint, XLogRecGetData(xlogreader), sizeof(CheckPoint));
 		wasShutdown = ((record->xl_info & ~XLR_INFO_MASK) == XLOG_CHECKPOINT_SHUTDOWN);
@@ -817,7 +820,8 @@ InitWalRecovery(ControlFileData *ControlFile, bool *wasShutdown_ptr,
 		 */
 		switchpoint = tliSwitchPoint(ControlFile->checkPointCopy.ThisTimeLineID, expectedTLEs, NULL);
 		ereport(FATAL,
-(errmsg("requested timeline %u is not a child of this server's history",
+(errcode(ERRCODE_DATA_CORRUPTED),
+ errmsg("requested timeline %u is not a child of this server's history",
 		recoveryTargetTLI),
  errdetail("Latest checkpoint is at %X/%X on timeline %u, but in the history of the requested timeline, the server forked off from that timeline at %X/%X.",
 		   LSN_FORMAT_ARGS(ControlFile->checkPoint),
@@ -833,7 +837,8 @@ InitWalRecovery(ControlFileData *ControlFile, bool *wasShutdown_ptr,
 		tliOfPointInHistory(ControlFile->minRecoveryPoint - 1, expectedTLEs) !=
 		ControlFile->minRecoveryPointTLI)
 		ereport(FATAL,
-(errmsg("requested timeline %u does not contain minimum recovery point %X/%X on timeline %u",
+(errcode(ERRCODE_DATA_CORRUPTED),
+ errmsg("requested timeline %u does not contain minimum recovery point %X/%X on timeline %u",
 		recoveryTargetTLI,
 		LSN_FORMAT_ARGS(ControlFile->minRecoveryPoint),
 		ControlFile->minRecoveryPointTLI)));
@@ -861,12 +866,14 @@ InitWalRecovery(ControlFileData *ControlFile, bool *wasShutdown_ptr,
 			 checkPoint.newestCommitTsXid)));
 	if (!TransactionIdIsNormal(XidFromFullTransactionId(checkPoint.nextXid)))
 		ereport(PANIC,
-(errmsg("invalid next transaction ID")));
+(errcode(ERRCODE_DATA_CORRUPTED),
+ errmsg("invalid next transaction ID")));
 
 	/* sanity check */
 	if (checkPoint.redo > CheckPointLoc)
 		ereport(PANIC,
-(errmsg("invalid redo in checkpoint record")));
+(errcode(ERRCODE_DATA_CORRUPTED),
+ errmsg("invalid redo in checkpoint record")));
 
 	/*
 	 * Check whether we need to force recovery from WAL.  If it appears to
@@ -877,7 +884,8 @@ InitWalRecovery(ControlFileData *ControlFile, bool *wasShutdown_ptr,
 	{
 		if (wasShutdown)
 			ereport(PANIC,
-	(errmsg("invalid redo record in shutdown checkpoint")));
+	(errcode(ERRCODE_DATA_CORRUPTED),
+	 errmsg("invalid redo record in shutdown checkpoint")));
 		InRecovery = true;
 	}
 	else if (ControlFile->state != DB_S

Re: Refactoring backend fork+exec code

2023-11-30 Thread Tristan Partin


On Wed Nov 29, 2023 at 5:36 PM CST, Heikki Linnakangas wrote:

On 11/10/2023 14:12, Heikki Linnakangas wrote:
Here's another rebased patch set. Compared to previous version, I did a 
little more refactoring around CreateSharedMemoryAndSemaphores and 
InitProcess:


- patch 1 splits CreateSharedMemoryAndSemaphores into two functions: 
CreateSharedMemoryAndSemaphores is now only called at postmaster 
startup, and a new function called AttachSharedMemoryStructs() is called 
in backends in EXEC_BACKEND mode. I extracted the common parts of those 
functions to a new static function. (Some of this refactoring used to be 
part of the 3rd patch in the series, but it seems useful on its own, so 
I split it out.)


- patch 3 moves the call to AttachSharedMemoryStructs() to 
InitProcess(), reducing the boilerplate code a little.



The patches are incrementally useful, so if you don't have time to 
review all of them, a review on a subset would be useful too.



 From 8886db1ed6bae21bf6d77c9bb1230edbb55e24f9 Mon Sep 17 00:00:00 2001
 From: Heikki Linnakangas 
 Date: Thu, 30 Nov 2023 00:04:22 +0200
 Subject: [PATCH v3 4/7] Pass CAC as argument to backend process


For me, being new to the code, it would be nice to have more of an 
explanation as to why this is "better." I don't doubt it; it would just 
help me and future readers of this commit in the future. More of an 
explanation in the commit message would suffice.


My other comment on this commit is that we now seem to have lost the 
context on what CAC stands for. Before we had the member variable to 
explain it. A comment on the enum would be great or changing cac named 
variables to canAcceptConnections. I did notice in patch 7 that there 
are still some variables named canAcceptConnections around, so I'll 
leave this comment up to you.



  From 98f8397b32a0b36e221475b32697c9c5bbca86a0 Mon Sep 17 00:00:00 2001
  From: Heikki Linnakangas 
  Date: Wed, 11 Oct 2023 13:38:06 +0300
  Subject: [PATCH v3 5/7] Remove ConnCreate and ConnFree, and allocate Port in
   stack


I like it separate.


 From 79aab42705a8cb0e16e61c33052fc56fdd4fca76 Mon Sep 17 00:00:00 2001
 From: Heikki Linnakangas 
 Date: Wed, 11 Oct 2023 13:38:10 +0300
 Subject: [PATCH v3 6/7] Introduce ClientSocket, rename some funcs



 +static intBackendStartup(ClientSocket *port);


s/port/client_sock


 -port->remote_hostname = strdup(remote_host);
 +port->remote_hostname = pstrdup(remote_host);
 +MemoryContextSwitchTo(oldcontext);


Something funky with the whitespace here, but my eyes might also be 
playing tricks on me. Mixing spaces in tabs like what do in this 
codebase makes it difficult to review :).



 From ce51876f87f1e4317e25baf64184749448fcd033 Mon Sep 17 00:00:00 2001
 From: Heikki Linnakangas 
 Date: Thu, 30 Nov 2023 00:07:34 +0200
 Subject: [PATCH v3 7/7] Refactor postmaster child process launching



 +entry_kinds[child_type].main_fn(startup_data, 
startup_data_len);
 +Assert(false);


Seems like you want the pg_unreachable() macro here instead of 
Assert(false). Similar comment at the end of SubPostmasterMain().



 +if (fwrite(param, paramsz, 1, fp) != 1)
 +{
 +ereport(LOG,
 +(errcode_for_file_access(),
 + errmsg("could not write to file \"%s\": %m", 
tmpfilename)));
 +FreeFile(fp);
 +return -1;
 +}
 +
 +/* Release file */
 +if (FreeFile(fp))
 +{
 +ereport(LOG,
 +(errcode_for_file_access(),
 + errmsg("could not write to file \"%s\": %m", 
tmpfilename)));
 +return -1;
 +}


Two pieces of feedback here. I generally find write(2) more useful than 
fwrite(3) because write(2) will report a useful errno, whereas fwrite(2) 
just uses ferror(3). The additional errno information might be valuable 
context in the log message. Up to you if you think it is also valuable.


The log message if FreeFile() fails doesn't seem to make sense to me. 
I didn't see any file writing in that code path, but it is possible that 
I missed something.



 +/*
 + * Set reference point for stack-depth checking.  This might seem
 + * redundant in !EXEC_BACKEND builds; but it's not because the 
postmaster
 + * launches its children from signal handlers, so we might be running 
on
 + * an alternative stack. XXX still true?
 + */
 +(void) set_stack_base();


Looks like there is still this XXX left. Can't say I completely 
understand the second sentence either.



 +/*
 + * make sure stderr is in binary mode before anything can possibly be
 + * written to it, in case it's actually the syslogger pipe, so the 
pipe
 + * chunking protocol isn't disturbed. Non-logpipe data gets 
translated on
 +

Re: Parallel CREATE INDEX for BRIN indexes

2023-11-30 Thread Matthias van de Meent

On Thu, 30 Nov 2023 at 01:10, Tomas Vondra
 wrote:
>
> On 11/29/23 23:59, Matthias van de Meent wrote:
>> On Wed, 29 Nov 2023 at 21:56, Tomas Vondra
>>  wrote:
>>>
>>> On 11/29/23 21:30, Matthias van de Meent wrote:
 On Wed, 29 Nov 2023 at 18:55, Tomas Vondra
  wrote:
> I did try to measure how much it actually saves, but none of the tests I
> did actually found measurable improvement. So I'm tempted to just not
> include this part, and accept that we may deserialize some of the tuples
> unnecessarily.
>
> Did you actually observe measurable improvements in some cases?

 The improvements would mostly stem from brin indexes with multiple
 (potentially compressed) by-ref types, as they go through more complex
 and expensive code to deserialize, requiring separate palloc() and
 memcpy() calls each.
 For single-column and by-value types the improvements are expected to
 be negligible, because there is no meaningful difference between
 copying a single by-ref value and copying its container; the
 additional work done for each tuple is marginal for those.

 For an 8-column BRIN index ((sha256((id)::text::bytea)::text),
 (sha256((id+1)::text::bytea)::text),
 (sha256((id+2)::text::bytea)::text), ...) instrumented with 0003 I
 measured a difference of 10x less time spent in the main loop of
 _brin_end_parallel, from ~30ms to 3ms when dealing with 55k 1-block
 ranges. It's not a lot, but worth at least something, I guess?

>>>
>>> It is something, but I can't really convince myself it's worth the extra
>>> code complexity. It's a somewhat extreme example, and the parallelism
>>> certainly saves much more than this.
>>
>> True. For this, I usually keep in mind that the docs on multi-column
>> indexes still indicate to use 1 N-column brin index over N 1-column
>> brin indexes (assuming the same storage parameters), so multi-column
>> BRIN indexes should not be considered to be uncommon:
>>
>> "The only reason to have multiple BRIN indexes instead of one
>> multicolumn BRIN index on a single table is to have a different
>> pages_per_range storage parameter."
>>
>> Note that most of the time in my example index is spent in creating
>> the actual tuples due to the use of hashing for data generation; for
>> index or plain to-text formatting the improvement is much more
>> pronounced: If I use an 8-column index (id::text, id, ...), index
>> creation takes ~500ms with 4+ workers. Of this, deforming takes some
>> 20ms, though when skipping the deforming step (i.e.,with my patch) it
>> takes ~3.5ms. That's a 3% shaved off the build time when the index
>> shape is beneficial.
>>
>
> That's all true, and while 3.5% is not something to ignore, my POV is
> that the parallelism speeds this up from ~2000ms to ~500ms. Yes, it
> would be great to shave off the extra 1% (relative to the original
> duration). But I don't have a great idea how to do code that in a way
> that is readable, and I don't want to stall the patch indefinitely
> because of a comparatively small improvement.
>
> Therefore I propose we get the simpler code committed and leave this as
> a future improvement.

That's fine with me, it is one reason why I kept it as a separate patch file.

 The attached patch fixes the issue that you called out .
 It also further updates _brin_end_parallel: the final 'write empty
 tuples' loop is never hit and is thus removed, because if there were
 any tuples in the spool we'd have filled the empty ranges at the end
 of the main loop, and if there were no tuples in the spool then the
 memtuple would still be at its original initialized value of 0 thus
 resulting in a constant false condition. I also updated some comments.

>>>
>>> Ah, right. I'll take a look tomorrow, but I guess I didn't realize we
>>> insert the empty ranges in the main loop, because we're already looking
>>> at the *next* summary.
>>
>> Yes, merging adds some significant complexity here. I don't think we
>> can easily get around that though...
>>
>>> But I think the idea was to insert empty ranges if there's a chunk of
>>> empty ranges at the end of the table, after the last tuple the index
>>> build reads. But I'm not sure that can actually happen ...
>>
>> This would be trivial to construct with partial indexes; e.g. WHERE
>> (my_pk IS NULL) would consist of exclusively empty ranges.
>> I don't see a lot of value in partial BRIN indexes, but I may be
>> overlooking something.
>>
>
> Oh, I haven't even thought about partial BRIN indexes! I'm sure for
> those it's even more important to actually fill-in the empty ranges,
> otherwise we end up scanning the whole supposedly filtered-out part of
> the table. I'll do some testing with that.

I just ran some more tests in less favorable environments, and it
looks like I hit a bug:

% SET max_parallel_workers = 0;
% CREATE INDEX ... USING brin (...);
ERROR:  cannot update tuples during a p

Re: Transaction timeout

2023-11-30 Thread Andrey M. Borodin


> On 20 Nov 2023, at 06:33, 邱宇航  wrote:

Nikolay, Peter, Fujii, Tung, Yuhang, thank you for reviewing this.
I'll address feedback soon, this patch has been for a long time on my TODO list.
I've started with fixing problem of COMMIT AND CHAIN by restarting timeout 
counter.
Tomorrow I plan to fix raising of the timeout when the transaction is idle.
Renaming transaction_timeout to something else (to avoid confusion with 
prepared xacts) also seems correct to me.

Thanks!


Best regards, Andrey Borodin.


v5-0001-Intorduce-transaction_timeout.patch
Description: Binary data

Re: [HACKERS] Changing references of password encryption to hashing

2023-11-30 Thread Robert Haas

On Wed, Nov 29, 2023 at 5:02 PM Nathan Bossart  wrote:
> On Wed, Nov 29, 2023 at 04:02:11PM -0500, Robert Haas wrote:
> > I'd fully support having good documentation that says "hey, here are
> > the low security authentication configurations, here are the
> > medium-security ones, here are the high security ones, and here's why
> > these ones are better than those ones and what they protect against
> > and what risks remain." That would be awesome.
>
> +1.  IMO the "Password Authentication" section [0] does this pretty well
> already.

That's limited to just the password-based methods, though, so some
broader discussion of the whole suite of available techniques could be
useful. It does call out the known weaknesses of the md5 and password,
though, which is good.

-- 
Robert Haas
EDB: http://www.enterprisedb.com

Re: pg_upgrade and logical replication

2023-11-30 Thread vignesh C

On Thu, 30 Nov 2023 at 13:35, Amit Kapila  wrote:
>
> On Wed, Nov 29, 2023 at 3:02 PM Amit Kapila  wrote:
> >
>
> In general, the test cases are a bit complex to understand, so, it
> will be difficult to enhance these later. The complexity comes from
> the fact that one upgrade test is trying to test multiple things (a)
> Enabled/Disabled subscriptions; (b) relation states 'i' and 'r' are
> preserved after the upgrade. (c) rows from non-refreshed tables are
> not copied, etc. I understand that you may want to cover as many
> things possible in one test to have fewer upgrade tests which could
> save some time but I think it makes the test somewhat difficult to
> understand and enhance. Can we try to split it such that (a) and (b)
> are tested in one test and others could be separated out?

Yes, I had tried to combine a few tests as it was taking more time to
run. I have refactored the tests by removing tab_not_upgraded1 related
test which is more of a logical replication test, adding more
comments, removing intermediate select count checks. So now we have
test1) which checks for upgrade with subscriber having table in
init/ready state, test2) Check that the data inserted to the publisher
when the subscriber is down will  be replicated to the new subscriber
once the new subscriber is started (these are done as continuation of
the previous test). test3) Check that pg_upgrade fails when
max_replication_slots configured in the new cluster is less than the
number of subscriptions in the old cluster. test4) Check upgrade fails
with old instance with relation in 'd' datasync(invalid) state and
missing replication origin.
In test4 I have combined both datasync relation state and missing
replication origin as the validation for both is in the same file. I
felt the readability is better now, do let me know if any of the test
is still difficult to understand.

> Few other comments:
> ===
> 1.
> +$old_sub->safe_psql('postgres',
> + "CREATE SUBSCRIPTION regress_sub CONNECTION '$connstr' PUBLICATION
> regress_pub"
> +);
> +
> +$old_sub->wait_for_subscription_sync($publisher, 'regress_sub');
> +
> +# After the above wait_for_subscription_sync call the table can be either in
> +# 'syncdone' or in 'ready' state. Now wait till the table reaches
> 'ready' state.
> +my $synced_query =
> +  "SELECT count(1) = 1 FROM pg_subscription_rel WHERE srsubstate = 'r'";
> +$old_sub->poll_query_until('postgres', $synced_query)
> +  or die "Timed out while waiting for the table to reach ready state";
>
> Can the table be in 'i' state after above test? If not, then above
> comment is misleading.

This part of test is to get the table in ready state/ modified the
comments appropriately

> 2.
> +# --
> +# Check that pg_upgrade is successful when all tables are in ready or in
> +# init state.
> +# --
> +$publisher->safe_psql('postgres',
> + "INSERT INTO tab_upgraded1 VALUES (generate_series(2,50), 'before
> initial sync')"
> +);
> +$publisher->wait_for_catchup('regress_sub');
>
> The previous comment applies to this one as well.

I have removed this comment and moved it before the upgrade command as
it is more appropriate there.

> 3.
> +$publisher->safe_psql('postgres', "CREATE PUBLICATION regress_pub1");
> +$old_sub->safe_psql('postgres',
> + "CREATE SUBSCRIPTION regress_sub1 CONNECTION '$connstr' PUBLICATION
> regress_pub1"
> +);
> +$old_sub->wait_for_subscription_sync($publisher, 'regress_sub1');
> +
> +# Change configuration to prepare a subscription table in init state
> +$old_sub->append_conf('postgresql.conf',
> + "max_logical_replication_workers = 0");
> +$old_sub->restart;
> +
> +# Add tab_upgraded2 to the publication. Now publication has tab_upgraded1
> +# and tab_upgraded2 tables.
> +$publisher->safe_psql('postgres',
> + "ALTER PUBLICATION regress_pub ADD TABLE tab_upgraded2");
> +
> +$old_sub->safe_psql('postgres',
> + "ALTER SUBSCRIPTION regress_sub REFRESH PUBLICATION");
>
> These two cases for Create and Alter look confusing. I think it would
> be better if Alter's case is moved before the comment: "Check that
> pg_upgrade is successful when all tables are in ready or in init
> state.".

I have added more comments to make it clear now. I have moved the
"check that pg_upgrade is successful when all tables ..." before the
upgrade command to be more clearer. Added comment "Pre-setup for
preparing subscription table in init state. Add tab_upgraded2 to the
publication." and "# The table tab_upgraded2 will be in init state as
the subscriber configuration  for max_logical_replication_workers is
set to 0."

> 4.
> +# Insert a row in tab_upgraded1 and tab_not_upgraded1 publisher table while
> +# it's down.
> +insert_line_at_pub('while old_sub is down');
>
> Isn't sub routine insert_line_at_pub() inserts in all three tables? If
> so, then the above comment seems to be wrong and I think it is better
> to explain the inte

Re: Postgres Partitions Limitations (5.11.2.3)

2023-11-30 Thread Laurenz Albe

On Thu, 2023-11-30 at 19:22 +0530, Ashutosh Bapat wrote:
> May be attach the patch to hackers thread (this) as well?

If you want, sure.  I thought it was good enough if the thread
is accessible via the commitfest app.

Yours,
Laurenz Albe
From ecdce740586e33eeb394d47564b10f813896ff11 Mon Sep 17 00:00:00 2001
From: Laurenz Albe 
Date: Mon, 9 Jan 2023 16:38:58 +0100
Subject: [PATCH] Fix omission in partitioning limitation documentation

UNIQUE and PRIMARY KEY constraints can be created on ONLY the
partitioned table.  We already had an example demonstrating that,
but forgot to mention it in the documentation of the limits of
partitioning.

Author: Laurenz Albe
Discussion: https://postgr.es/m/167299368731.659.16130012959616771...@wrigleys.postgresql.org
---
 doc/src/sgml/ddl.sgml | 5 -
 1 file changed, 4 insertions(+), 1 deletion(-)

diff --git a/doc/src/sgml/ddl.sgml b/doc/src/sgml/ddl.sgml
index 6e92bbddd2..b4a75f9c8f 100644
--- a/doc/src/sgml/ddl.sgml
+++ b/doc/src/sgml/ddl.sgml
@@ -4282,7 +4282,10 @@ ALTER INDEX measurement_city_id_logdate_key
 Using ONLY to add or drop a constraint on only
 the partitioned table is supported as long as there are no
 partitions.  Once partitions exist, using ONLY
-will result in an error.  Instead, constraints on the partitions
+will result in an error (the exception to this are
+UNIQUE and PRIMARY KEY
+constraints, which will be created with an invalid index, as shown in
+the example above).  Instead, constraints on the partitions
 themselves can be added and (if they are not present in the parent
 table) dropped.

-- 
2.39.0

Re: Set all variable-length fields of pg_attribute to null on column drop

2023-11-30 Thread Robert Haas

On Thu, Nov 30, 2023 at 6:24 AM Peter Eisentraut  wrote:
> I noticed that when a column is dropped, RemoveAttributeById() clears
> out certain fields in pg_attribute, but it leaves the variable-length
> fields at the end (attacl, attoptions, and attfdwoptions) unchanged.
> This is probably harmless, but it seems wasteful and unclean, and leaves
> potentially dangling data lying around (for example, attacl could
> contain references to users that are later also dropped).
>
> I suggest the attached patch to set those fields to null when a column
> is marked as dropped.

I haven't reviewed the patch, but +1 for the idea.

-- 
Robert Haas
EDB: http://www.enterprisedb.com

Re: pg_upgrade and logical replication

2023-11-30 Thread vignesh C

On Thu, 30 Nov 2023 at 06:37, Peter Smith  wrote:
>
> Here are some review comments for patch v20-0001
>
> ==
>
> 1. getSubscriptions
>
> + if (dopt->binary_upgrade && fout->remoteVersion >= 17)
> + appendPQExpBufferStr(query, " s.subenabled\n");
> + else
> + appendPQExpBufferStr(query, " false AS subenabled\n");
>
> Probably I misunderstood this logic... AFAIK the CREATE SUBSCRIPTION
> is normally default *enabled*, so why does this code set default
> differently as 'false'. OTOH, if this is some special case default
> needed because the subscription upgrade is not supported before PG17
> then maybe it needs a comment to explain.

No changes needed to be done in this case, explanation for the same is
given at [1]

> ~~~
>
> 2. dumpSubscription
>
> + if (strcmp(subinfo->subenabled, "t") == 0)
> + {
> + appendPQExpBufferStr(query,
> + "\n-- For binary upgrade, must preserve the subscriber's running state.\n");
> + appendPQExpBuffer(query, "ALTER SUBSCRIPTION %s ENABLE;\n", qsubname);
> + }
>
> (this is a bit similar to previous comment)
>
> Probably I misunderstood this logic... but AFAIK the CREATE
> SUBSCRIPTION is normally default *enabled*. In the CREATE SUBSCRIPTION
> top of this function I did not see any "enabled=xxx" code, so won't
> this just default to enabled=true per normal. In other words, what
> happens if the subscription being upgraded was already DISABLED -- How
> does it remain disabled still after upgrade?
>
> But I saw there is a test case for this so perhaps the code is fine?
> Maybe it just needs more explanatory comments for this area?

No changes needed to be done in this case, explanation for the same is
given at [1]

> ==
> src/bin/pg_upgrade/t/004_subscription.pl
>
> 3.
> +# The subscription's running status should be preserved
> +my $result =
> +  $new_sub->safe_psql('postgres',
> + "SELECT subenabled FROM pg_subscription WHERE subname = 'regress_sub'");
> +is($result, qq(f),
> + "check that the subscriber that was disable on the old subscriber
> should be disabled in the new subscriber"
> +);
> +$result =
> +  $new_sub->safe_psql('postgres',
> + "SELECT subenabled FROM pg_subscription WHERE subname = 'regress_sub1'");
> +is($result, qq(t),
> + "check that the subscriber that was enabled on the old subscriber
> should be enabled in the new subscriber"
> +);
> +$new_sub->safe_psql('postgres', "DROP SUBSCRIPTION regress_sub1");
> +
>
> BEFORE
> check that the subscriber that was disable on the old subscriber
> should be disabled in the new subscriber
>
> SUGGESTION
> check that a subscriber that was disabled on the old subscriber is
> disabled on the new subscriber
> ~
>
> BEFORE
> check that the subscriber that was enabled on the old subscriber
> should be enabled in the new subscriber
>
> SUGGESTION
> check that a subscriber that was enabled on the old subscriber is
> enabled on the new subscriber

These statements are combined now

> ~~~
>
> 4.
> +is($result, qq($remote_lsn), "remote_lsn should have been preserved");
> +
> +
> +# Check the number of rows for each table on each server
>
>
> Double blank lines.

Modified

> ~~~
>
> 5.
> +$old_sub->safe_psql('postgres', "ALTER SUBSCRIPTION regress_sub1 DISABLE");
> +$old_sub->safe_psql('postgres',
> + "ALTER SUBSCRIPTION regress_sub1 SET (slot_name = none)");
> +$old_sub->safe_psql('postgres', "DROP SUBSCRIPTION regress_sub1");
> +
>
> Probably it would be tidier to combine all of those.

Modified

The changes for the same is present in the v21 version patch attached at [2]

[1] - 
https://www.postgresql.org/message-id/CAA4eK1JpWkRBFMDC3wOCK%3DHzCXg8XT1jH-tWb%3Db%2B%2B_8YS2%3DQSQ%40mail.gmail.com
[2] - 
https://www.postgresql.org/message-id/CALDaNm37E4tmSZd%2Bk1ixtKevX3eucmhdOnw4pGmykZk4C1Nm4Q%40mail.gmail.com

Regards,
Vignesh

Re: pg_upgrade and logical replication

2023-11-30 Thread vignesh C

On Wed, 29 Nov 2023 at 15:02, Amit Kapila  wrote:
>
> On Tue, Nov 28, 2023 at 4:12 PM vignesh C  wrote:
> >
>
> Few comments on the latest patch:
> ===
> 1.
> + if (fout->remoteVersion >= 17)
> + appendPQExpBufferStr(query, " o.remote_lsn AS suboriginremotelsn,\n");
> + else
> + appendPQExpBufferStr(query, " NULL AS suboriginremotelsn,\n");
> +
> + if (dopt->binary_upgrade && fout->remoteVersion >= 17)
> + appendPQExpBufferStr(query, " s.subenabled\n");
> + else
> + appendPQExpBufferStr(query, " false AS subenabled\n");
> +
> + appendPQExpBufferStr(query,
> + "FROM pg_subscription s\n");
> +
> + if (fout->remoteVersion >= 17)
> + appendPQExpBufferStr(query,
> + "LEFT JOIN pg_catalog.pg_replication_origin_status o \n"
> + "ON o.external_id = 'pg_' || s.oid::text \n");
>
> Why 'subenabled' have a check for binary_upgrade but
> 'suboriginremotelsn' doesn't?

Combined these two now.

> 2.
> +Datum
> +binary_upgrade_add_sub_rel_state(PG_FUNCTION_ARGS)
> +{
> + Relation rel;
> + HeapTuple tup;
> + Oid subid;
> + Form_pg_subscription form;
> + char*subname;
> + Oid relid;
> + char relstate;
> + XLogRecPtr sublsn;
> +
> + CHECK_IS_BINARY_UPGRADE;
> +
> + /* We must check these things before dereferencing the arguments */
> + if (PG_ARGISNULL(0) || PG_ARGISNULL(1) || PG_ARGISNULL(2))
> + elog(ERROR, "null argument to binary_upgrade_add_sub_rel_state is
> not allowed");
> +
> + subname = text_to_cstring(PG_GETARG_TEXT_PP(0));
> + relid = PG_GETARG_OID(1);
> + relstate = PG_GETARG_CHAR(2);
> + sublsn = PG_ARGISNULL(3) ? InvalidXLogRecPtr : PG_GETARG_LSN(3);
> +
> + tup = SearchSysCache1(RELOID, ObjectIdGetDatum(relid));
> + if (!HeapTupleIsValid(tup))
> + ereport(ERROR,
> + errcode(ERRCODE_INVALID_PARAMETER_VALUE),
> + errmsg("relation %u does not exist", relid));
> + ReleaseSysCache(tup);
> +
> + rel = table_open(SubscriptionRelationId, RowExclusiveLock);
>
> Why there is no locking for relation? I see that during subscription
> operation, we do acquire AccessShareLock on the relation before adding
> a corresponding entry in pg_subscription_rel. See the following code:
>
> CreateSubscription()
> {
> ...
> foreach(lc, tables)
> {
> RangeVar   *rv = (RangeVar *) lfirst(lc);
> Oid relid;
>
> relid = RangeVarGetRelid(rv, AccessShareLock, false);
>
> /* Check for supported relkind. */
> CheckSubscriptionRelkind(get_rel_relkind(relid),
> rv->schemaname, rv->relname);
>
> AddSubscriptionRelState(subid, relid, table_state,
> InvalidXLogRecPtr);
> ...
> }

Modified

> 3.
> +Datum
> +binary_upgrade_add_sub_rel_state(PG_FUNCTION_ARGS)
> {
> ...
> ...
> + AddSubscriptionRelState(subid, relid, relstate, sublsn);
> ...
> }
>
> I see a problem with directly using this function which is that it
> doesn't release locks which means it expects either the caller to
> release those locks or postpone to release them at the transaction
> end. However, all the other binary_upgrade support functions don't
> postpone releasing locks till the transaction ends. I think we should
> add an additional parameter to indicate whether we want to release
> locks and then pass it true from the binary upgrade support function.

Modified

> 4.
> extern void getPublicationTables(Archive *fout, TableInfo tblinfo[],
>   int numTables);
>  extern void getSubscriptions(Archive *fout);
> +extern void getSubscriptionTables(Archive *fout);
>
> getSubscriptions() and getSubscriptionTables() are defined in the
> opposite order in .c file. I think it is better to change the order in
> .c file unless there is a reason for not doing so.

Modified

> 5. At this stage, no need to update/send the 0002 patch, we can look
> at it after the main patch is committed. That is anyway not directly
> related to the main patch.

Removed it from this version.

> Apart from the above, I have modified a few comments and messages in
> the attached. Kindly review and include the changes if you are fine
> with those.

Merged them.

The attached v21 version patch has the change for the same.

Regards,
Vignesh
From 5fadbe8c9a54855026b1dd870f5d74ef4da20f38 Mon Sep 17 00:00:00 2001
From: Vignesh C 
Date: Mon, 30 Oct 2023 12:31:59 +0530
Subject: [PATCH v21] Preserve the full subscription's state during pg_upgrade

Previously, only the subscription metadata information was preserved.  Without
the list of relations and their state it's impossible to re-enable the
subscriptions without missing some records as the list of relations can only be
refreshed after enabling the subscription (and therefore starting the apply
worker).  Even if we added a way to refresh the subscription while enabling a
publication, we still wouldn't know which relations are new on the publication
side, and therefore should be fully synced, and which shouldn't.

To fix this problem, this patch teaches pg_dump to restore the content of
pg_subscription_rel from the old cluster by using
binary_upgrade_add_sub_rel_state SQL function. This is supported only
in binary upgrade mode.

Re: PostgreSql: Canceled on conflict out to old pivot

2023-11-30 Thread Wirch, Eduard

Thanks for the detailed answer, Heikki.

> > The longest transaction that could occur is 1 min long.
> I hate to drill on this, but are you very sure about that? A transaction
in a different database?

Don't be sorry for that, drilling down is important. ;) It took me so long
to reply because I had to prepare the information carefully. You're right,
on that day I observed the behavior, there were indeed long running
transactions in different DBs! My understanding of serializable isolation
is that only transactions which can somehow affect each other can conflict.
It should be clear for PostgreSql, that transactions belonging to different
databases cannot affect each other. Why do they cause serializable
conflicts?

If you want something visual, I prepared a SO question with similar content
like this mail, but added an image of the tx flow:
https://stackoverflow.com/questions/77544821/postgresql-canceled-on-conflict-out-to-old-pivot

Cheers,
Eduard


Am Di., 28. Nov. 2023 um 09:53 Uhr schrieb Heikki Linnakangas <
hlinn...@iki.fi>:

> On 28/11/2023 07:41, Wirch, Eduard wrote:
> > ERROR: could not serialize access due to read/write dependencies among
> > transactions
> >Detail: Reason code: Canceled on identification as a pivot, with
> > conflict out to old committed transaction 61866959.
> >
> > There is a variation of the error:
> >
> > PSQLException: ERROR: could not serialize access due to read/write
> > dependencies among transactions
> >Detail: Reason code: Canceled on conflict out to old pivot 61940806.
>
> Both of these errors are coming from CheckForSerializableConflictOut(),
> and are indeed variations of the same kind of conflict.
>
> > We're logging the id, begin and end of every transaction. Transaction
> > 61940806 was committed without errors. The transaction responsible for
> > the above error was started 40min later (and failed immediately). With
> > 61866959 it is even more extreme: the first conflict error occurred 2.5h
> > after 61866959 was committed.
>
> Weird indeed. There is only one caller of
> CheckForSerializableConflictOut(), and it does this:
>
> >   /*
> >* Find top level xid.  Bail out if xid is too early to be a
> conflict, or
> >* if it's our own xid.
> >*/
> >   if (TransactionIdEquals(xid, GetTopTransactionIdIfAny()))
> >   return;
> >   xid = SubTransGetTopmostTransaction(xid);
> >   if (TransactionIdPrecedes(xid, TransactionXmin))
> >   return;
> >
> >   CheckForSerializableConflictOut(relation, xid, snapshot);
>
> That check with TransactionXmin is very clear: if 'xid' precedes the
> xmin of the current transaction, IOW if there were no transactions with
> 'xid' or older running when the current transcaction started,
> CheckForSerializableConflictOut() is not called.
>
> > The DB table access pattern is too complex to lay out here. There are
> > like 20 tables that are read/written to. Transactions are usually short
> > living. The longest transaction that could occur is 1 min long. My
> > understanding of serializable isolation is that only overlapping
> > transactions can conflict. I can be pretty sure that in the above cases
> > there is no single transaction, which overlaps with 61940806 and with
> > the failing transaction 40 min later.
>
> I hate to drill on this, but are you very sure about that? I don't see
> how this could happen if there are no long-running transactions. Maybe a
> forgotten two-phase commit transaction? A transaction in a different
> database? A developer who did "begin;" in psql and went for lunch?
>
> > Such long running transactions
> > would cause different types of errors in our system ("out of shared
> > memory", "You might need to increase max_pred_locks_per_transaction").
>
> I don't see why that would necessarily be the case, unless it's
> something very specific to your application.
>
> > Why does PostgreSql detect a conflict with a transaction which was
> > committed more than 1h before? Can there be a long dependency chain
> > between many short running transactions? Does the high load prevent
> > Postgres from doing some clean up?
>
> The dependencies don't chain like that, but there is a system of
> "summarizing" old transactions to limit the shared memory usage. When a
> transaction has dependencies on other transactions, we track those
> dependencies in shared memory. But if we run short on the space reserved
> for that, we summarize the dependencies, losing granularity. We lose
> information of which relations/pages/tuples the xid accessed and which
> transactions exactly it had a dependency on. That is safe, but can cause
> false positives.
>
> The amount of shared memory reserved for tracking the dependencies is
> determined by max_pred_locks_per_transaction, so you could try
> increasing that to reduce those false positives, even if you never get
> the "out of shared memory" error.
>
> --
> Heikki Linnakangas
> Neon (https://neon.tech)
>
>

Re: Random pg_upgrade test failure on drongo

2023-11-30 Thread Alexander Lakhin


Hello Andrew and Kuroda-san,

27.11.2023 16:58, Andrew Dunstan wrote:

It's also interesting, what is full version/build of OS on drongo and
fairywren.



It's WS 2019 1809/17763.4252. The latest available AFAICT is 17763.5122



I've updated it to 17763.5122 now.



Thank you for the information! It had pushed me to upgrade my Server
2019 1809 17763.592 to 17763.4252. And then I discovered that I have
difficulties with reproducing the issue on all my VMs after reboot (even
on old versions/builds). It took me a while to understand what's going on
and what affects reproduction of the issue.
I was puzzled by the fact that I can't reproduce the issue with my
unlink-open test program under seemingly the same conditions as before,
until I realized that the issue reproduced only when the target directory
opened in Windows Explorer.

Now I'm sorry for bringing more mystery into the topic and for misleading
information.

So, the issue reproduced only when something scans the working directory
for files/opens them. I added the same logic into my test program (see
unlink-open-scandir attached) and now I see the failure on Windows Server
2019 (Version 10.0.17763.4252).
A script like this:
start cmd /c "unlink-open-scandir test1 10 5000 >log1 2>&1"
...
start cmd /c "unlink-open-scandir test10 10 5000 >log10 2>&1"

results in:
C:\temp>find "failed" log*
-- LOG1
-- LOG10
fopen() after unlink() failed (13)
-- LOG2
fopen() after unlink() failed (13)
-- LOG3
fopen() after unlink() failed (13)
-- LOG4
fopen() after unlink() failed (13)
-- LOG5
fopen() after unlink() failed (13)
-- LOG6
fopen() after unlink() failed (13)
-- LOG7
fopen() after unlink() failed (13)
-- LOG8
fopen() after unlink() failed (13)
-- LOG9
fopen() after unlink() failed (13)

C:\temp>type log10
...
iteration 108
fopen() after unlink() failed (13)

The same observed on:
Windows 10 Version 1809 (OS Build 17763.1)

But no failures on:
Windows 10 Version 22H2 (OS Build 19045.3693)
Windows 11 Version 21H2 (OS Build 22000.613)

So the behavior change really took place, but my previous estimations were
incorrect (my apologies).

BTW, "rename" mode of the test program can produce more rare errors on
rename:
-- LOG3
MoveFileEx() failed (0)

but not on open.

30.11.2023 13:00, Hayato Kuroda (Fujitsu) wrote:


Thanks for your interest for the issue. I have been tracking the failure but 
been not occurred.
Your analysis seems to solve BF failures, by updating OSes.


Yes, but I don't think that leaving Server 2019 behind (I suppose Windows
Server 2019 build 20348 would have the same behaviour as Windows 10 19045)
is affordable. (Though looking at Cirrus CI logs, I see that what is
entitled "Windows Server 2019" in fact is Windows Server 2022 there.)


I think that's because unlink() is performed asynchronously on those old
Windows versions, but rename() is always synchronous.
  
OK. Actually I could not find descriptions about them, but your experiment showed facts.


I don't know how this peculiarity is called, but it looks like when some
other process captures the file handle, unlink() exits as if the file was
deleted completely, but the subsequent open() fails.

Best regards,
Alexander#include 
#include 
#include 

char buf[1048576] = {0};

int main(int argc, char *argv[])
{
	int nmb;
	int res;
	int num_iterations;
	char *filename, *curfilename;
	char tmpfilename[MAX_PATH];
	FILE *fh;
	char do_rename;
	WIN32_FIND_DATA FindFileData;

	if (argc < 4)
		{ printf("Usage: %s {filename} {size_mb} {num_iterations}", argv[0]); return 1; }

	filename = argv[1];

	nmb = atoi(argv[2]);
	if (nmb < 1) nmb = 1;

	num_iterations = atoi(argv[3]);
	if (num_iterations < 1) num_iterations = 1;

	do_rename = (strcmp(argv[argc - 1], "rename") == 0);
	
	for (int i = 1; i <= num_iterations; i++) {
		printf("iteration %d\n", i);

		fh = fopen(filename, "wb");
		if (fh == NULL)
			{ printf("first fopen() failed (%d)\n", errno); return 1; }

		for (int n = 0; n <= nmb; n++) {
			if (fwrite(buf, 1, sizeof(buf), fh) != sizeof(buf))
{ printf("fwrite() failed (%d)\n", errno); return 1; }
		}

		res = fclose(fh);
		if (res < 0)
			{ printf("first fclose() failed (%d)\n", errno); return 1; }

		curfilename = filename;
		if (do_rename)
		{
			sprintf(tmpfilename, "%s.tmp", filename);
			res = MoveFileEx(filename, tmpfilename, MOVEFILE_REPLACE_EXISTING); // from dirmod.c
			if (res == 0)
{ printf("MoveFileEx() failed (%d)\n", errno); return 1; }
			curfilename = tmpfilename;
		}
			
		res = unlink(curfilename);
		if (res < 0)
			{ printf("first unlink() failed (%d)\n", errno); return 1; }

		fh = fopen(filename, "w");
		if (fh == NULL)
			{ printf("fopen() after unlink() failed (%d)\n", errno); return 2; }

		res = fclose(fh);
		if (res < 0)
			{ printf("fclose() failed (%d)\n", errno); return 1; }

		res = unlink(filename);
		if (res < 0)
			{ printf("unlink() failed (%d)\n", errno); return 1; }

Dumped SQL failed to execute with ERROR "GROUP BY position -1 is not in select list"

2023-11-30 Thread Haotian Chen

Hi hackers,



I found that dumped view SQL failed to execute due to the explicit cast

of negative number, and I took a look at the defined SQL in view and then

found -1 in the group by clause. I suppose it’s the main reason the sql

cannot be executed and raised ERROR "GROUP BY position -1 is not in select list"



postgres=# create view v1 as select * from t1 group by a,b,-1::int;

CREATE VIEW

postgres=# \d+ v1;

 View "public.v1"

 Column |  Type   | Collation | Nullable | Default | Storage | Description

+-+---+--+-+-+-

 a  | integer |   |  | | plain   |

 b  | integer |   |  | | plain   |

View definition:

 SELECT a,

b

   FROM t1

  GROUP BY a, b, (- 1);



After exploring codes, I suppose we should treat operator plus constant

as -'nnn'::typename instead of const, my patch just did this by handling

Opexpr especially, but I am not sure it’s the best way or not, BTW do you

guys have any suggestions and another approach?



--
Best Regards,
Haotian


v1-0001-fix-dump-view-fails-with-group-by-clause.patch
Description: v1-0001-fix-dump-view-fails-with-group-by-clause.patch

Re: proposal: possibility to read dumped table's name from file

2023-11-30 Thread Pavel Stehule

čt 30. 11. 2023 v 14:05 odesílatel Daniel Gustafsson 
napsal:

> > On 30 Nov 2023, at 07:13, Pavel Stehule  wrote:
> > čt 30. 11. 2023 v 4:40 odesílatel Tom Lane  t...@sss.pgh.pa.us>> napsal:
> > Daniel Gustafsson mailto:dan...@yesql.se>> writes:
> > > I took another look at this, found some more polish that was needed,
> added
> > > another testcase and ended up pushing it.
> >
> > mamba is unhappy because this uses  functions without
> > casting their arguments to unsigned char:
>
> Thanks for the heads-up.
>
> > here is a patch
>
> I agree with this fix, and have applied it.
>

Thank you

Pavel

>
> --
> Daniel Gustafsson
>
>

Re: Custom explain options

2023-11-30 Thread Konstantin Knizhnik




On 30/11/2023 5:59 am, Andrei Lepikhov wrote:

On 21/10/2023 19:16, Konstantin Knizhnik wrote:
EXPLAIN statement has a list of options (i.e. ANALYZE, BUFFERS, 
COST,...) which help to provide useful details of query execution.
In Neon we have added PREFETCH option which shows information about 
page prefetching during query execution (prefetching is more critical 
for Neon
architecture because of separation of compute and storage, so it is 
implemented not only for bitmap heap scan as in Vanilla Postgres, but 
also for seqscan, indexscan and indexonly scan). Another possible 
candidate  for explain options is local file cache (extra caching 
layer above shared buffers which is used to somehow replace file 
system cache in standalone Postgres).


I think that it will be nice to have a generic mechanism which allows 
extensions to add its own options to EXPLAIN.


Generally, I welcome this idea: Extensions can already do a lot of 
work, and they should have a tool to report their state, not only into 
the log.
But I think your approach needs to be elaborated. At first, it would 
be better to allow registering extended instruments for specific node 
types to avoid unneeded calls.
Secondly, looking into the Instrumentation usage, I don't see the 
reason to limit the size: as I see everywhere it exists locally or in 
the DSA where its size is calculated on the fly. So, by registering an 
extended instrument, we can reserve a slot for the extension. The 
actual size of underlying data can be provided by the extension routine.



Thank you for review.

I agree that support of extended instruments is desired. I just tried to 
minimize number of changes to make this patch smaller.


Concerning limiting instrumentation size, may be I missed something, but 
I do not see any goo way to handle this:


```

./src/backend/executor/nodeMemoize.c1106:        si = 
&node->shared_info->sinstrument[ParallelWorkerNumber];
./src/backend/executor/nodeAgg.c4322:        si = 
&node->shared_info->sinstrument[ParallelWorkerNumber];
./src/backend/executor/nodeIncrementalSort.c107: 
instrumentSortedGroup(&(node)->shared_info->sinfo[ParallelWorkerNumber].groupName##GroupInfo, 
\
./src/backend/executor/execParallel.c808: InstrInit(&instrument[i], 
estate->es_instrument);
./src/backend/executor/execParallel.c1052: 
InstrAggNode(planstate->instrument, &instrument[n]);
./src/backend/executor/execParallel.c1306: 
InstrAggNode(&instrument[ParallelWorkerNumber], planstate->instrument);
./src/backend/commands/explain.c1763:            Instrumentation 
*instrument = &w->instrument[n];
./src/backend/commands/explain.c2168:            Instrumentation 
*instrument = &w->instrument[n];

```

In all this cases we are using array of `Instrumentation` and if it 
contains varying part, then it is not clear where to place it.
Yes, there is also code which serialize and sends instrumentations 
between worker processes  and I have updated this code in my PR to send 
actual amount of custom instrumentation data. But it can not help with 
the cases above.

Re: Implement missing join selectivity estimation for range types


Hi!

Thank you for your work on the subject, I think it's a really useful 
feature and it allows optimizer to estimate more correctly clauses with 
such type of operator.


I rewieved your patch and noticed that some comments are repeated into 
multirangejoinsel functions, I suggest combining them.


The proposed changes are in the attached patch.


If this topic is about calculating selectivity, have you thought about 
adding cardinality calculation test for queries with this type of operator?


For example, you could form queries similar to those that you use in 
src/test/regress/sql/multirangetypes.sql and 
src/test/regress/sql/rangetypes.sql.


I added a few in the attached patch.

--
Regards,
Alena Rybakina
diff --git a/src/backend/utils/adt/multirangetypes_selfuncs.c 
b/src/backend/utils/adt/multirangetypes_selfuncs.c
index c670d225a0c..7708768b89f 100644
--- a/src/backend/utils/adt/multirangetypes_selfuncs.c
+++ b/src/backend/utils/adt/multirangetypes_selfuncs.c
@@ -1620,14 +1620,15 @@ multirangejoinsel(PG_FUNCTION_ARGS)

hist1_lower, nhist1);
break;
 
+   /*
+* Start by comparing lower bounds and if they are equal
+* compare upper bounds for comparison operators
+*/
case OID_MULTIRANGE_LESS_EQUAL_OP:
 
/*
 * A <= B
 *
-* Start by comparing lower bounds and if they 
are equal
-* compare upper bounds
-*
 * Negation of OID_RANGE_GREATER_OP.
 *
 * Overestimate by comparing only the lower 
bounds. Higher
@@ -1644,9 +1645,6 @@ multirangejoinsel(PG_FUNCTION_ARGS)
/*
 * A < B
 *
-* Start by comparing lower bounds and if they 
are equal
-* compare upper bounds
-*
 * Underestimate by comparing only the lower 
bounds. Higher
 * accuracy would require us to add P(lower1 = 
lower2) *
 * P(upper1 < upper2)
@@ -1661,9 +1659,6 @@ multirangejoinsel(PG_FUNCTION_ARGS)
/*
 * A >= B
 *
-* Start by comparing lower bounds and if they 
are equal
-* compare upper bounds
-*
 * Negation of OID_RANGE_LESS_OP.
 *
 * Overestimate by comparing only the lower 
bounds. Higher
@@ -1680,9 +1675,6 @@ multirangejoinsel(PG_FUNCTION_ARGS)
/*
 * A > B == B < A
 *
-* Start by comparing lower bounds and if they 
are equal
-* compare upper bounds
-*
 * Underestimate by comparing only the lower 
bounds. Higher
 * accuracy would require us to add P(lower1 = 
lower2) *
 * P(upper1 > upper2)
@@ -1773,18 +1765,16 @@ multirangejoinsel(PG_FUNCTION_ARGS)
case OID_MULTIRANGE_ADJACENT_MULTIRANGE_OP:
case OID_MULTIRANGE_ADJACENT_RANGE_OP:
case OID_RANGE_ADJACENT_MULTIRANGE_OP:
-
-   /*
-* just punt for now, estimation would require 
equality
-* selectivity for bounds
-*/
case OID_MULTIRANGE_CONTAINS_ELEM_OP:
case OID_MULTIRANGE_ELEM_CONTAINED_OP:
 
-   /*
-* just punt for now, estimation would require 
extraction of
-* histograms for the anyelement
-*/
+  /*
+   * just punt for now:
+   * if it is a type of adjucent operation 
estimation
+   * it will require equality selectivity for 
bounds;
+   * if it is one of type of contain operation
+   * it will extraction of histograms for the any 
element.
+   */
default:

Re: Postgres Partitions Limitations (5.11.2.3)

2023-11-30 Thread Ashutosh Bapat

On Fri, Oct 27, 2023 at 12:28 PM Laurenz Albe  wrote:
>
> On Mon, 2023-01-09 at 16:40 +0100, Laurenz Albe wrote:
> > > "Using ONLY to add or drop a constraint on only the partitioned table is
> > > supported as long as there are no partitions. Once partitions exist, using
> > > ONLY will result in an error. Instead, constraints on the partitions
> > > themselves can be added and (if they are not present in the parent table)
> > > dropped." This seems in contradiction to the example involving adding a
> > > unique constraint while minimizing locking at the bottom of "5.11.2.2.
> > > Partition Maintenance", which seems to run fine on my local Pg instance:
> > >
> > > This technique can be used with UNIQUE and PRIMARY KEY constraints too; 
> > > the
> > > indexes are created implicitly when the constraint is created. Example:
> >
> > No, that is actually an omission in the documentation.
> >
> > The attached patch tries to improve that.
>
> I am sending a reply to the hackers list, so that I can add the patch to the 
> commitfest.

May be attach the patch to hackers thread (this) as well?

-- 
Best Wishes,
Ashutosh Bapat

Re: pg_dump/nls.mk is missing a file

> On 30 Nov 2023, at 04:00, Kyotaro Horiguchi  wrote:

> Upon reviewing my translation, I discovered that filter.c was not
> included in the nls.mk of pg_dump.

Fixed.  I did leave the other headers in there since I don't feel comfortable
with changing that part in an otherwise unrelated thread (and maybe there are
more in the tree such that a bigger cleanup is possible?).

--
Daniel Gustafsson

Re: Extra periods in pg_dump messages

> On 30 Nov 2023, at 02:39, Kyotaro Horiguchi  wrote:

> In the bleeding-edge version of pg_dump, when a conditionspecifying an
> index, for example, is described in an object filter file, the
> following message is output. However, there is a period at the end of
> the line. Shouldn't this be removed?

Yes, ending with a period is for detail and hint messages. Fixed.

--
Daniel Gustafsson

Re: about help message for new pg_dump's --filter option

> On 30 Nov 2023, at 02:52, Kyotaro Horiguchi  wrote:
> 
> At Thu, 30 Nov 2023 10:20:40 +0900 (JST), Kyotaro Horiguchi 
>  wrote in 
>> Hello.
>> 
>> Recently, a new --filter option was added to pg_dump. I might be
>> wrong, but the syntax of the help message for this feels off. Is the
>> word 'on' not necessary after 'based'?
>> 
>>> --filter=FILENAMEinclude or exclude objects and data from dump
>>>  based expressions in FILENAME
> 
> Hmm. A similar message is spelled as "based on expression". Thus, the
> attached correct this message in this line.

Right, that's an unfortunate miss, it should've been "based on expression" like
you propose.  Fixed.

--
Daniel Gustafsson

Re: proposal: possibility to read dumped table's name from file

> On 30 Nov 2023, at 07:13, Pavel Stehule  wrote:
> čt 30. 11. 2023 v 4:40 odesílatel Tom Lane  > napsal:
> Daniel Gustafsson mailto:dan...@yesql.se>> writes:
> > I took another look at this, found some more polish that was needed, added
> > another testcase and ended up pushing it.
> 
> mamba is unhappy because this uses  functions without
> casting their arguments to unsigned char:

Thanks for the heads-up.

> here is a patch

I agree with this fix, and have applied it.

--
Daniel Gustafsson

Re: Synchronizing slots from primary to standby

2023-11-30 Thread Ajin Cherian

On Wed, Nov 29, 2023 at 8:17 PM Zhijie Hou (Fujitsu)
 wrote:
>
> This has been fixed.
>
> Best Regards,
> Hou zj

Thanks for addressing my comments. Some comments from my testing of patch v41

1. In my opinion, the second message "aborting the wait...moving to
the next slot" does not hold much value. There might not even be a
"next slot", there might be just one slot. I think the first LOG is
enough to indicate that the sync-slot is waiting as it repeats this
log till the slot catches up. I know these messages hold great value
for debugging but in production,  "waiting..", "aborting the wait.."
might not be as helpful, maybe change it to debug?

2023-11-30 05:13:49.811 EST [6115] LOG:  waiting for remote slot
"sub1" LSN (0/3047A90) and catalog xmin (745) to pass local slot LSN
(0/3047AC8) and catalog xmin (745)
2023-11-30 05:13:57.909 EST [6115] LOG:  aborting the wait for remote
slot "sub1" and moving to the next slot, will attempt creating it
again
2023-11-30 05:14:07.921 EST [6115] LOG:  waiting for remote slot
"sub1" LSN (0/3047A90) and catalog xmin (745) to pass local slot LSN
(0/3047AC8) and catalog xmin (745)

2. If a slot on the standby is in the "i" state as it hasn't been
synced and it was invalidated on the primary, should you continuously
retry creating this invalidated slot on the standby?

2023-11-30 06:21:41.844 EST [10563] LOG:  waiting for remote slot
"sub1" LSN (0/0) and catalog xmin (785) to pass local slot LSN
(0/EED9330) and catalog xmin (785)
2023-11-30 06:21:41.845 EST [10563] WARNING:  slot "sub1" invalidated
on the primary server, slot creation aborted
2023-11-30 06:21:51.892 EST [10563] LOG:  waiting for remote slot
"sub1" LSN (0/0) and catalog xmin (785) to pass local slot LSN
(0/EED9330) and catalog xmin (785)
2023-11-30 06:21:51.893 EST [10563] WARNING:  slot "sub1" invalidated
on the primary server, slot creation aborted

3. If creation of a slot on the standby fails for one slot because a
slot of the same name exists, then thereafter no new sync slots are
created on standby. Is this expected? I do see that previously created
slots are kept up to date, just that no new slots are created after
that.

regards,
Ajin Cherian
Fujitsu australia

Re: logical decoding and replication of sequences, take 2

On Thu, Nov 30, 2023 at 5:28 AM Tomas Vondra
 wrote:
>
> 3) "bad case" - small transactions that generate a lot of relfilenodes
>
>   select alter_sequence();
>
> where the function is defined like this (I did create 1000 sequences
> before the test):
>
>   CREATE OR REPLACE FUNCTION alter_sequence() RETURNS void AS $$
>   DECLARE
>   v INT;
>   BEGIN
>   v := 1 + (random() * 999)::int;
>   execute format('alter sequence s%s restart with 1000', v);
>   perform nextval('s');
>   END;
>   $$ LANGUAGE plpgsql;
>
> This performs terribly, but it's entirely unrelated to sequences.
> Current master has exactly the same problem, if transactions do DDL.
> Like this, for example:
>
>   CREATE OR REPLACE FUNCTION create_table() RETURNS void AS $$
>   DECLARE
>   v INT;
>   BEGIN
>   v := 1 + (random() * 999)::int;
>   execute format('create table t%s (a int)', v);
>   execute format('drop table t%s', v);
>   insert into t values (1);
>   END;
>   $$ LANGUAGE plpgsql;
>
> This has the same impact on master. The perf report shows this:
>
>   --98.06%--pg_logical_slot_get_changes_guts
>|
> --97.88%--LogicalDecodingProcessRecord
>  |
>  --97.56%--xact_decode
>   |
>--97.51%--DecodeCommit
> |
> |--91.92%--SnapBuildCommitTxn
> | |
> |  --91.65%--SnapBuildBuildSnapshot
> |   |
> |   --91.14%--pg_qsort
>
> The sequence decoding is maybe ~1%. The reason why SnapBuildSnapshot
> takes so long is because:
>
> -
>   Breakpoint 1, SnapBuildBuildSnapshot (builder=0x21f60f8)
>   at snapbuild.c:498
>   498+ sizeof(TransactionId) *   builder->committed.xcnt
>   (gdb) p builder->committed.xcnt
>   $4 = 11532
> -
>
> And with each iteration it grows by 1.
>

Can we somehow avoid this either by keeping DDL-related xacts open or
aborting them? Also, will it make any difference to use setval as
do_setval() seems to be logging each time?

If possible, can you share the scripts? Kuroda-San has access to the
performance machine, he may be able to try it as well.

-- 
With Regards,
Amit Kapila.

Re: [Proposal] global sequence implemented by snowflake ID

2023-11-30 Thread Tomas Vondra

On 11/30/23 11:56, Amit Kapila wrote:
> On Thu, Nov 30, 2023 at 6:48 AM Michael Paquier  wrote:
>>
>> On Tue, Nov 28, 2023 at 02:23:44PM +0530, Amit Kapila wrote:
>>> It is interesting to see you want to work towards globally distributed
>>> sequences. I think it would be important to discuss how and what we
>>> want to achieve with sequences w.r.t logical replication and or
>>> active-active configuration. There is a patch [1] for logical
>>> replication of sequences which will primarily achieve the failover
>>> case, i.e. if the publisher goes down and the subscriber takes over
>>> the role, one can re-direct connections to it. Now, if we have global
>>> sequences, one can imagine that even after failover the clients can
>>> still get unique values of sequences. It will be a bit more flexible
>>> to use global sequences, for example, we can use the sequence on both
>>> nodes at the same time which won't be possible with the replication of
>>> sequences as they will become inconsistent. Now, it is also possible
>>> that both serve different use cases and we need both functionalities
>>> but it would be better to have some discussion on the same.
>>>
>>> Thoughts?
>>>
>>> [1] - https://commitfest.postgresql.org/45/3823/
>>
>> Thanks for pointing this out.  I've read through the patch proposed by
>> Tomas and both are independent things IMO.  The logical decoding patch
>> relies on the SEQ_LOG records to find out which last_value/is_called
>> to transfer, which is something directly depending on the in-core
>> sequence implementation.  Sequence AMs are concepts that cover much
>> more ground, leaving it up to the implementor to do what they want
>> while hiding the activity with a RELKIND_SEQUENCE (generated columns
>> included).
>>
> 
> Right, I understand that implementation-wise and or concept-wise they
> are different. It is more about the use case, see below.
> 
>> To put it short, I have the impression that one and the other don't
>> really conflict, but just cover different ground.  However, I agree
>> that depending on the sequence AM implementation used in a cluster
>> (snowflake IDs guarantee unicity with their machine ID), replication
>> may not be necessary because the sequence implementation may be able
>> to ensure that no replication is required from the start.
>>

I certainly do agree solutions like UUID or SnowflakeID may be a better
choice for distributed systems (especially in active-active case),
because there's no internal state to replicate. That's what I'd use for
such systems, I think.

As for implementation/replication, I haven't checked the code, but I'd
imagine the AM should be able to decide whether something needs to be
replicated (and how) or not. So the traditional sequences would
replicate, and the alternative sequences would not replicate anything.

> 
> This was the key point that I wanted to discuss or hear opinions
> about. So, if we wish to have some sort of global sequences then it is
> not clear to me what benefits will we get by having replication of
> non-global sequences. One thing that comes to mind is replication
> covers a subset of use cases (like help in case of failover or
> switchover to subscriber) and till the time we have some
> implementation of global sequences, it can help users.
> 

What are you going to do about use cases like using logical replication
for upgrade to the next major version? Or applications that prefer (or
have to) use traditional sequences?


regards

-- 
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Re: [PATCH] Native spinlock support on RISC-V

2023-11-30 Thread Christoph Berg

Re: Thomas Munro
> I randomly remembered this topic after seeing an s390x announcement
> from Christoph[1], and figured he or someone else might be interested
> in the same observation about that platform.  That is, we finally got
> around to defining this for ARM, but I bet one internet point that
> RISCV64 and s390x have that property too (but would need to find a
> written reference, or perhaps a similar declaration in the Linux
> sources, etc):

Fwiw, while s390x is admittedly more of historical interest, riscv64
is well on the way of becoming an official release architecture with
the next Debian stable release, so having it supported properly by
PostgreSQL would be appreciated.

Christoph

Re: Report planning memory in EXPLAIN ANALYZE

2023-11-30 Thread Alvaro Herrera

On 2023-Nov-30, Ashutosh Bapat wrote:

> Hi Alvaro,
> Thanks for the review and the edits. Sorry for replying late.
> 
> On Tue, Nov 21, 2023 at 11:56 PM Alvaro Herrera  
> wrote:
> >
> > I gave this a quick look.  I think the usefulness aspect is already
> > established in general terms; the bit I'm not so sure about is whether
> > we want it enabled by default.  For many cases it'd just be noise.
> > Perhaps we want it hidden behind something like "EXPLAIN (MEMORY)" or
> > such, particularly since things like "allocated" (which, per David,
> > seems to be the really useful metric) seems too much a PG-developer
> > value rather than an end-user value.
> 
> It is not default but hidden behind "SUMMARY".Do you still think it
> requires another sub-flag MEMORY?

Well, SUMMARY is enabled by default with ANALYZE, and I'd rather not
have planner memory consumption displayed by default with all EXPLAIN
ANALYZEs.  So yeah, I still think this deserves its own option.

But let's hear others' opinions on this point.  I'm only one vote here.

-- 
Álvaro Herrera   48°01'N 7°57'E  —  https://www.EnterpriseDB.com/
"All rings of power are equal,
But some rings of power are more equal than others."
 (George Orwell's The Lord of the Rings)

RE: Is this a problem in GenericXLogFinish()?

2023-11-30 Thread Hayato Kuroda (Fujitsu)

Dear Amit,

> I think it is acceptable especially if it increases code
> coverage. Can you once check that?

PSA the screen shot. I did "PROVE_TESTS="t/027*" make check" in 
src/test/recovery, and
generated a report.
Line 661 was not hit in the HEAD, but the screen showed that it was executed.

[1]: 
https://coverage.postgresql.org/src/backend/access/hash/hash_xlog.c.gcov.html#661

Best Regards,
Hayato Kuroda
FUJITSU LIMITED

Set all variable-length fields of pg_attribute to null on column drop

2023-11-30 Thread Peter Eisentraut

I noticed that when a column is dropped, RemoveAttributeById() clears 
out certain fields in pg_attribute, but it leaves the variable-length 
fields at the end (attacl, attoptions, and attfdwoptions) unchanged. 
This is probably harmless, but it seems wasteful and unclean, and leaves 
potentially dangling data lying around (for example, attacl could 
contain references to users that are later also dropped).


I suggest the attached patch to set those fields to null when a column 
is marked as dropped.From d587588df6d2bea8ea9b9e13b897e4206b1a0884 Mon Sep 17 00:00:00 2001
From: Peter Eisentraut 
Date: Thu, 30 Nov 2023 12:19:20 +0100
Subject: [PATCH v1] Set all variable-length fields of pg_attribute to null on
 column drop

When a column is dropped, the fields attacl, attoptions, and
attfdwoptions were kept unchanged.  This is probably harmless, but it
seems wasteful, and leaves potentially dangling data lying around (for
example, attacl could contain references to users that are later also
dropped).

Change this to set those fields to null when a column is marked as
dropped.
---
 src/backend/catalog/heap.c | 39 --
 1 file changed, 21 insertions(+), 18 deletions(-)

diff --git a/src/backend/catalog/heap.c b/src/backend/catalog/heap.c
index 7224d96695..b93894889d 100644
--- a/src/backend/catalog/heap.c
+++ b/src/backend/catalog/heap.c
@@ -1647,6 +1647,9 @@ RemoveAttributeById(Oid relid, AttrNumber attnum)
HeapTuple   tuple;
Form_pg_attribute attStruct;
charnewattname[NAMEDATALEN];
+   Datum   valuesAtt[Natts_pg_attribute] = {0};
+   boolnullsAtt[Natts_pg_attribute] = {0};
+   boolreplacesAtt[Natts_pg_attribute] = {0};
 
/*
 * Grab an exclusive lock on the target table, which we will NOT release
@@ -1695,24 +1698,24 @@ RemoveAttributeById(Oid relid, AttrNumber attnum)
 "pg.dropped.%d", attnum);
namestrcpy(&(attStruct->attname), newattname);
 
-   /* clear the missing value if any */
-   if (attStruct->atthasmissing)
-   {
-   Datum   valuesAtt[Natts_pg_attribute] = {0};
-   boolnullsAtt[Natts_pg_attribute] = {0};
-   boolreplacesAtt[Natts_pg_attribute] = {0};
-
-   /* update the tuple - set atthasmissing and attmissingval */
-   valuesAtt[Anum_pg_attribute_atthasmissing - 1] =
-   BoolGetDatum(false);
-   replacesAtt[Anum_pg_attribute_atthasmissing - 1] = true;
-   valuesAtt[Anum_pg_attribute_attmissingval - 1] = (Datum) 0;
-   nullsAtt[Anum_pg_attribute_attmissingval - 1] = true;
-   replacesAtt[Anum_pg_attribute_attmissingval - 1] = true;
-
-   tuple = heap_modify_tuple(tuple, RelationGetDescr(attr_rel),
- valuesAtt, 
nullsAtt, replacesAtt);
-   }
+   /* Clear the missing value */
+   attStruct->atthasmissing = false;
+   nullsAtt[Anum_pg_attribute_attmissingval - 1] = true;
+   replacesAtt[Anum_pg_attribute_attmissingval - 1] = true;
+
+   /*
+* Clear the other variable-length fields.  This saves some space in
+* pg_attribute and removes no longer useful information.
+*/
+   nullsAtt[Anum_pg_attribute_attacl - 1] = true;
+   replacesAtt[Anum_pg_attribute_attacl - 1] = true;
+   nullsAtt[Anum_pg_attribute_attoptions - 1] = true;
+   replacesAtt[Anum_pg_attribute_attoptions - 1] = true;
+   nullsAtt[Anum_pg_attribute_attfdwoptions - 1] = true;
+   replacesAtt[Anum_pg_attribute_attfdwoptions - 1] = true;
+
+   tuple = heap_modify_tuple(tuple, RelationGetDescr(attr_rel),
+ valuesAtt, nullsAtt, 
replacesAtt);
 
CatalogTupleUpdate(attr_rel, &tuple->t_self, tuple);
 
-- 
2.43.0

Re: Report planning memory in EXPLAIN ANALYZE

2023-11-30 Thread Ashutosh Bapat

Hi Alvaro,
Thanks for the review and the edits. Sorry for replying late.

On Tue, Nov 21, 2023 at 11:56 PM Alvaro Herrera  wrote:
>
> I gave this a quick look.  I think the usefulness aspect is already
> established in general terms; the bit I'm not so sure about is whether
> we want it enabled by default.  For many cases it'd just be noise.
> Perhaps we want it hidden behind something like "EXPLAIN (MEMORY)" or
> such, particularly since things like "allocated" (which, per David,
> seems to be the really useful metric) seems too much a PG-developer
> value rather than an end-user value.

It is not default but hidden behind "SUMMARY".Do you still think it
requires another sub-flag MEMORY?

>
> If EXPLAIN (MEMORY) is added, then probably auto_explain needs a
> corresponding flag, and doc updates.
>
> Some code looks to be in weird places.  Why is calc_mem_usage, which
> operates on MemoryContextCounters, in explain.c instead of mcxt.c?
> why is MemUsage in explain.h instead of memnodes.h?  I moved both.  I
> also renamed them, to MemoryContextSizeDifference() and MemoryUsage
> respectively; fixup patch attached.

That looks better. The functions are now available outside explain.

>
> I see no reason for this to be three separate patches anymore.

Squashed into one along with your changes.

>
> The EXPLAIN docs (explain.sgml) need an update to mention the new flag
> and the new output, too.

Updated section describing  "SUMMARY" flag.

PFA updated patch.

-- 
Best Wishes,
Ashutosh Bapat
From e38cde855a8aaaf77b9a01c948db7cde106993a6 Mon Sep 17 00:00:00 2001
From: Ashutosh Bapat 
Date: Wed, 12 Jul 2023 14:34:14 +0530
Subject: [PATCH] EXPLAIN reports memory consumed for planning a query

The memory consumed is reported as "Planner Memory" property in EXPLAIN
output when any of the options ANALYZE or SUMMARY is specified.

A separate memory context is allocated for measuring memory consumption
to eliminate any effect of previous activity on the calculation.  The
memory context statistics is noted before and after calling
pg_plan_query() from ExplainOneQuery(). The difference in the two
statistics is used to calculate the memory consumption.

We report two values "used" and "allocated".  The difference between
memory statistics counters totalspace and freespace gives the memory
that remains palloc'ed at the end of planning. It is reported as memory
"used". This does not account for chunks of memory that were palloc'ed
but subsequently pfree'ed. But such memory remains allocated in the
memory context is given by the totalspace counter. The value of this
counter is reported as memory "allocated".

Ashutosh Bapat, reviewed by David Rowley and Alvaro Herrera
---
 doc/src/sgml/ref/explain.sgml |  8 +++-
 src/backend/commands/explain.c| 55 ++-
 src/backend/commands/prepare.c| 24 +++-
 src/backend/utils/mmgr/mcxt.c | 38 ++
 src/include/commands/explain.h|  3 +-
 src/include/nodes/memnodes.h  | 10 +
 src/include/utils/memutils.h  |  5 +++
 src/test/regress/expected/explain.out | 26 +++--
 src/tools/pgindent/typedefs.list  |  1 +
 9 files changed, 160 insertions(+), 10 deletions(-)

diff --git a/doc/src/sgml/ref/explain.sgml b/doc/src/sgml/ref/explain.sgml
index 5ba6486da1..ba33e9be2d 100644
--- a/doc/src/sgml/ref/explain.sgml
+++ b/doc/src/sgml/ref/explain.sgml
@@ -239,8 +239,12 @@ ROLLBACK;
 SUMMARY
 
  
-  Include summary information (e.g., totaled timing information) after the
-  query plan.  Summary information is included by default when
+  Include summary information after the query plan. The summary information
+  includes the totaled timing information and planner's memory consumption.
+  The memory consumption is reported as two values "used" and "allocated".
+  "used" indicates the net memory used by the planner. The "allocated"
+  indicates the memory allocated, which may slightly higher than the memory
+  used, by the planner. Summary information is included by default when
   ANALYZE is used but otherwise is not included by
   default, but can be enabled using this option.  Planning time in
   EXPLAIN EXECUTE includes the time required to fetch
diff --git a/src/backend/commands/explain.c b/src/backend/commands/explain.c
index f1d71bc54e..8c7f27b661 100644
--- a/src/backend/commands/explain.c
+++ b/src/backend/commands/explain.c
@@ -122,6 +122,8 @@ static const char *explain_get_index_name(Oid indexId);
 static void show_buffer_usage(ExplainState *es, const BufferUsage *usage,
 			  bool planning);
 static void show_wal_usage(ExplainState *es, const WalUsage *usage);
+static void show_planning_memory(ExplainState *es,
+ const MemoryUsage *usage);
 static void ExplainIndexScanDetails(Oid indexid, ScanDirection indexorderdir,
 	ExplainState *es);
 static void ExplainScanTarget(Scan *plan, ExplainState *es);
@@ -393,13

Re: Extend pgbench partitioning to pgbench_history

2023-11-30 Thread Gabriele Bartolini

Please discard the previous patch and use this one (it had a leftover
comment from an initial attempt to limit this to hash case).

Thanks,
Gabriele

On Thu, 30 Nov 2023 at 11:29, Gabriele Bartolini <
gabriele.bartol...@enterprisedb.com> wrote:

> Hi there,
>
> While benchmarking a new feature involving tablespace support in
> CloudNativePG (Kubernetes operator), I wanted to try out the partitioning
> feature of pgbench. I saw it supporting both range and hash partitioning,
> but limited to pgbench_accounts.
>
> With the attached patch, I extend the partitioning capability to the
> pgbench_history table too.
>
> I have been thinking of adding an option to control this, but I preferred
> to ask in this list whether it really makes sense or not (I struggle indeed
> to see use cases where accounts is partitioned and history is not).
>
> Please let me know what you think.
>
> Thanks,
> Gabriele
> --
> Gabriele Bartolini
> Vice President, Cloud Native at EDB
> enterprisedb.com
>


-- 
Gabriele Bartolini
Vice President, Cloud Native at EDB
enterprisedb.com


0001-Include-pgbench_history-in-partitioning-method-for-p.patch
Description: Binary data

Re: Is this a problem in GenericXLogFinish()?

2023-11-30 Thread Alexander Lakhin


Hello Kuroda-san,

30.11.2023 10:28, Hayato Kuroda (Fujitsu) wrote:

Again, good catch. Here is my analysis and fix patch.
I think it is sufficient to add an initialization for writebuf.


I agree with the change. It aligns hash_xlog_squeeze_page() with
hash_xlog_move_page_contents() in regard to initialization of writebuf.
Interestingly enough, the discrepancy exists since these functions
appeared with c11453ce0, and I can't see what justified adding the
initialization inside hash_xlog_move_page_contents() only.
So I think that if this doesn't affect performance, having aligned coding
(writebuf initialized in both functions) is better.


I want to add a test case for it as well. I've tested on my env and found that 
proposed
tuples seems sufficient.
(We must fill the primary bucket, so initial 500 is needed. Also, we must add
many dead pages to lead squeeze operation. Final page seems OK to smaller 
value.)

I compared the execution time before/after patching, and they are not so 
different
(1077 ms -> 1125 ms).

How do you think?


I can confirm that the test case proposed demonstrates what is expected
and the duration increase is tolerable, as to me.

Best regards,
Alexander

Re: POC, WIP: OR-clause support for indexes


On 30.11.2023 11:00, Alena Rybakina wrote:

Hi!




Honestly, it seems very hard to avoid the conclusion that this
transformation is being done at too early a stage. Parse analysis is
not the time to try to do query optimization. I can't really believe
that there's a way to produce a committable patch along these lines.
Ideally, a transformation like this should be done after we know what
plan shape we're using (or considering using), so that we can make
cost-based decisions about whether to transform or not. But at the
very least it should happen somewhere in the planner. There's really
no justification for parse analysis rewriting the SQL that the user
entered.


Here, we assume that array operation is generally better than many ORs.
As a result, it should be more effective to make OR->ANY 
transformation in the parser (it is a relatively lightweight 
operation here) and, as a second phase, decompose that in the optimizer.
We implemented earlier prototypes in different places of the 
optimizer, and I'm convinced that only this approach resolves the 
issues we found.

Does this approach look weird? Maybe. We can debate it in this thread.


I think this is incorrect, and the example of A. Korotkov confirms 
this. If we perform the conversion at the parsing stage, we will skip 
the more important conversion using OR expressions. I'll show you in 
the example below.


First of all, I will describe my idea to combine two approaches to 
obtaining plans with OR to ANY transformation and ANY to OR 
transformation. I think they are both good, and we can't work with 
just one of them, we should consider both the option of OR 
expressions, and with ANY.




Just in case, I have attached a patch or->any transformation where this 
happens at the index creation stage.


I get diff file during make check, but judging by the changes, it shows 
that the transformation is going well. I also attached it.



--
Regards,
Alena Rybakina
Postgres Professional
From 9f75b58525b4892db2c360401771f69599ed21c8 Mon Sep 17 00:00:00 2001
From: Alena Rybakina 
Date: Wed, 29 Nov 2023 18:58:55 +0300
Subject: [PATCH] Transform OR clause to ANY expressions.

Replace (X=N1) OR (X=N2) ... with X = ANY(N1, N2) on the preliminary stage of
optimization when we are still working with a tree expression.
Sometimes it can lead to not optimal plan. But we think it is better to have
array of elements instead of a lot of OR clauses. Here is a room for further
optimizations on decomposing that array into more optimal parts.

---
 src/backend/nodes/queryjumblefuncs.c  |  30 ++
 src/backend/optimizer/path/indxpath.c | 394 +-
 src/backend/utils/misc/guc_tables.c   |  10 +
 src/backend/utils/misc/postgresql.conf.sample |   1 +
 src/include/nodes/queryjumble.h   |   1 +
 src/include/optimizer/paths.h |   1 +
 src/test/regress/expected/create_index.out| 120 ++
 src/test/regress/expected/guc.out |   3 +-
 src/test/regress/expected/join.out|  48 +++
 src/test/regress/expected/partition_prune.out | 179 
 src/test/regress/expected/sysviews.out|   3 +-
 src/test/regress/expected/tidscan.out |  17 +
 src/test/regress/sql/create_index.sql |  32 ++
 src/test/regress/sql/join.sql |  10 +
 src/test/regress/sql/partition_prune.sql  |  22 +
 src/test/regress/sql/tidscan.sql  |   6 +
 src/tools/pgindent/typedefs.list  |   1 +
 17 files changed, 875 insertions(+), 3 deletions(-)

diff --git a/src/backend/nodes/queryjumblefuncs.c b/src/backend/nodes/queryjumblefuncs.c
index 281907a4d83..c98fda73d17 100644
--- a/src/backend/nodes/queryjumblefuncs.c
+++ b/src/backend/nodes/queryjumblefuncs.c
@@ -283,6 +283,36 @@ _jumbleNode(JumbleState *jstate, Node *node)
 	}
 }
 
+JumbleState *
+JumbleExpr(Expr *expr, uint64 *queryId)
+{
+	JumbleState *jstate = NULL;
+
+	Assert(queryId != NULL);
+
+	jstate = (JumbleState *) palloc(sizeof(JumbleState));
+
+	/* Set up workspace for query jumbling */
+	jstate->jumble = (unsigned char *) palloc(JUMBLE_SIZE);
+	jstate->jumble_len = 0;
+	jstate->clocations_buf_size = 32;
+	jstate->clocations = (LocationLen *)
+		palloc(jstate->clocations_buf_size * sizeof(LocationLen));
+	jstate->clocations_count = 0;
+	jstate->highest_extern_param_id = 0;
+
+	/* Compute query ID */
+	_jumbleNode(jstate, (Node *) expr);
+	*queryId = DatumGetUInt64(hash_any_extended(jstate->jumble,
+jstate->jumble_len,
+0));
+
+	if (*queryId == UINT64CONST(0))
+		*queryId = UINT64CONST(1);
+
+	return jstate;
+}
+
 static void
 _jumbleList(JumbleState *jstate, Node *node)
 {
diff --git a/src/backend/optimizer/path/indxpath.c b/src/backend/optimizer/path/indxpath.c
index 03a5fbdc6dc..acf4937db01 100644
--- a/src/backend/optimizer/path/indxpath.c
+++ b/src/backend/optimizer/path/indxpath.c
@@ -34,7 +34,12 @@
 #include "optimizer/restrictinfo.h"
 #include "utils/lsyscache.h"
 #include "u

Re: GUC names in messages

2023-11-30 Thread Alvaro Herrera



> +/*
> + * Return whether the GUC name should be enclosed in double-quotes.
> + *
> + * Quoting is intended for names which could be mistaken for normal English
> + * words.  Quotes are only applied to GUC names that are written entirely 
> with
> + * lower-case alphabetical characters.
> + */
> +static bool
> +quotes_needed_for_GUC_name(const char *name)
> +{
> + for (const char *p = name; *p; p++)
> + {
> + if ('a' > *p || *p > 'z')
> + return false;
> + }
> +
> + return true;
> +}

I think you need a function that the name possibly quoted, in a way that
lets the translator handle the quoting:

  static char buffer[SOMEMAXLEN];

  quotes_needed = ...;

  if (quotes_needed)
 /* translator: a quoted configuration parameter name */
 snprintf(buffer, _("\"%s\""), name);
 return buffer
  else
 /* no translation needed in this case */
 return name;

then the calling code just does a single %s that prints the string
returned by this function.  (Do note that the function is not reentrant,
like pg_dump's fmtId.  Shouldn't be a problem ...)

> @@ -3621,8 +3673,8 @@ set_config_option_ext(const char *name, const char 
> *value,
>   {
>   if (changeVal && !makeDefault)
>   {
> - elog(DEBUG3, "\"%s\": setting ignored because previous 
> source is higher priority",
> -  name);
> + elog(DEBUG3, "%s%s%s: setting ignored because previous 
> source is higher priority",
> +  GUC_FORMAT(name));

Note that elog() doesn't do translation, and DEBUG doesn't really need
to worry too much about style anyway.  I'd leave these as-is.

-- 
Álvaro Herrera   48°01'N 7°57'E  —  https://www.EnterpriseDB.com/
"La primera ley de las demostraciones en vivo es: no trate de usar el sistema.
Escriba un guión que no toque nada para no causar daños." (Jakob Nielsen)

Re: [Proposal] global sequence implemented by snowflake ID

On Thu, Nov 30, 2023 at 6:48 AM Michael Paquier  wrote:
>
> On Tue, Nov 28, 2023 at 02:23:44PM +0530, Amit Kapila wrote:
> > It is interesting to see you want to work towards globally distributed
> > sequences. I think it would be important to discuss how and what we
> > want to achieve with sequences w.r.t logical replication and or
> > active-active configuration. There is a patch [1] for logical
> > replication of sequences which will primarily achieve the failover
> > case, i.e. if the publisher goes down and the subscriber takes over
> > the role, one can re-direct connections to it. Now, if we have global
> > sequences, one can imagine that even after failover the clients can
> > still get unique values of sequences. It will be a bit more flexible
> > to use global sequences, for example, we can use the sequence on both
> > nodes at the same time which won't be possible with the replication of
> > sequences as they will become inconsistent. Now, it is also possible
> > that both serve different use cases and we need both functionalities
> > but it would be better to have some discussion on the same.
> >
> > Thoughts?
> >
> > [1] - https://commitfest.postgresql.org/45/3823/
>
> Thanks for pointing this out.  I've read through the patch proposed by
> Tomas and both are independent things IMO.  The logical decoding patch
> relies on the SEQ_LOG records to find out which last_value/is_called
> to transfer, which is something directly depending on the in-core
> sequence implementation.  Sequence AMs are concepts that cover much
> more ground, leaving it up to the implementor to do what they want
> while hiding the activity with a RELKIND_SEQUENCE (generated columns
> included).
>

Right, I understand that implementation-wise and or concept-wise they
are different. It is more about the use case, see below.

> To put it short, I have the impression that one and the other don't
> really conflict, but just cover different ground.  However, I agree
> that depending on the sequence AM implementation used in a cluster
> (snowflake IDs guarantee unicity with their machine ID), replication
> may not be necessary because the sequence implementation may be able
> to ensure that no replication is required from the start.
>

This was the key point that I wanted to discuss or hear opinions
about. So, if we wish to have some sort of global sequences then it is
not clear to me what benefits will we get by having replication of
non-global sequences. One thing that comes to mind is replication
covers a subset of use cases (like help in case of failover or
switchover to subscriber) and till the time we have some
implementation of global sequences, it can help users.

-- 
With Regards,
Amit Kapila.

Re: Various small doc improvements; plpgsql, schemas, permissions, oidvector

2023-11-30 Thread Shubham Khanna

On Thu, Nov 30, 2023 at 3:59 PM David G. Johnston
 wrote:
>
> Extending my prior email which is now redundant.
>
> On Tue, Oct 3, 2023 at 7:00 PM David G. Johnston  
> wrote:
>>
>> On Tue, Oct 3, 2023 at 4:15 PM Karl O. Pinc  wrote:
>>>
>>> On Tue, 3 Oct 2023 14:51:31 -0700
>>> "David G. Johnston"  wrote:
>>>
>>> Isn't the entire section about "deviating from the normal flow of the
>>> code"?  That's what makes me want "Exception" in the section title.
>>
>>
>> It is about how error handling in a procedure diverts the flow from the 
>> normal code path to some other code path - what that path is labelled is 
>> less important than the thing that causes the diversion - an error.
>>
>>>
>>> ?  I remain (overly?) focused on the word "exception", since that's
>>> whats in the brain of the user that's writing RAISE EXCEPTION.
>>> It matters if exceptions and errors are different.  If they're
>>> not, then it also matters since it's exceptions that the user's
>>> code raises.
>>>
>>
>> It's unfortunate the keyword to raise the message level "ERROR" is 
>> "EXCEPTION" in that command but I'd rather simply handle that one anomaly 
>> that make the rest of the system use the word exception, especially seem to 
>> be fairly consistent in our usage of ERROR already.  I'm sympathetic that 
>> other systems out there also encourage the usage of exception in this 
>> context instead of error - but not to the point of opening up this 
>> long-standing decision for rework.
>>
>>>
>>> Have you any thoughts on the "permissions", "privleges" and
>>> "attributes" vocabulary/concepts used in this area?
>>
>>
>> I think we benefit from being able to equate permissions and privileges and 
>> trying to separate them is going to be more harmful than helpful.  The 
>> limited things that role attributes permit, and how they fall outside the 
>> privilege/permission concept as we use it, isn't something that I've noticed 
>> is a problem that needs addressing.
>>
>>
>>>  (I'm slightly
>>> nervous about the renumbering making the thread hard to follow.)
>>>
>>
>> 0009 - Something just seems off with this one.  Unless there are more places 
>> with this type in use I would just move the relevant notes (i.e., the one in 
>> proallargtypes) to that column and be done with it.  If there are multiple 
>> places then moving the notes to the main docs and cross-referencing to them 
>> seems warranted.  I also wouldn't call it legacy.
>>
>> 0010 -
>>
>> When creating new objects, if a schema qualification is not given with the 
>> name the first extant entry in the search_path is chosen; then an error will 
>> be raised should the supplied name already exist in that schema.
>> In contexts where the object must already exist, but its name is not schema 
>> qualified, the extant search_path schemas will be consulted serially until 
>> one of them contains an appropriate object, returning it, or all schemas are 
>> consulted, resulting in an object not found error.
>>
>> I'm not seeing much value in presenting the additional user/public details 
>> here.  Especially as it would then seem appropriate to include pg_temp.  And 
>> now we have to deal with the fact that by default the public schema isn't so 
>> public anymore.
>>
>
> 0011 - (first pass, going from memory, might have missed some needed details)
>
> Aside from non-atomic SQL routine bodies (functions and procedures) the 
> result of the server executing SQL sent by the connected client does not 
> result in raw SQL, or textual expressions, being stored for later evaluation. 
>  All objects are identified (or created) during execution and their effects 
> stored within the system catalogs and assigned system identifiers (oids) to 
> provide an absolute and immutable reference to be used while establishing 
> inter-object dependencies.  In short, indirect actions taken by the server, 
> based upon stored knowledge, can and often will execute while in a 
> search_path that only contains the pg_catalog schema so that the stored 
> knowledge can be found.
>
> For routines written in any language except Atomic SQL the textual body of 
> the routine is stored as-is within the database.  When executing such a 
> routine the (parent) session basically opens up a new connection to the 
> server (one per routine) and within that new sub-session sends the SQL 
> contained within the routine to the server for execution just like any other 
> client, and therefore any object references present in that SQL need to be 
> resolved to a schema as previously discussed.  By default, upon connecting, 
> the newly created session is updated so that its settings take on the current 
> values in the parent session.  When authoring a routine this is often 
> undesirable as the behavior of the routine now depends upon an environment 
> that is not definitively known to the routine author.  Schema-qualifying 
> object references within the routine body is one tool to remove such 
> uncertainty.  Another is by

Extend pgbench partitioning to pgbench_history

2023-11-30 Thread Gabriele Bartolini

Hi there,

While benchmarking a new feature involving tablespace support in
CloudNativePG (Kubernetes operator), I wanted to try out the partitioning
feature of pgbench. I saw it supporting both range and hash partitioning,
but limited to pgbench_accounts.

With the attached patch, I extend the partitioning capability to the
pgbench_history table too.

I have been thinking of adding an option to control this, but I preferred
to ask in this list whether it really makes sense or not (I struggle indeed
to see use cases where accounts is partitioned and history is not).

Please let me know what you think.

Thanks,
Gabriele
-- 
Gabriele Bartolini
Vice President, Cloud Native at EDB
enterprisedb.com


0001-Include-pgbench_history-in-partitioning-method-for-p.patch
Description: Binary data

Re: Is this a problem in GenericXLogFinish()?

On Thu, Nov 30, 2023 at 12:58 PM Hayato Kuroda (Fujitsu)
 wrote:
>
> >
> > Good catch, thank you for reporting! I will investigate more about it and 
> > post my
> > analysis.
> >
>
> Again, good catch. Here is my analysis and fix patch.
> I think it is sufficient to add an initialization for writebuf.
>
> In the reported case, neither is_prim_bucket_same_wrt nor 
> is_prev_bucket_same_wrt
> is set in the WAL record, and ntups is also zero. This means that the wbuf is 
> not
> written in the WAL record on primary side (see [1]).
> Also, there are no reasons to read (and lock) other buffer page because we do
> not modify it. Based on them, I think that just initializing as InvalidBuffer 
> is sufficient.
>

Agreed.

>
> I want to add a test case for it as well. I've tested on my env and found 
> that proposed
> tuples seems sufficient.
> (We must fill the primary bucket, so initial 500 is needed. Also, we must add
> many dead pages to lead squeeze operation. Final page seems OK to smaller 
> value.)
>
> I compared the execution time before/after patching, and they are not so 
> different
> (1077 ms -> 1125 ms).
>

In my environment, it increased from 375ms to 385ms (median of five
runs). I think it is acceptable especially if it increases code
coverage. Can you once check that?

-- 
With Regards,
Amit Kapila.

Re: [PGDOCS] Inconsistent linkends to "monitoring" views.

2023-11-30 Thread John Naylor

On Wed, Nov 8, 2023 at 2:02 PM John Naylor  wrote:

> My 2 cents: Comment typos are visible to readers, so more annoying
> when seen in isolation, and less likely to have surroundings that
> could change in back branches. Consistency would preferred all else
> being equal, but then again nothing is wrong with the existing links.
> In any case, no one has come out in favor of the patch, so it seems
> like it should be rejected unless that changes.

This is done.

RE: Random pg_upgrade test failure on drongo

2023-11-30 Thread Hayato Kuroda (Fujitsu)

Dear Alexander, Andrew,
 
Thanks for your analysis!
 
> I see that behavior on:
> Windows 10 Version 1607 (OS Build 14393.0)
> Windows Server 2016 Version 1607 (OS Build 14393.0)
> Windows Server 2019 Version 1809 (OS Build 17763.1)
>
> But it's not reproduced on:
> Windows 10 Version 1809 (OS Build 17763.1) (triple-checked)
> Windows Server 2019 Version 1809 (OS Build 17763.592)
> Windows 10 Version 22H2 (OS Build 19045.3693)
> Windows 11 Version 21H2 (OS Build 22000.613)
>
> So it looks like the failure occurs depending not on Windows edition, but
> rather on it's build. For Windows Server 2019 the "good" build is
> somewhere between 17763.1 and 17763.592, but for Windows 10 it's between
> 14393.0 and 17763.1.
> (Maybe there was some change related to
> FILE_DISPOSITION_POSIX_SEMANTICS/
> FILE_DISPOSITION_ON_CLOSE implementation; I don't know where to find
> information about that change.)
>
> It's also interesting, what is full version/build of OS on drongo and
> fairywren.
 
Thanks for your interest for the issue. I have been tracking the failure but 
been not occurred.
Your analysis seems to solve BF failures, by updating OSes.
 
> I think that's because unlink() is performed asynchronously on those old
> Windows versions, but rename() is always synchronous.
 
OK. Actually I could not find descriptions about them, but your experiment 
showed facts.
 
> I've managed to reproduce that issue (or at least a situation that
> manifested similarly) with a sleep added in miscinit.c:
>  ereport(IsPostmasterEnvironment ? LOG : NOTICE,
> (errmsg("database system is shut down")));
> +   pg_usleep(50L);
>
> With this change, I get the same warning as in [1] when running in
> parallel 10 tests 002_pg_upgrade with a minimal olddump (on iterations
> 33, 46, 8). And with my PoC patch applied, I could see the same warning
> as well (on iteration 6).
>
> I believe that's because rename() can't rename a directory containing an
> open file, just as unlink() can't remove it.
>
> In the light of the above, I think that the issue in question should be
> fixed in accordance with/as a supplement to [2].
 
OK, I understood that we need to fix more around here. For now, we should focus 
our points.
 
Your patch seems good, but it needs more sight from windows-friendly developers.
How do other think?

Best Regards,
Hayato Kuroda
FUJITSU LIMITED

Re: [PoC] pg_upgrade: allow to upgrade publisher node

On Wed, Nov 29, 2023 at 7:33 AM Hayato Kuroda (Fujitsu)
 wrote:
>
> > > Actually, we do not expect that it won't input NULL. IIUC all of slots 
> > > have
> > > slot_name, and subquery uses its name. But will it be kept forever? I 
> > > think we
> > > can avoid any risk.
> > >
> > > > I've not tested it yet but even if it returns NULL, perhaps
> > > > get_old_cluster_logical_slot_infos() would still set curr->caught_up
> > > > to false, no?
> > >
> > > Hmm. I checked the C99 specification [1] of strcmp, but it does not 
> > > define the
> > > case when the NULL is input. So it depends implementation.
> >
> > I think PQgetvalue() returns an empty string if the result value is null.
> >
>
> Oh, you are right... I found below paragraph from [1].
>
> > An empty string is returned if the field value is null. See PQgetisnull to 
> > distinguish
> > null values from empty-string values.
>
> So I agree what you said - current code can accept NULL.
> But still not sure the error message is really good or not.
> If we regard an empty string as false, the slot which has empty name will be 
> reported like:
> "The slot \"\" has not consumed the WAL yet" in 
> check_old_cluster_for_valid_slots().
> Isn't it inappropriate?
>

I see your point that giving a better message (which would tell the
actual problem) to the user in this case also has a value. OTOH, as
you said, this case won't happen in practical scenarios, so I am fine
either way with a slight tilt toward retaining a better error message
(aka the current way). Sawada-San/Bharath, do you have any suggestions
on this?

-- 
With Regards,
Amit Kapila.

Proposal: Filter irrelevant change before reassemble transactions during logical decoding

2023-11-30 Thread li jie

Hi,

During logical decoding, if there is a large write transaction, some
spill files will be written to disk,
depending on the setting of max_changes_in_memory.

This behavior can effectively avoid OOM, but if the transaction
generates a lot of change before commit,
a large number of files may fill the disk. For example, you can update
a TB-level table.
Of course,   this is also inevitable.

But I found an inelegant phenomenon. If the updated large table is not
published, its changes will also
be written with a large number of spill files. Look at an example below:

publisher:
```
create table tbl_pub(id int, val1 text, val2 text,val3 text);
create table tbl_t1(id int, val1 text, val2 text,val3 text);
CREATE PUBLICATION mypub FOR TABLE public.tbl_pub;
```

subscriber:
```
create table tbl_pub(id int, val1 text, val2 text,val3 text);
create table tbl_t1(id int, val1 text, val2 text,val3 text);
CREATE SUBSCRIPTION mysub CONNECTION 'host=127.0.0.1 port=5432
user=postgres dbname=postgres' PUBLICATION mypub;
```

publisher:
```
begin;
insert into tbl_t1 select i,repeat('xyzzy', i),repeat('abcba',
i),repeat('dfds', i) from generate_series(0,99) i;
```

Later you will see a large number of spill files in the
"/$PGDATA/pg_replslot/mysub/" directory.
```
$ll -sh
total 4.5G
4.0K -rw--- 1 postgres postgres 200 Nov 30 09:24 state
17M -rw--- 1 postgres postgres 17M Nov 30 08:22 xid-750-lsn-0-1000.spill
12M -rw--- 1 postgres postgres 12M Nov 30 08:20 xid-750-lsn-0-100.spill
17M -rw--- 1 postgres postgres 17M Nov 30 08:23 xid-750-lsn-0-1100.spill
..
```

We can see that table tbl_t1 is not published in mypub. It is also not
sent downstream because it is subscribed.
After the transaction is reorganized, the pgoutput decoding plug-in
filters out these change  of unpublished relation
when sending logical changes. see function pgoutput_change.

Above all, if after constructing a change and before queuing a change
into a transaction, we filter out unpublished
relation-related changes, will it make logical decoding less laborious
and avoid disk growth as much as possible?


This is just an immature idea. I haven't started to implement it yet.
Maybe it was designed this way because there
are key factors that I didn't consider. So I want to hear everyone's
opinions, especially the designers of logic decoding.

Re: XID formatting and SLRU refactorings (was: Add 64-bit XIDs into PostgreSQL 15)

2023-11-30 Thread Alexander Korotkov

On Thu, Nov 30, 2023 at 10:29 AM Pavel Borisov  wrote:
> Agree. The fix is attached.

What an oversight.
Thank you, pushed!

--
Regards,
Alexander Korotkov

Re: POC, WIP: OR-clause support for indexes


On 30.11.2023 11:30, Andrei Lepikhov wrote:

On 30/11/2023 15:00, Alena Rybakina wrote:
2. The second patch is my patch version when I moved the OR 
transformation in the s index formation stage:


So, I got the best query plan despite the possible OR to ANY 
transformation:


If the user uses a clause like "x IN (1,2) AND y=100", it will break 
your 'good' solution.


No, unfortunately I still see the plan with Seq scan node:

postgres=# explain analyze select * from test where x in (1,2) and y = 100;

 QUERY PLAN

 Gather  (cost=1000.00..12690.10 rows=1 width=12) (actual 
time=72.985..74.832 rows=0 loops=1)

   Workers Planned: 2
   Workers Launched: 2
   ->  Parallel Seq Scan on test  (cost=0.00..11690.00 rows=1 width=12) 
(actual time=68.573..68.573 rows=0 loops=3)
 Filter: ((x = ANY ('{1,2}'::integer[])) AND (y = '100'::double 
precision))

 Rows Removed by Filter: 33
 Planning Time: 0.264 ms
 Execution Time: 74.887 ms

(8 rows)

In my opinion, the general approach here is to stay with OR->ANY 
transformation at the parsing stage and invent one more way for 
picking an index by looking into the array and attempting to find a 
compound index.
Having a shorter list of expressions, where uniform ORs are grouped 
into arrays, the optimizer will do such work with less overhead. 


Looking at the current index generation code, implementing this approach 
will require a lot of refactoring so that functions starting with 
get_indexes do not rely on the current baserestrictinfo, but use only 
the indexrestrictinfo, which is a copy of baserestrictinfo. And I think, 
potentially, there may be complexity also with the equivalences that we 
can get from OR expressions. All interesting transformations are 
available only for OR expressions, not for ANY, that is, it makes sense 
to try the last chance to find a suitable plan with the available OR 
expressions and if that plan turns out to be better, use it.



--
Regards,
Alena Rybakina
Postgres Professional

Re: [dynahash] do not refill the hashkey after hash_search

2023-11-30 Thread John Naylor

On Thu, Sep 14, 2023 at 3:28 PM Junwang Zhao  wrote:
>
> Add a v2 with some change to fix warnings about unused-parameter.
>
> I will add this to Commit Fest.

Pushed v2 after removing asserts, as well as the unnecessary cast that
I complained about earlier.

Some advice: I was added as a reviewer in CF without my knowledge. I
see people doing it sometimes, but I don't recommend that, since the
reviewer field in CF implies the person volunteered to continue giving
feedback for patch updates.

RE: [Proposal] global sequence implemented by snowflake ID

2023-11-30 Thread Hayato Kuroda (Fujitsu)

Dear Nikita,

Thanks for reading my patch!

>
I have reviewed the patch in this topic and have a question mentioning the 
machine ID -
INSERT INTO snowflake_sequence.machine_id
SELECT round((random() * (0 - 511))::numeric, 0) + 511;

This kind of ID generation does not seem to guarantee from not having the same 
ID in a pool
of instances, does it?
>

You are right. For now the part is randomly assigned, but it might be 
duplicated on another instance.
Maybe we should provide a way for setting it manually. Or, we may able to use 
another way
for determining machine ID.
(system_identifier is too long to use here...)

Note that the contrib module was provided just for the reference. We are now
discussing high-level items, like needs, use-cases and approaches. Could you
please your opinion around here if you have?

The implementation may be completely changed, so I did not change yet. Of 
course,
your comment is quite helpful so that it will be handled eventually.

Best Regards,
Hayato Kuroda
FUJITSU LIMITED

Re: POC, WIP: OR-clause support for indexes

2023-11-30 Thread Andrei Lepikhov


On 30/11/2023 15:00, Alena Rybakina wrote:
2. The second patch is my patch version when I moved the OR 
transformation in the s index formation stage:


So, I got the best query plan despite the possible OR to ANY 
transformation:


If the user uses a clause like "x IN (1,2) AND y=100", it will break 
your 'good' solution.
In my opinion, the general approach here is to stay with OR->ANY 
transformation at the parsing stage and invent one more way for picking 
an index by looking into the array and attempting to find a compound index.
Having a shorter list of expressions, where uniform ORs are grouped into 
arrays, the optimizer will do such work with less overhead.


--
regards,
Andrei Lepikhov
Postgres Professional

Re: XID formatting and SLRU refactorings (was: Add 64-bit XIDs into PostgreSQL 15)

2023-11-30 Thread Pavel Borisov

On Thu, 30 Nov 2023 at 08:03, Tom Lane  wrote:

> Alexander Lakhin  writes:
> > And a warning:
> > $ CC=gcc-12 CFLAGS="-Wall -Wextra -Wno-unused-parameter
> -Wno-sign-compare -Wno-clobbered
> > -Wno-missing-field-initializers" ./configure -q && make -s
> > slru.c:63:1: warning: ‘inline’ is not at beginning of declaration
> [-Wold-style-declaration]
> > 63 | static int  inline
> >| ^~
>
> > Maybe it's worth fixing before committing...
>
> This should have been fixed before commit, because there are now a
> dozen buildfarm animals complaining about it, as well as who-knows-
> how-many developers' compilers.
>
>  calliphoridae | 2023-11-30 02:48:59 |
> /home/bf/bf-build/calliphoridae/HEAD/pgsql.build/../pgsql/src/backend/access/transam/slru.c:63:1:
> warning: 'inline' is not at beginning of declaration
> [-Wold-style-declaration]
>  canebrake | 2023-11-29 14:22:10 |
> /home/bf/bf-build/canebrake/HEAD/pgsql.build/../pgsql/src/backend/access/transam/slru.c:63:1:
> warning: 'inline' is not at beginning of declaration
> [-Wold-style-declaration]
>  culicidae | 2023-11-30 02:49:06 |
> /home/bf/bf-build/culicidae/HEAD/pgsql.build/../pgsql/src/backend/access/transam/slru.c:63:1:
> warning: 'inline' is not at beginning of declaration
> [-Wold-style-declaration]
>  desmoxytes| 2023-11-30 03:11:15 |
> /home/bf/bf-build/desmoxytes/HEAD/pgsql.build/../pgsql/src/backend/access/transam/slru.c:63:1:
> warning: 'inline' is not at beginning of declaration
> [-Wold-style-declaration]
>  flaviventris  | 2023-11-30 02:53:19 |
> /home/bf/bf-build/flaviventris/HEAD/pgsql.build/../pgsql/src/backend/access/transam/slru.c:63:1:
> warning: 'inline' is not at beginning of declaration
> [-Wold-style-declaration]
>  francolin | 2023-11-30 02:26:08 |
> ../pgsql/src/backend/access/transam/slru.c:63:1: warning: 'inline' is not
> at beginning of declaration [-Wold-style-declaration]
>  grassquit | 2023-11-30 02:58:36 |
> /home/bf/bf-build/grassquit/HEAD/pgsql.build/../pgsql/src/backend/access/transam/slru.c:63:1:
> warning: 'inline' is not at beginning of declaration
> [-Wold-style-declaration]
>  komodoensis   | 2023-11-30 03:07:52 |
> /home/bf/bf-build/komodoensis/HEAD/pgsql.build/../pgsql/src/backend/access/transam/slru.c:63:1:
> warning: 'inline' is not at beginning of declaration
> [-Wold-style-declaration]
>  phycodurus| 2023-11-29 14:29:02 |
> /home/bf/bf-build/phycodurus/HEAD/pgsql.build/../pgsql/src/backend/access/transam/slru.c:63:1:
> warning: 'inline' is not at beginning of declaration
> [-Wold-style-declaration]
>  piculet   | 2023-11-30 02:32:57 |
> ../pgsql/src/backend/access/transam/slru.c:63:1: warning: 'inline' is not
> at beginning of declaration [-Wold-style-declaration]
>  pogona| 2023-11-29 14:22:31 |
> /home/bf/bf-build/pogona/HEAD/pgsql.build/../pgsql/src/backend/access/transam/slru.c:63:1:
> warning: 'inline' is not at beginning of declaration
> [-Wold-style-declaration]
>  rorqual   | 2023-11-30 02:32:41 |
> ../pgsql/src/backend/access/transam/slru.c:63:1: warning: 'inline' is not
> at beginning of declaration [-Wold-style-declaration]
>  serinus   | 2023-11-30 02:47:05 |
> /home/bf/bf-build/serinus/HEAD/pgsql.build/../pgsql/src/backend/access/transam/slru.c:63:1:
> warning: 'inline' is not at beginning of declaration
> [-Wold-style-declaration]
>  skink | 2023-11-29 14:23:05 |
> /home/bf/bf-build/skink-master/HEAD/pgsql.build/../pgsql/src/backend/access/transam/slru.c:63:1:
> warning: 'inline' is not at beginning of declaration
> [-Wold-style-declaration]
>  taipan| 2023-11-30 03:03:52 |
> /home/bf/bf-build/taipan/HEAD/pgsql.build/../pgsql/src/backend/access/transam/slru.c:63:1:
> warning: 'inline' is not at beginning of declaration
> [-Wold-style-declaration]
>  tamandua  | 2023-11-30 02:49:50 |
> /home/bf/bf-build/tamandua/HEAD/pgsql.build/../pgsql/src/backend/access/transam/slru.c:63:1:
> warning: 'inline' is not at beginning of declaration
> [-Wold-style-declaration]
>
> regards, tom lane
>
Agree. The fix is attached.

Regards,
Pavel


0001-Fix-warning-due-non-standard-inline-declaration-in-4.patch
Description: Binary data

Re: [Proposal] global sequence implemented by snowflake ID

2023-11-30 Thread Nikita Malakhov

Hi!

I have reviewed the patch in this topic and have a question mentioning the
machine ID -
INSERT INTO snowflake_sequence.machine_id
SELECT round((random() * (0 - 511))::numeric, 0) + 511;

This kind of ID generation does not seem to guarantee from not having the
same ID in a pool
of instances, does it?

On Thu, Nov 30, 2023 at 4:18 AM Michael Paquier  wrote:

> On Tue, Nov 28, 2023 at 02:23:44PM +0530, Amit Kapila wrote:
> > It is interesting to see you want to work towards globally distributed
> > sequences. I think it would be important to discuss how and what we
> > want to achieve with sequences w.r.t logical replication and or
> > active-active configuration. There is a patch [1] for logical
> > replication of sequences which will primarily achieve the failover
> > case, i.e. if the publisher goes down and the subscriber takes over
> > the role, one can re-direct connections to it. Now, if we have global
> > sequences, one can imagine that even after failover the clients can
> > still get unique values of sequences. It will be a bit more flexible
> > to use global sequences, for example, we can use the sequence on both
> > nodes at the same time which won't be possible with the replication of
> > sequences as they will become inconsistent. Now, it is also possible
> > that both serve different use cases and we need both functionalities
> > but it would be better to have some discussion on the same.
> >
> > Thoughts?
> >
> > [1] - https://commitfest.postgresql.org/45/3823/
>
> Thanks for pointing this out.  I've read through the patch proposed by
> Tomas and both are independent things IMO.  The logical decoding patch
> relies on the SEQ_LOG records to find out which last_value/is_called
> to transfer, which is something directly depending on the in-core
> sequence implementation.  Sequence AMs are concepts that cover much
> more ground, leaving it up to the implementor to do what they want
> while hiding the activity with a RELKIND_SEQUENCE (generated columns
> included).
>
> To put it short, I have the impression that one and the other don't
> really conflict, but just cover different ground.  However, I agree
> that depending on the sequence AM implementation used in a cluster
> (snowflake IDs guarantee unicity with their machine ID), replication
> may not be necessary because the sequence implementation may be able
> to ensure that no replication is required from the start.
> --
> Michael
>


-- 
Regards,
Nikita Malakhov
Postgres Professional
The Russian Postgres Company
https://postgrespro.ru/

Re: pg_upgrade and logical replication

On Wed, Nov 29, 2023 at 3:02 PM Amit Kapila  wrote:
>

In general, the test cases are a bit complex to understand, so, it
will be difficult to enhance these later. The complexity comes from
the fact that one upgrade test is trying to test multiple things (a)
Enabled/Disabled subscriptions; (b) relation states 'i' and 'r' are
preserved after the upgrade. (c) rows from non-refreshed tables are
not copied, etc. I understand that you may want to cover as many
things possible in one test to have fewer upgrade tests which could
save some time but I think it makes the test somewhat difficult to
understand and enhance. Can we try to split it such that (a) and (b)
are tested in one test and others could be separated out?

Few other comments:
===
1.
+$old_sub->safe_psql('postgres',
+ "CREATE SUBSCRIPTION regress_sub CONNECTION '$connstr' PUBLICATION
regress_pub"
+);
+
+$old_sub->wait_for_subscription_sync($publisher, 'regress_sub');
+
+# After the above wait_for_subscription_sync call the table can be either in
+# 'syncdone' or in 'ready' state. Now wait till the table reaches
'ready' state.
+my $synced_query =
+  "SELECT count(1) = 1 FROM pg_subscription_rel WHERE srsubstate = 'r'";
+$old_sub->poll_query_until('postgres', $synced_query)
+  or die "Timed out while waiting for the table to reach ready state";

Can the table be in 'i' state after above test? If not, then above
comment is misleading.

2.
+# --
+# Check that pg_upgrade is successful when all tables are in ready or in
+# init state.
+# --
+$publisher->safe_psql('postgres',
+ "INSERT INTO tab_upgraded1 VALUES (generate_series(2,50), 'before
initial sync')"
+);
+$publisher->wait_for_catchup('regress_sub');

The previous comment applies to this one as well.

3.
+$publisher->safe_psql('postgres', "CREATE PUBLICATION regress_pub1");
+$old_sub->safe_psql('postgres',
+ "CREATE SUBSCRIPTION regress_sub1 CONNECTION '$connstr' PUBLICATION
regress_pub1"
+);
+$old_sub->wait_for_subscription_sync($publisher, 'regress_sub1');
+
+# Change configuration to prepare a subscription table in init state
+$old_sub->append_conf('postgresql.conf',
+ "max_logical_replication_workers = 0");
+$old_sub->restart;
+
+# Add tab_upgraded2 to the publication. Now publication has tab_upgraded1
+# and tab_upgraded2 tables.
+$publisher->safe_psql('postgres',
+ "ALTER PUBLICATION regress_pub ADD TABLE tab_upgraded2");
+
+$old_sub->safe_psql('postgres',
+ "ALTER SUBSCRIPTION regress_sub REFRESH PUBLICATION");

These two cases for Create and Alter look confusing. I think it would
be better if Alter's case is moved before the comment: "Check that
pg_upgrade is successful when all tables are in ready or in init
state.".

4.
+# Insert a row in tab_upgraded1 and tab_not_upgraded1 publisher table while
+# it's down.
+insert_line_at_pub('while old_sub is down');

Isn't sub routine insert_line_at_pub() inserts in all three tables? If
so, then the above comment seems to be wrong and I think it is better
to explain the intention of this insert.

5.
+my $result =
+  $new_sub->safe_psql('postgres',
+ "SELECT subenabled FROM pg_subscription WHERE subname = 'regress_sub'");
+is($result, qq(f),
+ "check that the subscriber that was disable on the old subscriber
should be disabled in the new subscriber"
+);
+$result =
+  $new_sub->safe_psql('postgres',
+ "SELECT subenabled FROM pg_subscription WHERE subname = 'regress_sub1'");
+is($result, qq(t),
+ "check that the subscriber that was enabled on the old subscriber
should be enabled in the new subscriber"
+);

Can't the above be tested with a single query?

6.
+$new_sub->safe_psql('postgres', "DROP SUBSCRIPTION regress_sub1");
+
+# Subscription relations should be preserved. The upgraded subscriber
won't know
+# about 'tab_not_upgraded1' because the subscription is not yet refreshed.
+$result =
+  $new_sub->safe_psql('postgres', "SELECT count(*) FROM pg_subscription_rel");
+is($result, qq(2),
+ "there should be 2 rows in pg_subscription_rel(representing
tab_upgraded1 and tab_upgraded2)"
+);

Here the DROP SUBSCRIPTION looks confusing. Let's try to move it after
the verification of objects after the upgrade.

7.
1.
+sub insert_line_at_pub
+{
+ my $payload = shift;
+
+ foreach ("tab_upgraded1", "tab_upgraded2", "tab_not_upgraded1")
+ {
+ $publisher->safe_psql('postgres',
+ "INSERT INTO " . $_ . " (val) VALUES('$payload')");
+ }
+}
+
+# Initial setup
+foreach ("tab_upgraded1", "tab_upgraded2", "tab_not_upgraded1")
+{
+ $publisher->safe_psql('postgres',
+ "CREATE TABLE " . $_ . " (id serial, val text)");
+ $old_sub->safe_psql('postgres',
+ "CREATE TABLE " . $_ . " (id serial, val text)");
+}
+insert_line_at_pub('before initial sync');

This makes the test slightly difficult to understand and we don't seem
to achieve much by writing sub routines.

--
With Regards,
Amit Kapila.

Re: POC, WIP: OR-clause support for indexes