Re: logical decoding and replication of sequences, take 2

2024-02-20 Thread Robert Haas
On Wed, Feb 21, 2024 at 1:06 PM Dilip Kumar  wrote:
> > But I am wondering why this flag is always set to true in
> > DecodeTXNNeedSkip() irrespective of the commit or abort. Because the
> > aborted transactions are not supposed to be replayed?  So if my
> > observation is correct that for the aborted transaction, this
> > shouldn't be set to true then we have a problem with sequence where we
> > are identifying the transactional changes as non-transaction changes
> > because now for transactional changes this should depend upon commit
> > status.
>
> I have checked this case with Amit Kapila.  So it seems in the cases
> where we have sent the prepared transaction or streamed in-progress
> transaction we would need to send the abort also, and for that reason,
> we are setting 'ctx->processing_required' as true so that if these
> WALs are not streamed we do not allow upgrade of such slots.

I don't find this explanation clear enough for me to understand.

-- 
Robert Haas
EDB: http://www.enterprisedb.com




Re: ALTER TABLE SET ACCESS METHOD on partitioned tables

2024-02-20 Thread Peter Eisentraut

On 21.02.24 07:40, Michael Paquier wrote:

This means that all partitioned tables would have pg_class.relam set,
and that relam would never be 0:
- The USING clause takes priority over default_table_access_method.
- If no USING clause, default_table_access_method is the AM used

Any partitions created from this partitioned table would inherit the
AM set, ignoring default_table_access_method.

Alvaro has made a very good point a couple of days ago at [1] where we
should try to make the behavior stick closer to tablespaces, where it
could be possible to set relam to 0 for a partitioned table, where a
partition would inherit the AM set in the GUC when a USING clause is
not defined (if USING specifies the AM, we'd just use it).


Yes, I think most people agreed that that would be the preferred behavior.





Re: Injection points: some tools to wait and wake

2024-02-20 Thread Michael Paquier
On Wed, Feb 21, 2024 at 07:08:03AM +0900, Michael Paquier wrote:
> Well, both you and Andrey are asking for it now, so let's do it.  The
> implementation is simple:
> - Store in InjectionPointSharedState an array of wait_counts and an
> array of names.  There is only one condition variable.
> - When a point wants to wait, it takes the spinlock and looks within
> the array of names until it finds a free slot, adds its name into the
> array to reserve a wait counter at the same position, releases the
> spinlock.  Then it loops on the condition variable for an update of
> the counter it has reserved.  It is possible to make something more
> efficient, but at a small size it would not really matter.
> - The wakeup takes a point name in argument, acquires the spinlock,
> and checks if it can find the point into the array, pinpoints the
> location of the counter to update and updates it.  Then it broadcasts
> the change.
> - The wait loop checks its counter, leaves its loop, cancels the
> sleep, takes the spinlock to unregister from the array, and leaves.
> 
> I would just hardcode the number of points that can wait, say 5 of
> them tracked in shmem?  Does that look like what you are looking at?

I was looking at that, and it proves to work OK, so you can do stuff
like waits and wakeups for multiple processes in a controlled manner.
The attached patch authorizes up to 32 waiters.  I have switched
things so as the information reported in pg_stat_activity is the name
of the injection point itself.

Comments and ideas are welcome.
--
Michael
From 75302cba302b83ce2a6d6eaf30b163f473b87276 Mon Sep 17 00:00:00 2001
From: Michael Paquier 
Date: Wed, 21 Feb 2024 16:36:25 +0900
Subject: [PATCH v2 1/2] injection_points: Add routines to wait and wake
 processes

This commit is made of two parts:
- A new callback that can be attached to a process to make it wait on a
condition variable.  The condition checked is registered in shared
memory by the module injection_points.
- A new SQL function to update the shared state and broadcast the update
using a condition variable.

The shared state used by the module is registered using the DSM
registry, and is optional.
---
 .../injection_points--1.0.sql |  10 ++
 .../injection_points/injection_points.c   | 151 ++
 src/tools/pgindent/typedefs.list  |   1 +
 3 files changed, 162 insertions(+)

diff --git a/src/test/modules/injection_points/injection_points--1.0.sql b/src/test/modules/injection_points/injection_points--1.0.sql
index 5944c41716..eed0310cf6 100644
--- a/src/test/modules/injection_points/injection_points--1.0.sql
+++ b/src/test/modules/injection_points/injection_points--1.0.sql
@@ -24,6 +24,16 @@ RETURNS void
 AS 'MODULE_PATHNAME', 'injection_points_run'
 LANGUAGE C STRICT PARALLEL UNSAFE;
 
+--
+-- injection_points_wakeup()
+--
+-- Wakes a condition variable waited on in an injection point.
+--
+CREATE FUNCTION injection_points_wakeup(IN point_name TEXT)
+RETURNS void
+AS 'MODULE_PATHNAME', 'injection_points_wakeup'
+LANGUAGE C STRICT PARALLEL UNSAFE;
+
 --
 -- injection_points_detach()
 --
diff --git a/src/test/modules/injection_points/injection_points.c b/src/test/modules/injection_points/injection_points.c
index e843e6594f..052b20f9c8 100644
--- a/src/test/modules/injection_points/injection_points.c
+++ b/src/test/modules/injection_points/injection_points.c
@@ -18,18 +18,72 @@
 #include "postgres.h"
 
 #include "fmgr.h"
+#include "storage/condition_variable.h"
 #include "storage/lwlock.h"
 #include "storage/shmem.h"
+#include "storage/dsm_registry.h"
 #include "utils/builtins.h"
 #include "utils/injection_point.h"
 #include "utils/wait_event.h"
 
 PG_MODULE_MAGIC;
 
+/* Maximum number of wait usable in injection points at once */
+#define INJ_MAX_WAIT	32
+#define INJ_NAME_MAXLEN	64
+
+/* Shared state information for injection points. */
+typedef struct InjectionPointSharedState
+{
+	/* protects accesses to wait_counts */
+	slock_t		lock;
+
+	/* Counters advancing when injection_points_wakeup() is called */
+	int			wait_counts[INJ_MAX_WAIT];
+
+	/* Names of injection points attached to wait counters */
+	char		name[INJ_MAX_WAIT][INJ_NAME_MAXLEN];
+
+	/*
+	 * Condition variable used for waits and wakeups, checking upon the set of
+	 * wait_counts when waiting.
+	 */
+	ConditionVariable wait_point;
+} InjectionPointSharedState;
+
+/* Pointer to shared-memory state. */
+static InjectionPointSharedState *inj_state = NULL;
+
 extern PGDLLEXPORT void injection_error(const char *name);
 extern PGDLLEXPORT void injection_notice(const char *name);
+extern PGDLLEXPORT void injection_wait(const char *name);
 
 
+static void
+injection_point_init_state(void *ptr)
+{
+	InjectionPointSharedState *state = (InjectionPointSharedState *) ptr;
+
+	SpinLockInit(>lock);
+	memset(state->wait_counts, 0, sizeof(state->wait_counts));
+	memset(state->name, 0, sizeof(state->name));
+	ConditionVariableInit(>wait_point);
+}
+
+static void

Re: logical decoding and replication of sequences, take 2

2024-02-20 Thread Dilip Kumar
On Tue, Feb 20, 2024 at 4:32 PM Dilip Kumar  wrote:
>
> On Tue, Feb 20, 2024 at 3:38 PM Robert Haas  wrote:
>
> > Let's say fast_forward is true. Then smgr_decode() is going to skip
> > recording anything about the relfilenode, so we'll identify all
> > sequence changes as non-transactional. But look at how this case is
> > handled in seq_decode():
> >
> > + if (ctx->fast_forward)
> > + {
> > + /*
> > + * We need to set processing_required flag to notify the sequence
> > + * change existence to the caller. Usually, the flag is set when
> > + * either the COMMIT or ABORT records are decoded, but this must be
> > + * turned on here because the non-transactional logical message is
> > + * decoded without waiting for these records.
> > + */
> > + if (!transactional)
> > + ctx->processing_required = true;
> > +
> > + return;
> > + }
>
> It appears that the 'processing_required' flag was introduced as part
> of supporting upgrades for logical replication slots. Its purpose is
> to determine whether a slot is fully caught up, meaning that there are
> no pending decodable changes left before it can be upgraded.
>
> So now if some change was transactional but we have identified it as
> non-transaction then we will mark this flag  'ctx->processing_required
> = true;' so we temporarily set this flag incorrectly, but even if the
> flag would have been correctly identified initially, it would have
> been set again to true in the DecodeTXNNeedSkip() function regardless
> of whether the transaction is committed or aborted. As a result, the
> flag would eventually be set to 'true', and the behavior would align
> with the intended logic.
>
> But I am wondering why this flag is always set to true in
> DecodeTXNNeedSkip() irrespective of the commit or abort. Because the
> aborted transactions are not supposed to be replayed?  So if my
> observation is correct that for the aborted transaction, this
> shouldn't be set to true then we have a problem with sequence where we
> are identifying the transactional changes as non-transaction changes
> because now for transactional changes this should depend upon commit
> status.

I have checked this case with Amit Kapila.  So it seems in the cases
where we have sent the prepared transaction or streamed in-progress
transaction we would need to send the abort also, and for that reason,
we are setting 'ctx->processing_required' as true so that if these
WALs are not streamed we do not allow upgrade of such slots.

-- 
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com




Re: Removing unneeded self joins

2024-02-20 Thread Richard Guo
On Mon, Feb 19, 2024 at 1:24 PM Andrei Lepikhov 
wrote:

> On 18/2/2024 23:18, Alexander Korotkov wrote:
> > On Sun, Feb 18, 2024 at 5:04 PM Alexander Korotkov 
> wrote:
> >> On Sun, Feb 18, 2024 at 3:00 PM Alexander Lakhin 
> wrote:
> >>> Please look at the following query which fails with an error since
> >>> d3d55ce57:
> >>>
> >>> create table t (i int primary key);
> >>>
> >>> select t3.i from t t1
> >>>   join t t2 on t1.i = t2.i,
> >>>   lateral (select t1.i limit 1) t3;
> >>>
> >>> ERROR:  non-LATERAL parameter required by subquery
> >>
> >> Thank you for spotting.  I'm looking at this.
> >
> > Attached is a draft patch fixing this query.  Could you, please, recheck?
> I reviewed this patch. Why do you check only the target list? I guess
> these links can be everywhere. See the patch in the attachment with the
> elaborated test and slightly changed code.


I just noticed that this fix has been committed in 489072ab7a, but it's
just flat wrong.

* The fix walks the subquery and replaces all the Vars with a varno
equal to the relid of the removing rel, without checking the
varlevelsup.  That is to say, a Var that belongs to the subquery itself
might also be replaced, which is wrong.  For instance,

create table t (i int primary key);

explain (costs off)
select t3.i from t t1
  join t t2 on t1.i = t2.i
  join lateral (select * from (select t1.i offset 0) offset 0) t3 on true;
ERROR:  no relation entry for relid 2


* The fix only traverses one level within the subquery, so Vars that
appear in subqueries with multiple levels cannot be replaced.  For
instance,

explain (costs off)
select t4.i from t t1
  join t t2 on t1.i = t2.i
  join lateral (select t3.i from t t3, (select t1.i) offset 0) t4 on true;
ERROR:  non-LATERAL parameter required by subquery


I think the right fix for these issues is to introduce a new element
'sublevels_up' in ReplaceVarnoContext, and enhance replace_varno_walker
to 1) recurse into subselects with sublevels_up increased, and 2)
perform the replacement only when varlevelsup is equal to sublevels_up.

Attached is a patch for the fix.

While writing the fix, I noticed some outdated comments.  Such as in
remove_rel_from_query, the first for loop updates otherrel's attr_needed
as well as lateral_vars, but the comment only mentions attr_needed.  So
this patch also fixes some outdated comments.

While writing the test cases, I found that the test cases for SJE are
quite messy.  Below are what I have noticed:

* There are several test cases using catalog tables like pg_class,
pg_stats, pg_index, etc. for testing join removal.  I don't see a reason
why we need to use catalog tables, and I think this just raises the risk
of instability.

* In many test cases, a mix of uppercase and lowercase keywords is used
in one query.  I think it'd better to maintain consistency by using
either all uppercase or all lowercase keywords in one query.

* In most situations, we verify the plan and the output of a query like:

explain (costs off)
select ...;
select ...;

The two select queries are supposed to be the same.  But in the SJE test
cases, I have noticed instances where the two select queries differ from
each other.

This patch also includes some cosmetic tweaks for SJE test cases.  It
does not change the test cases using catalog tables though.  I think
that should be a seperate patch.

Thanks
Richard


v1-0001-Replace-lateral-references-to-removed-rels-in-subqueries.patch
Description: Binary data


Re: What about Perl autodie?

2024-02-20 Thread Peter Eisentraut

On 14.02.24 17:52, Peter Eisentraut wrote:
A gentler way might be to start using some perlcritic policies like 
InputOutput::RequireCheckedOpen or the more general 
InputOutput::RequireCheckedSyscalls and add explicit error checking at 
the sites it points out.


Here is a start for that.  I added the required stanza to perlcriticrc 
and started with an explicit list of functions to check:


functions = chmod flock open read rename seek symlink system

and fixed all the issues it pointed out.

I picked those functions because most existing code already checked 
those, so the omissions are probably unintended, or in some cases also 
because I thought it would be important for test correctness (e.g., some 
tests using chmod).


I didn't design any beautiful error messages, mostly just used "or die 
$!", which mostly matches existing code, and also this is 
developer-level code, so having the system error plus source code 
reference should be ok.


In the second patch, I changed the perlcriticrc stanza to use an 
exclusion list instead of an explicit inclusion list.  That way, you can 
see what we are currently *not* checking.  I'm undecided which way 
around is better, and exactly what functions we should be checking.  (Of 
course, in principle, all of them, but since this is test and build 
support code, not production code, there are probably some reasonable 
compromises to be made.)
From c32941ce95281ab21691c4181962d20a820b1f20 Mon Sep 17 00:00:00 2001
From: Peter Eisentraut 
Date: Tue, 20 Feb 2024 10:12:12 +0100
Subject: [PATCH v1 1/2] perlcritic InputOutput::RequireCheckedSyscalls

---
 .../t/010_pg_archivecleanup.pl  |  2 +-
 src/bin/pg_basebackup/t/010_pg_basebackup.pl|  8 
 src/bin/pg_ctl/t/001_start_stop.pl  |  2 +-
 src/bin/pg_resetwal/t/002_corrupted.pl  |  2 +-
 src/bin/pg_rewind/t/009_growing_files.pl|  2 +-
 src/bin/pg_rewind/t/RewindTest.pm   |  4 ++--
 src/pl/plperl/text2macro.pl |  4 ++--
 src/test/kerberos/t/001_auth.pl |  2 +-
 .../ssl_passphrase_callback/t/001_testfunc.pl   |  2 +-
 src/test/perl/PostgreSQL/Test/Cluster.pm| 12 ++--
 src/test/perl/PostgreSQL/Test/Utils.pm  | 16 
 src/test/ssl/t/SSL/Server.pm| 10 +-
 src/tools/msvc_gendef.pl|  4 ++--
 src/tools/perlcheck/perlcriticrc|  4 
 src/tools/pgindent/pgindent | 17 +
 15 files changed, 48 insertions(+), 43 deletions(-)

diff --git a/src/bin/pg_archivecleanup/t/010_pg_archivecleanup.pl 
b/src/bin/pg_archivecleanup/t/010_pg_archivecleanup.pl
index 792f5677c87..91a98c71e99 100644
--- a/src/bin/pg_archivecleanup/t/010_pg_archivecleanup.pl
+++ b/src/bin/pg_archivecleanup/t/010_pg_archivecleanup.pl
@@ -36,7 +36,7 @@ sub create_files
 {
foreach my $fn (map { $_->{name} } @_)
{
-   open my $file, '>', "$tempdir/$fn";
+   open my $file, '>', "$tempdir/$fn" or die $!;
 
print $file 'CONTENT';
close $file;
diff --git a/src/bin/pg_basebackup/t/010_pg_basebackup.pl 
b/src/bin/pg_basebackup/t/010_pg_basebackup.pl
index 86cc01a640b..159da3029af 100644
--- a/src/bin/pg_basebackup/t/010_pg_basebackup.pl
+++ b/src/bin/pg_basebackup/t/010_pg_basebackup.pl
@@ -77,7 +77,7 @@
 ok(-d "$tempdir/backup", 'backup directory was created and left behind');
 rmtree("$tempdir/backup");
 
-open my $conf, '>>', "$pgdata/postgresql.conf";
+open my $conf, '>>', "$pgdata/postgresql.conf" or die $!;
 print $conf "max_replication_slots = 10\n";
 print $conf "max_wal_senders = 10\n";
 print $conf "wal_level = replica\n";
@@ -175,7 +175,7 @@
qw(backup_label tablespace_map postgresql.auto.conf.tmp
current_logfiles.tmp global/pg_internal.init.123))
 {
-   open my $file, '>>', "$pgdata/$filename";
+   open my $file, '>>', "$pgdata/$filename" or die $!;
print $file "DONOTCOPY";
close $file;
 }
@@ -185,7 +185,7 @@
 # unintended side effects.
 if ($Config{osname} ne 'darwin')
 {
-   open my $file, '>>', "$pgdata/.DS_Store";
+   open my $file, '>>', "$pgdata/.DS_Store" or die $!;
print $file "DONOTCOPY";
close $file;
 }
@@ -424,7 +424,7 @@
my $tblspcoid = $1;
my $escapedRepTsDir = $realRepTsDir;
$escapedRepTsDir =~ s/\\//g;
-   open my $mapfile, '>', $node2->data_dir . '/tablespace_map';
+   open my $mapfile, '>', $node2->data_dir . '/tablespace_map' or die $!;
print $mapfile "$tblspcoid $escapedRepTsDir\n";
close $mapfile;
 
diff --git a/src/bin/pg_ctl/t/001_start_stop.pl 
b/src/bin/pg_ctl/t/001_start_stop.pl
index fd56bf7706a..cbdaee57fb1 100644
--- a/src/bin/pg_ctl/t/001_start_stop.pl
+++ b/src/bin/pg_ctl/t/001_start_stop.pl
@@ -23,7 +23,7 @@
 command_ok([ $ENV{PG_REGRESS}, '--config-auth', "$tempdir/data" ],
'configure 

Re: Add lookup table for replication slot invalidation causes

2024-02-20 Thread Bharath Rupireddy
On Wed, Feb 21, 2024 at 11:56 AM Michael Paquier  wrote:
>
> Note the recent commit 74a730631065 where Alvaro has changed for the
> lwlock tranche names.  That's quite elegant.

Yes, that's absolutely neat. FWIW, designated initializer syntax can
be used in a few more places though. I'm not sure how much worth it
will be but I'll see if I can quickly put up a patch for it.

> > With these two asserts, the behavior (asserts on null and non-existent
> > inputs) is the same as what GetSlotInvalidationCause has right now.
>
> Well, I won't fight you over that.

Haha :)

> A new cause would require an update of SlotInvalidationCause, so if
> you keep RS_INVAL_MAX_CAUSES close to it that's impossible to miss.
> IMO, it makes just more sense to keep that in slot.c because of the
> static assert as well.

Hm, okay. Moved that to slot.c but left a note in the comment atop
enum to update it.

> + * If you add a new invalidation cause here, remember to add its name in
> + * SlotInvalidationCauses in the same order as that of the cause.
>
> The order does not matter with the way v2 does things with
> SlotInvalidationCauses[], no?

Ugh. Corrected that now.

Please see the attached v3 patch.

-- 
Bharath Rupireddy
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com


v3-0001-Add-lookup-table-for-replication-slot-invalidatio.patch
Description: Binary data


Test to dump and restore objects left behind by regression

2024-02-20 Thread Ashutosh Bapat
Hi All,
In [1] we found that having a test to dump and restore objects left
behind by regression test is missing. Such a test would cover many
dump restore scenarios without much effort. It will also help identity
problems described in the same thread [2] during development itself.

I am starting a new thread to discuss such a test. Attached is a WIP
version of the test. The test does fail at the restore step when
commit 74563f6b90216180fc13649725179fc119dddeb5 is reverted
reintroducing the problem.

Attached WIP test is inspired from
src/bin/pg_upgrade/t/002_pg_upgrade.pl which tests binary-upgrade
dumps. Attached test tests the non-binary-upgrade dumps.

Similar to 0002_pg_upgrade.pl the test uses SQL dumps before and after
dump and restore to make sure that the objects are restored correctly.
The test has some shortcomings
1. Objects which are not dumped at all are never tested.
2. Since the rows are dumped in varying order by the two clusters, the
test only tests schema dump and restore.
3. The order of columns of the inheritance child table differs
depending upon the DDLs used to reach a given state. This introduces
diffs in the SQL dumps before and after restore. The test ignores
these diffs by hardcoding the diff in the test.

Even with 1 and 2 the test is useful to detect dump/restore anomalies.
I think we should improve 3, but I don't have a good and simpler
solution. I didn't find any way to compare two given clusters in our
TAP test framework. Building it will be a lot of work. Not sure if
it's worth it.

Suggestions welcome.

[1] 
https://www.postgresql.org/message-id/CAExHW5vyqv%3DXLTcNMzCNccOrHiun_XhYPjcRqeV6dLvZSamriQ%40mail.gmail.com
[2] https://www.postgresql.org/message-id/3462358.1708107856%40sss.pgh.pa.us

--
Best Wishes,
Ashutosh Bapat
From 78903d2cad4e94e05db74b2473f82aabb498f987 Mon Sep 17 00:00:00 2001
From: Ashutosh Bapat 
Date: Wed, 21 Feb 2024 11:02:40 +0530
Subject: [PATCH 1/2] WIP Test to dump and restore object left behind by
 regression

Regression run leaves many database objects behind in a variety of states.
Dumping and restoring these objects covers many dump/restore scenarios not
covered elsewhere.  src/bin/pg_upgrade/t/002_pg_upgrade.pl test pg_upgrade this
way. But it does not cover non-binary-upgrade dump and restore.

The test takes a dump of regression database from one cluster and restores it
on another cluster. It compares the dumps from both the clusters in SQL format
to make sure that the objects dumped were restored properly. Obviously if some
objects were not dumped, those remain untested. Hence this isn't a complete
test.

The order in which data rows are dumped varies a lot between dump and restore,
hence the tests only schema dumps.

The order in which columns of an inherited table are dumped varies depending
upon the DDLs used to set up inheritance. This introduces some difference in
the SQL dumps taken from the two clusters. Those differences are explicitly
ignored in the test.

TODO:
1. We could test formats other than -Fc

Ashutosh Bapat
---
 src/bin/pg_dump/Makefile |   4 +
 src/bin/pg_dump/t/006_pg_dump_regress.pl | 180 +++
 2 files changed, 184 insertions(+)
 create mode 100644 src/bin/pg_dump/t/006_pg_dump_regress.pl

diff --git a/src/bin/pg_dump/Makefile b/src/bin/pg_dump/Makefile
index 930c741c95..29a3d67953 100644
--- a/src/bin/pg_dump/Makefile
+++ b/src/bin/pg_dump/Makefile
@@ -42,6 +42,10 @@ OBJS = \
 	pg_backup_tar.o \
 	pg_backup_utils.o
 
+# required for 006_pg_dump_regress.pl
+REGRESS_SHLIB=$(abs_top_builddir)/src/test/regress/regress$(DLSUFFIX)
+export REGRESS_SHLIB
+
 all: pg_dump pg_restore pg_dumpall
 
 pg_dump: pg_dump.o common.o pg_dump_sort.o $(OBJS) | submake-libpq submake-libpgport submake-libpgfeutils
diff --git a/src/bin/pg_dump/t/006_pg_dump_regress.pl b/src/bin/pg_dump/t/006_pg_dump_regress.pl
new file mode 100644
index 00..c3016f975d
--- /dev/null
+++ b/src/bin/pg_dump/t/006_pg_dump_regress.pl
@@ -0,0 +1,180 @@
+# Copyright (c) 2022-2024, PostgreSQL Global Development Group
+
+# Test dump and restore of regression data. This is expected to cover dump and
+# restore of most of the types of objects left behind in different states by
+# the regression test.
+use strict;
+use warnings FATAL => 'all';
+
+use Cwdqw(abs_path);
+use File::Basename qw(dirname);
+use File::Compare;
+use File::Find qw(find);
+use File::Path qw(rmtree);
+
+use PostgreSQL::Test::Cluster;
+use PostgreSQL::Test::Utils;
+use PostgreSQL::Test::AdjustUpgrade;
+use Test::More;
+
+sub generate_regress_data
+{
+	my ($node) = @_;
+
+	# The default location of the source code is the root of this directory.
+	my $srcdir = abs_path("../../..");
+
+	# Grab any regression options that may be passed down by caller.
+	my $extra_opts = $ENV{EXTRA_REGRESS_OPTS} || "";
+
+	# --dlpath is needed to be able to find the location of regress.so
+	# and any libraries the regression tests require.
+	my $dlpath = 

Re: Synchronizing slots from primary to standby

2024-02-20 Thread shveta malik
On Tue, Feb 20, 2024 at 6:19 PM Masahiko Sawada  wrote:
>
> On Tue, Feb 20, 2024 at 12:33 PM shveta malik  wrote:
> Thank you for the explanation. It makes sense to me to move the check.
>
>
> 2. If the wal_level is not logical, the server will need to restart
> anyway to change the wal_level and have the slotsync worker work. Does
> it make sense to have the postmaster exit if the wal_level is not
> logical and sync_replication_slots is enabled? For instance, we have
> similar checks in PostmsaterMain():
>
> if (summarize_wal && wal_level == WAL_LEVEL_MINIMAL)
> ereport(ERROR,
> (errmsg("WAL cannot be summarized when wal_level is
> \"minimal\"")));

Thanks for the feedback. I have addressed it in v93.

thanks
SHveta




Re: A new message seems missing a punctuation

2024-02-20 Thread Amit Kapila
On Tue, Feb 20, 2024 at 7:25 PM shveta malik  wrote:
>
> On Tue, Feb 20, 2024 at 5:01 PM Amit Kapila  wrote:
> >
> >
> > We do expose the required information (restart_lsn, catalog_xmin,
> > synced, temporary, etc) via pg_replication_slots. So, I feel the LOG
> > message here is sufficient to DEBUG (or know the details) when the
> > slot sync doesn't succeed.
> >
>
> Please find the patch having the suggested changes.
>

LGTM. I'll push this tomorrow unless someone has further comments/suggestions.

-- 
With Regards,
Amit Kapila.




Re: ALTER TABLE SET ACCESS METHOD on partitioned tables

2024-02-20 Thread Michael Paquier
On Tue, Feb 20, 2024 at 03:47:46PM +0100, Peter Eisentraut wrote:
> It would be helpful if this patch could more extensively document in its
> commit message what semantic changes it makes.  Various options of possible
> behaviors were discussed in this thread, but it's not clear which behaviors
> were chosen in this particular patch version.
> 
> The general idea is that you can set an access method on a partitioned
> table.  That much seems very agreeable.  But then what happens with this
> setting, how can you override it, how can you change it, what happens when
> you change it, what happens with existing partitions and new partitions,
> etc. -- and which of these behaviors are new and old.  Many things to
> specify.

The main point in this patch is the following code block in
DefineRelation(), that defines the semantics about the AM set for a
partitioned table:
+else if (RELKIND_HAS_TABLE_AM(relkind) || relkind == 
RELKIND_PARTITIONED_TABLE)
 {
+if (stmt->partbound)
+{
+/*
+ * For partitions, if no access method is specified, use the AM of
+ * the parent table.
+ */
+Assert(list_length(inheritOids) == 1);
+accessMethodId = get_rel_relam(linitial_oid(inheritOids));
+Assert(OidIsValid(accessMethodId));
+}
+else
+accessMethodId = get_table_am_oid(default_table_access_method, 
false);
 }

This means that all partitioned tables would have pg_class.relam set,
and that relam would never be 0:
- The USING clause takes priority over default_table_access_method.
- If no USING clause, default_table_access_method is the AM used

Any partitions created from this partitioned table would inherit the
AM set, ignoring default_table_access_method.  

Alvaro has made a very good point a couple of days ago at [1] where we
should try to make the behavior stick closer to tablespaces, where it
could be possible to set relam to 0 for a partitioned table, where a
partition would inherit the AM set in the GUC when a USING clause is
not defined (if USING specifies the AM, we'd just use it).

Existing partitions should not be changed if the AM of their
partitioned table changes, so you can think of the AM as a hint for
the creation of new partitions.

[1]: 
https://www.postgresql.org/message-id/202402011550.sfszd46247zi@alvherre.pgsql
--
Michael


signature.asc
Description: PGP signature


Re: logical decoding and replication of sequences, take 2

2024-02-20 Thread Amit Kapila
On Wed, Feb 14, 2024 at 10:21 PM Tomas Vondra
 wrote:
>
> On 2/13/24 17:37, Robert Haas wrote:
>
> > In other words, the fact that some sequence changes are
> > non-transactional creates ordering hazards that don't exist if there
> > are no non-transactional changes. So in that way, sequences are
> > different from table modifications, where applying the transactions in
> > order of commit is all we need to do. Here we need to apply the
> > transactions in order of commit and also apply the non-transactional
> > changes at the right point in the sequence. Consider the following
> > alternative apply sequence:
> >
> > 1. T1.
> > 2. T2's transactional changes (i.e. the ALTER SEQUENCE INCREMENT and
> > the subsequent nextval)
> > 3. T3's nextval
> > 4. T2's first nextval
> >
> > That's still in commit order. It's also wrong.
> >
>
> Yes, this would be wrong. Thankfully the apply is not allowed to reorder
> the changes like this, because that's not what "non-transactional" means
> in this context.
>
> It does not mean we can arbitrarily reorder the changes, it only means
> the changes are applied as if they were independent transactions (but in
> the same order as they were executed originally).
>

In this regard, I have another scenario in mind where the apply order
could be different for the changes in the same transactions. For
example,

Transaction T1
Begin;
Insert ..
Insert ..
nextval .. --consider this generates WAL
..
Insert ..
nextval .. --consider this generates WAL

In this case, if the nextval operations will be applied in a different
order (aka before Inserts) then there could be some inconsistency.
Say, if, it doesn't follow the above order during apply then a trigger
fired on both pub and sub for each row insert that refers to the
current sequence value to make some decision could have different
behavior on publisher and subscriber. If this is not how the patch
will behave then fine but otherwise, isn't this something that we
should be worried about?

-- 
With Regards,
Amit Kapila.




Re: Unlinking Parallel Hash Join inner batch files sooner

2024-02-20 Thread Andrei Lepikhov

Hi,

I see in [1] that the reporter mentioned a delay between the error 
message in parallel HashJoin and the return control back from PSQL. Your 
patch might reduce this delay.
Also, I have the same complaint from users who processed gigabytes of 
data in parallel HashJoin. Presumably, they also stuck into the unlink 
of tons of temporary files. So, are you going to do something with this 
code?


[1] 
https://www.postgresql.org/message-id/18349-83d33dd3d0c855c3%40postgresql.org


--
regards,
Andrei Lepikhov
Postgres Professional





Re: Add lookup table for replication slot invalidation causes

2024-02-20 Thread Michael Paquier
On Wed, Feb 21, 2024 at 09:49:37AM +0530, Bharath Rupireddy wrote:
> On Wed, Feb 21, 2024 at 5:04 AM Michael Paquier  wrote:
> Done that way. I'm fine with the designated initialization [1] that an
> ISO C99 compliant compiler offers. PostgreSQL installation guide
> https://www.postgresql.org/docs/current/install-requirements.html says
> that we need an at least C99-compliant ISO/ANSI C compiler.

Note the recent commit 74a730631065 where Alvaro has changed for the
lwlock tranche names.  That's quite elegant.

> Right, but an assertion isn't a bad idea there as it can generate a
> backtrace as opposed to the crash generating just SEGV note (and
> perhaps a crash dump) in server logs.
> 
> With these two asserts, the behavior (asserts on null and non-existent
> inputs) is the same as what GetSlotInvalidationCause has right now.

Well, I won't fight you over that.

>> +/* Maximum number of invalidation causes */
>> +#defineRS_INVAL_MAX_CAUSES RS_INVAL_WAL_LEVEL
 There is no need to add that to slot.h: it is only used in slot.c.

> 
> Right, but it needs to be updated whenever a new cause is added to
> enum ReplicationSlotInvalidationCause. Therefore, I think it's better
> to be closer there in slot.h.

A new cause would require an update of SlotInvalidationCause, so if
you keep RS_INVAL_MAX_CAUSES close to it that's impossible to miss.
IMO, it makes just more sense to keep that in slot.c because of the
static assert as well.

+ * If you add a new invalidation cause here, remember to add its name in
+ * SlotInvalidationCauses in the same order as that of the cause. 

The order does not matter with the way v2 does things with
SlotInvalidationCauses[], no?
--
Michael


signature.asc
Description: PGP signature


BRIN integer overflow

2024-02-20 Thread Oleg Tselebrovskiy

Greetings, everyone!

While analyzing output of Svace static analyzer [1] I've found a bug

Function bringetbitmap that is used in BRIN's IndexAmRoutine should 
return an
int64 value, but the actual return value is int, since totalpages is int 
and

totalpages * 10 is also int. This could lead to integer overflow

I suggest to change totalpages to be int64 to avoid potential overflow.
Also in all other "amgetbitmap functions" (such as hashgetbitmap, 
gistgetbitmap,

gingetbitmap, blgetbitmap) the return value is of correct int64 type

The proposed patch is attached

[1] - https://svace.pages.ispras.ru/svace-website/en/

Oleg Tselebrovskiy, Postgres Prodiff --git a/src/backend/access/brin/brin.c b/src/backend/access/brin/brin.c
index 1087a9011ea..f72667e484e 100644
--- a/src/backend/access/brin/brin.c
+++ b/src/backend/access/brin/brin.c
@@ -559,7 +559,7 @@ bringetbitmap(IndexScanDesc scan, TIDBitmap *tbm)
 	BrinOpaque *opaque;
 	BlockNumber nblocks;
 	BlockNumber heapBlk;
-	int			totalpages = 0;
+	int64		totalpages = 0;
 	FmgrInfo   *consistentFn;
 	MemoryContext oldcxt;
 	MemoryContext perRangeCxt;


Re: WIP Incremental JSON Parser

2024-02-20 Thread Andrew Dunstan



On 2024-02-20 Tu 19:53, Jacob Champion wrote:

On Tue, Feb 20, 2024 at 2:10 PM Andrew Dunstan  wrote:

Well, that didn't help a lot, but meanwhile the CFBot seems to have
decided in the last few days that it's now happy, so full steam aead! ;-)

I haven't been able to track down the root cause yet, but I am able to
reproduce the failure consistently on my machine:

 ERROR:  could not parse backup manifest: file size is not an
integer: '16384, "Last-Modified": "2024-02-20 23:07:43 GMT",
"Checksum-Algorithm": "CRC32C", "Checksum": "66c829cd" },
 { "Path": "base/4/2606", "Size": 24576, "Last-Modified": "20

Full log is attached.




*sigh* That's weird. I wonder why you can reproduce it and I can't. Can 
you give me details of the build? OS, compiler, path to source, build 
setup etc.? Anything that might be remotely relevant.



cheers


andrew

--
Andrew Dunstan
EDB: https://www.enterprisedb.com





Re: Speeding up COPY TO for uuids and arrays

2024-02-20 Thread Michael Paquier
On Mon, Feb 19, 2024 at 03:08:45PM +0100, Laurenz Albe wrote:
> On Sat, 2024-02-17 at 12:24 -0800, Andres Freund wrote:
>> I wonder if we should move the core part for converting to hex to numutils.c,
>> we already have code the for the inverse. There does seem to be further
>> optimization potential in the conversion, and that seems better done 
>> somewhere
>> central rather than one type's output function. OTOH, it might not be worth
>> it, given the need to add the dashes.
>
> I thought about it, but then decided not to do that.
> Calling a function that converts the bytes to hex and then adding the
> hyphens will slow down processing, and I think the code savings would be
> minimal at best.

Yeah, I'm not sure either if that's worth doing, the current
conversion code is simple enough.  I'd be curious to hear about ideas
to optimize that more locally, though.

I was curious about the UUID one, and COPYing out 10M rows with a
single UUID attribute brings down the runtime of a COPY from 3.8s to
2.3s here on a simple benchmark, with uuid_out showing up at the top
of profiles easily on HEAD.  Some references for HEAD:
31.63%  5300  postgres  postgres[.] uuid_out
19.79%  3316  postgres  postgres[.] appendStringInfoChar
11.27%  1887  postgres  postgres[.] CopyAttributeOutText
 6.36%  1065  postgres  postgres[.] 
pgstat_progress_update_param

And with the patch for uuid_out:
22.66%  2147  postgres  postgres   [.] CopyAttributeOutText
12.99%  1231  postgres  postgres   [.] uuid_out
11.41%  1081  postgres  postgres   [.] 
pgstat_progress_update_param
 4.79%   454  postgres  postgres   [.] CopyOneRowTo

That's very good, so I'd like to apply this part.  Let me know if
there are any objections.
--
Michael


signature.asc
Description: PGP signature


Re: Have pg_basebackup write "dbname" in "primary_conninfo"?

2024-02-20 Thread Ajin Cherian
On Wed, Feb 21, 2024 at 4:09 PM Hayato Kuroda (Fujitsu) <
kuroda.hay...@fujitsu.com> wrote:

> Dear Horiguchi-san,
>
> > GetConnection()@streamutil.c wants to ensure conninfo has a fallback
> > database name ("replication"). However, the function seems to be
> > ignoring the case where neither dbname nor connection string is given,
> > which is the first case Kuroda-san raised. The second case is the
> > intended behavior of the function.
> >
> > > /* pg_recvlogical uses dbname only; others use connection_string
> only.
> > */
> > > Assert(dbname == NULL || connection_string == NULL);
> >
> > And the function incorrectly assumes that the connection string
> > requires "dbname=replication".
> >
> > >  * Merge the connection info inputs given in form of connection
> string,
> > >  * options and default values (dbname=replication,
> replication=true,
> > etc.)
> >
> > But the name is a pseudo database name only used by pg_hba.conf
> > (maybe) , which cannot be used as an actual database name (without
> > quotation marks, or unless it is actually created). The function
> > should not add the fallback database name because the connection
> > string for physical replication connection doesn't require the dbname
> > parameter. (attached patch)
>
> I was also missing, but the requirement was that the dbname should be
> included
> only when the dbname option was explicitly specified [1]. Even mine and
> yours
> cannot handle like that. Libpq function
> PQconnectdbParams()->pqConnectOptions2()
> fills all the parameter to PGconn, at that time the information whether it
> is
> intentionally specified or not is discarded. Then,
> GenerateRecoveryConfig() would
> just write down all the connection parameters from PGconn, they cannot
> recognize.
>
> Well, one option is that if a default dbname should be written to the
configuration file, then "postgres' is a better option than "replication"
or "username" as the default option, at least a db of that name exists.

regards,
Ajin Cherian
Fujitsu Australia


Re: Introduce XID age and inactive timeout based replication slot invalidation

2024-02-20 Thread Bharath Rupireddy
On Tue, Feb 20, 2024 at 12:05 PM Bharath Rupireddy
 wrote:
>
>> [...] and was able to produce something like:
> >
> > postgres=# select 
> > slot_name,slot_type,active,active_pid,wal_status,invalidation_reason from 
> > pg_replication_slots;
> >   slot_name  | slot_type | active | active_pid | wal_status | 
> > invalidation_reason
> > -+---++++-
> >  rep1| physical  | f  || reserved   |
> >  master_slot | physical  | t  |1482441 | unreserved | wal_removed
> > (2 rows)
> >
> > does that make sense to have an "active/working" slot "ivalidated"?
>
> Thanks. Can you please provide the steps to generate this error? Are
> you setting max_slot_wal_keep_size on primary to generate
> "wal_removed"?

I'm able to reproduce [1] the state [2] where the slot got invalidated
first, then its wal_status became unreserved, but still the slot is
serving after the standby comes up online after it catches up with the
primary getting the WAL files from the archive. There's a good reason
for this state -
https://git.postgresql.org/gitweb/?p=postgresql.git;a=blob;f=src/backend/replication/slotfuncs.c;h=d2fa5e669a32f19989b0d987d3c7329851a1272e;hb=ff9e1e764fcce9a34467d614611a34d4d2a91b50#l351.
This intermittent state can only happen for physical slots, not for
logical slots because logical subscribers can't get the missing
changes from the WAL stored in the archive.

And, the fact looks to be that an invalidated slot can never become
normal but still can serve a standby if the standby is able to catch
up by fetching required WAL (this is the WAL the slot couldn't keep
for the standby) from elsewhere (archive via restore_command).

As far as the 0001 patch is concerned, it reports the
invalidation_reason as long as slot_contents.data.invalidated !=
RS_INVAL_NONE. I think this is okay.

Thoughts?

[1]
./initdb -D db17
echo "max_wal_size = 128MB
max_slot_wal_keep_size = 64MB
archive_mode = on
archive_command='cp %p
/home/ubuntu/postgres/pg17/bin/archived_wal/%f'" | tee -a
db17/postgresql.conf

./pg_ctl -D db17 -l logfile17 start

./psql -d postgres -p 5432 -c "SELECT
pg_create_physical_replication_slot('sb_repl_slot', true, false);"

rm -rf sbdata logfilesbdata
./pg_basebackup -D sbdata

echo "port=5433
primary_conninfo='host=localhost port=5432 dbname=postgres user=ubuntu'
primary_slot_name='sb_repl_slot'
restore_command='cp /home/ubuntu/postgres/pg17/bin/archived_wal/%f
%p'" | tee -a sbdata/postgresql.conf

touch sbdata/standby.signal

./pg_ctl -D sbdata -l logfilesbdata start
./psql -d postgres -p 5433 -c "SELECT pg_is_in_recovery();"

./pg_ctl -D sbdata -l logfilesbdata stop

./psql -d postgres -p 5432 -c "SELECT pg_logical_emit_message(true,
'mymessage', repeat('', 1000));"
./psql -d postgres -p 5432 -c "CHECKPOINT;"
./pg_ctl -D sbdata -l logfilesbdata start
./psql -d postgres -p 5432 -xc "SELECT * FROM pg_replication_slots;"

[2]
postgres=# SELECT * FROM pg_replication_slots;
-[ RECORD 1 ]---+-
slot_name   | sb_repl_slot
plugin  |
slot_type   | physical
datoid  |
database|
temporary   | f
active  | t
active_pid  | 710667
xmin|
catalog_xmin|
restart_lsn | 0/115D21A0
confirmed_flush_lsn |
wal_status  | unreserved
safe_wal_size   | 77782624
two_phase   | f
conflict_reason |
failover| f
synced  | f
invalidation_reason | wal_removed
last_inactive_at|
inactive_count  | 1

--
Bharath Rupireddy
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com




RE: Have pg_basebackup write "dbname" in "primary_conninfo"?

2024-02-20 Thread Hayato Kuroda (Fujitsu)
Dear Horiguchi-san,

> GetConnection()@streamutil.c wants to ensure conninfo has a fallback
> database name ("replication"). However, the function seems to be
> ignoring the case where neither dbname nor connection string is given,
> which is the first case Kuroda-san raised. The second case is the
> intended behavior of the function.
> 
> > /* pg_recvlogical uses dbname only; others use connection_string only.
> */
> > Assert(dbname == NULL || connection_string == NULL);
> 
> And the function incorrectly assumes that the connection string
> requires "dbname=replication".
> 
> >  * Merge the connection info inputs given in form of connection string,
> >  * options and default values (dbname=replication, replication=true,
> etc.)
> 
> But the name is a pseudo database name only used by pg_hba.conf
> (maybe) , which cannot be used as an actual database name (without
> quotation marks, or unless it is actually created). The function
> should not add the fallback database name because the connection
> string for physical replication connection doesn't require the dbname
> parameter. (attached patch)

I was also missing, but the requirement was that the dbname should be included
only when the dbname option was explicitly specified [1]. Even mine and yours
cannot handle like that. Libpq function PQconnectdbParams()->pqConnectOptions2()
fills all the parameter to PGconn, at that time the information whether it is
intentionally specified or not is discarded. Then, GenerateRecoveryConfig() 
would
just write down all the connection parameters from PGconn, they cannot 
recognize.

One approach is that based on Horiguchi-san's one and initial patch, we can
avoid writing when the dbname is same as the username.

[1]: 
https://www.postgresql.org/message-id/CAA4eK1KH1d1J5giPMZVOtMe0iqncf1CpNwkBKoYAmXdC-kEGZg%40mail.gmail.com

Best Regards,
Hayato Kuroda
FUJITSU LIMITED
https://www.fujitsu.com/ 





Re: logical decoding and replication of sequences, take 2

2024-02-20 Thread Amit Kapila
On Tue, Feb 20, 2024 at 5:39 PM Tomas Vondra
 wrote:
>
> On 2/20/24 06:54, Amit Kapila wrote:
> > On Thu, Dec 21, 2023 at 6:47 PM Tomas Vondra
> >  wrote:
> >>
> >> On 12/19/23 13:54, Christophe Pettus wrote:
> >>> Hi,
> >>>
> >>> I wanted to hop in here on one particular issue:
> >>>
>  On Dec 12, 2023, at 02:01, Tomas Vondra  
>  wrote:
>  - desirability of the feature: Random IDs (UUIDs etc.) are likely a much
>  better solution for distributed (esp. active-active) systems. But there
>  are important use cases that are likely to keep using regular sequences
>  (online upgrades of single-node instances, existing systems, ...).
> >>>
> >>> +1.
> >>>
> >>> Right now, the lack of sequence replication is a rather large
> >>> foot-gun on logical replication upgrades.  Copying the sequences
> >>> over during the cutover period is doable, of course, but:
> >>>
> >>> (a) There's no out-of-the-box tooling that does it, so everyone has
> >>> to write some scripts just for that one function.
> >>>
> >>> (b) It's one more thing that extends the cutover window.
> >>>
> >>
> >> I agree it's an annoying gap for this use case. But if this is the only
> >> use cases, maybe a better solution would be to provide such tooling
> >> instead of adding it to the logical decoding?
> >>
> >> It might seem a bit strange if most data is copied by replication
> >> directly, while sequences need special handling, ofc.
> >>
> >
> > One difference between the logical replication of tables and sequences
> > is that we can guarantee with synchronous_commit (and
> > synchronous_standby_names) that after failover transactions data is
> > replicated or not whereas for sequences we can't guarantee that
> > because of their non-transactional nature. Say, there are two
> > transactions T1 and T2, it is possible that T1's entire table data and
> > sequence data are committed and replicated but T2's sequence data is
> > replicated. So, after failover to logical subscriber in such a case if
> > one routes T2 again to the new node as it was not successful
> > previously then it would needlessly perform the sequence changes
> > again. I don't how much that matters but that would probably be the
> > difference between the replication of tables and sequences.
> >
>
> I don't quite follow what the problem with synchronous_commit is :-(
>
> For sequences, we log the changes ahead, i.e. even if nextval() did not
> write anything into WAL, it's still safe because these changes are
> covered by the WAL generated some time ago (up to ~32 values back). And
> that's certainly subject to synchronous_commit, right?
>
> There certainly are issues with sequences and syncrep:
>
> https://www.postgresql.org/message-id/712cad46-a9c8-1389-aef8-faf0203c9...@enterprisedb.com
>
> but that's unrelated to logical replication.
>
> FWIW I don't think we'd re-apply sequence changes needlessly, because
> the worker does update the origin after applying non-transactional
> changes. So after the replication gets restarted, we'd skip what we
> already applied, no?
>

It will work for restarts but I was trying to discuss what happens in
the scenario after the publisher node goes down and we failover to the
subscriber node and make it a primary node (or a failover case). After
that, all unfinished transactions will be re-routed to the new
primary. Consider a theoretical case where we send sequence changes of
the yet uncommitted transactions directly from wal buffers (something
like 91f2cae7a4 does for physical replication) and then immediately
the primary or publisher node crashes. After failover to the
subscriber node, the application will re-route unfinished transactions
to the new primary. In such a situation, I think there is a chance
that we will update the sequence value when it would have already
received/applied that update via replication. This is what I was
saying that there is probably a difference between tables and
sequences, for tables such a replicated change would be rolled back.
Having said that, this is probably no different from what would happen
in the case of physical replication.

> But maybe there is an issue and I'm just not getting it. Could you maybe
> share an example of T1/T2, with a replication restart and what you think
> would happen?
>
> > I agree with your point above that for upgrades some tool like
> > pg_copysequence where we can provide a way to copy sequence data to
> > subscribers from the publisher would suffice the need.
> >
>
> Perhaps. Unfortunately it doesn't quite work for failovers, and it's yet
> another tool users would need to use.
>

But can logical replica be used for failover? We don't have any way to
replicate/sync the slots on subscribers and neither we have a
mechanism to replicate existing publications. I think if we want to
achieve failover to a logical subscriber we need to replicate/sync the
required logical and physical slots to the subscribers. I haven't
thought through it completely so there 

Re: 'Shutdown <= SmartShutdown' check while launching processes in postmaster.

2024-02-20 Thread Tom Lane
shveta malik  writes:
> I would like to know that why we have 'Shutdown <= SmartShutdown'
> check before launching few processes (WalReceiver, WalSummarizer,
> AutoVacuum worker) while rest of the processes (BGWriter, WalWriter,
> Checkpointer, Archiver etc) do not have any such check. If I have to
> launch a new process, what shall be the criteria to decide if I need
> this check?

Children that are stopped by the "if (pmState == PM_STOP_BACKENDS)"
stanza in PostmasterStateMachine should not be allowed to start
again later if we are trying to shut down.  (But "smart" shutdown
doesn't enforce that, since it's a very weak state that only
prohibits new client sessions.)  The processes that are allowed
to continue beyond that point are ones that are needed to perform
the shutdown checkpoint, or useful to make it finish faster.

regards, tom lane




Re: Add system identifier to backup manifest

2024-02-20 Thread Michael Paquier
On Mon, Feb 19, 2024 at 12:06:19PM +0530, Amul Sul wrote:
> On Mon, Feb 19, 2024 at 4:22 AM Michael Paquier  wrote:
>> And the new option should be documented at the top of the init()
>> routine for perldoc.
> 
> Added in the attached version.

I've done some wordsmithing on 0001 and it is OK, so I've applied it
to move the needle.  Hope that helps.
--
Michael


signature.asc
Description: PGP signature


Re: Add lookup table for replication slot invalidation causes

2024-02-20 Thread Bharath Rupireddy
On Wed, Feb 21, 2024 at 5:04 AM Michael Paquier  wrote:
>
> On Tue, Feb 20, 2024 at 05:53:03PM +0100, Jelte Fennema-Nio wrote:
> > Seems like a good improvement overall. But I'd prefer the definition
> > of the lookup table to use this syntax:
> >
> > const char *const SlotInvalidationCauses[] = {
> > [RS_INVAL_NONE] = "none",
> > [RS_INVAL_WAL_REMOVED] = "wal_removed",
> > [RS_INVAL_HORIZON] = "rows_removed",
> > [RS_INVAL_WAL_LEVEL] = "wal_level_sufficient",
> > };
>
> +1.

Done that way. I'm fine with the designated initialization [1] that an
ISO C99 compliant compiler offers. PostgreSQL installation guide
https://www.postgresql.org/docs/current/install-requirements.html says
that we need an at least C99-compliant ISO/ANSI C compiler.

[1] https://open-std.org/JTC1/SC22/WG14/www/docs/n494.pdf
https://en.cppreference.com/w/c/99
https://www.ibm.com/docs/en/zos/2.4.0?topic=initializers-designated-aggregate-types-c-only

> > Regarding the actual patch:
> >
> > -   Assert(conflict_reason);
> >
> > Probably we should keep this Assert. As well as the Assert(0)
>
> The assert(0) at the end of the routine, likely so.  I don't see a
> huge point for the assert on conflict_reason as we'd crash anyway  on
> strcmp, no?

Right, but an assertion isn't a bad idea there as it can generate a
backtrace as opposed to the crash generating just SEGV note (and
perhaps a crash dump) in server logs.

With these two asserts, the behavior (asserts on null and non-existent
inputs) is the same as what GetSlotInvalidationCause has right now.

> +/* Maximum number of invalidation causes */
> +#defineRS_INVAL_MAX_CAUSES RS_INVAL_WAL_LEVEL
>
> There is no need to add that to slot.h: it is only used in slot.c.

Right, but it needs to be updated whenever a new cause is added to
enum ReplicationSlotInvalidationCause. Therefore, I think it's better
to be closer there in slot.h.

Please see the attached v2 patch.

-- 
Bharath Rupireddy
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com


v2-0001-Add-lookup-table-for-replication-slot-invalidatio.patch
Description: Binary data


Re: Have pg_basebackup write "dbname" in "primary_conninfo"?

2024-02-20 Thread Ajin Cherian
On Wed, Feb 21, 2024 at 2:04 PM Kyotaro Horiguchi 
wrote:

>
> Although I haven't looked the original thread, it seems that the
> dbname is used only by pg_sync_replication_slots(). If it is true,
> couldn't we make the SQL function require a database name to make a
> connection, instead of requiring it in physical-replication conninfo?
>
>
>
In the original thread, the intention is to not just provide this
functionality  using the function pg_sync_replication_slots(), but provide
a GUC option on standbys to sync logical replication slots periodically
even without calling that function. This requires connecting to a database.

regards,
Ajin Cherian
Fujitsu Australia


Re: Have pg_basebackup write "dbname" in "primary_conninfo"?

2024-02-20 Thread Ajin Cherian
On Wed, Feb 21, 2024 at 2:04 PM Kyotaro Horiguchi 
wrote:

>
> About the proposed patch, pg_basebackup cannot verify the validity of
> the dbname. It could be problematic.
>
> Although I haven't looked the original thread, it seems that the
> dbname is used only by pg_sync_replication_slots(). If it is true,
> couldn't we make the SQL function require a database name to make a
> connection, instead of requiring it in physical-replication conninfo?
>
> regards.
>
> --
> Kyotaro Horiguchi
> NTT Open Source Software Center
>

I agree. If the intention is to meet the new requirement of the sync-slot
patch which requires a dbname in the primary_conninfo, then pseudo dbnames
will not work, whether it be the username or just "replication". I feel if
the user does not specify dbname explicitly in pg_basebackup it should be
left blank in the generated primary_conninfo string as well.

regards,
Ajin Cherian
Fujitsu Australia


'Shutdown <= SmartShutdown' check while launching processes in postmaster.

2024-02-20 Thread shveta malik
Hi hackers,

I would like to know that why we have 'Shutdown <= SmartShutdown'
check before launching few processes (WalReceiver, WalSummarizer,
AutoVacuum worker) while rest of the processes (BGWriter, WalWriter,
Checkpointer, Archiver etc) do not have any such check. If I have to
launch a new process, what shall be the criteria to decide if I need
this check?

Looking forward to your expert advice.

thanks
Shveta




Re: Have pg_basebackup write "dbname" in "primary_conninfo"?

2024-02-20 Thread Kyotaro Horiguchi
At Tue, 20 Feb 2024 19:56:10 +0530, Robert Haas  wrote 
in 
> It seems like maybe somebody should look into why this is happening,
> and perhaps fix it.

GetConnection()@streamutil.c wants to ensure conninfo has a fallback
database name ("replication"). However, the function seems to be
ignoring the case where neither dbname nor connection string is given,
which is the first case Kuroda-san raised. The second case is the
intended behavior of the function.

>   /* pg_recvlogical uses dbname only; others use connection_string only. 
> */
>   Assert(dbname == NULL || connection_string == NULL);

And the function incorrectly assumes that the connection string
requires "dbname=replication".

>* Merge the connection info inputs given in form of connection string,
>* options and default values (dbname=replication, replication=true, 
> etc.)

But the name is a pseudo database name only used by pg_hba.conf
(maybe) , which cannot be used as an actual database name (without
quotation marks, or unless it is actually created). The function
should not add the fallback database name because the connection
string for physical replication connection doesn't require the dbname
parameter. (attached patch)

About the proposed patch, pg_basebackup cannot verify the validity of
the dbname. It could be problematic.

Although I haven't looked the original thread, it seems that the
dbname is used only by pg_sync_replication_slots(). If it is true,
couldn't we make the SQL function require a database name to make a
connection, instead of requiring it in physical-replication conninfo?

regards.

-- 
Kyotaro Horiguchi
NTT Open Source Software Center
diff --git a/src/bin/pg_basebackup/streamutil.c 
b/src/bin/pg_basebackup/streamutil.c
index 56d1b15951..da63a04de4 100644
--- a/src/bin/pg_basebackup/streamutil.c
+++ b/src/bin/pg_basebackup/streamutil.c
@@ -78,7 +78,7 @@ GetConnection(void)
 
/*
 * Merge the connection info inputs given in form of connection string,
-* options and default values (dbname=replication, replication=true, 
etc.)
+* options and default values (replication=true, etc.)
 */
i = 0;
if (connection_string)
@@ -96,14 +96,6 @@ GetConnection(void)
keywords = pg_malloc0((argcount + 1) * sizeof(*keywords));
values = pg_malloc0((argcount + 1) * sizeof(*values));
 
-   /*
-* Set dbname here already, so it can be overridden by a dbname 
in the
-* connection string.
-*/
-   keywords[i] = "dbname";
-   values[i] = "replication";
-   i++;
-
for (conn_opt = conn_opts; conn_opt->keyword != NULL; 
conn_opt++)
{
if (conn_opt->val != NULL && conn_opt->val[0] != '\0')


RE: Have pg_basebackup write "dbname" in "primary_conninfo"?

2024-02-20 Thread Hayato Kuroda (Fujitsu)
Dear Robert,

> > Just FYI - here is an extreme case. And note that I have applied proposed 
> > patch.
> >
> > When `pg_basebackup -D data_N2 -R` is used:
> > ```
> > primary_conninfo = 'user=hayato ... dbname=hayato ...
> > ```
> >
> > But when `pg_basebackup -d "" -D data_N2 -R` is used:
> > ```
> > primary_conninfo = 'user=hayato ... dbname=replication
> > ```
> 
> It seems like maybe somebody should look into why this is happening,
> and perhaps fix it.

I think this caused from below part [1] in GetConnection().

If both dbname and connection_string are the NULL, we will enter the else part
and NULL would be substituted - {"dbnmae", NULL} key-value pair is generated
only here.

Then, in PQconnectdbParams()->PQconnectStartParams->pqConnectOptions2(),
the strange part would be found and replaced to the username [2].

I think if both the connection string and the dbname are NULL, it should be
considered as the physical replication connection. here is a patch to fix it.
After the application, below two examples can output "dbname=replication".
You can also confirm.

```
pg_basebackup -D data_N2 -U postgres
pg_basebackup -D data_N2 -R -v

-> primary_conninfo = 'user=postgres ... dbname=replication ...
```

[1]
```
else
{
keywords = pg_malloc0((argcount + 1) * sizeof(*keywords));
values = pg_malloc0((argcount + 1) * sizeof(*values));
keywords[i] = "dbname";
values[i] = dbname;
i++;
}
```

[2]
```
/*
 * If database name was not given, default it to equal user name
 */
if (conn->dbName == NULL || conn->dbName[0] == '\0')
{
free(conn->dbName);
conn->dbName = strdup(conn->pguser);
if (!conn->dbName)
goto oom_error;
}
```


Best Regards,
Hayato Kuroda
FUJITSU LIMITED
https://www.fujitsu.com/

diff --git a/src/bin/pg_basebackup/streamutil.c 
b/src/bin/pg_basebackup/streamutil.c
index 56d1b15951..aea1ccba36 100644
--- a/src/bin/pg_basebackup/streamutil.c
+++ b/src/bin/pg_basebackup/streamutil.c
@@ -119,7 +119,7 @@ GetConnection(void)
keywords = pg_malloc0((argcount + 1) * sizeof(*keywords));
values = pg_malloc0((argcount + 1) * sizeof(*values));
keywords[i] = "dbname";
-   values[i] = dbname;
+   values[i] = dbname == NULL ? "replication" : dbname;
i++;
}
 


Re: Patch: Add parse_type Function

2024-02-20 Thread Erik Wienhold
On 2024-02-20 15:44 +0100, David E. Wheeler wrote:
> I’ve tweaked the comments and their order in v7, attached.
> 
> This goes back to the discussion of the error raising of
> to_regtype[1], so I’ve incorporated the patch from that thread into
> this patch, and set up the docs for to_regtypemod() with similar
> information.

Makes sense.

> [1] 
> https://www.postgresql.org/message-id/flat/57E1FDDC-5A38-452D-82D7-A44DA2E13862%40justatheory.com#1ae0b11634bc33c7ad3cd728e43d504e

> -   
> +role="func_signature">
>  
>   format_type
>  
> @@ -25462,7 +25462,7 @@ SELECT collation for ('foo' COLLATE "de_DE");
>
>  
>
> -   
> +role="func_signature">

The references are printed as "???" right now.  Can be fixed with
xreflabel="format_type" and xreflabel="to_regtype" on those 
elements.

> +to_regtypemod ( type 
> text )

The docs show parameter name "type" but pg_proc.data does not define
proargnames.  So the to_regtypemod() cannot be called using named
notation:

test=> select to_regtypemod(type => 'int'::text);
ERROR:  function to_regtypemod(type => text) does not exist

> +Parses a string of text, extracts a potential type name from it, and
> +translates its typmod, the modifier for the type, if any. Failure to

s/typmod, the modifier of the type/type modifier/

Because format_type() already uses "type modifier" and I think it helps
searchability to use the same wording.

-- 
Erik




Re: [PoC] Federated Authn/z with OAUTHBEARER

2024-02-20 Thread Jacob Champion
Hi all,

v14 rebases over latest and fixes a warning when assertions are
disabled. 0006 is temporary and hacks past a couple of issues I have
not yet root caused -- one of which makes me wonder if 0001 needs to
be considered alongside the recent pg_combinebackup and incremental
JSON work...?

--Jacob
1:  e7f87668ab ! 1:  b6e8358f44 common/jsonapi: support FRONTEND clients
@@ Commit message
 We can now partially revert b44669b2ca, now that json_errdetail() works
 correctly.
 
- ## src/bin/pg_verifybackup/parse_manifest.c ##
-@@ src/bin/pg_verifybackup/parse_manifest.c: 
json_parse_manifest(JsonManifestParseContext *context, char *buffer,
-   /* Run the actual JSON parser. */
-   json_error = pg_parse_json(lex, );
-   if (json_error != JSON_SUCCESS)
--  json_manifest_parse_failure(context, "parsing failed");
-+  json_manifest_parse_failure(context, json_errdetail(json_error, 
lex));
-   if (parse.state != JM_EXPECT_EOF)
-   json_manifest_parse_failure(context, "manifest ended 
unexpectedly");
- 
-
  ## src/bin/pg_verifybackup/t/005_bad_manifest.pl ##
 @@ src/bin/pg_verifybackup/t/005_bad_manifest.pl: use Test::More;
  my $tempdir = PostgreSQL::Test::Utils::tempdir;
@@ src/common/Makefile: override CPPFLAGS += 
-DVAL_LDFLAGS_EX="\"$(LDFLAGS_EX)\""
 +override CPPFLAGS := -DFRONTEND -I. -I$(top_srcdir)/src/common 
-I$(libpq_srcdir) $(CPPFLAGS)
  LIBS += $(PTHREAD_LIBS)
  
- # If you add objects here, see also src/tools/msvc/Mkvcbuild.pm
+ OBJS_COMMON = \
 
  ## src/common/jsonapi.c ##
 @@
@@ src/common/meson.build: foreach name, opts : pgcommon_variants
  'dependencies': opts['dependencies'] + [ssl],
}
 
+ ## src/common/parse_manifest.c ##
+@@ src/common/parse_manifest.c: 
json_parse_manifest(JsonManifestParseContext *context, char *buffer,
+   /* Run the actual JSON parser. */
+   json_error = pg_parse_json(lex, );
+   if (json_error != JSON_SUCCESS)
+-  json_manifest_parse_failure(context, "parsing failed");
++  json_manifest_parse_failure(context, json_errdetail(json_error, 
lex));
+   if (parse.state != JM_EXPECT_EOF)
+   json_manifest_parse_failure(context, "manifest ended 
unexpectedly");
+ 
+
  ## src/include/common/jsonapi.h ##
 @@
  #ifndef JSONAPI_H
2:  0ab79a168f ! 2:  5fa08a8033 libpq: add OAUTHBEARER SASL mechanism
@@ Commit message
 - fix libcurl initialization thread-safety
 - harden the libcurl flow implementation
 - figure out pgsocket/int difference on Windows
+- fix intermittent failure in the cleanup callback tests (race
+  condition?)
 - ...and more.
 
  ## configure ##
@@ src/interfaces/libpq/Makefile: endif
  endif
 
  ## src/interfaces/libpq/exports.txt ##
-@@ src/interfaces/libpq/exports.txt: PQclosePrepared   188
- PQclosePortal 189
- PQsendClosePrepared   190
+@@ src/interfaces/libpq/exports.txt: PQsendClosePrepared   190
  PQsendClosePortal 191
-+PQsetAuthDataHook 192
-+PQgetAuthDataHook 193
-+PQdefaultAuthDataHook 194
+ PQchangePassword  192
+ PQsendPipelineSync193
++PQsetAuthDataHook 194
++PQgetAuthDataHook 195
++PQdefaultAuthDataHook 196
 
  ## src/interfaces/libpq/fe-auth-oauth-curl.c (new) ##
 @@
@@ src/interfaces/libpq/fe-auth-oauth-curl.c (new)
 +   */
 +  cnt = sscanf(interval_str, "%f", );
 +
-+  Assert(cnt == 1); /* otherwise the lexer screwed up */
++  if (cnt != 1)
++  {
++  /*
++   * Either the lexer screwed up or our assumption above isn't 
true, and
++   * either way a developer needs to take a look.
++   */
++  Assert(cnt == 1);
++  return 1; /* don't fall through in release builds */
++  }
++
 +  parsed = ceilf(parsed);
 +
 +  if (parsed < 1)
@@ src/interfaces/libpq/fe-auth.c: pg_fe_sendauth(AuthRequest areq, int 
payloadlen,
{
/* Use this message if pg_SASL_continue didn't 
supply one */
if (conn->errorMessage.len == oldmsglen)
-@@ src/interfaces/libpq/fe-auth.c: PQencryptPasswordConn(PGconn *conn, 
const char *passwd, const char *user,
- 
-   return crypt_pwd;
+@@ src/interfaces/libpq/fe-auth.c: PQchangePassword(PGconn *conn, const 
char *user, const char *passwd)
+   }
+   }
  }
 +
 +PQauthDataHook_type PQauthDataHook = PQdefaultAuthDataHook;
@@ src/interfaces/libpq/fe-connect.c: keep_going:   
/* We will come back to here
case CONNECTION_AUTH_OK:
{
  

Re: WIP Incremental JSON Parser

2024-02-20 Thread Jacob Champion
On Tue, Feb 20, 2024 at 2:10 PM Andrew Dunstan  wrote:
> Well, that didn't help a lot, but meanwhile the CFBot seems to have
> decided in the last few days that it's now happy, so full steam aead! ;-)

I haven't been able to track down the root cause yet, but I am able to
reproduce the failure consistently on my machine:

ERROR:  could not parse backup manifest: file size is not an
integer: '16384, "Last-Modified": "2024-02-20 23:07:43 GMT",
"Checksum-Algorithm": "CRC32C", "Checksum": "66c829cd" },
{ "Path": "base/4/2606", "Size": 24576, "Last-Modified": "20

Full log is attached.

--Jacob


regress_log_003_timeline.gz
Description: GNU Zip compressed data


Re: partitioning and identity column

2024-02-20 Thread Alexander Korotkov
On Tue, Feb 20, 2024 at 8:00 AM Alexander Lakhin  wrote:
> 20.02.2024 07:57, Ashutosh Bapat wrote:
> >> Could you please name functions, which you suspect, for me to recheck them?
> >> Perhaps we should consider fixing all of such functions, in light of
> >> b0f7dd915 and d57b7cc33...
> > Looks like the second commit has fixed all other places I knew except
> > Identity related functions. So worth fixing identity related functions
> > too. I see
> > dropconstraint_internal() has two calls to check_stack_depth() back to
> > back. The second one is not needed?
>
> Yeah, that's funny. It looks like such a double protection emerged
> because Alvaro protected the function (in b0f7dd915), which was waiting for
> adding check_stack_depth() in the other thread (resulted in d57b7cc33).
>
> Thank you for spending time on this!

Thank you, I removed the second check.

--
Regards,
Alexander Korotkov




Re: Lessons from using mmap()

2024-02-20 Thread Alexander Korotkov
Hi!

On Wed, Feb 21, 2024 at 1:21 AM Bruce Momjian  wrote:
> I think we have come to the same conclusion in the past, but I thought
> it would be good to share someone else's research, and it might be
> helpful if we ever revisit this idea.

I read this blog post before. In my personal opinion it is an example
of awful speculation.  If your DBMS needs WAL, then your first
question about using mmap() for your data is how to enforce WAL to be
really ahead of writing the data pages.  As I know there is no
solution for that with mmap() (or at least a solution with acceptable
portability).  The blog post advertises LMDB, and LMDB is really good.
But LMDB uses copy-on-write instead of WAL to ensure durability.  And
it supports a single writer at a time.  This is just another niche of
solutions!
The blog post makes an impression that developers of non-mmap()
DBMS'es are idiots who didn't manage to use mmap() properly.  We're
not idiots, we develop DBMS of high concurrency! :-)

--
Regards,
Alexander Korotkov




Re: Add bump memory context type and use it for tuplesorts

2024-02-20 Thread David Rowley
On Wed, 21 Feb 2024 at 00:02, Matthias van de Meent
 wrote:
>
> On Tue, 20 Feb 2024 at 11:02, David Rowley  wrote:
> > On Fri, 26 Jan 2024 at 01:29, Matthias van de Meent
> >  wrote:
> > > >> +allocSize = MAXALIGN(sizeof(BumpContext)) + Bump_BLOCKHDRSZ +
> > > >> +if (minContextSize != 0)
> > > >> +allocSize = Max(allocSize, minContextSize);
> > > >> +else
> > > >> +allocSize = Max(allocSize, initBlockSize);
> > >
> > > Shouldn't this be the following, considering the meaning of 
> > > "initBlockSize"?
> >
> > No, we want to make the blocksize exactly initBlockSize if we can. Not
> > initBlockSize plus all the header stuff.  We do it that way for all
> > the other contexts and I agree that it's a good idea as it keeps the
> > malloc request sizes powers of 2.
>
> One part of the reason of my comment was that initBlockSize was
> ignored in favour of minContextSize if that was configured, regardless
> of the value of initBlockSize. Is it deliberately ignored when
> minContextSize is set?

Ok, it's a good question. It's to allow finer-grained control over the
initial block as it allows it to be a fixed given size without
affecting the number that we double for the subsequent blocks.

e.g BumpContextCreate(64*1024, 8*1024, 1024*1024);

would make the first block 64K and the next block 16K, followed by
32K, 64K ... 1MB.

Whereas, BumpContextCreate(0, 8*1024, 1024*1024) will start at 8K, 16K ... 1MB.

It seems useful as you might have a good idea of how much memory the
common case has and want to do that without having to allocate
subsequent blocks, but if slightly more memory is required sometimes,
you probably don't want the next malloc to be double the common size,
especially if the common size is large.

Apart from slab.c, this is how all the other contexts work.  It seems
best to keep this and not to go against the grain on this as there's
more to consider if we opt to change the context types of existing
contexts.

David




Re: Change the bool member of the Query structure to bits

2024-02-20 Thread Quan Zongliang




On 2024/2/20 19:18, Tomas Vondra wrote:

On 2/20/24 11:11, Quan Zongliang wrote:


Sorry. I forgot to save a file. This is the latest.

On 2024/2/20 18:07, Quan Zongliang wrote:


The Query structure has an increasing number of bool attributes. This
is likely to increase in the future. And they have the same
properties. Wouldn't it be better to store them in bits? Common
statements don't use them, so they have little impact. This also saves
memory space.



Hi,

Are we really adding bools to Query that often? A bit of git-blame says
it's usually multiple years to add a single new flag, which is what I'd
expect. I doubt that'll change.

As for the memory savings, can you quantify how much memory this would save?

I highly doubt that's actually true (or at least measurable). The Query
struct has ~256B, the patch cuts that to ~232B. But we allocate stuff in
power-of-2, so we'll allocate 256B chunk anyway. And we allocate very
few of those objects anyway ...

regards


That makes sense. Withdraw.





Re: Change the bool member of the Query structure to bits

2024-02-20 Thread Quan Zongliang




On 2024/2/20 23:45, Tom Lane wrote:

Quan Zongliang  writes:

The Query structure has an increasing number of bool attributes. This is
likely to increase in the future. And they have the same properties.
Wouldn't it be better to store them in bits? Common statements don't use
them, so they have little impact. This also saves memory space.


I'm -1 on that, for three reasons:

* The amount of space saved is quite negligible.  If queries had many
Query structs then it could matter, but they don't.

* This causes enough code churn to create a headache for back-patching.

* This'll completely destroy the readability of these flags in
pprint output.

I'm not greatly in love with the macro layer you propose, either,
but those details don't matter because I think we should just
leave well enough alone.

regards, tom lane

I get it. Withdraw.





Re: Add lookup table for replication slot invalidation causes

2024-02-20 Thread Michael Paquier
On Tue, Feb 20, 2024 at 05:53:03PM +0100, Jelte Fennema-Nio wrote:
> Seems like a good improvement overall. But I'd prefer the definition
> of the lookup table to use this syntax:
> 
> const char *const SlotInvalidationCauses[] = {
> [RS_INVAL_NONE] = "none",
> [RS_INVAL_WAL_REMOVED] = "wal_removed",
> [RS_INVAL_HORIZON] = "rows_removed",
> [RS_INVAL_WAL_LEVEL] = "wal_level_sufficient",
> };

+1.

> Regarding the actual patch:
> 
> -   Assert(conflict_reason);
> 
> Probably we should keep this Assert. As well as the Assert(0)

The assert(0) at the end of the routine, likely so.  I don't see a
huge point for the assert on conflict_reason as we'd crash anyway  on
strcmp, no?

> +   for (cause = RS_INVAL_NONE; cause <= RS_INVAL_MAX_CAUSES; cause++)
> 
> Strictly speaking this is a slight change in behaviour, since now
> "none" is also parsed. That seems fine to me though.

Yep.  This does not strike me as an issue.  We only use
GetSlotInvalidationCause() in synchronize_slots(), mapping to NULL in
the case of "none".

Agreed that this is an improvement.

+/* Maximum number of invalidation causes */
+#defineRS_INVAL_MAX_CAUSES RS_INVAL_WAL_LEVEL

There is no need to add that to slot.h: it is only used in slot.c.
--
Michael


signature.asc
Description: PGP signature


Re: Revisiting {CREATE INDEX, REINDEX} CONCURRENTLY improvements

2024-02-20 Thread Michail Nikolaev
Hello!

> I think the best way for this to work would be an index method that
> exclusively stores TIDs, and of which we can quickly determine new
> tuples, too. I was thinking about something like GIN's format, but
> using (generation number, tid) instead of ([colno, colvalue], tid) as
> key data for the internal trees, and would be unlogged (because the
> data wouldn't have to survive a crash)

Yeah, this seems to be a reasonable approach, but there are some
doubts related to it - it needs new index type as well as unlogged
indexes to be introduced - this may make the patch too invasive to be
merged. Also, some way to remove the index from the catalog in case of
a crash may be required.

A few more thoughts:
* it is possible to go without generation number - we may provide a
way to do some kind of fast index lookup (by TID) directly during the
second table scan phase.
* one more option is to maintain a Tuplesorts (instead of an index)
with TIDs as changelog and merge with index snapshot after taking a
new visibility snapshot. But it is not clear how to share the same
Tuplesort with multiple inserting backends.
* crazy idea - what is about to do the scan in the index we are
building? We have tuple, so, we have all the data indexed in the
index. We may try to do an index scan using that data to get all
tuples and find the one with our TID :) Yes, in some cases it may be
too bad because of the huge amount of TIDs we need to scan + also
btree copies whole page despite we need single item. But some
additional index method may help - feels like something related to
uniqueness (but it is only in btree anyway).

Thanks,
Mikhail.




Re: Add bump memory context type and use it for tuplesorts

2024-02-20 Thread David Rowley
On Tue, 20 Feb 2024 at 23:52, Matthias van de Meent
 wrote:
> What I meant was that
>
> > (char *) block + Bump_BLOCKHDRSZ
> vs
> >  ((char *) block) + Bump_BLOCKHDRSZ
>
> , when combined with my little experience with pointer addition and
> precedence, and a lack of compiler at the ready at that point in time,
> I was afraid that "(char *) block + Bump_BLOCKHDRSZ" would be parsed
> as "(char *) (block + Bump_BLOCKHDRSZ)", which would get different
> offsets across the two statements.
> Godbolt has since helped me understand that both are equivalent.

I failed to notice this.  I've made them the same regardless to
prevent future questions from being raised about the discrepancy
between the two.

David




Re: Injection points: some tools to wait and wake

2024-02-20 Thread Michael Paquier
On Tue, Feb 20, 2024 at 05:32:38PM +0300, Andrey M. Borodin wrote:
> I will try to simplify test to 2-step, but it would be much easier
> to implement if injection points could be awaken independently. 

I don't mind implementing something that wakes only a given point
name, that's just more state data to track in shmem for the module.

> No, "prepared injection point" is not about wait\wake logic. It's
> about having injection point in critical section. 
> Normal injection point will pstrdup(name) and fail. In [0] I need a
> test that waits after multixact generation before WAL-logging
> it. It's only possible in a critical section. 

It took me some time to understand what you mean here.  You are
referring to the allocations done in load_external_function() ->
expand_dynamic_library_name() -> substitute_libpath_macro(), which
is something that has to happen the first time a callback is loaded
into the local cache of a process.  So what you want to achieve is to
preload the callback in its cache in a first step without running it,
then be able to run it, so your Prepare[d] layer is just a way to
split InjectionPointRun() into two.  Fancy use case.  You could
disable the critical section around the INJECTION_POINT() as one
solution, though having something to pre-warm the local cache for a
point name and avoid the allocations done in the external load would
be a second way to achieve that.

"Prepare" is not the best term I would have used, perhaps just have
one INJECTION_POINT_PRELOAD() macro to warm up the cache outside the
critical section with a wrapper routine? You could then leave
InjectionPointRun() as it is now.  Having a check on CritSectionCount
in the injection point internals to disable temporarily a critical
section would not feel right as this is used as a safety check
anywhere else.
--
Michael


signature.asc
Description: PGP signature


Lessons from using mmap()

2024-02-20 Thread Bruce Momjian
This blog, and the blogs it links to, explains the complexities of using
mmap() for database data/index file I/O.


https://www.symas.com/post/are-you-sure-you-want-to-use-mmap-in-your-dbms

The blog starts by stating:

There are, however, severe correctness and performance issues
with mmap that are not immediately apparent. Such problems make it
difficult, if not impossible, to use mmap correctly and efficiently
in a modern DBMS.

The remainder of the article makes various arguments that such mmap use
is _possible_, but ends with a reasonable conclusion:

Ultimately, the answer to the question "are you sure you want
to use mmap in your DBMS?" should be rephrased - do you really
want to reimplement everything the OS already does for you? Do
you really believe you can do it correctly, better than the OS
already does? The DBMS world is littered with projects whose
authors believed, incorrectly, that they could.

I think we have come to the same conclusion in the past, but I thought
it would be good to share someone else's research, and it might be
helpful if we ever revisit this idea.

-- 
  Bruce Momjian  https://momjian.us
  EDB  https://enterprisedb.com

  Only you can decide what is important to you.




Re: Injection points: some tools to wait and wake

2024-02-20 Thread Michael Paquier
On Tue, Feb 20, 2024 at 03:55:08PM +, Bertrand Drouvot wrote:
> Okay, makes sense to keep this as it is as a "template" in case more stuff is
> added. 
> 
> +   /* Counter advancing when injection_points_wake() is called */
> +   int wait_counts;
> 
> In that case what about using an unsigned instead? (Nit)

uint32.  Sure.

> 1 ===
> 
> +void
> +injection_wait(const char *name)
> 
> Looks like name is not used in the function. I guess the reason it is a 
> parameter
> is because that's the way the callback function is being called in
> InjectionPointRun()?

Right.  The callback has to define this argument.

> 2 ===
> 
> +PG_FUNCTION_INFO_V1(injection_points_wake);
> +Datum
> +injection_points_wake(PG_FUNCTION_ARGS)
> +{
> 
> I think that This function will wake up all the "wait" injection points.
> Would that make sense to implement some filtering based on the name? That 
> could
> be useful for tests that would need multiple wait injection points and that 
> want
> to wake them up "sequentially".
> 
> We could think about it if there is such a need in the future though.

Well, both you and Andrey are asking for it now, so let's do it.  The
implementation is simple:
- Store in InjectionPointSharedState an array of wait_counts and an
array of names.  There is only one condition variable.
- When a point wants to wait, it takes the spinlock and looks within
the array of names until it finds a free slot, adds its name into the
array to reserve a wait counter at the same position, releases the
spinlock.  Then it loops on the condition variable for an update of
the counter it has reserved.  It is possible to make something more
efficient, but at a small size it would not really matter.
- The wakeup takes a point name in argument, acquires the spinlock,
and checks if it can find the point into the array, pinpoints the
location of the counter to update and updates it.  Then it broadcasts
the change.
- The wait loop checks its counter, leaves its loop, cancels the
sleep, takes the spinlock to unregister from the array, and leaves.

I would just hardcode the number of points that can wait, say 5 of
them tracked in shmem?  Does that look like what you are looking at?

> +# Register a injection point on the standby so as the follow-up
> 
> typo: "an injection"?

Oops.  Fixed locally.

> +for (my $i = 0; $i < 3000; $i++)
> +{
> 
> is 3000 due to?:
> 
> +checkpoint_timeout = 30s
> 
> If so, would that make sense to reduce the value for both?

That had better be based on PostgreSQL::Test::Utils::timeout_default,
actually, as in something like:
foreach my $i (0 .. 10 * $PostgreSQL::Test::Utils::timeout_default)
--
Michael


signature.asc
Description: PGP signature


Re: System username in pg_stat_activity

2024-02-20 Thread Magnus Hagander
On Fri, Feb 16, 2024 at 9:45 PM Andres Freund  wrote:
>
> Hi,
>
> On 2024-02-16 15:22:16 -0500, Tom Lane wrote:
> > Magnus Hagander  writes:
> > > I mean, we could split it into more than one view. But adding a new
> > > view for every new thing we want to show is also not very good from
> > > either a usability or performance perspective.  So where would we put
> > > it?
> >
> > It'd have to be a new view with a row per session, showing static
> > (or at least mostly static?) properties of the session.
>
> Yep.
>
>
> > Could we move some existing fields of pg_stat_activity into such a
> > view?
>
> I'd suspect that at least some of
>  - leader_pid
>  - datid
>  - datname
>  - usesysid
>  - usename
>  - backend_start
>  - client_addr
>  - client_hostname
>  - client_port
>  - backend_type
>
> could be moved. Whether's worth breaking existing queries, I don't quite know.

I think that's the big question. I think if we move all of those we
will break every single monitoring tool out there for postgres...
That's a pretty hefty price.


> One option would be to not return (some) of them from pg_stat_get_activity(),
> but add them to the view in a way that the planner can elide the reference.

Without having any numbers, I would think that the join to pg_authid
for exapmle is likely more costly than returning all the other fields.
But that one does get eliminated as long as one doesn't query that
column. But if we make more things "joined in from the view", isn't
that likely to just make it more expensive in most cases?


> > I'm not sure that this is worth the trouble TBH.  If it can be shown
> > that pulling a few fields out of pg_stat_activity actually does make
> > for a useful speedup, then maybe OK ... but Andres hasn't provided
> > any evidence that there's a measurable issue.
>
> If I thought that the two columns proposed here were all that we wanted to
> add, I'd not be worried. But there have been quite a few other fields
> proposed, e.g. tracking idle/active time on a per-connection granularity.
>
> We even already have a patch to add pg_stat_session
> https://commitfest.postgresql.org/47/3405/

In a way, that's yet another different type of values though -- it
contains accumulated stats. So we really have 3 types -- "info" that's
not really stats (username, etc), "current state" (query, wait events,
state) and "accumulated stats" (counters since start).If we don't want
to combine them all, we should perhaps not combine any and actually
have 3 views?

-- 
 Magnus Hagander
 Me: https://www.hagander.net/
 Work: https://www.redpill-linpro.com/




Re: System username in pg_stat_activity

2024-02-20 Thread Magnus Hagander
On Fri, Feb 16, 2024 at 9:31 PM Magnus Hagander  wrote:
>
> On Fri, Feb 16, 2024 at 9:20 PM Andres Freund  wrote:
> >
> > Hi,
> >
> > On 2024-02-16 20:57:59 +0100, Magnus Hagander wrote:
> > > On Fri, Feb 16, 2024 at 8:41 PM Andres Freund  wrote:
> > > > On 2024-01-10 12:46:34 +0100, Magnus Hagander wrote:
> > > > > The attached patch adds a column "authuser" to pg_stat_activity which
> > > > > contains the username of the externally authenticated user, being the
> > > > > same value as the SYSTEM_USER keyword returns in a backend.
> > > >
> > > > I continue to think that it's a bad idea to make pg_stat_activity ever 
> > > > wider
> > > > with columns that do not actually describe properties that change 
> > > > across the
> > > > course of a session.  Yes, there's the argument that that ship has 
> > > > sailed, but
> > > > I don't think that's a good reason to continue ever further down that 
> > > > road.
> > > >
> > > > It's not just a usability issue, it also makes it more expensive to 
> > > > query
> > > > pg_stat_activity. This is of course more pronounced with textual 
> > > > columns than
> > > > with integer ones.
> > >
> > > That's a fair point, but I do think that has in most ways already sailed, 
> > > yes.
> > >
> > > I mean, we could split it into more than one view. But adding a new
> > > view for every new thing we want to show is also not very good from
> > > either a usability or performance perspective.  So where would we put
> > > it?
> >
> > I think we should group new properties that don't change over the course of 
> > a
> > session ([1]) in a new view (e.g. pg_stat_session). I don't think we need 
> > one
> > view per property, but I do think it makes sense to split information that
> > changes very frequently (like most pg_stat_activity contents) from 
> > information
> > that doesn't (like auth_method, auth_identity).
>
> That would make sense in many ways, but ends up with "other level of
> annoyances". E.g. the database name and oid don't change, but would we
> want to move those out of pg_stat_activity? Same for username? Don't
> we just end up in a grayzone about what belongs where?
>
> Also - were you envisioning just another view, or actually replacing
> the pg_stat_get_activity() part? As in where do you think the cost
> comes?

Andres -- did you spot this question in the middle or did it get lost
in the flurry of others? :)

-- 
 Magnus Hagander
 Me: https://www.hagander.net/
 Work: https://www.redpill-linpro.com/




Re: Shared detoast Datum proposal

2024-02-20 Thread Andy Fan

Tomas Vondra  writes:

Hi Tomas, 
>
> I took a quick look on this thread/patch today, so let me share a couple
> initial thoughts. I may not have a particularly coherent/consistent
> opinion on the patch or what would be a better way to do this yet, but
> perhaps it'll start a discussion ...

Thank you for this!

>
> The goal of the patch (as I understand it) is essentially to cache
> detoasted values, so that the value does not need to be detoasted
> repeatedly in different parts of the plan. I think that's perfectly
> sensible and worthwhile goal - detoasting is not cheap, and complex
> plans may easily spend a lot of time on it.

exactly.

>
> That being said, the approach seems somewhat invasive, and touching
> parts I wouldn't have expected to need a change to implement this. For
> example, I certainly would not have guessed the patch to need changes in
> createplan.c or setrefs.c.
>
> Perhaps it really needs to do these things, but neither the thread nor
> the comments are very enlightening as for why it's needed :-( In many
> cases I can guess, but I'm not sure my guess is correct. And comments in
> code generally describe what's happening locally / next line, not the
> bigger picture & why it's happening.

there were explaination at [1], but it probably is too high level.
Writing a proper comments is challenging for me, but I am pretty happy
to try more. At the end of this writing, I explained the data workflow,
I am feeling that would be useful for reviewers. 

> IIUC we walk the plan to decide which Vars should be detoasted (and
> cached) once, and which cases should not do that because it'd inflate
> the amount of data we need to keep in a Sort, Hash etc.

Exactly.

> Not sure if
> there's a better way to do this - it depends on what happens in the
> upper parts of the plan, so we can't decide while building the paths.

I'd say I did this intentionally. Deciding such things in paths will be
more expensive than create_plan stage IMO.

> But maybe we could decide this while transforming the paths into a plan?
> (I realize the JIT thread nearby needs to do something like that in
> create_plan, and in that one I suggested maybe walking the plan would be
> a better approach, so I may be contradicting myself a little bit.).

I think that's pretty similar what I'm doing now. Just that I did it
*just after* the create_plan. This is because the create_plan doesn't
transform the path to plan in the top->down manner all the time, the
known exception is create_mergejoin_plan. so I have to walk just after
the create_plan is 
done.

In the create_mergejoin_plan, the Sort node is created *after* the
subplan for the Sort is created.

/* Recursively process the path tree, demanding the correct tlist result */
plan = create_plan_recurse(root, best_path, CP_EXACT_TLIST);

+ /*
+  * After the plan tree is built completed, we start to walk for which
+  * expressions should not used the shared-detoast feature.
+  */
+ set_plan_forbid_pre_detoast_vars_recurse(plan, NIL);

>
> In any case, set_plan_forbid_pre_detoast_vars_recurse should probably
> explain the overall strategy / reasoning in a bit more detail. Maybe
> it's somewhere in this thread, but that's not great for reviewers.

a lession learnt, thanks.

a revisted version of comments from the lastest patch.

/*
 * set_plan_forbid_pre_detoast_vars_recurse
 *   Walking the Plan tree in the top-down manner to gather the vars which
 * should be as small as possible and record them in 
Plan.forbid_pre_detoast_vars
 * 
 * plan: the plan node to walk right now.
 * small_tlist: a list of nodes which its subplan should provide them as
 * small as possible.
 */
static void
set_plan_forbid_pre_detoast_vars_recurse(Plan *plan, List *small_tlist)

>
> Similar for the setrefs.c changes. It seems a bit suspicious to piggy
> back the new code into fix_scan_expr/fix_scan_list and similar code.
> Those functions have a pretty clearly defined purpose, not sure we want
> to also extend them to also deal with this new thing. (FWIW I'd 100%%
> did it this way if I hacked on a PoC of this, to make it work. But I'm
> not sure it's the right solution.)

The main reason of doing so is because I want to share the same walk
effort as fix_scan_expr. otherwise I have to walk the plan for 
every expression again. I thought this as a best practice in the past
and thought we can treat the pre_detoast_attrs as a valuable side
effects:(

> I don't know what to thing about the Bitset - maybe it's necessary, but
> how would I know? I don't have any way to measure the benefits, because
> the 0002 patch uses it right away. 

a revisted version of comments from the latest patch. graph 2 explains
this decision.

/*
 * The attributes whose values are the detoasted version in 
tts_values[*],
 * if so these memory needs some extra clean-up. These memory can't be 
put
 * into ecxt_per_tuple_memory since many of them needs a longer life 
span,
 * for example the 

Re: Running the fdw test from the terminal crashes into the core-dump

2024-02-20 Thread Tom Lane
Alvaro Herrera  writes:
> The real problem is that a MERGE ... DO NOTHING action reports that no
> permissions need to be checked on the target relation, which is not a
> problem when there are other actions in the MERGE command since they add
> privs to check.  When DO NOTHING is the only action, the added assert in
> a61b1f748 is triggered.

> So, this means we can fix this by simply requiring ACL_SELECT privileges
> on a DO NOTHING action.  We don't need to request specific privileges on
> any particular column (perminfo->selectedCols continues to be the empty
> set) -- which means that any role that has privileges on *any* column
> would get a pass.  If you're doing MERGE with any other action besides
> DO NOTHING, you already have privileges on at least some column, so
> adding ACL_SELECT breaks nothing.

LGTM.

regards, tom lane




Re: Running the fdw test from the terminal crashes into the core-dump

2024-02-20 Thread Alvaro Herrera
On 2024-Feb-18, Alvaro Herrera wrote:

> On 2024-Feb-18, Tom Lane wrote:
> 
> > So I'd blame this on faulty handling of the zero-partitions case in
> > the RTEPermissionInfo refactoring.  (I didn't bisect to prove that
> > a61b1f748 is exactly where it broke, but I do see that the query
> > successfully does nothing in v15.)
> 
> You're right, this is the commit that broke it.  It's unclear to me if
> Amit is available to look at it, so I'll give this a look tomorrow if he
> isn't.

OK, so it turns out that the bug is older than that -- it was actually
introduced with MERGE itself (7103ebb7aae8) and has nothing to do with
partitioning or RTEPermissionInfo; commit a61b1f748 is only related
because it added the Assert() that barfs when there are no privileges to
check.

The real problem is that a MERGE ... DO NOTHING action reports that no
permissions need to be checked on the target relation, which is not a
problem when there are other actions in the MERGE command since they add
privs to check.  When DO NOTHING is the only action, the added assert in
a61b1f748 is triggered.

So, this means we can fix this by simply requiring ACL_SELECT privileges
on a DO NOTHING action.  We don't need to request specific privileges on
any particular column (perminfo->selectedCols continues to be the empty
set) -- which means that any role that has privileges on *any* column
would get a pass.  If you're doing MERGE with any other action besides
DO NOTHING, you already have privileges on at least some column, so
adding ACL_SELECT breaks nothing.

-- 
Álvaro Herrera   48°01'N 7°57'E  —  https://www.EnterpriseDB.com/
"Pido que me den el Nobel por razones humanitarias" (Nicanor Parra)
>From 5f2eb9992f40933aa58f8bb0e0bed7ebd99f2952 Mon Sep 17 00:00:00 2001
From: Alvaro Herrera 
Date: Tue, 20 Feb 2024 18:30:37 +0100
Subject: [PATCH] MERGE ... DO NOTHING: require SELECT privileges

Verify that a user running MERGE with a DO NOTHING clause has
privileges to read the table, even if no columns are referenced.  Such
privileges were already required if the ON clause or any of the WHEN
conditions referenced any column at all, so there's no functional change
in practice.

This change fixes an assertion failure in the case where no column is
referenced by the command and the WHEN clauses are all DO NOTHING.

Backpatch to 15, where MERGE was introduced.

Reported-by: Alena Rybakina 
Discussion: https://postgr.es/m/4d65a385-7efa-4436-a825-0869f89d9...@postgrespro.ru
---
 src/backend/parser/parse_merge.c|  7 ++-
 src/test/regress/expected/merge.out | 11 ++-
 src/test/regress/sql/merge.sql  | 11 +++
 3 files changed, 27 insertions(+), 2 deletions(-)

diff --git a/src/backend/parser/parse_merge.c b/src/backend/parser/parse_merge.c
index 5f6a683ab9..73f7a48b3c 100644
--- a/src/backend/parser/parse_merge.c
+++ b/src/backend/parser/parse_merge.c
@@ -133,7 +133,11 @@ transformMergeStmt(ParseState *pstate, MergeStmt *stmt)
 		int			when_type = (mergeWhenClause->matched ? 0 : 1);
 
 		/*
-		 * Collect action types so we can check target permissions
+		 * Collect permissions to check, according to action types. We require
+		 * SELECT privileges for DO NOTHING because it'd be irregular to have
+		 * a target relation with zero privileges checked, in case DO NOTHING
+		 * is the only action.  There's no damage from that: any meaningful
+		 * MERGE command requires at least some access to the table anyway.
 		 */
 		switch (mergeWhenClause->commandType)
 		{
@@ -147,6 +151,7 @@ transformMergeStmt(ParseState *pstate, MergeStmt *stmt)
 targetPerms |= ACL_DELETE;
 break;
 			case CMD_NOTHING:
+targetPerms |= ACL_SELECT;
 break;
 			default:
 elog(ERROR, "unknown action in MERGE WHEN clause");
diff --git a/src/test/regress/expected/merge.out b/src/test/regress/expected/merge.out
index f87905fabd..125d7b6bca 100644
--- a/src/test/regress/expected/merge.out
+++ b/src/test/regress/expected/merge.out
@@ -3,6 +3,7 @@
 --
 CREATE USER regress_merge_privs;
 CREATE USER regress_merge_no_privs;
+CREATE USER regress_merge_none;
 DROP TABLE IF EXISTS target;
 NOTICE:  table "target" does not exist, skipping
 DROP TABLE IF EXISTS source;
@@ -159,12 +160,19 @@ ERROR:  cannot execute MERGE on relation "mv"
 DETAIL:  This operation is not supported for materialized views.
 DROP MATERIALIZED VIEW mv;
 -- permissions
+SET SESSION AUTHORIZATION regress_merge_none;
+MERGE INTO target
+USING (SELECT 1)
+ON true
+WHEN MATCHED THEN
+	DO NOTHING;
+ERROR:  permission denied for table target
+RESET SESSION AUTHORIZATION;
 MERGE INTO target
 USING source2
 ON target.tid = source2.sid
 WHEN MATCHED THEN
 	UPDATE SET balance = 0;
-ERROR:  permission denied for table source2
 GRANT INSERT ON target TO regress_merge_no_privs;
 SET SESSION AUTHORIZATION regress_merge_no_privs;
 MERGE INTO target
@@ -2248,3 +2256,4 @@ DROP TABLE source, source2;
 DROP FUNCTION merge_trigfunc();
 DROP USER regress_merge_privs;
 DROP USER 

Re: Proposal: Adjacent B-Tree index

2024-02-20 Thread Dilshod Urazov
> only 1 index lookup is needed.
Sorry, must be "only lookups of 1 index are needed".

--
Dilshod Urazov

вт, 20 февр. 2024 г. в 21:09, Dilshod Urazov :

> > I'm not sure why are two indexes not sufficient here?
>
> Did I write that they are not sufficient? The whole point is that in
> relational DBMSs which are widely used
> to store graphs we can optimize storage in such cases. Also we can
> optimize traversals e.g. if we want to
> get all nodes that are adjacent to a given node with id = X in an oriented
> graph
>
> SELECT id, label
> FROM Nodes
> JOIN Edges ON Nodes.id = Edges.target
> WHERE Edges.source = X;
>
> only 1 index lookup is needed.
>
> > The entry could've been removed because (e.g.)
> > test's b column was updated thus inserting a new index entry for the
> > new HOT-chain's TID.
>
> If test'b column was updated and HOT optimization took place no new index
> entry is created. Index tuple
> pointing to old heap tuple is valid since now it is pointing to HOT-chain.
>
> --
> Dilshod Urazov
>
> пн, 19 февр. 2024 г. в 22:32, Matthias van de Meent <
> boekewurm+postg...@gmail.com>:
>
>> On Mon, 19 Feb 2024 at 18:48, Dilshod Urazov 
>> wrote:
>> >
>> > - Motivation
>> >
>> > A regular B-tree index provides efficient mapping of key values to
>> tuples within a table. However, if you have two tables connected in some
>> way, a regular B-tree index may not be efficient enough. In this case, you
>> would need to create an index for each table. The purpose will become
>> clearer if we consider a simple example which is the main use-case as I see
>> it.
>>
>> I'm not sure why are two indexes not sufficient here? PostgreSQL can
>> already do merge joins, which would have the same effect of hitting
>> the same location in the index at the same time between all tables,
>> without the additional overhead of having to scan two table's worth of
>> indexes in VACUUM.
>>
>> > During the vacuum of A an index tuple pointing to a dead tuple in A
>> should be cleaned as well as all index tuples for the same key.
>>
>> This is definitely not correct. If I have this schema
>>
>> table test (id int primary key, b text unique)
>> table test_ref(test_id int references test(id))
>>
>> and if an index would contain entries for both test and test_ref, it
>> can't just remove all test_ref entries because an index entry with the
>> same id was removed: The entry could've been removed because (e.g.)
>> test's b column was updated thus inserting a new index entry for the
>> new HOT-chain's TID.
>>
>> > would suffice for this new semantics.
>>
>> With the provided explanation I don't think this is a great idea.
>>
>> Kind regards,
>>
>> Matthias van de Meent.
>>
>


Re: Missing Group Key in grouped aggregate

2024-02-20 Thread Tomas Vondra
On 2/20/24 16:53, Tom Lane wrote:
> =?UTF-8?Q?Erik_Nordstr=C3=B6m?=  writes:
>> I noticed that, beginning with PG16, grouped aggregates are missing the
>> "Group Key" in the EXPLAIN output.
> 
>> It seems the Agg node has numCols (number of grouping cols) set to zero in
>> queries like
> 
>> SELECT foo, count(*) FROM bar WHERE foo=1 GROUP BY foo;
> 
>> In PG15, the "Group Key" is shown and the Agg node has numCols set as
>> expected.
> 
> Looks sane to me: the planner now notices that there can only
> be one group so it doesn't tell the GroupAgg node to worry about
> making groups.  If it were missing in a case where there could be
> multiple output groups, yes that'd be a bug.
> 
> If you want to run it to ground you could bisect to see where the
> behavior changed, but you'd probably just find it was intentional.
> 

I believe this changed in:

commit 8d83a5d0a2673174dc478e707de1f502935391a5
Author: Tom Lane 
Date:   Wed Jan 18 12:37:57 2023 -0500

Remove redundant grouping and DISTINCT columns.

Avoid explicitly grouping by columns that we know are redundant
for sorting, for example we need group by only one of x and y in
SELECT ... WHERE x = y GROUP BY x, y
This comes up more often than you might think, as shown by the
changes in the regression tests.  It's nearly free to detect too,
since we are just piggybacking on the existing logic that detects
redundant pathkeys.  (In some of the existing plans that change,
it's visible that a sort step preceding the grouping step already
didn't bother to sort by the redundant column, making the old plan
a bit silly-looking.)

...

It's not quite obvious from the commit message, but that's where git
bisect says the behavior changed.

regards

-- 
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company




Re: Shared detoast Datum proposal

2024-02-20 Thread Tomas Vondra
Hi,

I took a quick look on this thread/patch today, so let me share a couple
initial thoughts. I may not have a particularly coherent/consistent
opinion on the patch or what would be a better way to do this yet, but
perhaps it'll start a discussion ...


The goal of the patch (as I understand it) is essentially to cache
detoasted values, so that the value does not need to be detoasted
repeatedly in different parts of the plan. I think that's perfectly
sensible and worthwhile goal - detoasting is not cheap, and complex
plans may easily spend a lot of time on it.

That being said, the approach seems somewhat invasive, and touching
parts I wouldn't have expected to need a change to implement this. For
example, I certainly would not have guessed the patch to need changes in
createplan.c or setrefs.c.

Perhaps it really needs to do these things, but neither the thread nor
the comments are very enlightening as for why it's needed :-( In many
cases I can guess, but I'm not sure my guess is correct. And comments in
code generally describe what's happening locally / next line, not the
bigger picture & why it's happening.

IIUC we walk the plan to decide which Vars should be detoasted (and
cached) once, and which cases should not do that because it'd inflate
the amount of data we need to keep in a Sort, Hash etc. Not sure if
there's a better way to do this - it depends on what happens in the
upper parts of the plan, so we can't decide while building the paths.

But maybe we could decide this while transforming the paths into a plan?
(I realize the JIT thread nearby needs to do something like that in
create_plan, and in that one I suggested maybe walking the plan would be
a better approach, so I may be contradicting myself a little bit.).

In any case, set_plan_forbid_pre_detoast_vars_recurse should probably
explain the overall strategy / reasoning in a bit more detail. Maybe
it's somewhere in this thread, but that's not great for reviewers.

Similar for the setrefs.c changes. It seems a bit suspicious to piggy
back the new code into fix_scan_expr/fix_scan_list and similar code.
Those functions have a pretty clearly defined purpose, not sure we want
to also extend them to also deal with this new thing. (FWIW I'd 100%%
did it this way if I hacked on a PoC of this, to make it work. But I'm
not sure it's the right solution.)

I don't know what to thing about the Bitset - maybe it's necessary, but
how would I know? I don't have any way to measure the benefits, because
the 0002 patch uses it right away. I think it should be done the other
way around, i.e. the patch should introduce the main feature first
(using the traditional Bitmapset), and then add Bitset on top of that.
That way we could easily measure the impact and see if it's useful.

On the whole, my biggest concern is memory usage & leaks. It's not
difficult to already have problems with large detoasted values, and if
we start keeping more of them, that may get worse. Or at least that's my
intuition - it can't really get better by keeping the values longer, right?

The other thing is the risk of leaks (in the sense of keeping detoasted
values longer than expected). I see the values are allocated in
tts_mcxt, and maybe that's the right solution - not sure.


FWIW while looking at the patch, I couldn't help but to think about
expanded datums. There's similarity in what these two features do - keep
detoasted values for a while, so that we don't need to do the expensive
processing if we access them repeatedly. Of course, expanded datums are
not meant to be long-lived, while "shared detoasted values" are meant to
exist (potentially) for the query duration. But maybe there's something
we could learn from expanded datums? For example how the varlena pointer
is leveraged to point to the expanded object.

For example, what if we add a "TOAST cache" as a query-level hash table,
and modify the detoasting to first check the hash table (with the TOAST
pointer as a key)? It'd be fairly trivial to enforce a memory limit on
the hash table, evict values from it, etc. And it wouldn't require any
of the createplan/setrefs changes, I think ...


regards

-- 
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company




Re: Proposal: Adjacent B-Tree index

2024-02-20 Thread Dilshod Urazov
> I'm not sure why are two indexes not sufficient here?

Did I write that they are not sufficient? The whole point is that in
relational DBMSs which are widely used
to store graphs we can optimize storage in such cases. Also we can optimize
traversals e.g. if we want to
get all nodes that are adjacent to a given node with id = X in an oriented
graph

SELECT id, label
FROM Nodes
JOIN Edges ON Nodes.id = Edges.target
WHERE Edges.source = X;

only 1 index lookup is needed.

> The entry could've been removed because (e.g.)
> test's b column was updated thus inserting a new index entry for the
> new HOT-chain's TID.

If test'b column was updated and HOT optimization took place no new index
entry is created. Index tuple
pointing to old heap tuple is valid since now it is pointing to HOT-chain.

--
Dilshod Urazov

пн, 19 февр. 2024 г. в 22:32, Matthias van de Meent <
boekewurm+postg...@gmail.com>:

> On Mon, 19 Feb 2024 at 18:48, Dilshod Urazov 
> wrote:
> >
> > - Motivation
> >
> > A regular B-tree index provides efficient mapping of key values to
> tuples within a table. However, if you have two tables connected in some
> way, a regular B-tree index may not be efficient enough. In this case, you
> would need to create an index for each table. The purpose will become
> clearer if we consider a simple example which is the main use-case as I see
> it.
>
> I'm not sure why are two indexes not sufficient here? PostgreSQL can
> already do merge joins, which would have the same effect of hitting
> the same location in the index at the same time between all tables,
> without the additional overhead of having to scan two table's worth of
> indexes in VACUUM.
>
> > During the vacuum of A an index tuple pointing to a dead tuple in A
> should be cleaned as well as all index tuples for the same key.
>
> This is definitely not correct. If I have this schema
>
> table test (id int primary key, b text unique)
> table test_ref(test_id int references test(id))
>
> and if an index would contain entries for both test and test_ref, it
> can't just remove all test_ref entries because an index entry with the
> same id was removed: The entry could've been removed because (e.g.)
> test's b column was updated thus inserting a new index entry for the
> new HOT-chain's TID.
>
> > would suffice for this new semantics.
>
> With the provided explanation I don't think this is a great idea.
>
> Kind regards,
>
> Matthias van de Meent.
>


Re: Integer undeflow in fprintf in dsa.c

2024-02-20 Thread Daniel Gustafsson
> On 20 Feb 2024, at 17:13, Ilyasov Ian  wrote:

> Sorry for not answering quickly.

There is no need for any apology, there is no obligation to answer within any
specific timeframe.

> I attached a patch to the letter with changes to take into account Daniel 
> Gustafsson's comment.

Looks good on a quick skim, I'll take care of this shortly.

--
Daniel Gustafsson





Re: Add lookup table for replication slot invalidation causes

2024-02-20 Thread Jelte Fennema-Nio
On Tue, 20 Feb 2024 at 12:11, Bharath Rupireddy
 wrote:
> Thoughts?

Seems like a good improvement overall. But I'd prefer the definition
of the lookup table to use this syntax:

const char *const SlotInvalidationCauses[] = {
[RS_INVAL_NONE] = "none",
[RS_INVAL_WAL_REMOVED] = "wal_removed",
[RS_INVAL_HORIZON] = "rows_removed",
[RS_INVAL_WAL_LEVEL] = "wal_level_sufficient",
};


Regarding the actual patch:

-   Assert(conflict_reason);

Probably we should keep this Assert. As well as the Assert(0)


+   for (cause = RS_INVAL_NONE; cause <= RS_INVAL_MAX_CAUSES; cause++)

Strictly speaking this is a slight change in behaviour, since now
"none" is also parsed. That seems fine to me though.




Re: Streaming read-ready sequential scan code

2024-02-20 Thread Melanie Plageman
On Tue, Feb 20, 2024 at 6:13 AM Robert Haas  wrote:
>
> On Tue, Feb 20, 2024 at 4:35 AM Melanie Plageman
>  wrote:
> > I've written three alternative implementations of the actual streaming
> > read user for sequential scan which handle the question of where to
> > allocate the streaming read object and how to handle changing scan
> > direction in different ways.
>
> It's weird to me that the prospect of changing the scan direction
> causes such complexity. I mean, why doesn't a streaming read object
> have a forget_all_my_previous_requests() method or somesuch?

Basically, that is what pg_streaming_read_free() does. It goes through
and releases the buffers it had pinned and frees any per-buffer data
allocated.

The complexity with the sequential scan streaming read user and scan
direction is just that it has to detect when the scan direction
changes and do the releasing/freeing and reallocation. The scan
direction is passed to heapgettup[_pagemode](), so this is something
that can change on a tuple-to-tuple basis.

It is less that doing this is complicated and more that it is annoying
and distracting to have to check for and handle a very unimportant and
uncommon case in the main path of the common case.

- Melanie




RE: Integer undeflow in fprintf in dsa.c

2024-02-20 Thread Ilyasov Ian
Sorry for not answering quickly.

Thank you for your comments.

I attached a patch to the letter with changes to take into account Daniel 
Gustafsson's comment.


Kind regards,
Ian Ilyasov.

Juniour Software Developer at Postgres Professional
Subject: [PATCH] Integer underflow fix in fprintf in dsa.c.
---
Index: src/backend/utils/mmgr/dsa.c
IDEA additional info:
Subsystem: com.intellij.openapi.diff.impl.patch.CharsetEP
<+>UTF-8
===
diff --git a/src/backend/utils/mmgr/dsa.c b/src/backend/utils/mmgr/dsa.c
--- a/src/backend/utils/mmgr/dsa.c	(revision b78fa8547d02fc72ace679fb4d5289dccdbfc781)
+++ b/src/backend/utils/mmgr/dsa.c	(date 1708440995707)
@@ -1105,9 +1105,13 @@
 		{
 			dsa_segment_index segment_index;
 
-			fprintf(stderr,
-	"segment bin %zu (at least %d contiguous pages free):\n",
-	i, 1 << (i - 1));
+			if (i == 0)
+fprintf(stderr,
+		"segment bin %zu (no contiguous free pages):\n", i);
+			else
+fprintf(stderr,
+		"segment bin %zu (at least %d contiguous pages free):\n",
+		i, 1 << (i - 1));
 			segment_index = area->control->segment_bins[i];
 			while (segment_index != DSA_SEGMENT_INDEX_NONE)
 			{


Re: Fix race condition in InvalidatePossiblyObsoleteSlot()

2024-02-20 Thread Bertrand Drouvot
Hi,

On Tue, Feb 20, 2024 at 02:33:44PM +0900, Michael Paquier wrote:
> On Tue, Feb 20, 2024 at 08:51:17AM +0900, Michael Paquier wrote:
> > Prefixing these with "initial_" is fine, IMO.  That shows the
> > intention that these come from the slot's data before doing the
> > termination.  So I'm OK with what's been proposed in v3.
> 
> I was looking at that a second time, and just concluded that this is
> OK, so I've applied that down to 16, wordsmithing a bit the comments.

Thanks!
FWIW, I've started to write a POC regarding the test we mentioned up-thread.

The POC test is based on what has been submitted by Michael in [1]. The POC test
seems to work fine and it seems that nothing more is needed in [1] (at some 
point
I thought I would need the ability to wake up multiple "wait" injection points
in sequence but that was not necessary).

I'll polish and propose my POC test once [1] is pushed (unless you're curious
about it before).

[1]: https://www.postgresql.org/message-id/flat/ZdLuxBk5hGpol91B%40paquier.xyz

Regards,

-- 
Bertrand Drouvot
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com




Re: pg_restore problem to load constraints with tables

2024-02-20 Thread Tom Lane
Fabrice Chapuis  writes:
> When a table is reloaded wit pg_restore, it is recreated without indexes or
> constraints. There are automatically skipped. Is there a reason for this?

[ shrug ] That's how the -t switch is defined.  If you want something
else, you can use the -l and -L switches to pick out a custom
collection of objects to restore.

regards, tom lane




Re: Injection points: some tools to wait and wake

2024-02-20 Thread Bertrand Drouvot
Hi,

On Tue, Feb 20, 2024 at 08:28:28AM +0900, Michael Paquier wrote:
> On Mon, Feb 19, 2024 at 02:28:04PM +, Bertrand Drouvot wrote:
> > If slock_t protects "only" one counter, then what about using 
> > pg_atomic_uint64
> > or pg_atomic_uint32 instead? And btw do we need wait_counts at all? (see 
> > comment 
> > number 4)
> 
> We could, indeed, even if we use more than one counter.  Still, I
> would be tempted to keep it in case more data is added to this
> structure as that would make for less stuff to do while being normally
> non-critical.  This sentence may sound pedantic, though..

Okay, makes sense to keep this as it is as a "template" in case more stuff is
added. 

+   /* Counter advancing when injection_points_wake() is called */
+   int wait_counts;

In that case what about using an unsigned instead? (Nit)

> > I'm wondering if this loop and wait_counts are needed, why not doing 
> > something
> > like (and completely get rid of wait_counts)?
> > 
> > "
> > ConditionVariablePrepareToSleep(_state->wait_point);
> > ConditionVariableSleep(_state->wait_point, injection_wait_event);
> > ConditionVariableCancelSleep();
> > "
> > 
> > It's true that the comment above ConditionVariableSleep() mentions that:
> 
> Perhaps not, but it encourages good practices around the use of
> condition variables, and we need to track all that in shared memory
> anyway.  Ashutosh has argued in favor of the approach taken by the
> patch in the original thread when I've sent a version doing exactly
> what you are saying now to not track a state in shmem.

Oh okay I missed this previous discussion, let's keep it as it is then.

New comments:

1 ===

+void
+injection_wait(const char *name)

Looks like name is not used in the function. I guess the reason it is a 
parameter
is because that's the way the callback function is being called in
InjectionPointRun()?

2 ===

+PG_FUNCTION_INFO_V1(injection_points_wake);
+Datum
+injection_points_wake(PG_FUNCTION_ARGS)
+{

I think that This function will wake up all the "wait" injection points.
Would that make sense to implement some filtering based on the name? That could
be useful for tests that would need multiple wait injection points and that want
to wake them up "sequentially".

We could think about it if there is such a need in the future though.

3 ===

+# Register a injection point on the standby so as the follow-up

typo: "an injection"?

4 ===

+for (my $i = 0; $i < 3000; $i++)
+{

is 3000 due to?:

+checkpoint_timeout = 30s

If so, would that make sense to reduce the value for both?

Regards,

-- 
Bertrand Drouvot
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com




Re: Missing Group Key in grouped aggregate

2024-02-20 Thread Tom Lane
=?UTF-8?Q?Erik_Nordstr=C3=B6m?=  writes:
> I noticed that, beginning with PG16, grouped aggregates are missing the
> "Group Key" in the EXPLAIN output.

> It seems the Agg node has numCols (number of grouping cols) set to zero in
> queries like

> SELECT foo, count(*) FROM bar WHERE foo=1 GROUP BY foo;

> In PG15, the "Group Key" is shown and the Agg node has numCols set as
> expected.

Looks sane to me: the planner now notices that there can only
be one group so it doesn't tell the GroupAgg node to worry about
making groups.  If it were missing in a case where there could be
multiple output groups, yes that'd be a bug.

If you want to run it to ground you could bisect to see where the
behavior changed, but you'd probably just find it was intentional.

regards, tom lane




Re: Change the bool member of the Query structure to bits

2024-02-20 Thread Tom Lane
Quan Zongliang  writes:
> The Query structure has an increasing number of bool attributes. This is 
> likely to increase in the future. And they have the same properties. 
> Wouldn't it be better to store them in bits? Common statements don't use 
> them, so they have little impact. This also saves memory space.

I'm -1 on that, for three reasons:

* The amount of space saved is quite negligible.  If queries had many
Query structs then it could matter, but they don't.

* This causes enough code churn to create a headache for back-patching.

* This'll completely destroy the readability of these flags in
pprint output.

I'm not greatly in love with the macro layer you propose, either,
but those details don't matter because I think we should just
leave well enough alone.

regards, tom lane




Re: numeric_big in make check?

2024-02-20 Thread Dean Rasheed
On Tue, 20 Feb 2024 at 15:16, Tom Lane  wrote:
>
> Dean Rasheed  writes:
> > Looking at the script itself, the addition, subtraction,
> > multiplication and division tests at the top are probably pointless,
> > since I would expect those operations to be tested adequately (and
> > probably more thoroughly) by the transcendental test cases. In fact, I
> > think it would probably be OK to delete everything above line 650, and
> > just keep the bottom half of the script -- the pow(), exp(), ln() and
> > log() tests, which cover various edge cases, as well as exercising
> > basic arithmetic operations internally.
>
> I could go with that, but let's just transpose those into numeric.
>

Works for me.

Regards,
Dean




Re: numeric_big in make check?

2024-02-20 Thread Tom Lane
Dean Rasheed  writes:
> Looking at the script itself, the addition, subtraction,
> multiplication and division tests at the top are probably pointless,
> since I would expect those operations to be tested adequately (and
> probably more thoroughly) by the transcendental test cases. In fact, I
> think it would probably be OK to delete everything above line 650, and
> just keep the bottom half of the script -- the pow(), exp(), ln() and
> log() tests, which cover various edge cases, as well as exercising
> basic arithmetic operations internally.

I could go with that, but let's just transpose those into numeric.

regards, tom lane




Re: bug report: some issues about pg_15_stable(8fa4a1ac61189efffb8b851ee77e1bc87360c445)

2024-02-20 Thread Dean Rasheed
On Tue, 20 Feb 2024 at 14:49, Dean Rasheed  wrote:
>
> Also, if the concurrent update were an update of a key
> column that was included in the join condition, the re-scan would
> follow the update to a new matching source row, which is inconsistent
> with what would happen if it were a join to a regular relation.
>

In case it wasn't clear what I was talking about there, here's a simple example:

-- Setup
DROP TABLE IF EXISTS src1, src2, tgt;
CREATE TABLE src1 (a int, b text);
CREATE TABLE src2 (a int, b text);
CREATE TABLE tgt (a int, b text);

INSERT INTO src1 SELECT x, 'Src1 '||x FROM generate_series(1, 3) g(x);
INSERT INTO src2 SELECT x, 'Src2 '||x FROM generate_series(4, 6) g(x);
INSERT INTO tgt SELECT x, 'Tgt '||x FROM generate_series(1, 6, 2) g(x);

-- Session 1
BEGIN;
UPDATE tgt SET a = 2 WHERE a = 1;

-- Session 2
UPDATE tgt t SET b = s.b
  FROM (SELECT * FROM src1 UNION ALL SELECT * FROM src2) s
 WHERE s.a = t.a;
SELECT * FROM tgt;

-- Session 1
COMMIT;

and the result in tgt is:

 a |   b
---+
 2 | Src1 2
 3 | Src1 3
 5 | Src2 5
(3 rows)

whereas if that UNION ALL subquery had been a regular table with the
same contents, the result would have been:

 a |   b
---+
 2 | Tgt 1
 3 | Src1 3
 5 | Src2 5

i.e., the concurrently modified row would not have been updated.

Regards,
Dean




Re: Replace current implementations in crypt() and gen_salt() to OpenSSL

2024-02-20 Thread Daniel Gustafsson
> On 20 Feb 2024, at 13:40, Peter Eisentraut  wrote:
> 
> On 20.02.24 12:39, Daniel Gustafsson wrote:
>> A fifth option is to throw away our in-tree implementations and use the 
>> OpenSSL
>> API's for everything, which is where this thread started.  If the effort to
>> payoff ratio is palatable to anyone then patches are for sure welcome.
> 
> The problem is that, as I understand it, these crypt routines are not 
> designed in a way that you can just plug in a crypto library underneath.  
> Effectively, the definition of what, say, blowfish crypt does, is whatever is 
> in that source file, and transitively, whatever OpenBSD does.  

I don't disagree, but if the OP is willing to take a stab at it then..

> (Fun question: Does OpenBSD care about FIPS?)

No, LibreSSL ripped out FIPS support early on.

>  Of course, you could reimplement the same algorithms independently, using 
> OpenSSL or whatever.  But I don't think this will really improve the state of 
> the world in aggregate, because to a large degree we are relying on the 
> upstream to keep these implementations maintained, and if we rewrite them, we 
> become the upstream.

As a sidenote, we are already trailing behind upstream on this, the patch in
[0] sits on my TODO, but given the lack of complaints over the years it's not
been bumped to the top.

--
Daniel Gustafsson

[0] 
https://www.postgresql.org/message-id/flat/CAA-7PziyARoKi_9e2xdC75RJ068XPVk1CHDDdscu2BGrPuW9TQ%40mail.gmail.com#b20783dd6c72e95a8a0f6464d1228ed5





Re: bug report: some issues about pg_15_stable(8fa4a1ac61189efffb8b851ee77e1bc87360c445)

2024-02-20 Thread Dean Rasheed
On Mon, 19 Feb 2024 at 17:48, zwj  wrote:
>
> Hello,
>
>I found an issue while using the latest version of PG15 
> (8fa4a1ac61189efffb8b851ee77e1bc87360c445).
>This question is about 'merge into'.
>
>When two merge into statements are executed concurrently, I obtain the 
> following process and results.
>Firstly, the execution results of each Oracle are different, and secondly, 
> I tried to understand its execution process and found that it was not very 
> clear.
>

Hmm, looking at this I think there is a problem with how UNION ALL
subqueries are pulled up, and I don't think it's necessarily limited
to MERGE.

Looking at the plan for this MERGE operation:

explain (verbose, costs off)
merge into mergeinto_0023_tb01 a using (select aid,name,year from
mergeinto_0023_tb02 union all select aid,name,year from
mergeinto_0023_tb03) c on (a.id=c.aid) when matched then update set
year=c.year when not matched then insert(id,name,year)
values(c.aid,c.name,c.year);

QUERY PLAN
-
 Merge on public.mergeinto_0023_tb01 a
   ->  Merge Right Join
 Output: a.ctid, mergeinto_0023_tb02.year,
mergeinto_0023_tb02.aid, mergeinto_0023_tb02.name,
(ROW(mergeinto_0023_tb02.aid, mergeinto_0023_tb02.name,
mergeinto_0023_tb02.year))
 Merge Cond: (a.id = mergeinto_0023_tb02.aid)
 ->  Sort
   Output: a.ctid, a.id
   Sort Key: a.id
   ->  Seq Scan on public.mergeinto_0023_tb01 a
 Output: a.ctid, a.id
 ->  Sort
   Output: mergeinto_0023_tb02.year,
mergeinto_0023_tb02.aid, mergeinto_0023_tb02.name,
(ROW(mergeinto_0023_tb02.aid, mergeinto_0023_tb02.name,
mergeinto_0023_tb02.year))
   Sort Key: mergeinto_0023_tb02.aid
   ->  Append
 ->  Seq Scan on public.mergeinto_0023_tb02
   Output: mergeinto_0023_tb02.year,
mergeinto_0023_tb02.aid, mergeinto_0023_tb02.name,
ROW(mergeinto_0023_tb02.aid, mergeinto_0023_tb02.name,
mergeinto_0023_tb02.year)
 ->  Seq Scan on public.mergeinto_0023_tb03
   Output: mergeinto_0023_tb03.year,
mergeinto_0023_tb03.aid, mergeinto_0023_tb03.name,
ROW(mergeinto_0023_tb03.aid, mergeinto_0023_tb03.name,
mergeinto_0023_tb03.year)

The "ROW(...)" targetlist entries are added because
preprocess_rowmarks() adds a rowmark to the UNION ALL subquery, which
at that point is the only non-target relation in the jointree. It does
this intending that the same values be returned during EPQ rechecking.
However, pull_up_subqueries() causes the UNION all subquery and its
leaf subqueries to be pulled up into the main query as appendrel
entries. So when it comes to EPQ rechecking, the rowmark does
absolutely nothing, and EvalPlanQual() does a full re-scan of
mergeinto_0023_tb02 and mergeinto_0023_tb03 and a re-sort for each
concurrently modified row.

A similar thing happens for UPDATE and DELETE, if they're joined to a
UNION ALL subquery. However, AFAICS that doesn't cause a problem
(other than being pretty inefficient) since, for UPDATE and DELETE,
the join to the UNION ALL subquery will always be an inner join, I
think, and so the join output will always be correct.

However, for MERGE, the join may be an outer join, so during an EPQ
recheck, we're joining the target relation (fixed to return just the
updated row) to the full UNION ALL subquery. So if it's an outer join,
the join output will return all-but-one of the subquery rows as not
matched rows in addition to the one matched row that we want, whereas
the EPQ mechanism is expecting the plan to return just one row.

On the face of it, the simplest fix is to tweak is_simple_union_all()
to prevent UNION ALL subquery pullup for MERGE, forcing a
subquery-scan plan. A quick test shows that that fixes the reported
issue.

is_simple_union_all() already has a test for rowmarks, and a comment
saying that locking isn't supported, but since it is called before
preprocess_rowmarks(), it doesn't know that the subquery is about to
be marked.

However, that leaves the question of whether we should do the same for
UPDATE and DELETE. There doesn't appear to be a live bug there, so
maybe they're best left alone. Also, back-patching a change like that
might make existing queries less efficient. But I feel like I might be
overlooking something here, and this doesn't seem to be how EPQ
rechecks are meant to work (doing a full re-scan of non-target
relations). Also, if the concurrent update were an update of a key
column that was included in the join condition, the re-scan would
follow the update to a new matching source row, which is inconsistent
with what would happen if it were a join to a regular relation.

Thoughts?

Regards,
Dean




Re: to_regtype() Raises Error

2024-02-20 Thread David E. Wheeler


On Feb 5, 2024, at 09:01, David E. Wheeler  wrote:

> Ah, thank you. Updated patch attached.

I’ve moved this patch into the to_regtype patch thread[1], since it exhibits 
the same behavior.

Best,

David

[1] 
https://www.postgresql.org/message-id/60ef4e11-bc1c-4034-b37b-695662d28...@justatheory.com





Re: ALTER TABLE SET ACCESS METHOD on partitioned tables

2024-02-20 Thread Peter Eisentraut

On 19.07.23 20:13, Justin Pryzby wrote:

On Thu, Jun 01, 2023 at 08:50:50AM -0400, Michael Paquier wrote:

On Wed, May 31, 2023 at 06:35:34PM -0500, Justin Pryzby wrote:

What do you think the comment ought to say ?  It already says:

src/backend/catalog/heap.c-  * Make a dependency link to force the 
relation to be deleted if its
src/backend/catalog/heap.c-  * access method is.


This is the third location where we rely on the fact that
RELKIND_HAS_TABLE_AM() does not include RELKIND_PARTITIONED_TABLE, so
it seems worth documenting what we are relying on as a comment?  Say:
  * Make a dependency link to force the relation to be deleted if its
  * access method is.
  *
  * No need to add an explicit dependency for the toast table, as the
  * main table depends on it.  Partitioned tables have a table access
  * method defined, and RELKIND_HAS_TABLE_AM ignores them.


You said that this location "relies on" the macro not including
partitioned tables, but I would say the opposite: the places that use
RELKIND_HAS_TABLE_AM() and do *not* say "or relkind==PARTITIONED_TABLE"
are the ones that "rely on" that...

Anyway, this updates various comments.  No other changes.


It would be helpful if this patch could more extensively document in its 
commit message what semantic changes it makes.  Various options of 
possible behaviors were discussed in this thread, but it's not clear 
which behaviors were chosen in this particular patch version.


The general idea is that you can set an access method on a partitioned 
table.  That much seems very agreeable.  But then what happens with this 
setting, how can you override it, how can you change it, what happens 
when you change it, what happens with existing partitions and new 
partitions, etc. -- and which of these behaviors are new and old.  Many 
things to specify.






Re: Patch: Add parse_type Function

2024-02-20 Thread David E. Wheeler
On Feb 20, 2024, at 01:30, jian he  wrote:

> the second hint `-- grammar error expected` seems to contradict with
> the results?

Quite right, thank you, that’s actually a trapped error. I’ve tweaked the 
comments and their order in v7, attached.

This goes back to the discussion of the error raising of to_regtype[1], so I’ve 
incorporated the patch from that thread into this patch, and set up the docs 
for to_regtypemod() with similar information. The wording is still a little 
opaque for my taste, though, written more for someone who knows a bit about the 
internals, but it’s a start.

I’ve also fixed the wayward parameter in the function signature in the docs, 
and added a note about why I’ve also patched genbki.pl.

Best,

David

[1] 
https://www.postgresql.org/message-id/flat/57E1FDDC-5A38-452D-82D7-A44DA2E13862%40justatheory.com#1ae0b11634bc33c7ad3cd728e43d504e



v7-0001-Add-to_regtypemod-SQL-function.patch
Description: Binary data


pg_restore problem to load constraints with tables

2024-02-20 Thread Fabrice Chapuis
Hi,
When a table is reloaded wit pg_restore, it is recreated without indexes or
constraints. There are automatically skipped. Is there a reason for this?

g_restore -j 8 -v -d zof /shared/pgdump/aq/backup/dbtest/shtest --no-owner
--role=test -t  mytable 2>&1 | tee -a dbest.log

pg_restore: skipping item 7727 SEQUENCE SET xxx_seq
pg_restore: skipping item 5110 INDEX x_cons
pg_restore: skipping item 5143 CONSTRAINT xxx
pg_restore: skipping item 5670 FK CONSTRAINT xxx

Thanks for your feedback

Fabrice


Re: Injection points: some tools to wait and wake

2024-02-20 Thread Andrey M. Borodin



> On 20 Feb 2024, at 02:21, Michael Paquier  wrote:
> 
> On Mon, Feb 19, 2024 at 11:54:20AM +0300, Andrey M. Borodin wrote:
>> 1. injection_points_wake() will wake all of waiters. But it's not
>> suitable for complex tests. I think there must be a way to wake only
>> specific waiter by injection point name.
> 
> I don't disagree with that, but I don't have a strong argument for
> implementing that until there is an explicit need for it in core.  It
> is also possible to plug in your own module, outside of core, if you
> are looking for something more specific.  The backend APIs allow that.

In [0] I want to create a test for edge case of reading recent multixact. 
External module does not allow to have a test within repository.
In that test I need to sync backends in 3 steps (backend 1 starts to wait in 
injection point, backend 2 starts to sleep in other point, backend 1 is 
released and observe 3rd injection point). "wake them all" implementation 
allows only 2-step synchronization.
I will try to simplify test to 2-step, but it would be much easier to implement 
if injection points could be awaken independently.

>> 2. Alexander Korotkov's stopevents could be used in isolation
>> tests. This kind of tests is perfect for describing complex race
>> conditions. (as a side note, I'd be happy if we could have
>> primary\standby in isolation tests too)
> 
> This requires plugging is more into src/test/isolation/, with multiple
> connection strings.  This has been suggested in the past.

I think standby isolation tests are just a sugar-on-top feature here.
Wrt injection points, I'd like to see a function to wait until some injection 
point is observed.
With this function at hand developer can implement race condition tests as an 
isolation test.

>> 5. In many cases we need to have injection point under critical
>> section. I propose to have a "prepared injection point". See [0] for
>> example in v2-0003-Test-multixact-CV-sleep.patch
>> +   INJECTION_POINT_PREPARE("GetNewMultiXactId-done");
>> +
>>START_CRIT_SECTION();
>> 
>> +   INJECTION_POINT_RUN_PREPARED();
> 
> I don't see how that's different from a wait/wake logic?  The only
> thing you've changed is to stop a wait when a point is detached and
> you want to make the stop conditional.  Plugging in a condition
> variable is more flexible than a hardcoded sleep in terms of wait,
> while being more responsive.

No, "prepared injection point" is not about wait\wake logic. It's about having 
injection point in critical section.
Normal injection point will pstrdup(name) and fail. In [0] I need a test that 
waits after multixact generation before WAL-logging it. It's only possible in a 
critical section.

Thanks!


Best regards, Andrey Borodin.

[0] 
https://www.postgresql.org/message-id/0925f9a9-4d53-4b27-a87e-3d83a757b...@yandex-team.ru



Re: Have pg_basebackup write "dbname" in "primary_conninfo"?

2024-02-20 Thread Robert Haas
On Tue, Feb 20, 2024 at 5:58 PM Hayato Kuroda (Fujitsu)
 wrote:
> > Seems weird to me. You don't use dbname=replication to ask for a
> > replication connection, so why would we ever end up with that
> > anywhere? And especially in only one of two such closely related
> > cases?
>
> Just FYI - here is an extreme case. And note that I have applied proposed 
> patch.
>
> When `pg_basebackup -D data_N2 -R` is used:
> ```
> primary_conninfo = 'user=hayato ... dbname=hayato ...
> ```
>
> But when `pg_basebackup -d "" -D data_N2 -R` is used:
> ```
> primary_conninfo = 'user=hayato ... dbname=replication
> ```

It seems like maybe somebody should look into why this is happening,
and perhaps fix it.

-- 
Robert Haas
EDB: http://www.enterprisedb.com




Re: A new message seems missing a punctuation

2024-02-20 Thread shveta malik
On Tue, Feb 20, 2024 at 5:01 PM Amit Kapila  wrote:
>
>
> We do expose the required information (restart_lsn, catalog_xmin,
> synced, temporary, etc) via pg_replication_slots. So, I feel the LOG
> message here is sufficient to DEBUG (or know the details) when the
> slot sync doesn't succeed.
>

Please find the patch having the suggested changes.

thanks
Shveta


v2-0001-Reword-LOG-msg-for-slot-sync.patch
Description: Binary data


Re: Thoughts about NUM_BUFFER_PARTITIONS

2024-02-20 Thread Tomas Vondra
Hi,

On 2/20/24 03:16, wenhui qiu wrote:
> Hi Heikki Linnakangas
>I saw git log found this commit:
> https://github.com/postgres/postgres/commit/3acc10c997f916f6a741d0b4876126b7b08e3892
> ,I don't seem to see an email discussing this commit. As the commit log
> tells us, we don't know exactly how large a value is optimal, and I believe
> it's more flexible to make it as a parameter.Thank you very much
> tomas.vondra for explaining the relationship, i see that MAX_SIMUL_LWLOCKS
> was just doubled in this commit, is there a more appropriate ratio between
> them?
> 

I think the discussion for that commit is in [1] (and especially [2]).

That being said, I don't think MAX_SIMUL_LOCKS and NUM_BUFFER_PARTITIONS
need to be in any particular ratio. The only requirement is that there
needs to be enough slack, and 72 locks seemed to work quite fine until
now - I don't think we need to change that.

What might be necessary is improving held_lwlocks - we treat is as LIFO,
but more as an expectation than a hard rule. I'm not sure how often we
violate that rule (if at all), but if we do then it's going to get more
expensive as we increase the number of locks. But I'm not sure this is
actually a problem in practice, we usually hold very few LWLocks at the
same time.

As for making this a parameter, I'm rather opposed to the idea. If we
don't have a very clear idea how to set this limit, what's the chance
users with little knowledge of the internals will pick a good value?
Adding yet another knob would just mean users start messing with it in
random ways (typically increasing it to very high value, because "more
is better"), causing more harm than good.

Adding it as a GUC would also require making some parts dynamic (instead
of just doing static allocation with compile-time constants). That's not
great, but I'm not sure how significant the penalty might be.


IMHO adding a GUC might be acceptable only if we fail to come up with a
good value (which is going to be a trade off), and if someone
demonstrates a clear benefit of increasing the value (which I don't
think happen in this thread yet).


regards


[1]
https://www.postgresql.org/message-id/flat/CAA4eK1LSTcMwXNO8ovGh7c0UgCHzGbN%3D%2BPjggfzQDukKr3q_DA%40mail.gmail.com

[2]
https://www.postgresql.org/message-id/CA%2BTgmoY58dQi8Z%3DFDAu4ggxHV-HYV03-R9on1LSP9OJU_fy_zA%40mail.gmail.com

-- 
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company




Re: numeric_big in make check?

2024-02-20 Thread Daniel Gustafsson
> On 20 Feb 2024, at 14:23, Dean Rasheed  wrote:

> If we did that, numeric_big would be even further down the list of
> expensive tests, and I'd say it should be run by default.

My motivation for raising this was to get a test which is executed as part of
parallel_schedule to make failures aren't missed.  If we get there by slimming
down numeric_big to keep the unique coverage then that sounds like a good plan
to me.

--
Daniel Gustafsson





Re: speed up a logical replica setup

2024-02-20 Thread vignesh C
On Mon, 19 Feb 2024 at 11:15, Hayato Kuroda (Fujitsu)
 wrote:
>
> Dear hackers,
>
> > Since it may be useful, I will post top-up patch on Monday, if there are no
> > updating.
>
> And here are top-up patches. Feel free to check and include.
>
> v22-0001: Same as v21-0001.
> === rebased patches ===
> v22-0002: Update docs per recent changes. Same as v20-0002.
> v22-0003: Add check versions of the target. Extracted from v20-0003.
> v22-0004: Remove -S option. Mostly same as v20-0009, but commit massage was
>   slightly changed.
> === Newbie ===
> V22-0005: Addressed my comments which seems to be trivial[1].
>   Comments #1, 3, 4, 8, 10, 14, 17 were addressed here.
> v22-0006: Consider the scenario when commands are failed after the recovery.
>   drop_subscription() is removed and some messages are added per [2].
> V22-0007: Revise server_is_in_recovery() per [1]. Comments #5, 6, 7, were 
> addressed here.
> V22-0008: Fix a strange report when physical_primary_slot is null. Per 
> comment #9 [1].
> V22-0009: Prohibit reuse publications when it has already existed. Per 
> comments #11 and 12 [1].
> V22-0010: Avoid to call PQclear()/PQfinish()/pg_free() if the process exits 
> soon. Per comment #15 [1].
> V22-0011: Update testcode. Per comments #17- [1].
>
> I did not handle below points because I have unclear points.
>
> a.
> This patch set cannot detect the disconnection between the target (standby) 
> and
> source (primary) during the catch up. Because the connection status must be 
> gotten
> at the same time (=in the same query) with the recovery status, but now it is 
> now an
> independed function (server_is_in_recovery()).
>
> b.
> This patch set cannot detect the inconsistency reported by Shubham [3]. I 
> could not
> come up with solutions without removing -P...

Few comments:
1) The below code can lead to assertion failure if the publisher is
stopped while dropping the replication slot:
+   if (primary_slot_name != NULL)
+   {
+   conn = connect_database(dbinfo[0].pubconninfo);
+   if (conn != NULL)
+   {
+   drop_replication_slot(conn, [0],
primary_slot_name);
+   }
+   else
+   {
+   pg_log_warning("could not drop replication
slot \"%s\" on primary",
+  primary_slot_name);
+   pg_log_warning_hint("Drop this replication
slot soon to avoid retention of WAL files.");
+   }
+   disconnect_database(conn);
+   }

pg_createsubscriber: error: connection to database failed: connection
to server on socket "/tmp/.s.PGSQL.5432" failed: No such file or
directory
Is the server running locally and accepting connections on that socket?
pg_createsubscriber: warning: could not drop replication slot
"standby_1" on primary
pg_createsubscriber: hint: Drop this replication slot soon to avoid
retention of WAL files.
pg_createsubscriber: pg_createsubscriber.c:432: disconnect_database:
Assertion `conn != ((void *)0)' failed.
Aborted (core dumped)

This is happening because we are calling disconnect_database in case
of connection failure case too which has the following assert:
+static void
+disconnect_database(PGconn *conn)
+{
+   Assert(conn != NULL);
+
+   PQfinish(conn);
+}

2) There is a CheckDataVersion function which does exactly this, will
we be able to use this:
+   /* Check standby server version */
+   if ((ver_fd = fopen(versionfile, "r")) == NULL)
+   pg_fatal("could not open file \"%s\" for reading: %m",
versionfile);
+
+   /* Version number has to be the first line read */
+   if (!fgets(rawline, sizeof(rawline), ver_fd))
+   {
+   if (!ferror(ver_fd))
+   pg_fatal("unexpected empty file \"%s\"", versionfile);
+   else
+   pg_fatal("could not read file \"%s\": %m", versionfile);
+   }
+
+   /* Strip trailing newline and carriage return */
+   (void) pg_strip_crlf(rawline);
+
+   if (strcmp(rawline, PG_MAJORVERSION) != 0)
+   {
+   pg_log_error("standby server is of wrong version");
+   pg_log_error_detail("File \"%s\" contains \"%s\",
which is not compatible with this program's version \"%s\".",
+   versionfile,
rawline, PG_MAJORVERSION);
+   exit(1);
+   }
+
+   fclose(ver_fd);

3) Should this be added to typedefs.list:
+enum WaitPMResult
+{
+   POSTMASTER_READY,
+   POSTMASTER_STILL_STARTING
+};

4) pgCreateSubscriber should be mentioned after pg_controldata to keep
the ordering consistency:
diff --git a/doc/src/sgml/reference.sgml b/doc/src/sgml/reference.sgml
index aa94f6adf6..c5edd244ef 100644
--- a/doc/src/sgml/reference.sgml
+++ b/doc/src/sgml/reference.sgml
@@ -285,6 +285,7 @@



+   


5) Here pg_replication_slots should be 

Re: confirmed flush lsn seems to be move backward in certain error cases

2024-02-20 Thread vignesh C
On Fri, 16 Feb 2024 at 17:39, vignesh C  wrote:
>
> Hi,
>
> The following assertion failure was seen while testing one scenario
> for other patch:
> TRAP: failed Assert("s->data.confirmed_flush >=
> s->last_saved_confirmed_flush"), File: "slot.c", Line: 1760, PID:
> 545314
> postgres: checkpointer performing shutdown
> checkpoint(ExceptionalCondition+0xbb)[0x564ee6870c58]
> postgres: checkpointer performing shutdown
> checkpoint(CheckPointReplicationSlots+0x18e)[0x564ee65e9c71]
> postgres: checkpointer performing shutdown 
> checkpoint(+0x1e1403)[0x564ee61be403]
> postgres: checkpointer performing shutdown
> checkpoint(CreateCheckPoint+0x78a)[0x564ee61bdace]
> postgres: checkpointer performing shutdown
> checkpoint(ShutdownXLOG+0x150)[0x564ee61bc735]
> postgres: checkpointer performing shutdown 
> checkpoint(+0x5ae28c)[0x564ee658b28c]
> postgres: checkpointer performing shutdown
> checkpoint(CheckpointerMain+0x31e)[0x564ee658ad55]
> postgres: checkpointer performing shutdown
> checkpoint(AuxiliaryProcessMain+0x1d1)[0x564ee65888d9]
> postgres: checkpointer performing shutdown 
> checkpoint(+0x5b7200)[0x564ee6594200]
> postgres: checkpointer performing shutdown
> checkpoint(PostmasterMain+0x14da)[0x564ee658f12f]
> postgres: checkpointer performing shutdown 
> checkpoint(+0x464fc6)[0x564ee6441fc6]
> /lib/x86_64-linux-gnu/libc.so.6(+0x29d90)[0x7ff6afa29d90]
> /lib/x86_64-linux-gnu/libc.so.6(__libc_start_main+0x80)[0x7ff6afa29e40]
> postgres: checkpointer performing shutdown
> checkpoint(_start+0x25)[0x564ee60b8e05]
>
> I was able to reproduce this issue with the following steps:
> -- Setup
> -- Publisher node:
> create table t1(c1 int);
> create table t2(c1 int);
> create publication pub1 for table t1;
> create publication pub2 for table t2;
>
> -- Subscriber node:
> create table t1(c1 int);
> create table t2(c1 int);
> create subscription test1 connection 'dbname=postgres host=localhost
> port=5432' publication pub1, pub2;
> select * from pg_subscription;
>
> -- Actual test
> insert into t1 values(10);
> insert into t2 values(20);
> select pg_sleep(10);
> drop publication pub2;
> insert into t1 values(10);
> insert into t2 values(20);
>
> Stop the publisher to see the assertion.
>
> For me the issue reproduces about twice in five times using the
> assert_failure.sh script attached.
>
> After the insert operation is replicated to the subscriber, the
> subscriber will set the lsn value sent by the publisher in the
> replication origin (in my case it was 0/1510978). publisher will then
> send keepalive messages with the current WAL position in the publisher
> (in my case it was 0/15109B0), but subscriber will simply send this
> position as the flush_lsn to the publisher as there are no ongoing
> transactions. Then since the publisher is started, it will identify
> that publication does not exist and stop the walsender/apply worker
> process. When the apply worker is restarted, we will get the
> remote_lsn(in my case it was 0/1510978) of the origin and set it to
> origin_startpos. We will start the apply worker with this
> origin_startpos (origin's remote_lsn). This position will be sent as
> feedback to the walsender process from the below stack:
> run_apply_worker->start_apply->LogicalRepApplyLoop->send_feedback.
> It will use the following send_feedback function call of
> LogicalRepApplyLoop function as in below code here as nothing is
> received from walsender:
> LogicalRepApplyLoop function
> ...
> len = walrcv_receive(LogRepWorkerWalRcvConn, , );
> if (len != 0)
> {
>/* Loop to process all available data (without blocking). */
>for (;;)
>{
>   CHECK_FOR_INTERRUPTS();
>   ...
>}
> }
>
> /* confirm all writes so far */
> send_feedback(last_received, false, false);
> ...
>
> In send_feedback, we will set flushpos to replication origin's
> remote_lsn and send it to the walsender process. Walsender process
> will receive this information and set confirmed_flush in:
> ProcessStandbyReplyMessage->LogicalConfirmReceivedLocation
>
> Then immediately we are trying to stop the publisher instance,
> shutdown checkpoint process will be triggered. In this case:
> confirmed_flush = 0/1510978 will be lesser than
> last_saved_confirmed_flush = 0/15109B0 which will result in Assertion
> failure.
>
> This issue is happening because we allow setting the confirmed_flush
> to a backward position.
> There are a couple of ways to fix this:
> a) One way it not to update the confirm_flush if the lsn sent is an
> older value like in Confirm_flush_dont_allow_backward.patch
> b) Another way is to remove the assertion in
> CheckPointReplicationSlots and marking the slot as dirty only if
> confirmed_flush is greater than last_saved_confirmed_flush like in
> Assert_confirmed_flush_will_always_not_be_less_than_last_saved_confirmed_flush.patch
>
> I preferred the first approach.

I have created the following 

Re: Synchronizing slots from primary to standby

2024-02-20 Thread Amit Kapila
On Tue, Feb 20, 2024 at 6:19 PM Masahiko Sawada  wrote:
>
> Thank you for the explanation. It makes sense to me to move the check.
>
> As for ValidateSlotSyncParams() called by SlotSyncWorkerAllowed(), I
> have two comments:
>
> 1. The error messages are not very descriptive and seem not to match
> other messages the postmaster says. When starting the standby server
> with misconfiguration about the slotsync, I got the following messages
> from the postmaster:
>
> 2024-02-20 17:01:16.356 JST [456741] LOG:  database system is ready to
> accept read-only connections
> 2024-02-20 17:01:16.358 JST [456741] LOG:  bad configuration for slot
> synchronization
> 2024-02-20 17:01:16.358 JST [456741] HINT:  "hot_standby_feedback"
> must be enabled.
>
> It says "bad configuration" but is still working, and does not say
> further information such as whether it skipped to start the slotsync
> worker etc. I think these messages could work for the slotsync worker
> but we might want to have more descriptive messages for the
> postmaster. For example, "skipped starting slot sync worker because
> hot_standby_feedback is disabled".
>

We are planning to change it to something like:"slot synchronization
requires hot_standby_feedback to be enabled". See [1]

> 2. If the wal_level is not logical, the server will need to restart
> anyway to change the wal_level and have the slotsync worker work. Does
> it make sense to have the postmaster exit if the wal_level is not
> logical and sync_replication_slots is enabled? For instance, we have
> similar checks in PostmsaterMain():
>
> if (summarize_wal && wal_level == WAL_LEVEL_MINIMAL)
> ereport(ERROR,
> (errmsg("WAL cannot be summarized when wal_level is
> \"minimal\"")));
>

+1. I think giving an error in this case makes sense.

Miscellaneous comments:

1.
+void
+ShutDownSlotSync(void)
+{
+ SpinLockAcquire(>mutex);
+
+ SlotSyncCtx->stopSignaled = true;

This flag is never reset back. I think we should reset this once the
promotion is complete. Though offhand, I don't see any problem with
this but it doesn't look clean and can be a source of bugs in the
future.

2.
+char *
+CheckDbnameInConninfo(void)
 {
  char*dbname;

Let's name this function as CheckAndGetDbnameFromConninfo().

Apart from the above, I have made cosmetic changes in the attached.

[1] - 
https://www.postgresql.org/message-id/CAJpy0uBWomyAjP0zyFdzhGxn%2BXsAb2OdJA%2BKfNyZRv2nV6PD9g%40mail.gmail.com

-- 
With Regards,
Amit Kapila.
diff --git a/src/backend/replication/logical/slotsync.c 
b/src/backend/replication/logical/slotsync.c
index d139d53173..c133fed6c2 100644
--- a/src/backend/replication/logical/slotsync.c
+++ b/src/backend/replication/logical/slotsync.c
@@ -1389,7 +1389,7 @@ ShutDownSlotSync(void)
 /*
  * SlotSyncWorkerCanRestart
  *
- * Returns true if enough time has passed (SLOTSYNC_RESTART_INTERVAL_SEC)
+ * Returns true if enough time (SLOTSYNC_RESTART_INTERVAL_SEC) has passed
  * since it was launched last. Otherwise returns false.
  *
  * This is a safety valve to protect against continuous respawn attempts if the
diff --git a/src/backend/replication/slot.c b/src/backend/replication/slot.c
index a2269c46f7..7e9bdf9e33 100644
--- a/src/backend/replication/slot.c
+++ b/src/backend/replication/slot.c
@@ -1242,14 +1242,13 @@ restart:
 * happening here. The persistent synced slots are thus safe 
but there
 * is a possibility that the slot sync worker has created a 
temporary
 * slot (which stays active even on release) and we are trying 
to drop
-* the same here. In practice, the chances of hitting this 
scenario is
-* very less as during slot synchronization, the temporary slot 
is
-* immediately converted to persistent and thus is safe due to 
the
-* shared lock taken on the database. So for the time being, 
we'll
-* just bail out in such a scenario.
+* the here. In practice, the chances of hitting this scenario 
are less
+* as during slot synchronization, the temporary slot is 
immediately
+* converted to persistent and thus is safe due to the shared 
lock
+* taken on the database. So, we'll just bail out in such a 
case.
 *
-* XXX: If needed, we can consider shutting down slot sync 
worker
-* before trying to drop synced temporary slots here.
+* XXX: We can consider shutting down the slot sync worker 
before
+* trying to drop synced temporary slots here.
 */
if (active_pid)
ereport(ERROR,


Re: numeric_big in make check?

2024-02-20 Thread Dean Rasheed
On Mon, 19 Feb 2024 at 15:35, Tom Lane  wrote:
>
> I thought I'd try to acquire some actual facts here, so I compared
> the code coverage shown by "make check" as of HEAD, versus "make
> check" after adding numeric_big to parallel_schedule.  I saw the
> following lines of numeric.c as being covered in the second run
> and not the first:
>

I don't think that tells the whole story. Code coverage only tells you
whether a particular line of code has been hit, not whether it has
been properly tested with all the values that might lead to different
cases. For example:

> sqrt_var():
> 10073 res_ndigits = res_weight + 1 - (-rscale - 1) / DEC_DIGITS;
>

To get code coverage of this line, all you need to do is ensure that
sqrt_var() is called with rscale < -1 (which can only happen from the
range-reduction code in ln_var()). You could do that by computing
ln(1e50), which results in calling sqrt_var() with rscale = -2,
causing that line of code in sqrt_var() to be hit. That would satisfy
code coverage, but I would argue that you've only really tested that
line of code properly if you also hit it with rscale = -3, and
probably a few more values, to check that the round-down division is
working as intended.

Similarly, looking at commit 18a02ad2a5, the crucial code change was
the following in power_var():

-val = Max(val, -NUMERIC_MAX_RESULT_SCALE);
-val = Min(val, NUMERIC_MAX_RESULT_SCALE);
val *= 0.434294481903252;/* approximate decimal result weight */

Any test that calls numeric_power() is going to hit those lines of
code, but to see a failure, it was necessary to hit them with the
absolute value of "val" greater than NUMERIC_MAX_RESULT_SCALE, which
is why that commit added 2 new test cases to numeric_big, calling
power_var() with "val" outside that range. Code coverage is never
going to tell you whether or not that is being tested, since the code
change was to delete lines. Even if that weren't the case, any line of
code involving Min() or Max() has 2 branches, and code coverage won't
tell you if you've hit both of them.

> Pretty poor return on investment for the runtime consumed.  I don't
> object to adding something to numeric.sql that's targeted at adding
> coverage for these (or anyplace else that's not covered), but let's
> not just throw cycles at the problem.
>

I agree that blindly performing a bunch of large computations (just
throwing cycles at the problem) is not a good approach to testing. And
maybe that's a fair characterisation of parts of numeric_big. However,
it also contains some fairly well-targeted tests intended to test
specific edge cases that only occur with particular ranges of inputs,
which don't necessarily show up as increased code coverage.

So I think this requires more careful analysis. Simply deleting
numeric_big and adding tests to numeric.sql until the same level of
code coverage is achieved will not give the same level of testing.

It's also worth noting that the cost of running numeric_big has come
down very significantly over the last few years, as can be seen by
running the current numeric_big script against old backends. There's a
lot of random variation in the timings, but the trend is very clear:

9.51.641s
9.60.856s
100.845s
110.750s
120.760s
130.672s
140.430s
150.347s
160.336s

Arguably, this is a useful benchmark to spot performance regressions
and test proposed performance-improving patches.

If I run "EXTRA_TESTS=numeric_big make check | grep 'ok ' | sort
-nrk5", numeric_big is not in the top 10 most expensive tests (it's
usually down at around 15'th place).

Looking at the script itself, the addition, subtraction,
multiplication and division tests at the top are probably pointless,
since I would expect those operations to be tested adequately (and
probably more thoroughly) by the transcendental test cases. In fact, I
think it would probably be OK to delete everything above line 650, and
just keep the bottom half of the script -- the pow(), exp(), ln() and
log() tests, which cover various edge cases, as well as exercising
basic arithmetic operations internally. We might want to check that
I/O of large numerics is still being tested properly though.

If we did that, numeric_big would be even further down the list of
expensive tests, and I'd say it should be run by default.

Regards,
Dean




Re: Synchronizing slots from primary to standby

2024-02-20 Thread Masahiko Sawada
On Tue, Feb 20, 2024 at 12:44 PM Amit Kapila  wrote:
>
> On Tue, Feb 20, 2024 at 8:25 AM Masahiko Sawada  wrote:
> >
> > Some comments not related to the patch but to the existing code:
> >
> > ---
> > It might have already been discussed but is the
> > src/backend/replication/logical the right place for the slocsync.c? If
> > it's independent of logical decoding/replication, is under
> > src/backend/replication could be more appropriate?
> >

Thank you for the comment.

>
> This point has not been discussed, so thanks for raising it. I think
> the reasoning behind keeping it in logical is that this file contains
> a code for logical slot syncing and a worker doing that. As it is
> mostly about logical slot syncing so there is an argument to keep it
> under logical. In the future, we may need to extend this functionality
> to have a per-db slot sync worker as well in which case it will
> probably be again somewhat related to logical slots.

That's a valid argument.

> Having said that,
> there is an argument to keep it under replication as well because the
> functionality it provides is for physical standbys.

Another argument to keep it under replication is; all files under the
replication/logical directory are logical decoding and logical
replication infrastructures. IOW these are the functionality built on
top of (logical) replication slots. On the other hand, the slotsync
worker (and slotsync functionality) looks a part of slot management
functionality, which seems the same layer of slot.c.

BTW the slotsync.c of v91 patch includes "replication/logical.h" but
it isn't actually necessary and #include'ing "replication/slot.h" is
sufficient.

Regards

-- 
Masahiko Sawada
Amazon Web Services: https://aws.amazon.com




Re: Synchronizing slots from primary to standby

2024-02-20 Thread Masahiko Sawada
On Tue, Feb 20, 2024 at 12:33 PM shveta malik  wrote:
>
> On Tue, Feb 20, 2024 at 8:25 AM Masahiko Sawada  wrote:
> >
> >
> > I've reviewed the v91 patch. Here are random comments:
>
> Thanks for the comments.
>
> > ---
> >  /*
> >   * Checks the remote server info.
> >   *
> > - * We ensure that the 'primary_slot_name' exists on the remote server and 
> > the
> > - * remote server is not a standby node.
> > + * Check whether we are a cascading standby. For non-cascading standbys, it
> > + * also ensures that the 'primary_slot_name' exists on the remote server.
> >   */
> >
> > IIUC what the validate_remote_info() does doesn't not change by this
> > patch, so the previous comment seems to be clearer to me.
> >
> > ---
> > if (remote_in_recovery)
> > +   {
> > +   /*
> > +* If we are a cascading standby, no need to check further for
> > +* 'primary_slot_name'.
> > +*/
> > ereport(ERROR,
> > errcode(ERRCODE_FEATURE_NOT_SUPPORTED),
> > errmsg("cannot synchronize replication slots from a
> > standby server"));
> > +   }
> > +   else
> > +   {
> > +   boolprimary_slot_valid;
> >
> > -   primary_slot_valid = DatumGetBool(slot_getattr(tupslot, 2, ));
> > -   Assert(!isnull);
> > +   primary_slot_valid = DatumGetBool(slot_getattr(tupslot, 2, 
> > ));
> > +   Assert(!isnull);
> >
> > -   if (!primary_slot_valid)
> > -   ereport(ERROR,
> > -   errcode(ERRCODE_INVALID_PARAMETER_VALUE),
> > -   errmsg("bad configuration for slot synchronization"),
> > -   /* translator: second %s is a GUC variable name */
> > -   errdetail("The replication slot \"%s\" specified by
> > \"%s\" does not exist on the primary server.",
> > - PrimarySlotName, "primary_slot_name"));
> > +   if (!primary_slot_valid)
> > +   ereport(ERROR,
> > +   errcode(ERRCODE_INVALID_PARAMETER_VALUE),
> > +   errmsg("bad configuration for slot synchronization"),
> > +   /* translator: second %s is a GUC variable name */
> > +   errdetail("The replication slot \"%s\" specified
> > by \"%s\" does not exist on the primary server.",
> > + PrimarySlotName, "primary_slot_name"));
> > +   }
> >
> > I think it's a refactoring rather than changes required by the
> > slotsync worker. We can do that in a separate patch but why do we need
> > this change in the first place?
>
> In v90, this refactoring was made due to the fact that
> validate_remote_info() was supposed to behave differently for SQL
> function and slot-sync worker. SQL-function was supposed to ERROR out
> while the worker was supposed to LOG and become no-op. And thus the
> change was needed. In v91, we made this functionality same i.e. both
> sql function and worker will error out but missed to remove this
> refactoring. Thanks for catching this, I will revert it in the next
> version. To match the refactoring, I made the comment change too, will
> revert that as well.
>
> > ---
> > +ValidateSlotSyncParams(ERROR);
> > +
> >  /* Load the libpq-specific functions */
> >  load_file("libpqwalreceiver", false);
> >
> > -ValidateSlotSyncParams();
> > +(void) CheckDbnameInConninfo();
> >
> > Is there any reason why we move where to check the parameters?
>
> Earlier DBname verification was done inside ValidateSlotSyncParams()
> and thus it was needed to load 'libpqwalreceiver' before we call this
> function. Now we have moved dbname verification in a separate call and
> thus the above order got changed. ValidateSlotSyncParams() is a common
> function used by SQL function and worker. Earlier slot sync worker was
> checking all the GUCs after starting up and was exiting each time any
> GUC was invalid. It was suggested that it would be better to check the
> GUCs before starting the slot sync worker in the postmaster itself,
> making the ValidateSlotSyncParams() move to postmaster (see
> SlotSyncWorkerAllowed).  But it was not a good idea to load libpq in
> postmaster and thus we moved libpq related verification out of
> ValidateSlotSyncParams(). This resulted in the above change.  I hope
> it answers your query.

Thank you for the explanation. It makes sense to me to move the check.

As for ValidateSlotSyncParams() called by SlotSyncWorkerAllowed(), I
have two comments:

1. The error messages are not very descriptive and seem not to match
other messages the postmaster says. When starting the standby server
with misconfiguration about the slotsync, I got the following messages
from the postmaster:

2024-02-20 17:01:16.356 JST [456741] LOG:  database system is ready to
accept read-only connections
2024-02-20 17:01:16.358 JST [456741] LOG:  bad configuration for slot
synchronization
2024-02-20 17:01:16.358 JST [456741] HINT:  "hot_standby_feedback"
must be enabled.

It says "bad configuration" but is still working, and 

Re: Replace current implementations in crypt() and gen_salt() to OpenSSL

2024-02-20 Thread Peter Eisentraut

On 20.02.24 12:39, Daniel Gustafsson wrote:

A fifth option is to throw away our in-tree implementations and use the OpenSSL
API's for everything, which is where this thread started.  If the effort to
payoff ratio is palatable to anyone then patches are for sure welcome.


The problem is that, as I understand it, these crypt routines are not 
designed in a way that you can just plug in a crypto library underneath. 
 Effectively, the definition of what, say, blowfish crypt does, is 
whatever is in that source file, and transitively, whatever OpenBSD 
does.  (Fun question: Does OpenBSD care about FIPS?)  Of course, you 
could reimplement the same algorithms independently, using OpenSSL or 
whatever.  But I don't think this will really improve the state of the 
world in aggregate, because to a large degree we are relying on the 
upstream to keep these implementations maintained, and if we rewrite 
them, we become the upstream.






Re: Replace current implementations in crypt() and gen_salt() to OpenSSL

2024-02-20 Thread Daniel Gustafsson
> On 20 Feb 2024, at 13:24, Robert Haas  wrote:
> 
> On Tue, Feb 20, 2024 at 5:09 PM Daniel Gustafsson  wrote:
>> A fifth option is to throw away our in-tree implementations and use the 
>> OpenSSL
>> API's for everything, which is where this thread started.  If the effort to
>> payoff ratio is palatable to anyone then patches are for sure welcome.
> 
> That generally seems fine, although I'm fuzzy on what our policy
> actually is. We have fallback implementations for some things and not
> others, IIRC.

I'm not sure there is a well-formed policy, but IIRC the idea with cryptohash
was to provide in-core functionality iff OpenSSL isn't used, and only use the
OpenSSL implementations if it is.  Since pgcrypto cannot be built without
OpenSSL (since db7d1a7b0530e8cbd045744e1c75b0e63fb6916f) I don't think it's a
problem to continue the work from that commit and replace more with OpenSSL
implementations.

--
Daniel Gustafsson





Re: Replace current implementations in crypt() and gen_salt() to OpenSSL

2024-02-20 Thread Peter Eisentraut

On 20.02.24 12:27, Robert Haas wrote:

I don't think the first two of these proposals help anything. AIUI,
FIPS mode is supposed to be a system wide toggle that affects
everything on the machine. The third one might help if you can be
compliant by just choosing not to install that extension, and the
fourth one solves the problem by sledgehammer.

Does Linux provide some way of asking whether "fips=1" was specified
at kernel boot time?


What you are describing only happens on Red Hat systems, I think.  They 
have built additional integration around this, which is great.  But 
that's not something you can rely on being the case on all systems, not 
even all Linux systems.





RE: Have pg_basebackup write "dbname" in "primary_conninfo"?

2024-02-20 Thread Hayato Kuroda (Fujitsu)
Dear Robert,

> Seems weird to me. You don't use dbname=replication to ask for a
> replication connection, so why would we ever end up with that
> anywhere? And especially in only one of two such closely related
> cases?

Just FYI - here is an extreme case. And note that I have applied proposed patch.

When `pg_basebackup -D data_N2 -R` is used:
```
primary_conninfo = 'user=hayato ... dbname=hayato ...
```

But when `pg_basebackup -d "" -D data_N2 -R` is used:
```
primary_conninfo = 'user=hayato ... dbname=replication
```

Best Regards,
Hayato Kuroda
FUJITSU LIMITED
https://www.fujitsu.com/ 



Re: Replace current implementations in crypt() and gen_salt() to OpenSSL

2024-02-20 Thread Robert Haas
On Tue, Feb 20, 2024 at 5:09 PM Daniel Gustafsson  wrote:
> A fifth option is to throw away our in-tree implementations and use the 
> OpenSSL
> API's for everything, which is where this thread started.  If the effort to
> payoff ratio is palatable to anyone then patches are for sure welcome.

That generally seems fine, although I'm fuzzy on what our policy
actually is. We have fallback implementations for some things and not
others, IIRC.

> > Does Linux provide some way of asking whether "fips=1" was specified
> > at kernel boot time?
>
> There is a crypto.fips_enabled sysctl but I have no idea how portable that is
> across distributions etc.

My guess would be that it's pretty portable, but my guesses about
Linux might not be very good. Still, if we wanted to go this route, it
probably wouldn't be too hard to figure out how portable this is.

-- 
Robert Haas
EDB: http://www.enterprisedb.com




Re: Integer undeflow in fprintf in dsa.c

2024-02-20 Thread Robert Haas
On Tue, Feb 20, 2024 at 5:30 PM Daniel Gustafsson  wrote:
> The message "at least 0 contiguous pages free" reads a bit nonsensical though,
> wouldn't it be preferrable to check for i being zero and print a custom 
> message
> for that case? Something like the below untested sketch?
>
> +   if (i == 0)
> +   fprintf(stderr,
> +   "segment bin %zu (no 
> contiguous free pages):\n", i);
> +   else
> +   fprintf(stderr,
> +   "segment bin %zu (at 
> least %d contiguous pages free):\n",
> +   i, 1 << (i - 1));

That does seem reasonable. However, this is just debugging code, so it
also probably isn't necessary to sweat anything too much.

-- 
Robert Haas
EDB: http://www.enterprisedb.com




Re: logical decoding and replication of sequences, take 2

2024-02-20 Thread Tomas Vondra



On 2/20/24 06:54, Amit Kapila wrote:
> On Thu, Dec 21, 2023 at 6:47 PM Tomas Vondra
>  wrote:
>>
>> On 12/19/23 13:54, Christophe Pettus wrote:
>>> Hi,
>>>
>>> I wanted to hop in here on one particular issue:
>>>
 On Dec 12, 2023, at 02:01, Tomas Vondra  
 wrote:
 - desirability of the feature: Random IDs (UUIDs etc.) are likely a much
 better solution for distributed (esp. active-active) systems. But there
 are important use cases that are likely to keep using regular sequences
 (online upgrades of single-node instances, existing systems, ...).
>>>
>>> +1.
>>>
>>> Right now, the lack of sequence replication is a rather large
>>> foot-gun on logical replication upgrades.  Copying the sequences
>>> over during the cutover period is doable, of course, but:
>>>
>>> (a) There's no out-of-the-box tooling that does it, so everyone has
>>> to write some scripts just for that one function.
>>>
>>> (b) It's one more thing that extends the cutover window.
>>>
>>
>> I agree it's an annoying gap for this use case. But if this is the only
>> use cases, maybe a better solution would be to provide such tooling
>> instead of adding it to the logical decoding?
>>
>> It might seem a bit strange if most data is copied by replication
>> directly, while sequences need special handling, ofc.
>>
> 
> One difference between the logical replication of tables and sequences
> is that we can guarantee with synchronous_commit (and
> synchronous_standby_names) that after failover transactions data is
> replicated or not whereas for sequences we can't guarantee that
> because of their non-transactional nature. Say, there are two
> transactions T1 and T2, it is possible that T1's entire table data and
> sequence data are committed and replicated but T2's sequence data is
> replicated. So, after failover to logical subscriber in such a case if
> one routes T2 again to the new node as it was not successful
> previously then it would needlessly perform the sequence changes
> again. I don't how much that matters but that would probably be the
> difference between the replication of tables and sequences.
> 

I don't quite follow what the problem with synchronous_commit is :-(

For sequences, we log the changes ahead, i.e. even if nextval() did not
write anything into WAL, it's still safe because these changes are
covered by the WAL generated some time ago (up to ~32 values back). And
that's certainly subject to synchronous_commit, right?

There certainly are issues with sequences and syncrep:

https://www.postgresql.org/message-id/712cad46-a9c8-1389-aef8-faf0203c9...@enterprisedb.com

but that's unrelated to logical replication.

FWIW I don't think we'd re-apply sequence changes needlessly, because
the worker does update the origin after applying non-transactional
changes. So after the replication gets restarted, we'd skip what we
already applied, no?

But maybe there is an issue and I'm just not getting it. Could you maybe
share an example of T1/T2, with a replication restart and what you think
would happen?

> I agree with your point above that for upgrades some tool like
> pg_copysequence where we can provide a way to copy sequence data to
> subscribers from the publisher would suffice the need.
> 

Perhaps. Unfortunately it doesn't quite work for failovers, and it's yet
another tool users would need to use.


regards

-- 
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company




Re: Integer undeflow in fprintf in dsa.c

2024-02-20 Thread Daniel Gustafsson
> On 20 Feb 2024, at 12:28, Ильясов Ян  wrote:

> ​fprintf(stderr,
>"segment bin %zu (at least %d contiguous pages free):\n",
>i, 1 << (i - 1));
> 
> In case i​ equals zero user will get "at least -2147483648 contiguous pages 
> free".

That does indeed seem like an oversight.

> I believe that this is a mistake, and fprintf​ should print "at least 0 
> contiguous pages free"
> in case i​ equals zero.

The message "at least 0 contiguous pages free" reads a bit nonsensical though,
wouldn't it be preferrable to check for i being zero and print a custom message
for that case? Something like the below untested sketch?

+   if (i == 0)
+   fprintf(stderr,
+   "segment bin %zu (no 
contiguous free pages):\n", i);
+   else
+   fprintf(stderr,
+   "segment bin %zu (at least 
%d contiguous pages free):\n",
+   i, 1 << (i - 1));

--
Daniel Gustafsson





Re: Replace current implementations in crypt() and gen_salt() to OpenSSL

2024-02-20 Thread Daniel Gustafsson
> On 20 Feb 2024, at 12:18, Peter Eisentraut  wrote:

> I think we are going about this the wrong way.  It doesn't make sense to ask 
> OpenSSL what a piece of code that doesn't use OpenSSL should do.

Given that pgcrypto cannot be built without OpenSSL, and ideally we should be
using the OpenSSL implementations for everything, I don't think it's too far
fetched.

--
Daniel Gustafsson





Re: Replace current implementations in crypt() and gen_salt() to OpenSSL

2024-02-20 Thread Daniel Gustafsson
> On 20 Feb 2024, at 12:27, Robert Haas  wrote:
> 
> On Tue, Feb 20, 2024 at 4:49 PM Peter Eisentraut  wrote:
>> I think there are several less weird ways to address this:
>> 
>> * Just document it.
>> 
>> * Make a pgcrypto-level GUC setting.
>> 
>> * Split out these functions into a separate extension.
>> 
>> * Deprecate these functions.
>> 
>> Or some combination of these.
> 
> I don't think the first two of these proposals help anything. AIUI,
> FIPS mode is supposed to be a system wide toggle that affects
> everything on the machine. The third one might help if you can be
> compliant by just choosing not to install that extension, and the
> fourth one solves the problem by sledgehammer.

A fifth option is to throw away our in-tree implementations and use the OpenSSL
API's for everything, which is where this thread started.  If the effort to
payoff ratio is palatable to anyone then patches are for sure welcome.

> Does Linux provide some way of asking whether "fips=1" was specified
> at kernel boot time?

There is a crypto.fips_enabled sysctl but I have no idea how portable that is
across distributions etc.

--
Daniel Gustafsson





Re: speed up a logical replica setup

2024-02-20 Thread vignesh C
On Mon, 19 Feb 2024 at 11:15, Hayato Kuroda (Fujitsu)
 wrote:
>
> Dear hackers,
>
> > Since it may be useful, I will post top-up patch on Monday, if there are no
> > updating.
>
> And here are top-up patches. Feel free to check and include.
>
> v22-0001: Same as v21-0001.
> === rebased patches ===
> v22-0002: Update docs per recent changes. Same as v20-0002.
> v22-0003: Add check versions of the target. Extracted from v20-0003.
> v22-0004: Remove -S option. Mostly same as v20-0009, but commit massage was
>   slightly changed.
> === Newbie ===
> V22-0005: Addressed my comments which seems to be trivial[1].
>   Comments #1, 3, 4, 8, 10, 14, 17 were addressed here.
> v22-0006: Consider the scenario when commands are failed after the recovery.
>   drop_subscription() is removed and some messages are added per [2].
> V22-0007: Revise server_is_in_recovery() per [1]. Comments #5, 6, 7, were 
> addressed here.
> V22-0008: Fix a strange report when physical_primary_slot is null. Per 
> comment #9 [1].
> V22-0009: Prohibit reuse publications when it has already existed. Per 
> comments #11 and 12 [1].
> V22-0010: Avoid to call PQclear()/PQfinish()/pg_free() if the process exits 
> soon. Per comment #15 [1].
> V22-0011: Update testcode. Per comments #17- [1].
>
> I did not handle below points because I have unclear points.
>
> a.
> This patch set cannot detect the disconnection between the target (standby) 
> and
> source (primary) during the catch up. Because the connection status must be 
> gotten
> at the same time (=in the same query) with the recovery status, but now it is 
> now an
> independed function (server_is_in_recovery()).
>
> b.
> This patch set cannot detect the inconsistency reported by Shubham [3]. I 
> could not
> come up with solutions without removing -P...

Few comments regarding the documentation:
1) max_replication_slots information seems to be present couple of times:

+
+ The target instance must have
+ max_replication_slots
+ and max_logical_replication_workers
+ configured to a value greater than or equal to the number of target
+ databases.
+

+   
+
+ The target instance must have
+ max_replication_slots
+ configured to a value greater than or equal to the number of target
+ databases and replication slots.
+
+   

2) Can we add an id to prerequisites and use it instead of referring
to r1-app-pg_createsubscriber-1:
- pg_createsubscriber checks if the
given target data
- directory has the same system identifier than the source data directory.
- Since it uses the recovery process as one of the steps, it starts the
- target server as a replica from the source server. If the system
- identifier is not the same,
pg_createsubscriber will
- terminate with an error.
+ Checks the target can be converted.  In particular, things listed in
+ above section would be
+ checked.  If these are not met
pg_createsubscriber
+ will terminate with an error.
 

3) The code also checks the following:
 Verify if a PostgreSQL binary (progname) is available in the same
directory as pg_createsubscriber.

But this is not present in the pre-requisites of documentation.

4) Here we mention that the target server should be stopped, but the
same is not mentioned in prerequisites:
+   Here is an example of using pg_createsubscriber.
+   Before running the command, please make sure target server is stopped.
+
+$ pg_ctl -D /usr/local/pgsql/data stop
+
+

5) If there is an error during any of the pg_createsubscriber
operation like if create subscription fails, it might not be possible
to rollback to the earlier state which had physical-standby
replication. I felt we should document this and also add it to the
console message like how we do in case of pg_upgrade.

Regards,
Vignesh




Re: A new message seems missing a punctuation

2024-02-20 Thread Amit Kapila
On Tue, Feb 20, 2024 at 4:50 PM Robert Haas  wrote:
>
> On Tue, Feb 20, 2024 at 4:42 PM Amit Kapila  wrote:
> > > So why do we log a message about this?
> >
> > This was added after the main commit of this functionality to find
> > some BF failures (where we were expecting the slot to sync but due to
> > one of these conditions not being met the slot was not synced) and we
> > can probably change it to DEBUG1 as well. I think we would need this
> > information w.r.t this functionality to gather more information in
> > case expected slots are not being synced and it may be helpful for
> > users to also know why the slots are not synced, if that happens.
>
> Ah, OK. Do you think we need any kind of system view to provide more
> insight here or is a log message sufficient?
>

We do expose the required information (restart_lsn, catalog_xmin,
synced, temporary, etc) via pg_replication_slots. So, I feel the LOG
message here is sufficient to DEBUG (or know the details) when the
slot sync doesn't succeed.

-- 
With Regards,
Amit Kapila.




Integer undeflow in fprintf in dsa.c

2024-02-20 Thread Ильясов Ян
Hello hackers,

Using Svace* I think I've found a little bug in src/backend/utils/mmgr/dsa.c.
This bug is presented in REL_12_STABLE, REL_13_STABLE, REL_14_STABLE,
REL_15_STABLE, REL_16_STABLE and master. I see that it was introduced together
with dynamic shared memory areas in the commit 
13df76a537cca3b8884911d8fdf7c89a457a8dd3.
I also see that at least two people have encountered this fprintf output.
(https://postgrespro.com/list/thread-id/2419512,
https://www.postgresql.org/message-id/15e9501170d.e4b5a3858707.3339083113985275726%40zohocorp.com)

​fprintf(stderr,
   "segment bin %zu (at least %d contiguous pages free):\n",
   i, 1 << (i - 1));

In case i​ equals zero user will get "at least -2147483648 contiguous pages 
free".
I believe that this is a mistake, and fprintf​ should print "at least 0 
contiguous pages free"
in case i​ equals zero.

The patch that has a fix of this is attached.

* ​- https://svace.pages.ispras.ru/svace-website/en/

Kind regards,
Ian Ilyasov.

Juniour Software Developer at Postgres Professional
Subject: [PATCH] Integer underflow fix in fprintf in dsa.c.
---
Index: src/backend/utils/mmgr/dsa.c
<+>UTF-8
===
diff --git a/src/backend/utils/mmgr/dsa.c b/src/backend/utils/mmgr/dsa.c
--- a/src/backend/utils/mmgr/dsa.c	(revision b78fa8547d02fc72ace679fb4d5289dccdbfc781)
+++ b/src/backend/utils/mmgr/dsa.c	(date 1708426298001)
@@ -1107,7 +1107,7 @@
 
 			fprintf(stderr,
 	"segment bin %zu (at least %d contiguous pages free):\n",
-	i, 1 << (i - 1));
+	i, i != 0 ? 1 << (i - 1) : 0);
 			segment_index = area->control->segment_bins[i];
 			while (segment_index != DSA_SEGMENT_INDEX_NONE)
 			{


Re: Replace current implementations in crypt() and gen_salt() to OpenSSL

2024-02-20 Thread Robert Haas
On Tue, Feb 20, 2024 at 4:49 PM Peter Eisentraut  wrote:
> I think there are several less weird ways to address this:
>
> * Just document it.
>
> * Make a pgcrypto-level GUC setting.
>
> * Split out these functions into a separate extension.
>
> * Deprecate these functions.
>
> Or some combination of these.

I don't think the first two of these proposals help anything. AIUI,
FIPS mode is supposed to be a system wide toggle that affects
everything on the machine. The third one might help if you can be
compliant by just choosing not to install that extension, and the
fourth one solves the problem by sledgehammer.

Does Linux provide some way of asking whether "fips=1" was specified
at kernel boot time?

-- 
Robert Haas
EDB: http://www.enterprisedb.com




Re: A new message seems missing a punctuation

2024-02-20 Thread Robert Haas
On Tue, Feb 20, 2024 at 4:42 PM Amit Kapila  wrote:
> > So why do we log a message about this?
>
> This was added after the main commit of this functionality to find
> some BF failures (where we were expecting the slot to sync but due to
> one of these conditions not being met the slot was not synced) and we
> can probably change it to DEBUG1 as well. I think we would need this
> information w.r.t this functionality to gather more information in
> case expected slots are not being synced and it may be helpful for
> users to also know why the slots are not synced, if that happens.

Ah, OK. Do you think we need any kind of system view to provide more
insight here or is a log message sufficient?

-- 
Robert Haas
EDB: http://www.enterprisedb.com




Re: Replace current implementations in crypt() and gen_salt() to OpenSSL

2024-02-20 Thread Peter Eisentraut

On 20.02.24 11:09, Daniel Gustafsson wrote:

On 20 Feb 2024, at 10:56, Koshi Shibagaki (Fujitsu) 
 wrote:



Let me confirm the discussion in threads. I think there are two topics.
1. prohibit the use of ciphers disallowed in FIPS mode at the level of block
cipher (crypt-bf, etc...) in crypt() and gen_salt()


That level might be overkill given that any cipher not in the FIPS certfied
module mustn't be used, but it's also not the wrong place to put it IMHO.


I think we are going about this the wrong way.  It doesn't make sense to 
ask OpenSSL what a piece of code that doesn't use OpenSSL should do. 
(And would that even give a sensible answer?  Like, you can configure 
OpenSSL to load the fips module, but you can also load the legacy module 
alongside it(??).)  And as you say, even if this code supported modern 
block ciphers, it wouldn't be FIPS compliant.


I think there are several less weird ways to address this:

* Just document it.

* Make a pgcrypto-level GUC setting.

* Split out these functions into a separate extension.

* Deprecate these functions.

Or some combination of these.





Re: Change the bool member of the Query structure to bits

2024-02-20 Thread Tomas Vondra
On 2/20/24 11:11, Quan Zongliang wrote:
> 
> Sorry. I forgot to save a file. This is the latest.
> 
> On 2024/2/20 18:07, Quan Zongliang wrote:
>>
>> The Query structure has an increasing number of bool attributes. This
>> is likely to increase in the future. And they have the same
>> properties. Wouldn't it be better to store them in bits? Common
>> statements don't use them, so they have little impact. This also saves
>> memory space.
>>

Hi,

Are we really adding bools to Query that often? A bit of git-blame says
it's usually multiple years to add a single new flag, which is what I'd
expect. I doubt that'll change.

As for the memory savings, can you quantify how much memory this would save?

I highly doubt that's actually true (or at least measurable). The Query
struct has ~256B, the patch cuts that to ~232B. But we allocate stuff in
power-of-2, so we'll allocate 256B chunk anyway. And we allocate very
few of those objects anyway ...

regards

-- 
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company




Re: POC, WIP: OR-clause support for indexes

2024-02-20 Thread Ranier Vilela
Em ter., 20 de fev. de 2024 às 00:18, Andrei Lepikhov <
a.lepik...@postgrespro.ru> escreveu:

> On 19/2/2024 19:53, Ranier Vilela wrote:
> > v17-0002
> > 1) move the vars *arrayconst and *dest, to after if, to avoid makeNode
> > (palloc).
> > + Const   *arrayconst;
> > + ScalarArrayOpExpr  *dest;
> > +
> > + pd = (PredicatesData *) lfirst(lc);
> > + if (pd->elems == NIL)
> > + /* The index doesn't participate in this operation */
> > + continue;
> >
> > + arrayconst = lsecond_node(Const, saop->args);
> > + dest = makeNode(ScalarArrayOpExpr);
> Thanks for the review!
> I'm not sure I understand you clearly. Does the patch in attachment fix
> the issue you raised?
>
Sorry if I wasn't clear.
What I meant is to move the initializations of the variables *arrayconst*
and *dest*
for after the test (if (pd->elems == NIL)
To avoid unnecessary initialization if the test fails.

best regards,
Ranier Vilela


Re: Streaming read-ready sequential scan code

2024-02-20 Thread Robert Haas
On Tue, Feb 20, 2024 at 4:35 AM Melanie Plageman
 wrote:
> I've written three alternative implementations of the actual streaming
> read user for sequential scan which handle the question of where to
> allocate the streaming read object and how to handle changing scan
> direction in different ways.

It's weird to me that the prospect of changing the scan direction
causes such complexity. I mean, why doesn't a streaming read object
have a forget_all_my_previous_requests() method or somesuch?

-- 
Robert Haas
EDB: http://www.enterprisedb.com




  1   2   >