Re: logical decoding and replication of sequences, take 2
On Wed, Feb 21, 2024 at 1:06 PM Dilip Kumar wrote: > > But I am wondering why this flag is always set to true in > > DecodeTXNNeedSkip() irrespective of the commit or abort. Because the > > aborted transactions are not supposed to be replayed? So if my > > observation is correct that for the aborted transaction, this > > shouldn't be set to true then we have a problem with sequence where we > > are identifying the transactional changes as non-transaction changes > > because now for transactional changes this should depend upon commit > > status. > > I have checked this case with Amit Kapila. So it seems in the cases > where we have sent the prepared transaction or streamed in-progress > transaction we would need to send the abort also, and for that reason, > we are setting 'ctx->processing_required' as true so that if these > WALs are not streamed we do not allow upgrade of such slots. I don't find this explanation clear enough for me to understand. -- Robert Haas EDB: http://www.enterprisedb.com
Re: ALTER TABLE SET ACCESS METHOD on partitioned tables
On 21.02.24 07:40, Michael Paquier wrote: This means that all partitioned tables would have pg_class.relam set, and that relam would never be 0: - The USING clause takes priority over default_table_access_method. - If no USING clause, default_table_access_method is the AM used Any partitions created from this partitioned table would inherit the AM set, ignoring default_table_access_method. Alvaro has made a very good point a couple of days ago at [1] where we should try to make the behavior stick closer to tablespaces, where it could be possible to set relam to 0 for a partitioned table, where a partition would inherit the AM set in the GUC when a USING clause is not defined (if USING specifies the AM, we'd just use it). Yes, I think most people agreed that that would be the preferred behavior.
Re: Injection points: some tools to wait and wake
On Wed, Feb 21, 2024 at 07:08:03AM +0900, Michael Paquier wrote: > Well, both you and Andrey are asking for it now, so let's do it. The > implementation is simple: > - Store in InjectionPointSharedState an array of wait_counts and an > array of names. There is only one condition variable. > - When a point wants to wait, it takes the spinlock and looks within > the array of names until it finds a free slot, adds its name into the > array to reserve a wait counter at the same position, releases the > spinlock. Then it loops on the condition variable for an update of > the counter it has reserved. It is possible to make something more > efficient, but at a small size it would not really matter. > - The wakeup takes a point name in argument, acquires the spinlock, > and checks if it can find the point into the array, pinpoints the > location of the counter to update and updates it. Then it broadcasts > the change. > - The wait loop checks its counter, leaves its loop, cancels the > sleep, takes the spinlock to unregister from the array, and leaves. > > I would just hardcode the number of points that can wait, say 5 of > them tracked in shmem? Does that look like what you are looking at? I was looking at that, and it proves to work OK, so you can do stuff like waits and wakeups for multiple processes in a controlled manner. The attached patch authorizes up to 32 waiters. I have switched things so as the information reported in pg_stat_activity is the name of the injection point itself. Comments and ideas are welcome. -- Michael From 75302cba302b83ce2a6d6eaf30b163f473b87276 Mon Sep 17 00:00:00 2001 From: Michael Paquier Date: Wed, 21 Feb 2024 16:36:25 +0900 Subject: [PATCH v2 1/2] injection_points: Add routines to wait and wake processes This commit is made of two parts: - A new callback that can be attached to a process to make it wait on a condition variable. The condition checked is registered in shared memory by the module injection_points. - A new SQL function to update the shared state and broadcast the update using a condition variable. The shared state used by the module is registered using the DSM registry, and is optional. --- .../injection_points--1.0.sql | 10 ++ .../injection_points/injection_points.c | 151 ++ src/tools/pgindent/typedefs.list | 1 + 3 files changed, 162 insertions(+) diff --git a/src/test/modules/injection_points/injection_points--1.0.sql b/src/test/modules/injection_points/injection_points--1.0.sql index 5944c41716..eed0310cf6 100644 --- a/src/test/modules/injection_points/injection_points--1.0.sql +++ b/src/test/modules/injection_points/injection_points--1.0.sql @@ -24,6 +24,16 @@ RETURNS void AS 'MODULE_PATHNAME', 'injection_points_run' LANGUAGE C STRICT PARALLEL UNSAFE; +-- +-- injection_points_wakeup() +-- +-- Wakes a condition variable waited on in an injection point. +-- +CREATE FUNCTION injection_points_wakeup(IN point_name TEXT) +RETURNS void +AS 'MODULE_PATHNAME', 'injection_points_wakeup' +LANGUAGE C STRICT PARALLEL UNSAFE; + -- -- injection_points_detach() -- diff --git a/src/test/modules/injection_points/injection_points.c b/src/test/modules/injection_points/injection_points.c index e843e6594f..052b20f9c8 100644 --- a/src/test/modules/injection_points/injection_points.c +++ b/src/test/modules/injection_points/injection_points.c @@ -18,18 +18,72 @@ #include "postgres.h" #include "fmgr.h" +#include "storage/condition_variable.h" #include "storage/lwlock.h" #include "storage/shmem.h" +#include "storage/dsm_registry.h" #include "utils/builtins.h" #include "utils/injection_point.h" #include "utils/wait_event.h" PG_MODULE_MAGIC; +/* Maximum number of wait usable in injection points at once */ +#define INJ_MAX_WAIT 32 +#define INJ_NAME_MAXLEN 64 + +/* Shared state information for injection points. */ +typedef struct InjectionPointSharedState +{ + /* protects accesses to wait_counts */ + slock_t lock; + + /* Counters advancing when injection_points_wakeup() is called */ + int wait_counts[INJ_MAX_WAIT]; + + /* Names of injection points attached to wait counters */ + char name[INJ_MAX_WAIT][INJ_NAME_MAXLEN]; + + /* + * Condition variable used for waits and wakeups, checking upon the set of + * wait_counts when waiting. + */ + ConditionVariable wait_point; +} InjectionPointSharedState; + +/* Pointer to shared-memory state. */ +static InjectionPointSharedState *inj_state = NULL; + extern PGDLLEXPORT void injection_error(const char *name); extern PGDLLEXPORT void injection_notice(const char *name); +extern PGDLLEXPORT void injection_wait(const char *name); +static void +injection_point_init_state(void *ptr) +{ + InjectionPointSharedState *state = (InjectionPointSharedState *) ptr; + + SpinLockInit(>lock); + memset(state->wait_counts, 0, sizeof(state->wait_counts)); + memset(state->name, 0, sizeof(state->name)); + ConditionVariableInit(>wait_point); +} + +static void
Re: logical decoding and replication of sequences, take 2
On Tue, Feb 20, 2024 at 4:32 PM Dilip Kumar wrote: > > On Tue, Feb 20, 2024 at 3:38 PM Robert Haas wrote: > > > Let's say fast_forward is true. Then smgr_decode() is going to skip > > recording anything about the relfilenode, so we'll identify all > > sequence changes as non-transactional. But look at how this case is > > handled in seq_decode(): > > > > + if (ctx->fast_forward) > > + { > > + /* > > + * We need to set processing_required flag to notify the sequence > > + * change existence to the caller. Usually, the flag is set when > > + * either the COMMIT or ABORT records are decoded, but this must be > > + * turned on here because the non-transactional logical message is > > + * decoded without waiting for these records. > > + */ > > + if (!transactional) > > + ctx->processing_required = true; > > + > > + return; > > + } > > It appears that the 'processing_required' flag was introduced as part > of supporting upgrades for logical replication slots. Its purpose is > to determine whether a slot is fully caught up, meaning that there are > no pending decodable changes left before it can be upgraded. > > So now if some change was transactional but we have identified it as > non-transaction then we will mark this flag 'ctx->processing_required > = true;' so we temporarily set this flag incorrectly, but even if the > flag would have been correctly identified initially, it would have > been set again to true in the DecodeTXNNeedSkip() function regardless > of whether the transaction is committed or aborted. As a result, the > flag would eventually be set to 'true', and the behavior would align > with the intended logic. > > But I am wondering why this flag is always set to true in > DecodeTXNNeedSkip() irrespective of the commit or abort. Because the > aborted transactions are not supposed to be replayed? So if my > observation is correct that for the aborted transaction, this > shouldn't be set to true then we have a problem with sequence where we > are identifying the transactional changes as non-transaction changes > because now for transactional changes this should depend upon commit > status. I have checked this case with Amit Kapila. So it seems in the cases where we have sent the prepared transaction or streamed in-progress transaction we would need to send the abort also, and for that reason, we are setting 'ctx->processing_required' as true so that if these WALs are not streamed we do not allow upgrade of such slots. -- Regards, Dilip Kumar EnterpriseDB: http://www.enterprisedb.com
Re: Removing unneeded self joins
On Mon, Feb 19, 2024 at 1:24 PM Andrei Lepikhov wrote: > On 18/2/2024 23:18, Alexander Korotkov wrote: > > On Sun, Feb 18, 2024 at 5:04 PM Alexander Korotkov > wrote: > >> On Sun, Feb 18, 2024 at 3:00 PM Alexander Lakhin > wrote: > >>> Please look at the following query which fails with an error since > >>> d3d55ce57: > >>> > >>> create table t (i int primary key); > >>> > >>> select t3.i from t t1 > >>> join t t2 on t1.i = t2.i, > >>> lateral (select t1.i limit 1) t3; > >>> > >>> ERROR: non-LATERAL parameter required by subquery > >> > >> Thank you for spotting. I'm looking at this. > > > > Attached is a draft patch fixing this query. Could you, please, recheck? > I reviewed this patch. Why do you check only the target list? I guess > these links can be everywhere. See the patch in the attachment with the > elaborated test and slightly changed code. I just noticed that this fix has been committed in 489072ab7a, but it's just flat wrong. * The fix walks the subquery and replaces all the Vars with a varno equal to the relid of the removing rel, without checking the varlevelsup. That is to say, a Var that belongs to the subquery itself might also be replaced, which is wrong. For instance, create table t (i int primary key); explain (costs off) select t3.i from t t1 join t t2 on t1.i = t2.i join lateral (select * from (select t1.i offset 0) offset 0) t3 on true; ERROR: no relation entry for relid 2 * The fix only traverses one level within the subquery, so Vars that appear in subqueries with multiple levels cannot be replaced. For instance, explain (costs off) select t4.i from t t1 join t t2 on t1.i = t2.i join lateral (select t3.i from t t3, (select t1.i) offset 0) t4 on true; ERROR: non-LATERAL parameter required by subquery I think the right fix for these issues is to introduce a new element 'sublevels_up' in ReplaceVarnoContext, and enhance replace_varno_walker to 1) recurse into subselects with sublevels_up increased, and 2) perform the replacement only when varlevelsup is equal to sublevels_up. Attached is a patch for the fix. While writing the fix, I noticed some outdated comments. Such as in remove_rel_from_query, the first for loop updates otherrel's attr_needed as well as lateral_vars, but the comment only mentions attr_needed. So this patch also fixes some outdated comments. While writing the test cases, I found that the test cases for SJE are quite messy. Below are what I have noticed: * There are several test cases using catalog tables like pg_class, pg_stats, pg_index, etc. for testing join removal. I don't see a reason why we need to use catalog tables, and I think this just raises the risk of instability. * In many test cases, a mix of uppercase and lowercase keywords is used in one query. I think it'd better to maintain consistency by using either all uppercase or all lowercase keywords in one query. * In most situations, we verify the plan and the output of a query like: explain (costs off) select ...; select ...; The two select queries are supposed to be the same. But in the SJE test cases, I have noticed instances where the two select queries differ from each other. This patch also includes some cosmetic tweaks for SJE test cases. It does not change the test cases using catalog tables though. I think that should be a seperate patch. Thanks Richard v1-0001-Replace-lateral-references-to-removed-rels-in-subqueries.patch Description: Binary data
Re: What about Perl autodie?
On 14.02.24 17:52, Peter Eisentraut wrote: A gentler way might be to start using some perlcritic policies like InputOutput::RequireCheckedOpen or the more general InputOutput::RequireCheckedSyscalls and add explicit error checking at the sites it points out. Here is a start for that. I added the required stanza to perlcriticrc and started with an explicit list of functions to check: functions = chmod flock open read rename seek symlink system and fixed all the issues it pointed out. I picked those functions because most existing code already checked those, so the omissions are probably unintended, or in some cases also because I thought it would be important for test correctness (e.g., some tests using chmod). I didn't design any beautiful error messages, mostly just used "or die $!", which mostly matches existing code, and also this is developer-level code, so having the system error plus source code reference should be ok. In the second patch, I changed the perlcriticrc stanza to use an exclusion list instead of an explicit inclusion list. That way, you can see what we are currently *not* checking. I'm undecided which way around is better, and exactly what functions we should be checking. (Of course, in principle, all of them, but since this is test and build support code, not production code, there are probably some reasonable compromises to be made.) From c32941ce95281ab21691c4181962d20a820b1f20 Mon Sep 17 00:00:00 2001 From: Peter Eisentraut Date: Tue, 20 Feb 2024 10:12:12 +0100 Subject: [PATCH v1 1/2] perlcritic InputOutput::RequireCheckedSyscalls --- .../t/010_pg_archivecleanup.pl | 2 +- src/bin/pg_basebackup/t/010_pg_basebackup.pl| 8 src/bin/pg_ctl/t/001_start_stop.pl | 2 +- src/bin/pg_resetwal/t/002_corrupted.pl | 2 +- src/bin/pg_rewind/t/009_growing_files.pl| 2 +- src/bin/pg_rewind/t/RewindTest.pm | 4 ++-- src/pl/plperl/text2macro.pl | 4 ++-- src/test/kerberos/t/001_auth.pl | 2 +- .../ssl_passphrase_callback/t/001_testfunc.pl | 2 +- src/test/perl/PostgreSQL/Test/Cluster.pm| 12 ++-- src/test/perl/PostgreSQL/Test/Utils.pm | 16 src/test/ssl/t/SSL/Server.pm| 10 +- src/tools/msvc_gendef.pl| 4 ++-- src/tools/perlcheck/perlcriticrc| 4 src/tools/pgindent/pgindent | 17 + 15 files changed, 48 insertions(+), 43 deletions(-) diff --git a/src/bin/pg_archivecleanup/t/010_pg_archivecleanup.pl b/src/bin/pg_archivecleanup/t/010_pg_archivecleanup.pl index 792f5677c87..91a98c71e99 100644 --- a/src/bin/pg_archivecleanup/t/010_pg_archivecleanup.pl +++ b/src/bin/pg_archivecleanup/t/010_pg_archivecleanup.pl @@ -36,7 +36,7 @@ sub create_files { foreach my $fn (map { $_->{name} } @_) { - open my $file, '>', "$tempdir/$fn"; + open my $file, '>', "$tempdir/$fn" or die $!; print $file 'CONTENT'; close $file; diff --git a/src/bin/pg_basebackup/t/010_pg_basebackup.pl b/src/bin/pg_basebackup/t/010_pg_basebackup.pl index 86cc01a640b..159da3029af 100644 --- a/src/bin/pg_basebackup/t/010_pg_basebackup.pl +++ b/src/bin/pg_basebackup/t/010_pg_basebackup.pl @@ -77,7 +77,7 @@ ok(-d "$tempdir/backup", 'backup directory was created and left behind'); rmtree("$tempdir/backup"); -open my $conf, '>>', "$pgdata/postgresql.conf"; +open my $conf, '>>', "$pgdata/postgresql.conf" or die $!; print $conf "max_replication_slots = 10\n"; print $conf "max_wal_senders = 10\n"; print $conf "wal_level = replica\n"; @@ -175,7 +175,7 @@ qw(backup_label tablespace_map postgresql.auto.conf.tmp current_logfiles.tmp global/pg_internal.init.123)) { - open my $file, '>>', "$pgdata/$filename"; + open my $file, '>>', "$pgdata/$filename" or die $!; print $file "DONOTCOPY"; close $file; } @@ -185,7 +185,7 @@ # unintended side effects. if ($Config{osname} ne 'darwin') { - open my $file, '>>', "$pgdata/.DS_Store"; + open my $file, '>>', "$pgdata/.DS_Store" or die $!; print $file "DONOTCOPY"; close $file; } @@ -424,7 +424,7 @@ my $tblspcoid = $1; my $escapedRepTsDir = $realRepTsDir; $escapedRepTsDir =~ s/\\//g; - open my $mapfile, '>', $node2->data_dir . '/tablespace_map'; + open my $mapfile, '>', $node2->data_dir . '/tablespace_map' or die $!; print $mapfile "$tblspcoid $escapedRepTsDir\n"; close $mapfile; diff --git a/src/bin/pg_ctl/t/001_start_stop.pl b/src/bin/pg_ctl/t/001_start_stop.pl index fd56bf7706a..cbdaee57fb1 100644 --- a/src/bin/pg_ctl/t/001_start_stop.pl +++ b/src/bin/pg_ctl/t/001_start_stop.pl @@ -23,7 +23,7 @@ command_ok([ $ENV{PG_REGRESS}, '--config-auth', "$tempdir/data" ], 'configure
Re: Add lookup table for replication slot invalidation causes
On Wed, Feb 21, 2024 at 11:56 AM Michael Paquier wrote: > > Note the recent commit 74a730631065 where Alvaro has changed for the > lwlock tranche names. That's quite elegant. Yes, that's absolutely neat. FWIW, designated initializer syntax can be used in a few more places though. I'm not sure how much worth it will be but I'll see if I can quickly put up a patch for it. > > With these two asserts, the behavior (asserts on null and non-existent > > inputs) is the same as what GetSlotInvalidationCause has right now. > > Well, I won't fight you over that. Haha :) > A new cause would require an update of SlotInvalidationCause, so if > you keep RS_INVAL_MAX_CAUSES close to it that's impossible to miss. > IMO, it makes just more sense to keep that in slot.c because of the > static assert as well. Hm, okay. Moved that to slot.c but left a note in the comment atop enum to update it. > + * If you add a new invalidation cause here, remember to add its name in > + * SlotInvalidationCauses in the same order as that of the cause. > > The order does not matter with the way v2 does things with > SlotInvalidationCauses[], no? Ugh. Corrected that now. Please see the attached v3 patch. -- Bharath Rupireddy PostgreSQL Contributors Team RDS Open Source Databases Amazon Web Services: https://aws.amazon.com v3-0001-Add-lookup-table-for-replication-slot-invalidatio.patch Description: Binary data
Test to dump and restore objects left behind by regression
Hi All, In [1] we found that having a test to dump and restore objects left behind by regression test is missing. Such a test would cover many dump restore scenarios without much effort. It will also help identity problems described in the same thread [2] during development itself. I am starting a new thread to discuss such a test. Attached is a WIP version of the test. The test does fail at the restore step when commit 74563f6b90216180fc13649725179fc119dddeb5 is reverted reintroducing the problem. Attached WIP test is inspired from src/bin/pg_upgrade/t/002_pg_upgrade.pl which tests binary-upgrade dumps. Attached test tests the non-binary-upgrade dumps. Similar to 0002_pg_upgrade.pl the test uses SQL dumps before and after dump and restore to make sure that the objects are restored correctly. The test has some shortcomings 1. Objects which are not dumped at all are never tested. 2. Since the rows are dumped in varying order by the two clusters, the test only tests schema dump and restore. 3. The order of columns of the inheritance child table differs depending upon the DDLs used to reach a given state. This introduces diffs in the SQL dumps before and after restore. The test ignores these diffs by hardcoding the diff in the test. Even with 1 and 2 the test is useful to detect dump/restore anomalies. I think we should improve 3, but I don't have a good and simpler solution. I didn't find any way to compare two given clusters in our TAP test framework. Building it will be a lot of work. Not sure if it's worth it. Suggestions welcome. [1] https://www.postgresql.org/message-id/CAExHW5vyqv%3DXLTcNMzCNccOrHiun_XhYPjcRqeV6dLvZSamriQ%40mail.gmail.com [2] https://www.postgresql.org/message-id/3462358.1708107856%40sss.pgh.pa.us -- Best Wishes, Ashutosh Bapat From 78903d2cad4e94e05db74b2473f82aabb498f987 Mon Sep 17 00:00:00 2001 From: Ashutosh Bapat Date: Wed, 21 Feb 2024 11:02:40 +0530 Subject: [PATCH 1/2] WIP Test to dump and restore object left behind by regression Regression run leaves many database objects behind in a variety of states. Dumping and restoring these objects covers many dump/restore scenarios not covered elsewhere. src/bin/pg_upgrade/t/002_pg_upgrade.pl test pg_upgrade this way. But it does not cover non-binary-upgrade dump and restore. The test takes a dump of regression database from one cluster and restores it on another cluster. It compares the dumps from both the clusters in SQL format to make sure that the objects dumped were restored properly. Obviously if some objects were not dumped, those remain untested. Hence this isn't a complete test. The order in which data rows are dumped varies a lot between dump and restore, hence the tests only schema dumps. The order in which columns of an inherited table are dumped varies depending upon the DDLs used to set up inheritance. This introduces some difference in the SQL dumps taken from the two clusters. Those differences are explicitly ignored in the test. TODO: 1. We could test formats other than -Fc Ashutosh Bapat --- src/bin/pg_dump/Makefile | 4 + src/bin/pg_dump/t/006_pg_dump_regress.pl | 180 +++ 2 files changed, 184 insertions(+) create mode 100644 src/bin/pg_dump/t/006_pg_dump_regress.pl diff --git a/src/bin/pg_dump/Makefile b/src/bin/pg_dump/Makefile index 930c741c95..29a3d67953 100644 --- a/src/bin/pg_dump/Makefile +++ b/src/bin/pg_dump/Makefile @@ -42,6 +42,10 @@ OBJS = \ pg_backup_tar.o \ pg_backup_utils.o +# required for 006_pg_dump_regress.pl +REGRESS_SHLIB=$(abs_top_builddir)/src/test/regress/regress$(DLSUFFIX) +export REGRESS_SHLIB + all: pg_dump pg_restore pg_dumpall pg_dump: pg_dump.o common.o pg_dump_sort.o $(OBJS) | submake-libpq submake-libpgport submake-libpgfeutils diff --git a/src/bin/pg_dump/t/006_pg_dump_regress.pl b/src/bin/pg_dump/t/006_pg_dump_regress.pl new file mode 100644 index 00..c3016f975d --- /dev/null +++ b/src/bin/pg_dump/t/006_pg_dump_regress.pl @@ -0,0 +1,180 @@ +# Copyright (c) 2022-2024, PostgreSQL Global Development Group + +# Test dump and restore of regression data. This is expected to cover dump and +# restore of most of the types of objects left behind in different states by +# the regression test. +use strict; +use warnings FATAL => 'all'; + +use Cwdqw(abs_path); +use File::Basename qw(dirname); +use File::Compare; +use File::Find qw(find); +use File::Path qw(rmtree); + +use PostgreSQL::Test::Cluster; +use PostgreSQL::Test::Utils; +use PostgreSQL::Test::AdjustUpgrade; +use Test::More; + +sub generate_regress_data +{ + my ($node) = @_; + + # The default location of the source code is the root of this directory. + my $srcdir = abs_path("../../.."); + + # Grab any regression options that may be passed down by caller. + my $extra_opts = $ENV{EXTRA_REGRESS_OPTS} || ""; + + # --dlpath is needed to be able to find the location of regress.so + # and any libraries the regression tests require. + my $dlpath =
Re: Synchronizing slots from primary to standby
On Tue, Feb 20, 2024 at 6:19 PM Masahiko Sawada wrote: > > On Tue, Feb 20, 2024 at 12:33 PM shveta malik wrote: > Thank you for the explanation. It makes sense to me to move the check. > > > 2. If the wal_level is not logical, the server will need to restart > anyway to change the wal_level and have the slotsync worker work. Does > it make sense to have the postmaster exit if the wal_level is not > logical and sync_replication_slots is enabled? For instance, we have > similar checks in PostmsaterMain(): > > if (summarize_wal && wal_level == WAL_LEVEL_MINIMAL) > ereport(ERROR, > (errmsg("WAL cannot be summarized when wal_level is > \"minimal\""))); Thanks for the feedback. I have addressed it in v93. thanks SHveta
Re: A new message seems missing a punctuation
On Tue, Feb 20, 2024 at 7:25 PM shveta malik wrote: > > On Tue, Feb 20, 2024 at 5:01 PM Amit Kapila wrote: > > > > > > We do expose the required information (restart_lsn, catalog_xmin, > > synced, temporary, etc) via pg_replication_slots. So, I feel the LOG > > message here is sufficient to DEBUG (or know the details) when the > > slot sync doesn't succeed. > > > > Please find the patch having the suggested changes. > LGTM. I'll push this tomorrow unless someone has further comments/suggestions. -- With Regards, Amit Kapila.
Re: ALTER TABLE SET ACCESS METHOD on partitioned tables
On Tue, Feb 20, 2024 at 03:47:46PM +0100, Peter Eisentraut wrote: > It would be helpful if this patch could more extensively document in its > commit message what semantic changes it makes. Various options of possible > behaviors were discussed in this thread, but it's not clear which behaviors > were chosen in this particular patch version. > > The general idea is that you can set an access method on a partitioned > table. That much seems very agreeable. But then what happens with this > setting, how can you override it, how can you change it, what happens when > you change it, what happens with existing partitions and new partitions, > etc. -- and which of these behaviors are new and old. Many things to > specify. The main point in this patch is the following code block in DefineRelation(), that defines the semantics about the AM set for a partitioned table: +else if (RELKIND_HAS_TABLE_AM(relkind) || relkind == RELKIND_PARTITIONED_TABLE) { +if (stmt->partbound) +{ +/* + * For partitions, if no access method is specified, use the AM of + * the parent table. + */ +Assert(list_length(inheritOids) == 1); +accessMethodId = get_rel_relam(linitial_oid(inheritOids)); +Assert(OidIsValid(accessMethodId)); +} +else +accessMethodId = get_table_am_oid(default_table_access_method, false); } This means that all partitioned tables would have pg_class.relam set, and that relam would never be 0: - The USING clause takes priority over default_table_access_method. - If no USING clause, default_table_access_method is the AM used Any partitions created from this partitioned table would inherit the AM set, ignoring default_table_access_method. Alvaro has made a very good point a couple of days ago at [1] where we should try to make the behavior stick closer to tablespaces, where it could be possible to set relam to 0 for a partitioned table, where a partition would inherit the AM set in the GUC when a USING clause is not defined (if USING specifies the AM, we'd just use it). Existing partitions should not be changed if the AM of their partitioned table changes, so you can think of the AM as a hint for the creation of new partitions. [1]: https://www.postgresql.org/message-id/202402011550.sfszd46247zi@alvherre.pgsql -- Michael signature.asc Description: PGP signature
Re: logical decoding and replication of sequences, take 2
On Wed, Feb 14, 2024 at 10:21 PM Tomas Vondra wrote: > > On 2/13/24 17:37, Robert Haas wrote: > > > In other words, the fact that some sequence changes are > > non-transactional creates ordering hazards that don't exist if there > > are no non-transactional changes. So in that way, sequences are > > different from table modifications, where applying the transactions in > > order of commit is all we need to do. Here we need to apply the > > transactions in order of commit and also apply the non-transactional > > changes at the right point in the sequence. Consider the following > > alternative apply sequence: > > > > 1. T1. > > 2. T2's transactional changes (i.e. the ALTER SEQUENCE INCREMENT and > > the subsequent nextval) > > 3. T3's nextval > > 4. T2's first nextval > > > > That's still in commit order. It's also wrong. > > > > Yes, this would be wrong. Thankfully the apply is not allowed to reorder > the changes like this, because that's not what "non-transactional" means > in this context. > > It does not mean we can arbitrarily reorder the changes, it only means > the changes are applied as if they were independent transactions (but in > the same order as they were executed originally). > In this regard, I have another scenario in mind where the apply order could be different for the changes in the same transactions. For example, Transaction T1 Begin; Insert .. Insert .. nextval .. --consider this generates WAL .. Insert .. nextval .. --consider this generates WAL In this case, if the nextval operations will be applied in a different order (aka before Inserts) then there could be some inconsistency. Say, if, it doesn't follow the above order during apply then a trigger fired on both pub and sub for each row insert that refers to the current sequence value to make some decision could have different behavior on publisher and subscriber. If this is not how the patch will behave then fine but otherwise, isn't this something that we should be worried about? -- With Regards, Amit Kapila.
Re: Unlinking Parallel Hash Join inner batch files sooner
Hi, I see in [1] that the reporter mentioned a delay between the error message in parallel HashJoin and the return control back from PSQL. Your patch might reduce this delay. Also, I have the same complaint from users who processed gigabytes of data in parallel HashJoin. Presumably, they also stuck into the unlink of tons of temporary files. So, are you going to do something with this code? [1] https://www.postgresql.org/message-id/18349-83d33dd3d0c855c3%40postgresql.org -- regards, Andrei Lepikhov Postgres Professional
Re: Add lookup table for replication slot invalidation causes
On Wed, Feb 21, 2024 at 09:49:37AM +0530, Bharath Rupireddy wrote: > On Wed, Feb 21, 2024 at 5:04 AM Michael Paquier wrote: > Done that way. I'm fine with the designated initialization [1] that an > ISO C99 compliant compiler offers. PostgreSQL installation guide > https://www.postgresql.org/docs/current/install-requirements.html says > that we need an at least C99-compliant ISO/ANSI C compiler. Note the recent commit 74a730631065 where Alvaro has changed for the lwlock tranche names. That's quite elegant. > Right, but an assertion isn't a bad idea there as it can generate a > backtrace as opposed to the crash generating just SEGV note (and > perhaps a crash dump) in server logs. > > With these two asserts, the behavior (asserts on null and non-existent > inputs) is the same as what GetSlotInvalidationCause has right now. Well, I won't fight you over that. >> +/* Maximum number of invalidation causes */ >> +#defineRS_INVAL_MAX_CAUSES RS_INVAL_WAL_LEVEL There is no need to add that to slot.h: it is only used in slot.c. > > Right, but it needs to be updated whenever a new cause is added to > enum ReplicationSlotInvalidationCause. Therefore, I think it's better > to be closer there in slot.h. A new cause would require an update of SlotInvalidationCause, so if you keep RS_INVAL_MAX_CAUSES close to it that's impossible to miss. IMO, it makes just more sense to keep that in slot.c because of the static assert as well. + * If you add a new invalidation cause here, remember to add its name in + * SlotInvalidationCauses in the same order as that of the cause. The order does not matter with the way v2 does things with SlotInvalidationCauses[], no? -- Michael signature.asc Description: PGP signature
BRIN integer overflow
Greetings, everyone! While analyzing output of Svace static analyzer [1] I've found a bug Function bringetbitmap that is used in BRIN's IndexAmRoutine should return an int64 value, but the actual return value is int, since totalpages is int and totalpages * 10 is also int. This could lead to integer overflow I suggest to change totalpages to be int64 to avoid potential overflow. Also in all other "amgetbitmap functions" (such as hashgetbitmap, gistgetbitmap, gingetbitmap, blgetbitmap) the return value is of correct int64 type The proposed patch is attached [1] - https://svace.pages.ispras.ru/svace-website/en/ Oleg Tselebrovskiy, Postgres Prodiff --git a/src/backend/access/brin/brin.c b/src/backend/access/brin/brin.c index 1087a9011ea..f72667e484e 100644 --- a/src/backend/access/brin/brin.c +++ b/src/backend/access/brin/brin.c @@ -559,7 +559,7 @@ bringetbitmap(IndexScanDesc scan, TIDBitmap *tbm) BrinOpaque *opaque; BlockNumber nblocks; BlockNumber heapBlk; - int totalpages = 0; + int64 totalpages = 0; FmgrInfo *consistentFn; MemoryContext oldcxt; MemoryContext perRangeCxt;
Re: WIP Incremental JSON Parser
On 2024-02-20 Tu 19:53, Jacob Champion wrote: On Tue, Feb 20, 2024 at 2:10 PM Andrew Dunstan wrote: Well, that didn't help a lot, but meanwhile the CFBot seems to have decided in the last few days that it's now happy, so full steam aead! ;-) I haven't been able to track down the root cause yet, but I am able to reproduce the failure consistently on my machine: ERROR: could not parse backup manifest: file size is not an integer: '16384, "Last-Modified": "2024-02-20 23:07:43 GMT", "Checksum-Algorithm": "CRC32C", "Checksum": "66c829cd" }, { "Path": "base/4/2606", "Size": 24576, "Last-Modified": "20 Full log is attached. *sigh* That's weird. I wonder why you can reproduce it and I can't. Can you give me details of the build? OS, compiler, path to source, build setup etc.? Anything that might be remotely relevant. cheers andrew -- Andrew Dunstan EDB: https://www.enterprisedb.com
Re: Speeding up COPY TO for uuids and arrays
On Mon, Feb 19, 2024 at 03:08:45PM +0100, Laurenz Albe wrote: > On Sat, 2024-02-17 at 12:24 -0800, Andres Freund wrote: >> I wonder if we should move the core part for converting to hex to numutils.c, >> we already have code the for the inverse. There does seem to be further >> optimization potential in the conversion, and that seems better done >> somewhere >> central rather than one type's output function. OTOH, it might not be worth >> it, given the need to add the dashes. > > I thought about it, but then decided not to do that. > Calling a function that converts the bytes to hex and then adding the > hyphens will slow down processing, and I think the code savings would be > minimal at best. Yeah, I'm not sure either if that's worth doing, the current conversion code is simple enough. I'd be curious to hear about ideas to optimize that more locally, though. I was curious about the UUID one, and COPYing out 10M rows with a single UUID attribute brings down the runtime of a COPY from 3.8s to 2.3s here on a simple benchmark, with uuid_out showing up at the top of profiles easily on HEAD. Some references for HEAD: 31.63% 5300 postgres postgres[.] uuid_out 19.79% 3316 postgres postgres[.] appendStringInfoChar 11.27% 1887 postgres postgres[.] CopyAttributeOutText 6.36% 1065 postgres postgres[.] pgstat_progress_update_param And with the patch for uuid_out: 22.66% 2147 postgres postgres [.] CopyAttributeOutText 12.99% 1231 postgres postgres [.] uuid_out 11.41% 1081 postgres postgres [.] pgstat_progress_update_param 4.79% 454 postgres postgres [.] CopyOneRowTo That's very good, so I'd like to apply this part. Let me know if there are any objections. -- Michael signature.asc Description: PGP signature
Re: Have pg_basebackup write "dbname" in "primary_conninfo"?
On Wed, Feb 21, 2024 at 4:09 PM Hayato Kuroda (Fujitsu) < kuroda.hay...@fujitsu.com> wrote: > Dear Horiguchi-san, > > > GetConnection()@streamutil.c wants to ensure conninfo has a fallback > > database name ("replication"). However, the function seems to be > > ignoring the case where neither dbname nor connection string is given, > > which is the first case Kuroda-san raised. The second case is the > > intended behavior of the function. > > > > > /* pg_recvlogical uses dbname only; others use connection_string > only. > > */ > > > Assert(dbname == NULL || connection_string == NULL); > > > > And the function incorrectly assumes that the connection string > > requires "dbname=replication". > > > > > * Merge the connection info inputs given in form of connection > string, > > > * options and default values (dbname=replication, > replication=true, > > etc.) > > > > But the name is a pseudo database name only used by pg_hba.conf > > (maybe) , which cannot be used as an actual database name (without > > quotation marks, or unless it is actually created). The function > > should not add the fallback database name because the connection > > string for physical replication connection doesn't require the dbname > > parameter. (attached patch) > > I was also missing, but the requirement was that the dbname should be > included > only when the dbname option was explicitly specified [1]. Even mine and > yours > cannot handle like that. Libpq function > PQconnectdbParams()->pqConnectOptions2() > fills all the parameter to PGconn, at that time the information whether it > is > intentionally specified or not is discarded. Then, > GenerateRecoveryConfig() would > just write down all the connection parameters from PGconn, they cannot > recognize. > > Well, one option is that if a default dbname should be written to the configuration file, then "postgres' is a better option than "replication" or "username" as the default option, at least a db of that name exists. regards, Ajin Cherian Fujitsu Australia
Re: Introduce XID age and inactive timeout based replication slot invalidation
On Tue, Feb 20, 2024 at 12:05 PM Bharath Rupireddy wrote: > >> [...] and was able to produce something like: > > > > postgres=# select > > slot_name,slot_type,active,active_pid,wal_status,invalidation_reason from > > pg_replication_slots; > > slot_name | slot_type | active | active_pid | wal_status | > > invalidation_reason > > -+---++++- > > rep1| physical | f || reserved | > > master_slot | physical | t |1482441 | unreserved | wal_removed > > (2 rows) > > > > does that make sense to have an "active/working" slot "ivalidated"? > > Thanks. Can you please provide the steps to generate this error? Are > you setting max_slot_wal_keep_size on primary to generate > "wal_removed"? I'm able to reproduce [1] the state [2] where the slot got invalidated first, then its wal_status became unreserved, but still the slot is serving after the standby comes up online after it catches up with the primary getting the WAL files from the archive. There's a good reason for this state - https://git.postgresql.org/gitweb/?p=postgresql.git;a=blob;f=src/backend/replication/slotfuncs.c;h=d2fa5e669a32f19989b0d987d3c7329851a1272e;hb=ff9e1e764fcce9a34467d614611a34d4d2a91b50#l351. This intermittent state can only happen for physical slots, not for logical slots because logical subscribers can't get the missing changes from the WAL stored in the archive. And, the fact looks to be that an invalidated slot can never become normal but still can serve a standby if the standby is able to catch up by fetching required WAL (this is the WAL the slot couldn't keep for the standby) from elsewhere (archive via restore_command). As far as the 0001 patch is concerned, it reports the invalidation_reason as long as slot_contents.data.invalidated != RS_INVAL_NONE. I think this is okay. Thoughts? [1] ./initdb -D db17 echo "max_wal_size = 128MB max_slot_wal_keep_size = 64MB archive_mode = on archive_command='cp %p /home/ubuntu/postgres/pg17/bin/archived_wal/%f'" | tee -a db17/postgresql.conf ./pg_ctl -D db17 -l logfile17 start ./psql -d postgres -p 5432 -c "SELECT pg_create_physical_replication_slot('sb_repl_slot', true, false);" rm -rf sbdata logfilesbdata ./pg_basebackup -D sbdata echo "port=5433 primary_conninfo='host=localhost port=5432 dbname=postgres user=ubuntu' primary_slot_name='sb_repl_slot' restore_command='cp /home/ubuntu/postgres/pg17/bin/archived_wal/%f %p'" | tee -a sbdata/postgresql.conf touch sbdata/standby.signal ./pg_ctl -D sbdata -l logfilesbdata start ./psql -d postgres -p 5433 -c "SELECT pg_is_in_recovery();" ./pg_ctl -D sbdata -l logfilesbdata stop ./psql -d postgres -p 5432 -c "SELECT pg_logical_emit_message(true, 'mymessage', repeat('', 1000));" ./psql -d postgres -p 5432 -c "CHECKPOINT;" ./pg_ctl -D sbdata -l logfilesbdata start ./psql -d postgres -p 5432 -xc "SELECT * FROM pg_replication_slots;" [2] postgres=# SELECT * FROM pg_replication_slots; -[ RECORD 1 ]---+- slot_name | sb_repl_slot plugin | slot_type | physical datoid | database| temporary | f active | t active_pid | 710667 xmin| catalog_xmin| restart_lsn | 0/115D21A0 confirmed_flush_lsn | wal_status | unreserved safe_wal_size | 77782624 two_phase | f conflict_reason | failover| f synced | f invalidation_reason | wal_removed last_inactive_at| inactive_count | 1 -- Bharath Rupireddy PostgreSQL Contributors Team RDS Open Source Databases Amazon Web Services: https://aws.amazon.com
RE: Have pg_basebackup write "dbname" in "primary_conninfo"?
Dear Horiguchi-san, > GetConnection()@streamutil.c wants to ensure conninfo has a fallback > database name ("replication"). However, the function seems to be > ignoring the case where neither dbname nor connection string is given, > which is the first case Kuroda-san raised. The second case is the > intended behavior of the function. > > > /* pg_recvlogical uses dbname only; others use connection_string only. > */ > > Assert(dbname == NULL || connection_string == NULL); > > And the function incorrectly assumes that the connection string > requires "dbname=replication". > > > * Merge the connection info inputs given in form of connection string, > > * options and default values (dbname=replication, replication=true, > etc.) > > But the name is a pseudo database name only used by pg_hba.conf > (maybe) , which cannot be used as an actual database name (without > quotation marks, or unless it is actually created). The function > should not add the fallback database name because the connection > string for physical replication connection doesn't require the dbname > parameter. (attached patch) I was also missing, but the requirement was that the dbname should be included only when the dbname option was explicitly specified [1]. Even mine and yours cannot handle like that. Libpq function PQconnectdbParams()->pqConnectOptions2() fills all the parameter to PGconn, at that time the information whether it is intentionally specified or not is discarded. Then, GenerateRecoveryConfig() would just write down all the connection parameters from PGconn, they cannot recognize. One approach is that based on Horiguchi-san's one and initial patch, we can avoid writing when the dbname is same as the username. [1]: https://www.postgresql.org/message-id/CAA4eK1KH1d1J5giPMZVOtMe0iqncf1CpNwkBKoYAmXdC-kEGZg%40mail.gmail.com Best Regards, Hayato Kuroda FUJITSU LIMITED https://www.fujitsu.com/
Re: logical decoding and replication of sequences, take 2
On Tue, Feb 20, 2024 at 5:39 PM Tomas Vondra wrote: > > On 2/20/24 06:54, Amit Kapila wrote: > > On Thu, Dec 21, 2023 at 6:47 PM Tomas Vondra > > wrote: > >> > >> On 12/19/23 13:54, Christophe Pettus wrote: > >>> Hi, > >>> > >>> I wanted to hop in here on one particular issue: > >>> > On Dec 12, 2023, at 02:01, Tomas Vondra > wrote: > - desirability of the feature: Random IDs (UUIDs etc.) are likely a much > better solution for distributed (esp. active-active) systems. But there > are important use cases that are likely to keep using regular sequences > (online upgrades of single-node instances, existing systems, ...). > >>> > >>> +1. > >>> > >>> Right now, the lack of sequence replication is a rather large > >>> foot-gun on logical replication upgrades. Copying the sequences > >>> over during the cutover period is doable, of course, but: > >>> > >>> (a) There's no out-of-the-box tooling that does it, so everyone has > >>> to write some scripts just for that one function. > >>> > >>> (b) It's one more thing that extends the cutover window. > >>> > >> > >> I agree it's an annoying gap for this use case. But if this is the only > >> use cases, maybe a better solution would be to provide such tooling > >> instead of adding it to the logical decoding? > >> > >> It might seem a bit strange if most data is copied by replication > >> directly, while sequences need special handling, ofc. > >> > > > > One difference between the logical replication of tables and sequences > > is that we can guarantee with synchronous_commit (and > > synchronous_standby_names) that after failover transactions data is > > replicated or not whereas for sequences we can't guarantee that > > because of their non-transactional nature. Say, there are two > > transactions T1 and T2, it is possible that T1's entire table data and > > sequence data are committed and replicated but T2's sequence data is > > replicated. So, after failover to logical subscriber in such a case if > > one routes T2 again to the new node as it was not successful > > previously then it would needlessly perform the sequence changes > > again. I don't how much that matters but that would probably be the > > difference between the replication of tables and sequences. > > > > I don't quite follow what the problem with synchronous_commit is :-( > > For sequences, we log the changes ahead, i.e. even if nextval() did not > write anything into WAL, it's still safe because these changes are > covered by the WAL generated some time ago (up to ~32 values back). And > that's certainly subject to synchronous_commit, right? > > There certainly are issues with sequences and syncrep: > > https://www.postgresql.org/message-id/712cad46-a9c8-1389-aef8-faf0203c9...@enterprisedb.com > > but that's unrelated to logical replication. > > FWIW I don't think we'd re-apply sequence changes needlessly, because > the worker does update the origin after applying non-transactional > changes. So after the replication gets restarted, we'd skip what we > already applied, no? > It will work for restarts but I was trying to discuss what happens in the scenario after the publisher node goes down and we failover to the subscriber node and make it a primary node (or a failover case). After that, all unfinished transactions will be re-routed to the new primary. Consider a theoretical case where we send sequence changes of the yet uncommitted transactions directly from wal buffers (something like 91f2cae7a4 does for physical replication) and then immediately the primary or publisher node crashes. After failover to the subscriber node, the application will re-route unfinished transactions to the new primary. In such a situation, I think there is a chance that we will update the sequence value when it would have already received/applied that update via replication. This is what I was saying that there is probably a difference between tables and sequences, for tables such a replicated change would be rolled back. Having said that, this is probably no different from what would happen in the case of physical replication. > But maybe there is an issue and I'm just not getting it. Could you maybe > share an example of T1/T2, with a replication restart and what you think > would happen? > > > I agree with your point above that for upgrades some tool like > > pg_copysequence where we can provide a way to copy sequence data to > > subscribers from the publisher would suffice the need. > > > > Perhaps. Unfortunately it doesn't quite work for failovers, and it's yet > another tool users would need to use. > But can logical replica be used for failover? We don't have any way to replicate/sync the slots on subscribers and neither we have a mechanism to replicate existing publications. I think if we want to achieve failover to a logical subscriber we need to replicate/sync the required logical and physical slots to the subscribers. I haven't thought through it completely so there
Re: 'Shutdown <= SmartShutdown' check while launching processes in postmaster.
shveta malik writes: > I would like to know that why we have 'Shutdown <= SmartShutdown' > check before launching few processes (WalReceiver, WalSummarizer, > AutoVacuum worker) while rest of the processes (BGWriter, WalWriter, > Checkpointer, Archiver etc) do not have any such check. If I have to > launch a new process, what shall be the criteria to decide if I need > this check? Children that are stopped by the "if (pmState == PM_STOP_BACKENDS)" stanza in PostmasterStateMachine should not be allowed to start again later if we are trying to shut down. (But "smart" shutdown doesn't enforce that, since it's a very weak state that only prohibits new client sessions.) The processes that are allowed to continue beyond that point are ones that are needed to perform the shutdown checkpoint, or useful to make it finish faster. regards, tom lane
Re: Add system identifier to backup manifest
On Mon, Feb 19, 2024 at 12:06:19PM +0530, Amul Sul wrote: > On Mon, Feb 19, 2024 at 4:22 AM Michael Paquier wrote: >> And the new option should be documented at the top of the init() >> routine for perldoc. > > Added in the attached version. I've done some wordsmithing on 0001 and it is OK, so I've applied it to move the needle. Hope that helps. -- Michael signature.asc Description: PGP signature
Re: Add lookup table for replication slot invalidation causes
On Wed, Feb 21, 2024 at 5:04 AM Michael Paquier wrote: > > On Tue, Feb 20, 2024 at 05:53:03PM +0100, Jelte Fennema-Nio wrote: > > Seems like a good improvement overall. But I'd prefer the definition > > of the lookup table to use this syntax: > > > > const char *const SlotInvalidationCauses[] = { > > [RS_INVAL_NONE] = "none", > > [RS_INVAL_WAL_REMOVED] = "wal_removed", > > [RS_INVAL_HORIZON] = "rows_removed", > > [RS_INVAL_WAL_LEVEL] = "wal_level_sufficient", > > }; > > +1. Done that way. I'm fine with the designated initialization [1] that an ISO C99 compliant compiler offers. PostgreSQL installation guide https://www.postgresql.org/docs/current/install-requirements.html says that we need an at least C99-compliant ISO/ANSI C compiler. [1] https://open-std.org/JTC1/SC22/WG14/www/docs/n494.pdf https://en.cppreference.com/w/c/99 https://www.ibm.com/docs/en/zos/2.4.0?topic=initializers-designated-aggregate-types-c-only > > Regarding the actual patch: > > > > - Assert(conflict_reason); > > > > Probably we should keep this Assert. As well as the Assert(0) > > The assert(0) at the end of the routine, likely so. I don't see a > huge point for the assert on conflict_reason as we'd crash anyway on > strcmp, no? Right, but an assertion isn't a bad idea there as it can generate a backtrace as opposed to the crash generating just SEGV note (and perhaps a crash dump) in server logs. With these two asserts, the behavior (asserts on null and non-existent inputs) is the same as what GetSlotInvalidationCause has right now. > +/* Maximum number of invalidation causes */ > +#defineRS_INVAL_MAX_CAUSES RS_INVAL_WAL_LEVEL > > There is no need to add that to slot.h: it is only used in slot.c. Right, but it needs to be updated whenever a new cause is added to enum ReplicationSlotInvalidationCause. Therefore, I think it's better to be closer there in slot.h. Please see the attached v2 patch. -- Bharath Rupireddy PostgreSQL Contributors Team RDS Open Source Databases Amazon Web Services: https://aws.amazon.com v2-0001-Add-lookup-table-for-replication-slot-invalidatio.patch Description: Binary data
Re: Have pg_basebackup write "dbname" in "primary_conninfo"?
On Wed, Feb 21, 2024 at 2:04 PM Kyotaro Horiguchi wrote: > > Although I haven't looked the original thread, it seems that the > dbname is used only by pg_sync_replication_slots(). If it is true, > couldn't we make the SQL function require a database name to make a > connection, instead of requiring it in physical-replication conninfo? > > > In the original thread, the intention is to not just provide this functionality using the function pg_sync_replication_slots(), but provide a GUC option on standbys to sync logical replication slots periodically even without calling that function. This requires connecting to a database. regards, Ajin Cherian Fujitsu Australia
Re: Have pg_basebackup write "dbname" in "primary_conninfo"?
On Wed, Feb 21, 2024 at 2:04 PM Kyotaro Horiguchi wrote: > > About the proposed patch, pg_basebackup cannot verify the validity of > the dbname. It could be problematic. > > Although I haven't looked the original thread, it seems that the > dbname is used only by pg_sync_replication_slots(). If it is true, > couldn't we make the SQL function require a database name to make a > connection, instead of requiring it in physical-replication conninfo? > > regards. > > -- > Kyotaro Horiguchi > NTT Open Source Software Center > I agree. If the intention is to meet the new requirement of the sync-slot patch which requires a dbname in the primary_conninfo, then pseudo dbnames will not work, whether it be the username or just "replication". I feel if the user does not specify dbname explicitly in pg_basebackup it should be left blank in the generated primary_conninfo string as well. regards, Ajin Cherian Fujitsu Australia
'Shutdown <= SmartShutdown' check while launching processes in postmaster.
Hi hackers, I would like to know that why we have 'Shutdown <= SmartShutdown' check before launching few processes (WalReceiver, WalSummarizer, AutoVacuum worker) while rest of the processes (BGWriter, WalWriter, Checkpointer, Archiver etc) do not have any such check. If I have to launch a new process, what shall be the criteria to decide if I need this check? Looking forward to your expert advice. thanks Shveta
Re: Have pg_basebackup write "dbname" in "primary_conninfo"?
At Tue, 20 Feb 2024 19:56:10 +0530, Robert Haas wrote in > It seems like maybe somebody should look into why this is happening, > and perhaps fix it. GetConnection()@streamutil.c wants to ensure conninfo has a fallback database name ("replication"). However, the function seems to be ignoring the case where neither dbname nor connection string is given, which is the first case Kuroda-san raised. The second case is the intended behavior of the function. > /* pg_recvlogical uses dbname only; others use connection_string only. > */ > Assert(dbname == NULL || connection_string == NULL); And the function incorrectly assumes that the connection string requires "dbname=replication". >* Merge the connection info inputs given in form of connection string, >* options and default values (dbname=replication, replication=true, > etc.) But the name is a pseudo database name only used by pg_hba.conf (maybe) , which cannot be used as an actual database name (without quotation marks, or unless it is actually created). The function should not add the fallback database name because the connection string for physical replication connection doesn't require the dbname parameter. (attached patch) About the proposed patch, pg_basebackup cannot verify the validity of the dbname. It could be problematic. Although I haven't looked the original thread, it seems that the dbname is used only by pg_sync_replication_slots(). If it is true, couldn't we make the SQL function require a database name to make a connection, instead of requiring it in physical-replication conninfo? regards. -- Kyotaro Horiguchi NTT Open Source Software Center diff --git a/src/bin/pg_basebackup/streamutil.c b/src/bin/pg_basebackup/streamutil.c index 56d1b15951..da63a04de4 100644 --- a/src/bin/pg_basebackup/streamutil.c +++ b/src/bin/pg_basebackup/streamutil.c @@ -78,7 +78,7 @@ GetConnection(void) /* * Merge the connection info inputs given in form of connection string, -* options and default values (dbname=replication, replication=true, etc.) +* options and default values (replication=true, etc.) */ i = 0; if (connection_string) @@ -96,14 +96,6 @@ GetConnection(void) keywords = pg_malloc0((argcount + 1) * sizeof(*keywords)); values = pg_malloc0((argcount + 1) * sizeof(*values)); - /* -* Set dbname here already, so it can be overridden by a dbname in the -* connection string. -*/ - keywords[i] = "dbname"; - values[i] = "replication"; - i++; - for (conn_opt = conn_opts; conn_opt->keyword != NULL; conn_opt++) { if (conn_opt->val != NULL && conn_opt->val[0] != '\0')
RE: Have pg_basebackup write "dbname" in "primary_conninfo"?
Dear Robert, > > Just FYI - here is an extreme case. And note that I have applied proposed > > patch. > > > > When `pg_basebackup -D data_N2 -R` is used: > > ``` > > primary_conninfo = 'user=hayato ... dbname=hayato ... > > ``` > > > > But when `pg_basebackup -d "" -D data_N2 -R` is used: > > ``` > > primary_conninfo = 'user=hayato ... dbname=replication > > ``` > > It seems like maybe somebody should look into why this is happening, > and perhaps fix it. I think this caused from below part [1] in GetConnection(). If both dbname and connection_string are the NULL, we will enter the else part and NULL would be substituted - {"dbnmae", NULL} key-value pair is generated only here. Then, in PQconnectdbParams()->PQconnectStartParams->pqConnectOptions2(), the strange part would be found and replaced to the username [2]. I think if both the connection string and the dbname are NULL, it should be considered as the physical replication connection. here is a patch to fix it. After the application, below two examples can output "dbname=replication". You can also confirm. ``` pg_basebackup -D data_N2 -U postgres pg_basebackup -D data_N2 -R -v -> primary_conninfo = 'user=postgres ... dbname=replication ... ``` [1] ``` else { keywords = pg_malloc0((argcount + 1) * sizeof(*keywords)); values = pg_malloc0((argcount + 1) * sizeof(*values)); keywords[i] = "dbname"; values[i] = dbname; i++; } ``` [2] ``` /* * If database name was not given, default it to equal user name */ if (conn->dbName == NULL || conn->dbName[0] == '\0') { free(conn->dbName); conn->dbName = strdup(conn->pguser); if (!conn->dbName) goto oom_error; } ``` Best Regards, Hayato Kuroda FUJITSU LIMITED https://www.fujitsu.com/ diff --git a/src/bin/pg_basebackup/streamutil.c b/src/bin/pg_basebackup/streamutil.c index 56d1b15951..aea1ccba36 100644 --- a/src/bin/pg_basebackup/streamutil.c +++ b/src/bin/pg_basebackup/streamutil.c @@ -119,7 +119,7 @@ GetConnection(void) keywords = pg_malloc0((argcount + 1) * sizeof(*keywords)); values = pg_malloc0((argcount + 1) * sizeof(*values)); keywords[i] = "dbname"; - values[i] = dbname; + values[i] = dbname == NULL ? "replication" : dbname; i++; }
Re: Patch: Add parse_type Function
On 2024-02-20 15:44 +0100, David E. Wheeler wrote: > I’ve tweaked the comments and their order in v7, attached. > > This goes back to the discussion of the error raising of > to_regtype[1], so I’ve incorporated the patch from that thread into > this patch, and set up the docs for to_regtypemod() with similar > information. Makes sense. > [1] > https://www.postgresql.org/message-id/flat/57E1FDDC-5A38-452D-82D7-A44DA2E13862%40justatheory.com#1ae0b11634bc33c7ad3cd728e43d504e > - > +role="func_signature"> > > format_type > > @@ -25462,7 +25462,7 @@ SELECT collation for ('foo' COLLATE "de_DE"); > > > > - > +role="func_signature"> The references are printed as "???" right now. Can be fixed with xreflabel="format_type" and xreflabel="to_regtype" on those elements. > +to_regtypemod ( type > text ) The docs show parameter name "type" but pg_proc.data does not define proargnames. So the to_regtypemod() cannot be called using named notation: test=> select to_regtypemod(type => 'int'::text); ERROR: function to_regtypemod(type => text) does not exist > +Parses a string of text, extracts a potential type name from it, and > +translates its typmod, the modifier for the type, if any. Failure to s/typmod, the modifier of the type/type modifier/ Because format_type() already uses "type modifier" and I think it helps searchability to use the same wording. -- Erik
Re: [PoC] Federated Authn/z with OAUTHBEARER
Hi all, v14 rebases over latest and fixes a warning when assertions are disabled. 0006 is temporary and hacks past a couple of issues I have not yet root caused -- one of which makes me wonder if 0001 needs to be considered alongside the recent pg_combinebackup and incremental JSON work...? --Jacob 1: e7f87668ab ! 1: b6e8358f44 common/jsonapi: support FRONTEND clients @@ Commit message We can now partially revert b44669b2ca, now that json_errdetail() works correctly. - ## src/bin/pg_verifybackup/parse_manifest.c ## -@@ src/bin/pg_verifybackup/parse_manifest.c: json_parse_manifest(JsonManifestParseContext *context, char *buffer, - /* Run the actual JSON parser. */ - json_error = pg_parse_json(lex, ); - if (json_error != JSON_SUCCESS) -- json_manifest_parse_failure(context, "parsing failed"); -+ json_manifest_parse_failure(context, json_errdetail(json_error, lex)); - if (parse.state != JM_EXPECT_EOF) - json_manifest_parse_failure(context, "manifest ended unexpectedly"); - - ## src/bin/pg_verifybackup/t/005_bad_manifest.pl ## @@ src/bin/pg_verifybackup/t/005_bad_manifest.pl: use Test::More; my $tempdir = PostgreSQL::Test::Utils::tempdir; @@ src/common/Makefile: override CPPFLAGS += -DVAL_LDFLAGS_EX="\"$(LDFLAGS_EX)\"" +override CPPFLAGS := -DFRONTEND -I. -I$(top_srcdir)/src/common -I$(libpq_srcdir) $(CPPFLAGS) LIBS += $(PTHREAD_LIBS) - # If you add objects here, see also src/tools/msvc/Mkvcbuild.pm + OBJS_COMMON = \ ## src/common/jsonapi.c ## @@ @@ src/common/meson.build: foreach name, opts : pgcommon_variants 'dependencies': opts['dependencies'] + [ssl], } + ## src/common/parse_manifest.c ## +@@ src/common/parse_manifest.c: json_parse_manifest(JsonManifestParseContext *context, char *buffer, + /* Run the actual JSON parser. */ + json_error = pg_parse_json(lex, ); + if (json_error != JSON_SUCCESS) +- json_manifest_parse_failure(context, "parsing failed"); ++ json_manifest_parse_failure(context, json_errdetail(json_error, lex)); + if (parse.state != JM_EXPECT_EOF) + json_manifest_parse_failure(context, "manifest ended unexpectedly"); + + ## src/include/common/jsonapi.h ## @@ #ifndef JSONAPI_H 2: 0ab79a168f ! 2: 5fa08a8033 libpq: add OAUTHBEARER SASL mechanism @@ Commit message - fix libcurl initialization thread-safety - harden the libcurl flow implementation - figure out pgsocket/int difference on Windows +- fix intermittent failure in the cleanup callback tests (race + condition?) - ...and more. ## configure ## @@ src/interfaces/libpq/Makefile: endif endif ## src/interfaces/libpq/exports.txt ## -@@ src/interfaces/libpq/exports.txt: PQclosePrepared 188 - PQclosePortal 189 - PQsendClosePrepared 190 +@@ src/interfaces/libpq/exports.txt: PQsendClosePrepared 190 PQsendClosePortal 191 -+PQsetAuthDataHook 192 -+PQgetAuthDataHook 193 -+PQdefaultAuthDataHook 194 + PQchangePassword 192 + PQsendPipelineSync193 ++PQsetAuthDataHook 194 ++PQgetAuthDataHook 195 ++PQdefaultAuthDataHook 196 ## src/interfaces/libpq/fe-auth-oauth-curl.c (new) ## @@ @@ src/interfaces/libpq/fe-auth-oauth-curl.c (new) + */ + cnt = sscanf(interval_str, "%f", ); + -+ Assert(cnt == 1); /* otherwise the lexer screwed up */ ++ if (cnt != 1) ++ { ++ /* ++ * Either the lexer screwed up or our assumption above isn't true, and ++ * either way a developer needs to take a look. ++ */ ++ Assert(cnt == 1); ++ return 1; /* don't fall through in release builds */ ++ } ++ + parsed = ceilf(parsed); + + if (parsed < 1) @@ src/interfaces/libpq/fe-auth.c: pg_fe_sendauth(AuthRequest areq, int payloadlen, { /* Use this message if pg_SASL_continue didn't supply one */ if (conn->errorMessage.len == oldmsglen) -@@ src/interfaces/libpq/fe-auth.c: PQencryptPasswordConn(PGconn *conn, const char *passwd, const char *user, - - return crypt_pwd; +@@ src/interfaces/libpq/fe-auth.c: PQchangePassword(PGconn *conn, const char *user, const char *passwd) + } + } } + +PQauthDataHook_type PQauthDataHook = PQdefaultAuthDataHook; @@ src/interfaces/libpq/fe-connect.c: keep_going: /* We will come back to here case CONNECTION_AUTH_OK: {
Re: WIP Incremental JSON Parser
On Tue, Feb 20, 2024 at 2:10 PM Andrew Dunstan wrote: > Well, that didn't help a lot, but meanwhile the CFBot seems to have > decided in the last few days that it's now happy, so full steam aead! ;-) I haven't been able to track down the root cause yet, but I am able to reproduce the failure consistently on my machine: ERROR: could not parse backup manifest: file size is not an integer: '16384, "Last-Modified": "2024-02-20 23:07:43 GMT", "Checksum-Algorithm": "CRC32C", "Checksum": "66c829cd" }, { "Path": "base/4/2606", "Size": 24576, "Last-Modified": "20 Full log is attached. --Jacob regress_log_003_timeline.gz Description: GNU Zip compressed data
Re: partitioning and identity column
On Tue, Feb 20, 2024 at 8:00 AM Alexander Lakhin wrote: > 20.02.2024 07:57, Ashutosh Bapat wrote: > >> Could you please name functions, which you suspect, for me to recheck them? > >> Perhaps we should consider fixing all of such functions, in light of > >> b0f7dd915 and d57b7cc33... > > Looks like the second commit has fixed all other places I knew except > > Identity related functions. So worth fixing identity related functions > > too. I see > > dropconstraint_internal() has two calls to check_stack_depth() back to > > back. The second one is not needed? > > Yeah, that's funny. It looks like such a double protection emerged > because Alvaro protected the function (in b0f7dd915), which was waiting for > adding check_stack_depth() in the other thread (resulted in d57b7cc33). > > Thank you for spending time on this! Thank you, I removed the second check. -- Regards, Alexander Korotkov
Re: Lessons from using mmap()
Hi! On Wed, Feb 21, 2024 at 1:21 AM Bruce Momjian wrote: > I think we have come to the same conclusion in the past, but I thought > it would be good to share someone else's research, and it might be > helpful if we ever revisit this idea. I read this blog post before. In my personal opinion it is an example of awful speculation. If your DBMS needs WAL, then your first question about using mmap() for your data is how to enforce WAL to be really ahead of writing the data pages. As I know there is no solution for that with mmap() (or at least a solution with acceptable portability). The blog post advertises LMDB, and LMDB is really good. But LMDB uses copy-on-write instead of WAL to ensure durability. And it supports a single writer at a time. This is just another niche of solutions! The blog post makes an impression that developers of non-mmap() DBMS'es are idiots who didn't manage to use mmap() properly. We're not idiots, we develop DBMS of high concurrency! :-) -- Regards, Alexander Korotkov
Re: Add bump memory context type and use it for tuplesorts
On Wed, 21 Feb 2024 at 00:02, Matthias van de Meent wrote: > > On Tue, 20 Feb 2024 at 11:02, David Rowley wrote: > > On Fri, 26 Jan 2024 at 01:29, Matthias van de Meent > > wrote: > > > >> +allocSize = MAXALIGN(sizeof(BumpContext)) + Bump_BLOCKHDRSZ + > > > >> +if (minContextSize != 0) > > > >> +allocSize = Max(allocSize, minContextSize); > > > >> +else > > > >> +allocSize = Max(allocSize, initBlockSize); > > > > > > Shouldn't this be the following, considering the meaning of > > > "initBlockSize"? > > > > No, we want to make the blocksize exactly initBlockSize if we can. Not > > initBlockSize plus all the header stuff. We do it that way for all > > the other contexts and I agree that it's a good idea as it keeps the > > malloc request sizes powers of 2. > > One part of the reason of my comment was that initBlockSize was > ignored in favour of minContextSize if that was configured, regardless > of the value of initBlockSize. Is it deliberately ignored when > minContextSize is set? Ok, it's a good question. It's to allow finer-grained control over the initial block as it allows it to be a fixed given size without affecting the number that we double for the subsequent blocks. e.g BumpContextCreate(64*1024, 8*1024, 1024*1024); would make the first block 64K and the next block 16K, followed by 32K, 64K ... 1MB. Whereas, BumpContextCreate(0, 8*1024, 1024*1024) will start at 8K, 16K ... 1MB. It seems useful as you might have a good idea of how much memory the common case has and want to do that without having to allocate subsequent blocks, but if slightly more memory is required sometimes, you probably don't want the next malloc to be double the common size, especially if the common size is large. Apart from slab.c, this is how all the other contexts work. It seems best to keep this and not to go against the grain on this as there's more to consider if we opt to change the context types of existing contexts. David
Re: Change the bool member of the Query structure to bits
On 2024/2/20 19:18, Tomas Vondra wrote: On 2/20/24 11:11, Quan Zongliang wrote: Sorry. I forgot to save a file. This is the latest. On 2024/2/20 18:07, Quan Zongliang wrote: The Query structure has an increasing number of bool attributes. This is likely to increase in the future. And they have the same properties. Wouldn't it be better to store them in bits? Common statements don't use them, so they have little impact. This also saves memory space. Hi, Are we really adding bools to Query that often? A bit of git-blame says it's usually multiple years to add a single new flag, which is what I'd expect. I doubt that'll change. As for the memory savings, can you quantify how much memory this would save? I highly doubt that's actually true (or at least measurable). The Query struct has ~256B, the patch cuts that to ~232B. But we allocate stuff in power-of-2, so we'll allocate 256B chunk anyway. And we allocate very few of those objects anyway ... regards That makes sense. Withdraw.
Re: Change the bool member of the Query structure to bits
On 2024/2/20 23:45, Tom Lane wrote: Quan Zongliang writes: The Query structure has an increasing number of bool attributes. This is likely to increase in the future. And they have the same properties. Wouldn't it be better to store them in bits? Common statements don't use them, so they have little impact. This also saves memory space. I'm -1 on that, for three reasons: * The amount of space saved is quite negligible. If queries had many Query structs then it could matter, but they don't. * This causes enough code churn to create a headache for back-patching. * This'll completely destroy the readability of these flags in pprint output. I'm not greatly in love with the macro layer you propose, either, but those details don't matter because I think we should just leave well enough alone. regards, tom lane I get it. Withdraw.
Re: Add lookup table for replication slot invalidation causes
On Tue, Feb 20, 2024 at 05:53:03PM +0100, Jelte Fennema-Nio wrote: > Seems like a good improvement overall. But I'd prefer the definition > of the lookup table to use this syntax: > > const char *const SlotInvalidationCauses[] = { > [RS_INVAL_NONE] = "none", > [RS_INVAL_WAL_REMOVED] = "wal_removed", > [RS_INVAL_HORIZON] = "rows_removed", > [RS_INVAL_WAL_LEVEL] = "wal_level_sufficient", > }; +1. > Regarding the actual patch: > > - Assert(conflict_reason); > > Probably we should keep this Assert. As well as the Assert(0) The assert(0) at the end of the routine, likely so. I don't see a huge point for the assert on conflict_reason as we'd crash anyway on strcmp, no? > + for (cause = RS_INVAL_NONE; cause <= RS_INVAL_MAX_CAUSES; cause++) > > Strictly speaking this is a slight change in behaviour, since now > "none" is also parsed. That seems fine to me though. Yep. This does not strike me as an issue. We only use GetSlotInvalidationCause() in synchronize_slots(), mapping to NULL in the case of "none". Agreed that this is an improvement. +/* Maximum number of invalidation causes */ +#defineRS_INVAL_MAX_CAUSES RS_INVAL_WAL_LEVEL There is no need to add that to slot.h: it is only used in slot.c. -- Michael signature.asc Description: PGP signature
Re: Revisiting {CREATE INDEX, REINDEX} CONCURRENTLY improvements
Hello! > I think the best way for this to work would be an index method that > exclusively stores TIDs, and of which we can quickly determine new > tuples, too. I was thinking about something like GIN's format, but > using (generation number, tid) instead of ([colno, colvalue], tid) as > key data for the internal trees, and would be unlogged (because the > data wouldn't have to survive a crash) Yeah, this seems to be a reasonable approach, but there are some doubts related to it - it needs new index type as well as unlogged indexes to be introduced - this may make the patch too invasive to be merged. Also, some way to remove the index from the catalog in case of a crash may be required. A few more thoughts: * it is possible to go without generation number - we may provide a way to do some kind of fast index lookup (by TID) directly during the second table scan phase. * one more option is to maintain a Tuplesorts (instead of an index) with TIDs as changelog and merge with index snapshot after taking a new visibility snapshot. But it is not clear how to share the same Tuplesort with multiple inserting backends. * crazy idea - what is about to do the scan in the index we are building? We have tuple, so, we have all the data indexed in the index. We may try to do an index scan using that data to get all tuples and find the one with our TID :) Yes, in some cases it may be too bad because of the huge amount of TIDs we need to scan + also btree copies whole page despite we need single item. But some additional index method may help - feels like something related to uniqueness (but it is only in btree anyway). Thanks, Mikhail.
Re: Add bump memory context type and use it for tuplesorts
On Tue, 20 Feb 2024 at 23:52, Matthias van de Meent wrote: > What I meant was that > > > (char *) block + Bump_BLOCKHDRSZ > vs > > ((char *) block) + Bump_BLOCKHDRSZ > > , when combined with my little experience with pointer addition and > precedence, and a lack of compiler at the ready at that point in time, > I was afraid that "(char *) block + Bump_BLOCKHDRSZ" would be parsed > as "(char *) (block + Bump_BLOCKHDRSZ)", which would get different > offsets across the two statements. > Godbolt has since helped me understand that both are equivalent. I failed to notice this. I've made them the same regardless to prevent future questions from being raised about the discrepancy between the two. David
Re: Injection points: some tools to wait and wake
On Tue, Feb 20, 2024 at 05:32:38PM +0300, Andrey M. Borodin wrote: > I will try to simplify test to 2-step, but it would be much easier > to implement if injection points could be awaken independently. I don't mind implementing something that wakes only a given point name, that's just more state data to track in shmem for the module. > No, "prepared injection point" is not about wait\wake logic. It's > about having injection point in critical section. > Normal injection point will pstrdup(name) and fail. In [0] I need a > test that waits after multixact generation before WAL-logging > it. It's only possible in a critical section. It took me some time to understand what you mean here. You are referring to the allocations done in load_external_function() -> expand_dynamic_library_name() -> substitute_libpath_macro(), which is something that has to happen the first time a callback is loaded into the local cache of a process. So what you want to achieve is to preload the callback in its cache in a first step without running it, then be able to run it, so your Prepare[d] layer is just a way to split InjectionPointRun() into two. Fancy use case. You could disable the critical section around the INJECTION_POINT() as one solution, though having something to pre-warm the local cache for a point name and avoid the allocations done in the external load would be a second way to achieve that. "Prepare" is not the best term I would have used, perhaps just have one INJECTION_POINT_PRELOAD() macro to warm up the cache outside the critical section with a wrapper routine? You could then leave InjectionPointRun() as it is now. Having a check on CritSectionCount in the injection point internals to disable temporarily a critical section would not feel right as this is used as a safety check anywhere else. -- Michael signature.asc Description: PGP signature
Lessons from using mmap()
This blog, and the blogs it links to, explains the complexities of using mmap() for database data/index file I/O. https://www.symas.com/post/are-you-sure-you-want-to-use-mmap-in-your-dbms The blog starts by stating: There are, however, severe correctness and performance issues with mmap that are not immediately apparent. Such problems make it difficult, if not impossible, to use mmap correctly and efficiently in a modern DBMS. The remainder of the article makes various arguments that such mmap use is _possible_, but ends with a reasonable conclusion: Ultimately, the answer to the question "are you sure you want to use mmap in your DBMS?" should be rephrased - do you really want to reimplement everything the OS already does for you? Do you really believe you can do it correctly, better than the OS already does? The DBMS world is littered with projects whose authors believed, incorrectly, that they could. I think we have come to the same conclusion in the past, but I thought it would be good to share someone else's research, and it might be helpful if we ever revisit this idea. -- Bruce Momjian https://momjian.us EDB https://enterprisedb.com Only you can decide what is important to you.
Re: Injection points: some tools to wait and wake
On Tue, Feb 20, 2024 at 03:55:08PM +, Bertrand Drouvot wrote: > Okay, makes sense to keep this as it is as a "template" in case more stuff is > added. > > + /* Counter advancing when injection_points_wake() is called */ > + int wait_counts; > > In that case what about using an unsigned instead? (Nit) uint32. Sure. > 1 === > > +void > +injection_wait(const char *name) > > Looks like name is not used in the function. I guess the reason it is a > parameter > is because that's the way the callback function is being called in > InjectionPointRun()? Right. The callback has to define this argument. > 2 === > > +PG_FUNCTION_INFO_V1(injection_points_wake); > +Datum > +injection_points_wake(PG_FUNCTION_ARGS) > +{ > > I think that This function will wake up all the "wait" injection points. > Would that make sense to implement some filtering based on the name? That > could > be useful for tests that would need multiple wait injection points and that > want > to wake them up "sequentially". > > We could think about it if there is such a need in the future though. Well, both you and Andrey are asking for it now, so let's do it. The implementation is simple: - Store in InjectionPointSharedState an array of wait_counts and an array of names. There is only one condition variable. - When a point wants to wait, it takes the spinlock and looks within the array of names until it finds a free slot, adds its name into the array to reserve a wait counter at the same position, releases the spinlock. Then it loops on the condition variable for an update of the counter it has reserved. It is possible to make something more efficient, but at a small size it would not really matter. - The wakeup takes a point name in argument, acquires the spinlock, and checks if it can find the point into the array, pinpoints the location of the counter to update and updates it. Then it broadcasts the change. - The wait loop checks its counter, leaves its loop, cancels the sleep, takes the spinlock to unregister from the array, and leaves. I would just hardcode the number of points that can wait, say 5 of them tracked in shmem? Does that look like what you are looking at? > +# Register a injection point on the standby so as the follow-up > > typo: "an injection"? Oops. Fixed locally. > +for (my $i = 0; $i < 3000; $i++) > +{ > > is 3000 due to?: > > +checkpoint_timeout = 30s > > If so, would that make sense to reduce the value for both? That had better be based on PostgreSQL::Test::Utils::timeout_default, actually, as in something like: foreach my $i (0 .. 10 * $PostgreSQL::Test::Utils::timeout_default) -- Michael signature.asc Description: PGP signature
Re: System username in pg_stat_activity
On Fri, Feb 16, 2024 at 9:45 PM Andres Freund wrote: > > Hi, > > On 2024-02-16 15:22:16 -0500, Tom Lane wrote: > > Magnus Hagander writes: > > > I mean, we could split it into more than one view. But adding a new > > > view for every new thing we want to show is also not very good from > > > either a usability or performance perspective. So where would we put > > > it? > > > > It'd have to be a new view with a row per session, showing static > > (or at least mostly static?) properties of the session. > > Yep. > > > > Could we move some existing fields of pg_stat_activity into such a > > view? > > I'd suspect that at least some of > - leader_pid > - datid > - datname > - usesysid > - usename > - backend_start > - client_addr > - client_hostname > - client_port > - backend_type > > could be moved. Whether's worth breaking existing queries, I don't quite know. I think that's the big question. I think if we move all of those we will break every single monitoring tool out there for postgres... That's a pretty hefty price. > One option would be to not return (some) of them from pg_stat_get_activity(), > but add them to the view in a way that the planner can elide the reference. Without having any numbers, I would think that the join to pg_authid for exapmle is likely more costly than returning all the other fields. But that one does get eliminated as long as one doesn't query that column. But if we make more things "joined in from the view", isn't that likely to just make it more expensive in most cases? > > I'm not sure that this is worth the trouble TBH. If it can be shown > > that pulling a few fields out of pg_stat_activity actually does make > > for a useful speedup, then maybe OK ... but Andres hasn't provided > > any evidence that there's a measurable issue. > > If I thought that the two columns proposed here were all that we wanted to > add, I'd not be worried. But there have been quite a few other fields > proposed, e.g. tracking idle/active time on a per-connection granularity. > > We even already have a patch to add pg_stat_session > https://commitfest.postgresql.org/47/3405/ In a way, that's yet another different type of values though -- it contains accumulated stats. So we really have 3 types -- "info" that's not really stats (username, etc), "current state" (query, wait events, state) and "accumulated stats" (counters since start).If we don't want to combine them all, we should perhaps not combine any and actually have 3 views? -- Magnus Hagander Me: https://www.hagander.net/ Work: https://www.redpill-linpro.com/
Re: System username in pg_stat_activity
On Fri, Feb 16, 2024 at 9:31 PM Magnus Hagander wrote: > > On Fri, Feb 16, 2024 at 9:20 PM Andres Freund wrote: > > > > Hi, > > > > On 2024-02-16 20:57:59 +0100, Magnus Hagander wrote: > > > On Fri, Feb 16, 2024 at 8:41 PM Andres Freund wrote: > > > > On 2024-01-10 12:46:34 +0100, Magnus Hagander wrote: > > > > > The attached patch adds a column "authuser" to pg_stat_activity which > > > > > contains the username of the externally authenticated user, being the > > > > > same value as the SYSTEM_USER keyword returns in a backend. > > > > > > > > I continue to think that it's a bad idea to make pg_stat_activity ever > > > > wider > > > > with columns that do not actually describe properties that change > > > > across the > > > > course of a session. Yes, there's the argument that that ship has > > > > sailed, but > > > > I don't think that's a good reason to continue ever further down that > > > > road. > > > > > > > > It's not just a usability issue, it also makes it more expensive to > > > > query > > > > pg_stat_activity. This is of course more pronounced with textual > > > > columns than > > > > with integer ones. > > > > > > That's a fair point, but I do think that has in most ways already sailed, > > > yes. > > > > > > I mean, we could split it into more than one view. But adding a new > > > view for every new thing we want to show is also not very good from > > > either a usability or performance perspective. So where would we put > > > it? > > > > I think we should group new properties that don't change over the course of > > a > > session ([1]) in a new view (e.g. pg_stat_session). I don't think we need > > one > > view per property, but I do think it makes sense to split information that > > changes very frequently (like most pg_stat_activity contents) from > > information > > that doesn't (like auth_method, auth_identity). > > That would make sense in many ways, but ends up with "other level of > annoyances". E.g. the database name and oid don't change, but would we > want to move those out of pg_stat_activity? Same for username? Don't > we just end up in a grayzone about what belongs where? > > Also - were you envisioning just another view, or actually replacing > the pg_stat_get_activity() part? As in where do you think the cost > comes? Andres -- did you spot this question in the middle or did it get lost in the flurry of others? :) -- Magnus Hagander Me: https://www.hagander.net/ Work: https://www.redpill-linpro.com/
Re: Shared detoast Datum proposal
Tomas Vondra writes: Hi Tomas, > > I took a quick look on this thread/patch today, so let me share a couple > initial thoughts. I may not have a particularly coherent/consistent > opinion on the patch or what would be a better way to do this yet, but > perhaps it'll start a discussion ... Thank you for this! > > The goal of the patch (as I understand it) is essentially to cache > detoasted values, so that the value does not need to be detoasted > repeatedly in different parts of the plan. I think that's perfectly > sensible and worthwhile goal - detoasting is not cheap, and complex > plans may easily spend a lot of time on it. exactly. > > That being said, the approach seems somewhat invasive, and touching > parts I wouldn't have expected to need a change to implement this. For > example, I certainly would not have guessed the patch to need changes in > createplan.c or setrefs.c. > > Perhaps it really needs to do these things, but neither the thread nor > the comments are very enlightening as for why it's needed :-( In many > cases I can guess, but I'm not sure my guess is correct. And comments in > code generally describe what's happening locally / next line, not the > bigger picture & why it's happening. there were explaination at [1], but it probably is too high level. Writing a proper comments is challenging for me, but I am pretty happy to try more. At the end of this writing, I explained the data workflow, I am feeling that would be useful for reviewers. > IIUC we walk the plan to decide which Vars should be detoasted (and > cached) once, and which cases should not do that because it'd inflate > the amount of data we need to keep in a Sort, Hash etc. Exactly. > Not sure if > there's a better way to do this - it depends on what happens in the > upper parts of the plan, so we can't decide while building the paths. I'd say I did this intentionally. Deciding such things in paths will be more expensive than create_plan stage IMO. > But maybe we could decide this while transforming the paths into a plan? > (I realize the JIT thread nearby needs to do something like that in > create_plan, and in that one I suggested maybe walking the plan would be > a better approach, so I may be contradicting myself a little bit.). I think that's pretty similar what I'm doing now. Just that I did it *just after* the create_plan. This is because the create_plan doesn't transform the path to plan in the top->down manner all the time, the known exception is create_mergejoin_plan. so I have to walk just after the create_plan is done. In the create_mergejoin_plan, the Sort node is created *after* the subplan for the Sort is created. /* Recursively process the path tree, demanding the correct tlist result */ plan = create_plan_recurse(root, best_path, CP_EXACT_TLIST); + /* + * After the plan tree is built completed, we start to walk for which + * expressions should not used the shared-detoast feature. + */ + set_plan_forbid_pre_detoast_vars_recurse(plan, NIL); > > In any case, set_plan_forbid_pre_detoast_vars_recurse should probably > explain the overall strategy / reasoning in a bit more detail. Maybe > it's somewhere in this thread, but that's not great for reviewers. a lession learnt, thanks. a revisted version of comments from the lastest patch. /* * set_plan_forbid_pre_detoast_vars_recurse * Walking the Plan tree in the top-down manner to gather the vars which * should be as small as possible and record them in Plan.forbid_pre_detoast_vars * * plan: the plan node to walk right now. * small_tlist: a list of nodes which its subplan should provide them as * small as possible. */ static void set_plan_forbid_pre_detoast_vars_recurse(Plan *plan, List *small_tlist) > > Similar for the setrefs.c changes. It seems a bit suspicious to piggy > back the new code into fix_scan_expr/fix_scan_list and similar code. > Those functions have a pretty clearly defined purpose, not sure we want > to also extend them to also deal with this new thing. (FWIW I'd 100%% > did it this way if I hacked on a PoC of this, to make it work. But I'm > not sure it's the right solution.) The main reason of doing so is because I want to share the same walk effort as fix_scan_expr. otherwise I have to walk the plan for every expression again. I thought this as a best practice in the past and thought we can treat the pre_detoast_attrs as a valuable side effects:( > I don't know what to thing about the Bitset - maybe it's necessary, but > how would I know? I don't have any way to measure the benefits, because > the 0002 patch uses it right away. a revisted version of comments from the latest patch. graph 2 explains this decision. /* * The attributes whose values are the detoasted version in tts_values[*], * if so these memory needs some extra clean-up. These memory can't be put * into ecxt_per_tuple_memory since many of them needs a longer life span, * for example the
Re: Running the fdw test from the terminal crashes into the core-dump
Alvaro Herrera writes: > The real problem is that a MERGE ... DO NOTHING action reports that no > permissions need to be checked on the target relation, which is not a > problem when there are other actions in the MERGE command since they add > privs to check. When DO NOTHING is the only action, the added assert in > a61b1f748 is triggered. > So, this means we can fix this by simply requiring ACL_SELECT privileges > on a DO NOTHING action. We don't need to request specific privileges on > any particular column (perminfo->selectedCols continues to be the empty > set) -- which means that any role that has privileges on *any* column > would get a pass. If you're doing MERGE with any other action besides > DO NOTHING, you already have privileges on at least some column, so > adding ACL_SELECT breaks nothing. LGTM. regards, tom lane
Re: Running the fdw test from the terminal crashes into the core-dump
On 2024-Feb-18, Alvaro Herrera wrote: > On 2024-Feb-18, Tom Lane wrote: > > > So I'd blame this on faulty handling of the zero-partitions case in > > the RTEPermissionInfo refactoring. (I didn't bisect to prove that > > a61b1f748 is exactly where it broke, but I do see that the query > > successfully does nothing in v15.) > > You're right, this is the commit that broke it. It's unclear to me if > Amit is available to look at it, so I'll give this a look tomorrow if he > isn't. OK, so it turns out that the bug is older than that -- it was actually introduced with MERGE itself (7103ebb7aae8) and has nothing to do with partitioning or RTEPermissionInfo; commit a61b1f748 is only related because it added the Assert() that barfs when there are no privileges to check. The real problem is that a MERGE ... DO NOTHING action reports that no permissions need to be checked on the target relation, which is not a problem when there are other actions in the MERGE command since they add privs to check. When DO NOTHING is the only action, the added assert in a61b1f748 is triggered. So, this means we can fix this by simply requiring ACL_SELECT privileges on a DO NOTHING action. We don't need to request specific privileges on any particular column (perminfo->selectedCols continues to be the empty set) -- which means that any role that has privileges on *any* column would get a pass. If you're doing MERGE with any other action besides DO NOTHING, you already have privileges on at least some column, so adding ACL_SELECT breaks nothing. -- Álvaro Herrera 48°01'N 7°57'E — https://www.EnterpriseDB.com/ "Pido que me den el Nobel por razones humanitarias" (Nicanor Parra) >From 5f2eb9992f40933aa58f8bb0e0bed7ebd99f2952 Mon Sep 17 00:00:00 2001 From: Alvaro Herrera Date: Tue, 20 Feb 2024 18:30:37 +0100 Subject: [PATCH] MERGE ... DO NOTHING: require SELECT privileges Verify that a user running MERGE with a DO NOTHING clause has privileges to read the table, even if no columns are referenced. Such privileges were already required if the ON clause or any of the WHEN conditions referenced any column at all, so there's no functional change in practice. This change fixes an assertion failure in the case where no column is referenced by the command and the WHEN clauses are all DO NOTHING. Backpatch to 15, where MERGE was introduced. Reported-by: Alena Rybakina Discussion: https://postgr.es/m/4d65a385-7efa-4436-a825-0869f89d9...@postgrespro.ru --- src/backend/parser/parse_merge.c| 7 ++- src/test/regress/expected/merge.out | 11 ++- src/test/regress/sql/merge.sql | 11 +++ 3 files changed, 27 insertions(+), 2 deletions(-) diff --git a/src/backend/parser/parse_merge.c b/src/backend/parser/parse_merge.c index 5f6a683ab9..73f7a48b3c 100644 --- a/src/backend/parser/parse_merge.c +++ b/src/backend/parser/parse_merge.c @@ -133,7 +133,11 @@ transformMergeStmt(ParseState *pstate, MergeStmt *stmt) int when_type = (mergeWhenClause->matched ? 0 : 1); /* - * Collect action types so we can check target permissions + * Collect permissions to check, according to action types. We require + * SELECT privileges for DO NOTHING because it'd be irregular to have + * a target relation with zero privileges checked, in case DO NOTHING + * is the only action. There's no damage from that: any meaningful + * MERGE command requires at least some access to the table anyway. */ switch (mergeWhenClause->commandType) { @@ -147,6 +151,7 @@ transformMergeStmt(ParseState *pstate, MergeStmt *stmt) targetPerms |= ACL_DELETE; break; case CMD_NOTHING: +targetPerms |= ACL_SELECT; break; default: elog(ERROR, "unknown action in MERGE WHEN clause"); diff --git a/src/test/regress/expected/merge.out b/src/test/regress/expected/merge.out index f87905fabd..125d7b6bca 100644 --- a/src/test/regress/expected/merge.out +++ b/src/test/regress/expected/merge.out @@ -3,6 +3,7 @@ -- CREATE USER regress_merge_privs; CREATE USER regress_merge_no_privs; +CREATE USER regress_merge_none; DROP TABLE IF EXISTS target; NOTICE: table "target" does not exist, skipping DROP TABLE IF EXISTS source; @@ -159,12 +160,19 @@ ERROR: cannot execute MERGE on relation "mv" DETAIL: This operation is not supported for materialized views. DROP MATERIALIZED VIEW mv; -- permissions +SET SESSION AUTHORIZATION regress_merge_none; +MERGE INTO target +USING (SELECT 1) +ON true +WHEN MATCHED THEN + DO NOTHING; +ERROR: permission denied for table target +RESET SESSION AUTHORIZATION; MERGE INTO target USING source2 ON target.tid = source2.sid WHEN MATCHED THEN UPDATE SET balance = 0; -ERROR: permission denied for table source2 GRANT INSERT ON target TO regress_merge_no_privs; SET SESSION AUTHORIZATION regress_merge_no_privs; MERGE INTO target @@ -2248,3 +2256,4 @@ DROP TABLE source, source2; DROP FUNCTION merge_trigfunc(); DROP USER regress_merge_privs; DROP USER
Re: Proposal: Adjacent B-Tree index
> only 1 index lookup is needed. Sorry, must be "only lookups of 1 index are needed". -- Dilshod Urazov вт, 20 февр. 2024 г. в 21:09, Dilshod Urazov : > > I'm not sure why are two indexes not sufficient here? > > Did I write that they are not sufficient? The whole point is that in > relational DBMSs which are widely used > to store graphs we can optimize storage in such cases. Also we can > optimize traversals e.g. if we want to > get all nodes that are adjacent to a given node with id = X in an oriented > graph > > SELECT id, label > FROM Nodes > JOIN Edges ON Nodes.id = Edges.target > WHERE Edges.source = X; > > only 1 index lookup is needed. > > > The entry could've been removed because (e.g.) > > test's b column was updated thus inserting a new index entry for the > > new HOT-chain's TID. > > If test'b column was updated and HOT optimization took place no new index > entry is created. Index tuple > pointing to old heap tuple is valid since now it is pointing to HOT-chain. > > -- > Dilshod Urazov > > пн, 19 февр. 2024 г. в 22:32, Matthias van de Meent < > boekewurm+postg...@gmail.com>: > >> On Mon, 19 Feb 2024 at 18:48, Dilshod Urazov >> wrote: >> > >> > - Motivation >> > >> > A regular B-tree index provides efficient mapping of key values to >> tuples within a table. However, if you have two tables connected in some >> way, a regular B-tree index may not be efficient enough. In this case, you >> would need to create an index for each table. The purpose will become >> clearer if we consider a simple example which is the main use-case as I see >> it. >> >> I'm not sure why are two indexes not sufficient here? PostgreSQL can >> already do merge joins, which would have the same effect of hitting >> the same location in the index at the same time between all tables, >> without the additional overhead of having to scan two table's worth of >> indexes in VACUUM. >> >> > During the vacuum of A an index tuple pointing to a dead tuple in A >> should be cleaned as well as all index tuples for the same key. >> >> This is definitely not correct. If I have this schema >> >> table test (id int primary key, b text unique) >> table test_ref(test_id int references test(id)) >> >> and if an index would contain entries for both test and test_ref, it >> can't just remove all test_ref entries because an index entry with the >> same id was removed: The entry could've been removed because (e.g.) >> test's b column was updated thus inserting a new index entry for the >> new HOT-chain's TID. >> >> > would suffice for this new semantics. >> >> With the provided explanation I don't think this is a great idea. >> >> Kind regards, >> >> Matthias van de Meent. >> >
Re: Missing Group Key in grouped aggregate
On 2/20/24 16:53, Tom Lane wrote: > =?UTF-8?Q?Erik_Nordstr=C3=B6m?= writes: >> I noticed that, beginning with PG16, grouped aggregates are missing the >> "Group Key" in the EXPLAIN output. > >> It seems the Agg node has numCols (number of grouping cols) set to zero in >> queries like > >> SELECT foo, count(*) FROM bar WHERE foo=1 GROUP BY foo; > >> In PG15, the "Group Key" is shown and the Agg node has numCols set as >> expected. > > Looks sane to me: the planner now notices that there can only > be one group so it doesn't tell the GroupAgg node to worry about > making groups. If it were missing in a case where there could be > multiple output groups, yes that'd be a bug. > > If you want to run it to ground you could bisect to see where the > behavior changed, but you'd probably just find it was intentional. > I believe this changed in: commit 8d83a5d0a2673174dc478e707de1f502935391a5 Author: Tom Lane Date: Wed Jan 18 12:37:57 2023 -0500 Remove redundant grouping and DISTINCT columns. Avoid explicitly grouping by columns that we know are redundant for sorting, for example we need group by only one of x and y in SELECT ... WHERE x = y GROUP BY x, y This comes up more often than you might think, as shown by the changes in the regression tests. It's nearly free to detect too, since we are just piggybacking on the existing logic that detects redundant pathkeys. (In some of the existing plans that change, it's visible that a sort step preceding the grouping step already didn't bother to sort by the redundant column, making the old plan a bit silly-looking.) ... It's not quite obvious from the commit message, but that's where git bisect says the behavior changed. regards -- Tomas Vondra EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
Re: Shared detoast Datum proposal
Hi, I took a quick look on this thread/patch today, so let me share a couple initial thoughts. I may not have a particularly coherent/consistent opinion on the patch or what would be a better way to do this yet, but perhaps it'll start a discussion ... The goal of the patch (as I understand it) is essentially to cache detoasted values, so that the value does not need to be detoasted repeatedly in different parts of the plan. I think that's perfectly sensible and worthwhile goal - detoasting is not cheap, and complex plans may easily spend a lot of time on it. That being said, the approach seems somewhat invasive, and touching parts I wouldn't have expected to need a change to implement this. For example, I certainly would not have guessed the patch to need changes in createplan.c or setrefs.c. Perhaps it really needs to do these things, but neither the thread nor the comments are very enlightening as for why it's needed :-( In many cases I can guess, but I'm not sure my guess is correct. And comments in code generally describe what's happening locally / next line, not the bigger picture & why it's happening. IIUC we walk the plan to decide which Vars should be detoasted (and cached) once, and which cases should not do that because it'd inflate the amount of data we need to keep in a Sort, Hash etc. Not sure if there's a better way to do this - it depends on what happens in the upper parts of the plan, so we can't decide while building the paths. But maybe we could decide this while transforming the paths into a plan? (I realize the JIT thread nearby needs to do something like that in create_plan, and in that one I suggested maybe walking the plan would be a better approach, so I may be contradicting myself a little bit.). In any case, set_plan_forbid_pre_detoast_vars_recurse should probably explain the overall strategy / reasoning in a bit more detail. Maybe it's somewhere in this thread, but that's not great for reviewers. Similar for the setrefs.c changes. It seems a bit suspicious to piggy back the new code into fix_scan_expr/fix_scan_list and similar code. Those functions have a pretty clearly defined purpose, not sure we want to also extend them to also deal with this new thing. (FWIW I'd 100%% did it this way if I hacked on a PoC of this, to make it work. But I'm not sure it's the right solution.) I don't know what to thing about the Bitset - maybe it's necessary, but how would I know? I don't have any way to measure the benefits, because the 0002 patch uses it right away. I think it should be done the other way around, i.e. the patch should introduce the main feature first (using the traditional Bitmapset), and then add Bitset on top of that. That way we could easily measure the impact and see if it's useful. On the whole, my biggest concern is memory usage & leaks. It's not difficult to already have problems with large detoasted values, and if we start keeping more of them, that may get worse. Or at least that's my intuition - it can't really get better by keeping the values longer, right? The other thing is the risk of leaks (in the sense of keeping detoasted values longer than expected). I see the values are allocated in tts_mcxt, and maybe that's the right solution - not sure. FWIW while looking at the patch, I couldn't help but to think about expanded datums. There's similarity in what these two features do - keep detoasted values for a while, so that we don't need to do the expensive processing if we access them repeatedly. Of course, expanded datums are not meant to be long-lived, while "shared detoasted values" are meant to exist (potentially) for the query duration. But maybe there's something we could learn from expanded datums? For example how the varlena pointer is leveraged to point to the expanded object. For example, what if we add a "TOAST cache" as a query-level hash table, and modify the detoasting to first check the hash table (with the TOAST pointer as a key)? It'd be fairly trivial to enforce a memory limit on the hash table, evict values from it, etc. And it wouldn't require any of the createplan/setrefs changes, I think ... regards -- Tomas Vondra EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
Re: Proposal: Adjacent B-Tree index
> I'm not sure why are two indexes not sufficient here? Did I write that they are not sufficient? The whole point is that in relational DBMSs which are widely used to store graphs we can optimize storage in such cases. Also we can optimize traversals e.g. if we want to get all nodes that are adjacent to a given node with id = X in an oriented graph SELECT id, label FROM Nodes JOIN Edges ON Nodes.id = Edges.target WHERE Edges.source = X; only 1 index lookup is needed. > The entry could've been removed because (e.g.) > test's b column was updated thus inserting a new index entry for the > new HOT-chain's TID. If test'b column was updated and HOT optimization took place no new index entry is created. Index tuple pointing to old heap tuple is valid since now it is pointing to HOT-chain. -- Dilshod Urazov пн, 19 февр. 2024 г. в 22:32, Matthias van de Meent < boekewurm+postg...@gmail.com>: > On Mon, 19 Feb 2024 at 18:48, Dilshod Urazov > wrote: > > > > - Motivation > > > > A regular B-tree index provides efficient mapping of key values to > tuples within a table. However, if you have two tables connected in some > way, a regular B-tree index may not be efficient enough. In this case, you > would need to create an index for each table. The purpose will become > clearer if we consider a simple example which is the main use-case as I see > it. > > I'm not sure why are two indexes not sufficient here? PostgreSQL can > already do merge joins, which would have the same effect of hitting > the same location in the index at the same time between all tables, > without the additional overhead of having to scan two table's worth of > indexes in VACUUM. > > > During the vacuum of A an index tuple pointing to a dead tuple in A > should be cleaned as well as all index tuples for the same key. > > This is definitely not correct. If I have this schema > > table test (id int primary key, b text unique) > table test_ref(test_id int references test(id)) > > and if an index would contain entries for both test and test_ref, it > can't just remove all test_ref entries because an index entry with the > same id was removed: The entry could've been removed because (e.g.) > test's b column was updated thus inserting a new index entry for the > new HOT-chain's TID. > > > would suffice for this new semantics. > > With the provided explanation I don't think this is a great idea. > > Kind regards, > > Matthias van de Meent. >
Re: Integer undeflow in fprintf in dsa.c
> On 20 Feb 2024, at 17:13, Ilyasov Ian wrote: > Sorry for not answering quickly. There is no need for any apology, there is no obligation to answer within any specific timeframe. > I attached a patch to the letter with changes to take into account Daniel > Gustafsson's comment. Looks good on a quick skim, I'll take care of this shortly. -- Daniel Gustafsson
Re: Add lookup table for replication slot invalidation causes
On Tue, 20 Feb 2024 at 12:11, Bharath Rupireddy wrote: > Thoughts? Seems like a good improvement overall. But I'd prefer the definition of the lookup table to use this syntax: const char *const SlotInvalidationCauses[] = { [RS_INVAL_NONE] = "none", [RS_INVAL_WAL_REMOVED] = "wal_removed", [RS_INVAL_HORIZON] = "rows_removed", [RS_INVAL_WAL_LEVEL] = "wal_level_sufficient", }; Regarding the actual patch: - Assert(conflict_reason); Probably we should keep this Assert. As well as the Assert(0) + for (cause = RS_INVAL_NONE; cause <= RS_INVAL_MAX_CAUSES; cause++) Strictly speaking this is a slight change in behaviour, since now "none" is also parsed. That seems fine to me though.
Re: Streaming read-ready sequential scan code
On Tue, Feb 20, 2024 at 6:13 AM Robert Haas wrote: > > On Tue, Feb 20, 2024 at 4:35 AM Melanie Plageman > wrote: > > I've written three alternative implementations of the actual streaming > > read user for sequential scan which handle the question of where to > > allocate the streaming read object and how to handle changing scan > > direction in different ways. > > It's weird to me that the prospect of changing the scan direction > causes such complexity. I mean, why doesn't a streaming read object > have a forget_all_my_previous_requests() method or somesuch? Basically, that is what pg_streaming_read_free() does. It goes through and releases the buffers it had pinned and frees any per-buffer data allocated. The complexity with the sequential scan streaming read user and scan direction is just that it has to detect when the scan direction changes and do the releasing/freeing and reallocation. The scan direction is passed to heapgettup[_pagemode](), so this is something that can change on a tuple-to-tuple basis. It is less that doing this is complicated and more that it is annoying and distracting to have to check for and handle a very unimportant and uncommon case in the main path of the common case. - Melanie
RE: Integer undeflow in fprintf in dsa.c
Sorry for not answering quickly. Thank you for your comments. I attached a patch to the letter with changes to take into account Daniel Gustafsson's comment. Kind regards, Ian Ilyasov. Juniour Software Developer at Postgres Professional Subject: [PATCH] Integer underflow fix in fprintf in dsa.c. --- Index: src/backend/utils/mmgr/dsa.c IDEA additional info: Subsystem: com.intellij.openapi.diff.impl.patch.CharsetEP <+>UTF-8 === diff --git a/src/backend/utils/mmgr/dsa.c b/src/backend/utils/mmgr/dsa.c --- a/src/backend/utils/mmgr/dsa.c (revision b78fa8547d02fc72ace679fb4d5289dccdbfc781) +++ b/src/backend/utils/mmgr/dsa.c (date 1708440995707) @@ -1105,9 +1105,13 @@ { dsa_segment_index segment_index; - fprintf(stderr, - "segment bin %zu (at least %d contiguous pages free):\n", - i, 1 << (i - 1)); + if (i == 0) +fprintf(stderr, + "segment bin %zu (no contiguous free pages):\n", i); + else +fprintf(stderr, + "segment bin %zu (at least %d contiguous pages free):\n", + i, 1 << (i - 1)); segment_index = area->control->segment_bins[i]; while (segment_index != DSA_SEGMENT_INDEX_NONE) {
Re: Fix race condition in InvalidatePossiblyObsoleteSlot()
Hi, On Tue, Feb 20, 2024 at 02:33:44PM +0900, Michael Paquier wrote: > On Tue, Feb 20, 2024 at 08:51:17AM +0900, Michael Paquier wrote: > > Prefixing these with "initial_" is fine, IMO. That shows the > > intention that these come from the slot's data before doing the > > termination. So I'm OK with what's been proposed in v3. > > I was looking at that a second time, and just concluded that this is > OK, so I've applied that down to 16, wordsmithing a bit the comments. Thanks! FWIW, I've started to write a POC regarding the test we mentioned up-thread. The POC test is based on what has been submitted by Michael in [1]. The POC test seems to work fine and it seems that nothing more is needed in [1] (at some point I thought I would need the ability to wake up multiple "wait" injection points in sequence but that was not necessary). I'll polish and propose my POC test once [1] is pushed (unless you're curious about it before). [1]: https://www.postgresql.org/message-id/flat/ZdLuxBk5hGpol91B%40paquier.xyz Regards, -- Bertrand Drouvot PostgreSQL Contributors Team RDS Open Source Databases Amazon Web Services: https://aws.amazon.com
Re: pg_restore problem to load constraints with tables
Fabrice Chapuis writes: > When a table is reloaded wit pg_restore, it is recreated without indexes or > constraints. There are automatically skipped. Is there a reason for this? [ shrug ] That's how the -t switch is defined. If you want something else, you can use the -l and -L switches to pick out a custom collection of objects to restore. regards, tom lane
Re: Injection points: some tools to wait and wake
Hi, On Tue, Feb 20, 2024 at 08:28:28AM +0900, Michael Paquier wrote: > On Mon, Feb 19, 2024 at 02:28:04PM +, Bertrand Drouvot wrote: > > If slock_t protects "only" one counter, then what about using > > pg_atomic_uint64 > > or pg_atomic_uint32 instead? And btw do we need wait_counts at all? (see > > comment > > number 4) > > We could, indeed, even if we use more than one counter. Still, I > would be tempted to keep it in case more data is added to this > structure as that would make for less stuff to do while being normally > non-critical. This sentence may sound pedantic, though.. Okay, makes sense to keep this as it is as a "template" in case more stuff is added. + /* Counter advancing when injection_points_wake() is called */ + int wait_counts; In that case what about using an unsigned instead? (Nit) > > I'm wondering if this loop and wait_counts are needed, why not doing > > something > > like (and completely get rid of wait_counts)? > > > > " > > ConditionVariablePrepareToSleep(_state->wait_point); > > ConditionVariableSleep(_state->wait_point, injection_wait_event); > > ConditionVariableCancelSleep(); > > " > > > > It's true that the comment above ConditionVariableSleep() mentions that: > > Perhaps not, but it encourages good practices around the use of > condition variables, and we need to track all that in shared memory > anyway. Ashutosh has argued in favor of the approach taken by the > patch in the original thread when I've sent a version doing exactly > what you are saying now to not track a state in shmem. Oh okay I missed this previous discussion, let's keep it as it is then. New comments: 1 === +void +injection_wait(const char *name) Looks like name is not used in the function. I guess the reason it is a parameter is because that's the way the callback function is being called in InjectionPointRun()? 2 === +PG_FUNCTION_INFO_V1(injection_points_wake); +Datum +injection_points_wake(PG_FUNCTION_ARGS) +{ I think that This function will wake up all the "wait" injection points. Would that make sense to implement some filtering based on the name? That could be useful for tests that would need multiple wait injection points and that want to wake them up "sequentially". We could think about it if there is such a need in the future though. 3 === +# Register a injection point on the standby so as the follow-up typo: "an injection"? 4 === +for (my $i = 0; $i < 3000; $i++) +{ is 3000 due to?: +checkpoint_timeout = 30s If so, would that make sense to reduce the value for both? Regards, -- Bertrand Drouvot PostgreSQL Contributors Team RDS Open Source Databases Amazon Web Services: https://aws.amazon.com
Re: Missing Group Key in grouped aggregate
=?UTF-8?Q?Erik_Nordstr=C3=B6m?= writes: > I noticed that, beginning with PG16, grouped aggregates are missing the > "Group Key" in the EXPLAIN output. > It seems the Agg node has numCols (number of grouping cols) set to zero in > queries like > SELECT foo, count(*) FROM bar WHERE foo=1 GROUP BY foo; > In PG15, the "Group Key" is shown and the Agg node has numCols set as > expected. Looks sane to me: the planner now notices that there can only be one group so it doesn't tell the GroupAgg node to worry about making groups. If it were missing in a case where there could be multiple output groups, yes that'd be a bug. If you want to run it to ground you could bisect to see where the behavior changed, but you'd probably just find it was intentional. regards, tom lane
Re: Change the bool member of the Query structure to bits
Quan Zongliang writes: > The Query structure has an increasing number of bool attributes. This is > likely to increase in the future. And they have the same properties. > Wouldn't it be better to store them in bits? Common statements don't use > them, so they have little impact. This also saves memory space. I'm -1 on that, for three reasons: * The amount of space saved is quite negligible. If queries had many Query structs then it could matter, but they don't. * This causes enough code churn to create a headache for back-patching. * This'll completely destroy the readability of these flags in pprint output. I'm not greatly in love with the macro layer you propose, either, but those details don't matter because I think we should just leave well enough alone. regards, tom lane
Re: numeric_big in make check?
On Tue, 20 Feb 2024 at 15:16, Tom Lane wrote: > > Dean Rasheed writes: > > Looking at the script itself, the addition, subtraction, > > multiplication and division tests at the top are probably pointless, > > since I would expect those operations to be tested adequately (and > > probably more thoroughly) by the transcendental test cases. In fact, I > > think it would probably be OK to delete everything above line 650, and > > just keep the bottom half of the script -- the pow(), exp(), ln() and > > log() tests, which cover various edge cases, as well as exercising > > basic arithmetic operations internally. > > I could go with that, but let's just transpose those into numeric. > Works for me. Regards, Dean
Re: numeric_big in make check?
Dean Rasheed writes: > Looking at the script itself, the addition, subtraction, > multiplication and division tests at the top are probably pointless, > since I would expect those operations to be tested adequately (and > probably more thoroughly) by the transcendental test cases. In fact, I > think it would probably be OK to delete everything above line 650, and > just keep the bottom half of the script -- the pow(), exp(), ln() and > log() tests, which cover various edge cases, as well as exercising > basic arithmetic operations internally. I could go with that, but let's just transpose those into numeric. regards, tom lane
Re: bug report: some issues about pg_15_stable(8fa4a1ac61189efffb8b851ee77e1bc87360c445)
On Tue, 20 Feb 2024 at 14:49, Dean Rasheed wrote: > > Also, if the concurrent update were an update of a key > column that was included in the join condition, the re-scan would > follow the update to a new matching source row, which is inconsistent > with what would happen if it were a join to a regular relation. > In case it wasn't clear what I was talking about there, here's a simple example: -- Setup DROP TABLE IF EXISTS src1, src2, tgt; CREATE TABLE src1 (a int, b text); CREATE TABLE src2 (a int, b text); CREATE TABLE tgt (a int, b text); INSERT INTO src1 SELECT x, 'Src1 '||x FROM generate_series(1, 3) g(x); INSERT INTO src2 SELECT x, 'Src2 '||x FROM generate_series(4, 6) g(x); INSERT INTO tgt SELECT x, 'Tgt '||x FROM generate_series(1, 6, 2) g(x); -- Session 1 BEGIN; UPDATE tgt SET a = 2 WHERE a = 1; -- Session 2 UPDATE tgt t SET b = s.b FROM (SELECT * FROM src1 UNION ALL SELECT * FROM src2) s WHERE s.a = t.a; SELECT * FROM tgt; -- Session 1 COMMIT; and the result in tgt is: a | b ---+ 2 | Src1 2 3 | Src1 3 5 | Src2 5 (3 rows) whereas if that UNION ALL subquery had been a regular table with the same contents, the result would have been: a | b ---+ 2 | Tgt 1 3 | Src1 3 5 | Src2 5 i.e., the concurrently modified row would not have been updated. Regards, Dean
Re: Replace current implementations in crypt() and gen_salt() to OpenSSL
> On 20 Feb 2024, at 13:40, Peter Eisentraut wrote: > > On 20.02.24 12:39, Daniel Gustafsson wrote: >> A fifth option is to throw away our in-tree implementations and use the >> OpenSSL >> API's for everything, which is where this thread started. If the effort to >> payoff ratio is palatable to anyone then patches are for sure welcome. > > The problem is that, as I understand it, these crypt routines are not > designed in a way that you can just plug in a crypto library underneath. > Effectively, the definition of what, say, blowfish crypt does, is whatever is > in that source file, and transitively, whatever OpenBSD does. I don't disagree, but if the OP is willing to take a stab at it then.. > (Fun question: Does OpenBSD care about FIPS?) No, LibreSSL ripped out FIPS support early on. > Of course, you could reimplement the same algorithms independently, using > OpenSSL or whatever. But I don't think this will really improve the state of > the world in aggregate, because to a large degree we are relying on the > upstream to keep these implementations maintained, and if we rewrite them, we > become the upstream. As a sidenote, we are already trailing behind upstream on this, the patch in [0] sits on my TODO, but given the lack of complaints over the years it's not been bumped to the top. -- Daniel Gustafsson [0] https://www.postgresql.org/message-id/flat/CAA-7PziyARoKi_9e2xdC75RJ068XPVk1CHDDdscu2BGrPuW9TQ%40mail.gmail.com#b20783dd6c72e95a8a0f6464d1228ed5
Re: bug report: some issues about pg_15_stable(8fa4a1ac61189efffb8b851ee77e1bc87360c445)
On Mon, 19 Feb 2024 at 17:48, zwj wrote: > > Hello, > >I found an issue while using the latest version of PG15 > (8fa4a1ac61189efffb8b851ee77e1bc87360c445). >This question is about 'merge into'. > >When two merge into statements are executed concurrently, I obtain the > following process and results. >Firstly, the execution results of each Oracle are different, and secondly, > I tried to understand its execution process and found that it was not very > clear. > Hmm, looking at this I think there is a problem with how UNION ALL subqueries are pulled up, and I don't think it's necessarily limited to MERGE. Looking at the plan for this MERGE operation: explain (verbose, costs off) merge into mergeinto_0023_tb01 a using (select aid,name,year from mergeinto_0023_tb02 union all select aid,name,year from mergeinto_0023_tb03) c on (a.id=c.aid) when matched then update set year=c.year when not matched then insert(id,name,year) values(c.aid,c.name,c.year); QUERY PLAN - Merge on public.mergeinto_0023_tb01 a -> Merge Right Join Output: a.ctid, mergeinto_0023_tb02.year, mergeinto_0023_tb02.aid, mergeinto_0023_tb02.name, (ROW(mergeinto_0023_tb02.aid, mergeinto_0023_tb02.name, mergeinto_0023_tb02.year)) Merge Cond: (a.id = mergeinto_0023_tb02.aid) -> Sort Output: a.ctid, a.id Sort Key: a.id -> Seq Scan on public.mergeinto_0023_tb01 a Output: a.ctid, a.id -> Sort Output: mergeinto_0023_tb02.year, mergeinto_0023_tb02.aid, mergeinto_0023_tb02.name, (ROW(mergeinto_0023_tb02.aid, mergeinto_0023_tb02.name, mergeinto_0023_tb02.year)) Sort Key: mergeinto_0023_tb02.aid -> Append -> Seq Scan on public.mergeinto_0023_tb02 Output: mergeinto_0023_tb02.year, mergeinto_0023_tb02.aid, mergeinto_0023_tb02.name, ROW(mergeinto_0023_tb02.aid, mergeinto_0023_tb02.name, mergeinto_0023_tb02.year) -> Seq Scan on public.mergeinto_0023_tb03 Output: mergeinto_0023_tb03.year, mergeinto_0023_tb03.aid, mergeinto_0023_tb03.name, ROW(mergeinto_0023_tb03.aid, mergeinto_0023_tb03.name, mergeinto_0023_tb03.year) The "ROW(...)" targetlist entries are added because preprocess_rowmarks() adds a rowmark to the UNION ALL subquery, which at that point is the only non-target relation in the jointree. It does this intending that the same values be returned during EPQ rechecking. However, pull_up_subqueries() causes the UNION all subquery and its leaf subqueries to be pulled up into the main query as appendrel entries. So when it comes to EPQ rechecking, the rowmark does absolutely nothing, and EvalPlanQual() does a full re-scan of mergeinto_0023_tb02 and mergeinto_0023_tb03 and a re-sort for each concurrently modified row. A similar thing happens for UPDATE and DELETE, if they're joined to a UNION ALL subquery. However, AFAICS that doesn't cause a problem (other than being pretty inefficient) since, for UPDATE and DELETE, the join to the UNION ALL subquery will always be an inner join, I think, and so the join output will always be correct. However, for MERGE, the join may be an outer join, so during an EPQ recheck, we're joining the target relation (fixed to return just the updated row) to the full UNION ALL subquery. So if it's an outer join, the join output will return all-but-one of the subquery rows as not matched rows in addition to the one matched row that we want, whereas the EPQ mechanism is expecting the plan to return just one row. On the face of it, the simplest fix is to tweak is_simple_union_all() to prevent UNION ALL subquery pullup for MERGE, forcing a subquery-scan plan. A quick test shows that that fixes the reported issue. is_simple_union_all() already has a test for rowmarks, and a comment saying that locking isn't supported, but since it is called before preprocess_rowmarks(), it doesn't know that the subquery is about to be marked. However, that leaves the question of whether we should do the same for UPDATE and DELETE. There doesn't appear to be a live bug there, so maybe they're best left alone. Also, back-patching a change like that might make existing queries less efficient. But I feel like I might be overlooking something here, and this doesn't seem to be how EPQ rechecks are meant to work (doing a full re-scan of non-target relations). Also, if the concurrent update were an update of a key column that was included in the join condition, the re-scan would follow the update to a new matching source row, which is inconsistent with what would happen if it were a join to a regular relation. Thoughts? Regards, Dean
Re: to_regtype() Raises Error
On Feb 5, 2024, at 09:01, David E. Wheeler wrote: > Ah, thank you. Updated patch attached. I’ve moved this patch into the to_regtype patch thread[1], since it exhibits the same behavior. Best, David [1] https://www.postgresql.org/message-id/60ef4e11-bc1c-4034-b37b-695662d28...@justatheory.com
Re: ALTER TABLE SET ACCESS METHOD on partitioned tables
On 19.07.23 20:13, Justin Pryzby wrote: On Thu, Jun 01, 2023 at 08:50:50AM -0400, Michael Paquier wrote: On Wed, May 31, 2023 at 06:35:34PM -0500, Justin Pryzby wrote: What do you think the comment ought to say ? It already says: src/backend/catalog/heap.c- * Make a dependency link to force the relation to be deleted if its src/backend/catalog/heap.c- * access method is. This is the third location where we rely on the fact that RELKIND_HAS_TABLE_AM() does not include RELKIND_PARTITIONED_TABLE, so it seems worth documenting what we are relying on as a comment? Say: * Make a dependency link to force the relation to be deleted if its * access method is. * * No need to add an explicit dependency for the toast table, as the * main table depends on it. Partitioned tables have a table access * method defined, and RELKIND_HAS_TABLE_AM ignores them. You said that this location "relies on" the macro not including partitioned tables, but I would say the opposite: the places that use RELKIND_HAS_TABLE_AM() and do *not* say "or relkind==PARTITIONED_TABLE" are the ones that "rely on" that... Anyway, this updates various comments. No other changes. It would be helpful if this patch could more extensively document in its commit message what semantic changes it makes. Various options of possible behaviors were discussed in this thread, but it's not clear which behaviors were chosen in this particular patch version. The general idea is that you can set an access method on a partitioned table. That much seems very agreeable. But then what happens with this setting, how can you override it, how can you change it, what happens when you change it, what happens with existing partitions and new partitions, etc. -- and which of these behaviors are new and old. Many things to specify.
Re: Patch: Add parse_type Function
On Feb 20, 2024, at 01:30, jian he wrote: > the second hint `-- grammar error expected` seems to contradict with > the results? Quite right, thank you, that’s actually a trapped error. I’ve tweaked the comments and their order in v7, attached. This goes back to the discussion of the error raising of to_regtype[1], so I’ve incorporated the patch from that thread into this patch, and set up the docs for to_regtypemod() with similar information. The wording is still a little opaque for my taste, though, written more for someone who knows a bit about the internals, but it’s a start. I’ve also fixed the wayward parameter in the function signature in the docs, and added a note about why I’ve also patched genbki.pl. Best, David [1] https://www.postgresql.org/message-id/flat/57E1FDDC-5A38-452D-82D7-A44DA2E13862%40justatheory.com#1ae0b11634bc33c7ad3cd728e43d504e v7-0001-Add-to_regtypemod-SQL-function.patch Description: Binary data
pg_restore problem to load constraints with tables
Hi, When a table is reloaded wit pg_restore, it is recreated without indexes or constraints. There are automatically skipped. Is there a reason for this? g_restore -j 8 -v -d zof /shared/pgdump/aq/backup/dbtest/shtest --no-owner --role=test -t mytable 2>&1 | tee -a dbest.log pg_restore: skipping item 7727 SEQUENCE SET xxx_seq pg_restore: skipping item 5110 INDEX x_cons pg_restore: skipping item 5143 CONSTRAINT xxx pg_restore: skipping item 5670 FK CONSTRAINT xxx Thanks for your feedback Fabrice
Re: Injection points: some tools to wait and wake
> On 20 Feb 2024, at 02:21, Michael Paquier wrote: > > On Mon, Feb 19, 2024 at 11:54:20AM +0300, Andrey M. Borodin wrote: >> 1. injection_points_wake() will wake all of waiters. But it's not >> suitable for complex tests. I think there must be a way to wake only >> specific waiter by injection point name. > > I don't disagree with that, but I don't have a strong argument for > implementing that until there is an explicit need for it in core. It > is also possible to plug in your own module, outside of core, if you > are looking for something more specific. The backend APIs allow that. In [0] I want to create a test for edge case of reading recent multixact. External module does not allow to have a test within repository. In that test I need to sync backends in 3 steps (backend 1 starts to wait in injection point, backend 2 starts to sleep in other point, backend 1 is released and observe 3rd injection point). "wake them all" implementation allows only 2-step synchronization. I will try to simplify test to 2-step, but it would be much easier to implement if injection points could be awaken independently. >> 2. Alexander Korotkov's stopevents could be used in isolation >> tests. This kind of tests is perfect for describing complex race >> conditions. (as a side note, I'd be happy if we could have >> primary\standby in isolation tests too) > > This requires plugging is more into src/test/isolation/, with multiple > connection strings. This has been suggested in the past. I think standby isolation tests are just a sugar-on-top feature here. Wrt injection points, I'd like to see a function to wait until some injection point is observed. With this function at hand developer can implement race condition tests as an isolation test. >> 5. In many cases we need to have injection point under critical >> section. I propose to have a "prepared injection point". See [0] for >> example in v2-0003-Test-multixact-CV-sleep.patch >> + INJECTION_POINT_PREPARE("GetNewMultiXactId-done"); >> + >>START_CRIT_SECTION(); >> >> + INJECTION_POINT_RUN_PREPARED(); > > I don't see how that's different from a wait/wake logic? The only > thing you've changed is to stop a wait when a point is detached and > you want to make the stop conditional. Plugging in a condition > variable is more flexible than a hardcoded sleep in terms of wait, > while being more responsive. No, "prepared injection point" is not about wait\wake logic. It's about having injection point in critical section. Normal injection point will pstrdup(name) and fail. In [0] I need a test that waits after multixact generation before WAL-logging it. It's only possible in a critical section. Thanks! Best regards, Andrey Borodin. [0] https://www.postgresql.org/message-id/0925f9a9-4d53-4b27-a87e-3d83a757b...@yandex-team.ru
Re: Have pg_basebackup write "dbname" in "primary_conninfo"?
On Tue, Feb 20, 2024 at 5:58 PM Hayato Kuroda (Fujitsu) wrote: > > Seems weird to me. You don't use dbname=replication to ask for a > > replication connection, so why would we ever end up with that > > anywhere? And especially in only one of two such closely related > > cases? > > Just FYI - here is an extreme case. And note that I have applied proposed > patch. > > When `pg_basebackup -D data_N2 -R` is used: > ``` > primary_conninfo = 'user=hayato ... dbname=hayato ... > ``` > > But when `pg_basebackup -d "" -D data_N2 -R` is used: > ``` > primary_conninfo = 'user=hayato ... dbname=replication > ``` It seems like maybe somebody should look into why this is happening, and perhaps fix it. -- Robert Haas EDB: http://www.enterprisedb.com
Re: A new message seems missing a punctuation
On Tue, Feb 20, 2024 at 5:01 PM Amit Kapila wrote: > > > We do expose the required information (restart_lsn, catalog_xmin, > synced, temporary, etc) via pg_replication_slots. So, I feel the LOG > message here is sufficient to DEBUG (or know the details) when the > slot sync doesn't succeed. > Please find the patch having the suggested changes. thanks Shveta v2-0001-Reword-LOG-msg-for-slot-sync.patch Description: Binary data
Re: Thoughts about NUM_BUFFER_PARTITIONS
Hi, On 2/20/24 03:16, wenhui qiu wrote: > Hi Heikki Linnakangas >I saw git log found this commit: > https://github.com/postgres/postgres/commit/3acc10c997f916f6a741d0b4876126b7b08e3892 > ,I don't seem to see an email discussing this commit. As the commit log > tells us, we don't know exactly how large a value is optimal, and I believe > it's more flexible to make it as a parameter.Thank you very much > tomas.vondra for explaining the relationship, i see that MAX_SIMUL_LWLOCKS > was just doubled in this commit, is there a more appropriate ratio between > them? > I think the discussion for that commit is in [1] (and especially [2]). That being said, I don't think MAX_SIMUL_LOCKS and NUM_BUFFER_PARTITIONS need to be in any particular ratio. The only requirement is that there needs to be enough slack, and 72 locks seemed to work quite fine until now - I don't think we need to change that. What might be necessary is improving held_lwlocks - we treat is as LIFO, but more as an expectation than a hard rule. I'm not sure how often we violate that rule (if at all), but if we do then it's going to get more expensive as we increase the number of locks. But I'm not sure this is actually a problem in practice, we usually hold very few LWLocks at the same time. As for making this a parameter, I'm rather opposed to the idea. If we don't have a very clear idea how to set this limit, what's the chance users with little knowledge of the internals will pick a good value? Adding yet another knob would just mean users start messing with it in random ways (typically increasing it to very high value, because "more is better"), causing more harm than good. Adding it as a GUC would also require making some parts dynamic (instead of just doing static allocation with compile-time constants). That's not great, but I'm not sure how significant the penalty might be. IMHO adding a GUC might be acceptable only if we fail to come up with a good value (which is going to be a trade off), and if someone demonstrates a clear benefit of increasing the value (which I don't think happen in this thread yet). regards [1] https://www.postgresql.org/message-id/flat/CAA4eK1LSTcMwXNO8ovGh7c0UgCHzGbN%3D%2BPjggfzQDukKr3q_DA%40mail.gmail.com [2] https://www.postgresql.org/message-id/CA%2BTgmoY58dQi8Z%3DFDAu4ggxHV-HYV03-R9on1LSP9OJU_fy_zA%40mail.gmail.com -- Tomas Vondra EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
Re: numeric_big in make check?
> On 20 Feb 2024, at 14:23, Dean Rasheed wrote: > If we did that, numeric_big would be even further down the list of > expensive tests, and I'd say it should be run by default. My motivation for raising this was to get a test which is executed as part of parallel_schedule to make failures aren't missed. If we get there by slimming down numeric_big to keep the unique coverage then that sounds like a good plan to me. -- Daniel Gustafsson
Re: speed up a logical replica setup
On Mon, 19 Feb 2024 at 11:15, Hayato Kuroda (Fujitsu) wrote: > > Dear hackers, > > > Since it may be useful, I will post top-up patch on Monday, if there are no > > updating. > > And here are top-up patches. Feel free to check and include. > > v22-0001: Same as v21-0001. > === rebased patches === > v22-0002: Update docs per recent changes. Same as v20-0002. > v22-0003: Add check versions of the target. Extracted from v20-0003. > v22-0004: Remove -S option. Mostly same as v20-0009, but commit massage was > slightly changed. > === Newbie === > V22-0005: Addressed my comments which seems to be trivial[1]. > Comments #1, 3, 4, 8, 10, 14, 17 were addressed here. > v22-0006: Consider the scenario when commands are failed after the recovery. > drop_subscription() is removed and some messages are added per [2]. > V22-0007: Revise server_is_in_recovery() per [1]. Comments #5, 6, 7, were > addressed here. > V22-0008: Fix a strange report when physical_primary_slot is null. Per > comment #9 [1]. > V22-0009: Prohibit reuse publications when it has already existed. Per > comments #11 and 12 [1]. > V22-0010: Avoid to call PQclear()/PQfinish()/pg_free() if the process exits > soon. Per comment #15 [1]. > V22-0011: Update testcode. Per comments #17- [1]. > > I did not handle below points because I have unclear points. > > a. > This patch set cannot detect the disconnection between the target (standby) > and > source (primary) during the catch up. Because the connection status must be > gotten > at the same time (=in the same query) with the recovery status, but now it is > now an > independed function (server_is_in_recovery()). > > b. > This patch set cannot detect the inconsistency reported by Shubham [3]. I > could not > come up with solutions without removing -P... Few comments: 1) The below code can lead to assertion failure if the publisher is stopped while dropping the replication slot: + if (primary_slot_name != NULL) + { + conn = connect_database(dbinfo[0].pubconninfo); + if (conn != NULL) + { + drop_replication_slot(conn, [0], primary_slot_name); + } + else + { + pg_log_warning("could not drop replication slot \"%s\" on primary", + primary_slot_name); + pg_log_warning_hint("Drop this replication slot soon to avoid retention of WAL files."); + } + disconnect_database(conn); + } pg_createsubscriber: error: connection to database failed: connection to server on socket "/tmp/.s.PGSQL.5432" failed: No such file or directory Is the server running locally and accepting connections on that socket? pg_createsubscriber: warning: could not drop replication slot "standby_1" on primary pg_createsubscriber: hint: Drop this replication slot soon to avoid retention of WAL files. pg_createsubscriber: pg_createsubscriber.c:432: disconnect_database: Assertion `conn != ((void *)0)' failed. Aborted (core dumped) This is happening because we are calling disconnect_database in case of connection failure case too which has the following assert: +static void +disconnect_database(PGconn *conn) +{ + Assert(conn != NULL); + + PQfinish(conn); +} 2) There is a CheckDataVersion function which does exactly this, will we be able to use this: + /* Check standby server version */ + if ((ver_fd = fopen(versionfile, "r")) == NULL) + pg_fatal("could not open file \"%s\" for reading: %m", versionfile); + + /* Version number has to be the first line read */ + if (!fgets(rawline, sizeof(rawline), ver_fd)) + { + if (!ferror(ver_fd)) + pg_fatal("unexpected empty file \"%s\"", versionfile); + else + pg_fatal("could not read file \"%s\": %m", versionfile); + } + + /* Strip trailing newline and carriage return */ + (void) pg_strip_crlf(rawline); + + if (strcmp(rawline, PG_MAJORVERSION) != 0) + { + pg_log_error("standby server is of wrong version"); + pg_log_error_detail("File \"%s\" contains \"%s\", which is not compatible with this program's version \"%s\".", + versionfile, rawline, PG_MAJORVERSION); + exit(1); + } + + fclose(ver_fd); 3) Should this be added to typedefs.list: +enum WaitPMResult +{ + POSTMASTER_READY, + POSTMASTER_STILL_STARTING +}; 4) pgCreateSubscriber should be mentioned after pg_controldata to keep the ordering consistency: diff --git a/doc/src/sgml/reference.sgml b/doc/src/sgml/reference.sgml index aa94f6adf6..c5edd244ef 100644 --- a/doc/src/sgml/reference.sgml +++ b/doc/src/sgml/reference.sgml @@ -285,6 +285,7 @@ + 5) Here pg_replication_slots should be
Re: confirmed flush lsn seems to be move backward in certain error cases
On Fri, 16 Feb 2024 at 17:39, vignesh C wrote: > > Hi, > > The following assertion failure was seen while testing one scenario > for other patch: > TRAP: failed Assert("s->data.confirmed_flush >= > s->last_saved_confirmed_flush"), File: "slot.c", Line: 1760, PID: > 545314 > postgres: checkpointer performing shutdown > checkpoint(ExceptionalCondition+0xbb)[0x564ee6870c58] > postgres: checkpointer performing shutdown > checkpoint(CheckPointReplicationSlots+0x18e)[0x564ee65e9c71] > postgres: checkpointer performing shutdown > checkpoint(+0x1e1403)[0x564ee61be403] > postgres: checkpointer performing shutdown > checkpoint(CreateCheckPoint+0x78a)[0x564ee61bdace] > postgres: checkpointer performing shutdown > checkpoint(ShutdownXLOG+0x150)[0x564ee61bc735] > postgres: checkpointer performing shutdown > checkpoint(+0x5ae28c)[0x564ee658b28c] > postgres: checkpointer performing shutdown > checkpoint(CheckpointerMain+0x31e)[0x564ee658ad55] > postgres: checkpointer performing shutdown > checkpoint(AuxiliaryProcessMain+0x1d1)[0x564ee65888d9] > postgres: checkpointer performing shutdown > checkpoint(+0x5b7200)[0x564ee6594200] > postgres: checkpointer performing shutdown > checkpoint(PostmasterMain+0x14da)[0x564ee658f12f] > postgres: checkpointer performing shutdown > checkpoint(+0x464fc6)[0x564ee6441fc6] > /lib/x86_64-linux-gnu/libc.so.6(+0x29d90)[0x7ff6afa29d90] > /lib/x86_64-linux-gnu/libc.so.6(__libc_start_main+0x80)[0x7ff6afa29e40] > postgres: checkpointer performing shutdown > checkpoint(_start+0x25)[0x564ee60b8e05] > > I was able to reproduce this issue with the following steps: > -- Setup > -- Publisher node: > create table t1(c1 int); > create table t2(c1 int); > create publication pub1 for table t1; > create publication pub2 for table t2; > > -- Subscriber node: > create table t1(c1 int); > create table t2(c1 int); > create subscription test1 connection 'dbname=postgres host=localhost > port=5432' publication pub1, pub2; > select * from pg_subscription; > > -- Actual test > insert into t1 values(10); > insert into t2 values(20); > select pg_sleep(10); > drop publication pub2; > insert into t1 values(10); > insert into t2 values(20); > > Stop the publisher to see the assertion. > > For me the issue reproduces about twice in five times using the > assert_failure.sh script attached. > > After the insert operation is replicated to the subscriber, the > subscriber will set the lsn value sent by the publisher in the > replication origin (in my case it was 0/1510978). publisher will then > send keepalive messages with the current WAL position in the publisher > (in my case it was 0/15109B0), but subscriber will simply send this > position as the flush_lsn to the publisher as there are no ongoing > transactions. Then since the publisher is started, it will identify > that publication does not exist and stop the walsender/apply worker > process. When the apply worker is restarted, we will get the > remote_lsn(in my case it was 0/1510978) of the origin and set it to > origin_startpos. We will start the apply worker with this > origin_startpos (origin's remote_lsn). This position will be sent as > feedback to the walsender process from the below stack: > run_apply_worker->start_apply->LogicalRepApplyLoop->send_feedback. > It will use the following send_feedback function call of > LogicalRepApplyLoop function as in below code here as nothing is > received from walsender: > LogicalRepApplyLoop function > ... > len = walrcv_receive(LogRepWorkerWalRcvConn, , ); > if (len != 0) > { >/* Loop to process all available data (without blocking). */ >for (;;) >{ > CHECK_FOR_INTERRUPTS(); > ... >} > } > > /* confirm all writes so far */ > send_feedback(last_received, false, false); > ... > > In send_feedback, we will set flushpos to replication origin's > remote_lsn and send it to the walsender process. Walsender process > will receive this information and set confirmed_flush in: > ProcessStandbyReplyMessage->LogicalConfirmReceivedLocation > > Then immediately we are trying to stop the publisher instance, > shutdown checkpoint process will be triggered. In this case: > confirmed_flush = 0/1510978 will be lesser than > last_saved_confirmed_flush = 0/15109B0 which will result in Assertion > failure. > > This issue is happening because we allow setting the confirmed_flush > to a backward position. > There are a couple of ways to fix this: > a) One way it not to update the confirm_flush if the lsn sent is an > older value like in Confirm_flush_dont_allow_backward.patch > b) Another way is to remove the assertion in > CheckPointReplicationSlots and marking the slot as dirty only if > confirmed_flush is greater than last_saved_confirmed_flush like in > Assert_confirmed_flush_will_always_not_be_less_than_last_saved_confirmed_flush.patch > > I preferred the first approach. I have created the following
Re: Synchronizing slots from primary to standby
On Tue, Feb 20, 2024 at 6:19 PM Masahiko Sawada wrote: > > Thank you for the explanation. It makes sense to me to move the check. > > As for ValidateSlotSyncParams() called by SlotSyncWorkerAllowed(), I > have two comments: > > 1. The error messages are not very descriptive and seem not to match > other messages the postmaster says. When starting the standby server > with misconfiguration about the slotsync, I got the following messages > from the postmaster: > > 2024-02-20 17:01:16.356 JST [456741] LOG: database system is ready to > accept read-only connections > 2024-02-20 17:01:16.358 JST [456741] LOG: bad configuration for slot > synchronization > 2024-02-20 17:01:16.358 JST [456741] HINT: "hot_standby_feedback" > must be enabled. > > It says "bad configuration" but is still working, and does not say > further information such as whether it skipped to start the slotsync > worker etc. I think these messages could work for the slotsync worker > but we might want to have more descriptive messages for the > postmaster. For example, "skipped starting slot sync worker because > hot_standby_feedback is disabled". > We are planning to change it to something like:"slot synchronization requires hot_standby_feedback to be enabled". See [1] > 2. If the wal_level is not logical, the server will need to restart > anyway to change the wal_level and have the slotsync worker work. Does > it make sense to have the postmaster exit if the wal_level is not > logical and sync_replication_slots is enabled? For instance, we have > similar checks in PostmsaterMain(): > > if (summarize_wal && wal_level == WAL_LEVEL_MINIMAL) > ereport(ERROR, > (errmsg("WAL cannot be summarized when wal_level is > \"minimal\""))); > +1. I think giving an error in this case makes sense. Miscellaneous comments: 1. +void +ShutDownSlotSync(void) +{ + SpinLockAcquire(>mutex); + + SlotSyncCtx->stopSignaled = true; This flag is never reset back. I think we should reset this once the promotion is complete. Though offhand, I don't see any problem with this but it doesn't look clean and can be a source of bugs in the future. 2. +char * +CheckDbnameInConninfo(void) { char*dbname; Let's name this function as CheckAndGetDbnameFromConninfo(). Apart from the above, I have made cosmetic changes in the attached. [1] - https://www.postgresql.org/message-id/CAJpy0uBWomyAjP0zyFdzhGxn%2BXsAb2OdJA%2BKfNyZRv2nV6PD9g%40mail.gmail.com -- With Regards, Amit Kapila. diff --git a/src/backend/replication/logical/slotsync.c b/src/backend/replication/logical/slotsync.c index d139d53173..c133fed6c2 100644 --- a/src/backend/replication/logical/slotsync.c +++ b/src/backend/replication/logical/slotsync.c @@ -1389,7 +1389,7 @@ ShutDownSlotSync(void) /* * SlotSyncWorkerCanRestart * - * Returns true if enough time has passed (SLOTSYNC_RESTART_INTERVAL_SEC) + * Returns true if enough time (SLOTSYNC_RESTART_INTERVAL_SEC) has passed * since it was launched last. Otherwise returns false. * * This is a safety valve to protect against continuous respawn attempts if the diff --git a/src/backend/replication/slot.c b/src/backend/replication/slot.c index a2269c46f7..7e9bdf9e33 100644 --- a/src/backend/replication/slot.c +++ b/src/backend/replication/slot.c @@ -1242,14 +1242,13 @@ restart: * happening here. The persistent synced slots are thus safe but there * is a possibility that the slot sync worker has created a temporary * slot (which stays active even on release) and we are trying to drop -* the same here. In practice, the chances of hitting this scenario is -* very less as during slot synchronization, the temporary slot is -* immediately converted to persistent and thus is safe due to the -* shared lock taken on the database. So for the time being, we'll -* just bail out in such a scenario. +* the here. In practice, the chances of hitting this scenario are less +* as during slot synchronization, the temporary slot is immediately +* converted to persistent and thus is safe due to the shared lock +* taken on the database. So, we'll just bail out in such a case. * -* XXX: If needed, we can consider shutting down slot sync worker -* before trying to drop synced temporary slots here. +* XXX: We can consider shutting down the slot sync worker before +* trying to drop synced temporary slots here. */ if (active_pid) ereport(ERROR,
Re: numeric_big in make check?
On Mon, 19 Feb 2024 at 15:35, Tom Lane wrote: > > I thought I'd try to acquire some actual facts here, so I compared > the code coverage shown by "make check" as of HEAD, versus "make > check" after adding numeric_big to parallel_schedule. I saw the > following lines of numeric.c as being covered in the second run > and not the first: > I don't think that tells the whole story. Code coverage only tells you whether a particular line of code has been hit, not whether it has been properly tested with all the values that might lead to different cases. For example: > sqrt_var(): > 10073 res_ndigits = res_weight + 1 - (-rscale - 1) / DEC_DIGITS; > To get code coverage of this line, all you need to do is ensure that sqrt_var() is called with rscale < -1 (which can only happen from the range-reduction code in ln_var()). You could do that by computing ln(1e50), which results in calling sqrt_var() with rscale = -2, causing that line of code in sqrt_var() to be hit. That would satisfy code coverage, but I would argue that you've only really tested that line of code properly if you also hit it with rscale = -3, and probably a few more values, to check that the round-down division is working as intended. Similarly, looking at commit 18a02ad2a5, the crucial code change was the following in power_var(): -val = Max(val, -NUMERIC_MAX_RESULT_SCALE); -val = Min(val, NUMERIC_MAX_RESULT_SCALE); val *= 0.434294481903252;/* approximate decimal result weight */ Any test that calls numeric_power() is going to hit those lines of code, but to see a failure, it was necessary to hit them with the absolute value of "val" greater than NUMERIC_MAX_RESULT_SCALE, which is why that commit added 2 new test cases to numeric_big, calling power_var() with "val" outside that range. Code coverage is never going to tell you whether or not that is being tested, since the code change was to delete lines. Even if that weren't the case, any line of code involving Min() or Max() has 2 branches, and code coverage won't tell you if you've hit both of them. > Pretty poor return on investment for the runtime consumed. I don't > object to adding something to numeric.sql that's targeted at adding > coverage for these (or anyplace else that's not covered), but let's > not just throw cycles at the problem. > I agree that blindly performing a bunch of large computations (just throwing cycles at the problem) is not a good approach to testing. And maybe that's a fair characterisation of parts of numeric_big. However, it also contains some fairly well-targeted tests intended to test specific edge cases that only occur with particular ranges of inputs, which don't necessarily show up as increased code coverage. So I think this requires more careful analysis. Simply deleting numeric_big and adding tests to numeric.sql until the same level of code coverage is achieved will not give the same level of testing. It's also worth noting that the cost of running numeric_big has come down very significantly over the last few years, as can be seen by running the current numeric_big script against old backends. There's a lot of random variation in the timings, but the trend is very clear: 9.51.641s 9.60.856s 100.845s 110.750s 120.760s 130.672s 140.430s 150.347s 160.336s Arguably, this is a useful benchmark to spot performance regressions and test proposed performance-improving patches. If I run "EXTRA_TESTS=numeric_big make check | grep 'ok ' | sort -nrk5", numeric_big is not in the top 10 most expensive tests (it's usually down at around 15'th place). Looking at the script itself, the addition, subtraction, multiplication and division tests at the top are probably pointless, since I would expect those operations to be tested adequately (and probably more thoroughly) by the transcendental test cases. In fact, I think it would probably be OK to delete everything above line 650, and just keep the bottom half of the script -- the pow(), exp(), ln() and log() tests, which cover various edge cases, as well as exercising basic arithmetic operations internally. We might want to check that I/O of large numerics is still being tested properly though. If we did that, numeric_big would be even further down the list of expensive tests, and I'd say it should be run by default. Regards, Dean
Re: Synchronizing slots from primary to standby
On Tue, Feb 20, 2024 at 12:44 PM Amit Kapila wrote: > > On Tue, Feb 20, 2024 at 8:25 AM Masahiko Sawada wrote: > > > > Some comments not related to the patch but to the existing code: > > > > --- > > It might have already been discussed but is the > > src/backend/replication/logical the right place for the slocsync.c? If > > it's independent of logical decoding/replication, is under > > src/backend/replication could be more appropriate? > > Thank you for the comment. > > This point has not been discussed, so thanks for raising it. I think > the reasoning behind keeping it in logical is that this file contains > a code for logical slot syncing and a worker doing that. As it is > mostly about logical slot syncing so there is an argument to keep it > under logical. In the future, we may need to extend this functionality > to have a per-db slot sync worker as well in which case it will > probably be again somewhat related to logical slots. That's a valid argument. > Having said that, > there is an argument to keep it under replication as well because the > functionality it provides is for physical standbys. Another argument to keep it under replication is; all files under the replication/logical directory are logical decoding and logical replication infrastructures. IOW these are the functionality built on top of (logical) replication slots. On the other hand, the slotsync worker (and slotsync functionality) looks a part of slot management functionality, which seems the same layer of slot.c. BTW the slotsync.c of v91 patch includes "replication/logical.h" but it isn't actually necessary and #include'ing "replication/slot.h" is sufficient. Regards -- Masahiko Sawada Amazon Web Services: https://aws.amazon.com
Re: Synchronizing slots from primary to standby
On Tue, Feb 20, 2024 at 12:33 PM shveta malik wrote: > > On Tue, Feb 20, 2024 at 8:25 AM Masahiko Sawada wrote: > > > > > > I've reviewed the v91 patch. Here are random comments: > > Thanks for the comments. > > > --- > > /* > > * Checks the remote server info. > > * > > - * We ensure that the 'primary_slot_name' exists on the remote server and > > the > > - * remote server is not a standby node. > > + * Check whether we are a cascading standby. For non-cascading standbys, it > > + * also ensures that the 'primary_slot_name' exists on the remote server. > > */ > > > > IIUC what the validate_remote_info() does doesn't not change by this > > patch, so the previous comment seems to be clearer to me. > > > > --- > > if (remote_in_recovery) > > + { > > + /* > > +* If we are a cascading standby, no need to check further for > > +* 'primary_slot_name'. > > +*/ > > ereport(ERROR, > > errcode(ERRCODE_FEATURE_NOT_SUPPORTED), > > errmsg("cannot synchronize replication slots from a > > standby server")); > > + } > > + else > > + { > > + boolprimary_slot_valid; > > > > - primary_slot_valid = DatumGetBool(slot_getattr(tupslot, 2, )); > > - Assert(!isnull); > > + primary_slot_valid = DatumGetBool(slot_getattr(tupslot, 2, > > )); > > + Assert(!isnull); > > > > - if (!primary_slot_valid) > > - ereport(ERROR, > > - errcode(ERRCODE_INVALID_PARAMETER_VALUE), > > - errmsg("bad configuration for slot synchronization"), > > - /* translator: second %s is a GUC variable name */ > > - errdetail("The replication slot \"%s\" specified by > > \"%s\" does not exist on the primary server.", > > - PrimarySlotName, "primary_slot_name")); > > + if (!primary_slot_valid) > > + ereport(ERROR, > > + errcode(ERRCODE_INVALID_PARAMETER_VALUE), > > + errmsg("bad configuration for slot synchronization"), > > + /* translator: second %s is a GUC variable name */ > > + errdetail("The replication slot \"%s\" specified > > by \"%s\" does not exist on the primary server.", > > + PrimarySlotName, "primary_slot_name")); > > + } > > > > I think it's a refactoring rather than changes required by the > > slotsync worker. We can do that in a separate patch but why do we need > > this change in the first place? > > In v90, this refactoring was made due to the fact that > validate_remote_info() was supposed to behave differently for SQL > function and slot-sync worker. SQL-function was supposed to ERROR out > while the worker was supposed to LOG and become no-op. And thus the > change was needed. In v91, we made this functionality same i.e. both > sql function and worker will error out but missed to remove this > refactoring. Thanks for catching this, I will revert it in the next > version. To match the refactoring, I made the comment change too, will > revert that as well. > > > --- > > +ValidateSlotSyncParams(ERROR); > > + > > /* Load the libpq-specific functions */ > > load_file("libpqwalreceiver", false); > > > > -ValidateSlotSyncParams(); > > +(void) CheckDbnameInConninfo(); > > > > Is there any reason why we move where to check the parameters? > > Earlier DBname verification was done inside ValidateSlotSyncParams() > and thus it was needed to load 'libpqwalreceiver' before we call this > function. Now we have moved dbname verification in a separate call and > thus the above order got changed. ValidateSlotSyncParams() is a common > function used by SQL function and worker. Earlier slot sync worker was > checking all the GUCs after starting up and was exiting each time any > GUC was invalid. It was suggested that it would be better to check the > GUCs before starting the slot sync worker in the postmaster itself, > making the ValidateSlotSyncParams() move to postmaster (see > SlotSyncWorkerAllowed). But it was not a good idea to load libpq in > postmaster and thus we moved libpq related verification out of > ValidateSlotSyncParams(). This resulted in the above change. I hope > it answers your query. Thank you for the explanation. It makes sense to me to move the check. As for ValidateSlotSyncParams() called by SlotSyncWorkerAllowed(), I have two comments: 1. The error messages are not very descriptive and seem not to match other messages the postmaster says. When starting the standby server with misconfiguration about the slotsync, I got the following messages from the postmaster: 2024-02-20 17:01:16.356 JST [456741] LOG: database system is ready to accept read-only connections 2024-02-20 17:01:16.358 JST [456741] LOG: bad configuration for slot synchronization 2024-02-20 17:01:16.358 JST [456741] HINT: "hot_standby_feedback" must be enabled. It says "bad configuration" but is still working, and
Re: Replace current implementations in crypt() and gen_salt() to OpenSSL
On 20.02.24 12:39, Daniel Gustafsson wrote: A fifth option is to throw away our in-tree implementations and use the OpenSSL API's for everything, which is where this thread started. If the effort to payoff ratio is palatable to anyone then patches are for sure welcome. The problem is that, as I understand it, these crypt routines are not designed in a way that you can just plug in a crypto library underneath. Effectively, the definition of what, say, blowfish crypt does, is whatever is in that source file, and transitively, whatever OpenBSD does. (Fun question: Does OpenBSD care about FIPS?) Of course, you could reimplement the same algorithms independently, using OpenSSL or whatever. But I don't think this will really improve the state of the world in aggregate, because to a large degree we are relying on the upstream to keep these implementations maintained, and if we rewrite them, we become the upstream.
Re: Replace current implementations in crypt() and gen_salt() to OpenSSL
> On 20 Feb 2024, at 13:24, Robert Haas wrote: > > On Tue, Feb 20, 2024 at 5:09 PM Daniel Gustafsson wrote: >> A fifth option is to throw away our in-tree implementations and use the >> OpenSSL >> API's for everything, which is where this thread started. If the effort to >> payoff ratio is palatable to anyone then patches are for sure welcome. > > That generally seems fine, although I'm fuzzy on what our policy > actually is. We have fallback implementations for some things and not > others, IIRC. I'm not sure there is a well-formed policy, but IIRC the idea with cryptohash was to provide in-core functionality iff OpenSSL isn't used, and only use the OpenSSL implementations if it is. Since pgcrypto cannot be built without OpenSSL (since db7d1a7b0530e8cbd045744e1c75b0e63fb6916f) I don't think it's a problem to continue the work from that commit and replace more with OpenSSL implementations. -- Daniel Gustafsson
Re: Replace current implementations in crypt() and gen_salt() to OpenSSL
On 20.02.24 12:27, Robert Haas wrote: I don't think the first two of these proposals help anything. AIUI, FIPS mode is supposed to be a system wide toggle that affects everything on the machine. The third one might help if you can be compliant by just choosing not to install that extension, and the fourth one solves the problem by sledgehammer. Does Linux provide some way of asking whether "fips=1" was specified at kernel boot time? What you are describing only happens on Red Hat systems, I think. They have built additional integration around this, which is great. But that's not something you can rely on being the case on all systems, not even all Linux systems.
RE: Have pg_basebackup write "dbname" in "primary_conninfo"?
Dear Robert, > Seems weird to me. You don't use dbname=replication to ask for a > replication connection, so why would we ever end up with that > anywhere? And especially in only one of two such closely related > cases? Just FYI - here is an extreme case. And note that I have applied proposed patch. When `pg_basebackup -D data_N2 -R` is used: ``` primary_conninfo = 'user=hayato ... dbname=hayato ... ``` But when `pg_basebackup -d "" -D data_N2 -R` is used: ``` primary_conninfo = 'user=hayato ... dbname=replication ``` Best Regards, Hayato Kuroda FUJITSU LIMITED https://www.fujitsu.com/
Re: Replace current implementations in crypt() and gen_salt() to OpenSSL
On Tue, Feb 20, 2024 at 5:09 PM Daniel Gustafsson wrote: > A fifth option is to throw away our in-tree implementations and use the > OpenSSL > API's for everything, which is where this thread started. If the effort to > payoff ratio is palatable to anyone then patches are for sure welcome. That generally seems fine, although I'm fuzzy on what our policy actually is. We have fallback implementations for some things and not others, IIRC. > > Does Linux provide some way of asking whether "fips=1" was specified > > at kernel boot time? > > There is a crypto.fips_enabled sysctl but I have no idea how portable that is > across distributions etc. My guess would be that it's pretty portable, but my guesses about Linux might not be very good. Still, if we wanted to go this route, it probably wouldn't be too hard to figure out how portable this is. -- Robert Haas EDB: http://www.enterprisedb.com
Re: Integer undeflow in fprintf in dsa.c
On Tue, Feb 20, 2024 at 5:30 PM Daniel Gustafsson wrote: > The message "at least 0 contiguous pages free" reads a bit nonsensical though, > wouldn't it be preferrable to check for i being zero and print a custom > message > for that case? Something like the below untested sketch? > > + if (i == 0) > + fprintf(stderr, > + "segment bin %zu (no > contiguous free pages):\n", i); > + else > + fprintf(stderr, > + "segment bin %zu (at > least %d contiguous pages free):\n", > + i, 1 << (i - 1)); That does seem reasonable. However, this is just debugging code, so it also probably isn't necessary to sweat anything too much. -- Robert Haas EDB: http://www.enterprisedb.com
Re: logical decoding and replication of sequences, take 2
On 2/20/24 06:54, Amit Kapila wrote: > On Thu, Dec 21, 2023 at 6:47 PM Tomas Vondra > wrote: >> >> On 12/19/23 13:54, Christophe Pettus wrote: >>> Hi, >>> >>> I wanted to hop in here on one particular issue: >>> On Dec 12, 2023, at 02:01, Tomas Vondra wrote: - desirability of the feature: Random IDs (UUIDs etc.) are likely a much better solution for distributed (esp. active-active) systems. But there are important use cases that are likely to keep using regular sequences (online upgrades of single-node instances, existing systems, ...). >>> >>> +1. >>> >>> Right now, the lack of sequence replication is a rather large >>> foot-gun on logical replication upgrades. Copying the sequences >>> over during the cutover period is doable, of course, but: >>> >>> (a) There's no out-of-the-box tooling that does it, so everyone has >>> to write some scripts just for that one function. >>> >>> (b) It's one more thing that extends the cutover window. >>> >> >> I agree it's an annoying gap for this use case. But if this is the only >> use cases, maybe a better solution would be to provide such tooling >> instead of adding it to the logical decoding? >> >> It might seem a bit strange if most data is copied by replication >> directly, while sequences need special handling, ofc. >> > > One difference between the logical replication of tables and sequences > is that we can guarantee with synchronous_commit (and > synchronous_standby_names) that after failover transactions data is > replicated or not whereas for sequences we can't guarantee that > because of their non-transactional nature. Say, there are two > transactions T1 and T2, it is possible that T1's entire table data and > sequence data are committed and replicated but T2's sequence data is > replicated. So, after failover to logical subscriber in such a case if > one routes T2 again to the new node as it was not successful > previously then it would needlessly perform the sequence changes > again. I don't how much that matters but that would probably be the > difference between the replication of tables and sequences. > I don't quite follow what the problem with synchronous_commit is :-( For sequences, we log the changes ahead, i.e. even if nextval() did not write anything into WAL, it's still safe because these changes are covered by the WAL generated some time ago (up to ~32 values back). And that's certainly subject to synchronous_commit, right? There certainly are issues with sequences and syncrep: https://www.postgresql.org/message-id/712cad46-a9c8-1389-aef8-faf0203c9...@enterprisedb.com but that's unrelated to logical replication. FWIW I don't think we'd re-apply sequence changes needlessly, because the worker does update the origin after applying non-transactional changes. So after the replication gets restarted, we'd skip what we already applied, no? But maybe there is an issue and I'm just not getting it. Could you maybe share an example of T1/T2, with a replication restart and what you think would happen? > I agree with your point above that for upgrades some tool like > pg_copysequence where we can provide a way to copy sequence data to > subscribers from the publisher would suffice the need. > Perhaps. Unfortunately it doesn't quite work for failovers, and it's yet another tool users would need to use. regards -- Tomas Vondra EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
Re: Integer undeflow in fprintf in dsa.c
> On 20 Feb 2024, at 12:28, Ильясов Ян wrote: > fprintf(stderr, >"segment bin %zu (at least %d contiguous pages free):\n", >i, 1 << (i - 1)); > > In case i equals zero user will get "at least -2147483648 contiguous pages > free". That does indeed seem like an oversight. > I believe that this is a mistake, and fprintf should print "at least 0 > contiguous pages free" > in case i equals zero. The message "at least 0 contiguous pages free" reads a bit nonsensical though, wouldn't it be preferrable to check for i being zero and print a custom message for that case? Something like the below untested sketch? + if (i == 0) + fprintf(stderr, + "segment bin %zu (no contiguous free pages):\n", i); + else + fprintf(stderr, + "segment bin %zu (at least %d contiguous pages free):\n", + i, 1 << (i - 1)); -- Daniel Gustafsson
Re: Replace current implementations in crypt() and gen_salt() to OpenSSL
> On 20 Feb 2024, at 12:18, Peter Eisentraut wrote: > I think we are going about this the wrong way. It doesn't make sense to ask > OpenSSL what a piece of code that doesn't use OpenSSL should do. Given that pgcrypto cannot be built without OpenSSL, and ideally we should be using the OpenSSL implementations for everything, I don't think it's too far fetched. -- Daniel Gustafsson
Re: Replace current implementations in crypt() and gen_salt() to OpenSSL
> On 20 Feb 2024, at 12:27, Robert Haas wrote: > > On Tue, Feb 20, 2024 at 4:49 PM Peter Eisentraut wrote: >> I think there are several less weird ways to address this: >> >> * Just document it. >> >> * Make a pgcrypto-level GUC setting. >> >> * Split out these functions into a separate extension. >> >> * Deprecate these functions. >> >> Or some combination of these. > > I don't think the first two of these proposals help anything. AIUI, > FIPS mode is supposed to be a system wide toggle that affects > everything on the machine. The third one might help if you can be > compliant by just choosing not to install that extension, and the > fourth one solves the problem by sledgehammer. A fifth option is to throw away our in-tree implementations and use the OpenSSL API's for everything, which is where this thread started. If the effort to payoff ratio is palatable to anyone then patches are for sure welcome. > Does Linux provide some way of asking whether "fips=1" was specified > at kernel boot time? There is a crypto.fips_enabled sysctl but I have no idea how portable that is across distributions etc. -- Daniel Gustafsson
Re: speed up a logical replica setup
On Mon, 19 Feb 2024 at 11:15, Hayato Kuroda (Fujitsu) wrote: > > Dear hackers, > > > Since it may be useful, I will post top-up patch on Monday, if there are no > > updating. > > And here are top-up patches. Feel free to check and include. > > v22-0001: Same as v21-0001. > === rebased patches === > v22-0002: Update docs per recent changes. Same as v20-0002. > v22-0003: Add check versions of the target. Extracted from v20-0003. > v22-0004: Remove -S option. Mostly same as v20-0009, but commit massage was > slightly changed. > === Newbie === > V22-0005: Addressed my comments which seems to be trivial[1]. > Comments #1, 3, 4, 8, 10, 14, 17 were addressed here. > v22-0006: Consider the scenario when commands are failed after the recovery. > drop_subscription() is removed and some messages are added per [2]. > V22-0007: Revise server_is_in_recovery() per [1]. Comments #5, 6, 7, were > addressed here. > V22-0008: Fix a strange report when physical_primary_slot is null. Per > comment #9 [1]. > V22-0009: Prohibit reuse publications when it has already existed. Per > comments #11 and 12 [1]. > V22-0010: Avoid to call PQclear()/PQfinish()/pg_free() if the process exits > soon. Per comment #15 [1]. > V22-0011: Update testcode. Per comments #17- [1]. > > I did not handle below points because I have unclear points. > > a. > This patch set cannot detect the disconnection between the target (standby) > and > source (primary) during the catch up. Because the connection status must be > gotten > at the same time (=in the same query) with the recovery status, but now it is > now an > independed function (server_is_in_recovery()). > > b. > This patch set cannot detect the inconsistency reported by Shubham [3]. I > could not > come up with solutions without removing -P... Few comments regarding the documentation: 1) max_replication_slots information seems to be present couple of times: + + The target instance must have + max_replication_slots + and max_logical_replication_workers + configured to a value greater than or equal to the number of target + databases. + + + + The target instance must have + max_replication_slots + configured to a value greater than or equal to the number of target + databases and replication slots. + + 2) Can we add an id to prerequisites and use it instead of referring to r1-app-pg_createsubscriber-1: - pg_createsubscriber checks if the given target data - directory has the same system identifier than the source data directory. - Since it uses the recovery process as one of the steps, it starts the - target server as a replica from the source server. If the system - identifier is not the same, pg_createsubscriber will - terminate with an error. + Checks the target can be converted. In particular, things listed in + above section would be + checked. If these are not met pg_createsubscriber + will terminate with an error. 3) The code also checks the following: Verify if a PostgreSQL binary (progname) is available in the same directory as pg_createsubscriber. But this is not present in the pre-requisites of documentation. 4) Here we mention that the target server should be stopped, but the same is not mentioned in prerequisites: + Here is an example of using pg_createsubscriber. + Before running the command, please make sure target server is stopped. + +$ pg_ctl -D /usr/local/pgsql/data stop + + 5) If there is an error during any of the pg_createsubscriber operation like if create subscription fails, it might not be possible to rollback to the earlier state which had physical-standby replication. I felt we should document this and also add it to the console message like how we do in case of pg_upgrade. Regards, Vignesh
Re: A new message seems missing a punctuation
On Tue, Feb 20, 2024 at 4:50 PM Robert Haas wrote: > > On Tue, Feb 20, 2024 at 4:42 PM Amit Kapila wrote: > > > So why do we log a message about this? > > > > This was added after the main commit of this functionality to find > > some BF failures (where we were expecting the slot to sync but due to > > one of these conditions not being met the slot was not synced) and we > > can probably change it to DEBUG1 as well. I think we would need this > > information w.r.t this functionality to gather more information in > > case expected slots are not being synced and it may be helpful for > > users to also know why the slots are not synced, if that happens. > > Ah, OK. Do you think we need any kind of system view to provide more > insight here or is a log message sufficient? > We do expose the required information (restart_lsn, catalog_xmin, synced, temporary, etc) via pg_replication_slots. So, I feel the LOG message here is sufficient to DEBUG (or know the details) when the slot sync doesn't succeed. -- With Regards, Amit Kapila.
Integer undeflow in fprintf in dsa.c
Hello hackers, Using Svace* I think I've found a little bug in src/backend/utils/mmgr/dsa.c. This bug is presented in REL_12_STABLE, REL_13_STABLE, REL_14_STABLE, REL_15_STABLE, REL_16_STABLE and master. I see that it was introduced together with dynamic shared memory areas in the commit 13df76a537cca3b8884911d8fdf7c89a457a8dd3. I also see that at least two people have encountered this fprintf output. (https://postgrespro.com/list/thread-id/2419512, https://www.postgresql.org/message-id/15e9501170d.e4b5a3858707.3339083113985275726%40zohocorp.com) fprintf(stderr, "segment bin %zu (at least %d contiguous pages free):\n", i, 1 << (i - 1)); In case i equals zero user will get "at least -2147483648 contiguous pages free". I believe that this is a mistake, and fprintf should print "at least 0 contiguous pages free" in case i equals zero. The patch that has a fix of this is attached. * - https://svace.pages.ispras.ru/svace-website/en/ Kind regards, Ian Ilyasov. Juniour Software Developer at Postgres Professional Subject: [PATCH] Integer underflow fix in fprintf in dsa.c. --- Index: src/backend/utils/mmgr/dsa.c <+>UTF-8 === diff --git a/src/backend/utils/mmgr/dsa.c b/src/backend/utils/mmgr/dsa.c --- a/src/backend/utils/mmgr/dsa.c (revision b78fa8547d02fc72ace679fb4d5289dccdbfc781) +++ b/src/backend/utils/mmgr/dsa.c (date 1708426298001) @@ -1107,7 +1107,7 @@ fprintf(stderr, "segment bin %zu (at least %d contiguous pages free):\n", - i, 1 << (i - 1)); + i, i != 0 ? 1 << (i - 1) : 0); segment_index = area->control->segment_bins[i]; while (segment_index != DSA_SEGMENT_INDEX_NONE) {
Re: Replace current implementations in crypt() and gen_salt() to OpenSSL
On Tue, Feb 20, 2024 at 4:49 PM Peter Eisentraut wrote: > I think there are several less weird ways to address this: > > * Just document it. > > * Make a pgcrypto-level GUC setting. > > * Split out these functions into a separate extension. > > * Deprecate these functions. > > Or some combination of these. I don't think the first two of these proposals help anything. AIUI, FIPS mode is supposed to be a system wide toggle that affects everything on the machine. The third one might help if you can be compliant by just choosing not to install that extension, and the fourth one solves the problem by sledgehammer. Does Linux provide some way of asking whether "fips=1" was specified at kernel boot time? -- Robert Haas EDB: http://www.enterprisedb.com
Re: A new message seems missing a punctuation
On Tue, Feb 20, 2024 at 4:42 PM Amit Kapila wrote: > > So why do we log a message about this? > > This was added after the main commit of this functionality to find > some BF failures (where we were expecting the slot to sync but due to > one of these conditions not being met the slot was not synced) and we > can probably change it to DEBUG1 as well. I think we would need this > information w.r.t this functionality to gather more information in > case expected slots are not being synced and it may be helpful for > users to also know why the slots are not synced, if that happens. Ah, OK. Do you think we need any kind of system view to provide more insight here or is a log message sufficient? -- Robert Haas EDB: http://www.enterprisedb.com
Re: Replace current implementations in crypt() and gen_salt() to OpenSSL
On 20.02.24 11:09, Daniel Gustafsson wrote: On 20 Feb 2024, at 10:56, Koshi Shibagaki (Fujitsu) wrote: Let me confirm the discussion in threads. I think there are two topics. 1. prohibit the use of ciphers disallowed in FIPS mode at the level of block cipher (crypt-bf, etc...) in crypt() and gen_salt() That level might be overkill given that any cipher not in the FIPS certfied module mustn't be used, but it's also not the wrong place to put it IMHO. I think we are going about this the wrong way. It doesn't make sense to ask OpenSSL what a piece of code that doesn't use OpenSSL should do. (And would that even give a sensible answer? Like, you can configure OpenSSL to load the fips module, but you can also load the legacy module alongside it(??).) And as you say, even if this code supported modern block ciphers, it wouldn't be FIPS compliant. I think there are several less weird ways to address this: * Just document it. * Make a pgcrypto-level GUC setting. * Split out these functions into a separate extension. * Deprecate these functions. Or some combination of these.
Re: Change the bool member of the Query structure to bits
On 2/20/24 11:11, Quan Zongliang wrote: > > Sorry. I forgot to save a file. This is the latest. > > On 2024/2/20 18:07, Quan Zongliang wrote: >> >> The Query structure has an increasing number of bool attributes. This >> is likely to increase in the future. And they have the same >> properties. Wouldn't it be better to store them in bits? Common >> statements don't use them, so they have little impact. This also saves >> memory space. >> Hi, Are we really adding bools to Query that often? A bit of git-blame says it's usually multiple years to add a single new flag, which is what I'd expect. I doubt that'll change. As for the memory savings, can you quantify how much memory this would save? I highly doubt that's actually true (or at least measurable). The Query struct has ~256B, the patch cuts that to ~232B. But we allocate stuff in power-of-2, so we'll allocate 256B chunk anyway. And we allocate very few of those objects anyway ... regards -- Tomas Vondra EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
Re: POC, WIP: OR-clause support for indexes
Em ter., 20 de fev. de 2024 às 00:18, Andrei Lepikhov < a.lepik...@postgrespro.ru> escreveu: > On 19/2/2024 19:53, Ranier Vilela wrote: > > v17-0002 > > 1) move the vars *arrayconst and *dest, to after if, to avoid makeNode > > (palloc). > > + Const *arrayconst; > > + ScalarArrayOpExpr *dest; > > + > > + pd = (PredicatesData *) lfirst(lc); > > + if (pd->elems == NIL) > > + /* The index doesn't participate in this operation */ > > + continue; > > > > + arrayconst = lsecond_node(Const, saop->args); > > + dest = makeNode(ScalarArrayOpExpr); > Thanks for the review! > I'm not sure I understand you clearly. Does the patch in attachment fix > the issue you raised? > Sorry if I wasn't clear. What I meant is to move the initializations of the variables *arrayconst* and *dest* for after the test (if (pd->elems == NIL) To avoid unnecessary initialization if the test fails. best regards, Ranier Vilela
Re: Streaming read-ready sequential scan code
On Tue, Feb 20, 2024 at 4:35 AM Melanie Plageman wrote: > I've written three alternative implementations of the actual streaming > read user for sequential scan which handle the question of where to > allocate the streaming read object and how to handle changing scan > direction in different ways. It's weird to me that the prospect of changing the scan direction causes such complexity. I mean, why doesn't a streaming read object have a forget_all_my_previous_requests() method or somesuch? -- Robert Haas EDB: http://www.enterprisedb.com