Re: SLRU optimization - configurable buffer pool and partitioning the SLRU lock

2023-11-18 Thread Andrey M. Borodin



> On 17 Nov 2023, at 16:11, Dilip Kumar  wrote:
> 
> On Fri, Nov 17, 2023 at 1:09 PM Dilip Kumar  wrote:
>> 
>> On Thu, Nov 16, 2023 at 3:11 PM Alvaro Herrera  
>> wrote:
> 
> PFA, updated patch version, this fixes the comment given by Alvaro and
> also improves some of the comments.

I’ve skimmed through the patch set. Here are some minor notes.

1. Cycles “for (slotno = bankstart; slotno < bankend; slotno++)” in 
SlruSelectLRUPage() and SimpleLruReadPage_ReadOnly() now have identical 
comments. I think a little of copy-paste is OK.
But SimpleLruReadPage_ReadOnly() does pgstat_count_slru_page_hit(), while 
SlruSelectLRUPage() does not. This is not related to the patch set, just a code 
nearby.

2. Do we really want these functions doing all the same?
extern bool check_multixact_offsets_buffers(int *newval, void **extra,GucSource 
source);
extern bool check_multixact_members_buffers(int *newval, void **extra,GucSource 
source);
extern bool check_subtrans_buffers(int *newval, void **extra,GucSource source);
extern bool check_notify_buffers(int *newval, void **extra, GucSource source);
extern bool check_serial_buffers(int *newval, void **extra, GucSource source);
extern bool check_xact_buffers(int *newval, void **extra, GucSource source);
extern bool check_commit_ts_buffers(int *newval, void **extra,GucSource source);

3. The name SimpleLruGetSLRUBankLock() contains meaning of SLRU twice. I’d 
suggest truncating prefix of infix.

I do not have hard opinion on any of this items.


Best regards, Andrey Borodin.



Re: SQL:2011 application time

2023-11-18 Thread Paul A Jungwirth
On Mon, Nov 6, 2023 at 11:07 PM jian he  wrote:
> + 
> +  In a temporal foreign key, the delete/update will use
> +  FOR PORTION OF semantics to constrain the
> +  effect to the bounds being deleted/updated in the referenced row.
> + 
>
> The first "para" should be  ?

Thanks, fixed (in v18)!

> There are many warnings after #define WRITE_READ_PARSE_PLAN_TREES
> see: http://cfbot.cputube.org/highlights/all.html#4308
> Does that mean oue new change in gram.y is somehow wrong?

Fixed (in read+out node funcs).

> sgml/html/sql-update.html:
> "range_or_period_name
> The range column or period to use when performing a temporal update.
> This must match the range or period used in the table's temporal
> primary key."
>
> Is the second sentence unnecessary? since no primary can still do "for
> portion of update".

You're right, this dates back to an older version of the patch. Removed.

> sgml/html/sql-update.html:
> "start_time
> The earliest time (inclusive) to change in a temporal update. This
> must be a value matching the base type of the range or period from
> range_or_period_name. It may also be the special value MINVALUE to
> indicate an update whose beginning is unbounded."
>
> probably something like the following:
> "lower_bound"
> The lower bound (inclusive) to change in an overlap update. This must
> be a value matching the base type of the range or period from
> range_or_period_name. It may also be the special value UNBOUNDED to
> indicate an update whose beginning is unbounded."
>
> Obviously the "start_time" reference also needs to change, and the
> sql-delete.html reference also needs to change.

See below re UNBOUNDED

> UPDATE for_portion_of_test FOR PORTION OF valid_at  FROM NULL TO
> "unbounded" SET name = 'NULL to NULL';
> should fail, but not. double quoted unbounded is a column reference, I assume.
>
> That's why I am confused with the function transformForPortionOfBound.
> "if (nodeTag(n) == T_ColumnRef)" part.

You're right, using a ColumnDef was probably not good here, and
treating `"UNBOUNDED"` (with quotes from the user) as a keyword is no
good. I couldn't find a way to make this work without reduce/reduce
conflicts, so I just took it out. It was syntactic sugar for `FROM/TO
NULL` and not part of the standard, so it's not too important. Also I
see that UNBOUNDED causes difficult problems already with window
functions (comments in gram.y). I hope I can find a way to make this
work eventually, but it can go for now.

> in create_table.sgml. you also need to add  WITHOUT OVERLAPS related
> info into 

You're right, fixed (though Peter's patch then changed this same spot).

Thanks,

-- 
Paul  ~{:-)
p...@illuminatedcomputing.com




Re: SQL:2011 application time

2023-11-18 Thread Paul A Jungwirth
Thank you for continuing to review this submission! My changes are in
the v18 patch I sent a few days ago. Details below.

On Sun, Oct 29, 2023 at 5:01 PM jian he  wrote:
> * The attached patch makes foreign keys with PERIOD fail if any of the
> foreign key columns is "generated columns".

I don't see anything like that included in your attachment. I do see
the restriction on `ON DELETE SET NULL/DEFAULT (columnlist)`, which I
included. But you are referring to something else I take it? Why do
you think FKs should fail if the referred column is GENERATED? Is that
a restriction you think should apply to all FKs or only temporal ones?

> * The following queries will cause segmentation fault. not sure the
> best way to fix it.
> . . .
> CONSTRAINT temporal_fk_rng2rng_fk FOREIGN KEY (parent_id, PERIOD valid_at)
> REFERENCES temporal3 (id, valid_at)
> );

Fixed, with additional tests re PERIOD on one side but not the other.

> * change the function FindFKComparisonOperators's "eqstrategy"  to
> make pg_constraint record correct {conpfeqop,conppeqop,conffeqop}.

This change is incorrect because it causes foreign keys to fail when
created with btree_gist. See my reply to Peter for more about that. My
v18 patch also includes some new (very simple) tests in the btree_gist
extension so it's easier to see whether temporal PKs & FKs work there.

> * fix the ON DELETE SET NULL/DEFAULT (columnlist). Now the following
> queries error will be more consistent.
> ALTER TABLE temporal_fk_rng2rng DROP CONSTRAINT temporal_fk_rng2rng_fk,
> ADD CONSTRAINT temporal_fk_rng2rng_fk
> FOREIGN KEY (parent_id, PERIOD valid_at) REFERENCES temporal_rng
> ON DELETE SET DEFAULT(valid_at);
> --ON DELETE SET NULL(valid_at);

Okay, thanks!

> * refactor restrict_cascading_range function.

It looks like your attachment only renames the column, but I think
"restrict" is more expressive and accurate than "get", so I'd like to
keep the original name here.

> * you did if (numfks != numpks) before if (is_temporal) {numfks +=
> 1;}, So I changed the code order to make the error report more
> consistent.

Since we do numfks +=1 and numpks +=1, I don't see any inconsistency
here. Also you are making things now happen before a permissions
check, which may be important (I'm not sure). Can you explain what
improvement is intended here? Your changes don't seem to cause any
changes in the tests, so what is the goal? Perhaps I'm
misunderstanding what you mean by "more consistent."

Thanks! I'll reply to your Nov 6 email separately.

Yours,

--
Paul  ~{:-)
p...@illuminatedcomputing.com




Re: pg_upgrade and logical replication

2023-11-18 Thread vignesh C
On Thu, 16 Nov 2023 at 18:25, Hayato Kuroda (Fujitsu)
 wrote:
>
> Dear Vignesh,
>
> Thanks for updating the patch! Here are some comments.
> They are mainly cosmetic because I have not read yours these days.
>
> 01. binary_upgrade_add_sub_rel_state()
>
> ```
> +/* We must check these things before dereferencing the arguments */
> +if (PG_ARGISNULL(0) || PG_ARGISNULL(1) || PG_ARGISNULL(2))
> +elog(ERROR, "null argument to binary_upgrade_add_sub_rel_state is 
> not allowed")
> ```
>
> But fourth argument can be NULL, right? I know you copied from other 
> functions,
> but they do not accept for all arguments. One approach is that pg_dump 
> explicitly
> writes InvalidXLogRecPtr as the fourth argument.

I did not find any problem with this approach, if the lsn is valid
like in ready state, we will send a valid lsn, if lsn is not valid
like in init state we will pass as NULL. This approach was also
suggested at [1].

> 02. binary_upgrade_add_sub_rel_state()
>
> ```
> +if (!OidIsValid(relid))
> +ereport(ERROR,
> +errcode(ERRCODE_INVALID_PARAMETER_VALUE),
> +errmsg("invalid relation identifier used: %u", relid));
> +
> +tup = SearchSysCache1(RELOID, ObjectIdGetDatum(relid));
> +if (!HeapTupleIsValid(tup))
> +ereport(ERROR,
> +errcode(ERRCODE_INVALID_PARAMETER_VALUE),
> +errmsg("relation %u does not exist", relid))
> ```
>
> I'm not sure they should be ereport(). Isn't it that they will be never 
> occurred?
> Other upgrade funcs do not have ereport(), and I think it does not have to be
> translated.

I have removed the first check and retained the second one for a sanity check.

> 03. binary_upgrade_replorigin_advance()
>
> IIUC this function is very similar to pg_replication_origin_advance(). Can we
> extract a common part of them? I think pg_replication_origin_advance() will be
> just a wrapper, and binary_upgrade_replorigin_advance() will get the name of
> origin and pass to it.

We will be able to reduce hardly 4 lines, I felt the existing is better.

> 04. binary_upgrade_replorigin_advance()
>
> Even if you do not accept 03, some variable name can be follow the function.

Modified

> 05. getSubscriptions()
>
> ```
> +appendPQExpBufferStr(query, "o.remote_lsn AS suboriginremotelsn\n")
> ```
>
> Hmm, this value is taken anyway, but will be dumed only when the cluster is 
> PG17+.
> Should we avoid getting the value like subrunasowner and subpasswordrequired?
> Not sure...

Modified

> 06. dumpSubscriptionTable()
>
> Can we assert that remote version is PG17+?

Modified

> 07. check_for_subscription_state()
>
> IIUC, this function is used only for old cluster. Should we follow
> check_old_cluster_for_valid_slots()?

Modified

> 08. check_for_subscription_state()
>
> ```
> +fprintf(script, "database:%s subscription:%s schema:%s 
> relation:%s state:%s not in required state\n",
> +active_db->db_name,
> +PQgetvalue(res, i, 0),
> +PQgetvalue(res, i, 1),
> +PQgetvalue(res, i, 2),
> +PQgetvalue(res, i, 3));
> ```
>
> IIRC, format strings should be double-quoted.

Modified

> 09. check_new_cluster_logical_replication_slots()
>
> Checks for replication origin were added in 
> check_new_cluster_logical_replication_slots(),
> but I felt it became a super function. Can we devide?

Modified

> 10. check_new_cluster_logical_replication_slots()
>
> Even if you reject above, it should be renamed.

Since the previous is handled, this is not valid.

> 11. pg_upgrade.h
>
> ```
> +int subscription_count; /* number of subscriptions */
> ```
>
> Based on other struct, it should be "nsubscriptions".

Modified

> 12. 004_subscription.pl
>
> ```
> +use File::Path qw(rmtree);
> ```
>
> I think this is not used.

Modified

> 13. 004_subscription.pl
>
> ```
> +my $bindir = $new_sub->config_data('--bindir');
> ```
> For extensibility, it might be better to separate for old/new bindir.

Modified

> 14. 004_subscription.pl
>
> ```
> +my $synced_query =
> +  "SELECT count(1) = 1 FROM pg_subscription_rel WHERE srsubstate = 'r'";
> +$old_sub->poll_query_until('postgres', $synced_query)
> +  or die "Timed out while waiting for subscriber to synchronize data";
> ```
>
> Actually, I'm not sure it is really needed. wait_for_subscription_sync() in 
> line 163
> ensures that sync are done? Are there any holes around here?

wait_for_subscription_sync will check if table is in syndone or in
ready state, since we are allowing sycndone state, I have removed this
part.

> 15. 004_subscription.pl
>
> ```
> +# Check the number of rows for each table on each server
> +my $result =
> +  $publisher->safe_psql('postgres', "SELECT count(*) FROM tab_upgraded");
> +is($result, qq(50), "check initial tab_upgraded table data on publisher");
> +$result =
> +  $publisher->safe_psql('postgres', "SELECT count(*) FROM 

Re: pg_upgrade and logical replication

2023-11-18 Thread vignesh C
On Thu, 16 Nov 2023 at 07:45, Peter Smith  wrote:
>
> Here are some review comments for patch v14-0001
>
> ==
> src/backend/utils/adt/pg_upgrade_support.c
>
> 1. binary_upgrade_replorigin_advance
>
> + /* lock to prevent the replication origin from vanishing */
> + LockRelationOid(ReplicationOriginRelationId, RowExclusiveLock);
> + originid = replorigin_by_name(originname, false);
>
> Use uppercase for the lock comment.

Modified

> ==
> src/bin/pg_upgrade/check.c
>
> 2. check_for_subscription_state
>
> > > + prep_status("Checking for subscription state");
> > > +
> > > + snprintf(output_path, sizeof(output_path), "%s/%s",
> > > + log_opts.basedir,
> > > + "subscription_state.txt");
> > >
> > > I felt this filename ought to be more like
> > > 'subscriptions_with_bad_state.txt' because the current name looks like
> > > a normal logfile with nothing to indicate that it is only for the
> > > states of the "bad" subscriptions.
> >
> > I  have kept the file name intentionally shorted as we noticed that
> > when the upgrade of the publisher patch used a longer name there were
> > some buildfarm failures because of longer names.
>
> OK, but how about some other short meaningful name like 'subs_invalid.txt'?
>
> I also thought "state" in the original name was misleading because
> this file contains not only subscriptions with bad 'state' but also
> subscriptions with missing 'origin'.

Modified

> ~~~
>
> 3. check_new_cluster_logical_replication_slots
>
>   int nslots_on_old;
>   int nslots_on_new;
> + int nsubs_on_old = old_cluster.subscription_count;
>
> I felt it might be better to make both these quantities 'unsigned' to
> make it more obvious that there are no special meanings for negative
> numbers.

I have used int itself as all others also use int like in case of
logical slots. I tried making the changes, but the code was not
consistent, so used int like that is used for others.


> ~~~
>
> 4. check_new_cluster_logical_replication_slots
>
> nslots_on_old = count_old_cluster_logical_slots();
>
> ~
>
> IMO the 'nsubs_on_old' should be coded the same as above. AFAICT, this
> is the only code where you are interested in the number of
> subscribers, and furthermore, it seems you only care about that count
> in the *old* cluster. This means the current implementation of
> get_subscription_count() seems more generic than it needs to be and
> that results in more unnecessary patch code. (I will repeat this same
> review comment in the other relevant places).
>
> SUGGESTION
> nslots_on_old = count_old_cluster_logical_slots();
> nsubs_on_old = count_old_cluster_subscriptions();

Modified to keep it similar to logical slot implementation.

> ~~~
>
> 5.
> + /*
> + * Quick return if there are no logical slots and subscriptions to be
> + * migrated.
> + */
> + if (nslots_on_old == 0 && nsubs_on_old == 0)
>   return;
>
> /and subscriptions/and no subscriptions/

Modified

> ~~~
>
> 6.
> - if (nslots_on_old > max_replication_slots)
> + if (nslots_on_old && nslots_on_old > max_replication_slots)
>   pg_fatal("max_replication_slots (%d) must be greater than or equal
> to the number of "
>   "logical replication slots (%d) on the old cluster",
>   max_replication_slots, nslots_on_old);
>
> Neither nslots_on_old nor max_replication_slots can be < 0, so I don't
> see why the additional check is needed here.
> AFAICT "if (nslots_on_old > max_replication_slots)" acheives the same
> thing that you want.

This part of code is changed now

> ~~~
>
> 7.
> + if (nsubs_on_old && nsubs_on_old > max_replication_slots)
> + pg_fatal("max_replication_slots (%d) must be greater than or equal
> to the number of "
> + "subscriptions (%d) on the old cluster",
> + max_replication_slots, nsubs_on_old);
>
> Neither nsubs_on_old nor max_replication_slots can be < 0, so I don't
> see why the additional check is needed here.
> AFAICT "if (nsubs_on_old > max_replication_slots)" achieves the same
> thing that you want.

This part of code is changed now

> ==
> src/bin/pg_upgrade/info.c
>
> 8. get_db_rel_and_slot_infos
>
> + if (cluster == _cluster)
> + get_subscription_count(cluster);
> +
>
> I felt this is unnecessary because you only want to know the
> nsubs_on_old in one place and then only for the old cluster, so
> calling this to set a generic attribute for the cluster is overkill.

We need to do this here because when we do the validation of new
cluster the old cluster will not be running. I have made the flow
similar to logical slots now.

> ~~~
>
> 9.
> +/*
> + * Get the number of subscriptions in the old cluster.
> + */
> +static void
> +get_subscription_count(ClusterInfo *cluster)
> +{
> + PGconn*conn;
> + PGresult   *res;
> +
> + if (GET_MAJOR_VERSION(cluster->major_version) < 1700)
> + return;
> +
> + conn = connectToServer(cluster, "template1");
> + res = executeQueryOrDie(conn,
> +   "SELECT oid FROM pg_catalog.pg_subscription");
> +
> + cluster->subscription_count = PQntuples(res);
> +
> + PQclear(res);
> + 

Re: long-standing data loss bug in initial sync of logical replication

2023-11-18 Thread Andres Freund
On 2023-11-19 02:15:33 +0100, Tomas Vondra wrote:
> 
> 
> On 11/18/23 22:05, Andres Freund wrote:
> > Hi,
> > 
> > On 2023-11-18 21:45:35 +0100, Tomas Vondra wrote:
> >> On 11/18/23 19:12, Andres Freund wrote:
>  If we increase the locks from ShareUpdateExclusive to ShareRowExclusive,
>  we're making it conflict with RowExclusive. Which is just DML, and I
>  think we need to do that.
> >>>
> >>> From what I can tell it needs to to be an AccessExlusiveLock. Completely
> >>> independent of logical decoding. The way the cache stays coherent is 
> >>> catalog
> >>> modifications conflicting with anything that builds cache entries. We 
> >>> have a
> >>> few cases where we do use lower level locks, but for those we have 
> >>> explicit
> >>> analysis for why that's ok (see e.g. reloptions.c) or we block until 
> >>> nobody
> >>> could have an old view of the catalog (various CONCURRENTLY) operations.
> >>>
> >>
> >> Yeah, I got too focused on the issue I triggered, which seems to be
> >> fixed by using SRE (still don't understand why ...). But you're probably
> >> right there may be other cases where SRE would not be sufficient, I
> >> certainly can't prove it'd be safe.
> > 
> > I think it makes sense here: SRE prevents the problematic "scheduling" in 
> > your
> > test - with SRE no DML started before ALTER PUB ... ADD can commit after.
> > 
> 
> If understand correctly, with the current code (which only gets
> ShareUpdateExclusiveLock), we may end up in a situation like this
> (sessions A and B):
> 
>   A: starts "ALTER PUBLICATION p ADD TABLE t" and gets the SUE lock
>   A: writes the invalidation message(s) into WAL
>   B: inserts into table "t"
>   B: commit
>   A: commit

I don't think this the problematic sequence - at least it's not what I had
reproed in
https://postgr.es/m/20231118025445.crhaeeuvoe2g5dv6%40awork3.anarazel.de

Adding line numbers:

1) S1: CREATE TABLE d(data text not null);
2) S1: INSERT INTO d VALUES('d1');
3) S2: BEGIN; INSERT INTO d VALUES('d2');
4) S1: ALTER PUBLICATION pb ADD TABLE d;
5) S2: COMMIT
6) S2: INSERT INTO d VALUES('d3');
7) S1: INSERT INTO d VALUES('d4');
8) RL: 

The problem with the sequence is that the insert from 3) is decoded *after* 4)
and that to decode the insert (which happened before the ALTER) the catalog
snapshot and cache state is from *before* the ALTER TABLE. Because the
transaction started in 3) doesn't actually modify any catalogs, no
invalidations are executed after decoding it. The result is that the cache
looks like it did at 3), not like after 4). Undesirable timetravel...

It's worth noting that here the cache state is briefly correct, after 4), it's
just that after 5) it stays the old state.

If 4) instead uses a SRE lock, then S1 will be blocked until S2 commits, and
everything is fine.



> > I'm not sure there are any cases where using SRE instead of AE would cause
> > problems for logical decoding, but it seems very hard to prove. I'd be very
> > surprised if just using SRE would not lead to corrupted cache contents in 
> > some
> > situations. The cases where a lower lock level is ok are ones where we just
> > don't care that the cache is coherent in that moment.

> Are you saying it might break cases that are not corrupted now? How
> could obtaining a stronger lock have such effect?

No, I mean that I don't know if using SRE instead of AE would have negative
consequences for logical decoding. I.e. whether, from a logical decoding POV,
it'd suffice to increase the lock level to just SRE instead of AE.

Since I don't see how it'd be correct otherwise, it's kind of a moot question.


> > In a way, the logical decoding cache-invalidation situation is a lot more
> > atomic than the "normal" situation. During normal operation locking is
> > strictly required to prevent incoherent states when building a cache entry
> > after a transaction committed, but before the sinval entries have been
> > queued. But in the logical decoding case that window doesn't exist.
> > 
> Because we apply the invalidations at commit time, so it happens as a
> single operation that can't interleave with other sessions?

Yea, the situation is much simpler during logical decoding than "originally" -
there's no concurrency.

Greetings,

Andres Freund




Re: pg_upgrade and logical replication

2023-11-18 Thread vignesh C
On Sun, 19 Nov 2023 at 06:52, vignesh C  wrote:
>
> On Fri, 10 Nov 2023 at 19:26, vignesh C  wrote:
> >
> > On Thu, 9 Nov 2023 at 12:23, Michael Paquier  wrote:
> > >
> >
> > > Note: actually, this would be OK if we are able to keep the OIDs of
> > > the subscribers consistent across upgrades?  I'm OK to not do nothing
> > > about that in this patch, to keep it simpler.  Just asking in passing.
> >
> > I will analyze more on this and post the analysis in the subsequent mail.
>
> I analyzed further and felt that retaining subscription oid would be
> cleaner as 
> subscription/subscription_rel/replication_origin/replication_origin_status
> all of these will be using the same oid as earlier and also probably
> help in supporting upgrade of subscription in more scenarios later.
> Here is a patch to handle the same.

Sorry I had attached the older patch, here is the correct updated one.

Regards,
Vignesh
From 94b1ca337498f2b2e5368c3f7179ba63dd75954a Mon Sep 17 00:00:00 2001
From: Vignesh C 
Date: Sun, 19 Nov 2023 06:53:59 +0530
Subject: [PATCH v1] Retain the subscription oids during upgrade.

Retain the subscription oids during upgrade.
---
 src/backend/commands/subscriptioncmds.c| 22 --
 src/backend/utils/adt/pg_upgrade_support.c | 10 ++
 src/bin/pg_dump/pg_dump.c  |  8 
 src/include/catalog/binary_upgrade.h   |  1 +
 src/include/catalog/pg_proc.dat|  4 
 5 files changed, 43 insertions(+), 2 deletions(-)

diff --git a/src/backend/commands/subscriptioncmds.c b/src/backend/commands/subscriptioncmds.c
index edc82c11be..1c7bb4b7cd 100644
--- a/src/backend/commands/subscriptioncmds.c
+++ b/src/backend/commands/subscriptioncmds.c
@@ -75,6 +75,9 @@
 /* check if the 'val' has 'bits' set */
 #define IsSet(val, bits)  (((val) & (bits)) == (bits))
 
+/* Potentially set by pg_upgrade_support functions */
+Oid			binary_upgrade_next_pg_subscription_oid = InvalidOid;
+
 /*
  * Structure to hold a bitmap representing the user-provided CREATE/ALTER
  * SUBSCRIPTION command options and the parsed/default values of each of them.
@@ -679,8 +682,23 @@ CreateSubscription(ParseState *pstate, CreateSubscriptionStmt *stmt,
 	memset(values, 0, sizeof(values));
 	memset(nulls, false, sizeof(nulls));
 
-	subid = GetNewOidWithIndex(rel, SubscriptionObjectIndexId,
-			   Anum_pg_subscription_oid);
+	/* Use binary-upgrade override for pg_subscription.oid? */
+	if (IsBinaryUpgrade)
+	{
+		if (!OidIsValid(binary_upgrade_next_pg_subscription_oid))
+			ereport(ERROR,
+	(errcode(ERRCODE_INVALID_PARAMETER_VALUE),
+	 errmsg("pg_subscription OID value not set when in binary upgrade mode")));
+
+		subid = binary_upgrade_next_pg_subscription_oid;
+		binary_upgrade_next_pg_subscription_oid = InvalidOid;
+	}
+	else
+	{
+		subid = GetNewOidWithIndex(rel, SubscriptionObjectIndexId,
+   Anum_pg_subscription_oid);
+	}
+
 	values[Anum_pg_subscription_oid - 1] = ObjectIdGetDatum(subid);
 	values[Anum_pg_subscription_subdbid - 1] = ObjectIdGetDatum(MyDatabaseId);
 	values[Anum_pg_subscription_subskiplsn - 1] = LSNGetDatum(InvalidXLogRecPtr);
diff --git a/src/backend/utils/adt/pg_upgrade_support.c b/src/backend/utils/adt/pg_upgrade_support.c
index 2f6fc86c3d..f5be088e6e 100644
--- a/src/backend/utils/adt/pg_upgrade_support.c
+++ b/src/backend/utils/adt/pg_upgrade_support.c
@@ -172,6 +172,16 @@ binary_upgrade_set_next_pg_authid_oid(PG_FUNCTION_ARGS)
 	PG_RETURN_VOID();
 }
 
+Datum
+binary_upgrade_set_next_pg_subscription_oid(PG_FUNCTION_ARGS)
+{
+	Oid			subid = PG_GETARG_OID(0);
+
+	CHECK_IS_BINARY_UPGRADE;
+	binary_upgrade_next_pg_subscription_oid = subid;
+	PG_RETURN_VOID();
+}
+
 Datum
 binary_upgrade_create_empty_extension(PG_FUNCTION_ARGS)
 {
diff --git a/src/bin/pg_dump/pg_dump.c b/src/bin/pg_dump/pg_dump.c
index 34fd0a86e9..f592d7c979 100644
--- a/src/bin/pg_dump/pg_dump.c
+++ b/src/bin/pg_dump/pg_dump.c
@@ -4775,6 +4775,14 @@ dumpSubscription(Archive *fout, const SubscriptionInfo *subinfo)
 	appendPQExpBuffer(delq, "DROP SUBSCRIPTION %s;\n",
 	  qsubname);
 
+	if (dopt->binary_upgrade)
+	{
+		appendPQExpBufferStr(query, "\n-- For binary upgrade, must preserve pg_subscription.oid\n");
+		appendPQExpBuffer(query,
+		  "SELECT pg_catalog.binary_upgrade_set_next_pg_subscription_oid('%u'::pg_catalog.oid);\n\n",
+		  subinfo->dobj.catId.oid);
+	}
+
 	appendPQExpBuffer(query, "CREATE SUBSCRIPTION %s CONNECTION ",
 	  qsubname);
 	appendStringLiteralAH(query, subinfo->subconninfo, fout);
diff --git a/src/include/catalog/binary_upgrade.h b/src/include/catalog/binary_upgrade.h
index 82a9125ba9..dc7b251051 100644
--- a/src/include/catalog/binary_upgrade.h
+++ b/src/include/catalog/binary_upgrade.h
@@ -32,6 +32,7 @@ extern PGDLLIMPORT RelFileNumber binary_upgrade_next_toast_pg_class_relfilenumbe
 
 extern PGDLLIMPORT Oid binary_upgrade_next_pg_enum_oid;
 extern PGDLLIMPORT Oid binary_upgrade_next_pg_authid_oid;
+extern PGDLLIMPORT Oid 

Re: pg_upgrade and logical replication

2023-11-18 Thread vignesh C
On Fri, 10 Nov 2023 at 19:26, vignesh C  wrote:
>
> On Thu, 9 Nov 2023 at 12:23, Michael Paquier  wrote:
> >
>
> > Note: actually, this would be OK if we are able to keep the OIDs of
> > the subscribers consistent across upgrades?  I'm OK to not do nothing
> > about that in this patch, to keep it simpler.  Just asking in passing.
>
> I will analyze more on this and post the analysis in the subsequent mail.

I analyzed further and felt that retaining subscription oid would be
cleaner as 
subscription/subscription_rel/replication_origin/replication_origin_status
all of these will be using the same oid as earlier and also probably
help in supporting upgrade of subscription in more scenarios later.
Here is a patch to handle the same.

Regards,
Vignesh
diff --git a/src/backend/commands/subscriptioncmds.c b/src/backend/commands/subscriptioncmds.c
index edc82c11be..1c7bb4b7cd 100644
--- a/src/backend/commands/subscriptioncmds.c
+++ b/src/backend/commands/subscriptioncmds.c
@@ -75,6 +75,9 @@
 /* check if the 'val' has 'bits' set */
 #define IsSet(val, bits)  (((val) & (bits)) == (bits))
 
+/* Potentially set by pg_upgrade_support functions */
+Oid			binary_upgrade_next_pg_subscription_oid = InvalidOid;
+
 /*
  * Structure to hold a bitmap representing the user-provided CREATE/ALTER
  * SUBSCRIPTION command options and the parsed/default values of each of them.
@@ -679,8 +682,23 @@ CreateSubscription(ParseState *pstate, CreateSubscriptionStmt *stmt,
 	memset(values, 0, sizeof(values));
 	memset(nulls, false, sizeof(nulls));
 
-	subid = GetNewOidWithIndex(rel, SubscriptionObjectIndexId,
-			   Anum_pg_subscription_oid);
+	/* Use binary-upgrade override for pg_subscription.oid? */
+	if (IsBinaryUpgrade)
+	{
+		if (!OidIsValid(binary_upgrade_next_pg_subscription_oid))
+			ereport(ERROR,
+	(errcode(ERRCODE_INVALID_PARAMETER_VALUE),
+	 errmsg("pg_subscription OID value not set when in binary upgrade mode")));
+
+		subid = binary_upgrade_next_pg_subscription_oid;
+		binary_upgrade_next_pg_subscription_oid = InvalidOid;
+	}
+	else
+	{
+		subid = GetNewOidWithIndex(rel, SubscriptionObjectIndexId,
+   Anum_pg_subscription_oid);
+	}
+
 	values[Anum_pg_subscription_oid - 1] = ObjectIdGetDatum(subid);
 	values[Anum_pg_subscription_subdbid - 1] = ObjectIdGetDatum(MyDatabaseId);
 	values[Anum_pg_subscription_subskiplsn - 1] = LSNGetDatum(InvalidXLogRecPtr);
diff --git a/src/backend/utils/adt/pg_upgrade_support.c b/src/backend/utils/adt/pg_upgrade_support.c
index 53cfa72b6f..34c328ea0d 100644
--- a/src/backend/utils/adt/pg_upgrade_support.c
+++ b/src/backend/utils/adt/pg_upgrade_support.c
@@ -59,6 +59,17 @@ binary_upgrade_set_next_pg_type_oid(PG_FUNCTION_ARGS)
 	PG_RETURN_VOID();
 }
 
+Datum
+binary_upgrade_set_next_pg_subscription_oid(PG_FUNCTION_ARGS)
+{
+	Oid			subid = PG_GETARG_OID(0);
+
+	CHECK_IS_BINARY_UPGRADE;
+	binary_upgrade_next_pg_subscription_oid = subid;
+
+	PG_RETURN_VOID();
+}
+
 Datum
 binary_upgrade_set_next_array_pg_type_oid(PG_FUNCTION_ARGS)
 {
diff --git a/src/bin/pg_dump/pg_dump.c b/src/bin/pg_dump/pg_dump.c
index 4528b7cc39..bc309be2a8 100644
--- a/src/bin/pg_dump/pg_dump.c
+++ b/src/bin/pg_dump/pg_dump.c
@@ -4954,6 +4954,14 @@ dumpSubscription(Archive *fout, const SubscriptionInfo *subinfo)
 	appendPQExpBuffer(delq, "DROP SUBSCRIPTION %s;\n",
 	  qsubname);
 
+	if (dopt->binary_upgrade)
+	{
+		appendPQExpBufferStr(query, "\n-- For binary upgrade, must preserve pg_subscription.oid\n");
+		appendPQExpBuffer(query,
+		  "SELECT pg_catalog.binary_upgrade_set_next_pg_subscription_oid('%u'::pg_catalog.oid);\n\n",
+		  subinfo->dobj.catId.oid);
+	}
+
 	appendPQExpBuffer(query, "CREATE SUBSCRIPTION %s CONNECTION ",
 	  qsubname);
 	appendStringLiteralAH(query, subinfo->subconninfo, fout);
diff --git a/src/include/catalog/binary_upgrade.h b/src/include/catalog/binary_upgrade.h
index 82a9125ba9..456e777b8d 100644
--- a/src/include/catalog/binary_upgrade.h
+++ b/src/include/catalog/binary_upgrade.h
@@ -19,6 +19,7 @@
 extern PGDLLIMPORT Oid binary_upgrade_next_pg_tablespace_oid;
 
 extern PGDLLIMPORT Oid binary_upgrade_next_pg_type_oid;
+extern PGDLLIMPORT Oid binary_upgrade_next_pg_subscription_oid;
 extern PGDLLIMPORT Oid binary_upgrade_next_array_pg_type_oid;
 extern PGDLLIMPORT Oid binary_upgrade_next_mrng_pg_type_oid;
 extern PGDLLIMPORT Oid binary_upgrade_next_mrng_array_pg_type_oid;
diff --git a/src/include/catalog/pg_proc.dat b/src/include/catalog/pg_proc.dat
index 45c681db5e..43cf39acae 100644
--- a/src/include/catalog/pg_proc.dat
+++ b/src/include/catalog/pg_proc.dat
@@ -11328,6 +11328,10 @@
   proname => 'binary_upgrade_set_next_pg_type_oid', provolatile => 'v',
   proparallel => 'r', prorettype => 'void', proargtypes => 'oid',
   prosrc => 'binary_upgrade_set_next_pg_type_oid' },
+{ oid => '8406', descr => 'for use by pg_upgrade',
+  proname => 'binary_upgrade_set_next_pg_subscription_oid', provolatile => 'v',
+  proparallel => 'r', prorettype => 'void', proargtypes => 

Re: Assert failure on 'list_member_ptr(rel->joininfo, restrictinfo)'

2023-11-18 Thread Alexander Korotkov
On Wed, Nov 15, 2023 at 5:07 PM Alexander Korotkov  wrote:
>
> On Wed, Nov 15, 2023 at 8:02 AM Andres Freund  wrote:
> > The kinda because there are callers to bms_(add|del)_members() that pass the
> > same bms as a and b, which only works if the reallocation happens "late".
>
> +1,
> Neat idea. I'm willing to work on this. Will propose the patch soon.


It's here.  New REALLOCATE_BITMAPSETS forces bitmapset reallocation on
each modification.  I also find it useful to add assert to all
bitmapset functions on argument NodeTag.  This allows you to find
access to hanging pointers earlier.

I had the feeling of falling into a rabbit hole while debugging all
the cases of failure with this new option.  With the second patch
regressions tests pass.

Any thoughts?

--
Regards,
Alexander Korotkov


0001-REALLOCATE_BITMAPSETS-manual-compile-time-option-v1.patch
Description: Binary data


0002-Make-regression-tests-pass-with-REALLOCATE_BITMAP-v1.patch
Description: Binary data


Re: long-standing data loss bug in initial sync of logical replication

2023-11-18 Thread Tomas Vondra



On 11/18/23 22:05, Andres Freund wrote:
> Hi,
> 
> On 2023-11-18 21:45:35 +0100, Tomas Vondra wrote:
>> On 11/18/23 19:12, Andres Freund wrote:
 If we increase the locks from ShareUpdateExclusive to ShareRowExclusive,
 we're making it conflict with RowExclusive. Which is just DML, and I
 think we need to do that.
>>>
>>> From what I can tell it needs to to be an AccessExlusiveLock. Completely
>>> independent of logical decoding. The way the cache stays coherent is catalog
>>> modifications conflicting with anything that builds cache entries. We have a
>>> few cases where we do use lower level locks, but for those we have explicit
>>> analysis for why that's ok (see e.g. reloptions.c) or we block until nobody
>>> could have an old view of the catalog (various CONCURRENTLY) operations.
>>>
>>
>> Yeah, I got too focused on the issue I triggered, which seems to be
>> fixed by using SRE (still don't understand why ...). But you're probably
>> right there may be other cases where SRE would not be sufficient, I
>> certainly can't prove it'd be safe.
> 
> I think it makes sense here: SRE prevents the problematic "scheduling" in your
> test - with SRE no DML started before ALTER PUB ... ADD can commit after.
> 

If understand correctly, with the current code (which only gets
ShareUpdateExclusiveLock), we may end up in a situation like this
(sessions A and B):

  A: starts "ALTER PUBLICATION p ADD TABLE t" and gets the SUE lock
  A: writes the invalidation message(s) into WAL
  B: inserts into table "t"
  B: commit
  A: commit

With the stronger SRE lock, the commits would have to happen in the
opposite order, because as you say it prevents the bad ordering.

But why would this matter for logical decoding? We accumulate the the
invalidations and execute them at transaction commit, or did I miss
something?

So what I think should happen is we get to apply B first, which won't
see the table as part of the publication. It might even build the cache
entries (syscache+relsync), reflecting that. But then we get to execute
A, along with all the invalidations, and that should invalidate them.

I'm clearly missing something, because the SRE does change the behavior,
so there has to be a difference (and by my reasoning it shouldn't be).

Or maybe it's the other way around? Won't B get the invalidation, but
use a historical snapshot that doesn't yet see the table in publication?

> I'm not sure there are any cases where using SRE instead of AE would cause
> problems for logical decoding, but it seems very hard to prove. I'd be very
> surprised if just using SRE would not lead to corrupted cache contents in some
> situations. The cases where a lower lock level is ok are ones where we just
> don't care that the cache is coherent in that moment.
> 

Are you saying it might break cases that are not corrupted now? How
could obtaining a stronger lock have such effect?

> In a way, the logical decoding cache-invalidation situation is a lot more
> atomic than the "normal" situation. During normal operation locking is
> strictly required to prevent incoherent states when building a cache entry
> after a transaction committed, but before the sinval entries have been
> queued. But in the logical decoding case that window doesn't exist.
> 

Because we apply the invalidations at commit time, so it happens as a
single operation that can't interleave with other sessions?


regards

-- 
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company




Re: reindexing an invalid index should not use ERRCODE_INDEX_CORRUPTED

2023-11-18 Thread Noah Misch
On Sat, Nov 18, 2023 at 03:09:58PM -0800, Andres Freund wrote:
> We currently provide no way to learn about a postgres instance having
> corruption than searching the logs for corruption events than matching by
> sqlstate, for ERRCODE_DATA_CORRUPTED and ERRCODE_INDEX_CORRUPTED.
> 
> Unfortunately, there is a case of such an sqlstate that's not at all 
> indicating
> corruption, namely REINDEX CONCURRENTLY when the index is invalid:
> 
> if (!indexRelation->rd_index->indisvalid)
> ereport(WARNING,
> (errcode(ERRCODE_INDEX_CORRUPTED),
>  errmsg("cannot reindex invalid index 
> \"%s.%s\" concurrently, skipping",
> 
> get_namespace_name(get_rel_namespace(cellOid)),
> get_rel_name(cellOid;
> 
> The only thing required to get to this is an interrupted CREATE INDEX
> CONCURRENTLY, which I don't think can be fairly characterized as "corruption".
> 
> ISTM something like ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE would be more
> appropriate?

+1, that's a clear improvement.

The "cannot" part of the message is also inaccurate, and it's not clear to me
why we have this specific restriction at all.  REINDEX INDEX CONCURRENTLY
accepts such indexes, so I doubt it's an implementation gap.  Since an INVALID
index often duplicates some valid index, I could see an argument that
reindexing INVALID indexes as part of a table-level REINDEX is wanted less
often than not.  But that argument would be just as pertinent to REINDEX TABLE
(w/o CONCURRENTLY), which doesn't impose this restriction.  Hmmm.




Re: Relation bulk write facility

2023-11-18 Thread Andres Freund
Hi,

On 2023-11-17 11:37:21 +0100, Heikki Linnakangas wrote:
> The new facility makes it easier to optimize bulk loading, as the
> logic for buffering, WAL-logging, and syncing the relation only needs
> to be implemented once. It's also less error-prone: We have had a
> number of bugs in how a relation is fsync'd - or not - at the end of a
> bulk loading operation. By centralizing that logic to one place, we
> only need to write it correctly once.

One thing I'd like to use the centralized handling for is to track such
writes in pg_stat_io. I don't mean as part of the initial patch, just that
that's another reason I like the facility.


> The new facility is faster for small relations: Instead of of calling
> smgrimmedsync(), we register the fsync to happen at next checkpoint,
> which avoids the fsync latency. That can make a big difference if you
> are e.g. restoring a schema-only dump with lots of relations.

I think this would be a huge win for people running their application tests
against postgres.


> + bulkw = bulkw_start_smgr(dst, forkNum, use_wal);
> +
>   nblocks = smgrnblocks(src, forkNum);
>  
>   for (blkno = 0; blkno < nblocks; blkno++)
>   {
> + Pagepage;
> +
>   /* If we got a cancel signal during the copy of the data, quit 
> */
>   CHECK_FOR_INTERRUPTS();
>  
> - smgrread(src, forkNum, blkno, buf.data);
> + page = bulkw_alloc_buf(bulkw);
> + smgrread(src, forkNum, blkno, page);
>  
>   if (!PageIsVerifiedExtended(page, blkno,
>   
> PIV_LOG_WARNING | PIV_REPORT_STAT))
> @@ -511,30 +514,9 @@ RelationCopyStorage(SMgrRelation src, SMgrRelation dst,
>* page this is, so we have to log the full page including any 
> unused
>* space.
>*/
> - if (use_wal)
> - log_newpage(>smgr_rlocator.locator, forkNum, 
> blkno, page, false);
> -
> - PageSetChecksumInplace(page, blkno);
> -
> - /*
> -  * Now write the page.  We say skipFsync = true because there's 
> no
> -  * need for smgr to schedule an fsync for this write; we'll do 
> it
> -  * ourselves below.
> -  */
> - smgrextend(dst, forkNum, blkno, buf.data, true);
> + bulkw_write(bulkw, blkno, page, false);

I wonder if bulkw_alloc_buf() is a good name - if you naively read this
change, it looks like it'll just leak memory. It also might be taken to be
valid until freed, which I don't think is the case?


> +/*-
> + *
> + * bulk_write.c
> + * Efficiently and reliably populate a new relation
> + *
> + * The assumption is that no other backends access the relation while we are
> + * loading it, so we can take some shortcuts.  Alternatively, you can use the
> + * buffer manager as usual, if performance is not critical, but you must not
> + * mix operations through the buffer manager and the bulk loading interface 
> at
> + * the same time.

>From "Alternatively" onward this is is somewhat confusing.


> + * We bypass the buffer manager to avoid the locking overhead, and call
> + * smgrextend() directly.  A downside is that the pages will need to be
> + * re-read into shared buffers on first use after the build finishes.  That's
> + * usually a good tradeoff for large relations, and for small relations, the
> + * overhead isn't very significant compared to creating the relation in the
> + * first place.

FWIW, I doubt the "isn't very significant" bit is really true.


> + * One tricky point is that because we bypass the buffer manager, we need to
> + * register the relation for fsyncing at the next checkpoint ourselves, and
> + * make sure that the relation is correctly fsync by us or the checkpointer
> + * even if a checkpoint happens concurrently.

"fsync'ed" or such? Otherwise this reads awkwardly for me.



> + *
> + *
> + * Portions Copyright (c) 1996-2023, PostgreSQL Global Development Group
> + * Portions Copyright (c) 1994, Regents of the University of California
> + *
> + *
> + * IDENTIFICATION
> + * src/backend/storage/smgr/bulk_write.c
> + *
> + *-
> + */
> +#include "postgres.h"
> +
> +#include "access/xloginsert.h"
> +#include "access/xlogrecord.h"
> +#include "storage/bufmgr.h"
> +#include "storage/bufpage.h"
> +#include "storage/bulk_write.h"
> +#include "storage/proc.h"
> +#include "storage/smgr.h"
> +#include "utils/rel.h"
> +
> +#define MAX_BUFFERED_PAGES XLR_MAX_BLOCK_ID
> +
> +typedef struct BulkWriteBuffer
> +{
> + Pagepage;
> + BlockNumber blkno;
> + boolpage_std;
> + int16   order;
> +} BulkWriteBuffer;
> +

The name makes it sound like this struct itself contains a buffer - but it's

reindexing an invalid index should not use ERRCODE_INDEX_CORRUPTED

2023-11-18 Thread Andres Freund
Hi,

We currently provide no way to learn about a postgres instance having
corruption than searching the logs for corruption events than matching by
sqlstate, for ERRCODE_DATA_CORRUPTED and ERRCODE_INDEX_CORRUPTED.

Unfortunately, there is a case of such an sqlstate that's not at all indicating
corruption, namely REINDEX CONCURRENTLY when the index is invalid:

if (!indexRelation->rd_index->indisvalid)
ereport(WARNING,
(errcode(ERRCODE_INDEX_CORRUPTED),
 errmsg("cannot reindex invalid index 
\"%s.%s\" concurrently, skipping",

get_namespace_name(get_rel_namespace(cellOid)),
get_rel_name(cellOid;

The only thing required to get to this is an interrupted CREATE INDEX
CONCURRENTLY, which I don't think can be fairly characterized as "corruption".

ISTM something like ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE would be more
appropriate?

Greetings,

Andres Freund




errcode_for_file_access() maps EROFS to INSUFFICIENT_PRIVILEGE

2023-11-18 Thread Andres Freund
Hi,

On linux, many filesystems default to remounting themselves read-only when
metadata IO fails. I.e. one common reaction to disks failing is a previously
read-write filesystem becoming read-only.

When e.g. trying to create a file on such a filesystem, errno is set to
EROFS. Writing with pre-existing FDs seems to mostly generate EIO.

In errcode_for_file_access(), we map EROFS to
ERRCODE_INSUFFICIENT_PRIVILEGE. An error code that's used very widely for many
other purposes.

Because it is so widely used, just searching for log messages with an
ERRCODE_INSUFFICIENT_PRIVILEGE sqlstate isn't promising, obviously stuff like
  ERROR: permission denied to set parameter \"%s\"
isn't interesting.

Nor is EROFS a question of insufficient privileges - the filesystem is read
only, even root would not be permitted to write.


I think ERRCODE_IO_ERROR would be more appropriate than
ERRCODE_INSUFFICIENT_PRIVILEGE, but not exactly great.

The only real downside would be a slightly odd sqlstate for postmaster's
creation of a lock file.  If a tablespace were mounted read-only, IO_ERROR
actually seems fine.

Greetings,

Andres Freund




PANIC serves too many masters

2023-11-18 Thread Andres Freund
Hi,

Right now we use PANIC for very different kinds of errors.

Sometimes for errors that are persistent, where crash-restarting and trying
again won't help:
  ereport(PANIC,
  (errmsg("could not locate a valid checkpoint record")));
or
  ereport(PANIC,
  (errmsg("online backup was canceled, recovery cannot continue")));



Sometimes for errors that could be transient, e.g. when running out of space
while trying to write out WAL:
  ereport(ERROR,
  (errcode_for_file_access(),
   errmsg("could not write to file \"%s\": %m", tmppath)));
(the ERROR is often promoted to a PANIC due to critical sections).
or
ereport(PANIC,
(errcode_for_file_access(),
 errmsg("could not write to log file \"%s\" at offset %u, length %zu: 
%m",
xlogfname, startoffset, nleft)));


Sometimes for "should never happen" checks that are important enough to check
in production builds:
  elog(PANIC, "stuck spinlock detected at %s, %s:%d",
or
  elog(PANIC, "failed to re-find shared proclock object");


I have two main issues with this:

1) If core dumps are allowed, we trigger core dumps for all of these. That's
   good for "should never happen" type of errors like a stuck spinlock. But it
   makes no sense for things like the on-disk state being wrong at startup, or
   running out of space while writing WAL - if anything, it might make that
   worse!

   It's very useful to be able to collect dumps for crashes in production, but
   it's not useful to generate thousands of identical cores because crash
   recovery fails with out-of-space and we retry over and over.


2) For errors where crash-restarting won't fix anything, using PANIC doesn't
   allow postmaster to distinguish between an error that should lead
   postmaster to exit itself (after killing other processes, obviously) and
   the normal crash restart cycle.


I've been trying to do some fleet wide analyses of the causes of crashes, but
having core dumps for lots of stuff that aren't crashes, often repeated many
times, makes that much harder. Filtering out abort()s and just looking at
segfaults filters out far too much.


I don't quite know what we should do. But the current situation decidedly
doesn't seem great.

Maybe we could have:
- PANIC_BUG - triggers abort() followed by a crash restart cycle
  to be used for things like a stuck spinlock
- PANIC_RETRY - causes a crash restart cycle, no core dump
  to be used for things like ENOSPC during WAL writes
- PANIC_EXIT - causes postmaster to exit(1)
  to be used for things where retrying won't help, like
  "requested recovery stop point is before consistent recovery point"


One could argue that some of the PANICs that want to just shut down the server
should instead be FATALs, with knowledge in postmaster about which/when such
errors should trigger exiting. We do have something like this for the startup
process, but only when errors happen "early enough", and without being able to
distinguish between "retryable" and "should exit" type errors.  But ISTM that
that requires adding more and more knowledge to postmaster.c, instead of
leaving it with the code that raises the error.


Greetings,

Andres Freund




Re: Use of backup_label not noted in log

2023-11-18 Thread Andres Freund
Hi,

On 2023-11-18 10:01:42 -0800, Andres Freund wrote:
> > What about adding it to the "redo starts at" message, something like
> >
> >   redo starts at 12/12345678, taken from control file
> >
> > or
> >
> >   redo starts at 12/12345678, taken from backup label
> 
> I think it'd make sense to log use of backup_label earlier than that - the
> locations from backup_label might end up not being available in the archive,
> the primary or locally, and we'll error out with "could not locate a valid
> checkpoint record".
> 
> I'd probably just do it within the if (read_backup_label()) block in
> InitWalRecovery(), *before* the ReadCheckpointRecord().

Not enamored with the phrasing of the log messages, but here's a prototype:

When starting up with backup_label present:
LOG:  starting from base backup with redo LSN A/34100028, checkpoint LSN 
A/34100080 on timeline ID 1

When restarting before reaching the end of the backup, but after backup_label
has been removed:
LOG:  continuing to start from base backup with redo LSN A/34100028
LOG:  entering standby mode
LOG:  redo starts at A/3954B958

Note that the LSN in the "continuing" case is the one the backup started at,
not where recovery will start.


I've wondered whether it's worth also adding an explicit message just after
ReachedEndOfBackup(), but it seems far less urgent due to the existing
"consistent recovery state reached at %X/%X" message.


We are quite inconsistent about how we spell LSNs. Sometimes with LSN
preceding, sometimes not. Sometimes with (LSN). Etc.


> I do like the idea of expanding the "redo starts at" message
> though. E.g. including minRecoveryLSN, ControlFile->backupStartPoint,
> ControlFile->backupEndPoint would provide information about when the node
> might become consistent.

Playing around with this a bit, I'm wondering if we instead should remove that
message, and emit something more informative earlier on. If there's a problem,
you kinda want the information before we finally get to the loop in
PerformWalLRecovery(). If e.g. there's no WAL you'll only get
LOG:  invalid checkpoint record
PANIC:  could not locate a valid checkpoint record

which is delightfully lacking in details.


There also are some other oddities:

If the primary is down when starting up, and we'd need WAL from the primary
for the first record, the "redo start at" message is delayed until that
happens, because we emit the message not before we read the first record, but
after. That's just plain odd.

And sometimes we'll start referencing the LSN at which we are starting
recovery before the "redo starts at" message. If e.g. we shut down
at a restart point, we'll emit

  LOG:  consistent recovery state reached at ...
before
  LOG:  redo starts at ...


But that's all clearly just material for HEAD.

Greetings,

Andres Freund
>From 37e9ebf2167892b0e58df04670288b9b8780bfdc Mon Sep 17 00:00:00 2001
From: Andres Freund 
Date: Sat, 18 Nov 2023 13:27:17 -0800
Subject: [PATCH v1] wip: Log when starting up from a base backup

Author:
Reviewed-by:
Discussion: https://postgr.es/m/20231117041811.vz4vgkthwjnwp...@awork3.anarazel.de
Backpatch:
---
 src/backend/access/transam/xlogrecovery.c | 26 +++
 1 file changed, 26 insertions(+)

diff --git a/src/backend/access/transam/xlogrecovery.c b/src/backend/access/transam/xlogrecovery.c
index c6156aa..c25831303fa 100644
--- a/src/backend/access/transam/xlogrecovery.c
+++ b/src/backend/access/transam/xlogrecovery.c
@@ -603,6 +603,22 @@ InitWalRecovery(ControlFileData *ControlFile, bool *wasShutdown_ptr,
 		if (StandbyModeRequested)
 			EnableStandbyMode();
 
+		/*
+		 * Omitting backup_label when creating a new replica, PITR node etc.
+		 * unfortunately is a common cause of corruption. Logging that
+		 * backup_label was used makes it a bit easier to exclude that as the
+		 * cause of observed corruption.
+		 *
+		 * Do so before we try to read the checkpoint record (which can fail),
+		 * as otherwise it can be hard to understand why a checkpoint other
+		 * than ControlFile->checkPoint is used.
+		 */
+		ereport(LOG,
+(errmsg("starting from base backup with redo LSN %X/%X, checkpoint LSN %X/%X on timeline ID %d",
+		LSN_FORMAT_ARGS(RedoStartLSN),
+		LSN_FORMAT_ARGS(CheckPointLoc),
+		CheckPointTLI)));
+
 		/*
 		 * When a backup_label file is present, we want to roll forward from
 		 * the checkpoint it identifies, rather than using pg_control.
@@ -742,6 +758,16 @@ InitWalRecovery(ControlFileData *ControlFile, bool *wasShutdown_ptr,
 EnableStandbyMode();
 		}
 
+		/*
+		 * For the same reason as when starting up with backup_label present,
+		 * emit a log message when we continue initializing from a base
+		 * backup.
+		 */
+		if (ControlFile->backupStartPoint != InvalidXLogRecPtr)
+			ereport(LOG,
+	(errmsg("continuing to start from base backup with redo LSN %X/%X",
+			LSN_FORMAT_ARGS(ControlFile->backupStartPoint;
+
 		/* Get the last valid checkpoint record. */
 		

Re: long-standing data loss bug in initial sync of logical replication

2023-11-18 Thread Andres Freund
Hi,

On 2023-11-18 21:45:35 +0100, Tomas Vondra wrote:
> On 11/18/23 19:12, Andres Freund wrote:
> >> If we increase the locks from ShareUpdateExclusive to ShareRowExclusive,
> >> we're making it conflict with RowExclusive. Which is just DML, and I
> >> think we need to do that.
> > 
> > From what I can tell it needs to to be an AccessExlusiveLock. Completely
> > independent of logical decoding. The way the cache stays coherent is catalog
> > modifications conflicting with anything that builds cache entries. We have a
> > few cases where we do use lower level locks, but for those we have explicit
> > analysis for why that's ok (see e.g. reloptions.c) or we block until nobody
> > could have an old view of the catalog (various CONCURRENTLY) operations.
> > 
> 
> Yeah, I got too focused on the issue I triggered, which seems to be
> fixed by using SRE (still don't understand why ...). But you're probably
> right there may be other cases where SRE would not be sufficient, I
> certainly can't prove it'd be safe.

I think it makes sense here: SRE prevents the problematic "scheduling" in your
test - with SRE no DML started before ALTER PUB ... ADD can commit after.

I'm not sure there are any cases where using SRE instead of AE would cause
problems for logical decoding, but it seems very hard to prove. I'd be very
surprised if just using SRE would not lead to corrupted cache contents in some
situations. The cases where a lower lock level is ok are ones where we just
don't care that the cache is coherent in that moment.

In a way, the logical decoding cache-invalidation situation is a lot more
atomic than the "normal" situation. During normal operation locking is
strictly required to prevent incoherent states when building a cache entry
after a transaction committed, but before the sinval entries have been
queued. But in the logical decoding case that window doesn't exist.

Greetings,

Andres Freund




Re: long-standing data loss bug in initial sync of logical replication

2023-11-18 Thread Tomas Vondra
On 11/18/23 19:12, Andres Freund wrote:
> Hi,
> 
> On 2023-11-18 11:56:47 +0100, Tomas Vondra wrote:
>>> I guess it's not really feasible to just increase the lock level here though
>>> :(. The use of ShareUpdateExclusiveLock isn't new, and suddenly using AEL
>>> would perhaps lead to new deadlocks and such? But it also seems quite wrong.
>>>
>>
>> If this really is about the lock being too weak, then I don't see why
>> would it be wrong?
> 
> Sorry, that was badly formulated. The wrong bit is the use of
> ShareUpdateExclusiveLock.
> 

Ah, you meant the current lock mode seems wrong, not that changing the
locks seems wrong. Yeah, true.

> 
>> If it's required for correctness, it's not really wrong, IMO. Sure, stronger
>> locks are not great ...
>>
>> I'm not sure about the risk of deadlocks. If you do
>>
>> ALTER PUBLICATION ... ADD TABLE
>>
>> it's not holding many other locks. It essentially gets a lock just a
>> lock on pg_publication catalog, and then the publication row. That's it.
>>
>> If we increase the locks from ShareUpdateExclusive to ShareRowExclusive,
>> we're making it conflict with RowExclusive. Which is just DML, and I
>> think we need to do that.
> 
> From what I can tell it needs to to be an AccessExlusiveLock. Completely
> independent of logical decoding. The way the cache stays coherent is catalog
> modifications conflicting with anything that builds cache entries. We have a
> few cases where we do use lower level locks, but for those we have explicit
> analysis for why that's ok (see e.g. reloptions.c) or we block until nobody
> could have an old view of the catalog (various CONCURRENTLY) operations.
> 

Yeah, I got too focused on the issue I triggered, which seems to be
fixed by using SRE (still don't understand why ...). But you're probably
right there may be other cases where SRE would not be sufficient, I
certainly can't prove it'd be safe.

> 
>> So maybe that's fine? For me, a detected deadlock is better than
>> silently missing some of the data.
> 
> That certainly is true.
> 
> 
>>> We could brute force this in the logical decoding infrastructure, by
>>> distributing invalidations from catalog modifying transactions to all
>>> concurrent in-progress transactions (like already done for historic catalog
>>> snapshot, c.f. SnapBuildDistributeNewCatalogSnapshot()).  But I think that'd
>>> be a fairly significant increase in overhead.
>>>
>>
>> I have no idea what the overhead would be - perhaps not too bad,
>> considering catalog changes are not too common (I'm sure there are
>> extreme cases). And maybe we could even restrict this only to
>> "interesting" catalogs, or something like that? (However I hate those
>> weird differences in behavior, it can easily lead to bugs.)
>>
>> But it feels more like a band-aid than actually fixing the issue.
> 
> Agreed.
> 

... and it would no not fix the other places outside logical decoding.


regards

-- 
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company




Re: SLRU optimization - configurable buffer pool and partitioning the SLRU lock

2023-11-18 Thread Alvaro Herrera
On 2023-Nov-18, Dilip Kumar wrote:

> On Fri, Nov 17, 2023 at 6:16 PM Alvaro Herrera  
> wrote:

> > I wonder what's the deal with false sharing in the new
> > bank_cur_lru_count array.  Maybe instead of using LWLockPadded for
> > bank_locks, we should have a new struct, with both the LWLock and the
> > LRU counter; then pad *that* to the cacheline size.  This way, both the
> > lwlock and the counter come to the CPU running this code together.
> 
> Actually, the array lengths of both LWLock and the LRU counter are
> different so I don't think we can move them to a common structure.
> The length of the *buffer_locks array is equal to the number of slots,
> the length of the *bank_locks array is Min (number_of_banks, 128), and
> the length of the *bank_cur_lru_count array is number_of_banks.

Oh.

> > Looking at the coverage for this code,
> > https://coverage.postgresql.org/src/backend/access/transam/clog.c.gcov.html#413
> > it seems in our test suites we never hit the case where there is
> > anything in the "nextidx" field for commit groups.  To be honest, I
> > don't understand this group stuff, and so I'm doubly hesitant to go
> > without any testing here.  Maybe it'd be possible to use Michael
> > Paquier's injection points somehow?
> 
> Sorry, but I am not aware of "Michael Paquier's injection" Is it
> something already in the repo? Can you redirect me to some of the
> example test cases if we already have them? Then I will try it out.

https://postgr.es/zvwufo_ykztjh...@paquier.xyz

-- 
Álvaro Herrera   48°01'N 7°57'E  —  https://www.EnterpriseDB.com/
"Sallah, I said NO camels! That's FIVE camels; can't you count?"
(Indiana Jones)




Re: long-standing data loss bug in initial sync of logical replication

2023-11-18 Thread Andres Freund
Hi,

On 2023-11-18 11:56:47 +0100, Tomas Vondra wrote:
> > I guess it's not really feasible to just increase the lock level here though
> > :(. The use of ShareUpdateExclusiveLock isn't new, and suddenly using AEL
> > would perhaps lead to new deadlocks and such? But it also seems quite wrong.
> > 
> 
> If this really is about the lock being too weak, then I don't see why
> would it be wrong?

Sorry, that was badly formulated. The wrong bit is the use of
ShareUpdateExclusiveLock.


> If it's required for correctness, it's not really wrong, IMO. Sure, stronger
> locks are not great ...
> 
> I'm not sure about the risk of deadlocks. If you do
> 
> ALTER PUBLICATION ... ADD TABLE
> 
> it's not holding many other locks. It essentially gets a lock just a
> lock on pg_publication catalog, and then the publication row. That's it.
> 
> If we increase the locks from ShareUpdateExclusive to ShareRowExclusive,
> we're making it conflict with RowExclusive. Which is just DML, and I
> think we need to do that.

>From what I can tell it needs to to be an AccessExlusiveLock. Completely
independent of logical decoding. The way the cache stays coherent is catalog
modifications conflicting with anything that builds cache entries. We have a
few cases where we do use lower level locks, but for those we have explicit
analysis for why that's ok (see e.g. reloptions.c) or we block until nobody
could have an old view of the catalog (various CONCURRENTLY) operations.


> So maybe that's fine? For me, a detected deadlock is better than
> silently missing some of the data.

That certainly is true.


> > We could brute force this in the logical decoding infrastructure, by
> > distributing invalidations from catalog modifying transactions to all
> > concurrent in-progress transactions (like already done for historic catalog
> > snapshot, c.f. SnapBuildDistributeNewCatalogSnapshot()).  But I think that'd
> > be a fairly significant increase in overhead.
> > 
> 
> I have no idea what the overhead would be - perhaps not too bad,
> considering catalog changes are not too common (I'm sure there are
> extreme cases). And maybe we could even restrict this only to
> "interesting" catalogs, or something like that? (However I hate those
> weird differences in behavior, it can easily lead to bugs.)
>
> But it feels more like a band-aid than actually fixing the issue.

Agreed.

Greetings,

Andres Freund




Re: Use of backup_label not noted in log

2023-11-18 Thread Andres Freund
Hi,

On 2023-11-18 09:30:01 -0400, David Steele wrote:
> I know this isn't really a bug, but not being able to tell where recovery
> information came from seems like a major omission in the logging.

Yea. I was preparing to forecefully suggest that some monitoring tooling
should verify that new standbys and PITRs needs to check that backup_label was
actually used, just to remember that there's nothing they could realistically
use (using DEBUG1 in production imo isn't realistic).

Greetings,

Andres Freund




Re: Use of backup_label not noted in log

2023-11-18 Thread Andres Freund
Hi,

On 2023-11-17 06:41:46 +0100, Laurenz Albe wrote:
> On Thu, 2023-11-16 at 20:18 -0800, Andres Freund wrote:
> > I've often had to analyze what caused corruption in PG instances, where the
> > symptoms match not having had backup_label in place when bringing on the
> > node. However that's surprisingly hard - the only log messages that indicate
> > use of backup_label are at DEBUG1.
> >
> > Given how crucial use of backup_label is and how frequently people do get it
> > wrong, I think we should add a LOG message - it's not like use of 
> > backup_label
> > is a frequent thing in the life of a postgres instance and is going to swamp
> > the log.  And I think we should backpatch that addition.
>
> +1
>
> I am not sure about the backpatch: it is not a bug, and we should not wantonly
> introduce new log messages in a minor release.  Some monitoring system may
> get confused.

I think log monitoring need (and do) handle unknown log messages
gracefully. You're constantly encountering them.  If were to change an
existing log message in the back branches it'd be a different story.

The reason for backpatching is that this is by far the most common reason for
corrupted systems in the wild that I have seen. And there's no way to
determine from the logs whether something has gone right or wrong - not really
a bug, but a pretty substantial weakness. And we're going to have to deal with
< 17 releases for 5 years, so making this at least somewhat diagnosable seems
like a good idea.


> What about adding it to the "redo starts at" message, something like
>
>   redo starts at 12/12345678, taken from control file
>
> or
>
>   redo starts at 12/12345678, taken from backup label

I think it'd make sense to log use of backup_label earlier than that - the
locations from backup_label might end up not being available in the archive,
the primary or locally, and we'll error out with "could not locate a valid
checkpoint record".

I'd probably just do it within the if (read_backup_label()) block in
InitWalRecovery(), *before* the ReadCheckpointRecord().

I do like the idea of expanding the "redo starts at" message
though. E.g. including minRecoveryLSN, ControlFile->backupStartPoint,
ControlFile->backupEndPoint would provide information about when the node
might become consistent.

Greetings,

Andres Freund




Re: Infinite Interval

2023-11-18 Thread Joseph Koshakow
On Thu, Nov 16, 2023 at 2:03 AM Ashutosh Bapat 
wrote:
>
>On Tue, Nov 14, 2023 at 4:39 PM Dean Rasheed 
wrote:
>>
>> On Thu, 9 Nov 2023 at 12:49, Dean Rasheed 
wrote:
>> >
>> > OK, I have pushed 0001 and 0002. Here's the remaining (main) patch.
>> >
>>
>> OK, I have now pushed the main patch.
>
>Thanks a lot Dean.

Yes, thanks Dean!


Re: Schema variables - new implementation for Postgres 15

2023-11-18 Thread Dmitry Dolgov
> On Sat, Nov 18, 2023 at 02:19:09PM +0100, Pavel Stehule wrote:
> > Would it be a problem to make pg_session_variables inspect the catalog
> > or something similar if needed?
> >
>
> It can be very easy to build pg_session_variables based on iteration over
> the system catalog. But I am not sure if we want it. pg_session_variables()
> is designed to show the variables from session memory, and it is used for
> testing. Originally it was named pg_debug_session_variables. If we iterate
> over catalog, it means using locks, and it can have an impact on isolation
> tests.

I see, thanks for clarification. In the end one can check the catalog
directly of course, is there any other value in this function except for
debugging purposes?

As a side note, I'm intended to go one more time through the first few
patches introducing the basic functionality, and then mark it as ready
in CF. I can't break the patch in testing since quite long time, and for
most parts the changes make sense to me.




Re: Use of backup_label not noted in log

2023-11-18 Thread David Steele

On 11/17/23 01:41, Laurenz Albe wrote:

On Thu, 2023-11-16 at 20:18 -0800, Andres Freund wrote:

I've often had to analyze what caused corruption in PG instances, where the
symptoms match not having had backup_label in place when bringing on the
node. However that's surprisingly hard - the only log messages that indicate
use of backup_label are at DEBUG1.

Given how crucial use of backup_label is and how frequently people do get it
wrong, I think we should add a LOG message - it's not like use of backup_label
is a frequent thing in the life of a postgres instance and is going to swamp
the log.  And I think we should backpatch that addition.


+1

I am not sure about the backpatch: it is not a bug, and we should not wantonly
introduce new log messages in a minor release.  Some monitoring system may
get confused.

What about adding it to the "redo starts at" message, something like

   redo starts at 12/12345678, taken from control file

or

   redo starts at 12/12345678, taken from backup label


I think a backpatch is OK as long as it is a separate message, but I 
like your idea of adding to the "redo starts" message going forward.


I know this isn't really a bug, but not being able to tell where 
recovery information came from seems like a major omission in the logging.


Regards,
-David




Re: Schema variables - new implementation for Postgres 15

2023-11-18 Thread Pavel Stehule
>>
>> The difference between debug_parallel_query = 1 and debug_parallel_query
>> = 0 is strange - and I'll check it.
>>
>
> looks so  pg_session_variables() doesn't work  in debug_paralel_query mode.
>

It is marked as parallel safe, what is probably nonsense.


Re: Use of backup_label not noted in log

2023-11-18 Thread David Steele

On 11/17/23 00:18, Andres Freund wrote:


I've often had to analyze what caused corruption in PG instances, where the
symptoms match not having had backup_label in place when bringing on the
node. However that's surprisingly hard - the only log messages that indicate
use of backup_label are at DEBUG1.

Given how crucial use of backup_label is and how frequently people do get it
wrong, I think we should add a LOG message - it's not like use of backup_label
is a frequent thing in the life of a postgres instance and is going to swamp
the log.  And I think we should backpatch that addition.


+1 for the message and I think a backpatch is fine as long as it is a 
new message. If monitoring systems can't handle an unrecognized message 
then that feels like a problem on their part.



Medium term I think we should go further, and leave evidence in pg_control
about the last use of ControlFile->backupStartPoint, instead of resetting it.


Michael also thinks this is a good idea.


I realize that there's a discussion about removing backup_label - but I think
that's fairly orthogonal. Should we go with the pg_control approach, we should
still emit a useful message when starting in a state that's "equivalent" to
having used the backup_label.


Agreed, this new message could easily be adapted to the recovery in 
pg_control patch.


Regards,
-David




Re: Schema variables - new implementation for Postgres 15

2023-11-18 Thread Pavel Stehule
so 18. 11. 2023 v 14:19 odesílatel Pavel Stehule 
napsal:

> Hi
>
> pá 17. 11. 2023 v 20:17 odesílatel Dmitry Dolgov <9erthali...@gmail.com>
> napsal:
>
>> > On Wed, Aug 23, 2023 at 04:02:44PM +0200, Pavel Stehule wrote:
>> > NameListToString is already buildin function. Do you think
>> NamesFromList?
>> >
>> > This is my oversight - there is just `+extern List *NamesFromList(List
>> > *names); ` line, but sure - it should be in 0002 patch
>> >
>> > fixed now
>>
>> Right, thanks for fixing.
>>
>> I think there is a wrinkle with pg_session_variables function. It
>> returns nothing if sessionvars hash table is empty, which has two
>> consequences:
>>
>> * One might get confused about whether a variable is created,
>>   based on the information from the function. An expected behaviour, but
>>   could be considered a bad UX.
>>
>> =# CREATE VARIABLE var1 AS varchar;
>>
>> -- empty, is expected
>> =# SELECT name, typname, can_select, can_update FROM
>> pg_session_variables();
>>  name | typname | can_select | can_update
>>  --+-++
>>  (0 rows)
>>
>> -- but one can't create a variable
>> =# CREATE VARIABLE var1 AS varchar;
>> ERROR:  42710: session variable "var1" already exists
>> LOCATION:  create_variable, pg_variable.c:102
>>
>> -- yet, suddenly after a select...
>> =# SELECT var2;
>>  var2
>>  --
>>   NULL
>>   (1 row)
>>
>> -- ... it's not empty
>> =# SELECT name, typname, can_select, can_update FROM pg_sessio
>> n_variables();
>>  name |  typname  | can_select | can_update
>>  --+---++
>>   var2 | character varying | t  | t
>>   (1 row)
>>
>> * Running a parallel query will end up returning an empty result even
>>   after accessing the variable.
>>
>> -- debug_parallel_query = 1 all the time
>> =# CREATE VARIABLE var2 AS varchar;
>>
>> -- empty, is expected
>> =# SELECT name, typname, can_select, can_update FROM
>> pg_session_variables();
>>  name | typname | can_select | can_update
>>  --+-++
>>  (0 rows)
>>
>> -- but this time an access...
>> SELECT var2;
>>  var2
>>  --
>>   NULL
>>   (1 row)
>>
>> -- or set...
>> =# LET var2 = 'test';
>>
>> -- doesn't change the result, it's still empty
>> =# SELECT name, typname, can_select, can_update FROM
>> pg_session_variables();
>>  name | typname | can_select | can_update
>>  --+-++
>>  (0 rows)
>>
>> Would it be a problem to make pg_session_variables inspect the catalog
>> or something similar if needed?
>>
>
> It can be very easy to build pg_session_variables based on iteration over
> the system catalog. But I am not sure if we want it. pg_session_variables()
> is designed to show the variables from session memory, and it is used for
> testing. Originally it was named pg_debug_session_variables. If we iterate
> over catalog, it means using locks, and it can have an impact on isolation
> tests.
>
> So maybe we can introduce a parameter for this function to show all
> session variables (based on catalog) or only used based on iteration over
> memory. Default can be "all". What do you think about it?
>
> The difference between debug_parallel_query = 1 and debug_parallel_query =
> 0 is strange - and I'll check it.
>

looks so  pg_session_variables() doesn't work  in debug_paralel_query mode.


Re: Schema variables - new implementation for Postgres 15

2023-11-18 Thread Pavel Stehule
Hi

pá 17. 11. 2023 v 20:17 odesílatel Dmitry Dolgov <9erthali...@gmail.com>
napsal:

> > On Wed, Aug 23, 2023 at 04:02:44PM +0200, Pavel Stehule wrote:
> > NameListToString is already buildin function. Do you think NamesFromList?
> >
> > This is my oversight - there is just `+extern List *NamesFromList(List
> > *names); ` line, but sure - it should be in 0002 patch
> >
> > fixed now
>
> Right, thanks for fixing.
>
> I think there is a wrinkle with pg_session_variables function. It
> returns nothing if sessionvars hash table is empty, which has two
> consequences:
>
> * One might get confused about whether a variable is created,
>   based on the information from the function. An expected behaviour, but
>   could be considered a bad UX.
>
> =# CREATE VARIABLE var1 AS varchar;
>
> -- empty, is expected
> =# SELECT name, typname, can_select, can_update FROM
> pg_session_variables();
>  name | typname | can_select | can_update
>  --+-++
>  (0 rows)
>
> -- but one can't create a variable
> =# CREATE VARIABLE var1 AS varchar;
> ERROR:  42710: session variable "var1" already exists
> LOCATION:  create_variable, pg_variable.c:102
>
> -- yet, suddenly after a select...
> =# SELECT var2;
>  var2
>  --
>   NULL
>   (1 row)
>
> -- ... it's not empty
> =# SELECT name, typname, can_select, can_update FROM pg_sessio
> n_variables();
>  name |  typname  | can_select | can_update
>  --+---++
>   var2 | character varying | t  | t
>   (1 row)
>
> * Running a parallel query will end up returning an empty result even
>   after accessing the variable.
>
> -- debug_parallel_query = 1 all the time
> =# CREATE VARIABLE var2 AS varchar;
>
> -- empty, is expected
> =# SELECT name, typname, can_select, can_update FROM
> pg_session_variables();
>  name | typname | can_select | can_update
>  --+-++
>  (0 rows)
>
> -- but this time an access...
> SELECT var2;
>  var2
>  --
>   NULL
>   (1 row)
>
> -- or set...
> =# LET var2 = 'test';
>
> -- doesn't change the result, it's still empty
> =# SELECT name, typname, can_select, can_update FROM
> pg_session_variables();
>  name | typname | can_select | can_update
>  --+-++
>  (0 rows)
>
> Would it be a problem to make pg_session_variables inspect the catalog
> or something similar if needed?
>

It can be very easy to build pg_session_variables based on iteration over
the system catalog. But I am not sure if we want it. pg_session_variables()
is designed to show the variables from session memory, and it is used for
testing. Originally it was named pg_debug_session_variables. If we iterate
over catalog, it means using locks, and it can have an impact on isolation
tests.

So maybe we can introduce a parameter for this function to show all session
variables (based on catalog) or only used based on iteration over memory.
Default can be "all". What do you think about it?

The difference between debug_parallel_query = 1 and debug_parallel_query =
0 is strange - and I'll check it.


Re: MERGE ... RETURNING

2023-11-18 Thread Dean Rasheed
On Fri, 17 Nov 2023 at 04:30, jian he  wrote:
>
> I think it should be:
> +   You will require the SELECT privilege on any column(s)
> +   of the data_source and
> +   target_table_name referred to
> +   in any condition or expression.
>

Ah, of course. As always, I'm blind to grammatical errors in my own
text, no matter how many times I read it. Thanks for checking!

Pushed.

The v13 patch still applies on top of this, so I won't re-post it.

Regards,
Dean




Re: add log messages when replication slots become active and inactive (was Re: Is it worth adding ReplicationSlot active_pid to ReplicationSlotPersistentData?)

2023-11-18 Thread Amit Kapila
On Thu, Nov 16, 2023 at 6:09 PM Bharath Rupireddy
 wrote:
>
> On Thu, Nov 16, 2023 at 4:01 PM Alvaro Herrera  
> wrote:
> >
> > On 2023-Nov-16, Peter Smith wrote:
> >
> > > I searched HEAD code and did not find any "translator:" comments for
> > > just ordinary slot name substitutions like these; AFAICT that comment
> > > is not necessary anymore.
> >
> > True.  Lose that.
>
> Removed.
>
> > The rationale I have is to consider whether a translator looking at the
> > original message message in isolation is going to understand what the %s
> > means.  If it's possible to tell what it is without having to go read
> > the source code that leads to the message, then you don't need a
> > "translator:" comment.  Otherwise you do.
>
> Agree. I think it's easy for one to guess the slot name follows "
> replication slot %s", so I removed the translator message.
>
> > > SUGGESTION (#1a and #1b)
> > >
> > > ereport(log_replication_commands ? LOG : DEBUG1,
> > > errmsg(SlotIsLogical(s)
> > >? "acquired logical replication slot \"%s\""
> > >: "acquired physical replication slot \"%s\"",
> > >NameStr(s->data.name)));
> >
> > The bad thing about this is that gettext() is not going to pick up these
> > strings into the translation catalog.  You could fix that by adding
> > gettext_noop() calls around them:
> >
> >  ereport(log_replication_commands ? LOG : DEBUG1,
> >  errmsg(SlotIsLogical(s)
> > ? gettext_noop("acquired logical replication slot \"%s\"")
> > : gettext_noop("acquired physical replication slot \"%s\""),
> > NameStr(s->data.name)));
> >
> > but at that point it's not clear that it's really better than putting
> > the ternary in the outer scope.
>
> I retained wrapping messages in errmsg("...").
>
> > You can verify this by doing "make update-po" and then searching for the
> > messages in postgres.pot.
>
> Translation gives me [1] with v18 patch
>
> PSA v18 patch.
>

LGTM. I'll push this early next week unless there are further
suggestions or comments.

-- 
With Regards,
Amit Kapila.




Re: long-standing data loss bug in initial sync of logical replication

2023-11-18 Thread Tomas Vondra



On 11/18/23 03:54, Andres Freund wrote:
> Hi,
> 
> On 2023-11-17 17:54:43 -0800, Andres Freund wrote:
>> On 2023-11-17 15:36:25 +0100, Tomas Vondra wrote:
>>> Overall, this looks, walks and quacks like a cache invalidation issue,
>>> likely a missing invalidation somewhere in the ALTER PUBLICATION code.
> 
> I can confirm that something is broken with invalidation handling.
> 
> To test this I just used pg_recvlogical to stdout. It's just interesting
> whether something arrives, that's easy to discern even with binary output.
> 
> CREATE PUBLICATION pb;
> src/bin/pg_basebackup/pg_recvlogical --plugin=pgoutput --start --slot test -d 
> postgres -o proto_version=4 -o publication_names=pb -o messages=true -f -
> 
> S1: CREATE TABLE d(data text not null);
> S1: INSERT INTO d VALUES('d1');
> S2: BEGIN; INSERT INTO d VALUES('d2');
> S1: ALTER PUBLICATION pb ADD TABLE d;
> S2: COMMIT
> S2: INSERT INTO d VALUES('d3');
> S1: INSERT INTO d VALUES('d4');
> RL: 
> 
> Without the 'd2' insert in an in-progress transaction, pgoutput *does* react
> to the ALTER PUBLICATION.
> 
> I think the problem here is insufficient locking. The ALTER PUBLICATION pb ADD
> TABLE d basically modifies the catalog state of 'd', without a lock preventing
> other sessions from having a valid cache entry that they could continue to
> use. Due to this, decoding S2's transactions that started before S2's commit,
> will populate the cache entry with the state as of the time of S1's last
> action, i.e. no need to output the change.
> 
> The reason this can happen is because OpenTableList() uses
> ShareUpdateExclusiveLock. That allows the ALTER PUBLICATION to happen while
> there's an ongoing INSERT.
> 

I guess this would also explain why changing the lock mode from
ShareUpdateExclusiveLock to ShareRowExclusiveLock changes the behavior.
INSERT acquires RowExclusiveLock, which doesn't conflict only with the
latter.

> I think this isn't just a logical decoding issue. S2's cache state just after
> the ALTER PUBLICATION is going to be wrong - the table is already locked,
> therefore further operations on the table don't trigger cache invalidation
> processing - but the catalog state *has* changed.  It's a bigger problem for
> logical decoding though, as it's a bit more lazy about invalidation processing
> than normal transactions, allowing the problem to persist for longer.
> 

Yeah. I'm wondering if there's some other operation acquiring a lock
weaker than RowExclusiveLock that might be affected by this. Because
then we'd need to get an even stronger lock ...

> 
> I guess it's not really feasible to just increase the lock level here though
> :(. The use of ShareUpdateExclusiveLock isn't new, and suddenly using AEL
> would perhaps lead to new deadlocks and such? But it also seems quite wrong.
> 

If this really is about the lock being too weak, then I don't see why
would it be wrong? If it's required for correctness, it's not really
wrong, IMO. Sure, stronger locks are not great ...

I'm not sure about the risk of deadlocks. If you do

ALTER PUBLICATION ... ADD TABLE

it's not holding many other locks. It essentially gets a lock just a
lock on pg_publication catalog, and then the publication row. That's it.

If we increase the locks from ShareUpdateExclusive to ShareRowExclusive,
we're making it conflict with RowExclusive. Which is just DML, and I
think we need to do that.

So maybe that's fine? For me, a detected deadlock is better than
silently missing some of the data.

> 
> We could brute force this in the logical decoding infrastructure, by
> distributing invalidations from catalog modifying transactions to all
> concurrent in-progress transactions (like already done for historic catalog
> snapshot, c.f. SnapBuildDistributeNewCatalogSnapshot()).  But I think that'd
> be a fairly significant increase in overhead.
> 

I have no idea what the overhead would be - perhaps not too bad,
considering catalog changes are not too common (I'm sure there are
extreme cases). And maybe we could even restrict this only to
"interesting" catalogs, or something like that? (However I hate those
weird differences in behavior, it can easily lead to bugs.)

But it feels more like a band-aid than actually fixing the issue.


regards

-- 
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company




Re: Synchronizing slots from primary to standby

2023-11-18 Thread Amit Kapila
On Fri, Nov 17, 2023 at 5:18 PM Drouvot, Bertrand
 wrote:
>
> On 11/17/23 2:46 AM, Zhijie Hou (Fujitsu) wrote:
> > On Tuesday, November 14, 2023 10:27 PM Drouvot, Bertrand 
> >  wrote:
> >
> > I feel the WaitForWALToBecomeAvailable may not be the best place to shutdown
> > slotsync worker and drop slots. There could be other reasons(other than
> > promotion) as mentioned in comments in case XLOG_FROM_STREAM to reach the 
> > code
> > there. I thought if the intention is to stop slotsync workers on promotion,
> > maybe FinishWalRecovery() is a better place to do it as it's indicating the 
> > end
> > of recovery and XLogShutdownWalRcv is also called in it.
>
> I can see that slotsync_drop_initiated_slots() has been moved in 
> FinishWalRecovery()
> in v35. That looks ok.
> >

I was thinking what if we just ignore creating such slots (which
require init state) in the first place? I think that can be
time-consuming in some cases but it will reduce the complexity and we
can always improve such cases later if we really encounter them in the
real world. I am not very sure that added complexity is worth
addressing this particular case, so I would like to know your and
others' opinions.

More Review for v35-0002*

1.
+ ereport(WARNING,
+ errmsg("skipping slots synchronization as primary_slot_name "
+"is not set."));

There is no need to use a full stop at the end for WARNING messages
and as previously mentioned, let's not split message lines in such
cases. There are other messages in the patch with similar problems,
please fix those as well.

2.
+slotsync_checks()
{
...
...
+ /* The hot_standby_feedback must be ON for slot-sync to work */
+ if (!hot_standby_feedback)
+ {
+ ereport(WARNING,
+ errmsg("skipping slots synchronization as hot_standby_feedback "
+"is off."));

This message has the same problem as mentioned in the previous
comment. Additionally, I think either atop slotsync_checks or along
with GUC check we should write comments as to why we expect these
values to be set for slot sync to work.

3.
+ /* The worker is running already */
+ if (SlotSyncWorker &>hdr.in_use
+ && SlotSyncWorker->hdr.proc)

The spacing for both the &&'s has problems. You need a space after the
first && and the second && should be in the prior line.

4.
+ LauncherRereadConfig(_slotsync);
+
  }

An empty line after LauncherRereadConfig() is not required.

5.
+static void
+LauncherRereadConfig(bool *ss_recheck)
+{
+ char*conninfo = pstrdup(PrimaryConnInfo);
+ char*slotname = pstrdup(PrimarySlotName);
+ bool syncslot = enable_syncslot;
+ bool feedback = hot_standby_feedback;

Can we change the variable name 'feedback' to 'standbyfeedback' to
make it slightly more descriptive?

6. The logic to recheck the slot_sync related parameters in
LauncherMain() is not very clear. IIUC, if after reload config any
parameter is changed, we just seem to be checking the validity of the
changed parameter but not restarting the slot sync worker, is that
correct? If so, what if dbname is changed, don't we need to restart
the slot-sync worker and re-initialize the connection; similarly
slotname change also needs some thoughts. Also, if all the parameters
are valid we seem to be re-launching the slot-sync worker without
first stopping it which doesn't seem correct, am I missing something
in this logic?

7.
@@ -524,6 +525,25 @@ CreateDecodingContext(XLogRecPtr start_lsn,
  errmsg("replication slot \"%s\" was not created in this database",
  NameStr(slot->data.name;

+ in_recovery = RecoveryInProgress();
+
+ /*
+ * Do not allow consumption of a "synchronized" slot until the standby
+ * gets promoted. Also do not allow consumption of slots with sync_state
+ * as SYNCSLOT_STATE_INITIATED as they are not synced completely to be
+ * used.
+ */
+ if ((in_recovery && (slot->data.sync_state != SYNCSLOT_STATE_NONE)) ||
+ slot->data.sync_state == SYNCSLOT_STATE_INITIATED)
+ ereport(ERROR,
+ (errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE),
+ errmsg("cannot use replication slot \"%s\" for logical decoding",
+ NameStr(slot->data.name)),
+ in_recovery ?
+ errdetail("This slot is being synced from the primary server.") :
+ errdetail("This slot was not synced completely from the primary server."),
+ errhint("Specify another replication slot.")));
+

If we are planning to drop slots in state SYNCSLOT_STATE_INITIATED at
the time of promotion, don't we need to just have an assert or
elog(ERROR, .. for non-recovery cases as such cases won't be
reachable? If so, I think we can separate out that case here.

8.
wait_for_primary_slot_catchup()
{
...
+ /* Check if this standby is promoted while we are waiting */
+ if (!RecoveryInProgress())
+ {
+ /*
+ * The remote slot didn't pass the locally reserved position at
+ * the time of local promotion, so it's not safe to use.
+ */
+ ereport(
+ WARNING,
+ (errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE),
+ errmsg(
+ "slot-sync wait for slot %s interrupted by promotion, "
+ "slot 

Re: long-standing data loss bug in initial sync of logical replication

2023-11-18 Thread Tomas Vondra
On 11/18/23 02:54, Andres Freund wrote:
> Hi,
> 
> On 2023-11-17 15:36:25 +0100, Tomas Vondra wrote:
>> It seems there's a long-standing data loss issue related to the initial
>> sync of tables in the built-in logical replication (publications etc.).
> 
> :(
> 

Yeah :-(

> 
>> Overall, this looks, walks and quacks like a cache invalidation issue,
>> likely a missing invalidation somewhere in the ALTER PUBLICATION code.
> 
> It could also be be that pgoutput doesn't have sufficient invalidation
> handling.
> 

I'm not sure about the details, but it can't be just about pgoutput
failing to react to some syscache invalidation. As described, just
resetting the RelationSyncEntry doesn't fix the issue - it's the
syscache that's not invalidated, IMO. But maybe that's what you mean.

> 
> One thing that looks bogus on the DDL side is how the invalidation handling
> interacts with locking.
> 
> 
> For tables etc the invalidation handling works because we hold a lock on the
> relation before modifying the catalog and don't release that lock until
> transaction end. That part is crucial: We queue shared invalidations at
> transaction commit, *after* the transaction is marked as visible, but *before*
> locks are released. That guarantees that any backend processing invalidations
> will see the new contents.  However, if the lock on the modified object is
> released before transaction commit, other backends can build and use a cache
> entry that hasn't processed invalidations (invaliations are processed when
> acquiring locks).
> 

Right.

> While there is such an object for publications, it seems to be acquired too
> late to actually do much good in a number of paths. And not at all in others.
> 
> E.g.:
> 
>   pubform = (Form_pg_publication) GETSTRUCT(tup);
> 
>   /*
>* If the publication doesn't publish changes via the root partitioned
>* table, the partition's row filter and column list will be used. So
>* disallow using WHERE clause and column lists on partitioned table in
>* this case.
>*/
>   if (!pubform->puballtables && publish_via_partition_root_given &&
>   !publish_via_partition_root)
> {
>   /*
>* Lock the publication so nobody else can do anything with it. 
> This
>* prevents concurrent alter to add partitioned table(s) with 
> WHERE
>* clause(s) and/or column lists which we don't allow when not
>* publishing via root.
>*/
>   LockDatabaseObject(PublicationRelationId, pubform->oid, 0,
>  AccessShareLock);
> 
> a) Another session could have modified the publication and made puballtables 
> out-of-date
> b) The LockDatabaseObject() uses AccessShareLock, so others can get past this
>point as well
> 
> b) seems like a copy-paste bug or such?
> 
> 
> I don't see any locking of the publication around RemovePublicationRelById(),
> for example.
> 
> I might just be misunderstanding things the way publication locking is
> intended to work.
> 

I've been asking similar questions while investigating this, but the
interactions with logical decoding (which kinda happens concurrently in
terms of WAL, but not concurrently in terms of time), historical
snapshots etc. make my head spin.

> 
>> However, while randomly poking at different things, I realized that if I
>> change the lock obtained on the relation in OpenTableList() from
>> ShareUpdateExclusiveLock to ShareRowExclusiveLock, the issue goes away.
> 
> That's odd. There's cases where changing the lock level can cause invalidation
> processing to happen because there is no pre-existing lock for the "new" lock
> level, but there was for the old. But OpenTableList() is used when altering
> the publications, so I don't see how that connects.
> 

Yeah, I had the idea that maybe the transaction already holds the lock
on the table, and changing this to ShareRowExclusiveLock makes it
different, possibly triggering a new invalidation or something. But I
did check with gdb, and if I set a breakpoint at OpenTableList, there
are no locks on the table.

But the effect is hard to deny - if I run the test 100 times, with the
SharedUpdateExclusiveLock I get maybe 80 failures. After changing it to
ShareRowExclusiveLock I get 0. Sure, there's some randomness for cases
like this, but this is pretty unlikely.


regards

-- 
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company




Re: SLRU optimization - configurable buffer pool and partitioning the SLRU lock

2023-11-18 Thread Dilip Kumar
On Fri, Nov 17, 2023 at 6:16 PM Alvaro Herrera  wrote:

Thanks for the review, all comments looks fine to me, replying to
those that need some clarification

> I wonder what's the deal with false sharing in the new
> bank_cur_lru_count array.  Maybe instead of using LWLockPadded for
> bank_locks, we should have a new struct, with both the LWLock and the
> LRU counter; then pad *that* to the cacheline size.  This way, both the
> lwlock and the counter come to the CPU running this code together.

Actually, the array lengths of both LWLock and the LRU counter are
different so I don't think we can move them to a common structure.
The length of the *buffer_locks array is equal to the number of slots,
the length of the *bank_locks array is Min (number_of_banks, 128), and
the length of the *bank_cur_lru_count array is number_of_banks.

> Looking at the coverage for this code,
> https://coverage.postgresql.org/src/backend/access/transam/clog.c.gcov.html#413
> it seems in our test suites we never hit the case where there is
> anything in the "nextidx" field for commit groups.  To be honest, I
> don't understand this group stuff, and so I'm doubly hesitant to go
> without any testing here.  Maybe it'd be possible to use Michael
> Paquier's injection points somehow?

Sorry, but I am not aware of "Michael Paquier's injection" Is it
something already in the repo? Can you redirect me to some of the
example test cases if we already have them? Then I will try it out.

-- 
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com