Re: Logical Replication of sequences

2024-07-25 Thread shveta malik
On Thu, Jul 25, 2024 at 9:06 AM vignesh C  wrote:
>
> The attached v20240725 version patch has the changes for the same.

Thank You for addressing the comments. Please review below issues:

1) Sub ahead of pub due to wrong initial sync of last_value for
non-incremented sequences. Steps at [1]
2) Sequence's min value is not honored on sub during replication. Steps at [2]

[1]:
---
on PUB:
CREATE SEQUENCE myseq001 INCREMENT 5 START 100;
SELECT * from pg_sequences; -->shows last_val as NULL

on SUB:
CREATE SEQUENCE myseq001 INCREMENT 5 START 100;
SELECT * from pg_sequences; -->correctly shows last_val as NULL
ALTER SUBSCRIPTION sub1 REFRESH PUBLICATION SEQUENCES;
SELECT * from pg_sequences;  -->wrongly updates last_val to 100; it is
still NULL on Pub.

Thus , SELECT nextval('myseq001') on pub gives 100, while on sub gives 105.
---


[2]:
---
Pub:
CREATE SEQUENCE myseq0 INCREMENT 5 START 10;
SELECT * from pg_sequences;

Sub:
CREATE SEQUENCE myseq0 INCREMENT 5 MINVALUE 100;

Pub:
SELECT nextval('myseq0');
SELECT nextval('myseq0');

Sub:
ALTER SUBSCRIPTION sub1 REFRESH PUBLICATION SEQUENCES;
--check 'last_value', it is 15 while min_value is 100
SELECT * from pg_sequences;
---

thanks
Shveta




Re: Logical Replication of sequences

2024-07-24 Thread shveta malik
On Wed, Jul 24, 2024 at 11:52 AM shveta malik  wrote:
>
> On Wed, Jul 24, 2024 at 9:17 AM Peter Smith  wrote:
> >
>
> I had a look at patches v20240720* (considering these as the latest
> one) and tried to do some basic testing (WIP). Few comments:
>
> 1)
> I see 'last_value' is updated wrongly after create-sub.  Steps:
>
> ---
> pub:
> CREATE SEQUENCE myseq0 INCREMENT 5 START 100;
> SELECT nextval('myseq0');
> SELECT nextval('myseq0');
> --last_value on pub is 105
> select * from pg_sequences;
> create publication pub1 for all tables, sequences;
>
> Sub:
> CREATE SEQUENCE myseq0 INCREMENT 5 START 100;
> create subscription sub1 connection 'dbname=postgres host=localhost
> user=shveta port=5433' publication pub1;
>
> --check 'r' state is reached
> select pc.relname, pr.srsubstate, pr.srsublsn from pg_subscription_rel
> pr, pg_class pc where (pr.srrelid = pc.oid);
>
> --check 'last_value', it shows some random value as 136
> select * from pg_sequences;

Okay, I see that in fetch_remote_sequence_data(), we are inserting
'last_value + log_cnt' fetched from remote as 'last_val' on subscriber
and thus leading to above behaviour. I did not understand why this is
done? This may result into issue when we insert data into a table with
identity column on subscriber (whose internal sequence is replicated);
the identity column in this case will end up having wrong value.

thanks
Shveta




Re: Logical Replication of sequences

2024-07-24 Thread shveta malik
On Wed, Jul 24, 2024 at 9:17 AM Peter Smith  wrote:
>

I had a look at patches v20240720* (considering these as the latest
one) and tried to do some basic testing (WIP). Few comments:

1)
I see 'last_value' is updated wrongly after create-sub.  Steps:

---
pub:
CREATE SEQUENCE myseq0 INCREMENT 5 START 100;
SELECT nextval('myseq0');
SELECT nextval('myseq0');
--last_value on pub is 105
select * from pg_sequences;
create publication pub1 for all tables, sequences;

Sub:
CREATE SEQUENCE myseq0 INCREMENT 5 START 100;
create subscription sub1 connection 'dbname=postgres host=localhost
user=shveta port=5433' publication pub1;

--check 'r' state is reached
select pc.relname, pr.srsubstate, pr.srsublsn from pg_subscription_rel
pr, pg_class pc where (pr.srrelid = pc.oid);

--check 'last_value', it shows some random value as 136
select * from pg_sequences;
---

2)
I can use 'for all sequences' only with 'for all tables' and can not
use it with the below. Shouldn't it be allowed?

create publication pub2 for tables in schema public, for all sequences;
create publication pub2 for table t1, for all sequences;

3)
preprocess_pub_all_objtype_list():
Do we need 'alltables_specified' and 'allsequences_specified' ? Can't
we make a repetition check using *alltables and *allsequences?

4) patch02's commit msg says : 'Additionally, a new system view,
pg_publication_sequences, has been introduced'
But it is not part of patch002.

thanks
Shveta




Re: Conflict detection and logging in logical replication

2024-07-22 Thread shveta malik
On Fri, Jul 19, 2024 at 2:06 PM shveta malik  wrote:
>
> On Thu, Jul 18, 2024 at 7:52 AM Zhijie Hou (Fujitsu)
>  wrote:
> >
> > Attach the V5 patch set which changed the following.
>

Please find last batch of comments on v5:

patch001:
1)
create subscription sub1 ... (detect_conflict=true);
I think it will be good to give WARNING here indicating that
detect_conflict is enabled but track_commit_timestamp is disabled and
thus few conflicts detection may not work. (Rephrase as apt)

2)
013_partition.pl: Since we have added update_differ testcase here, we
shall add delete_differ as well. And in some file where appropriate,
we shall add update_exists test as well.

3)
013_partition.pl (#799,802):
For update_missing and delete_missing, we have log verification format
as 'qr/conflict delete_missing/update_missing detected on relation '.
But for update_differ, we do not capture "conflict update_differ
detected on relation ...". We capture only the DETAIL.
I think we should be consistent and capture conflict name here as well.


patch002:

4)
pg_stat_get_subscription_stats():

-
/* conflict count */
for (int nconflict = 0; nconflict < CONFLICT_NUM_TYPES; nconflict++)
values[i + nconflict] = Int64GetDatum(subentry->conflict_count[nconflict]);

i += CONFLICT_NUM_TYPES;
-

Can't we do values[i++] here as well (instead of values[i +
nconflict])? Then we don't need to do 'i += CONFLICT_NUM_TYPES'.

5)
026_stats.pl:
Wherever we are checking this: 'Apply and Sync errors are > 0 and
reset timestamp is NULL', we need to check update_exssts_count as well
along with other counts.


thanks
Shveta




Re: Allow logical failover slots to wait on synchronous replication

2024-07-21 Thread shveta malik
On Fri, Jul 19, 2024 at 2:52 AM John H  wrote:
>
> Hi Shveta,
>
> Thanks for taking a look at the patch.
>
> > > will leave user no option to unlink failover-enabled logical
> > > subscribers from the wait on synchronous standbys.
>
> That's a good point. I'm a bit biased in that I don't think there's a
> great reason why someone would
> want to replicate logical changes out of the synchronous cluster
> without it having been synchronously replicated
>  but yes this would be different behavior compared to strictly the slot one.
>
> > ...
> > So when 'synchronized_standby_slots' is comma separated list, we pick
> > those slots; if it is empty, then no wait on standbys, and if its
> > value is 'DEFAULT' as configured by user, then go with
> > 'synchronous_standby_names'. Thoughts?
>
> I think I'd prefer having a separate GUC if the alternative is to reserve
> special keywords in 'synchronized_standby_slots' but I'm not sure if I
> feel strongly about that.

My only concern is, earlier we provided a way to set the failover
property of slots even without mandatorily wait on physical standbys.
But now we will be changing this behaviour. Okay, we can see what
other comments. If we plan to go this way, we can change docs to
clearly mention this.


> > > 2)
> > > When  'synchronized_standby_slots' is configured but standby named in
> > > it is down blocking logical replication, then we get a WARNING in
> > > subscriber's log file:
> > >
> > > WARNING:  replication slot "standby_2" specified in parameter
> > > synchronized_standby_slots does not have active_pid.
> > > DETAIL:  Logical replication is waiting on the standby associated with
> > > "standby_2".
> > > HINT:  Consider starting standby associated with "standby_2" or amend
> > > parameter synchronized_standby_slots.
> > >
> > > But OTOH, when  'synchronous_standby_names' is configured instead of
> > > 'synchronized_standby_slots' and any of the standbys listed is down
> > > blocking logical replication, we do not get any sort of warning. It is
> > > inconsistent behavior. Also user might be left clueless on why
> > > subscribers are not getting changes.
>
> Ah that's a gap. Let me add some logging/warning in a similar fashion.
> Although I think I'd have the warning be relatively generic (e.g.
> changes are blocked because
> they're not synchronously committed)
>

okay, sounds good.

thanks
Shveta




Re: Conflict detection and logging in logical replication

2024-07-19 Thread shveta malik
On Thu, Jul 18, 2024 at 7:52 AM Zhijie Hou (Fujitsu)
 wrote:
>
> Attach the V5 patch set which changed the following.

Thanks for the patch. Tested that previous reported issues are fixed.
Please have a look at below scenario, I was expecting it to raise
'update_differ' but it raised both 'update_differ' and 'delete_differ'
together:

-
Pub:
create table tab (a int not null, b int primary key);
create publication pub1 for table tab;

Sub (partitioned table):
create table tab (a int not null, b int primary key) partition by
range (b);
create table tab_1 partition of tab for values from (minvalue) to
(100);
create table tab_2 partition of tab for values from (100) to
(maxvalue);
create subscription sub1 connection '...' publication pub1 WITH
(detect_conflict=true);

Pub -  insert into tab values (1,1);
Sub - update tab set b=1000 where a=1;
Pub - update tab set b=1000 where a=1;  -->update_missing detected
correctly as b=1 will not be found on sub.
Pub - update tab set b=1 where b=1000;  -->update_differ expected, but
it gives both update_differ and delete_differ.
-

Few trivial comments:

1)
Commit msg:
For insert_exists conflict, the log can include origin and commit
timestamp details of the conflicting key with track_commit_timestamp
enabled.

--Please add update_exists as well.

2)
execReplication.c:
Return false if there is no conflict and *conflictslot is set to NULL.

--This gives a feeling that this function will return false if both
the conditions are true. But instead first one is the condition, while
the second is action. Better to rephrase to:

Returns false if there is no conflict. Sets *conflictslot to NULL in
such a case.
Or
Sets *conflictslot to NULL and returns false in case of no conflict.

3)
FindConflictTuple() shares some code parts with
RelationFindReplTupleByIndex() and RelationFindReplTupleSeq() for
checking status in 'res'. Is it worth making a function to be used in
all three.

thanks
Shveta




Re: Conflict detection and logging in logical replication

2024-07-17 Thread shveta malik
On Thu, Jul 18, 2024 at 7:52 AM Zhijie Hou (Fujitsu)
 wrote:
>
> On Thursday, July 11, 2024 1:03 PM shveta malik  
> wrote:
>
> Hi,
>
> Thanks for the comments!
>
> >
> > I have one concern about how we deal with conflicts. As for insert_exists, 
> > we
> > keep on erroring out while raising conflict, until it is manually resolved:
> > ERROR:  conflict insert_exists detected
> >
> > But for other cases, we just log conflict and either skip or apply the 
> > operation. I
> > LOG:  conflict update_differ detected
> > DETAIL:  Updating a row that was modified by a different origin
> >
> > I know that it is no different than HEAD. But now since we are logging 
> > conflicts
> > explicitly, we should call out default behavior on each conflict. I see some
> > incomplete and scattered info in '31.5.
> > Conflicts' section saying that:
> >  "When replicating UPDATE or DELETE operations, missing data will not
> > produce a conflict and such operations will simply be skipped."
> > (lets say it as pt a)
> >
> > Also some more info in a later section saying (pt b):
> > :A conflict will produce an error and will stop the replication; it must be 
> > resolved
> > manually by the user."
> >
> > My suggestions:
> > 1) in point a above, shall we have:
> > missing data or differing data (i.e. somehow reword to accommodate
> > update_differ and delete_differ cases)
>
> I am not sure if rewording existing words is better. I feel adding a link to
> let user refer to the detect_conflict section for the all the
> conflicts is sufficient, so did like that.

Agree, it looks better with detect_conflict link.

> >
> > 2)
> > monitoring.sgml: Below are my suggestions, please change if you feel apt.
> >
> > 2a) insert_exists_count : Number of times inserting a row that violates a 
> > NOT
> > DEFERRABLE unique constraint while applying changes. Suggestion: Number of
> > times a row insertion violated a NOT DEFERRABLE unique constraint while
> > applying changes.
> >
> > 2b) update_differ_count : Number of times updating a row that was previously
> > modified by another source while applying changes. Suggestion: Number of 
> > times
> > update was performed on a row that was previously modified by another source
> > while applying changes.
> >
> > 2c) delete_differ_count: Number of times deleting a row that was previously
> > modified by another source while applying changes. Suggestion: Number of 
> > times
> > delete was performed on a row that was previously modified by another source
> > while applying changes.
>
> I am a bit unsure which one is better, so I didn't change in this version.

I still feel the wording is bit unclear/incomplete Also to be
consistent with previous fields (see sync_error_count:Number of times
an error occurred during the initial table synchronization), we should
at-least have it in past tense. Another way of writing could be:

'Number of times inserting a row violated a NOT DEFERRABLE unique
constraint while applying changes.'   and likewise for each conflict
field.


> >
> > 5)
> > conflict.h:CONFLICT_NUM_TYPES
> >
> > --Shall the macro be CONFLICT_TYPES_NUM  instead?
>
> I think the current style followed existing ones(e.g. IOOP_NUM_TYPES,
> BACKEND_NUM_TYPES, IOOBJECT_NUM_TYPES ...), so I didn't change this.
>
> Attach the V5 patch set which changed the following:
> 1. addressed shveta's comments which are not mentioned above.
> 2. support update_exists conflict which indicates
> that the updated value of a row violates the unique constraint.

Thanks for making the changes.

thanks
Shveta




Re: Allow logical failover slots to wait on synchronous replication

2024-07-17 Thread shveta malik
On Thu, Jul 18, 2024 at 9:25 AM shveta malik  wrote:
>
> On Tue, Jul 9, 2024 at 12:42 AM John H  wrote:
> >
> >
> > > Can we make it a default
> > > behavior that logical slots marked with a failover option will wait
> > > for 'synchronous_standby_names' as per your patch's idea unless
> > > 'standby_slot_names' is specified? I don't know if there is any value
> > > in setting the 'failover' option for a slot without specifying
> > > 'standby_slot_names', so was wondering if we can additionally tie it
> > > to 'synchronous_standby_names'. Any better ideas?
> > >
> >
> > No, I think that works pretty cleanly actually. Reserving some special
> > keyword isn't great
> > which is the only other thing I can think of. I've updated the patch
> > and tests to reflect that.
> >
> > Attached the patch that addresses these changes.
>
> Thank You for the patch. I like the overall idea, it is a useful one
> and will make user experience better. Please find few comments.
>
> 1)
> I agree with the idea that instead of introducing new GUC, we can make
> failover logical slots wait on 'synchronous_standby_names', but this
> will leave user no option to unlink failover-enabled logical
> subscribers from the wait on synchronous standbys. We provide user a
> way to switch off automatic slot-sync by disabling
> 'sync_replication_slots' and we also provide user a way to manually
> sync the slots using function 'pg_sync_replication_slots()' and
> nowhere we make it mandatory to set 'synchronized_standby_slots'; in
> fact in docs, it is a recommended setting and not a mandatory one.
> User can always create failover slots, switch off automatic slot sync
> and disable wait on standbys by not specifying
> 'synchronized_standby_slots' and do the slot-sync and consistency
> checks manually when needed. I feel, for worst case scenarios, we
> should provide user an option to delink failover-enabled logical
> subscriptions from  'synchronous_standby_names'.
>
> We can have 'synchronized_standby_slots' (standby_slot_names) to
> accept one such option as in 'SAME_AS_SYNCREP_STANDBYS' or  say
> 'DEFAULT'. So when 'synchronous_standby_names' is comma separated
> list, we pick those slots; if it is empty, then no wait on standbys,
> and if its value is 'DEFAULT' as configured by the user, then go with
> 'synchronous_standby_names'. Thoughts?

One correction here
('synchronous_standby_names-->synchronized_standby_slots). Corrected
version:

So when 'synchronized_standby_slots' is comma separated list, we pick
those slots; if it is empty, then no wait on standbys, and if its
value is 'DEFAULT' as configured by user, then go with
'synchronous_standby_names'. Thoughts?

>
> 2)
> When  'synchronized_standby_slots' is configured but standby named in
> it is down blocking logical replication, then we get a WARNING in
> subscriber's log file:
>
> WARNING:  replication slot "standby_2" specified in parameter
> synchronized_standby_slots does not have active_pid.
> DETAIL:  Logical replication is waiting on the standby associated with
> "standby_2".
> HINT:  Consider starting standby associated with "standby_2" or amend
> parameter synchronized_standby_slots.
>
> But OTOH, when  'synchronous_standby_names' is configured instead of
> 'synchronized_standby_slots' and any of the standbys listed is down
> blocking logical replication, we do not get any sort of warning. It is
> inconsistent behavior. Also user might be left clueless on why
> subscribers are not getting changes.
>
> thanks
> Shveta




Re: Allow logical failover slots to wait on synchronous replication

2024-07-17 Thread shveta malik
On Tue, Jul 9, 2024 at 12:42 AM John H  wrote:
>
>
> > Can we make it a default
> > behavior that logical slots marked with a failover option will wait
> > for 'synchronous_standby_names' as per your patch's idea unless
> > 'standby_slot_names' is specified? I don't know if there is any value
> > in setting the 'failover' option for a slot without specifying
> > 'standby_slot_names', so was wondering if we can additionally tie it
> > to 'synchronous_standby_names'. Any better ideas?
> >
>
> No, I think that works pretty cleanly actually. Reserving some special
> keyword isn't great
> which is the only other thing I can think of. I've updated the patch
> and tests to reflect that.
>
> Attached the patch that addresses these changes.

Thank You for the patch. I like the overall idea, it is a useful one
and will make user experience better. Please find few comments.

1)
I agree with the idea that instead of introducing new GUC, we can make
failover logical slots wait on 'synchronous_standby_names', but this
will leave user no option to unlink failover-enabled logical
subscribers from the wait on synchronous standbys. We provide user a
way to switch off automatic slot-sync by disabling
'sync_replication_slots' and we also provide user a way to manually
sync the slots using function 'pg_sync_replication_slots()' and
nowhere we make it mandatory to set 'synchronized_standby_slots'; in
fact in docs, it is a recommended setting and not a mandatory one.
User can always create failover slots, switch off automatic slot sync
and disable wait on standbys by not specifying
'synchronized_standby_slots' and do the slot-sync and consistency
checks manually when needed. I feel, for worst case scenarios, we
should provide user an option to delink failover-enabled logical
subscriptions from  'synchronous_standby_names'.

We can have 'synchronized_standby_slots' (standby_slot_names) to
accept one such option as in 'SAME_AS_SYNCREP_STANDBYS' or  say
'DEFAULT'. So when 'synchronous_standby_names' is comma separated
list, we pick those slots; if it is empty, then no wait on standbys,
and if its value is 'DEFAULT' as configured by user, then go with
'synchronous_standby_names'. Thoughts?


2)
When  'synchronized_standby_slots' is configured but standby named in
it is down blocking logical replication, then we get a WARNING in
subscriber's log file:

WARNING:  replication slot "standby_2" specified in parameter
synchronized_standby_slots does not have active_pid.
DETAIL:  Logical replication is waiting on the standby associated with
"standby_2".
HINT:  Consider starting standby associated with "standby_2" or amend
parameter synchronized_standby_slots.

But OTOH, when  'synchronous_standby_names' is configured instead of
'synchronized_standby_slots' and any of the standbys listed is down
blocking logical replication, we do not get any sort of warning. It is
inconsistent behavior. Also user might be left clueless on why
subscribers are not getting changes.

thanks
Shveta




Re: Conflict detection and logging in logical replication

2024-07-11 Thread shveta malik
On Wed, Jul 10, 2024 at 3:09 PM shveta malik  wrote:
>
> On Wed, Jul 3, 2024 at 8:31 AM Zhijie Hou (Fujitsu)
>  wrote:
> >
> > On Wednesday, June 26, 2024 10:58 AM Zhijie Hou (Fujitsu) 
> >  wrote:
> > >
> >
> > Hi,
> >
> > As suggested by Sawada-san in another thread[1].
> >
> > I am attaching the V4 patch set which tracks the delete_differ
> > conflict in logical replication.
> >
> > delete_differ means that the replicated DELETE is deleting a row
> > that was modified by a different origin.
> >

Thanks for the patch. please find few comments for patch002:

1)
Commit msg says: The conflicts will be tracked only when
track_conflict option of the subscription is enabled.

track_conflict --> detect_conflict

2)
monitoring.sgml: Below are my suggestions, please change if you feel apt.

2a) insert_exists_count : Number of times inserting a row that
violates a NOT DEFERRABLE unique constraint while applying changes.
Suggestion: Number of times a row insertion violated a NOT DEFERRABLE
unique constraint while applying changes.

2b) update_differ_count : Number of times updating a row that was
previously modified by another source while applying changes.
Suggestion: Number of times update was performed on a row that was
previously modified by another source while applying changes.

2c) delete_differ_count: Number of times deleting a row that was
previously modified by another source while applying changes.
Suggestion: Number of times delete was performed on a row that was
previously modified by another source while applying changes.

2d) To be consistent, we can change  'is not found' to 'was not found'
in update_missing_count , delete_missing_count cases as well.


3)
create_subscription.sgml has:
When conflict detection is enabled, additional logging is triggered
and the conflict statistics are collected in the following scenarios:

--Can we rephrase a little and link pg_stat_subscription_stats
structure here for reference.

4)
IIUC, conflict_count array (in pgstat.h) maps directly to ConflictType
enum. So if the order of entries ever changes in this enum, without
changing it in pg_stat_subscription_stats and pg_proc, we may get
wrong values under each column when querying
pg_stat_subscription_stats. If so, then perhaps it is good to add a
comment atop ConflictType that if someone changes this order, order in
other files too needs to be changed.

5)
conflict.h:CONFLICT_NUM_TYPES

--Shall the macro be CONFLICT_TYPES_NUM  instead?

6)
pgstatsfuncs.c

-
/* conflict count */
for (int i = 0; i < CONFLICT_NUM_TYPES; i++)
values[3 + i] = Int64GetDatum(subentry->conflict_count[i]);

/* stats_reset */
if (subentry->stat_reset_timestamp == 0)
nulls[8] = true;
else
values[8] = TimestampTzGetDatum(subentry->stat_reset_timestamp);
-

After setting values for [3+i], we abruptly had [8]. I think it will
be better to use i++ to increment values' index. And at the end, we
can check if it reached 'PG_STAT_GET_SUBSCRIPTION_STATS_COLS'.

thanks
Shveta




Re: Conflict detection and logging in logical replication

2024-07-10 Thread shveta malik
On Thu, Jul 11, 2024 at 7:47 AM Zhijie Hou (Fujitsu)
 wrote:
>
> On Wednesday, July 10, 2024 5:39 PM shveta malik  
> wrote:
> >
> > On Wed, Jul 3, 2024 at 8:31 AM Zhijie Hou (Fujitsu) 
> > wrote:
> > >
> > > On Wednesday, June 26, 2024 10:58 AM Zhijie Hou (Fujitsu)
> >  wrote:
> > > >
> > >
> > > Hi,
> > >
> > > As suggested by Sawada-san in another thread[1].
> > >
> > > I am attaching the V4 patch set which tracks the delete_differ
> > > conflict in logical replication.
> > >
> > > delete_differ means that the replicated DELETE is deleting a row that
> > > was modified by a different origin.
> > >
> >
> > Thanks for the patch. I am still in process of review but please find few
> > comments:
>
> Thanks for the comments!
>
> > 1) When I try to *insert* primary/unique key on pub, which already exists on
> > sub, conflict gets detected. But when I try to *update* primary/unique key 
> > to a
> > value on pub which already exists on sub, conflict is not detected. I get 
> > the
> > error:
> >
> > 2024-07-10 14:21:09.976 IST [647678] ERROR:  duplicate key value violates
> > unique constraint "t1_pkey"
> > 2024-07-10 14:21:09.976 IST [647678] DETAIL:  Key (pk)=(4) already exists.
>
> Yes, I think the detection of this conflict is not added with the
> intention to control the size of the patch in the first version.
>
> >
> > This is because such conflict detection needs detection of constraint 
> > violation
> > using the *new value* rather than *existing* value during UPDATE. INSERT
> > conflict detection takes care of this case i.e. the columns of incoming row 
> > are
> > considered as new values and it tries to see if all unique indexes are okay 
> > to
> > digest such new values (all incoming columns) but update's logic is 
> > different.
> > It searches based on oldTuple *only* and thus above detection is missing.
>
> I think the logic is the same if we want to detect the unique violation
> for UDPATE, we need to check if the new value of the UPDATE violates any
> unique constraints same as the detection of insert_exists (e.g. check
> the conflict around ExecInsertIndexTuples())
>
> >
> > Shall we support such detection? If not, is it worth docuementing?
>
> I am personally OK to support this detection.

+1. I think it should not be a complex or too big change.

> And
> I think it's already documented that we only detect unique violation for
> insert which mean update conflict is not detected.
>
> > 2)
> > Another case which might confuse user:
> >
> > CREATE TABLE t1 (pk integer primary key, val1 integer, val2 integer);
> >
> > On PUB: insert into t1 values(1,10,10); insert into t1 values(2,20,20);
> >
> > On SUB: update t1 set pk=3 where pk=2;
> >
> > Data on PUB: {1,10,10}, {2,20,20}
> > Data on SUB: {1,10,10}, {3,20,20}
> >
> > Now on PUB: update t1 set val1=200 where val1=20;
> >
> > On Sub, I get this:
> > 2024-07-10 14:44:00.160 IST [648287] LOG:  conflict update_missing detected
> > on relation "public.t1"
> > 2024-07-10 14:44:00.160 IST [648287] DETAIL:  Did not find the row to be
> > updated.
> > 2024-07-10 14:44:00.160 IST [648287] CONTEXT:  processing remote data for
> > replication origin "pg_16389" during message type "UPDATE" for replication
> > target relation "public.t1" in transaction 760, finished at 0/156D658
> >
> > To user, it could be quite confusing, as val1=20 exists on sub but still he 
> > gets
> > update_missing conflict and the 'DETAIL' is not sufficient to give the 
> > clarity. I
> > think on HEAD as well (have not tested), we will get same behavior i.e. 
> > update
> > will be ignored as we make search based on RI (pk in this case). So we are 
> > not
> > worsening the situation, but now since we are detecting conflict, is it 
> > possible
> > to give better details in 'DETAIL' section indicating what is actually 
> > missing?
>
> I think It's doable to report the row value that cannot be found in the local
> relation, but the concern is the potential risk of exposing some
> sensitive data in the log. This may be OK, as we are already reporting the
> key value for constraints violation, so if others also agree, we can add
> the row value in the DETAIL as well.

Okay, let's see what others say. JFYI, the same situation holds valid
for delete_missing case.

I have one concern about how we deal with conflicts. As for
insert_

Re: Conflict detection and logging in logical replication

2024-07-10 Thread shveta malik
On Wed, Jul 3, 2024 at 8:31 AM Zhijie Hou (Fujitsu)
 wrote:
>
> On Wednesday, June 26, 2024 10:58 AM Zhijie Hou (Fujitsu) 
>  wrote:
> >
>
> Hi,
>
> As suggested by Sawada-san in another thread[1].
>
> I am attaching the V4 patch set which tracks the delete_differ
> conflict in logical replication.
>
> delete_differ means that the replicated DELETE is deleting a row
> that was modified by a different origin.
>

Thanks for the patch. I am still in process of review but please find
few comments:

1) When I try to *insert* primary/unique key on pub, which already
exists on sub, conflict gets detected. But when I try to *update*
primary/unique key to a value on pub which already exists on sub,
conflict is not detected. I get the error:

2024-07-10 14:21:09.976 IST [647678] ERROR:  duplicate key value
violates unique constraint "t1_pkey"
2024-07-10 14:21:09.976 IST [647678] DETAIL:  Key (pk)=(4) already exists.

This is because such conflict detection needs detection of constraint
violation using the *new value* rather than *existing* value during
UPDATE. INSERT conflict detection takes care of this case i.e. the
columns of incoming row are considered as new values and it tries to
see if all unique indexes are okay to digest such new values (all
incoming columns) but update's logic is different. It  searches based
on oldTuple *only* and thus above detection is missing.

Shall we support such detection? If not, is it worth docuementing? It
basically falls in 'pkey_exists' conflict category but to user it
might seem like any ordinary update leading to 'unique key constraint
violation'.


2)
Another case which might confuse user:

CREATE TABLE t1 (pk integer primary key, val1 integer, val2 integer);

On PUB: insert into t1 values(1,10,10); insert into t1 values(2,20,20);

On SUB: update t1 set pk=3 where pk=2;

Data on PUB: {1,10,10}, {2,20,20}
Data on SUB: {1,10,10}, {3,20,20}

Now on PUB: update t1 set val1=200 where val1=20;

On Sub, I get this:
2024-07-10 14:44:00.160 IST [648287] LOG:  conflict update_missing
detected on relation "public.t1"
2024-07-10 14:44:00.160 IST [648287] DETAIL:  Did not find the row to
be updated.
2024-07-10 14:44:00.160 IST [648287] CONTEXT:  processing remote data
for replication origin "pg_16389" during message type "UPDATE" for
replication target relation "public.t1" in transaction 760, finished
at 0/156D658

To user, it could be quite confusing, as val1=20 exists on sub but
still he gets update_missing conflict and the 'DETAIL' is not
sufficient to give the clarity. I think on HEAD as well (have not
tested), we will get same behavior i.e. update will be ignored as we
make search based on RI (pk in this case). So we are not worsening the
situation, but now since we are detecting conflict, is it possible to
give better details in 'DETAIL' section indicating what is actually
missing?


 thanks
Shveta




Re: Conflict Detection and Resolution

2024-07-03 Thread shveta malik
On Wed, Jul 3, 2024 at 4:12 PM Dilip Kumar  wrote:
>
> On Wed, Jul 3, 2024 at 4:02 PM shveta malik  wrote:
> >
> > On Wed, Jul 3, 2024 at 11:29 AM Dilip Kumar  wrote:
> > >
> > > On Wed, Jul 3, 2024 at 11:00 AM shveta malik  
> > > wrote:
> > > >
> > > > > Yes, I also think it should be independent of CDR.  IMHO, it should be
> > > > > based on the user-configured maximum clock skew tolerance and can be
> > > > > independent of CDR.
> > > >
> > > > +1
> > > >
> > > > > IIUC we would make the remote apply wait just
> > > > > before committing if the remote commit timestamp is ahead of the local
> > > > > clock by more than the maximum clock skew tolerance, is that correct?
> > > >
> > > > +1 on condition to wait.
> > > >
> > > > But I think we should make apply worker wait during begin
> > > > (apply_handle_begin) instead of commit. It makes more sense to delay
> > > > the entire operation to manage clock-skew rather than the commit
> > > > alone. And only then CDR's timestamp based resolution which are much
> > > > prior to commit-stage can benefit from this. Thoughts?
> > >
> > > But do we really need to wait at apply_handle_begin()? I mean if we
> > > already know the commit_ts then we can perform the conflict resolution
> > > no?
> >
> > I would like to highlight one point here that the resultant data may
> > be different depending upon at what stage (begin or commit) we
> > conclude to wait. Example:
> >
> > --max_clock_skew set to 0 i.e. no tolerance for clock skew.
> > --Remote Update with commit_timestamp = 10.20AM.
> > --Local clock (which is say 5 min behind) shows = 10.15AM.
> >
> > Case 1: Wait during Begin:
> > When remote update arrives at local node, apply worker waits till
> > local clock hits 'remote's commit_tts - max_clock_skew' i.e. till
> > 10.20 AM. In the meantime (during the wait period of apply worker) if
> > some local update on the same row has happened at say 10.18am (local
> > clock), that will be applied first. Now when apply worker's wait is
> > over, it will detect 'update_diffe'r conflict and as per
> > 'last_update_win', remote_tuple will win as 10.20 is latest than
> > 10.18.
> >
> > Case 2: Wait during Commit:
> > When remote update arrives at local node, it finds no conflict and
> > goes for commit. But before commit, it waits till the local clock hits
> > 10.20 AM. In the meantime (during wait period of apply worker)) if
> > some local update is trying to update the same row say at 10.18, it
> > has to wait (due to locks taken by remote update on that row) and
> > remote tuple will get committed first with commit timestamp of 10.20.
> > Then local update will proceed and will overwrite remote tuple.
> >
> > So in case1, remote tuple is the final change while in case2, local
> > tuple is the final change.
>
> Got it, but which case is correct, I think both.  Because in case-1
> local commit's commit_ts is 10:18 and the remote commit's commit_ts is
> 10:20 so remote apply wins.  And case 2, the remote commit's commit_ts
> is 10:20 whereas the local commit's commit_ts must be 10:20 + delta
> (because it waited for the remote transaction to get committed).
>
> Now say which is better, in case-1 we have to make the remote apply to
> wait at the beginning state without knowing what would be the local
> clock when it actually comes to commit, it may so happen that if we
> choose case-2 by the time the remote transaction finish applying the
> local clock is beyond 10:20 and we do not even need to wait?

yes, agree that wait time could be lesser to some extent in case 2.
But the wait during commit will make user operations on the same row
wait, without user having any clue on concurrent blocking operations.
I am not sure if it will be acceptable.

thanks
Shveta




Re: Conflict Detection and Resolution

2024-07-03 Thread shveta malik
On Wed, Jul 3, 2024 at 3:35 PM Amit Kapila  wrote:
>
> On Wed, Jul 3, 2024 at 2:16 PM Dilip Kumar  wrote:
> >
> > On Wed, Jul 3, 2024 at 12:30 PM Amit Kapila  wrote:
> > >
> > > On Wed, Jul 3, 2024 at 11:29 AM Dilip Kumar  wrote:
> >
> > > But waiting after applying the operations and before applying the
> > > commit would mean that we need to wait with the locks held. That could
> > > be a recipe for deadlocks in the system. I see your point related to
> > > performance but as we are not expecting clock skew in normal cases, we
> > > shouldn't be too much bothered on the performance due to this. If
> > > there is clock skew, we expect users to fix it, this is just a
> > > worst-case aid for users.
> >
> > But if we make it wait at the very first operation that means we will
> > not suck more decoded data from the network and wouldn't that make the
> > sender wait for the network buffer to get sucked in by the receiver?
> >
>
> That would be true even if we wait just before applying the commit
> record considering the transaction is small and the wait time is
> large.
>
> > Also, we already have a handling of parallel apply workers so if we do
> > not have an issue of deadlock there or if we can handle those issues
> > there we can do it here as well no?
> >
>
> Parallel apply workers won't wait for a long time. There is some
> similarity and in both cases, deadlock will be detected but chances of
> such implementation-related deadlocks will be higher if we start
> waiting for a random amount of times. The other possibility is that we
> can keep a cap on the max clock skew time above which we will give
> ERROR even if the user has configured wait.

+1. But I think cap has to be on wait-time. As an example, let's say
the user has configured 'clock skew tolerance' of 10sec while the
actual clock skew between nodes is 5 min. It means, we will mostly
have to  wait '5 min - 10sec' to bring the clock skew to a tolerable
limit, which is a huge waiting time. We can keep a max limit on this
wait time.

thanks
Shveta




Re: Conflict Detection and Resolution

2024-07-03 Thread shveta malik
On Wed, Jul 3, 2024 at 11:29 AM Dilip Kumar  wrote:
>
> On Wed, Jul 3, 2024 at 11:00 AM shveta malik  wrote:
> >
> > > Yes, I also think it should be independent of CDR.  IMHO, it should be
> > > based on the user-configured maximum clock skew tolerance and can be
> > > independent of CDR.
> >
> > +1
> >
> > > IIUC we would make the remote apply wait just
> > > before committing if the remote commit timestamp is ahead of the local
> > > clock by more than the maximum clock skew tolerance, is that correct?
> >
> > +1 on condition to wait.
> >
> > But I think we should make apply worker wait during begin
> > (apply_handle_begin) instead of commit. It makes more sense to delay
> > the entire operation to manage clock-skew rather than the commit
> > alone. And only then CDR's timestamp based resolution which are much
> > prior to commit-stage can benefit from this. Thoughts?
>
> But do we really need to wait at apply_handle_begin()? I mean if we
> already know the commit_ts then we can perform the conflict resolution
> no?

I would like to highlight one point here that the resultant data may
be different depending upon at what stage (begin or commit) we
conclude to wait. Example:

--max_clock_skew set to 0 i.e. no tolerance for clock skew.
--Remote Update with commit_timestamp = 10.20AM.
--Local clock (which is say 5 min behind) shows = 10.15AM.

Case 1: Wait during Begin:
When remote update arrives at local node, apply worker waits till
local clock hits 'remote's commit_tts - max_clock_skew' i.e. till
10.20 AM. In the meantime (during the wait period of apply worker) if
some local update on the same row has happened at say 10.18am (local
clock), that will be applied first. Now when apply worker's wait is
over, it will detect 'update_diffe'r conflict and as per
'last_update_win', remote_tuple will win as 10.20 is latest than
10.18.

Case 2: Wait during Commit:
When remote update arrives at local node, it finds no conflict and
goes for commit. But before commit, it waits till the local clock hits
10.20 AM. In the meantime (during wait period of apply worker)) if
some local update is trying to update the same row say at 10.18, it
has to wait (due to locks taken by remote update on that row) and
remote tuple will get committed first with commit timestamp of 10.20.
Then local update will proceed and will overwrite remote tuple.

So in case1, remote tuple is the final change while in case2, local
tuple is the final change.

thanks
Shveta




Re: Conflict Detection and Resolution

2024-07-02 Thread shveta malik
On Wed, Jul 3, 2024 at 10:47 AM Dilip Kumar  wrote:
>
> On Tue, Jul 2, 2024 at 2:40 PM shveta malik  wrote:
> >
> > On Wed, Jun 19, 2024 at 1:52 PM Dilip Kumar  wrote:
> > >
> > > On Tue, Jun 18, 2024 at 3:29 PM shveta malik  
> > > wrote:
> > > > On Tue, Jun 18, 2024 at 11:34 AM Dilip Kumar  
> > > > wrote:
> > > >
> > > > I tried to work out a few scenarios with this, where the apply worker
> > > > will wait until its local clock hits 'remote_commit_tts - max_skew
> > > > permitted'. Please have a look.
> > > >
> > > > Let's say, we have a GUC to configure max_clock_skew permitted.
> > > > Resolver is last_update_wins in both cases.
> > > > 
> > > > 1) Case 1: max_clock_skew set to 0 i.e. no tolerance for clock skew.
> > > >
> > > > Remote Update with commit_timestamp = 10.20AM.
> > > > Local clock (which is say 5 min behind) shows = 10.15AM.
> > > >
> > > > When remote update arrives at local node, we see that skew is greater
> > > > than max_clock_skew and thus apply worker waits till local clock hits
> > > > 'remote's commit_tts - max_clock_skew' i.e. till 10.20 AM. Once the
> > > > local clock hits 10.20 AM, the worker applies the remote change with
> > > > commit_tts of 10.20AM. In the meantime (during wait period of apply
> > > > worker)) if some local update on same row has happened at say 10.18am,
> > > > that will applied first, which will be later overwritten by above
> > > > remote change of 10.20AM as remote-change's timestamp appear more
> > > > latest, even though it has happened earlier than local change.
> > >
> > > For the sake of simplicity let's call the change that happened at
> > > 10:20 AM change-1 and the change that happened at 10:15 as change-2
> > > and assume we are talking about the synchronous commit only.
> > >
> > > I think now from an application perspective the change-1 wouldn't have
> > > caused the change-2 because we delayed applying change-2 on the local
> > > node which would have delayed the confirmation of the change-1 to the
> > > application that means we have got the change-2 on the local node
> > > without the confirmation of change-1 hence change-2 has no causal
> > > dependency on the change-1.  So it's fine that we perform change-1
> > > before change-2 and the timestamp will also show the same at any other
> > > node if they receive these 2 changes.
> > >
> > > The goal is to ensure that if we define the order where change-2
> > > happens before change-1, this same order should be visible on all
> > > other nodes. This will hold true because the commit timestamp of
> > > change-2 is earlier than that of change-1.
> > >
> > > > 2)  Case 2: max_clock_skew is set to 2min.
> > > >
> > > > Remote Update with commit_timestamp=10.20AM
> > > > Local clock (which is say 5 min behind) = 10.15AM.
> > > >
> > > > Now apply worker will notice skew greater than 2min and thus will wait
> > > > till local clock hits 'remote's commit_tts - max_clock_skew' i.e.
> > > > 10.18 and will apply the change with commit_tts of 10.20 ( as we
> > > > always save the origin's commit timestamp into local commit_tts, see
> > > > RecordTransactionCommit->TransactionTreeSetCommitTsData). Now lets say
> > > > another local update is triggered at 10.19am, it will be applied
> > > > locally but it will be ignored on remote node. On the remote node ,
> > > > the existing change with a timestamp of 10.20 am will win resulting in
> > > > data divergence.
> > >
> > > Let's call the 10:20 AM change as a change-1 and the change that
> > > happened at 10:19 as change-2
> > >
> > > IIUC, although we apply the change-1 at 10:18 AM the commit_ts of that
> > > commit_ts of that change is 10:20, and the same will be visible to all
> > > other nodes.  So in conflict resolution still the change-1 happened
> > > after the change-2 because change-2's commit_ts is 10:19 AM.   Now
> > > there could be a problem with the causal order because we applied the
> > > change-1 at 10:18 AM so the application might have gotten confirmation
> > > at 10:18 AM and the change-2 of the local node may be triggered as a
> > > result of confirmation of the change-1 that means now change-2 has a
> > > causal dependency on the chan

Re: Conflict Detection and Resolution

2024-07-02 Thread shveta malik
On Wed, Jun 19, 2024 at 1:52 PM Dilip Kumar  wrote:
>
> On Tue, Jun 18, 2024 at 3:29 PM shveta malik  wrote:
> > On Tue, Jun 18, 2024 at 11:34 AM Dilip Kumar  wrote:
> >
> > I tried to work out a few scenarios with this, where the apply worker
> > will wait until its local clock hits 'remote_commit_tts - max_skew
> > permitted'. Please have a look.
> >
> > Let's say, we have a GUC to configure max_clock_skew permitted.
> > Resolver is last_update_wins in both cases.
> > 
> > 1) Case 1: max_clock_skew set to 0 i.e. no tolerance for clock skew.
> >
> > Remote Update with commit_timestamp = 10.20AM.
> > Local clock (which is say 5 min behind) shows = 10.15AM.
> >
> > When remote update arrives at local node, we see that skew is greater
> > than max_clock_skew and thus apply worker waits till local clock hits
> > 'remote's commit_tts - max_clock_skew' i.e. till 10.20 AM. Once the
> > local clock hits 10.20 AM, the worker applies the remote change with
> > commit_tts of 10.20AM. In the meantime (during wait period of apply
> > worker)) if some local update on same row has happened at say 10.18am,
> > that will applied first, which will be later overwritten by above
> > remote change of 10.20AM as remote-change's timestamp appear more
> > latest, even though it has happened earlier than local change.
>
> For the sake of simplicity let's call the change that happened at
> 10:20 AM change-1 and the change that happened at 10:15 as change-2
> and assume we are talking about the synchronous commit only.
>
> I think now from an application perspective the change-1 wouldn't have
> caused the change-2 because we delayed applying change-2 on the local
> node which would have delayed the confirmation of the change-1 to the
> application that means we have got the change-2 on the local node
> without the confirmation of change-1 hence change-2 has no causal
> dependency on the change-1.  So it's fine that we perform change-1
> before change-2 and the timestamp will also show the same at any other
> node if they receive these 2 changes.
>
> The goal is to ensure that if we define the order where change-2
> happens before change-1, this same order should be visible on all
> other nodes. This will hold true because the commit timestamp of
> change-2 is earlier than that of change-1.
>
> > 2)  Case 2: max_clock_skew is set to 2min.
> >
> > Remote Update with commit_timestamp=10.20AM
> > Local clock (which is say 5 min behind) = 10.15AM.
> >
> > Now apply worker will notice skew greater than 2min and thus will wait
> > till local clock hits 'remote's commit_tts - max_clock_skew' i.e.
> > 10.18 and will apply the change with commit_tts of 10.20 ( as we
> > always save the origin's commit timestamp into local commit_tts, see
> > RecordTransactionCommit->TransactionTreeSetCommitTsData). Now lets say
> > another local update is triggered at 10.19am, it will be applied
> > locally but it will be ignored on remote node. On the remote node ,
> > the existing change with a timestamp of 10.20 am will win resulting in
> > data divergence.
>
> Let's call the 10:20 AM change as a change-1 and the change that
> happened at 10:19 as change-2
>
> IIUC, although we apply the change-1 at 10:18 AM the commit_ts of that
> commit_ts of that change is 10:20, and the same will be visible to all
> other nodes.  So in conflict resolution still the change-1 happened
> after the change-2 because change-2's commit_ts is 10:19 AM.   Now
> there could be a problem with the causal order because we applied the
> change-1 at 10:18 AM so the application might have gotten confirmation
> at 10:18 AM and the change-2 of the local node may be triggered as a
> result of confirmation of the change-1 that means now change-2 has a
> causal dependency on the change-1 but commit_ts shows change-2
> happened before the change-1 on all the nodes.
>
> So, is this acceptable? I think yes because the user has configured a
> maximum clock skew of 2 minutes, which means the detected order might
> not always align with the causal order for transactions occurring
> within that time frame. Generally, the ideal configuration for
> max_clock_skew should be in multiple of the network round trip time.
> Assuming this configuration, we wouldn’t encounter this problem
> because for change-2 to be caused by change-1, the client would need
> to get confirmation of change-1 and then trigger change-2, which would
> take at least 2-3 network round trips.

As we agreed, the subscriber should wait before applying an operation
if the commit timestamp of the currently replayed transaction is in
the future and the differen

Re: Conflict Detection and Resolution

2024-07-01 Thread shveta malik
On Mon, Jul 1, 2024 at 11:47 AM Masahiko Sawada  wrote:
>
> Hi,
>
> On Thu, May 23, 2024 at 3:37 PM shveta malik  wrote:
> >
> > DELETE
> > 
> > Conflict Type:
> > 
> > delete_missing: An incoming delete is trying to delete a row on a
> > target node which does not exist.
>
> IIUC the 'delete_missing' conflict doesn't cover the case where an
> incoming delete message is trying to delete a row that has already
> been updated locally or by another node. I think in update/delete
> conflict situations, we need to resolve the conflicts based on commit
> timestamps like we do for update/update and insert/update conflicts.
>
> For example, suppose there are two node-A and node-B and setup
> bi-directional replication, and suppose further that both have the row
> with id = 1, consider the following sequences:
>
> 09:00:00  DELETE ... WHERE id = 1 on node-A.
> 09:00:05  UPDATE ... WHERE id = 1 on node-B.
> 09:00:10  node-A received the update message from node-B.
> 09:00:15  node-B received the delete message from node-A.
>
> At 09:00:10 on node-A, an update_deleted conflict is generated since
> the row on node-A is already deleted locally. Suppose that we use
> 'apply_or_skip' resolution for this conflict, we convert the update
> message into an insertion, so node-A now has the row with id = 1. At
> 09:00:15 on node-B, the incoming delete message is applied and deletes
> the row with id = 1, even though the row has already been modified
> locally. The node-A and node-B are now inconsistent. This
> inconsistency can be avoided by using 'skip' resolution for the
> 'update_deleted' conflict on node-A, and 'skip' resolution is the
> default method for that actually. However, if we handle it as
> 'update_missing', the 'apply_or_skip' resolution is used by default.
>
> IIUC with the proposed architecture, DELETE always takes precedence
> over UPDATE since both 'update_deleted' and 'update_missing' don't use
> commit timestamps to resolve the conflicts. As long as that is true, I
> think there is no use case for 'apply_or_skip' and 'apply_or_error'
> resolutions in update/delete conflict cases. In short, I think we need
> something like 'delete_differ' conflict type as well.

Thanks for the feedback. Sure, we can have 'delete_differ'.

> FYI PGD and
> Oracle GoldenGate seem to have this conflict type[1][2].
>
> The 'delete'_differ' conflict type would have at least
> 'latest_timestamp_wins' resolution. With the timestamp based
> resolution method, we would deal with update/delete conflicts as
> follows:
>
> 09:00:00: DELETE ... WHERE id = 1 on node-A.
> 09:00:05: UPDATE ... WHERE id = 1 on node-B.
> - the updated row doesn't have the origin since it's a local change.
> 09:00:10: node-A received the update message from node-B.
> - the incoming update message has the origin of node-B whereas the
> local row is already removed locally.
> - 'update_deleted' conflict is generated.
> - do the insert of the new row instead, because the commit
> timestamp of UPDATE is newer than DELETE's one.

So, are you suggesting to support latest_tmestamp_wins for
'update_deleted' case? And shall 'latest_tmestamp_wins' be default
then instead of 'skip'? In some cases, the complete row can not be
constructed, and then 'insertion' might not be possible even if the
timestamp of 'update' is latest. Then shall we skip or error out at
latest_tmestamp_wins config?

Even if we support 'latest_timestamp_wins' as default, we can still
have 'apply_or_skip' and 'apply_or_error' as other options for
'update_deleted' case. Or do you suggest getting rid of these options
completely?

> 09:00:15: node-B received the delete message from node-A.
> - the incoming delete message has the origin of node-B whereas the
> (updated) row doesn't have the origin.
> - 'update_differ' conflict is generated.

Here, do you mean 'delete_differ' conflict is generated?

thanks
Shveta




Re: Conflict Detection and Resolution

2024-06-26 Thread shveta malik
On Wed, Jun 26, 2024 at 2:33 PM Amit Kapila  wrote:
>
> On Tue, Jun 25, 2024 at 3:39 PM shveta malik  wrote:
> >
> > On Tue, Jun 25, 2024 at 3:12 PM Amit Kapila  wrote:
> > >
> > > On Mon, Jun 24, 2024 at 1:47 PM shveta malik  
> > > wrote:
> > > >
> > > > On Thu, Jun 20, 2024 at 6:41 PM Amit Kapila  
> > > > wrote:
> > > > >
> > > > > >> In the second patch, we can implement simple built-in resolution
> > > > > >> strategies like apply and skip (which can be named as remote_apply 
> > > > > >> and
> > > > > >> keep_local, see [3][4] for details on these strategies) with ERROR 
> > > > > >> or
> > > > > >> LOG being the default strategy. We can allow these strategies to be
> > > > > >> configured at the global and table level.
> > > >
> > > > Before we implement resolvers, we need a way to configure them. Please
> > > > find the patch002 which attempts to implement Global Level Conflict
> > > > Resolvers Configuration.  Note that patch002 is dependent upon
> > > > Conflict-Detection patch001 which is reviewed in another thread [1].
> > > > I have attached patch001 here for convenience and to avoid CFBot
> > > > failures. But please use [1] if you have any comments on patch001.
> > > >
> > > > New DDL commands in patch002 are:
> > > >
> > > > To set global resolver for given conflcit_type:
> > > > SET CONFLICT RESOLVER 'conflict_resolver' FOR 'conflict_type'
> > > >
> > > > To reset to default resolver:
> > > > RESET CONFLICT RESOLVER FOR 'conflict_type'
> > > >
> > >
> > > Does setting up resolvers have any meaning without subscriptions? I am
> > > wondering whether we should allow to set up the resolvers at the
> > > subscription level. One benefit is that users don't need to use a
> > > different DDL to set up resolvers. The first patch gives a conflict
> > > detection option at the subscription level, so it would be symmetrical
> > > to provide a resolver at the subscription level. Yet another benefit
> > > could be that it provides users facility to configure different
> > > resolvers for a set of tables belonging to a particular
> > > publication/node.
> >
> > There can be multiple tables included in a publication with varying
> > business use-cases and thus may need different resolvers set, even
> > though they all are part of the same publication.
> >
>
> Agreed but this is the reason we are planning to keep resolvers at the
> table level. Here, I am asking to set resolvers at the subscription
> level rather than at the global level.

Okay, got it. I misunderstood earlier that we want to replace table
level resolvers with subscription ones.
Having global configuration has one benefit that if the user has no
requirement to set different resolvers for different subscriptions or
tables, he may always set one global configuration and be done with
it. OTOH, I also agree with benefits coming with subscription level
configuration.

thanks
Shveta




Re: Conflict Detection and Resolution

2024-06-25 Thread shveta malik
On Tue, Jun 25, 2024 at 3:12 PM Amit Kapila  wrote:
>
> On Mon, Jun 24, 2024 at 1:47 PM shveta malik  wrote:
> >
> > On Thu, Jun 20, 2024 at 6:41 PM Amit Kapila  wrote:
> > >
> > > >> In the second patch, we can implement simple built-in resolution
> > > >> strategies like apply and skip (which can be named as remote_apply and
> > > >> keep_local, see [3][4] for details on these strategies) with ERROR or
> > > >> LOG being the default strategy. We can allow these strategies to be
> > > >> configured at the global and table level.
> >
> > Before we implement resolvers, we need a way to configure them. Please
> > find the patch002 which attempts to implement Global Level Conflict
> > Resolvers Configuration.  Note that patch002 is dependent upon
> > Conflict-Detection patch001 which is reviewed in another thread [1].
> > I have attached patch001 here for convenience and to avoid CFBot
> > failures. But please use [1] if you have any comments on patch001.
> >
> > New DDL commands in patch002 are:
> >
> > To set global resolver for given conflcit_type:
> > SET CONFLICT RESOLVER 'conflict_resolver' FOR 'conflict_type'
> >
> > To reset to default resolver:
> > RESET CONFLICT RESOLVER FOR 'conflict_type'
> >
>
> Does setting up resolvers have any meaning without subscriptions? I am
> wondering whether we should allow to set up the resolvers at the
> subscription level. One benefit is that users don't need to use a
> different DDL to set up resolvers. The first patch gives a conflict
> detection option at the subscription level, so it would be symmetrical
> to provide a resolver at the subscription level. Yet another benefit
> could be that it provides users facility to configure different
> resolvers for a set of tables belonging to a particular
> publication/node.

There can be multiple tables included in a publication with varying
business use-cases and thus may need different resolvers set, even
though they all are part of the same publication.

> >
> > 
> >
> > As suggested in [2] and above, it seems logical to have table-specific
> > resolvers configuration along with global one.
> >
> > Here is the proposal for table level resolvers:
> >
> > 1) We can provide support for table level resolvers using ALTER TABLE:
> >
> > ALTER TABLE  SET CONFLICT RESOLVER  on ,
> >SET CONFLICT RESOLVER
> >  on , ...;
> >
> > Reset can be done using:
> > ALTER TABLE  RESET CONFLICT RESOLVER on ,
> >   RESET CONFLICT RESOLVER on
> > , ...;
> >
> > Above commands will save/remove configuration in/from the new system
> > catalog pg_conflict_rel.
> >
> > 2) Table level configuration (if any) will be given preference over
> > global ones. The tables not having table-specific resolvers will use
> > global configured ones.
> >
> > 3) If the table is a partition table, then resolvers created for the
> > parent will be inherited by all child partition tables. Multiple
> > resolver entries will be created, one for each child partition in the
> > system catalog (similar to constraints).
> >
> > 4) Users can also configure explicit resolvers for child partitions.
> > In such a case, child's resolvers will override inherited resolvers
> > (if any).
> >
> > 5) Any attempt to RESET (remove) inherited resolvers on the child
> > partition table *alone* will result in error:  "cannot reset inherited
> > resolvers" (similar to constraints). But RESET of explicit created
> > resolvers (non-inherited ones) will be permitted for child partitions.
> > On RESET, the resolver configuration will not fallback to the
> > inherited resolver again. Users need to explicitly configure new
> > resolvers for the child partition tables (after RESET) if needed.
> >
>
> Why so? If we can allow the RESET command to fallback to the inherited
> resolver it would make the behavior consistent for the child table
> where we don't have performed SET.

Thought behind not making it fallback is since the user has done
'RESET', he may want to remove the resolver completely. We don't know
if he really wants to go back to the previous one. If he does, it is
easy to set it again. But if he does not, and we set the inherited
resolver again during 'RESET', there is no way he can drop that
inherited resolver alone on the child partition.

> > 6) Removal/Reset of resolvers on parent will remove corresponding
> > "inherited" resolvers on all the child partitions 

Re: Conflict detection and logging in logical replication

2024-06-24 Thread shveta malik
On Mon, Jun 24, 2024 at 7:39 AM Zhijie Hou (Fujitsu)
 wrote:
>
> When testing the patch, I noticed a bug that when reporting the conflict
> after calling ExecInsertIndexTuples(), we might find the tuple that we
> just inserted and report it.(we should only report conflict if there are
> other conflict tuples which are not inserted by us) Here is a new patch
> which fixed this and fixed a compile warning reported by CFbot.
>

Thanks for the patch. Few comments:

1) Few typos:
Commit msg of patch001: iolates--> violates
execIndexing.c:  ingored --> ignored

2) Commit msg of stats patch: "The commit adds columns in view
pg_stat_subscription_workers to shows"
--"pg_stat_subscription_workers" --> "pg_stat_subscription_stats"

3) I feel, chapter '31.5. Conflicts' in docs should also mention about
detection or point to the page where it is already mentioned.

thanks
Shveta




Re: Conflict Detection and Resolution

2024-06-19 Thread shveta malik
On Wed, Jun 19, 2024 at 1:52 PM Dilip Kumar  wrote:
>
> On Tue, Jun 18, 2024 at 3:29 PM shveta malik  wrote:
> > On Tue, Jun 18, 2024 at 11:34 AM Dilip Kumar  wrote:
> >
> > I tried to work out a few scenarios with this, where the apply worker
> > will wait until its local clock hits 'remote_commit_tts - max_skew
> > permitted'. Please have a look.
> >
> > Let's say, we have a GUC to configure max_clock_skew permitted.
> > Resolver is last_update_wins in both cases.
> > 
> > 1) Case 1: max_clock_skew set to 0 i.e. no tolerance for clock skew.
> >
> > Remote Update with commit_timestamp = 10.20AM.
> > Local clock (which is say 5 min behind) shows = 10.15AM.
> >
> > When remote update arrives at local node, we see that skew is greater
> > than max_clock_skew and thus apply worker waits till local clock hits
> > 'remote's commit_tts - max_clock_skew' i.e. till 10.20 AM. Once the
> > local clock hits 10.20 AM, the worker applies the remote change with
> > commit_tts of 10.20AM. In the meantime (during wait period of apply
> > worker)) if some local update on same row has happened at say 10.18am,
> > that will applied first, which will be later overwritten by above
> > remote change of 10.20AM as remote-change's timestamp appear more
> > latest, even though it has happened earlier than local change.
>
> For the sake of simplicity let's call the change that happened at
> 10:20 AM change-1 and the change that happened at 10:15 as change-2
> and assume we are talking about the synchronous commit only.

Do you mean "the change that happened at 10:18 as change-2"

>
> I think now from an application perspective the change-1 wouldn't have
> caused the change-2 because we delayed applying change-2 on the local
> node

Do you mean "we delayed applying change-1 on the local node."

>which would have delayed the confirmation of the change-1 to the
> application that means we have got the change-2 on the local node
> without the confirmation of change-1 hence change-2 has no causal
> dependency on the change-1.  So it's fine that we perform change-1
> before change-2

Do you mean "So it's fine that we perform change-2 before change-1"

>and the timestamp will also show the same at any other
> node if they receive these 2 changes.
>
> The goal is to ensure that if we define the order where change-2
> happens before change-1, this same order should be visible on all
> other nodes. This will hold true because the commit timestamp of
> change-2 is earlier than that of change-1.

Considering the above corrections as base, I agree with this.

> > 2)  Case 2: max_clock_skew is set to 2min.
> >
> > Remote Update with commit_timestamp=10.20AM
> > Local clock (which is say 5 min behind) = 10.15AM.
> >
> > Now apply worker will notice skew greater than 2min and thus will wait
> > till local clock hits 'remote's commit_tts - max_clock_skew' i.e.
> > 10.18 and will apply the change with commit_tts of 10.20 ( as we
> > always save the origin's commit timestamp into local commit_tts, see
> > RecordTransactionCommit->TransactionTreeSetCommitTsData). Now lets say
> > another local update is triggered at 10.19am, it will be applied
> > locally but it will be ignored on remote node. On the remote node ,
> > the existing change with a timestamp of 10.20 am will win resulting in
> > data divergence.
>
> Let's call the 10:20 AM change as a change-1 and the change that
> happened at 10:19 as change-2
>
> IIUC, although we apply the change-1 at 10:18 AM the commit_ts of that
> commit_ts of that change is 10:20, and the same will be visible to all
> other nodes.  So in conflict resolution still the change-1 happened
> after the change-2 because change-2's commit_ts is 10:19 AM.   Now
> there could be a problem with the causal order because we applied the
> change-1 at 10:18 AM so the application might have gotten confirmation
> at 10:18 AM and the change-2 of the local node may be triggered as a
> result of confirmation of the change-1 that means now change-2 has a
> causal dependency on the change-1 but commit_ts shows change-2
> happened before the change-1 on all the nodes.
>
> So, is this acceptable? I think yes because the user has configured a
> maximum clock skew of 2 minutes, which means the detected order might
> not always align with the causal order for transactions occurring
> within that time frame.

Agree. I had the same thoughts, and wanted to confirm my understanding.

>Generally, the ideal configuration for
> max_clock_skew should be in multiple of the network round trip time.
> Assuming this configuration, we wouldn’t encounter this problem
> because for change-2 to be caused by change-1, the client would need
> to get confirmation of change-1 and then trigger change-2, which would
> take at least 2-3 network round trips.


thanks
Shveta




Re: Conflict Detection and Resolution

2024-06-18 Thread shveta malik
On Tue, Jun 18, 2024 at 11:34 AM Dilip Kumar  wrote:
>
> On Tue, Jun 18, 2024 at 10:17 AM Dilip Kumar  wrote:
> >
> > On Mon, Jun 17, 2024 at 3:23 PM Amit Kapila  wrote:
> > >
> > > On Wed, Jun 12, 2024 at 10:03 AM Dilip Kumar  
> > > wrote:
> > > >
> > > > On Tue, Jun 11, 2024 at 7:44 PM Tomas Vondra
> > > >  wrote:
> > > >
> > > > > > Yes, that's correct. However, many cases could benefit from the
> > > > > > update_deleted conflict type if it can be implemented reliably. 
> > > > > > That's
> > > > > > why we wanted to give it a try. But if we can't achieve predictable
> > > > > > results with it, I'm fine to drop this approach and conflict_type. 
> > > > > > We
> > > > > > can consider a better design in the future that doesn't depend on
> > > > > > non-vacuumed entries and provides a more robust method for 
> > > > > > identifying
> > > > > > deleted rows.
> > > > > >
> > > > >
> > > > > I agree having a separate update_deleted conflict would be beneficial,
> > > > > I'm not arguing against that - my point is actually that I think this
> > > > > conflict type is required, and that it needs to be detected reliably.
> > > > >
> > > >
> > > > When working with a distributed system, we must accept some form of
> > > > eventual consistency model. However, it's essential to design a
> > > > predictable and acceptable behavior. For example, if a change is a
> > > > result of a previous operation (such as an update on node B triggered
> > > > after observing an operation on node A), we can say that the operation
> > > > on node A happened before the operation on node B. Conversely, if
> > > > operations on nodes A and B are independent, we consider them
> > > > concurrent.
> > > >
> > > > In distributed systems, clock skew is a known issue. To establish a
> > > > consistency model, we need to ensure it guarantees the
> > > > "happens-before" relationship. Consider a scenario with three nodes:
> > > > NodeA, NodeB, and NodeC. If NodeA sends changes to NodeB, and
> > > > subsequently NodeB makes changes, and then both NodeA's and NodeB's
> > > > changes are sent to NodeC, the clock skew might make NodeB's changes
> > > > appear to have occurred before NodeA's changes. However, we should
> > > > maintain data that indicates NodeB's changes were triggered after
> > > > NodeA's changes arrived at NodeB. This implies that logically, NodeB's
> > > > changes happened after NodeA's changes, despite what the timestamps
> > > > suggest.
> > > >
> > > > A common method to handle such cases is using vector clocks for
> > > > conflict resolution.
> > > >
> > >
> > > I think the unbounded size of the vector could be a problem to store
> > > for each event. However, while researching previous discussions, it
> > > came to our notice that we have discussed this topic in the past as
> > > well in the context of standbys. For recovery_min_apply_delay, we
> > > decided the clock skew is not a problem as the settings of this
> > > parameter are much larger than typical time deviations between servers
> > > as mentioned in docs. Similarly for casual reads [1], there was a
> > > proposal to introduce max_clock_skew parameter and suggesting the user
> > > to make sure to have NTP set up correctly. We have tried to check
> > > other databases (like Ora and BDR) where CDR is implemented but didn't
> > > find anything specific to clock skew. So, I propose to go with a GUC
> > > like max_clock_skew such that if the difference of time between the
> > > incoming transaction's commit time and the local time is more than
> > > max_clock_skew then we raise an ERROR. It is not clear to me that
> > > putting bigger effort into clock skew is worth especially when other
> > > systems providing CDR feature (like Ora or BDR) for decades have not
> > > done anything like vector clocks. It is possible that this is less of
> > > a problem w.r.t CDR and just detecting the anomaly in clock skew is
> > > good enough.
> >
> > I believe that if we've accepted this solution elsewhere, then we can
> > also consider the same. Basically, we're allowing the application to
> > set its tolerance for clock skew. And, if the skew exceeds that
> > tolerance, it's the application's responsibility to synchronize;
> > otherwise, an error will occur. This approach seems reasonable.
>
> This model can be further extended by making the apply worker wait if
> the remote transaction's commit_ts is greater than the local
> timestamp. This ensures that no local transactions occurring after the
> remote transaction appear to have happened earlier due to clock skew
> instead we make them happen before the remote transaction by delaying
> the remote transaction apply.  Essentially, by having the remote
> application wait until the local timestamp matches the remote
> transaction's timestamp, we ensure that the remote transaction, which
> seems to occur after concurrent local transactions due to clock skew,
> is actually applied after those transactions.
>
> With this model, 

Re: Conflict Detection and Resolution

2024-06-12 Thread shveta malik
On Tue, Jun 11, 2024 at 7:44 PM Tomas Vondra
 wrote:
>
>
>
> On 6/11/24 10:35, shveta malik wrote:
> > On Mon, Jun 10, 2024 at 5:24 PM Tomas Vondra
> >  wrote:
> >>
> >>
> >>
> >> On 6/10/24 12:56, shveta malik wrote:
> >>> On Fri, Jun 7, 2024 at 6:08 PM Tomas Vondra
> >>>  wrote:
> >>>>
> >>>>>>>
> >>>>>>>  UPDATE
> >>>>>>> 
> >>>>>>>
> >>>>>>> Conflict Detection Method:
> >>>>>>> 
> >>>>>>> Origin conflict detection: The ‘origin’ info is used to detect
> >>>>>>> conflict which can be obtained from commit-timestamp generated for
> >>>>>>> incoming txn at the source node. To compare remote’s origin with the
> >>>>>>> local’s origin, we must have origin information for local txns as well
> >>>>>>> which can be obtained from commit-timestamp after enabling
> >>>>>>> ‘track_commit_timestamp’ locally.
> >>>>>>> The one drawback here is the ‘origin’ information cannot be obtained
> >>>>>>> once the row is frozen and the commit-timestamp info is removed by
> >>>>>>> vacuum. For a frozen row, conflicts cannot be raised, and thus the
> >>>>>>> incoming changes will be applied in all the cases.
> >>>>>>>
> >>>>>>> Conflict Types:
> >>>>>>> 
> >>>>>>> a) update_differ: The origin of an incoming update's key row differs
> >>>>>>> from the local row i.e.; the row has already been updated locally or
> >>>>>>> by different nodes.
> >>>>>>> b) update_missing: The row with the same value as that incoming
> >>>>>>> update's key does not exist. Remote is trying to update a row which
> >>>>>>> does not exist locally.
> >>>>>>> c) update_deleted: The row with the same value as that incoming
> >>>>>>> update's key does not exist. The row is already deleted. This conflict
> >>>>>>> type is generated only if the deleted row is still detectable i.e., it
> >>>>>>> is not removed by VACUUM yet. If the row is removed by VACUUM already,
> >>>>>>> it cannot detect this conflict. It will detect it as update_missing
> >>>>>>> and will follow the default or configured resolver of update_missing
> >>>>>>> itself.
> >>>>>>>
> >>>>>>
> >>>>>> I don't understand the why should update_missing or update_deleted be
> >>>>>> different, especially considering it's not detected reliably. And also
> >>>>>> that even if we happen to find the row the associated TOAST data may
> >>>>>> have already been removed. So why would this matter?
> >>>>>
> >>>>> Here, we are trying to tackle the case where the row is 'recently'
> >>>>> deleted i.e. concurrent UPDATE and DELETE on pub and sub. User may
> >>>>> want to opt for a different resolution in such a case as against the
> >>>>> one where the corresponding row was not even present in the first
> >>>>> place. The case where the row was deleted long back may not fall into
> >>>>> this category as there are higher chances that they have been removed
> >>>>> by vacuum and can be considered equivalent to the update_ missing
> >>>>> case.
> >>>>>
> >>>>
> >>>> My point is that if we can't detect the difference reliably, it's not
> >>>> very useful. Consider this example:
> >>>>
> >>>> Node A:
> >>>>
> >>>>   T1: INSERT INTO t (id, value) VALUES (1,1);
> >>>>
> >>>>   T2: DELETE FROM t WHERE id = 1;
> >>>>
> >>>> Node B:
> >>>>
> >>>>   T3: UPDATE t SET value = 2 WHERE id = 1;
> >>>>
> >>>> The "correct" order of received messages on a third node is T1-T3-T2.
> >>>> But we may also see T1-T2-T3 and T3-T1-T2, e.g. due to network issues
> >>>> and so on. For T1-T2-T3 the right decision is to discard the update,
> >>&

Re: Conflict Detection and Resolution

2024-06-11 Thread shveta malik
On Mon, Jun 10, 2024 at 5:24 PM Tomas Vondra
 wrote:
>
>
>
> On 6/10/24 12:56, shveta malik wrote:
> > On Fri, Jun 7, 2024 at 6:08 PM Tomas Vondra
> >  wrote:
> >>
> >>>>>
> >>>>>  UPDATE
> >>>>> 
> >>>>>
> >>>>> Conflict Detection Method:
> >>>>> 
> >>>>> Origin conflict detection: The ‘origin’ info is used to detect
> >>>>> conflict which can be obtained from commit-timestamp generated for
> >>>>> incoming txn at the source node. To compare remote’s origin with the
> >>>>> local’s origin, we must have origin information for local txns as well
> >>>>> which can be obtained from commit-timestamp after enabling
> >>>>> ‘track_commit_timestamp’ locally.
> >>>>> The one drawback here is the ‘origin’ information cannot be obtained
> >>>>> once the row is frozen and the commit-timestamp info is removed by
> >>>>> vacuum. For a frozen row, conflicts cannot be raised, and thus the
> >>>>> incoming changes will be applied in all the cases.
> >>>>>
> >>>>> Conflict Types:
> >>>>> 
> >>>>> a) update_differ: The origin of an incoming update's key row differs
> >>>>> from the local row i.e.; the row has already been updated locally or
> >>>>> by different nodes.
> >>>>> b) update_missing: The row with the same value as that incoming
> >>>>> update's key does not exist. Remote is trying to update a row which
> >>>>> does not exist locally.
> >>>>> c) update_deleted: The row with the same value as that incoming
> >>>>> update's key does not exist. The row is already deleted. This conflict
> >>>>> type is generated only if the deleted row is still detectable i.e., it
> >>>>> is not removed by VACUUM yet. If the row is removed by VACUUM already,
> >>>>> it cannot detect this conflict. It will detect it as update_missing
> >>>>> and will follow the default or configured resolver of update_missing
> >>>>> itself.
> >>>>>
> >>>>
> >>>> I don't understand the why should update_missing or update_deleted be
> >>>> different, especially considering it's not detected reliably. And also
> >>>> that even if we happen to find the row the associated TOAST data may
> >>>> have already been removed. So why would this matter?
> >>>
> >>> Here, we are trying to tackle the case where the row is 'recently'
> >>> deleted i.e. concurrent UPDATE and DELETE on pub and sub. User may
> >>> want to opt for a different resolution in such a case as against the
> >>> one where the corresponding row was not even present in the first
> >>> place. The case where the row was deleted long back may not fall into
> >>> this category as there are higher chances that they have been removed
> >>> by vacuum and can be considered equivalent to the update_ missing
> >>> case.
> >>>
> >>
> >> My point is that if we can't detect the difference reliably, it's not
> >> very useful. Consider this example:
> >>
> >> Node A:
> >>
> >>   T1: INSERT INTO t (id, value) VALUES (1,1);
> >>
> >>   T2: DELETE FROM t WHERE id = 1;
> >>
> >> Node B:
> >>
> >>   T3: UPDATE t SET value = 2 WHERE id = 1;
> >>
> >> The "correct" order of received messages on a third node is T1-T3-T2.
> >> But we may also see T1-T2-T3 and T3-T1-T2, e.g. due to network issues
> >> and so on. For T1-T2-T3 the right decision is to discard the update,
> >> while for T3-T1-T2 it's to either wait for the INSERT or wait for the
> >> insert to arrive.
> >>
> >> But if we misdetect the situation, we either end up with a row that
> >> shouldn't be there, or losing an update.
> >
> > Doesn't the above example indicate that 'update_deleted' should also
> > be considered a necessary conflict type?  Please see the possibilities
> > of conflicts in all three cases:
> >
> >
> > The "correct" order of receiving messages on node C (as suggested
> > above) is T1-T3-T2   (case1)
> > --
> > T1 will insert the row.
> > T3 will have update_differ conflict; l

Re: Conflict Detection and Resolution

2024-06-10 Thread shveta malik
On Fri, Jun 7, 2024 at 6:10 PM Tomas Vondra
 wrote:
>
> >>> I don't understand the why should update_missing or update_deleted be
> >>> different, especially considering it's not detected reliably. And also
> >>> that even if we happen to find the row the associated TOAST data may
> >>> have already been removed. So why would this matter?
> >>
> >> Here, we are trying to tackle the case where the row is 'recently'
> >> deleted i.e. concurrent UPDATE and DELETE on pub and sub. User may
> >> want to opt for a different resolution in such a case as against the
> >> one where the corresponding row was not even present in the first
> >> place. The case where the row was deleted long back may not fall into
> >> this category as there are higher chances that they have been removed
> >> by vacuum and can be considered equivalent to the update_ missing
> >> case.
> >>
> >> Regarding "TOAST column" for deleted row cases, we may need to dig
> >> more. Thanks for bringing this case. Let me analyze more here.
> >>
> > I tested a simple case with a table with one TOAST column and found
> > that when a tuple with a TOAST column is deleted, both the tuple and
> > corresponding pg_toast entries are marked as ‘deleted’ (dead) but not
> > removed immediately. The main tuple and respective pg_toast entry are
> > permanently deleted only during vacuum. First, the main table’s dead
> > tuples are vacuumed, followed by the secondary TOAST relation ones (if
> > available).
> > Please let us know if you have a specific scenario in mind where the
> > TOAST column data is deleted immediately upon ‘delete’ operation,
> > rather than during vacuum, which we are missing.
> >
>
> I'm pretty sure you can vacuum the TOAST table directly, which means
> you'll end up with a deleted tuple with TOAST pointers, but with the
> TOAST entries already gone.
>

It is true that for a deleted row, its toast entries can be vacuumed
earlier than the original/parent row, but we do not need to be
concerned about that to raise 'update_deleted'. To raise an
'update_deleted' conflict, it is sufficient to know that the row has
been deleted and not yet vacuumed, regardless of the presence or
absence of its toast entries. Once this is determined, we need to
build the tuple from remote data and apply it (provided resolver is
such that). If the tuple cannot be fully constructed from the remote
data, the apply operation will either be skipped or an error will be
raised, depending on whether the user has chosen the apply_or_skip or
apply_or_error option.

In cases where the table has toast columns but the remote data does
not include the toast-column entry (when the toast column is
unmodified and not part of the replica identity),  the resolution for
'update_deleted' will be no worse than for 'update_missing'. That is,
for both the cases, we can not construct full tuple and thus the
operation either needs to be skipped or error needs to be raised.

thanks
Shveta




Re: Conflict Detection and Resolution

2024-06-10 Thread shveta malik
On Fri, Jun 7, 2024 at 6:08 PM Tomas Vondra
 wrote:
>
> >>>
> >>>  UPDATE
> >>> 
> >>>
> >>> Conflict Detection Method:
> >>> 
> >>> Origin conflict detection: The ‘origin’ info is used to detect
> >>> conflict which can be obtained from commit-timestamp generated for
> >>> incoming txn at the source node. To compare remote’s origin with the
> >>> local’s origin, we must have origin information for local txns as well
> >>> which can be obtained from commit-timestamp after enabling
> >>> ‘track_commit_timestamp’ locally.
> >>> The one drawback here is the ‘origin’ information cannot be obtained
> >>> once the row is frozen and the commit-timestamp info is removed by
> >>> vacuum. For a frozen row, conflicts cannot be raised, and thus the
> >>> incoming changes will be applied in all the cases.
> >>>
> >>> Conflict Types:
> >>> 
> >>> a) update_differ: The origin of an incoming update's key row differs
> >>> from the local row i.e.; the row has already been updated locally or
> >>> by different nodes.
> >>> b) update_missing: The row with the same value as that incoming
> >>> update's key does not exist. Remote is trying to update a row which
> >>> does not exist locally.
> >>> c) update_deleted: The row with the same value as that incoming
> >>> update's key does not exist. The row is already deleted. This conflict
> >>> type is generated only if the deleted row is still detectable i.e., it
> >>> is not removed by VACUUM yet. If the row is removed by VACUUM already,
> >>> it cannot detect this conflict. It will detect it as update_missing
> >>> and will follow the default or configured resolver of update_missing
> >>> itself.
> >>>
> >>
> >> I don't understand the why should update_missing or update_deleted be
> >> different, especially considering it's not detected reliably. And also
> >> that even if we happen to find the row the associated TOAST data may
> >> have already been removed. So why would this matter?
> >
> > Here, we are trying to tackle the case where the row is 'recently'
> > deleted i.e. concurrent UPDATE and DELETE on pub and sub. User may
> > want to opt for a different resolution in such a case as against the
> > one where the corresponding row was not even present in the first
> > place. The case where the row was deleted long back may not fall into
> > this category as there are higher chances that they have been removed
> > by vacuum and can be considered equivalent to the update_ missing
> > case.
> >
>
> My point is that if we can't detect the difference reliably, it's not
> very useful. Consider this example:
>
> Node A:
>
>   T1: INSERT INTO t (id, value) VALUES (1,1);
>
>   T2: DELETE FROM t WHERE id = 1;
>
> Node B:
>
>   T3: UPDATE t SET value = 2 WHERE id = 1;
>
> The "correct" order of received messages on a third node is T1-T3-T2.
> But we may also see T1-T2-T3 and T3-T1-T2, e.g. due to network issues
> and so on. For T1-T2-T3 the right decision is to discard the update,
> while for T3-T1-T2 it's to either wait for the INSERT or wait for the
> insert to arrive.
>
> But if we misdetect the situation, we either end up with a row that
> shouldn't be there, or losing an update.

Doesn't the above example indicate that 'update_deleted' should also
be considered a necessary conflict type?  Please see the possibilities
of conflicts in all three cases:


The "correct" order of receiving messages on node C (as suggested
above) is T1-T3-T2   (case1)
--
T1 will insert the row.
T3 will have update_differ conflict; latest_timestamp wins or apply
will apply it. earliest_timestamp_wins or skip will skip it.
T2 will delete the row (irrespective of whether the update happened or not).
End Result: No Data.

T1-T2-T3
--
T1 will insert the row.
T2 will delete the row.
T3 will have conflict update_deleted. If it is 'update_deleted', the
chances are that the resolver set here is to 'skip' (default is also
'skip' in this case).

If vacuum has deleted that row (or if we don't support
'update_deleted' conflict), it will be 'update_missing' conflict. In
that case, the user may end up inserting the row if resolver chosen is
in favor of apply (which seems an obvious choice for 'update_missing'
conflict; default is also 'apply_or_skip').

End result:
Row inserted with 'update_missing'.
Row correctly skipped with 'update_deleted' (assuming the obvious
choice seems to be 'skip' for update_deleted case).

So it seems that with 'update_deleted' conflict, there are higher
chances of opting for right decision here (which is to discard the
update), as 'update_deleted' conveys correct info to the user.  The
'update_missing' OTOH does not convey correct info and user may end up
inserting the data by choosing apply favoring resolvers for
'update_missing'. Again, we get benefit of 'update_deleted' for
*recently* deleted rows only.

T3-T1-T2
--
T3 may end up inserting the record if the resolver is in favor of
'apply' and all 

Re: Conflict Detection and Resolution

2024-06-04 Thread shveta malik
On Tue, Jun 4, 2024 at 9:37 AM Amit Kapila  wrote:
>
> > >
> > > >
> > > > Conflict Resolution
> > > > 
> > > > a) latest_timestamp_wins:The change with later commit timestamp 
> > > > wins.
> > > > b) earliest_timestamp_wins:   The change with earlier commit timestamp 
> > > > wins.
>
> Can you share the use case of "earliest_timestamp_wins" resolution
> method? It seems after the initial update on the local node, it will
> never allow remote update to succeed which sounds a bit odd. Jan has
> shared this and similar concerns about this resolution method, so I
> have added him to the email as well.

I do not have the exact scenario for this.  But I feel, if 2 nodes are
concurrently inserting different data against a primary key, then some
users may have preferences that retain the row which was inserted
earlier. It is no different from latest_timestamp_wins. It totally
depends upon what kind of application and requirement the user may
have, based on which, he may discard the later coming rows (specially
for INSERT case).

> > > > Conflict Types:
> > > > 
> > > > a) update_differ: The origin of an incoming update's key row differs
> > > > from the local row i.e.; the row has already been updated locally or
> > > > by different nodes.
> > > > b) update_missing: The row with the same value as that incoming
> > > > update's key does not exist. Remote is trying to update a row which
> > > > does not exist locally.
> > > > c) update_deleted: The row with the same value as that incoming
> > > > update's key does not exist. The row is already deleted. This conflict
> > > > type is generated only if the deleted row is still detectable i.e., it
> > > > is not removed by VACUUM yet. If the row is removed by VACUUM already,
> > > > it cannot detect this conflict. It will detect it as update_missing
> > > > and will follow the default or configured resolver of update_missing
> > > > itself.
> > > >
> > >
> > > I don't understand the why should update_missing or update_deleted be
> > > different, especially considering it's not detected reliably. And also
> > > that even if we happen to find the row the associated TOAST data may
> > > have already been removed. So why would this matter?
> >
> > Here, we are trying to tackle the case where the row is 'recently'
> > deleted i.e. concurrent UPDATE and DELETE on pub and sub. User may
> > want to opt for a different resolution in such a case as against the
> > one where the corresponding row was not even present in the first
> > place. The case where the row was deleted long back may not fall into
> > this category as there are higher chances that they have been removed
> > by vacuum and can be considered equivalent to the update_ missing
> > case.
> >
>
> I think to make 'update_deleted' work, we need another scan with a
> different snapshot type to find the recently deleted row. I don't know
> if it is a good idea to scan the index twice with different snapshots,
> so for the sake of simplicity, can we consider 'updated_deleted' same
> as 'update_missing'? If we think it is an important case to consider
> then we can try to accomplish this once we finalize the
> design/implementation of other resolution methods.

I think it is important for scenarios when data is being updated and
deleted concurrently. But yes, I agree that implementation may have
some performance hit for this case. We can tackle this scenario at a
later stage.

> > > >
> > > > To implement the above, subscription commands will be changed to have
> > > > one more parameter 'conflict_resolution=on/off', default will be OFF.
> > > >
> > > > To configure global resolvers, new DDL command will be introduced:
> > > >
> > > > CONFLICT RESOLVER ON  IS 
> > > >
> > >
> > > I very much doubt we want a single global conflict resolver, or even one
> > > resolver per subscription. It seems like a very table-specific thing.
> >
>
> +1 to make it a table-level configuration but we probably need
> something at the global level as well such that by default if users
> don't define anything at table-level global-level configuration will
> be used.
>
> >
> > >
> > > Also, doesn't all this whole design ignore the concurrency between
> > > publishers? Isn't this problematic considering the commit timestamps may
> > > go backwards (for a given publisher), which means the conflict
> > > resolution is not deterministic (as it depends on how exactly it
> > > interleaves)?
> > >
>
> I am not able to imagine the cases you are worried about. Can you
> please be specific? Is it similar to the case I described in
> yesterday's email [1]?
>
> [1] - 
> https://www.postgresql.org/message-id/CAA4eK1JTMiBOoGqkt%3DaLPLU8Rs45ihbLhXaGHsz8XC76%2BOG3%2BQ%40mail.gmail.com
>

thanks
Shveta




Re: Conflict Detection and Resolution

2024-05-26 Thread shveta malik
On Sat, May 25, 2024 at 2:39 AM Tomas Vondra
 wrote:
>
> On 5/23/24 08:36, shveta malik wrote:
> > Hello hackers,
> >
> > Please find the proposal for Conflict Detection and Resolution (CDR)
> > for Logical replication.
> >  > below details.>
> >
> > Introduction
> > 
> > In case the node is subscribed to multiple providers, or when local
> > writes happen on a subscriber, conflicts can arise for the incoming
> > changes.  CDR is the mechanism to automatically detect and resolve
> > these conflicts depending on the application and configurations.
> > CDR is not applicable for the initial table sync. If locally, there
> > exists conflicting data on the table, the table sync worker will fail.
> > Please find the details on CDR in apply worker for INSERT, UPDATE and
> > DELETE operations:
> >
>
> Which architecture are you aiming for? Here you talk about multiple
> providers, but the wiki page mentions active-active. I'm not sure how
> much this matters, but it might.

Currently, we are working for multi providers case but ideally it
should work for active-active also. During further discussion and
implementation phase, if we find that, there are cases which will not
work in straight-forward way for active-active, then our primary focus
will remain to first implement it for multiple providers architecture.

>
> Also, what kind of consistency you expect from this? Because none of
> these simple conflict resolution methods can give you the regular
> consistency models we're used to, AFAICS.

Can you please explain a little bit more on this.

>
> > INSERT
> > 
> > To resolve INSERT conflict on subscriber, it is important to find out
> > the conflicting row (if any) before we attempt an insertion. The
> > indexes or search preference for the same will be:
> > First check for replica identity (RI) index.
> >   - if not found, check for the primary key (PK) index.
> > - if not found, then check for unique indexes (individual ones or
> > added by unique constraints)
> >  - if unique index also not found, skip CDR
> >
> > Note: if no RI index, PK, or unique index is found but
> > REPLICA_IDENTITY_FULL is defined, CDR will still be skipped.
> > The reason being that even though a row can be identified with
> > REPLICAT_IDENTITY_FULL, such tables are allowed to have duplicate
> > rows. Hence, we should not go for conflict detection in such a case.
> >
>
> It's not clear to me why would REPLICA_IDENTITY_FULL mean the table is
> allowed to have duplicate values? It just means the upstream is sending
> the whole original row, there can still be a PK/UNIQUE index on both the
> publisher and subscriber.

Yes, right. Sorry for confusion. I meant the same i.e. in absence of
'RI index, PK, or unique index', tables can have duplicates. So even
in presence of Replica-identity (FULL in this case) but in absence of
unique/primary index, CDR will be skipped for INSERT.

>
> > In case of replica identity ‘nothing’ and in absence of any suitable
> > index (as defined above), CDR will be skipped for INSERT.
> >
> > Conflict Type:
> > 
> > insert_exists: A conflict is detected when the table has the same
> > value for a key column as the new value in the incoming row.
> >
> > Conflict Resolution
> > 
> > a) latest_timestamp_wins:The change with later commit timestamp wins.
> > b) earliest_timestamp_wins:   The change with earlier commit timestamp wins.
> > c) apply:   Always apply the remote change.
> > d) skip:Remote change is skipped.
> > e) error:   Error out on conflict. Replication is stopped, manual
> > action is needed.
> >
>
> Why not to have some support for user-defined conflict resolution
> methods, allowing to do more complex stuff (e.g. merging the rows in
> some way, perhaps even with datatype-specific behavior)?

Initially, for the sake of simplicity, we are targeting to support
built-in resolvers. But we have a plan to work on user-defined
resolvers as well. We shall propose that separately.

>
> > The change will be converted to 'UPDATE' and applied if the decision
> > is in favor of applying remote change.
> >
> > It is important to have commit timestamp info available on subscriber
> > when latest_timestamp_wins or earliest_timestamp_wins method is chosen
> > as resolution method.  Thus ‘track_commit_timestamp’ must be enabled
> > on subscriber, in absence of which, configuring the said
> > timestamp-based resolution methods will result in error.
> >
> > Note: If the user has chosen the latest or earlies

Conflict Detection and Resolution

2024-05-23 Thread shveta malik
Hello hackers,

Please find the proposal for Conflict Detection and Resolution (CDR)
for Logical replication.


Introduction

In case the node is subscribed to multiple providers, or when local
writes happen on a subscriber, conflicts can arise for the incoming
changes.  CDR is the mechanism to automatically detect and resolve
these conflicts depending on the application and configurations.
CDR is not applicable for the initial table sync. If locally, there
exists conflicting data on the table, the table sync worker will fail.
Please find the details on CDR in apply worker for INSERT, UPDATE and
DELETE operations:

INSERT

To resolve INSERT conflict on subscriber, it is important to find out
the conflicting row (if any) before we attempt an insertion. The
indexes or search preference for the same will be:
First check for replica identity (RI) index.
  - if not found, check for the primary key (PK) index.
- if not found, then check for unique indexes (individual ones or
added by unique constraints)
 - if unique index also not found, skip CDR

Note: if no RI index, PK, or unique index is found but
REPLICA_IDENTITY_FULL is defined, CDR will still be skipped.
The reason being that even though a row can be identified with
REPLICAT_IDENTITY_FULL, such tables are allowed to have duplicate
rows. Hence, we should not go for conflict detection in such a case.

In case of replica identity ‘nothing’ and in absence of any suitable
index (as defined above), CDR will be skipped for INSERT.

Conflict Type:

insert_exists: A conflict is detected when the table has the same
value for a key column as the new value in the incoming row.

Conflict Resolution

a) latest_timestamp_wins:The change with later commit timestamp wins.
b) earliest_timestamp_wins:   The change with earlier commit timestamp wins.
c) apply:   Always apply the remote change.
d) skip:Remote change is skipped.
e) error:   Error out on conflict. Replication is stopped, manual
action is needed.

The change will be converted to 'UPDATE' and applied if the decision
is in favor of applying remote change.

It is important to have commit timestamp info available on subscriber
when latest_timestamp_wins or earliest_timestamp_wins method is chosen
as resolution method.  Thus ‘track_commit_timestamp’ must be enabled
on subscriber, in absence of which, configuring the said
timestamp-based resolution methods will result in error.

Note: If the user has chosen the latest or earliest_timestamp_wins,
and the remote and local timestamps are the same, then it will go by
system identifier. The change with a higher system identifier will
win. This will ensure that the same change is picked on all the nodes.

 UPDATE


Conflict Detection Method:

Origin conflict detection: The ‘origin’ info is used to detect
conflict which can be obtained from commit-timestamp generated for
incoming txn at the source node. To compare remote’s origin with the
local’s origin, we must have origin information for local txns as well
which can be obtained from commit-timestamp after enabling
‘track_commit_timestamp’ locally.
The one drawback here is the ‘origin’ information cannot be obtained
once the row is frozen and the commit-timestamp info is removed by
vacuum. For a frozen row, conflicts cannot be raised, and thus the
incoming changes will be applied in all the cases.

Conflict Types:

a) update_differ: The origin of an incoming update's key row differs
from the local row i.e.; the row has already been updated locally or
by different nodes.
b) update_missing: The row with the same value as that incoming
update's key does not exist. Remote is trying to update a row which
does not exist locally.
c) update_deleted: The row with the same value as that incoming
update's key does not exist. The row is already deleted. This conflict
type is generated only if the deleted row is still detectable i.e., it
is not removed by VACUUM yet. If the row is removed by VACUUM already,
it cannot detect this conflict. It will detect it as update_missing
and will follow the default or configured resolver of update_missing
itself.

 Conflict Resolutions:

a)   latest_timestamp_wins:The change with later commit timestamp
wins. Can be used for ‘update_differ’.
b)   earliest_timestamp_wins:   The change with earlier commit
timestamp wins. Can be used for ‘update_differ’.
c)   apply:   The remote change is always applied.  Can be used for
‘update_differ’.
d)   apply_or_skip: Remote change is converted to INSERT and is
applied. If the complete row cannot be constructed from the info
provided by the publisher, then the change is skipped. Can be used for
‘update_missing’ or ‘update_deleted’.
e)   apply_or_error: Remote change is converted to INSERT and is
applied. If the complete row cannot be constructed from the info
provided by the publisher, then error is 

Re: Synchronizing slots from primary to standby

2024-04-29 Thread shveta malik
On Mon, Apr 29, 2024 at 5:28 PM Zhijie Hou (Fujitsu)
 wrote:
>
> On Monday, April 29, 2024 5:11 PM shveta malik  wrote:
> >
> > On Mon, Apr 29, 2024 at 11:38 AM shveta malik 
> > wrote:
> > >
> > > On Mon, Apr 29, 2024 at 10:57 AM Zhijie Hou (Fujitsu)
> > >  wrote:
> > > >
> > > > On Friday, March 15, 2024 10:45 PM Bertrand Drouvot
> >  wrote:
> > > > >
> > > > > Hi,
> > > > >
> > > > > On Thu, Mar 14, 2024 at 02:22:44AM +, Zhijie Hou (Fujitsu) wrote:
> > > > > > Hi,
> > > > > >
> > > > > > Since the standby_slot_names patch has been committed, I am
> > > > > > attaching the last doc patch for review.
> > > > > >
> > > > >
> > > > > Thanks!
> > > > >
> > > > > 1 ===
> > > > >
> > > > > +   continue subscribing to publications now on the new primary
> > > > > + server
> > > > > without
> > > > > +   any data loss.
> > > > >
> > > > > I think "without any data loss" should be re-worded in this
> > > > > context. Data loss in the sense "data committed on the primary and
> > > > > not visible on the subscriber in case of failover" can still occurs 
> > > > > (in case
> > synchronous replication is not used).
> > > > >
> > > > > 2 ===
> > > > >
> > > > > +   If the result (failover_ready) of both above 
> > > > > steps is
> > > > > +   true, existing subscriptions will be able to continue without data
> > loss.
> > > > > +  
> > > > >
> > > > > I don't think that's true if synchronous replication is not used.
> > > > > Say,
> > > > >
> > > > > - synchronous replication is not used
> > > > > - primary is not able to reach the standby anymore and
> > > > > standby_slot_names is set
> > > > > - new data is inserted into the primary
> > > > > - then not replicated to subscriber (due to standby_slot_names)
> > > > >
> > > > > Then I think the both above steps will return true but data would
> > > > > be lost in case of failover.
> > > >
> > > > Thanks for the comments, attach the new version patch which reworded
> > > > the above places.
> > >
> > > Thanks for the patch.
> > >
> > > Few comments:
> > >
> > > 1)  Tested the steps, one of the queries still refers to
> > > 'conflict_reason'. I think it should refer 'conflicting'.
>
> Thanks for catching this. Fixed.
>
> > >
> > > 2) Will it be good to mention that in case of planned promotion, it is
> > > recommended to run  pg_sync_replication_slots() as last sync attempt
> > > before we run failvoer-ready validation steps? This can be mentioned
> > > in high-availaibility.sgml of current patch
> >
> > I recall now that with the latest fix, we cannot run
> > pg_sync_replication_slots() unless we disable the slot-sync worker.
> > Considering that, I think it will be too many steps just to run the SQL 
> > function at
> > the end without much value added. Thus we can skip this point, we can rely 
> > on
> > slot sync worker completely.
>
> Agreed. I didn't change this.
>
> Here is the V3 doc patch.

Thanks for the patch.

It will be good if 1a can produce quoted slot-names list as output,
which can be used directly in step 1b's query; otherwise, it is little
inconvenient to give input to 1b if the number of slots are huge. User
needs to manually quote each slot-name.

Other than this, the patch looks good to me.

thanks
Shveta




Re: Synchronizing slots from primary to standby

2024-04-29 Thread shveta malik
On Mon, Apr 29, 2024 at 11:38 AM shveta malik  wrote:
>
> On Mon, Apr 29, 2024 at 10:57 AM Zhijie Hou (Fujitsu)
>  wrote:
> >
> > On Friday, March 15, 2024 10:45 PM Bertrand Drouvot 
> >  wrote:
> > >
> > > Hi,
> > >
> > > On Thu, Mar 14, 2024 at 02:22:44AM +, Zhijie Hou (Fujitsu) wrote:
> > > > Hi,
> > > >
> > > > Since the standby_slot_names patch has been committed, I am attaching
> > > > the last doc patch for review.
> > > >
> > >
> > > Thanks!
> > >
> > > 1 ===
> > >
> > > +   continue subscribing to publications now on the new primary server
> > > without
> > > +   any data loss.
> > >
> > > I think "without any data loss" should be re-worded in this context. Data 
> > > loss in
> > > the sense "data committed on the primary and not visible on the 
> > > subscriber in
> > > case of failover" can still occurs (in case synchronous replication is 
> > > not used).
> > >
> > > 2 ===
> > >
> > > +   If the result (failover_ready) of both above steps 
> > > is
> > > +   true, existing subscriptions will be able to continue without data 
> > > loss.
> > > +  
> > >
> > > I don't think that's true if synchronous replication is not used. Say,
> > >
> > > - synchronous replication is not used
> > > - primary is not able to reach the standby anymore and standby_slot_names 
> > > is
> > > set
> > > - new data is inserted into the primary
> > > - then not replicated to subscriber (due to standby_slot_names)
> > >
> > > Then I think the both above steps will return true but data would be lost 
> > > in case
> > > of failover.
> >
> > Thanks for the comments, attach the new version patch which reworded the
> > above places.
>
> Thanks for the patch.
>
> Few comments:
>
> 1)  Tested the steps, one of the queries still refers to
> 'conflict_reason'. I think it should refer 'conflicting'.
>
> 2) Will it be good to mention that in case of planned promotion, it is
> recommended to run  pg_sync_replication_slots() as last sync attempt
> before we run failvoer-ready validation steps? This can be mentioned
> in high-availaibility.sgml of current patch

I recall now that with the latest fix, we cannot run
pg_sync_replication_slots() unless we disable the slot-sync worker.
Considering that, I think it will be too many steps just to run the
SQL function at the end without much value added. Thus we can skip
this point, we can rely on slot sync worker completely.

thanks
Shveta




Re: Synchronizing slots from primary to standby

2024-04-29 Thread shveta malik
On Mon, Apr 29, 2024 at 11:38 AM shveta malik  wrote:
>
> On Mon, Apr 29, 2024 at 10:57 AM Zhijie Hou (Fujitsu)
>  wrote:
> >
> > On Friday, March 15, 2024 10:45 PM Bertrand Drouvot 
> >  wrote:
> > >
> > > Hi,
> > >
> > > On Thu, Mar 14, 2024 at 02:22:44AM +, Zhijie Hou (Fujitsu) wrote:
> > > > Hi,
> > > >
> > > > Since the standby_slot_names patch has been committed, I am attaching
> > > > the last doc patch for review.
> > > >
> > >
> > > Thanks!
> > >
> > > 1 ===
> > >
> > > +   continue subscribing to publications now on the new primary server
> > > without
> > > +   any data loss.
> > >
> > > I think "without any data loss" should be re-worded in this context. Data 
> > > loss in
> > > the sense "data committed on the primary and not visible on the 
> > > subscriber in
> > > case of failover" can still occurs (in case synchronous replication is 
> > > not used).
> > >
> > > 2 ===
> > >
> > > +   If the result (failover_ready) of both above steps 
> > > is
> > > +   true, existing subscriptions will be able to continue without data 
> > > loss.
> > > +  
> > >
> > > I don't think that's true if synchronous replication is not used. Say,
> > >
> > > - synchronous replication is not used
> > > - primary is not able to reach the standby anymore and standby_slot_names 
> > > is
> > > set
> > > - new data is inserted into the primary
> > > - then not replicated to subscriber (due to standby_slot_names)
> > >
> > > Then I think the both above steps will return true but data would be lost 
> > > in case
> > > of failover.
> >
> > Thanks for the comments, attach the new version patch which reworded the
> > above places.
>
> Thanks for the patch.
>
> Few comments:
>
> 1)  Tested the steps, one of the queries still refers to
> 'conflict_reason'. I think it should refer 'conflicting'.
>
> 2) Will it be good to mention that in case of planned promotion, it is
> recommended to run  pg_sync_replication_slots() as last sync attempt
> before we run failvoer-ready validation steps? This can be mentioned
> in high-availaibility.sgml of current patch

I recall now that with the latest fix, we cannot run
pg_sync_replication_slots() unless we disable the slot-sync worker.
Considering that, I think it will be too many steps just to run the
SQL function at the end without much value added. Thus we can skip
this point, we can rely on slot sync worker completely.

thanks
Shveta




Re: Synchronizing slots from primary to standby

2024-04-29 Thread shveta malik
On Mon, Apr 29, 2024 at 10:57 AM Zhijie Hou (Fujitsu)
 wrote:
>
> On Friday, March 15, 2024 10:45 PM Bertrand Drouvot 
>  wrote:
> >
> > Hi,
> >
> > On Thu, Mar 14, 2024 at 02:22:44AM +, Zhijie Hou (Fujitsu) wrote:
> > > Hi,
> > >
> > > Since the standby_slot_names patch has been committed, I am attaching
> > > the last doc patch for review.
> > >
> >
> > Thanks!
> >
> > 1 ===
> >
> > +   continue subscribing to publications now on the new primary server
> > without
> > +   any data loss.
> >
> > I think "without any data loss" should be re-worded in this context. Data 
> > loss in
> > the sense "data committed on the primary and not visible on the subscriber 
> > in
> > case of failover" can still occurs (in case synchronous replication is not 
> > used).
> >
> > 2 ===
> >
> > +   If the result (failover_ready) of both above steps is
> > +   true, existing subscriptions will be able to continue without data loss.
> > +  
> >
> > I don't think that's true if synchronous replication is not used. Say,
> >
> > - synchronous replication is not used
> > - primary is not able to reach the standby anymore and standby_slot_names is
> > set
> > - new data is inserted into the primary
> > - then not replicated to subscriber (due to standby_slot_names)
> >
> > Then I think the both above steps will return true but data would be lost 
> > in case
> > of failover.
>
> Thanks for the comments, attach the new version patch which reworded the
> above places.

Thanks for the patch.

Few comments:

1)  Tested the steps, one of the queries still refers to
'conflict_reason'. I think it should refer 'conflicting'.

2) Will it be good to mention that in case of planned promotion, it is
recommended to run  pg_sync_replication_slots() as last sync attempt
before we run failvoer-ready validation steps? This can be mentioned
in high-availaibility.sgml of current patch

thanks
Shveta




Re: promotion related handling in pg_sync_replication_slots()

2024-04-23 Thread shveta malik
On Tue, Apr 23, 2024 at 9:07 AM Amit Kapila  wrote:
>
> On Mon, Apr 22, 2024 at 7:04 PM Masahiko Sawada  wrote:
> >
> > On Mon, Apr 22, 2024 at 9:02 PM shveta malik  wrote:
> > >
> > > Thanks for the patch, the changes look good Amit. Please find the merged 
> > > patch.
> > >
> >
> > I've reviewed the patch and have some comments:

Thanks for the comments.

> > ---
> > /*
> > -* Early initialization.
> > +* Register slotsync_worker_onexit() before we register
> > +* ReplicationSlotShmemExit() in BaseInit(), to ensure that during the
> > +* exit of the slot sync worker, ReplicationSlotShmemExit() is called
> > +* first, followed by slotsync_worker_onexit(). The startup process 
> > during
> > +* promotion invokes ShutDownSlotSync() which waits for slot sync to
> > +* finish and it does that by checking the 'syncing' flag. Thus worker
> > +* must be done with the slots' release and cleanup before it marks 
> > itself
> > +* as finished syncing.
> >  */
> >
> > I'm slightly worried that we register the slotsync_worker_onexit()
> > callback before BaseInit(), because it could be a blocker when we want
> > to add more work in the callback, for example sending the stats.
> >
>
> The other possibility is that we do slot release/clean up in the
> slotsync_worker_onexit() call itself and then we can do it after
> BaseInit(). Do you have any other/better idea for this?

I have currently implemented it this way in v11.

> > ---
> > synchronize_slots(wrconn);
> > +
> > +   /* Cleanup the temporary slots */
> > +   ReplicationSlotCleanup();
> > +
> > +   /* We are done with sync, so reset sync flag */
> > +   reset_syncing_flag();
> >
> > I think it ends up removing other temp slots that are created by the
> > same backend process using
> > pg_create_{physical,logical_replication_slots() function, which could
> > be a large side effect of this function for users.

Yes, this is a problem. Thanks for catching it.

>
> True, I think here we should either remove only temporary and synced
> marked slots. The other possibility is to create slots as RS_EPHEMERAL
> initially when called from the SQL function but that doesn't sound
> like a neat approach.

Modified the logic to remove only synced temporary slots during
SQL-function exit.

Please find v11 with above changes.

thanks
Shveta


v11-0001-Handle-stopSignaled-during-sync-function-call.patch
Description: Binary data


Re: promotion related handling in pg_sync_replication_slots()

2024-04-22 Thread shveta malik
On Mon, Apr 22, 2024 at 5:10 PM Amit Kapila  wrote:
>
> On Fri, Apr 19, 2024 at 1:52 PM shveta malik  wrote:
> >
> > Please find v9 with the above comments addressed.
> >
>
> I have made minor modifications in the comments and a function name.
> Please see the attached top-up patch. Apart from this, the patch looks
> good to me.

Thanks for the patch, the changes look good Amit. Please find the merged patch.

thanks
Shveta


v10-0001-Handle-stopSignaled-during-sync-function-call.patch
Description: Binary data


Re: Disallow changing slot's failover option in transaction block

2024-04-22 Thread shveta malik
On Mon, Apr 22, 2024 at 5:57 AM Zhijie Hou (Fujitsu)
 wrote:
>
> On Friday, April 19, 2024 10:54 AM Kuroda, Hayato/黒田 隼人 
>  wrote:
> > In your patch, the pg_dump.c was updated. IIUC the main reason was that
> > pg_restore executes some queries as single transactions so that ALTER
> > SUBSCRIPTION cannot be done, right?
>
> Yes, and please see below for other reasons.
>
> > Also, failover was synchronized only when we were in the upgrade mode, but
> > your patch seems to remove the condition. Can you clarify the reason?
>
> We used ALTER SUBSCRIPTION in upgrade mode because it was not allowed to use
> connect=false and failover=true together when CREATE SUBSCRIPTION. But since 
> we
> don't have this restriction anymore(we don't alter slot when creating sub
> anymore), we can directly specify failover in CREATE SUBSCRIPTION and do that
> in non-upgrade mode as well.
>
> Attach the V3 patch which also addressed Shveta[1] and Bertrand[2]'s comments.

 Tested the patch, works well.

thanks
Shveta




Re: Disallow changing slot's failover option in transaction block

2024-04-19 Thread shveta malik
On Fri, Apr 19, 2024 at 6:09 AM Zhijie Hou (Fujitsu)
 wrote:
>
> Here is V2 patch which includes the changes for CREATE SUBSCRIPTION as
> suggested. Since we don't connect pub to alter slot when (create_slot=false)
> anymore, the restriction that disallows failover=true when connect=false is
> also removed.

Thanks for the patch. I feel getSubscription() also needs to get
'subfailover' option independent of dopt->binary_upgrade i.e. it needs
similar changes as that of dumpSubscription(). I tested pg_dump,
currently it is not dumping failover parameter for failover-enabled
subscriptions, perhaps due to the same bug.  Create-sub and Alter-sub
changes look good and work well.

thanks
Shveta




Re: promotion related handling in pg_sync_replication_slots()

2024-04-19 Thread shveta malik
On Fri, Apr 19, 2024 at 11:37 AM shveta malik  wrote:
>
> On Fri, Apr 19, 2024 at 10:53 AM Bertrand Drouvot
>  wrote:
> >
> > Hi,
> >
> > On Thu, Apr 18, 2024 at 05:36:05PM +0530, shveta malik wrote:
> > > Please find v8 attached. Changes are:
> >
> > Thanks!
> >
> > A few comments:
>
> Thanks for reviewing.
>
> > 1 ===
> >
> > @@ -1440,7 +1461,7 @@ ReplSlotSyncWorkerMain(char *startup_data, size_t 
> > startup_data_len)
> >  * slotsync_worker_onexit() but that will need the connection to be 
> > made
> >  * global and we want to avoid introducing global for this purpose.
> >  */
> > -   before_shmem_exit(slotsync_failure_callback, 
> > PointerGetDatum(wrconn));
> > +   before_shmem_exit(slotsync_worker_disconnect, 
> > PointerGetDatum(wrconn));
> >
> > The comment above this change still states "Register the failure callback 
> > once
> > we have the connection", I think it has to be reworded a bit now that v8 is
> > making use of slotsync_worker_disconnect().
> >
> > 2 ===
> >
> > +* Register slotsync_worker_onexit() before we register
> > +* ReplicationSlotShmemExit() in BaseInit(), to ensure that during 
> > exit of
> > +* slot sync worker, ReplicationSlotShmemExit() is called first, 
> > followed
> > +* by slotsync_worker_onexit(). Startup process during promotion 
> > waits for
> >
> > Worth to mention in shmem_exit() (where it "while 
> > (--before_shmem_exit_index >= 0)"
> > or before the shmem_exit() definition) that ReplSlotSyncWorkerMain() relies 
> > on
> > this LIFO behavior? (not sure if there is other "strong" LIFO requirement in
> > other part of the code).
>
> I see other modules as well relying on LIFO behavior.
> Please see applyparallelworker.c where
> 'before_shmem_exit(pa_shutdown)' is needed to be done after
> 'before_shmem_exit(logicalrep_worker_onexit)' (commit id 3d144c6).
> Also in postinit.c, I see such comments atop
> 'before_shmem_exit(ShutdownPostgres, 0)'.
> I feel we can skip adding this specific comment about
> ReplSlotSyncWorkerMain() in ipc.c, as none of the other modules has
> also not added any. I will address the rest of your comments in the
> next version.
>
> > 3 ===
> >
> > +* Startup process during promotion waits for slot sync to finish 
> > and it
> > +* does that by checking the 'syncing' flag.
> >
> > worth to mention ShutDownSlotSync()?
> >
> > 4 ===
> >
> > I did a few tests manually (launching ShutDownSlotSync() through gdb / with 
> > and
> > without sync worker and with / without pg_sync_replication_slots() running
> > concurrently) and it looks like it works as designed.
>
> Thanks for testing it.
>
> > Having said that, the logic that is in place to take care of the corner 
> > cases
> > described up-thread seems reasonable to me.

Please find v9 with the above comments addressed.

thanks
Shveta


v9-0001-Handle-stopSignaled-during-sync-function-call.patch
Description: Binary data


Re: promotion related handling in pg_sync_replication_slots()

2024-04-19 Thread shveta malik
On Fri, Apr 19, 2024 at 10:53 AM Bertrand Drouvot
 wrote:
>
> Hi,
>
> On Thu, Apr 18, 2024 at 05:36:05PM +0530, shveta malik wrote:
> > Please find v8 attached. Changes are:
>
> Thanks!
>
> A few comments:

Thanks for reviewing.

> 1 ===
>
> @@ -1440,7 +1461,7 @@ ReplSlotSyncWorkerMain(char *startup_data, size_t 
> startup_data_len)
>  * slotsync_worker_onexit() but that will need the connection to be 
> made
>  * global and we want to avoid introducing global for this purpose.
>  */
> -   before_shmem_exit(slotsync_failure_callback, PointerGetDatum(wrconn));
> +   before_shmem_exit(slotsync_worker_disconnect, 
> PointerGetDatum(wrconn));
>
> The comment above this change still states "Register the failure callback once
> we have the connection", I think it has to be reworded a bit now that v8 is
> making use of slotsync_worker_disconnect().
>
> 2 ===
>
> +* Register slotsync_worker_onexit() before we register
> +* ReplicationSlotShmemExit() in BaseInit(), to ensure that during 
> exit of
> +* slot sync worker, ReplicationSlotShmemExit() is called first, 
> followed
> +* by slotsync_worker_onexit(). Startup process during promotion 
> waits for
>
> Worth to mention in shmem_exit() (where it "while (--before_shmem_exit_index 
> >= 0)"
> or before the shmem_exit() definition) that ReplSlotSyncWorkerMain() relies on
> this LIFO behavior? (not sure if there is other "strong" LIFO requirement in
> other part of the code).

I see other modules as well relying on LIFO behavior.
Please see applyparallelworker.c where
'before_shmem_exit(pa_shutdown)' is needed to be done after
'before_shmem_exit(logicalrep_worker_onexit)' (commit id 3d144c6).
Also in postinit.c, I see such comments atop
'before_shmem_exit(ShutdownPostgres, 0)'.
I feel we can skip adding this specific comment about
ReplSlotSyncWorkerMain() in ipc.c, as none of the other modules has
also not added any. I will address the rest of your comments in the
next version.

> 3 ===
>
> +* Startup process during promotion waits for slot sync to finish and 
> it
> +* does that by checking the 'syncing' flag.
>
> worth to mention ShutDownSlotSync()?
>
> 4 ===
>
> I did a few tests manually (launching ShutDownSlotSync() through gdb / with 
> and
> without sync worker and with / without pg_sync_replication_slots() running
> concurrently) and it looks like it works as designed.

Thanks for testing it.

> Having said that, the logic that is in place to take care of the corner cases
> described up-thread seems reasonable to me.

thanks
Shveta




Re: promotion related handling in pg_sync_replication_slots()

2024-04-18 Thread shveta malik
On Thu, Apr 18, 2024 at 12:35 PM shveta malik  wrote:
>
> To fix above issues, these changes have been made in v7:

Please find v8 attached. Changes are:

1) It fixes ShutDownSlotSync() issue, where we perform
kill(SlotSyncCtx->pid). There are chances that after we release
spin-lock and before we perform kill, slot-sync worker has error-ed
out and has set SlotSyncCtx->pid to InvalidPid (-1) already. And thus
kill(-1) could result in abnormal process kills on some platforms.
Now, we get pid under spin-lock and then use it to perform kill to
avoid pid=-1 kill. This is on a similar line of how ShutdownWalRcv()
does it.

2) Improved comments in code.

3) Updated commit message with new fixes. I had missed to update it in
the previous version.

thanks
Shveta


v8-0001-Handle-stopSignaled-during-sync-function-call.patch
Description: Binary data


Re: Disallow changing slot's failover option in transaction block

2024-04-18 Thread shveta malik
On Thu, Apr 18, 2024 at 11:40 AM Hayato Kuroda (Fujitsu)
 wrote:
>
> Dear Shveta,
>
> Sorry for delay response. I missed your post.
>
> > +1.
> >
> > Similar to ALTER SUB, CREATE SUB also needs to be fixed. This is
> > because we call alter_replication_slot in CREATE SUB as well, for the
> > case when slot_name is provided and create_slot=false. But the tricky
> > part is we call alter_replication_slot() when creating subscription
> > for both failover=true and false. That means if we want to fix it on
> > the similar line of ALTER SUB, we have to disallow user from executing
> > the CREATE SUBSCRIPTION (slot_name = xx) in a txn block, which seems
> > to break some existing use cases. (previously, user can execute such a
> > command inside a txn block).
> >
> > So, we need to think if there are better ways to fix it.  After
> > discussion with Hou-San offlist, here are some ideas:
> > 1. do not alter replication slot's failover option when CREATE
> > SUBSCRIPTION   WITH failover=false. This means we alter replication
> > slot only when failover is set to true. And thus we can disallow
> > CREATE SUB WITH (slot_name =xx, failover=true, create_slot=false)
> > inside a txn block.
> >
> > This option allows user to run CREATE-SUB(create_slot=false) with
> > failover=false in txn block like earlier. But on the downside, it
> > makes the behavior inconsistent for otherwise simpler option like
> > failover,  i.e. with failover=true, CREATE SUB is not allowed in txn
> > block while with failover=false, it is allowed. It makes it slightly
> > complex to be understood by user.
> >
> > 2. let's not disallow CREATE SUB in txn block as earlier, just don't
> > perform internal alter-failover during CREATE SUB for existing
> > slots(create_slot=false, slot_name=xx)  i.e. when create_slot is
> > false, we will ignore failover parameter of CREATE SUB and it is
> > user's responsibility to set it appropriately using ALTER SUB cmd. For
> > create_slot=true, the behaviour of CREATE-SUB remains same as earlier.
> >
> > This option does not add new restriction for CREATE SUB wrt txn block.
> > In context of failover with create_slot=false, we already have a
> > similar restriction (documented one) for ALTER SUB, i.e. with 'ALTER
> > SUBSCRIPTION SET(slot_name = new)', user needs to alter the new slot's
> > failover by himself. CREAT SUB can also be documented in similar way.
> > This seems simpler to be understood considering existing ALTER SUB's
> > behavior as well. Plus, this will make CREATE-SUB code slightly
> > simpler and thus easily manageable.
> >
> > 3. add a alter_slot option for CREATE SUBSCRIPTION, we can only alter
> > the  slot's failover if alter_slot=true. And so we can disallow CREATE
> > SUB WITH (slot_name =xx, alter_slot=true) inside a txn block.
> >
> > This seems a clean way, as everything will be as per user's consent
> > based on alter_slot parameter. But on the downside, this will need
> > introducing additional parameter and also adding new restriction of
> > running CREATE-sub in txn  block for a specific case.
> >
> > 4. Don't alter replication in subscription DDLs. Instead, try to alter
> > replication slot's failover in the apply worker. This means we need to
> > execute alter_replication_slot each time before starting streaming in
> > the apply worker.
> >
> > This does not seem appealing to execute alter_replication_slot
> > everytime the apply worker starts. But if others think it as a better
> > option, it can be further analyzed.
>
> Thanks for describing, I also prefer 2, because it seems bit strange that
> CREATE statement leads ALTER.

Thanks for feedback.

> > Currently, our preference is option 2, as that looks a clean solution
> > and also aligns with ALTER-SUB behavior which is already documented.
> > Thoughts?
> >
> > 
> > Note that we could not refer to the design of two_phase here, because
> > two_phase can be considered as a streaming option, so it's fine to
> > change the two_phase along with START_REPLICATION command. (the
> > two_phase is not changed in subscription DDLs, but get changed in
> > START_REPLICATION command). But the failover is closely related to a
> > replication slot itself.
> > 

Sorry for causing confusion. This is not the statement which is
documented one, this was an additional note to support our analysis.

> Sorry, I cannot find statements. Where did you refer?

When I said that option 2 aligns with ALTER-SUB documented behaviour,
I meant the doc described in [1] stating "When altering the slot_name,
the failover and two_phase property values of the named slot may
differ from the counterpart failover and two_phase parameters
specified in the subscription"

[1]: 
https://www.postgresql.org/docs/devel/sql-altersubscription.html#SQL-ALTERSUBSCRIPTION-PARAMS-SET

thanks
Shveta




Re: promotion related handling in pg_sync_replication_slots()

2024-04-18 Thread shveta malik
On Tue, Apr 16, 2024 at 2:52 PM Bertrand Drouvot
 wrote:
>
> Hi,
>
> On Tue, Apr 16, 2024 at 02:06:45PM +0530, shveta malik wrote:
> > On Tue, Apr 16, 2024 at 1:55 PM Zhijie Hou (Fujitsu)
> >  wrote:
> > > I personally feel adding the additional check for sync_replication_slots 
> > > may
> > > not improve the situation here. Because the GUC sync_replication_slots can
> > > change at any point, the GUC could be false when performing this addition 
> > > check
> > > and is set to true immediately after the check, so It could not simplify 
> > > the logic
> > > anyway.
> >
> > +1.
> > I feel doc and "cannot synchronize replication slots concurrently"
> > check should suffice.
> >
> > In the scenario which Hou-San pointed out,  if after performing the
> > GUC check in SQL function, this GUC is enabled immediately and say
> > worker is started sooner than the function could get chance to sync,
> > in that case as well, SQL function will ultimately get error "cannot
> > synchronize replication slots concurrently", even though GUC is
> > enabled.  Thus, I feel we should stick with samer error in all
> > scenarios.
>
> Okay, fine by me, let's forget about checking sync_replication_slots then.

Thanks.

While reviewing and testing this patch further, we encountered 2
race-conditions which needs to be handled:

1) For slot sync worker, the order of cleanup execution was a) first
reset 'syncing' flag (slotsync_failure_callback) b) then reset pid and
syncing (slotsync_worker_onexit). But in ShutDownSlotSync(), we rely
only on the 'syncing' flag for wait-exit logic. So it may so happen
that in the window between these two callbacks, ShutDownSlotSync()
proceeds and calls update_synced_slots_inactive_since() which may then
hit assert  Assert((SlotSyncCtx->pid == InvalidPid).

2) Another problem as described by Hou-San off-list:
When the slotsync worker error out after acquiring a slot, it will
first call slotsync_worker_onexit() and then
ReplicationSlotShmemExit(), so in the window between these two
callbacks, it's possible that the SlotSyncCtx->syncing
SlotSyncCtx->pid has been reset but the slot->active_pid is still
valid. The Assert will be broken in this.
@@ -1471,6 +1503,9 @@ update_synced_slots_inactive_since(void)
{
Assert(SlotIsLogical(s));

+   /* The slot must not be acquired by any process */
+   Assert(s->active_pid == 0);
+


To fix above issues, these changes have been made in v7:
1) For worker, replaced slotsync_failure_callback() with
slotsync_worker_disconnect() so that the latter only disconnects and
thus slotsync_worker_onexit() does pid cleanup followed by syncing
flag cleanup. This will make ShutDownSlotSync()'s wait exit reliably.

2) To fix second problem, changes are:

2.1) For worker, moved slotsync_worker_onexit() registration before
BaseInit() (BaseInit is the one doing ReplicationSlotShmemExit
registration). If we do this change in order of registration, then
order of cleanup for worker will be a) slotsync_worker_disconnect() b)
ReplicationSlotShmemExit() c) slotsync_worker_onexit(). This order
ensures that the worker is actually done with slots release and
cleanup before it marks itself as done syncing.

2.2) For SQL function, did ReplicationSlotRelease() and
ReplicationSlotCleanup() as first step in slotsync_failure_callback().

While doing change 2.2, it occurred to us, that it would be a clean
solution to do ReplicationSlotCleanup() even on successful execution
of SQL function. It seems better that the temporary slots are
cleaned-up when SQL function exists, as we do not know when the user
will run this SQL function again and thus leaving temp slots for
longer does not seem a good idea. Thoughts?

thanks
Shveta


v7-0001-Handle-stopSignaled-during-sync-function-call.patch
Description: Binary data


Re: Disallow changing slot's failover option in transaction block

2024-04-18 Thread shveta malik
On Thu, Apr 18, 2024 at 11:22 AM Amit Kapila  wrote:
>
> On Tue, Apr 16, 2024 at 5:06 PM shveta malik  wrote:
> >
> > On Tue, Apr 16, 2024 at 1:45 PM Hayato Kuroda (Fujitsu)
> >  wrote:
> > >
> > > Dear Hou,
> > >
> > > > Kuroda-San reported an issue off-list that:
> > > >
> > > > If user execute ALTER SUBSCRIPTION SET (failover) command inside a txn 
> > > > block
> > > > and rollback, only the subscription option change can be rolled back, 
> > > > while the
> > > > replication slot's failover change is preserved.
> > > >
> > > > This is because ALTER SUBSCRIPTION SET (failover) command internally
> > > > executes
> > > > the replication command ALTER_REPLICATION_SLOT to change the replication
> > > > slot's
> > > > failover property, but this replication command execution cannot be
> > > > rollback.
> > > >
> > > > To fix it, I think we can prevent user from executing ALTER 
> > > > SUBSCRIPTION set
> > > > (failover) inside a txn block, which is also consistent to the ALTER
> > > > SUBSCRIPTION REFRESH/DROP SUBSCRIPTION command. Attach a small
> > > > patch to address this.
> > >
> > > Thanks for posting the patch, the fix is same as my expectation.
> > > Also, should we add the restriction to the doc? I feel [1] can be updated.
> >
> > +1.
> >
> > Similar to ALTER SUB, CREATE SUB also needs to be fixed. This is
> > because we call alter_replication_slot in CREATE SUB as well, for the
> > case when slot_name is provided and create_slot=false. But the tricky
> > part is we call alter_replication_slot() when creating subscription
> > for both failover=true and false. That means if we want to fix it on
> > the similar line of ALTER SUB, we have to disallow user from executing
> > the CREATE SUBSCRIPTION (slot_name = xx) in a txn block, which seems
> > to break some existing use cases. (previously, user can execute such a
> > command inside a txn block).
> >
> > So, we need to think if there are better ways to fix it.  After
> > discussion with Hou-San offlist, here are some ideas:
> >
> > 1. do not alter replication slot's failover option when CREATE
> > SUBSCRIPTION   WITH failover=false. This means we alter replication
> > slot only when failover is set to true. And thus we can disallow
> > CREATE SUB WITH (slot_name =xx, failover=true, create_slot=false)
> > inside a txn block.
> >
> > This option allows user to run CREATE-SUB(create_slot=false) with
> > failover=false in txn block like earlier. But on the downside, it
> > makes the behavior inconsistent for otherwise simpler option like
> > failover,  i.e. with failover=true, CREATE SUB is not allowed in txn
> > block while with failover=false, it is allowed. It makes it slightly
> > complex to be understood by user.
> >
> > 2. let's not disallow CREATE SUB in txn block as earlier, just don't
> > perform internal alter-failover during CREATE SUB for existing
> > slots(create_slot=false, slot_name=xx)  i.e. when create_slot is
> > false, we will ignore failover parameter of CREATE SUB and it is
> > user's responsibility to set it appropriately using ALTER SUB cmd. For
> > create_slot=true, the behaviour of CREATE-SUB remains same as earlier.
> >
> > This option does not add new restriction for CREATE SUB wrt txn block.
> > In context of failover with create_slot=false, we already have a
> > similar restriction (documented one) for ALTER SUB, i.e. with 'ALTER
> > SUBSCRIPTION SET(slot_name = new)', user needs to alter the new slot's
> > failover by himself. CREAT SUB can also be documented in similar way.
> > This seems simpler to be understood considering existing ALTER SUB's
> > behavior as well. Plus, this will make CREATE-SUB code slightly
> > simpler and thus easily manageable.
> >
>
> +1 for option 2 as it sounds logical to me and consistent with ALTER
> SUBSCRIPTION. BTW, IIUC, you are referring to: "When altering the
> slot_name, the failover and two_phase property values of the named
> slot may differ from the counterpart failover and two_phase parameters
> specified in the subscription. When creating the slot, ensure the slot
> properties failover and two_phase match their counterpart parameters
> of the subscription." in docs [1], right?

Yes. Here:
https://www.postgresql.org/docs/devel/sql-altersubscription.html#SQL-ALTERSUBSCRIPTION-PARAMS-SET

thanks
Shveta




Re: Disallow changing slot's failover option in transaction block

2024-04-16 Thread shveta malik
On Tue, Apr 16, 2024 at 1:45 PM Hayato Kuroda (Fujitsu)
 wrote:
>
> Dear Hou,
>
> > Kuroda-San reported an issue off-list that:
> >
> > If user execute ALTER SUBSCRIPTION SET (failover) command inside a txn block
> > and rollback, only the subscription option change can be rolled back, while 
> > the
> > replication slot's failover change is preserved.
> >
> > This is because ALTER SUBSCRIPTION SET (failover) command internally
> > executes
> > the replication command ALTER_REPLICATION_SLOT to change the replication
> > slot's
> > failover property, but this replication command execution cannot be
> > rollback.
> >
> > To fix it, I think we can prevent user from executing ALTER SUBSCRIPTION set
> > (failover) inside a txn block, which is also consistent to the ALTER
> > SUBSCRIPTION REFRESH/DROP SUBSCRIPTION command. Attach a small
> > patch to address this.
>
> Thanks for posting the patch, the fix is same as my expectation.
> Also, should we add the restriction to the doc? I feel [1] can be updated.

+1.

Similar to ALTER SUB, CREATE SUB also needs to be fixed. This is
because we call alter_replication_slot in CREATE SUB as well, for the
case when slot_name is provided and create_slot=false. But the tricky
part is we call alter_replication_slot() when creating subscription
for both failover=true and false. That means if we want to fix it on
the similar line of ALTER SUB, we have to disallow user from executing
the CREATE SUBSCRIPTION (slot_name = xx) in a txn block, which seems
to break some existing use cases. (previously, user can execute such a
command inside a txn block).

So, we need to think if there are better ways to fix it.  After
discussion with Hou-San offlist, here are some ideas:

1. do not alter replication slot's failover option when CREATE
SUBSCRIPTION   WITH failover=false. This means we alter replication
slot only when failover is set to true. And thus we can disallow
CREATE SUB WITH (slot_name =xx, failover=true, create_slot=false)
inside a txn block.

This option allows user to run CREATE-SUB(create_slot=false) with
failover=false in txn block like earlier. But on the downside, it
makes the behavior inconsistent for otherwise simpler option like
failover,  i.e. with failover=true, CREATE SUB is not allowed in txn
block while with failover=false, it is allowed. It makes it slightly
complex to be understood by user.

2. let's not disallow CREATE SUB in txn block as earlier, just don't
perform internal alter-failover during CREATE SUB for existing
slots(create_slot=false, slot_name=xx)  i.e. when create_slot is
false, we will ignore failover parameter of CREATE SUB and it is
user's responsibility to set it appropriately using ALTER SUB cmd. For
create_slot=true, the behaviour of CREATE-SUB remains same as earlier.

This option does not add new restriction for CREATE SUB wrt txn block.
In context of failover with create_slot=false, we already have a
similar restriction (documented one) for ALTER SUB, i.e. with 'ALTER
SUBSCRIPTION SET(slot_name = new)', user needs to alter the new slot's
failover by himself. CREAT SUB can also be documented in similar way.
This seems simpler to be understood considering existing ALTER SUB's
behavior as well. Plus, this will make CREATE-SUB code slightly
simpler and thus easily manageable.

3. add a alter_slot option for CREATE SUBSCRIPTION, we can only alter
the  slot's failover if alter_slot=true. And so we can disallow CREATE
SUB WITH (slot_name =xx, alter_slot=true) inside a txn block.

This seems a clean way, as everything will be as per user's consent
based on alter_slot parameter. But on the downside, this will need
introducing additional parameter and also adding new restriction of
running CREATE-sub in txn  block for a specific case.

4. Don't alter replication in subscription DDLs. Instead, try to alter
replication slot's failover in the apply worker. This means we need to
execute alter_replication_slot each time before starting streaming in
the apply worker.

This does not seem appealing to execute alter_replication_slot
everytime the apply worker starts. But if others think it as a better
option, it can be further analyzed.


Currently, our preference is option 2, as that looks a clean solution
and also aligns with ALTER-SUB behavior which is already documented.
Thoughts?


Note that we could not refer to the design of two_phase here, because
two_phase can be considered as a streaming option, so it's fine to
change the two_phase along with START_REPLICATION command. (the
two_phase is not changed in subscription DDLs, but get changed in
START_REPLICATION command). But the failover is closely related to a
replication slot itself.



Thanks
Shveta




Re: promotion related handling in pg_sync_replication_slots()

2024-04-16 Thread shveta malik
On Tue, Apr 16, 2024 at 1:55 PM Zhijie Hou (Fujitsu)
 wrote:
>
> On Tuesday, April 16, 2024 2:52 PM Bertrand Drouvot 
>  wrote:
>
>
> Hi,
>
> > On Tue, Apr 16, 2024 at 10:00:04AM +0530, shveta malik wrote:
> > > Please find v5 addressing above comments.
> >
> > Thanks!
> >
> > @@ -1634,9 +1677,14 @@ SyncReplicationSlots(WalReceiverConn *wrconn)  {
> > PG_ENSURE_ERROR_CLEANUP(slotsync_failure_callback,
> > PointerGetDatum(wrconn));
> > {
> > +   check_flags_and_set_sync_info(InvalidPid);
> > +
> >
> > Given the fact that if the sync worker is running it won't be possible to 
> > trigger a
> > manual sync with pg_sync_replication_slots(), what about also checking the
> > "sync_replication_slots" value at the start of SyncReplicationSlots() and 
> > emmit
> > an error if sync_replication_slots is set to on? (The message could 
> > explicitly
> > states that it's not possible to use the function if sync_replication_slots 
> > is set to
> > on).
>
> I personally feel adding the additional check for sync_replication_slots may
> not improve the situation here. Because the GUC sync_replication_slots can
> change at any point, the GUC could be false when performing this addition 
> check
> and is set to true immediately after the check, so It could not simplify the 
> logic
> anyway.

+1.
I feel doc and "cannot synchronize replication slots concurrently"
check should suffice.

In the scenario which Hou-San pointed out,  if after performing the
GUC check in SQL function, this GUC is enabled immediately and say
worker is started sooner than the function could get chance to sync,
in that case as well, SQL function will ultimately get error "cannot
synchronize replication slots concurrently", even though GUC is
enabled.  Thus, I feel we should stick with samer error in all
scenarios.


thanks
Shveta




Re: promotion related handling in pg_sync_replication_slots()

2024-04-16 Thread shveta malik
On Tue, Apr 16, 2024 at 12:03 PM Bertrand Drouvot
 wrote:
>
> > I think for now let's restrict their usage in parallel and make the
> > promotion behavior consistent both for worker and API.
>
> Okay, let's do it that way. Is it worth to add a few words in the doc related 
> to
> pg_sync_replication_slots() though? (to mention it can not be used if the sync
> slot worker is running).

+1. Please find v6 having the suggested doc changes.


thanks
Shveta


v6-0001-Handle-stopSignaled-during-sync-function-call.patch
Description: Binary data


Re: promotion related handling in pg_sync_replication_slots()

2024-04-15 Thread shveta malik
On Tue, Apr 16, 2024 at 9:27 AM Zhijie Hou (Fujitsu)
 wrote:
>
> On Monday, April 15, 2024 6:09 PM shveta malik  wrote:
> >
> > Please find v4 addressing the above comments.
>
> Thanks for the patch.
>
> Here are few comments:

Thanks for reviewing the patch.

>
> 1.
>
> +   ereport(ERROR,
> +   errmsg("promotion in progress, can 
> not synchronize replication slots"));
> +   }
>
> I think an errcode is needed.
>
> The style of the error message seems a bit unnatural to me. I suggest:
> "cannot synchronize replication slots when standby promotion is ongoing"

Modified.

>
> 2.
>
> +   if (worker_pid != InvalidPid)
> +   Assert(SlotSyncCtx->pid == InvalidPid);
>
> We could merge the checks into one Assert().
> Assert(SlotSyncCtx->pid == InvalidPid || worker_pid == InvalidPid);

Modified.

>
> 3.
>
> -   pqsignal(SIGINT, SignalHandlerForShutdownRequest);
>
> I realized that we should register this before setting SlotSyncCtx->pid,
> otherwise if the standby is promoted after setting pid and before registering
> signal handle function, the slotsync worker could miss to handle SIGINT sent 
> by
> startup process(ShutDownSlotSync). This is an existing issue for slotsync
> worker, but maybe we could fix it together with the patch.

Yes, it seems like a problem. Fixed it. Also to be consistent, moved
other signal handlers' registration as well before we set pid.

Please find v5 addressing above comments.

thanks
Shveta


v5-0001-Handle-stopSignaled-during-sync-function-call.patch
Description: Binary data


Re: promotion related handling in pg_sync_replication_slots()

2024-04-15 Thread shveta malik
On Mon, Apr 15, 2024 at 2:29 PM Amit Kapila  wrote:
>
> On Fri, Apr 12, 2024 at 5:25 PM shveta malik  wrote:
> >
> > Please find v3 addressing race-condition and one other comment.
> >
> > Up-thread it was suggested that,  probably, checking
> > SlotSyncCtx->syncing should be sufficient in ShutDownSlotSync(). On
> > re-thinking, it might not be. Slot sync worker sets and resets
> > 'syncing' with each sync-cycle, and thus we need to rely on worker's
> > pid in ShutDownSlotSync(), as there could be a window where promotion
> > is triggered and 'syncing' is not set for worker, while the worker is
> > still running. This implementation of setting and resetting syncing
> > with each sync-cycle looks better as compared to setting syncing
> > during the entire life-cycle of the worker. So, I did not change it.
> >
>
> To retain this we need to have different handling for 'syncing' for
> workers and function which seems like more maintenance burden than the
> value it provides. Moreover, in SyncReplicationSlots(), we are calling
> a function after acquiring spinlock which is not our usual coding
> practice.

Okay.  Changed it to consistent handling. Now both worker and SQL
function set 'syncing' when they start and reset it when they exit.

> One minor comment:
>  * All the fields except 'syncing' are used only by slotsync worker.
>  * 'syncing' is used both by worker and SQL function 
> pg_sync_replication_slots.
>  */
> typedef struct SlotSyncCtxStruct
> {
> pid_t pid;
> bool stopSignaled;
> bool syncing;
> time_t last_start_time;
> slock_t mutex;
> } SlotSyncCtxStruct;
>
> I feel the above comment is no longer valid after this patch. We can
> probably remove this altogether.

Yes, changed.

Please find v4 addressing the above comments.

thanks
Shveta


v4-0001-Handle-stopSignaled-during-sync-function-call.patch
Description: Binary data


Re: promotion related handling in pg_sync_replication_slots()

2024-04-12 Thread shveta malik
On Fri, Apr 12, 2024 at 4:18 PM Amit Kapila  wrote:
>
> On Fri, Apr 12, 2024 at 7:57 AM shveta malik  wrote:
> >
> > On Sat, Apr 6, 2024 at 12:25 PM Bharath Rupireddy
> >  wrote:
> > >
> > > On Fri, Apr 5, 2024 at 10:31 AM shveta malik  
> > > wrote:
> > > >
> > > > Please find v2. Changes are:
> > >
> > > Thanks for the patch. Here are some comments.
> >
> > Thanks for reviewing.
> > >
> > > 1. Can we have a clear saying in the shmem variable who's syncing at
> > > the moment? Is it a slot sync worker or a backend via SQL function?
> > > Perhaps turn "bool syncing;" to "SlotSyncSource sync_source;"
> > >
> > > typedef enum SlotSyncSource
> > > {
> > > SLOT_SYNC_NONE,
> > > SLOT_SYNC_WORKER,
> > > SLOT_SYNC_BACKEND,
> > > } SlotSyncSource;
> > >
> > > Then, the check in ShutDownSlotSync can be:
> > >
> > > +   /*
> > > +* Return if neither the slot sync worker is running nor the 
> > > function
> > > +* pg_sync_replication_slots() is executing.
> > > +*/
> > > +   if ((SlotSyncCtx->pid == InvalidPid) &&
> > > SlotSyncCtx->sync_source != SLOT_SYNC_BACKEND)
> > > {
> > >
>
> I don't know if this will be help, especially after fixing the race
> condition I mentioned. But otherwise, also, at this stage it doesn't
> seem helpful to add the source of sync explicitly.
>

Agreed.

Please find v3 addressing race-condition and one other comment.

Up-thread it was suggested that,  probably, checking
SlotSyncCtx->syncing should be sufficient in ShutDownSlotSync(). On
re-thinking, it might not be. Slot sync worker sets and resets
'syncing' with each sync-cycle, and thus we need to rely on worker's
pid in ShutDownSlotSync(), as there could be a window where promotion
is triggered and 'syncing' is not set for worker, while the worker is
still running. This implementation of setting and resetting syncing
with each sync-cycle looks better as compared to setting syncing
during the entire life-cycle of the worker. So, I did not change it.

To fix the race condition, I moved the setting of the 'syncing'  flag
together with the 'stopSignaled' check under the same spinLock for the
SQL function. OTOH, for worker, I feel it is good to check
'stopSignaled' at the beginning itself, while retaining the
setting/resetting of 'syncing' at a later stage during the actual sync
cycle. This makes handling for SQL function and worker slightly
different. And thus to achieve this, I had to take the 'syncing' flag
handling out of synchronize_slots() and move it to both worker and SQL
function by introducing 2 new functions check_and_set_syncing_flag()
and reset_syncing_flag().
I am analyzing if there are better ways to achieve this, any
suggestions are welcome.

thanks
Shveta


v3-0001-Handle-stopSignaled-during-sync-function-call.patch
Description: Binary data


Re: promotion related handling in pg_sync_replication_slots()

2024-04-11 Thread shveta malik
On Sat, Apr 6, 2024 at 12:25 PM Bharath Rupireddy
 wrote:
>
> On Fri, Apr 5, 2024 at 10:31 AM shveta malik  wrote:
> >
> > Please find v2. Changes are:
>
> Thanks for the patch. Here are some comments.

Thanks for reviewing.
>
> 1. Can we have a clear saying in the shmem variable who's syncing at
> the moment? Is it a slot sync worker or a backend via SQL function?
> Perhaps turn "bool syncing;" to "SlotSyncSource sync_source;"
>
> typedef enum SlotSyncSource
> {
> SLOT_SYNC_NONE,
> SLOT_SYNC_WORKER,
> SLOT_SYNC_BACKEND,
> } SlotSyncSource;
>
> Then, the check in ShutDownSlotSync can be:
>
> +   /*
> +* Return if neither the slot sync worker is running nor the function
> +* pg_sync_replication_slots() is executing.
> +*/
> +   if ((SlotSyncCtx->pid == InvalidPid) &&
> SlotSyncCtx->sync_source != SLOT_SYNC_BACKEND)
> {
>
> 2.
> SyncReplicationSlots(WalReceiverConn *wrconn)
>  {
> +/*
> + * Startup process signaled the slot sync to stop, so if meanwhile user
> + * has invoked slot sync SQL function, simply return.
> + */
> +SpinLockAcquire(>mutex);
> +if (SlotSyncCtx->stopSignaled)
> +{
> +ereport(LOG,
> +errmsg("skipping slot synchronization as slot sync
> shutdown is signaled during promotion"));
> +
>
> Unless I'm missing something, I think this can't detect if the backend
> via SQL function is already half-way through syncing in
> synchronize_one_slot. So, better move this check to (or also have it
> there) slot sync loop that calls synchronize_one_slot. To avoid
> spinlock acquisitions, we can perhaps do this check in when we acquire
> the spinlock for synced flag.

If the sync via SQL function is already half-way, then promotion
should wait for it to finish. I don't think it is a good idea to move
the check to synchronize_one_slot().  The sync-call should either not
start (if it noticed the promotion) or finish the sync and then let
promotion proceed. But I would like to know others' opinion on this.

>
> /* Search for the named slot */
> if ((slot = SearchNamedReplicationSlot(remote_slot->name, true)))
> {
> boolsynced;
>
> SpinLockAcquire(>mutex);
> synced = slot->data.synced;
> << get SlotSyncCtx->stopSignaled here >>
> SpinLockRelease(>mutex);
>
> << do the slot sync skip check here if (stopSignaled) >>
>
> 3. Can we have a test or steps at least to check the consequences
> manually one can get if slot syncing via SQL function is happening
> during the promotion?
>
> IIUC, we need to ensure there is no backend acquiring it and
> performing sync while the slot sync worker is shutting down/standby
> promotion is occuring. Otherwise, some of the slots can get resynced
> and some are not while we are shutting down the slot sync worker as
> part of the standby promotion which might leave the slots in an
> inconsistent state.

I do not think that we can reach a state (exception is some error
scenario) where some of the slots are synced while the rest are not
during a *particular* sync-cycle only because promotion is going in
parallel. (And yes we need to fix the race-condition stated by Amit
up-thread for this statement to be true.)

thanks
Shveta




Re: promotion related handling in pg_sync_replication_slots()

2024-04-11 Thread shveta malik
On Sat, Apr 6, 2024 at 11:49 AM Amit Kapila  wrote:
>
> On Fri, Apr 5, 2024 at 10:31 AM shveta malik  wrote:
> >
> > Please find v2. Changes are:
> > 1) Rebased the patch as there was conflict due to recent commit 6f132ed.
> > 2) Added an Assert in update_synced_slots_inactive_since() to ensure
> > that the slot does not have active_pid.
> > 3) Improved commit msg and comments.
> >
>
> Few comments:
> ==
> 1.
>  void
>  SyncReplicationSlots(WalReceiverConn *wrconn)
>  {
> + /*
> + * Startup process signaled the slot sync to stop, so if meanwhile user
> + * has invoked slot sync SQL function, simply return.
> + */
> + SpinLockAcquire(>mutex);
> + if (SlotSyncCtx->stopSignaled)
> + {
> + ereport(LOG,
> + errmsg("skipping slot synchronization as slot sync shutdown is
> signaled during promotion"));
> +
> + SpinLockRelease(>mutex);
> + return;
> + }
> + SpinLockRelease(>mutex);
>
> There is a race condition with this code. Say during promotion
> ShutDownSlotSync() is just before setting this flag and the user has
> invoked pg_sync_replication_slots() and passed this check but still
> didn't set the SlotSyncCtx->syncing flag. So, now, the promotion would
> recognize that there is slot sync going on in parallel, and slot sync
> wouldn't know that the promotion is in progress.

Did you mean that now, the promotion *would not* recognize...

I see, I will fix this.

> The current coding for slot sync worker ensures there is no such race
> by checking the stopSignaled and setting the pid together under
> spinlock. I think we need to move the setting of the syncing flag
> similarly. Once we do that probably checking SlotSyncCtx->syncing
> should be sufficient in ShutDownSlotSync(). If we change the location
> of setting the 'syncing' flag then please ensure its cleanup as we
> currently do in slotsync_failure_callback().

Sure, let me review.

> 2.
> @@ -1395,6 +1395,7 @@ update_synced_slots_inactive_since(void)
>   if (s->in_use && s->data.synced)
>   {
>   Assert(SlotIsLogical(s));
> + Assert(s->active_pid == 0);
>
> We can add a comment like: "The slot must not be acquired by any process"
>
> --
> With Regards,
> Amit Kapila.




Re: Synchronizing slots from primary to standby

2024-04-05 Thread shveta malik
On Fri, Apr 5, 2024 at 4:31 PM Bertrand Drouvot
 wrote:
>
> BTW, I just realized that the LSN I used in my example in the 
> LSN_FORMAT_ARGS()
> are not the right ones.

Noted. Thanks.

Please find v3 with the comments addressed.

thanks
Shveta


v3-0001-Correct-sanity-check-to-compare-confirmed_lsn.patch
Description: Binary data


Re: Synchronizing slots from primary to standby

2024-04-05 Thread shveta malik
On Fri, Apr 5, 2024 at 10:09 AM Bertrand Drouvot
 wrote:
>
> What about something like?
>
> ereport(LOG,
> errmsg("synchronized confirmed_flush_lsn for slot \"%s\" differs from 
> remote slot",
> remote_slot->name),
> errdetail("Remote slot has LSN %X/%X but local slot has LSN %X/%X.",
> LSN_FORMAT_ARGS(remote_slot->restart_lsn),
> LSN_FORMAT_ARGS(slot->data.restart_lsn));
>
> Regards,

+1. Better than earlier. I will update and post the patch.

thanks
Shveta




Re: promotion related handling in pg_sync_replication_slots()

2024-04-04 Thread shveta malik
On Thu, Apr 4, 2024 at 5:17 PM Amit Kapila  wrote:
>
> On Thu, Apr 4, 2024 at 5:05 PM shveta malik  wrote:
> >
> > Hello hackers,
> >
> > Currently, promotion related handling is missing in the slot sync SQL
> > function pg_sync_replication_slots().   Here is the background on how
> > it is done in slot sync worker:
> > During promotion, the startup process in order to shut down the
> > slot-sync worker, sets the 'stopSignaled' flag, sends the shut-down
> > signal, and waits for slot sync worker to exit. Meanwhile if the
> > postmaster has not noticed the promotion yet, it may end up restarting
> > slot sync worker. In such a case, the worker exits if 'stopSignaled'
> > is set.
> >
> > Since there is a chance that the user (or any of his scripts/tools)
> > may execute SQL function pg_sync_replication_slots() in parallel to
> > promotion, such handling is needed in this SQL function as well, The
> > attached patch attempts to implement the same.
> >
>
> Thanks for the report and patch. I'll look into it.
>

Please find v2. Changes are:
1) Rebased the patch as there was conflict due to recent commit 6f132ed.
2) Added an Assert in update_synced_slots_inactive_since() to ensure
that the slot does not have active_pid.
3) Improved commit msg and comments.


thanks
Shveta


v2-0001-Handle-stopSignaled-during-sync-function-call.patch
Description: Binary data


Re: Synchronizing slots from primary to standby

2024-04-04 Thread shveta malik
On Fri, Apr 5, 2024 at 9:22 AM Bertrand Drouvot
 wrote:
>
> Hi,
>
> On Thu, Apr 04, 2024 at 05:31:45PM +0530, shveta malik wrote:
> > On Thu, Apr 4, 2024 at 2:59 PM shveta malik  wrote:
> > >
> > >
> > > Prior to commit 2ec005b, this check was okay, as we did not expect
> > > restart_lsn of the synced slot to be ahead of remote since we were
> > > directly copying the lsns. But now when we use 'advance' to do logical
> > > decoding on standby, there is a possibility that restart lsn of the
> > > synced slot is ahead of remote slot, if there are running txns records
> > > found after reaching consistent-point while consuming WALs from
> > > restart_lsn till confirmed_lsn. In such a case, slot-sync's advance
> > > may end up serializing snapshots and setting restart_lsn to the
> > > serialized snapshot point, ahead of remote one.
> > >
> > > Fix:
> > > The sanity check needs to be corrected. Attached a patch to address the 
> > > issue.
> >
>
> Thanks for reporting, explaining the issue and providing a patch.
>
> Regarding the patch:
>
> 1 ===
>
> +* Attempt to sync lsns and xmins only if remote slot is ahead of 
> local
>
> s/lsns/LSNs/?
>
> 2 ===
>
> +   if (slot->data.confirmed_flush != 
> remote_slot->confirmed_lsn)
> +   elog(LOG,
> +"could not synchronize local slot 
> \"%s\" LSN(%X/%X)"
> +" to remote slot's LSN(%X/%X) ",
> +remote_slot->name,
> +
> LSN_FORMAT_ARGS(slot->data.confirmed_flush),
> +
> LSN_FORMAT_ARGS(remote_slot->confirmed_lsn));
>
> I don't think that the message is correct here. Unless I am missing something
> there is nothing in the following code path that would prevent the slot to be
> sync during this cycle.

This is a sanity check,  I will put a comment to indicate the same. We
want to ensure if anything changes in future, we get correct logs to
indicate that.
If required, the LOG msg can be changed. Kindly suggest if you have
anything better in mind.

thanks
Shveta




Re: Introduce XID age and inactive timeout based replication slot invalidation

2024-04-04 Thread shveta malik
On Thu, Apr 4, 2024 at 5:53 PM Bharath Rupireddy
 wrote:
>
> On Thu, Apr 4, 2024 at 4:35 PM Amit Kapila  wrote:
> >
> > > Thanks for the changes. v34-0001 LGTM.
> >
> > I was doing a final review before pushing 0001 and found that
> > 'inactive_since' could be set twice during startup after promotion,
> > once while restoring slots and then via ShutDownSlotSync(). The reason
> > is that ShutDownSlotSync() will be invoked in normal startup on
> > primary though it won't do anything apart from setting inactive_since
> > if we have synced slots. I think you need to check 'StandbyMode' in
> > update_synced_slots_inactive_since() and return if the same is not
> > set. We can't use 'InRecovery' flag as that will be set even during
> > crash recovery.
> >
> > Can you please test this once unless you don't agree with the above theory?
>
> Nice catch. I've verified that update_synced_slots_inactive_since is
> called even for normal server startups/crash recovery. I've added a
> check to exit if the StandbyMode isn't set.
>
> Please find the attached v35 patch.

Thanks for the patch. Tested it , works well. Few cosmetic changes needed:

in 040 test file:
1)
# Capture the inactive_since of the slot from the primary. Note that the slot
# will be inactive since the corresponding subscription is disabled..

2 .. at the end. Replace with one.

2)
# Synced slot on the standby must get its own inactive_since.

. not needed in single line comment (to be consistent with
neighbouring comments)


3)
update_synced_slots_inactive_since():

if (!StandbyMode)
return;

It will be good to add comments here.

thanks
Shveta




Re: Synchronizing slots from primary to standby

2024-04-04 Thread shveta malik
On Thu, Apr 4, 2024 at 2:59 PM shveta malik  wrote:
>
>
> Prior to commit 2ec005b, this check was okay, as we did not expect
> restart_lsn of the synced slot to be ahead of remote since we were
> directly copying the lsns. But now when we use 'advance' to do logical
> decoding on standby, there is a possibility that restart lsn of the
> synced slot is ahead of remote slot, if there are running txns records
> found after reaching consistent-point while consuming WALs from
> restart_lsn till confirmed_lsn. In such a case, slot-sync's advance
> may end up serializing snapshots and setting restart_lsn to the
> serialized snapshot point, ahead of remote one.
>
> Fix:
> The sanity check needs to be corrected. Attached a patch to address the issue.

Please find v2 which has detailed commit-msg and some more comments in code.

thanks
Shveta


v2-0001-Correct-sanity-check-to-compare-confirmed_lsn.patch
Description: Binary data


promotion related handling in pg_sync_replication_slots()

2024-04-04 Thread shveta malik
Hello hackers,

Currently, promotion related handling is missing in the slot sync SQL
function pg_sync_replication_slots().   Here is the background on how
it is done in slot sync worker:
During promotion, the startup process in order to shut down the
slot-sync worker, sets the 'stopSignaled' flag, sends the shut-down
signal, and waits for slot sync worker to exit. Meanwhile if the
postmaster has not noticed the promotion yet, it may end up restarting
slot sync worker. In such a case, the worker exits if 'stopSignaled'
is set.

Since there is a chance that the user (or any of his scripts/tools)
may execute SQL function pg_sync_replication_slots() in parallel to
promotion, such handling is needed in this SQL function as well, The
attached patch attempts to implement the same. Changes are:

1) If pg_sync_replication_slots()  is already running when the
promotion is triggered, ShutDownSlotSync() checks the
'SlotSyncCtx->syncing' flag as well and waits for it to become false
i.e. waits till parallel running SQL function is finished.

2) If  pg_sync_replication_slots() is invoked when promotion is
already in progress, pg_sync_replication_slots() respects the
'stopSignaled' flag set by the startup process and becomes a no-op.

thanks
Shveta


v1-0001-Handle-stopSignaled-during-sync-function-call.patch
Description: Binary data


Re: Synchronizing slots from primary to standby

2024-04-04 Thread shveta malik
On Wed, Apr 3, 2024 at 3:36 PM Amit Kapila  wrote:
>
> On Wed, Apr 3, 2024 at 11:13 AM Amit Kapila  wrote:
> >
> > On Wed, Apr 3, 2024 at 9:36 AM Bharath Rupireddy
> >  wrote:
> >
> > > I quickly looked at v8, and have a nit, rest all looks good.
> > >
> > > +if (DecodingContextReady(ctx) && found_consistent_snapshot)
> > > +*found_consistent_snapshot = true;
> > >
> > > Can the found_consistent_snapshot be checked first to help avoid the
> > > function call DecodingContextReady() for pg_replication_slot_advance
> > > callers?
> > >
> >
> > Okay, changed. Additionally, I have updated the comments and commit
> > message. I'll push this patch after some more testing.
> >
>
> Pushed!

There is an intermittent BF failure observed at [1] after this commit (2ec005b).

Analysis:
We see in BF logs, that during the first call of the sync function,
restart_lsn of the synced slot is advanced to a lsn which is > than
remote slot's restart_lsn. And when sync call is done second time
without any further change on primary, it hits the error:
  ERROR:  cannot synchronize local slot "lsub1_slot" LSN(0/360) to
remote slot's LSN(0/328) as synchronization would move it
backwards

Relevant BF logs are given at [2]. This reproduces intermittently
depending upon if bgwriter logs running txn record when the test is
running. We were able to mimic the test case to reproduce the failure.
Please see attached bf-test.txt for steps.

Issue:
Issue is that we are having a wrong sanity check based on
'restart_lsn' in synchronize_one_slot():

if (remote_slot->restart_lsn < slot->data.restart_lsn)
elog(ERROR, ...);

Prior to commit 2ec005b, this check was okay, as we did not expect
restart_lsn of the synced slot to be ahead of remote since we were
directly copying the lsns. But now when we use 'advance' to do logical
decoding on standby, there is a possibility that restart lsn of the
synced slot is ahead of remote slot, if there are running txns records
found after reaching consistent-point while consuming WALs from
restart_lsn till confirmed_lsn. In such a case, slot-sync's advance
may end up serializing snapshots and setting restart_lsn to the
serialized snapshot point, ahead of remote one.

Fix:
The sanity check needs to be corrected. Attached a patch to address the issue.
a) The sanity check is corrected to compare confirmed_lsn rather than
restart_lsn.
Additional changes:
b) A log has been added after LogicalSlotAdvanceAndCheckSnapState() to
log the case when the local and remote slots' confirmed-lsn were not
found to be the same after sync (if at all).
c) Now we attempt to sync in update_local_synced_slot() if one of
confirmed_lsn, restart_lsn, and catalog_xmin for remote slot is ahead
of local slot instead of them just being unequal.

[1]:
https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=calliphoridae=2024-04-03%2017%3A57%3A28

[2]:
2024-04-03 18:00:41.896 UTC [3239617][client backend][0/2:0] LOG:
statement: SELECT pg_sync_replication_slots();
LOG:  starting logical decoding for slot "lsub1_slot"
DETAIL:  Streaming transactions committing after 0/0, reading WAL from
0/328.
LOG:  logical decoding found consistent point at 0/328
DEBUG:  serializing snapshot to pg_logical/snapshots/0-360.snap
DEBUG:  got new restart lsn 0/360 at 0/360
LOG:  newly created slot "lsub1_slot" is sync-ready now

2024-04-03 18:00:45.218 UTC [3243487][client backend][2/2:0] LOG:
statement: SELECT pg_sync_replication_slots();
  ERROR:  cannot synchronize local slot "lsub1_slot" LSN(0/360) to
remote slot's LSN(0/328) as synchronization would move it
backwards

thanks
Shveta
Keep sync_replication_slots as disabled on standby.

--primary:
SELECT 'init' FROM pg_create_logical_replication_slot('logicalslot', 
'test_decoding', false, false, true);

SELECT pg_log_standby_snapshot();
SELECT pg_log_standby_snapshot();
SELECT pg_log_standby_snapshot();
SELECT pg_log_standby_snapshot();
select pg_sleep(2);

SELECT pg_replication_slot_advance('logicalslot', pg_current_wal_lsn());
select slot_name,database,xmin,catalog_xmin 
cxmin,restart_lsn,confirmed_flush_lsn flushLsn,wal_status,conflicting conf, 
failover,synced,temporary temp, active from pg_replication_slots order by 
slot_name;

--standby:
select pg_sync_replication_slots();
select pg_sync_replication_slots(); -- ERROR here


v1-0001-Correct-sanity-check-to-compare-confirmed_lsn.patch
Description: Binary data


Re: Introduce XID age and inactive timeout based replication slot invalidation

2024-04-03 Thread shveta malik
On Wed, Apr 3, 2024 at 11:17 AM Bharath Rupireddy
 wrote:
>
> On Wed, Apr 3, 2024 at 8:38 AM shveta malik  wrote:
> >
> > > > Or a simple solution is that the slotsync worker updates
> > > > inactive_since as it does for non-synced slots, and disables
> > > > timeout-based slot invalidation for synced slots.
> >
> > I like this idea better, it takes care of such a case too when the
> > user is relying on sync-function rather than worker and does not want
> > to get the slots invalidated in between 2 sync function calls.
>
> Please find the attached v31 patches implementing the above idea:
>

Thanks for the patches, please find few comments:

v31-001:

1)
system-views.sgml:
value will get updated  after every synchronization from the
corresponding remote slot on the primary.

--This is confusing. It will be good to rephrase it.

2)
update_synced_slots_inactive_since()

--May be, we should mention in the header that this function is called
only during promotion.

3) 040_standby_failover_slots_sync.pl:
We capture inactive_since_on_primary when we do this for the first time at #175
ALTER SUBSCRIPTION regress_mysub1 DISABLE"

But we again recreate the sub and disable it at line #280.
Do you think we shall get inactive_since_on_primary again here, to be
compared with inactive_since_on_new_primary later?


v31-002:
(I had reviewed v29-002 but missed to post comments,  I think these
are still applicable)

1) I think replication_slot_inactivity_timeout was recommended here
(instead of replication_slot_inactive_timeout, so please give it a
thought):
https://www.postgresql.org/message-id/202403260739.udlp7lxixktx%40alvherre.pgsql

2) Commit msg:
a)
"It is often easy for developers to set a timeout of say 1
or 2 or 3 days at slot level, after which the inactive slots get
dropped."

Shall we say invalidated rather than dropped?

b)
"To achieve the above, postgres introduces a GUC allowing users
set inactive timeout and then a slot stays inactive for this much
amount of time it invalidates the slot."

Broken sentence.



thanks
Shveta




Re: Introduce XID age and inactive timeout based replication slot invalidation

2024-04-02 Thread shveta malik
On Tue, Apr 2, 2024 at 11:58 AM Bertrand Drouvot
 wrote:
>
> Hi,
>
> On Tue, Apr 02, 2024 at 12:07:54PM +0900, Masahiko Sawada wrote:
> > On Mon, Apr 1, 2024 at 12:18 PM Bharath Rupireddy
> >
> > FWIW, coming to this thread late, I think that the inactive_since
> > should not be synchronized from the primary. The wall clocks are
> > different on the primary and the standby so having the primary's
> > timestamp on the standby can confuse users, especially when there is a
> > big clock drift. Also, as Amit mentioned, inactive_since seems not to
> > be consistent with other parameters we copy. The
> > replication_slot_inactive_timeout feature should work on the standby
> > independent from the primary, like other slot invalidation mechanisms,
> > and it should be based on its own local clock.
>
> Thanks for sharing your thoughts! So, it looks like that most of us agree to 
> not
> sync inactive_since from the primary, I'm fine with that.

+1 on not syncing slots from primary.

> > If we want to invalidate the synced slots due to the timeout, I think
> > we need to define what is "inactive" for synced slots.
> >
> > Suppose that the slotsync worker updates the local (synced) slot's
> > inactive_since whenever releasing the slot, irrespective of the actual
> > LSNs (or other slot parameters) having been updated. I think that this
> > idea cannot handle a slot that is not acquired on the primary. In this
> > case, the remote slot is inactive but the local slot is regarded as
> > active.  WAL files are piled up on the standby (and on the primary) as
> > the slot's LSNs don't move forward. I think we want to regard such a
> > slot as "inactive" also on the standby and invalidate it because of
> > the timeout.
>
> I think that makes sense to somehow link inactive_since on the standby to
> the actual LSNs (or other slot parameters) being updated or not.
>
> > > > Now, the other concern is that calling GetCurrentTimestamp()
> > > > could be costly when the values for the slot are not going to be
> > > > updated but if that happens we can optimize such that before acquiring
> > > > the slot we can have some minimal pre-checks to ensure whether we need
> > > > to update the slot or not.
> >
> > If we use such pre-checks, another problem might happen; it cannot
> > handle a case where the slot is acquired on the primary but its LSNs
> > don't move forward. Imagine a logical replication conflict happened on
> > the subscriber, and the logical replication enters the retry loop. In
> > this case, the remote slot's inactive_since gets updated for every
> > retry, but it looks inactive from the standby since the slot LSNs
> > don't change. Therefore, only the local slot could be invalidated due
> > to the timeout but probably we don't want to regard such a slot as
> > "inactive".
> >
> > Another idea I came up with is that the slotsync worker updates the
> > local slot's inactive_since to the local timestamp only when the
> > remote slot might have got inactive. If the remote slot is acquired by
> > someone, the local slot's inactive_since is also NULL. If the remote
> > slot gets inactive, the slotsync worker sets the local timestamp to
> > the local slot's inactive_since. Since the remote slot could be
> > acquired and released before the slotsync worker gets the remote slot
> > data again, if the remote slot's inactive_since > the local slot's
> > inactive_since, the slotsync worker updates the local one.
>
> Then I think we would need to be careful about time zone comparison.

Yes. Also we need to consider the case when a user is relying on
pg_sync_replication_slots() and has not enabled slot-sync worker. In
such a case if synced slot's inactive_since is derived from inactivity
of remote-slot, it might not be that frequently updated (based on when
the user actually runs the SQL function) and thus may be misleading.
OTOH, if inactivty_since of synced slots represents its own
inactivity, then it will give correct info even for the case when the
SQL function is run after a long time and slot-sync worker is
disabled.

> > IOW, we
> > detect whether the remote slot was acquired and released since the
> > last synchronization, by checking the remote slot's inactive_since.
> > This idea seems to handle these cases I mentioned unless I'm missing
> > something, but it requires for the slotsync worker to update
> > inactive_since in a different way than other parameters.
> >
> > Or a simple solution is that the slotsync worker updates
> > inactive_since as it does for non-synced slots, and disables
> > timeout-based slot invalidation for synced slots.

I like this idea better, it takes care of such a case too when the
user is relying on sync-function rather than worker and does not want
to get the slots invalidated in between 2 sync function calls.

> Yeah, I think the main question to help us decide is: do we want to invalidate
> "inactive" synced slots locally (in addition to synchronizing the invalidation
> from the primary)?


Re: Synchronizing slots from primary to standby

2024-04-01 Thread shveta malik
On Mon, Apr 1, 2024 at 5:05 PM Amit Kapila  wrote:
>
> > 2 ===
> >
> > +   {
> > +   if (SnapBuildSnapshotExists(remote_slot->restart_lsn))
> > +   {
> >
> > That could call SnapBuildSnapshotExists() multiple times for the same
> > "restart_lsn" (for example in case of multiple remote slots to sync).
> >
> > What if the sync worker records the last lsn it asks for serialization (and
> > serialized ? Then we could check that value first before deciding to call 
> > (or not)
> > SnapBuildSnapshotExists() on it?
> >
> > It's not ideal because it would record "only the last one" but that would be
> > simple enough for now (currently there is only one sync worker so that 
> > scenario
> > is likely to happen).
> >
>
> Yeah, we could do that but I am not sure how much it can help. I guess
> we could do some tests to see if it helps.

I had a look at test-results conducted by Nisha, I did not find any
repetitive restart_lsn from primary being synced to standby for that
particular test of 100 slots. Unless we have some concrete test in
mind (having repetitive restart_lsn), I do not think that by using the
given tests, we can establish the benefit of suggested optimization.
Attached the log files of all slots test for reference,

thanks
Shveta
 slot_name | database |  xmin   | cxmin | restart_lsn |  flushlsn  | wal_status 
| conf | failover | synced | temp 
---+--+-+---+-+++--+--++--
 slot_1| newdb1   | |   772 | 0/AD0   | 0/E43E9240 | reserved   
| f| t| f  | f
 slot_10   | newdb2   | |   772 | 0/A0002C8   | 0/E43E9240 | reserved   
| f| t| f  | f
 slot_100  | newdb20  | |   772 | 0/A005F90   | 0/E43E9240 | reserved   
| f| t| f  | f
 slot_11   | newdb3   | |   772 | 0/A000300   | 0/E43E9240 | reserved   
| f| t| f  | f
 slot_12   | newdb3   | |   772 | 0/A000338   | 0/E43E9240 | reserved   
| f| t| f  | f
 slot_13   | newdb3   | |   772 | 0/A000370   | 0/E43E9240 | reserved   
| f| t| f  | f
 slot_14   | newdb3   | |   772 | 0/A0003A8   | 0/E43E9240 | reserved   
| f| t| f  | f
 slot_15   | newdb3   | |   772 | 0/A0003E0   | 0/E43E9240 | reserved   
| f| t| f  | f
 slot_16   | newdb4   | |   772 | 0/A000418   | 0/E43E9240 | reserved   
| f| t| f  | f
 slot_17   | newdb4   | |   772 | 0/A000450   | 0/E43E9240 | reserved   
| f| t| f  | f
 slot_18   | newdb4   | |   772 | 0/A000488   | 0/E43E9240 | reserved   
| f| t| f  | f
 slot_19   | newdb4   | |   772 | 0/A0004C0   | 0/E43E9240 | reserved   
| f| t| f  | f
 slot_2| newdb1   | |   772 | 0/A000108   | 0/E43E9240 | reserved   
| f| t| f  | f
 slot_20   | newdb4   | |   772 | 0/A0004F8   | 0/E43E9240 | reserved   
| f| t| f  | f
 slot_21   | newdb5   | |   772 | 0/A000530   | 0/E43E9240 | reserved   
| f| t| f  | f
 slot_22   | newdb5   | |   772 | 0/A000568   | 0/E43E9240 | reserved   
| f| t| f  | f
 slot_23   | newdb5   | |   772 | 0/A0005A0   | 0/E43E9240 | reserved   
| f| t| f  | f
 slot_24   | newdb5   | |   772 | 0/A0005D8   | 0/E43E9240 | reserved   
| f| t| f  | f
 slot_25   | newdb5   | |   772 | 0/A000610   | 0/E43E9240 | reserved   
| f| t| f  | f
 slot_26   | newdb6   | |   772 | 0/A000648   | 0/E43E9240 | reserved   
| f| t| f  | f
 slot_27   | newdb6   | |   772 | 0/A000680   | 0/E43E9240 | reserved   
| f| t| f  | f
 slot_28   | newdb6   | |   772 | 0/A0006B8   | 0/E43E9240 | reserved   
| f| t| f  | f
 slot_29   | newdb6   | |   772 | 0/A0006F0   | 0/E43E9240 | reserved   
| f| t| f  | f
 slot_3| newdb1   | |   772 | 0/A000140   | 0/E43E9240 | reserved   
| f| t| f  | f
 slot_30   | newdb6   | |   772 | 0/A000728   | 0/E43E9240 | reserved   
| f| t| f  | f
 slot_31   | newdb7   | |   772 | 0/A000760   | 0/E43E9240 | reserved   
| f| t| f  | f
 slot_32   | newdb7   | |   772 | 0/A000798   | 0/E43E9240 | reserved   
| f| t| f  | f
 slot_33   | newdb7   | |   772 | 0/A0007D0   | 0/E43E9240 | reserved   
| f| t| f  | f
 slot_34   | newdb7   | |   772 | 0/A000808   | 0/E43E9240 | reserved   
| f| t| f  | f
 slot_35   | newdb7   | |   772 | 0/A000840   | 0/E43E9240 | reserved   
| f| t| f  | f
 slot_36   | newdb8   | |   772 | 0/A000878   | 0/E43E9240 | reserved   
| f| t| f  | f
 slot_37   | 

Re: Synchronizing slots from primary to standby

2024-04-01 Thread shveta malik
On Mon, Apr 1, 2024 at 10:40 AM Amit Kapila  wrote:
>
> On Mon, Apr 1, 2024 at 10:01 AM Bharath Rupireddy
>  wrote:
> >
> > On Thu, Mar 28, 2024 at 10:08 AM Zhijie Hou (Fujitsu)
> >  wrote:
> > >
> > > [2] The steps to reproduce the data miss issue on a primary->standby 
> > > setup:
> >
> > I'm trying to reproduce the problem with [1], but I can see the
> > changes after the standby is promoted. Am I missing anything here?
> >
> > ubuntu:~/postgres/pg17/bin$ ./psql -d postgres -p 5433 -c "select *
> > from pg_logical_slot_get_changes('lrep_sync_slot', NULL, NULL);"
> > lsn| xid |data
> > ---+-+-
> >  0/3B0 | 738 | BEGIN 738
> >  0/3017FC8 | 738 | COMMIT 738
> >  0/3017FF8 | 739 | BEGIN 739
> >  0/3019A38 | 739 | COMMIT 739
> >  0/3019A38 | 740 | BEGIN 740
> >  0/3019A38 | 740 | table public.dummy1: INSERT: a[integer]:999
> >  0/3019AA8 | 740 | COMMIT 740
> > (7 rows)
> >
> > [1]
> > -#define LOG_SNAPSHOT_INTERVAL_MS 15000
> > +#define LOG_SNAPSHOT_INTERVAL_MS 150
> >
> > ./initdb -D db17
> > echo "archive_mode = on
> > archive_command='cp %p /home/ubuntu/postgres/pg17/bin/archived_wal/%f'
> > wal_level='logical'
> > autovacuum = off
> > checkpoint_timeout='1h'" | tee -a db17/postgresql.conf
> >
> > ./pg_ctl -D db17 -l logfile17 start
> >
> > rm -rf sbdata logfilesbdata
> > ./pg_basebackup -D sbdata
> >
> > ./psql -d postgres -p 5432 -c "SELECT
> > pg_create_logical_replication_slot('lrep_sync_slot', 'test_decoding',
> > false, false, true);"
> > ./psql -d postgres -p 5432 -c "SELECT
> > pg_create_physical_replication_slot('phy_repl_slot', true, false);"
> >
> > echo "port=5433
> > primary_conninfo='host=localhost port=5432 dbname=postgres user=ubuntu'
> > primary_slot_name='phy_repl_slot'
> > restore_command='cp /home/ubuntu/postgres/pg17/bin/archived_wal/%f %p'
> > hot_standby_feedback=on
> > sync_replication_slots=on" | tee -a sbdata/postgresql.conf
> >
> > touch sbdata/standby.signal
> >
> > ./pg_ctl -D sbdata -l logfilesbdata start
> > ./psql -d postgres -p 5433 -c "SELECT pg_is_in_recovery();"
> >
> > ./psql -d postgres
> >
> > SESSION1, TXN1
> > BEGIN;
> > create table dummy1(a int);
> >
> > SESSION2, TXN2
> > SELECT pg_log_standby_snapshot();
> >
> > SESSION1, TXN1
> > COMMIT;
> >
> > SESSION1, TXN1
> > BEGIN;
> > create table dummy2(a int);
> >
> > SESSION2, TXN2
> > SELECT pg_log_standby_snapshot();
> >
> > SESSION1, TXN1
> > COMMIT;
> >
> > ./psql -d postgres -p 5432 -c "SELECT
> > pg_replication_slot_advance('lrep_sync_slot', pg_current_wal_lsn());"
> >
>
> After this step and before the next, did you ensure that the slot sync
> has synced the latest confirmed_flush/restart LSNs? You can query:
> "select slot_name,restart_lsn, confirmed_flush_lsn from
> pg_replication_slots;" to ensure the same on both the primary and
> standby.

+1. To ensure last sync, one can run this manually on standby just
before promotion :
SELECT pg_sync_replication_slots();

thanks

Shveta




Re: Synchronizing slots from primary to standby

2024-03-28 Thread shveta malik
On Fri, Mar 29, 2024 at 6:36 AM Zhijie Hou (Fujitsu)
 wrote:
>
> Attach a new version patch which fixed an un-initialized variable issue and
> added some comments. Also, temporarily enable DEBUG2 for the 040 tap-test so 
> that
> we can analyze the possible CFbot failures easily.

As suggested by Amit in [1], for the fix being discussed where we need
to advance the synced slot on standby, we need to skip the dbid check
in fast_forward mode in CreateDecodingContext(). We tried few tests to
make sure that there was no table-access done during fast-forward mode

1) Initially we tried avoiding database-id check in
CreateDecodingContext() only when called by
pg_logical_replication_slot_advance(). 'make check-world' passed on
HEAD for the same.

2) But the more generic solution was to skip the database check if
"fast_forward" is true. It was tried and 'make check-world' passed on
HEAD for that as well.

3) Another thing tried by Hou-San was to run pgbench after skipping db
check in the fast_forward logical decoding case.
pgbench was run to generate some changes and then the logical slot was
advanced to the latest position in another database. A LOG was added
in relation_open to catch table access. It was found that there was no
table-access in fast forward logical decoding i.e. no LOGS for
table-open were generated during the test. Steps given at [2]

[1]: 
https://www.postgresql.org/message-id/CAA4eK1KMiKangJa4NH_K1oFc87Y01n3rnpuwYagT59Y%3DADW8Dw%40mail.gmail.com

[2]:
--
1. apply the DEBUG patch (attached as .txt) which will log the
relation open and table cache access.

2. create a slot:
SELECT 'init' FROM pg_create_logical_replication_slot('logicalslot',
'test_decoding', false, false, true);

3. run pgbench to generate some data.
pgbench -i postgres
pgbench --aggregate-interval=5 --time=5 --client=10 --log --rate=1000
--latency-limit=10 --failures-detailed --max-tries=10 postgres

4. start a fresh session in a different db and advance the slot to the
latest position. There should be no relation open or CatCache log
between the LOG "starting logical decoding for slot .." and LOG
"decoding over".
SELECT pg_replication_slot_advance('logicalslot', pg_current_wal_lsn());
--

thanks
Shveta
From 5386894faa14c0de9854e0eee9679f8eea775f65 Mon Sep 17 00:00:00 2001
From: Hou Zhijie 
Date: Fri, 29 Mar 2024 11:46:36 +0800
Subject: [PATCH] debug log

---
 src/backend/access/common/relation.c | 2 ++
 src/backend/replication/slotfuncs.c  | 1 +
 src/backend/utils/cache/catcache.c   | 1 +
 3 files changed, 4 insertions(+)

diff --git a/src/backend/access/common/relation.c 
b/src/backend/access/common/relation.c
index d8a313a2c9..40718fc47e 100644
--- a/src/backend/access/common/relation.c
+++ b/src/backend/access/common/relation.c
@@ -50,6 +50,7 @@ relation_open(Oid relationId, LOCKMODE lockmode)
 
Assert(lockmode >= NoLock && lockmode < MAX_LOCKMODES);
 
+   elog(LOG, "relation_open");
/* Get the lock before trying to open the relcache entry */
if (lockmode != NoLock)
LockRelationOid(relationId, lockmode);
@@ -91,6 +92,7 @@ try_relation_open(Oid relationId, LOCKMODE lockmode)
 
Assert(lockmode >= NoLock && lockmode < MAX_LOCKMODES);
 
+   elog(LOG, "try_relation_open");
/* Get the lock first */
if (lockmode != NoLock)
LockRelationOid(relationId, lockmode);
diff --git a/src/backend/replication/slotfuncs.c 
b/src/backend/replication/slotfuncs.c
index ef5081784c..564b36fc45 100644
--- a/src/backend/replication/slotfuncs.c
+++ b/src/backend/replication/slotfuncs.c
@@ -608,6 +608,7 @@ pg_logical_replication_slot_advance(XLogRecPtr moveto,
 
/* free context, call shutdown callback */
FreeDecodingContext(ctx);
+   elog(LOG, "decoding over");
 
InvalidateSystemCaches();
}
diff --git a/src/backend/utils/cache/catcache.c 
b/src/backend/utils/cache/catcache.c
index 569f51cb33..e19c586697 100644
--- a/src/backend/utils/cache/catcache.c
+++ b/src/backend/utils/cache/catcache.c
@@ -1328,6 +1328,7 @@ SearchCatCacheInternal(CatCache *cache,
 
Assert(cache->cc_nkeys == nkeys);
 
+   elog(LOG, "SearchCatCacheInternal");
/*
 * one-time startup overhead for each cache
 */
-- 
2.31.1



Re: Introduce XID age and inactive timeout based replication slot invalidation

2024-03-27 Thread shveta malik
On Wed, Mar 27, 2024 at 9:00 PM Bharath Rupireddy
 wrote:
>
> Thanks. I'm attaching v29 patches. 0001 managing inactive_since on the
> standby for sync slots. 0002 implementing inactive timeout GUC based
> invalidation mechanism.
>
> Please have a look.

Thanks for the patches. v29-001 looks good to me.

thanks
Shveta




Re: Introduce XID age and inactive timeout based replication slot invalidation

2024-03-27 Thread shveta malik
On Wed, Mar 27, 2024 at 2:55 PM Bharath Rupireddy
 wrote:
>
> On Wed, Mar 27, 2024 at 11:39 AM shveta malik  wrote:
> >
> > Thanks for the patch. Few trivial things:
>
> Thanks for reviewing.
>
> > --
> > 1)
> > system-views.sgml:
> >
> > a) "Note that the slots" --> "Note that the slots on the standbys,"
> > --it is good to mention "standbys" as synced could be true on primary
> > as well (promoted standby)
>
> Done.
>
> > b) If you plan to add more info which Bertrand suggested, then it will
> > be better to make a  section instead of using "Note"
>
> I added the note that Bertrand specified upthread. But, I couldn't
> find an instance of adding  ...  within a table. Hence
> with "Note that " statments just like any other notes in the
> system-views.sgml. pg_replication_slot in system-vews.sgml renders as
> table, so having  ...  may not be a great idea.
>
> > 2)
> > commit msg:
> >
> > "The impact of this
> > on a promoted standby inactive_since is always NULL for all
> > synced slots even after server restart.
> > "
> > Sentence looks broken.
> > -
>
> Reworded.
>
> > Apart from the above trivial things, v26-001 looks good to me.
>
> Please check the attached v27 patch which also has Bertrand's comment
> on deduplicating the TAP function. I've now moved it to Cluster.pm.
>

Thanks for the patch. Regarding doc, I have few comments.

+Note that the slots on the standbys that are being synced from a
+primary server (whose synced field is
+true), will get the
+inactive_since value from the
+corresponding remote slot on the primary. Also, note that for the
+synced slots on the standby, after the standby starts up (i.e. after
+the slots are loaded from the disk), the inactive_since will remain
+zero until the next slot sync cycle.

a)  "inactive_since will remain  zero"
Since it is user exposed info and the user finds it NULL in
pg_replication_slots, shall we mention NULL instead of 0?

b) Since we are referring to the sync cycle here, I feel it will be
good to give a link to that page.
+zero until the next slot sync cycle (see
+ for
+slot synchronization details).

thanks
Shveta




Re: Introduce XID age and inactive timeout based replication slot invalidation

2024-03-27 Thread shveta malik
On Wed, Mar 27, 2024 at 11:05 AM Bharath Rupireddy
 wrote:
>
> Fixed an issue in synchronize_slots where DatumGetLSN is being used in
> place of DatumGetTimestampTz. Found this via CF bot member [1], not on
> my dev system.
>
> Please find the attached v6 patch.

Thanks for the patch. Few trivial things:

--
1)
system-views.sgml:

a) "Note that the slots" --> "Note that the slots on the standbys,"
--it is good to mention "standbys" as synced could be true on primary
as well (promoted standby)

b) If you plan to add more info which Bertrand suggested, then it will
be better to make a  section instead of using "Note"

2)
commit msg:

"The impact of this
on a promoted standby inactive_since is always NULL for all
synced slots even after server restart.
"
Sentence looks broken.
-

Apart from the above trivial things, v26-001 looks good to me.

thanks
Shveta




Re: Introduce XID age and inactive timeout based replication slot invalidation

2024-03-26 Thread shveta malik
On Tue, Mar 26, 2024 at 6:05 PM Bertrand Drouvot
 wrote:
>
>
> > We can think on that later if we really need another
> > field which give us sync time.
>
> I think that calling GetCurrentTimestamp() so frequently could be too costly, 
> so
> I'm not sure we should.

Agreed.

> > In my second approach, I have tried to
> > avoid updating inactive_since for synced slots during sync process. We
> > update that field during creation of synced slot so that
> > inactive_since reflects correct info even for synced slots (rather
> > than copying from primary).
>
> Yeah, and I think we could create a dedicated field with this information
> if we feel the need.

Okay.

thanks
Shveta




Re: Introduce XID age and inactive timeout based replication slot invalidation

2024-03-26 Thread shveta malik
On Wed, Mar 27, 2024 at 10:22 AM Amit Kapila  wrote:
>
> On Wed, Mar 27, 2024 at 10:08 AM Bharath Rupireddy
>  wrote:
> >
> > On Tue, Mar 26, 2024 at 11:22 PM Bertrand Drouvot
> >  wrote:
> >
> > > 3)
> > > update_synced_slots_inactive_time():
> > >
> > > This assert is removed, is it intentional?
> > > Assert(s->active_pid == 0);
> >
> > Yes, the slot can get acquired in the corner case when someone runs
> > pg_sync_replication_slots concurrently at this time. I'm referring to
> > the issue reported upthread. We don't prevent one running
> > pg_sync_replication_slots in promotion/ShutDownSlotSync phase right?
> > Maybe we should prevent that otherwise some of the slots are synced
> > and the standby gets promoted while others are yet-to-be-synced.
> >
>
> We should do something about it but that shouldn't be done in this
> patch. We can handle it separately and then add such an assert.

Agreed. Once this patch is concluded, I can fix the slot sync shutdown
issue and will also add this 'assert' back.

thanks
Shveta




Re: Introduce XID age and inactive timeout based replication slot invalidation

2024-03-26 Thread shveta malik
On Tue, Mar 26, 2024 at 9:59 PM Bharath Rupireddy
 wrote:
>
> On Tue, Mar 26, 2024 at 4:35 PM Bharath Rupireddy
>  wrote:
> >
> > If we just sync inactive_since value for synced slots while in
> > recovery from the primary, so be it. Why do we need to update it to
> > the current time when the slot is being created? We don't expose slot
> > creation time, no? Aren't we fine if we just sync the value from
> > primary and document that fact? After the promotion, we can reset it
> > to the current time so that it gets its own time.
>
> I'm attaching v24 patches. It implements the above idea proposed
> upthread for synced slots. I've now separated
> s/last_inactive_time/inactive_since and synced slots behaviour. Please
> have a look.

Thanks for the patches. Few trivial comments for v24-002:

1)
slot.c:
+ * data from the remote slot. We use InRecovery flag instead of
+ * RecoveryInProgress() as it always returns true even for normal
+ * server startup.

a) Not clear what 'it' refers to. Better to use 'the latter'
b) Is it better to mention the primary here:
 'as the latter always returns true even on the primary server during startup'.


2)
update_local_synced_slot():

- strcmp(remote_slot->plugin, NameStr(slot->data.plugin)) == 0)
+ strcmp(remote_slot->plugin, NameStr(slot->data.plugin)) == 0 &&
+ remote_slot->inactive_since == slot->inactive_since)

When this code was written initially, the intent was to do strcmp at
the end (only if absolutely needed). It will be good if we maintain
the same and add new checks before strcmp.

3)
update_synced_slots_inactive_time():

This assert is removed, is it intentional?
Assert(s->active_pid == 0);


4)
040_standby_failover_slots_sync.pl:

+# Capture the inactive_since of the slot from the standby the logical failover
+# slots are synced/created on the standby.

The comment is unclear, something seems missing.

thanks
Shveta




Re: Introduce XID age and inactive timeout based replication slot invalidation

2024-03-26 Thread shveta malik
On Tue, Mar 26, 2024 at 4:35 PM Bharath Rupireddy
 wrote:
>
> On Tue, Mar 26, 2024 at 4:18 PM shveta malik  wrote:
> >
> > > What about another approach?: inactive_since gives data synced from 
> > > primary for
> > > synced slots and another dedicated field (could be added later...) could
> > > represent what you suggest as the other option.
> >
> > Yes, okay with me. I think there is some confusion here as well. In my
> > second approach above, I have not suggested anything related to
> > sync-worker. We can think on that later if we really need another
> > field which give us sync time.  In my second approach, I have tried to
> > avoid updating inactive_since for synced slots during sync process. We
> > update that field during creation of synced slot so that
> > inactive_since reflects correct info even for synced slots (rather
> > than copying from primary). Please have a look at my patch and let me
> > know your thoughts. I am fine with copying it from primary as well and
> > documenting this behaviour.
>
> I took a look at your patch.
>
> --- a/src/backend/replication/logical/slotsync.c
> +++ b/src/backend/replication/logical/slotsync.c
> @@ -628,6 +628,7 @@ synchronize_one_slot(RemoteSlot *remote_slot, Oid
> remote_dbid)
>  SpinLockAcquire(>mutex);
>  slot->effective_catalog_xmin = xmin_horizon;
>  slot->data.catalog_xmin = xmin_horizon;
> +slot->inactive_since = GetCurrentTimestamp();
>  SpinLockRelease(>mutex);
>
> If we just sync inactive_since value for synced slots while in
> recovery from the primary, so be it. Why do we need to update it to
> the current time when the slot is being created?

If we update inactive_since  at synced slot's creation or during
restart (skipping setting it during sync), then this time reflects
actual 'inactive_since' for that particular synced slot.  Isn't that a
clear info for the user and in alignment of what the name
'inactive_since' actually suggests?

> We don't expose slot
> creation time, no?

No, we don't. But for synced slot, that is the time since that slot is
inactive  (unless promoted), so we are exposing inactive_since and not
creation time.

>Aren't we fine if we just sync the value from
> primary and document that fact? After the promotion, we can reset it
> to the current time so that it gets its own time. Do you see any
> issues with it?

Yes, we can do that. But curious to know, do we see any additional
benefit of reflecting primary's inactive_since at standby which I
might be missing?

thanks
Shveta




Re: Introduce XID age and inactive timeout based replication slot invalidation

2024-03-26 Thread shveta malik
On Tue, Mar 26, 2024 at 3:50 PM Bertrand Drouvot
 wrote:
>
> Hi,
>
> > I think there may have been some misunderstanding here.
>
> Indeed ;-)
>
> > But now if I
> > rethink this, I am fine with 'inactive_since' getting synced from
> > primary to standby. But if we do that, we need to add docs stating
> > "inactive_since" represents primary's inactivity and not standby's
> > slots inactivity for synced slots.
>
> Yeah sure.
>
> > The reason for this clarification
> > is that the synced slot might be generated much later, yet
> > 'inactive_since' is synced from the primary, potentially indicating a
> > time considerably earlier than when the synced slot was actually
> > created.
>
> Right.
>
> > Another approach could be that "inactive_since" for synced slot
> > actually gives its own inactivity data rather than giving primary's
> > slot data. We update inactive_since on standby only at 3 occasions:
> > 1) at the time of creation of the synced slot.
> > 2) during standby restart.
> > 3) during promotion of standby.
> >
> > I have attached a sample patch for this idea as.txt file.
>
> Thanks!
>
> > I am fine with any of these approaches.  One gives data synced from
> > primary for synced slots, while another gives actual inactivity data
> > of synced slots.
>
> What about another approach?: inactive_since gives data synced from primary 
> for
> synced slots and another dedicated field (could be added later...) could
> represent what you suggest as the other option.

Yes, okay with me. I think there is some confusion here as well. In my
second approach above, I have not suggested anything related to
sync-worker. We can think on that later if we really need another
field which give us sync time.  In my second approach, I have tried to
avoid updating inactive_since for synced slots during sync process. We
update that field during creation of synced slot so that
inactive_since reflects correct info even for synced slots (rather
than copying from primary). Please have a look at my patch and let me
know your thoughts. I am fine with copying it from primary as well and
documenting this behaviour.

> Another cons of updating inactive_since at the current time during each slot
> sync cycle is that calling GetCurrentTimestamp() very frequently
> (during each sync cycle of very active slots) could be too costly.

Right.

thanks
Shveta




Re: Introduce XID age and inactive timeout based replication slot invalidation

2024-03-26 Thread shveta malik
On Tue, Mar 26, 2024 at 1:54 PM Bertrand Drouvot
 wrote:
>
> Hi,
>
> On Tue, Mar 26, 2024 at 01:37:21PM +0530, Amit Kapila wrote:
> > On Tue, Mar 26, 2024 at 1:15 PM Bertrand Drouvot
> >  wrote:
> > >
> > > 2 ===
> > >
> > > It looks like inactive_since is set to the current timestamp on the 
> > > standby
> > > each time the sync worker does a cycle:
> > >
> > > primary:
> > >
> > > postgres=# select slot_name,inactive_since from pg_replication_slots 
> > > where failover = 't';
> > >   slot_name  |inactive_since
> > > -+---
> > >  lsub27_slot | 2024-03-26 07:39:19.745517+00
> > >  lsub28_slot | 2024-03-26 07:40:24.953826+00
> > >
> > > standby:
> > >
> > > postgres=# select slot_name,inactive_since from pg_replication_slots 
> > > where failover = 't';
> > >   slot_name  |inactive_since
> > > -+---
> > >  lsub27_slot | 2024-03-26 07:43:56.387324+00
> > >  lsub28_slot | 2024-03-26 07:43:56.387338+00
> > >
> > > I don't think that should be the case.
> > >
> >
> > But why? This is exactly what we discussed in another thread where we
> > agreed to update inactive_since even for sync slots.
>
> Hum, I thought we agreed to "sync" it and to "update it to current time"
> only at promotion time.

I think there may have been some misunderstanding here. But now if I
rethink this, I am fine with 'inactive_since' getting synced from
primary to standby. But if we do that, we need to add docs stating
"inactive_since" represents primary's inactivity and not standby's
slots inactivity for synced slots. The reason for this clarification
is that the synced slot might be generated much later, yet
'inactive_since' is synced from the primary, potentially indicating a
time considerably earlier than when the synced slot was actually
created.

Another approach could be that "inactive_since" for synced slot
actually gives its own inactivity data rather than giving primary's
slot data. We update inactive_since on standby only at 3 occasions:
1) at the time of creation of the synced slot.
2) during standby restart.
3) during promotion of standby.

I have attached a sample patch for this idea as.txt file.

I am fine with any of these approaches.  One gives data synced from
primary for synced slots, while another gives actual inactivity data
of synced slots.

thanks
Shveta
From 7dcd0e95299263187eb1f03812f8321b2612ee5c Mon Sep 17 00:00:00 2001
From: Shveta Malik 
Date: Tue, 26 Mar 2024 14:42:25 +0530
Subject: [PATCH v1] inactive_since for synced slots.

inactive_since is updated for synced slots:
1) at the time of creation of slot.
2) during server restart.
3) during promotion.
---
 src/backend/replication/logical/slotsync.c |  1 +
 src/backend/replication/slot.c | 15 ---
 2 files changed, 13 insertions(+), 3 deletions(-)

diff --git a/src/backend/replication/logical/slotsync.c 
b/src/backend/replication/logical/slotsync.c
index bbf9a2c485..6114895dca 100644
--- a/src/backend/replication/logical/slotsync.c
+++ b/src/backend/replication/logical/slotsync.c
@@ -628,6 +628,7 @@ synchronize_one_slot(RemoteSlot *remote_slot, Oid 
remote_dbid)
SpinLockAcquire(>mutex);
slot->effective_catalog_xmin = xmin_horizon;
slot->data.catalog_xmin = xmin_horizon;
+   slot->inactive_since = GetCurrentTimestamp();
SpinLockRelease(>mutex);
ReplicationSlotsComputeRequiredXmin(true);
LWLockRelease(ProcArrayLock);
diff --git a/src/backend/replication/slot.c b/src/backend/replication/slot.c
index d0a2f440ef..f2a57a14ec 100644
--- a/src/backend/replication/slot.c
+++ b/src/backend/replication/slot.c
@@ -628,7 +628,10 @@ retry:
 * now.
 */
SpinLockAcquire(>mutex);
-   s->inactive_since = 0;
+
+   if (!(RecoveryInProgress() && s->data.synced))
+   s->inactive_since = 0;
+
SpinLockRelease(>mutex);
 
if (am_walsender)
@@ -704,14 +707,20 @@ ReplicationSlotRelease(void)
 */
SpinLockAcquire(>mutex);
slot->active_pid = 0;
-   slot->inactive_since = now;
+
+   if (!(RecoveryInProgress() && slot->data.synced))
+   slot->inactive_since = now;
+
SpinLockRelease(>mutex);
ConditionVariableBroadcast(>active_cv);
}
else
{
SpinLockAcquire(>mutex);
-   slot->inactive_since = now;
+
+   if (!(RecoveryInProgress() && slot->data.synced))
+   slot->inactive_since = now;
+
SpinLockRelease(>mutex);
}
 
-- 
2.34.1



Re: Introduce XID age and inactive timeout based replication slot invalidation

2024-03-26 Thread shveta malik
On Tue, Mar 26, 2024 at 2:27 PM Bharath Rupireddy
 wrote:
>
> >
> > 1)
> > Commti msg:
> >
> > ensures the value is set to current timestamp during the
> > shutdown to help correctly interpret the time if the standby gets
> > promoted without a restart.
> >
> > shutdown --> shutdown of slot sync worker   (as it was not clear if it
> > is instance shutdown or something else)
>
> Changed it to "shutdown of slot sync machinery" to be consistent with
> the comments.

Thanks for addressing the comments. Just to give more clarity here (so
that you take a informed decision), I am not sure if we actually shut
down slot-sync machinery. We only shot down slot sync worker.
Slot-sync machinery can still be used using
'pg_sync_replication_slots' SQL function. I can easily reproduce the
scenario where SQL function and  reset_synced_slots_info() are going
in parallel where the latter hits 'Assert(s->active_pid == 0)' due to
the fact that  parallel SQL sync function is active on that slot.

thanks
Shveta




Re: Introduce XID age and inactive timeout based replication slot invalidation

2024-03-26 Thread shveta malik
On Tue, Mar 26, 2024 at 11:08 AM Bharath Rupireddy
 wrote:
>
> On Tue, Mar 26, 2024 at 9:30 AM shveta malik  wrote:
> >
> > On Mon, Mar 25, 2024 at 12:43 PM shveta malik  
> > wrote:
> > >
> > > I have one concern, for synced slots on standby, how do we disallow
> > > invalidation due to inactive-timeout immediately after promotion?
> > >
> > > For synced slots, last_inactive_time and inactive_timeout are both
> > > set. Let's say I bring down primary for promotion of standby and then
> > > promote standby, there are chances that it may end up invalidating
> > > synced slots (considering standby is not brought down during promotion
> > > and thus inactive_timeout may already be past 'last_inactive_time').
> > >
> >
> > On standby, if we decide to maintain valid last_inactive_time for
> > synced slots, then invalidation is correctly restricted in
> > InvalidateSlotForInactiveTimeout() for synced slots using the check:
> >
> > if (RecoveryInProgress() && slot->data.synced)
> > return false;
> >
> > But immediately after promotion, we can not rely on the above check
> > and thus possibility of synced slots invalidation is there. To
> > maintain consistent behavior regarding the setting of
> > last_inactive_time for synced slots, similar to user slots, one
> > potential solution to prevent this invalidation issue is to update the
> > last_inactive_time of all synced slots within the ShutDownSlotSync()
> > function during FinishWalRecovery(). This approach ensures that
> > promotion doesn't immediately invalidate slots, and henceforth, we
> > possess a correct last_inactive_time as a basis for invalidation going
> > forward. This will be equivalent to updating last_inactive_time during
> > restart (but without actual restart during promotion).
> > The plus point of maintaining last_inactive_time for synced slots
> > could be, this can provide data to the user on when last time the sync
> > was attempted on that particular slot by background slot sync worker
> > or SQl function. Thoughts?
>
> Please find the attached v21 patch implementing the above idea. It
> also has changes for renaming last_inactive_time to inactive_since.
>

Thanks for the patch. I have tested this patch alone, and it does what
it says. One additional thing which I noticed is that now it sets
inactive_since for temp slots as well, but that idea looks fine to me.

I could not test 'invalidation on promotion bug' with this change, as
that needed rebasing of the rest of the patches.

Few trivial things:

1)
Commti msg:

ensures the value is set to current timestamp during the
shutdown to help correctly interpret the time if the standby gets
promoted without a restart.

shutdown --> shutdown of slot sync worker   (as it was not clear if it
is instance shutdown or something else)

2)
'The time since the slot has became inactive'.

has became-->has become
or just became

Please check it in all the files. There are multiple places.

thanks
Shveta




Re: Introduce XID age and inactive timeout based replication slot invalidation

2024-03-26 Thread shveta malik
On Tue, Mar 26, 2024 at 11:36 AM Bertrand Drouvot
 wrote:
> >
> > The issue that I can see with your proposal is: what if one synced the slots
> > manually (with pg_sync_replication_slots()) but does not use the sync 
> > worker?
> > Then I think ShutDownSlotSync() is not going to help in that case.
>
> It looks like ShutDownSlotSync() is always called (even if 
> sync_replication_slots = off),
> so that sounds ok to me (I should have checked the code, I was under the 
> impression
> ShutDownSlotSync() was not called if sync_replication_slots = off).

Right, it is called irrespective of sync_replication_slots.

thanks
Shveta




Re: pgsql: Track last_inactive_time in pg_replication_slots.

2024-03-25 Thread shveta malik
On Tue, Mar 26, 2024 at 1:50 AM Bharath Rupireddy
 wrote:
>
> On Tue, Mar 26, 2024 at 1:30 AM Nathan Bossart  
> wrote:
> >
> > On Mon, Mar 25, 2024 at 04:49:12PM +, Bertrand Drouvot wrote:
> > > On Mon, Mar 25, 2024 at 12:25:37PM -0400, Robert Haas wrote:
> > >> In the same vein, I think deactivated_at or inactive_since might be
> > >> good names to consider. I think they get at the same thing as
> > >> released_time, but they avoid introducing a completely new word
> > >> (release, as opposed to active/inactive).
> > >
> > > Yeah, I'd vote for inactive_since then.
> >
> > Having only skimmed some of the related discussions, I'm inclined to agree
> > that inactive_since provides the clearest description for the column.
>
> I think we all have some agreement on inactive_since. So, I'm
> attaching the patch for that change.

pg_proc.dat needs to be changed to refer to 'inactive_since' instead
of 'last_inactive_time' in the attached patch.

thanks
Shveta




Re: pgsql: Track last_inactive_time in pg_replication_slots.

2024-03-25 Thread shveta malik
On Mon, Mar 25, 2024 at 9:54 PM Bertrand Drouvot
 wrote:
>
> Hi,
>
> On Mon, Mar 25, 2024 at 07:32:11PM +0530, Amit Kapila wrote:
> > On Mon, Mar 25, 2024 at 6:57 PM Robert Haas  wrote:
> > > And I'm suspicious that having an exception for slots being synced is
> > > a bad idea. That makes too much of a judgement about how the user will
> > > use this field. It's usually better to just expose the data, and if
> > > the user needs helps to make sense of that data, then give them that
> > > help separately.
> >
> > The reason we didn't set this for sync slots is that they won't be
> > usable (one can't use them to decode WAL) unless standby is promoted
> > [2]. But I see your point as well. So, I have copied the others
> > involved in this discussion to see what they think.
>
> Yeah I also see Robert's point. If we also sync the "last inactive time" 
> field then
> we would need to take care of the corner case mentioned by Shveta in [1] 
> during
> promotion.

I have suggested one potential solution for that in [1]. Please have a look.

[1]: 
https://www.postgresql.org/message-id/CAJpy0uB-yE%2BRiw7JQ4hW0%2BigJxvPc%2Brq%2B9c7WyTa1Jz7%2B2gAiA%40mail.gmail.com

thanks
Shveta




Re: Introduce XID age and inactive timeout based replication slot invalidation

2024-03-25 Thread shveta malik
On Mon, Mar 25, 2024 at 12:43 PM shveta malik  wrote:
>
> I have one concern, for synced slots on standby, how do we disallow
> invalidation due to inactive-timeout immediately after promotion?
>
> For synced slots, last_inactive_time and inactive_timeout are both
> set. Let's say I bring down primary for promotion of standby and then
> promote standby, there are chances that it may end up invalidating
> synced slots (considering standby is not brought down during promotion
> and thus inactive_timeout may already be past 'last_inactive_time').
>

On standby, if we decide to maintain valid last_inactive_time for
synced slots, then invalidation is correctly restricted in
InvalidateSlotForInactiveTimeout() for synced slots using the check:

if (RecoveryInProgress() && slot->data.synced)
return false;

But immediately after promotion, we can not rely on the above check
and thus possibility of synced slots invalidation is there. To
maintain consistent behavior regarding the setting of
last_inactive_time for synced slots, similar to user slots, one
potential solution to prevent this invalidation issue is to update the
last_inactive_time of all synced slots within the ShutDownSlotSync()
function during FinishWalRecovery(). This approach ensures that
promotion doesn't immediately invalidate slots, and henceforth, we
possess a correct last_inactive_time as a basis for invalidation going
forward. This will be equivalent to updating last_inactive_time during
restart (but without actual restart during promotion).
The plus point of maintaining last_inactive_time for synced slots
could be, this can provide data to the user on when last time the sync
was attempted on that particular slot by background slot sync worker
or SQl function. Thoughts?

thanks
Shveta




Re: Introduce XID age and inactive timeout based replication slot invalidation

2024-03-25 Thread shveta malik
On Mon, Mar 25, 2024 at 5:10 PM Amit Kapila  wrote:
>
> I think we should keep pg_alter_replication_slot() as the last
> priority among the remaining patches for this release. Let's try to
> first finish the primary functionality of inactive_timeout patch.
> Otherwise, I agree that the problem reported by you should be fixed.

Noted. Will focus on v18-002 patch now.

I was debugging the flow and just noticed that RecoveryInProgress()
always returns 'true' during
StartupReplicationSlots()-->RestoreSlotFromDisk() (even on primary) as
'xlogctl->SharedRecoveryState' is always 'RECOVERY_STATE_CRASH' at
that time. The 'xlogctl->SharedRecoveryState' is changed  to
'RECOVERY_STATE_DONE' on primary and to 'RECOVERY_STATE_ARCHIVE' on
standby at a later stage in StartupXLOG() (after we are done loading
slots).

The impact of this is, the condition in RestoreSlotFromDisk() in v20-001:

if (!(RecoveryInProgress() && slot->data.synced))
 slot->last_inactive_time = GetCurrentTimestamp();

is merely equivalent to:

if (!slot->data.synced)
slot->last_inactive_time = GetCurrentTimestamp();

Thus on primary, after restart, last_inactive_at is set correctly,
while on promoted standby (new primary), last_inactive_at is always
NULL after restart for the synced slots.

thanks
Shveta




Re: Introduce XID age and inactive timeout based replication slot invalidation

2024-03-25 Thread shveta malik
On Mon, Mar 25, 2024 at 3:31 PM Bharath Rupireddy
 wrote:
>
> Right. Done that way i.e. not setting the last_inactive_time for slots
> both while releasing the slot and restoring from the disk.
>
> Also, I've added a TAP function to check if the captured times are
> sane per Bertrand's review comment.
>
> Please see the attached v20 patch.

Thanks for the patch. The issue of unnecessary invalidation of synced
slots on promotion is resolved in this patch.

thanks
Shveta




Re: Introduce XID age and inactive timeout based replication slot invalidation

2024-03-25 Thread shveta malik
On Mon, Mar 25, 2024 at 1:37 PM Bertrand Drouvot
 wrote:
>
> Hi,
>
> Yeah, and I can see last_inactive_time is moving on the standby (while not the
> case on the primary), probably due to the sync worker slot acquisition/release
> which does not seem right.
>

Yes, you are right, last_inactive_time keeps on moving for synced
slots on standby.  Once I disabled slot-sync worker, then it is
constant. Then it only changes if I call pg_sync_replication_slots().

On a  different note, I noticed that we allow altering
inactive_timeout for synced-slots on standby. And again overwrite it
with the primary's value in the next sync cycle. Steps:


--Check pg_replication_slots for synced slot on standby, inactive_timeout is 120
   slot_name   | failover | synced | active | inactive_timeout
---+--+++--
 logical_slot1 | t| t  | f  |  120

--Alter on standby
SELECT 'alter' FROM pg_alter_replication_slot('logical_slot1', 900);

--Check pg_replication_slots:
   slot_name   | failover | synced | active | inactive_timeout
---+--+++--
 logical_slot1 | t| t  | f  |  900

--Run sync function
SELECT pg_sync_replication_slots();

--check again, inactive_timeout is set back to primary's value.
   slot_name   | failover | synced | active | inactive_timeout
---+--+++--
 logical_slot1 | t| t  | f  |  120

 

I feel altering synced slot's inactive_timeout should be prohibited on
standby. It should be in sync with primary always. Thoughts?

I am listing the concerns raised by me:
1) create-subscription with create_slot=false overwriting
inactive_timeout of existing slot  ([1])
2) last_inactive_time set for synced slots may result in invalidation
of slot on promotion.  ([2])
3) alter replication slot to alter inactive_timout for synced slots on
standby, should this be allowed?

[1]: 
https://www.postgresql.org/message-id/CAJpy0uAqBi%2BGbNn2ngJ-A_Z905CD3ss896bqY2ACUjGiF1Gkng%40mail.gmail.com
[2]: 
https://www.postgresql.org/message-id/CAJpy0uCLu%2BmqAwAMum%3DpXE9YYsy0BE7hOSw_Wno5vjwpFY%3D63g%40mail.gmail.com

thanks
Shveta




Re: Introduce XID age and inactive timeout based replication slot invalidation

2024-03-25 Thread shveta malik
On Mon, Mar 25, 2024 at 1:37 PM Bertrand Drouvot
 wrote:
>
> Hi,
>
> On Mon, Mar 25, 2024 at 12:59:52PM +0530, Amit Kapila wrote:
> > On Mon, Mar 25, 2024 at 12:43 PM shveta malik  
> > wrote:
> > >
> > > On Mon, Mar 25, 2024 at 11:53 AM shveta malik  
> > > wrote:
> > > >
> > > > On Mon, Mar 25, 2024 at 10:33 AM shveta malik  
> > > > wrote:
> > > > >
> > > > > On Sun, Mar 24, 2024 at 3:06 PM Bharath Rupireddy
> > > > >  wrote:
> > > > > >
> > > > > > I've attached the v18 patch set here.
> > >
> > > I have one concern, for synced slots on standby, how do we disallow
> > > invalidation due to inactive-timeout immediately after promotion?
> > >
> > > For synced slots, last_inactive_time and inactive_timeout are both
> > > set.
>
> Yeah, and I can see last_inactive_time is moving on the standby (while not the
> case on the primary), probably due to the sync worker slot acquisition/release
> which does not seem right.
>
> > Let's say I bring down primary for promotion of standby and then
> > > promote standby, there are chances that it may end up invalidating
> > > synced slots (considering standby is not brought down during promotion
> > > and thus inactive_timeout may already be past 'last_inactive_time').
> > >
> >
> > This raises the question of whether we need to set
> > 'last_inactive_time' synced slots on the standby?
>
> Yeah, I think that last_inactive_time should stay at 0 on synced slots on the
> standby because such slots are not usable anyway (until the standby gets 
> promoted).
>
> So, I think that last_inactive_time does not make sense if the slot never had
> the chance to be active.
>
> OTOH I think the timeout invalidation (if any) should be synced from primary.

Yes, even I feel that last_inactive_time makes sense only when the
slot is available to be used. Synced slots are not available to be
used until standby is promoted and thus last_inactive_time can be
skipped to be set for synced_slots. But once primay is invalidated due
to inactive-timeout, that invalidation should be synced to standby
(which is happening currently).

thanks
Shveta




Re: Introduce XID age and inactive timeout based replication slot invalidation

2024-03-25 Thread shveta malik
On Mon, Mar 25, 2024 at 11:53 AM shveta malik  wrote:
>
> On Mon, Mar 25, 2024 at 10:33 AM shveta malik  wrote:
> >
> > On Sun, Mar 24, 2024 at 3:06 PM Bharath Rupireddy
> >  wrote:
> > >
> > > I've attached the v18 patch set here.

I have one concern, for synced slots on standby, how do we disallow
invalidation due to inactive-timeout immediately after promotion?

For synced slots, last_inactive_time and inactive_timeout are both
set. Let's say I bring down primary for promotion of standby and then
promote standby, there are chances that it may end up invalidating
synced slots (considering standby is not brought down during promotion
and thus inactive_timeout may already be past 'last_inactive_time'). I
tried with smaller unit of inactive_timeout:

--Shutdown primary to prepare for planned promotion.

--On standby, one synced slot with last_inactive_time (lat) as 12:21
   slot_name   | failover | synced | active | temp | conf | res |
 lat| inactive_timeout
---+--+++--+--+-+--+--
 logical_slot1 | t   | t  | f | f   |
f   |   | 2024-03-25 12:21:09.020757+05:30 |  60

--wait for some time, now the time is 12:24
postgres=# select now();
   now
--
 2024-03-25 12:24:17.616716+05:30

-- promote immediately:
./pg_ctl -D ../../standbydb/ promote -w

--on promoted standby:
postgres=# select pg_is_in_recovery();
 pg_is_in_recovery
---
 f

--synced slot is invalidated immediately on promotion.
   slot_name   | failover | synced | active | temp | conf
  |   res|   lat|
inactive_timeout
---+--+++--+--+--+--+
 logical_slot1 | t | t   | f | f
| f| inactive_timeout | 2024-03-25
12:21:09.020757+05:30 |


thanks
Shveta




Re: Introduce XID age and inactive timeout based replication slot invalidation

2024-03-25 Thread shveta malik
On Mon, Mar 25, 2024 at 10:33 AM shveta malik  wrote:
>
> On Sun, Mar 24, 2024 at 3:06 PM Bharath Rupireddy
>  wrote:
> >
> > I've attached the v18 patch set here.
>

I have a question. Don't we allow creating subscriptions on an
existing slot with a non-null 'inactive_timeout' set where
'inactive_timeout' of the slot is retained even after subscription
creation?

I tried this:

===
--On publisher, create slot with 120sec inactive_timeout:
SELECT * FROM pg_create_logical_replication_slot('logical_slot1',
'pgoutput', false, true, true, 120);

--On subscriber, create sub using logical_slot1
create subscription mysubnew1_1  connection 'dbname=newdb1
host=localhost user=shveta port=5433' publication mypubnew1_1 WITH
(failover = true, create_slot=false, slot_name='logical_slot1');

--Before creating sub, pg_replication_slots output:
   slot_name   | failover | synced | active | temp | conf |
   lat| inactive_timeout
---+--+++--+--+--+--
 logical_slot1 | t| f  | f  | f| f| 2024-03-25
11:11:55.375736+05:30 |  120

--After creating sub pg_replication_slots output:  (inactive_timeout is 0 now):
   slot_name   |failover | synced | active | temp | conf | | lat |
inactive_timeout
---+-+++--+--+-+-+--
 logical_slot1 |t| f  | t  | f| f| | |
   0
===

In CreateSubscription, we call  'walrcv_alter_slot()' /
'ReplicationSlotAlter()' when create_slot is false. This call ends up
setting active_timeout from 120sec to 0. Is it intentional?

thanks
Shveta




Re: Introduce XID age and inactive timeout based replication slot invalidation

2024-03-24 Thread shveta malik
On Sun, Mar 24, 2024 at 3:06 PM Bharath Rupireddy
 wrote:
>
> I've attached the v18 patch set here.

Thanks for the patches. Please find few comments:

patch 001:


1)
slot.h:

+ /* The time at which this slot become inactive */
+ TimestampTz last_inactive_time;

become -->became

-
patch 002:

2)
slotsync.c:

  ReplicationSlotCreate(remote_slot->name, true, RS_TEMPORARY,
remote_slot->two_phase,
remote_slot->failover,
-   true);
+   true, 0);

+ slot->data.inactive_timeout = remote_slot->inactive_timeout;

Is there a reason we are not passing 'remote_slot->inactive_timeout'
to ReplicationSlotCreate() directly?

-

3)
slotfuncs.c
pg_create_logical_replication_slot():
+ int inactive_timeout = PG_GETARG_INT32(5);

Can we mention here that timeout is in seconds either in comment or
rename variable to inactive_timeout_secs?

Please do this for create_physical_replication_slot(),
create_logical_replication_slot(),
pg_create_physical_replication_slot() as well.

-
4)
+ int inactive_timeout; /* The amount of time in seconds the slot
+ * is allowed to be inactive. */
 } LogicalSlotInfo;

 Do we need to mention "before getting invalided" like other places
(in last patch)?

--

 5)
Same at these two places. "before getting invalided" to be added in
the last patch otherwise the info is incompleted.

+
+ /* The amount of time in seconds the slot is allowed to be inactive */
+ int inactive_timeout;
 } ReplicationSlotPersistentData;


+ * inactive_timeout: The amount of time in seconds the slot is allowed to be
+ * inactive.
  */
 void
 ReplicationSlotCreate(const char *name, bool db_specific,
 Same here. "before getting invalidated" ?



Reviewing more..

thanks
Shveta




Re: Introduce XID age and inactive timeout based replication slot invalidation

2024-03-15 Thread shveta malik
On Thu, Mar 14, 2024 at 7:58 PM Bharath Rupireddy
 wrote:
>
> While we wait to hear from others on this, I'm attaching the v9 patch
> set implementing the above idea (check 0001 patch). Please have a
> look. I'll come back to the other review comments soon.
>

patch002:

1)
I would like to understand the purpose of 'inactive_count'? Is it only
for users for monitoring purposes? We are not using it anywhere
internally.

I shutdown the instance 5 times and found that 'inactive_count' became
5 for all the slots created on that instance. Is this intentional? I
mean we can not really use them if the instance is down.  I felt it
should increment the inactive_count only if during the span of
instance, they were actually inactive i.e. no streaming or replication
happening through them.


2)
slot.c:
+ case RS_INVAL_XID_AGE:
+ {
+ if (TransactionIdIsNormal(s->data.xmin))
+ {
+  ..
+ }
+ if (TransactionIdIsNormal(s->data.catalog_xmin))
+ {
+  ..
+ }
+ }

Can we optimize this code? It has duplicate code for processing
s->data.catalog_xmin and s->data.xmin. Can we create a sub-function
for this purpose and call it twice here?

thanks
Shveta




Re: Introduce XID age and inactive timeout based replication slot invalidation

2024-03-14 Thread shveta malik
On Thu, Mar 14, 2024 at 7:58 PM Bharath Rupireddy
 wrote:
>
> On Thu, Mar 14, 2024 at 12:24 PM Amit Kapila  wrote:
> >
> > On Wed, Mar 13, 2024 at 9:24 PM Bharath Rupireddy
> > >
> > > Yes, there will be some sort of duplicity if we emit conflict_reason
> > > as a text field. However, I still think the better way is to turn
> > > conflict_reason text to conflict boolean and set it to true only on
> > > rows_removed and wal_level_insufficient invalidations. When conflict
> > > boolean is true, one (including all the tests that we've added
> > > recently) can look for invalidation_reason text field for the reason.
> > > This sounds reasonable to me as opposed to we just mentioning in the
> > > docs that "if invalidation_reason is rows_removed or
> > > wal_level_insufficient it's the reason for conflict with recovery".

+1 on maintaining both conflicting and invalidation_reason

> > Fair point. I think we can go either way. Bertrand, Nathan, and
> > others, do you have an opinion on this matter?
>
> While we wait to hear from others on this, I'm attaching the v9 patch
> set implementing the above idea (check 0001 patch). Please have a
> look. I'll come back to the other review comments soon.

Thanks for the patch. JFYI, patch09 does not apply to HEAD, some
recent commit caused the conflict.

Some trivial comments on patch001 (yet to review other patches)

1)
info.c:

- "%s as caught_up, conflict_reason IS NOT NULL as invalid "
+ "%s as caught_up, invalidation_reason IS NOT NULL as invalid "

Can we revert back to 'conflicting as invalid' since it is a query for
logical slots only.

2)
040_standby_failover_slots_sync.pl:

- q{SELECT conflict_reason IS NULL AND synced AND NOT temporary FROM
pg_replication_slots WHERE slot_name = 'lsub1_slot';}
+ q{SELECT invalidation_reason IS NULL AND synced AND NOT temporary
FROM pg_replication_slots WHERE slot_name = 'lsub1_slot';}

Here too, can we have 'NOT conflicting' instead of '
invalidation_reason IS NULL' as it is a logical slot test.

thanks
Shveta




Re: Have pg_basebackup write "dbname" in "primary_conninfo"?

2024-03-14 Thread shveta malik
On Thu, Mar 14, 2024 at 10:57 AM Amit Kapila  wrote:
>
> On Thu, Mar 14, 2024 at 5:57 AM Masahiko Sawada  wrote:
> >
> > This fact makes me think that the slotsync worker might be able to
> > accept the primary_conninfo value even if there is no dbname in the
> > value. That is, if there is no dbname in the primary_conninfo, it uses
> > the username in accordance with the specs of the connection string.
> > Currently, the slotsync worker connects to the local database first
> > and then establishes the connection to the primary server. But if we
> > can reverse the two steps, it can get the dbname that has actually
> > been used to establish the remote connection and use it for the local
> > connection too. That way, the primary_conninfo generated by
> > pg_basebackup could work even without the patch. For example, if the
> > OS user executing pg_basebackup is 'postgres', the slotsync worker
> > would connect to the postgres database. Given the 'postgres' database
> > is created by default and 'postgres' OS user is used in common, I
> > guess it could cover many cases in practice actually.
> >
>
> I think this is worth investigating but I suspect that in most cases
> users will end up using a replication connection without specifying
> the user name and we may not be able to give a meaningful error
> message when slotsync worker won't be able to connect. The same will
> be true even when the dbname same as the username would be used.
>

I attempted the change as suggested by Swada-San. Attached the PoC
patch .For it to work, I have to expose a new get api in libpq-fe
which gets dbname from stream-connection. Please have a look.

Without this PoC patch, the errors in slot-sync worker:

-
a) If dbname is missing:
[1230932] LOG:  slot sync worker started
[1230932] ERROR:  slot synchronization requires dbname to be specified
in primary_conninfo

b) If specified db does not exist
[1230913] LOG:  slot sync worker started
[1230913] FATAL:  database "postgres1" does not exist
-

Now with this patch:
-
a) If the dbname same as user does not exist:
[1232473] LOG:  slot sync worker started
[1232473] ERROR:  could not connect to the primary server: connection
to server at "127.0.0.1", port 5433 failed: FATAL: database
"bckp_user" does not exist

b) If user itself is removed from primary_conninfo, libpq takes user
who has authenticated the system by default and gives error if db of
same name does not exist
ERROR:  could not connect to the primary server: connection to server
at "127.0.0.1", port 5433 failed: FATAL:  database "shveta" does not
exist
-

The errors in second case look slightly confusing to me.

thanks
Shveta


v1-0001-Use-user-as-dbname-for-slot-sync.patch
Description: Binary data


Re: Introduce XID age and inactive timeout based replication slot invalidation

2024-03-13 Thread shveta malik
On Fri, Mar 8, 2024 at 10:42 PM Bharath Rupireddy
 wrote:
>
> On Wed, Mar 6, 2024 at 4:49 PM Amit Kapila  wrote:
> >
> > You might want to consider its interaction with sync slots on standby.
> > Say, there is no activity on slots in terms of processing the changes
> > for slots. Now, we won't perform sync of such slots on standby showing
> > them inactive as per your new criteria where as same slots could still
> > be valid on primary as the walsender is still active. This may be more
> > of a theoretical point as in running system there will probably be
> > some activity but I think this needs some thougths.
>
> I believe the xmin and catalog_xmin of the sync slots on the standby
> keep advancing depending on the slots on the primary, no? If yes, the
> XID age based invalidation shouldn't be a problem.

If the user has not enabled slot-sync worker and is relying on the SQL
function pg_sync_replication_slots(), then the xmin and catalog_xmin
of synced slots may not keep on advancing. These will be advanced only
on next run of function. But meanwhile the synced slots may be
invalidated due to 'xid_aged'.  Then the next time, when user runs
pg_sync_replication_slots() again, the invalidated slots will be
dropped and will be recreated by this SQL function (provided they are
valid on primary and are invalidated on standby alone).  I am not
stating that it is a problem, but we need to think if this is what we
want. Secondly, the behaviour is not same with 'inactive_timeout'
invalidation. Synced slots are immune to 'inactive_timeout'
invalidation as this invalidation happens only in walsender, while
these are not immune to 'xid_aged' invalidation. So again, needs some
thoughts here.

> I believe there are no walsenders started for the sync slots on the
> standbys, right? If yes, the inactive timeout based invalidation also
> shouldn't be a problem. Because, the inactive timeouts for a slot are
> tracked only for walsenders because they are the ones that typically
> hold replication slots for longer durations and for real replication
> use. We did a similar thing in a recent commit [1].
>
> Is my understanding right? Do you still see any problems with it?

I have explained the situation above for us to think over it better.

thanks
Shveta




Re: Introduce XID age and inactive timeout based replication slot invalidation

2024-03-13 Thread shveta malik
> JFYI, the patch does not apply to the head. There is a conflict in
> multiple files.

For review purposes, I applied v8 to the March 6 code-base. I have yet
to review in detail, please find my initial thoughts:

1)
I found that 'inactive_replication_slot_timeout' works only if there
was any walsender ever started for that slot . The logic is under
'am_walsender' check.  Is this intentional?
If I create a slot and use only pg_logical_slot_get_changes or
pg_replication_slot_advance on it, it never gets invalidated due to
timeout.  While, when I set 'max_slot_xid_age' or say
'max_slot_wal_keep_size' to a lower value, the said slot is
invalidated correctly with 'xid_aged' and 'wal_removed' reasons
respectively.

Example:
With inactive_replication_slot_timeout=1min, test1_3  is the slot for
which there is no walsender and only advance and get_changes SQL
functions were called; test1_4 is the one for which pg_recvlogical was
run for a second.

 test1_3 |   785 | | reserved   |   | t
|  |
 test1_4 |   798 | | lost   | inactive_timeout| t|
2024-03-13 11:52:41.58446+05:30  |

And when inactive_replication_slot_timeout=0  and max_slot_xid_age=10

 test1_3 |   785 | | lost   |  xid_aged   | t
  |  |
 test1_4 |   798 | | lost   | inactive_timeout| t|
2024-03-13 11:52:41.58446+05:30  |


2)
The msg for patch 3 says:
--
a) when replication slots is lying inactive for a day or so using
last_inactive_at metric,
b) when a replication slot is becoming inactive too frequently using
last_inactive_at metric.
--
 I think in b, you want to refer to inactive_count instead of last_inactive_at?

3)
I do not see invalidation_reason updated for 2 new reasons in system-views.sgml


thanks
Shveta




Re: Introduce XID age and inactive timeout based replication slot invalidation

2024-03-12 Thread shveta malik
On Wed, Mar 6, 2024 at 2:47 PM Bharath Rupireddy
 wrote:
>
> On Wed, Mar 6, 2024 at 2:42 PM Bertrand Drouvot
>  wrote:
> >
> > Hi,
> >
> > On Tue, Mar 05, 2024 at 01:44:43PM -0600, Nathan Bossart wrote:
> > > On Wed, Mar 06, 2024 at 12:50:38AM +0530, Bharath Rupireddy wrote:
> > > > On Mon, Mar 4, 2024 at 2:11 PM Bertrand Drouvot
> > > >  wrote:
> > > >> On Sun, Mar 03, 2024 at 03:44:34PM -0600, Nathan Bossart wrote:
> > > >> > Unless I am misinterpreting some details, ISTM we could rename this 
> > > >> > column
> > > >> > to invalidation_reason and use it for both logical and physical 
> > > >> > slots.  I'm
> > > >> > not seeing a strong need for another column.
> > > >>
> > > >> Yeah having two columns was more for convenience purpose. Without the 
> > > >> "conflict"
> > > >> one, a slot conflicting with recovery would be "a logical slot having 
> > > >> a non NULL
> > > >> invalidation_reason".
> > > >>
> > > >> I'm also fine with one column if most of you prefer that way.
> > > >
> > > > While we debate on the above, please find the attached v7 patch set
> > > > after rebasing.
> > >
> > > It looks like Bertrand is okay with reusing the same column for both
> > > logical and physical slots
> >
> > Yeah, I'm okay with one column.
>
> Thanks. v8-0001 is how it looks. Please see the v8 patch set with this change.

JFYI, the patch does not apply to the head. There is a conflict in
multiple files.

thanks
Shveta




Re: Add comment to specify timeout unit in ConditionVariableTimedSleep()

2024-03-11 Thread shveta malik
On Sat, Mar 9, 2024 at 12:19 PM Michael Paquier  wrote:
>
> On Tue, Mar 05, 2024 at 03:20:48PM +0900, Michael Paquier wrote:
> > That sounds like a good idea to me, so I'm OK with your suggestion.
>
> Applied this one as f160bf06f72a.  Thanks.

Thanks!

thanks
Shveta




Re: Regardign RecentFlushPtr in WalSndWaitForWal()

2024-03-11 Thread shveta malik
On Sat, Mar 2, 2024 at 4:44 PM Amit Kapila  wrote:
>
> Right, I think the quoted code has check "if (!RecoveryInProgress())".
>
> >
>  But apart from that, your
> > observation seems accurate, yes.
> >
>
> I also find the observation correct and the code has been like that
> since commit 5a991ef8 [1]. So, let's wait to see if Robert or Andres
> remembers the reason, otherwise, we should probably nuke this code.

Please find the patch attached for the same.

thanks
Shveta


v1-0001-Remove-redundant-RecentFlushPtr-fetch-in-WalSndWa.patch
Description: Binary data


Re: Synchronizing slots from primary to standby

2024-03-07 Thread shveta malik
On Fri, Mar 8, 2024 at 9:56 AM Ajin Cherian  wrote:
>
>> Pushed with minor modifications. I'll keep an eye on BF.
>>
>> BTW, one thing that we should try to evaluate a bit more is the
>> traversal of slots in StandbySlotsHaveCaughtup() where we verify if
>> all the slots mentioned in standby_slot_names have received the
>> required WAL. Even if the standby_slot_names list is short the total
>> number of slots can be much larger which can lead to an increase in
>> CPU usage during traversal. There is an optimization that allows to
>> cache ss_oldest_flush_lsn and ensures that we don't need to traverse
>> the slots each time so it may not hit frequently but still there is a
>> chance. I see it is possible to further optimize this area by caching
>> the position of each slot mentioned in standby_slot_names in
>> replication_slots array but not sure whether it is worth.
>>
>>
>
> I tried to test this by configuring a large number of logical slots while 
> making sure the standby slots are at the end of the array and checking if 
> there was any performance hit in logical replication from these searches.
>

Thanks  Ajin and Nisha.

We also plan:
1) Redoing XLogSendLogical time-log related test with
'sync_replication_slots' enabled.
2) pg_recvlogical test to monitor lag in StandbySlotsHaveCaughtup()
for a large number of slots.
3) Profiling to see if StandbySlotsHaveCaughtup() is noticeable in the
report when there are a large number of slots to traverse.

thanks
Shveta




Re: Missing LWLock protection in pgstat_reset_replslot()

2024-03-06 Thread shveta malik
On Thu, Mar 7, 2024 at 11:12 AM Michael Paquier  wrote:
>

> Yeah, it is possible that what you retrieve from
> pgstat_fetch_replslot() does not refer exactly to the slot's content
> under concurrent activity, but you cannot protect a single scan of
> pg_stat_replication_slots as of an effect of its design:
> pg_stat_get_replication_slot() has to be called multiple times.  The
> patch at least makes sure that the copy of the slot's stats retrieved
> by pgstat_fetch_entry() is slightly more consistent

Right, I understand that.

, but you cannot do
> better than that except if the data retrieved from
> pg_replication_slots and its stats are fetched in the same context
> function call, holding the replslot LWLock for the whole scan
> duration.

Yes, agreed.

>
> > Do you feel that the lock in pgstat_fetch_replslot() should be moved
> > to its caller when we are done copying the results of that slot? This
> > will improve the protection slightly.
>
> I don't see what extra protection this would offer, as
> pg_stat_get_replication_slot() is called once for each slot.  Feel
> free to correct me if I'm missing something.

It slightly improves the chances.  pgstat_fetch_replslot is only
called from pg_stat_get_replication_slot(), I thought it might be
better to acquire lock before we call pgstat_fetch_replslot and
release once we are done copying the results for that particular slot.
But  I also understand that it will not prevent someone from dropping
that slot at a later stage when the rest of the calls of
pg_stat_get_replication_slot() are still pending. So I am okay with
the current one.

thanks
Shveta




Re: Missing LWLock protection in pgstat_reset_replslot()

2024-03-06 Thread shveta malik
On Wed, Mar 6, 2024 at 2:36 PM Bertrand Drouvot
 wrote:
>
> Hi,
>
> On Wed, Mar 06, 2024 at 10:24:46AM +0530, shveta malik wrote:
> > On Tue, Mar 5, 2024 at 6:52 PM Bertrand Drouvot
> >  wrote:
> > Thanks.  Can we try to get rid of multiple LwLockRelease in
> > pgstat_reset_replslot(). Is this any better?
> >
> > /*
> > -* Nothing to do for physical slots as we collect stats only for 
> > logical
> > -* slots.
> > +* Reset stats if it is a logical slot. Nothing to do for physical 
> > slots
> > +* as we collect stats only for logical slots.
> >  */
> > -   if (SlotIsPhysical(slot))
> > -   {
> > -   LWLockRelease(ReplicationSlotControlLock);
> > -   return;
> > -   }
> > -
> > -   /* reset this one entry */
> > -   pgstat_reset(PGSTAT_KIND_REPLSLOT, InvalidOid,
> > -ReplicationSlotIndex(slot));
> > +   if (SlotIsLogical(slot))
> > +   pgstat_reset(PGSTAT_KIND_REPLSLOT, InvalidOid,
> > +ReplicationSlotIndex(slot));
> >
> > LWLockRelease(ReplicationSlotControlLock);
> >
>
> Yeah, it's easier to read and probably reduce the pgstat_replslot.o object 
> file
> size a bit for non optimized build.
>
> > Something similar in pgstat_fetch_replslot() perhaps?
>
> Yeah, all of the above done in v3 attached.
>

Thanks for the patch.

For the fix in pgstat_fetch_replslot(), even with the lock in fetch
call, there are chances that the concerned slot can be dropped and
recreated.

--It can happen in a small window in pg_stat_get_replication_slot()
when we are consuming the return values of pgstat_fetch_replslot
(using slotent).

--Also it can happen at a later stage when we have switched to
fetching the next slot (obtained from 'pg_replication_slots' through
view' pg_stat_replication_slots'), the previous one can be dropped.

Ultimately the results of system view 'pg_replication_slots' can still
give us non-existing or re-created slots. But yes, I do not deny that
it gives us better consistency protection.

Do you feel that the lock in pgstat_fetch_replslot() should be moved
to its caller when we are done copying the results of that slot? This
will improve the protection slightly.

thanks
Shveta




Re: Synchronizing slots from primary to standby

2024-03-06 Thread shveta malik
On Wed, Mar 6, 2024 at 6:54 PM Zhijie Hou (Fujitsu)
 wrote:
>
> On Wednesday, March 6, 2024 9:13 PM Zhijie Hou (Fujitsu) 
>  wrote:
> >
> > On Wednesday, March 6, 2024 11:04 AM Zhijie Hou (Fujitsu)
> >  wrote:
> > >
> > > On Wednesday, March 6, 2024 9:30 AM Masahiko Sawada
> > >  wrote:
> > >
> > > Hi,
> > >
> > > > On Fri, Mar 1, 2024 at 4:21 PM Zhijie Hou (Fujitsu)
> > > > 
> > > > wrote:
> > > > >
> > > > > On Friday, March 1, 2024 2:11 PM Masahiko Sawada
> > > >  wrote:
> > > > > >
> > > > > >
> > > > > > ---
> > > > > > +void
> > > > > > +assign_standby_slot_names(const char *newval, void *extra) {
> > > > > > +List  *standby_slots;
> > > > > > +MemoryContext oldcxt;
> > > > > > +char  *standby_slot_names_cpy = extra;
> > > > > > +
> > > > > >
> > > > > > Given that the newval and extra have the same data
> > > > > > (standby_slot_names value), why do we not use newval instead? I
> > > > > > think that if we use newval, we don't need to guc_strdup() in
> > > > > > check_standby_slot_names(), we might need to do list_copy_deep()
> > > > > > instead, though. It's not clear to me as there is no comment.
> > > > >
> > > > > I think SplitIdentifierString will modify the passed in string, so
> > > > > we'd better not pass the newval to it, otherwise the stored guc
> > > > > string(standby_slot_names) will be changed. I can see we are doing
> > > > > similar thing in other GUC check/assign function as well.
> > > > > (check_wal_consistency_checking/ assign_wal_consistency_checking,
> > > > > check_createrole_self_grant/ assign_createrole_self_grant ...).
> > > >
> > > > Why does it have to be a List in the first place?
> > >
> > > I thought the List type is convenient to use here, as we have existing
> > > list build function(SplitIdentifierString), and have convenient list
> > > macro to loop the
> > > list(foreach_ptr) which can save some codes.
> > >
> > > > In earlier version patches, we
> > > > used to copy the list and delete the element until it became empty,
> > > > while waiting for physical wal senders. But we now just refer to
> > > > each slot name in the list. The current code assumes that
> > > > stnadby_slot_names_cpy is allocated in GUCMemoryContext but once it
> > > > changes, it will silently get broken. I think we can check and
> > > > assign standby_slot_names in a similar way to
> > > > check/assign_temp_tablespaces and
> > check/assign_synchronous_standby_names.
> > >
> > > Yes, we could do follow it by allocating an array and copy each slot
> > > name into it, but it also requires some codes to build and scan the
> > > array. So, is it possible to expose the GucMemorycontext or have an API 
> > > like
> > guc_copy_list instead ?
> > > If we don't want to touch the guc api, I am ok with using an array as 
> > > well.
> >
> > I rethink about this and realize that it's not good to do the memory 
> > allocation in
> > assign hook function. As the "src/backend/utils/misc/README" said, we'd
> > better do that in check hook function and pass it via extra to assign hook
> > function. And thus array is a good choice in this case rather than a List 
> > which
> > cannot be passed to *extra.
> >
> > Here is the V107 patch set which parse and cache the standby slot names in 
> > an
> > array instead of a List.
>
> The patch needs to be rebased due to recent commit.
>
> Attach the V107_2 path set. There are no code changes in this version.

 The patch needed to be rebased due to a recent commit. Attached
v107_3, there are no code changes in this version.

thanks
Shveta


v107_3-0002-Document-the-steps-to-check-if-the-standby-is.patch
Description: Binary data


v107_3-0001-Allow-logical-walsenders-to-wait-for-the-phys.patch
Description: Binary data


Re: Missing LWLock protection in pgstat_reset_replslot()

2024-03-05 Thread shveta malik
On Tue, Mar 5, 2024 at 6:52 PM Bertrand Drouvot
 wrote:
>
> > /*
> > * Nothing to do for physical slots as we collect stats only for logical
> > * slots.
> > */
> > if (SlotIsPhysical(slot))
> > return;
>
> D'oh! Thanks! Fixed in v2 shared up-thread.

Thanks.  Can we try to get rid of multiple LwLockRelease in
pgstat_reset_replslot(). Is this any better?

/*
-* Nothing to do for physical slots as we collect stats only for logical
-* slots.
+* Reset stats if it is a logical slot. Nothing to do for physical slots
+* as we collect stats only for logical slots.
 */
-   if (SlotIsPhysical(slot))
-   {
-   LWLockRelease(ReplicationSlotControlLock);
-   return;
-   }
-
-   /* reset this one entry */
-   pgstat_reset(PGSTAT_KIND_REPLSLOT, InvalidOid,
-ReplicationSlotIndex(slot));
+   if (SlotIsLogical(slot))
+   pgstat_reset(PGSTAT_KIND_REPLSLOT, InvalidOid,
+ReplicationSlotIndex(slot));

LWLockRelease(ReplicationSlotControlLock);


Something similar in pgstat_fetch_replslot() perhaps?

thanks
Shveta




  1   2   3   4   >