RE: Conflict detection for update_deleted in logical replication

2024-11-04 Thread Zhijie Hou (Fujitsu)
On Monday, October 28, 2024 1:40 PM Peter Smith  wrote:
> 
> Hi Hou-San, here are a few trivial comments remaining for patch v6-0001.

Thanks for the comments!

> 
> ==
> doc/src/sgml/protocol.sgml
> 
> 3.
> +Primary status update (B)
> +
> + 
> +  
> +   Byte1('s')
> 
> Currently, there are identifiers 's' for the "Primary status update"
> message, and 'S' for the "Primary status request" message.
> 
> As mentioned in the previous review ([1] #5b) I preferred it to be the other 
> way
> around:
> 'S' = status from primary
> 's' = request status from primary
> 
> Of course, it doesn't make any difference, but "S" seems more important than
> "s", so therefore "S" being the main msg and coming from the *primary*
> seemed more natural to me.

I am not sure if one message is more important than another, so I prefer to
keep the current style. Since this is a minor issue, we can easily revise it in
future version patches if we receive additional feedback.

Other comments look good to me and will address in V7 patch set.

Best Regards,
Hou zj


RE: Pgoutput not capturing the generated columns

2024-10-28 Thread Zhijie Hou (Fujitsu)
On Monday, October 28, 2024 2:54 PM Hayato Kuroda (Fujitsu) 
 wrote:
> 
> 
> 02. 031_column_list.pl
> 
> ```
> -# TEST: Generated and dropped columns are not considered for the column
> list.
> +# TEST: Dropped columns are not considered for the column list.
>  # So, the publication having a column list except for those columns and a  #
> publication without any column (aka all columns as part of the columns  #
> list) are considered to have the same column list.
> ```
> 
> Based on the comment, this case does not test the behavior of generated
> columns anymore. So, I felt column 'd' could be removed from the case.

I think keeping the generated column can test the cases you mentioned
in comment #03, so we can modify the comments here to make that clear.

> 
> 03. 031_column_list.pl
> 
> Can we test that generated columns won't be replaced if it does not included 
> in
> the column list?

As stated above, it can be covered in existing tests.

Best Regards,
Hou zj


RE: Pgoutput not capturing the generated columns

2024-10-28 Thread Zhijie Hou (Fujitsu)
On Monday, October 28, 2024 1:34 PM Amit Kapila  wrote:
> 
> On Mon, Oct 28, 2024 at 7:43 AM Peter Smith 
> wrote:
> 
> >
> > 4. pgoutput_column_list_init
> >
> > - if (att->attisdropped || att->attgenerated)
> > + if (att->attisdropped)
> >   continue;
> >
> > + if (att->attgenerated)
> > + {
> > + if (bms_is_member(att->attnum, cols)) gencolpresent = true;
> > +
> > + continue;
> > + }
> > +
> >   nliveatts++;
> >   }
> >
> >   /*
> > - * If column list includes all the columns of the table,
> > - * set it to NULL.
> > + * If column list includes all the columns of the table
> > + * and there are no generated columns, set it to NULL.
> >   */
> > - if (bms_num_members(cols) == nliveatts)
> > + if (bms_num_members(cols) == nliveatts && !gencolpresent)
> >   {
> >
> > Something seems not quite right (or maybe redundant) with this logic.
> > For example, because you unconditionally 'continue' for generated
> > columns, then AFAICT it is just not possible for bms_num_members(cols)
> > == nliveatts and at the same time 'gencolpresent' to be true. So you
> > could've just Asserted (!gencolpresent) instead of checking it in the
> > condition and mentioning it in the comment.

I think it's possible for the condition you mentioned to happen.

For example:
 
CREATE TABLE test_mix_4 (a int primary key, b int, d int GENERATED ALWAYS AS (a 
+ 1) STORED);
CREATE PUBLICATION pub FOR TABLE test_mix_4(a, d);

> >
> 
> It seems part of the logic is redundant. I propose to change something along 
> the
> lines of the attached. I haven't tested the attached change as it shows how we
> can improve this part of code.

Thanks for the changes. I tried and faced an unexpected behavior
that the walsender will report Error "cannot use different column lists fo.."
in the following case:

Pub:
CREATE TABLE test_mix_4 (a int PRIMARY KEY, b int, c int, d int GENERATED 
ALWAYS AS (a + 1) STORED);
ALTER TABLE test_mix_4 DROP COLUMN c;
CREATE PUBLICATION pub_mix_7 FOR TABLE test_mix_4 (a, b);
CREATE PUBLICATION pub_mix_8 FOR TABLE test_mix_4;
Sub:
CREATE SUBSCRIPTION sub1 CONNECTION '$publisher_connstr' PUBLICATION 
pub_mix_7, pub_mix_8;

The pub_mix_7 publishes column a,b which should be converted to NULL
in pgoutput, but was not due to the check of att_gen_present.

Based on above, I feel we can keep the original code as it is.

Best Regards,
Hou zj


RE: Conflict Detection and Resolution

2024-10-18 Thread Zhijie Hou (Fujitsu)
On Wednesday, October 9, 2024 2:34 PM shveta malik  
wrote:
> 
> On Wed, Oct 9, 2024 at 8:58 AM shveta malik 
> wrote:
> >
> > On Tue, Oct 8, 2024 at 3:12 PM Nisha Moond
>  wrote:
> > >
> >
> 
> Please find few comments on v14-patch004:
> 
> patch004:
> 1)
> GetConflictResolver currently errors out when the resolver is last_update_wins
> and track_commit_timestamp is disabled. It means every conflict resolution
> with this resolver will keep on erroring out. I am not sure if we should emit
> ERROR here. We do emit ERROR when someone tries to configure
> last_update_wins but track_commit_timestamp is disabled. I think that should
> suffice. The one in GetConflictResolver can be converted to WARNING max.
> 
> What could be the side-effect if we do not emit error here? In such a case, 
> the
> local timestamp will be 0 and remote change will always win.
> Is that right? If so, then if needed, we can emit a warning saying something 
> like:
> 'track_commit_timestamp is disabled and thus remote change is applied
> always.'
> 
> Thoughts?

I think simply reporting a warning and applying remote changes without further
action could lead to data inconsistencies between nodes. Considering the
potential challenges and time required to recover from these inconsistencies, I
prefer to keep reporting errors, in which case users have an opportunity to
resolve the issue by enabling track_commit_timestamp.

Best Regards,
Hou zj




RE: Conflict detection for update_deleted in logical replication

2024-10-13 Thread Zhijie Hou (Fujitsu)
On Friday, October 11, 2024 4:35 PM Zhijie Hou (Fujitsu) 
 wrote:
> 
> Attach the V4 patch set which addressed above comments.
> 

While reviewing the patch, I noticed that the current design could not work in
a non-bidirectional cluster (publisher -> subscriber) when the publisher is
also a physical standby. (We supported logical decoding on a physical standby
recently, so it's possible to take a physical standby as a logical publisher).

The cluster looks like:

physical primary -> physical standby (also publisher) -> logical 
subscriber (detect_update_deleted)

The issue arises because the physical standby (acting as the publisher) might
lag behind its primary. As a result, the logical walsender on the standby might
not be able to get the latest WAL position when requested by the logical
subscriber. We can only get the WAL replay position but there may be more WALs
that are being replicated from the primary and those WALs could have older
commit timestamp. (Note that transactions on both primary and standby have
the same commit timestamp).

So, the logical walsender might send an outdated WAL position as feedback.
This, in turn, can cause the replication slot on the subscriber to advance
prematurely, leading to the unwanted removal of dead tuples. As a result, the
apply worker may fail to correctly detect update-delete conflicts.

We thought of few options to fix this:

1) Add a Time-Based Subscription Option:

We could add a new time-based option for subscriptions, such as
retain_dead_tuples = '5s'. In the logical launcher, after obtaining the
candidate XID, the launcher will wait for the specified time before advancing
the slot.xmin. This ensures that deleted tuples are retained for at least the
duration defined by this new option.

This approach is designed to simulate the functionality of the GUC
(vacuum_committs_age), but with a simpler implementation that does not impact
vacuum performance. We can maintain both this time-based method and the current
automatic method. If a user does not specify the time-based option, we will
continue using the existing approach to retain dead tuples until all concurrent
transactions from the remote node have been applied.

2) Modification to the Logical Walsender

On the logical walsender, which is as a physical standby, we can build an
additional connection to the physical primary to obtain the latest WAL
position. This position will then be sent as feedback to the logical
subscriber.

A potential concern is that this requires the walsender to use the walreceiver
API, which may seem a bit unnatural. And, it starts an additional walsender
process on the primary, as the logical walsender on the physical standby will
need to communicate with this walsender to fetch the WAL position.

3) Documentation of Restrictions

As an alternative, we could simply document the restriction that detecting
update_delete is not supported if the publisher is also acting as a physical
standby.


Best Regards,
Hou zj


RE: Conflict detection for update_deleted in logical replication

2024-09-24 Thread Zhijie Hou (Fujitsu)
On Friday, September 20, 2024 11:59 AM Hou, Zhijie/侯 志杰 wrote:
> 
> On Friday, September 20, 2024 10:55 AM Zhijie Hou (Fujitsu)
>  wrote:
> > On Friday, September 20, 2024 2:49 AM Masahiko Sawada
>  wrote:
> > >
> > >
> > > I think that such a time-based configuration parameter would be a
> > > reasonable solution. The current concerns are that it might affect
> > > vacuum performance and lead to a similar bug we had with
> > vacuum_defer_cleanup_age.
> >
> > Thanks for the feedback!
> >
> > I am working on the POC patch and doing some initial performance tests
> > on this idea.
> > I will share the results after finishing.

Here is a POC patch for vacuum_committs_age idea. The patch adds a GUC
vacuum_committs_age to prevent dead rows from being removed if the age of the
delete transaction (xmax) has not exceeded the vacuum_committs_age threshold.
E.g. , it ensures the row is retained if now() - commit_timestamp_of_xmax <
vacuum_committs_age.

However, please note that the patch is still unfinished due to a few
issues that need to be addressed. For instance: We need to prevent
relfrozenxid/datfrozenxid from being advanced in both aggressive and
non-aggressive vacuum modes. Otherwise, the commit timestamp data is cleaned
up after advancing frozenxid, and we won’t be able to compute the age of a 
tuple.

Additionally, the patch has a noticeable performance impact on vacuum
operations when rows in a table are deleted by multiple transactions. Here are
the results of VACUUMing a table after deleting each row in a separate
transaction (total of 1000 dead rows) and the xmax ages of all the dead
tuples have exceeded the vacuum_committs_age in patched tests (see attachment
for the basic configuration of the tests):

HEAD:   Time: 848.637 ms
patched, SLRU 8MB:  Time: 1423.915 ms
patched, SLRU 1G:   Time: 1310.869 ms

Since we have discussed about an alternative approach that can reliably retain
dead tuples without modifying vacuum process. We plan to shift our focus to
this new approach [1]. I am currently working on another POC patch based on this
new approach and will share it later.

[1] 
https://www.postgresql.org/message-id/CAD21AoD%3Dm-YHceYMpsdu0HnGCaezeyVhaCPFxDLHU7aN0wgzqg%40mail.gmail.com

Best Regards,
Hou zj


perftest.conf
Description: perftest.conf


0001-try-to-add-vacuum_committs_age.patch
Description: 0001-try-to-add-vacuum_committs_age.patch


RE: Conflict detection for update_deleted in logical replication

2024-09-24 Thread Zhijie Hou (Fujitsu)
On Tuesday, September 24, 2024 2:42 PM Masahiko Sawada  
wrote:
> 
> On Mon, Sep 23, 2024 at 8:32 PM Zhijie Hou (Fujitsu)
>  wrote:
> >
> > On Tuesday, September 24, 2024 5:05 AM Masahiko Sawada
>  wrote:
> > > I'm still studying this idea but let me confirm the following scenario.
> > >
> > > Suppose both Node-A and Node-B have the same row (1,1) in table t,
> > > and XIDs and commit LSNs of T2 and T3 are the following:
> > >
> > > Node A
> > >   T2: DELETE FROM t WHERE id = 1 (10:02 AM) XID:100,
> commit-LSN:1000
> > >
> > > Node B
> > >   T3: UPDATE t SET value = 2 WHERE id 1 (10:01 AM) XID:500,
> > > commit-LSN:5000
> > >
> > > Further suppose that it's now 10:05 AM, and the latest XID and the
> > > latest flush WAL position of Node-A and Node-B are following:
> > >
> > > Node A
> > >   current XID: 300
> > >   latest flush LSN; 3000
> > >
> > > Node B
> > >   current XID: 700
> > >   latest flush LSN: 7000
> > >
> > > Both T2 and T3 are NOT sent to Node B and Node A yet, respectively
> > > (i.e., the logical replication is delaying for 5 min).
> > >
> > > Consider the following scenario:
> > >
> > > 1. The apply worker on Node-A calls GetRunningTransactionData() and
> > > gets 301 (set as candidate_xmin).
> > > 2. The apply worker on Node-A requests the latest WAL flush position
> > > from Node-B, and gets 7000 (set as candidate_remote_wal_lsn).
> > > 3. T2 is applied on Node-B, and the latest flush position of Node-B is now
> 8000.
> > > 4. The apply worker on Node-A continues applying changes, and
> > > applies the transactions up to remote (commit) LSN 7100.
> > > 5. Now that the apply worker on Node-A applied all changes smaller
> > > than candidate_remote_wal_lsn (7000), it increases the slot.xmin to
> > > 301 (candidate_xmin).
> > > 6. On Node-A, vacuum runs and physically removes the tuple that was
> > > deleted by T2.
> > >
> > > Here, on Node-B, there might be a transition between LSN 7100 and
> > > 8000 that might require the tuple that is deleted by T2.
> > >
> > > For example, "UPDATE t SET value = 3 WHERE id = 1" (say T4) is
> > > executed on Node-B at LSN 7200, and it's sent to Node-A after step 6.
> > > On Node-A, whether we detect "update_deleted" or "update_missing"
> > > still depends on when vacuum removes the tuple deleted by T2.
> >
> > I think in this case, no matter we detect "update_delete" or
> > "update_missing", the final data is the same. Because T4's commit
> > timestamp should be later than
> > T2 on node A, so in the case of "update_deleted", it will compare the
> > commit timestamp of the deleted tuple's xmax with T4's timestamp, and
> > T4 should win, which means we will convert the update into insert and
> > apply. Even if the deleted tuple is deleted and "update_missing" is
> > detected, the update will still be converted into insert and applied. So, 
> > the
> result is the same.
> 
> The "latest_timestamp_wins" is the default resolution method for
> "update_deleted"? When I checked the wiki page[1], the "skip" was the default
> solution method for that.

Right, I think the wiki needs some update.

I think using 'skip' as default for update_delete could easily cause data
divergence when the dead tuple is deleted by an old transaction while the
UPDATE has a newer timestamp like the case you mentioned. It's necessary to
follow the last update win strategy when the incoming update has later
timestamp, which is to convert update to insert.

Best Regards,
Hou zj


RE: Conflict detection for update_deleted in logical replication

2024-09-23 Thread Zhijie Hou (Fujitsu)
On Tuesday, September 24, 2024 5:05 AM Masahiko Sawada  
wrote:
> 
> Thank you for considering another idea.

Thanks for reviewing the idea!

> 
> On Fri, Sep 20, 2024 at 2:46 AM Amit Kapila  wrote:
> >
> > On Fri, Sep 20, 2024 at 8:25 AM Zhijie Hou (Fujitsu)
> >  wrote:
> > >
> > > Apart from the vacuum_defer_cleanup_age idea.
> > >
> >
> > I think you meant to say vacuum_committs_age idea.
> >
> > > we’ve given more thought to our
> > > approach for retaining dead tuples and have come up with another idea
> that can
> > > reliably detect conflicts without requiring users to choose a wise value 
> > > for
> > > the vacuum_committs_age. This new idea could also reduce the
> performance
> > > impact. Thanks a lot to Amit for off-list discussion.
> > >
> > > The concept of the new idea is that, the dead tuples are only useful to
> detect
> > > conflicts when applying *concurrent* transactions from remotes. Any
> subsequent
> > > UPDATE from a remote node after removing the dead tuples should have a
> later
> > > timestamp, meaning it's reasonable to detect an update_missing scenario
> and
> > > convert the UPDATE to an INSERT when applying it.
> > >
> > > To achieve above, we can create an additional replication slot on the
> > > subscriber side, maintained by the apply worker. This slot is used to 
> > > retain
> > > the dead tuples. The apply worker will advance the slot.xmin after
> confirming
> > > that all the concurrent transaction on publisher has been applied locally.
> 
> The replication slot used for this purpose will be a physical one or
> logical one? And IIUC such a slot doesn't need to retain WAL but if we
> do that, how do we advance the LSN of the slot?

I think it would be a logical slot. We can keep the
restart_lsn/confirmed_flush_lsn as invalid because we don't need to retain the
WALs for decoding purpose.

> 
> > > 2) the apply worker send a new message to walsender to request the latest
> wal
> > > flush position(GetFlushRecPtr) on publisher, and save it to
> > > 'candidate_remote_wal_lsn'. Here we could introduce a new feedback
> message or
> > > extend the existing keepalive message(e,g extends the requestReply bit in
> > > keepalive message to add a 'request_wal_position' value)
> 
> The apply worker sends a keepalive message when it didn't receive
> anything more than wal_receiver_timeout / 2. So in a very active
> system, we cannot rely on piggybacking new information to the
> keepalive messages to get the latest remote flush LSN.

Right. I think we need to send this new message at some interval independent of
wal_receiver_timeout.

> 
> > > 3) The apply worker can continue to apply changes. After applying all the
> WALs
> > > upto 'candidate_remote_wal_lsn', the apply worker can then advance the
> > > slot.xmin to 'candidate_xmin'.
> > >
> > > This approach ensures that dead tuples are not removed until all
> concurrent
> > > transactions have been applied. It can be effective for both bidirectional
> and
> > > non-bidirectional replication cases.
> > >
> > > We could introduce a boolean subscription option (retain_dead_tuples) to
> > > control whether this feature is enabled. Each subscription intending to
> detect
> > > update-delete conflicts should set retain_dead_tuples to true.
> > >
> 
> I'm still studying this idea but let me confirm the following scenario.
> 
> Suppose both Node-A and Node-B have the same row (1,1) in table t, and
> XIDs and commit LSNs of T2 and T3 are the following:
> 
> Node A
>   T2: DELETE FROM t WHERE id = 1 (10:02 AM) XID:100, commit-LSN:1000
> 
> Node B
>   T3: UPDATE t SET value = 2 WHERE id 1 (10:01 AM) XID:500,
> commit-LSN:5000
> 
> Further suppose that it's now 10:05 AM, and the latest XID and the
> latest flush WAL position of Node-A and Node-B are following:
> 
> Node A
>   current XID: 300
>   latest flush LSN; 3000
> 
> Node B
>   current XID: 700
>   latest flush LSN: 7000
> 
> Both T2 and T3 are NOT sent to Node B and Node A yet, respectively
> (i.e., the logical replication is delaying for 5 min).
> 
> Consider the following scenario:
> 
> 1. The apply worker on Node-A calls GetRunningTransactionData() and
> gets 301 (set as candidate_xmin).
> 2. The apply worker on Node-A requests the latest WAL flush position
> from Node-B, and gets 7000 (set as candidate_remote_wal_lsn).
> 3. T2 is applied on Node-B, and th

RE: Conflict detection for update_deleted in logical replication

2024-09-19 Thread Zhijie Hou (Fujitsu)
On Friday, September 20, 2024 10:55 AM Zhijie Hou (Fujitsu) 
 wrote:
> On Friday, September 20, 2024 2:49 AM Masahiko Sawada  
> wrote:
> > 
> >
> > I think that such a time-based configuration parameter would be a
> > reasonable solution. The current concerns are that it might affect
> > vacuum performance and lead to a similar bug we had with
> vacuum_defer_cleanup_age.
> 
> Thanks for the feedback!
> 
> I am working on the POC patch and doing some initial performance tests on
> this idea.
> I will share the results after finishing.
> 
> Apart from the vacuum_defer_cleanup_age idea. we’ve given more thought to
> our approach for retaining dead tuples and have come up with another idea that
> can reliably detect conflicts without requiring users to choose a wise value 
> for
> the vacuum_committs_age. This new idea could also reduce the performance
> impact. Thanks a lot to Amit for off-list discussion.
> 
> The concept of the new idea is that, the dead tuples are only useful to detect
> conflicts when applying *concurrent* transactions from remotes. Any
> subsequent UPDATE from a remote node after removing the dead tuples
> should have a later timestamp, meaning it's reasonable to detect an
> update_missing scenario and convert the UPDATE to an INSERT when
> applying it.
> 
> To achieve above, we can create an additional replication slot on the 
> subscriber
> side, maintained by the apply worker. This slot is used to retain the dead 
> tuples.
> The apply worker will advance the slot.xmin after confirming that all the
> concurrent transaction on publisher has been applied locally.
> 
> The process of advancing the slot.xmin could be:
> 
> 1) the apply worker call GetRunningTransactionData() to get the
> 'oldestRunningXid' and consider this as 'candidate_xmin'.
> 2) the apply worker send a new message to walsender to request the latest wal
> flush position(GetFlushRecPtr) on publisher, and save it to
> 'candidate_remote_wal_lsn'. Here we could introduce a new feedback
> message or extend the existing keepalive message(e,g extends the
> requestReply bit in keepalive message to add a 'request_wal_position' value)
> 3) The apply worker can continue to apply changes. After applying all the WALs
> upto 'candidate_remote_wal_lsn', the apply worker can then advance the
> slot.xmin to 'candidate_xmin'.
> 
> This approach ensures that dead tuples are not removed until all concurrent
> transactions have been applied. It can be effective for both bidirectional and
> non-bidirectional replication cases.
> 
> We could introduce a boolean subscription option (retain_dead_tuples) to
> control whether this feature is enabled. Each subscription intending to detect
> update-delete conflicts should set retain_dead_tuples to true.
> 
> The following explains how it works in different cases to achieve data
> consistency:
...
> --
> 3 nodes, non-bidirectional, Node C subscribes to both Node A and Node B:
> --

Sorry for a typo here, the time of T2 and T3 were reversed.
Please see the following correction:

> 
> Node A:
>   T1: INSERT INTO t (id, value) VALUES (1,1); ts=10.00 AM
>   T2: DELETE FROM t WHERE id = 1; ts=10.01 AM

Here T2 should be at ts=10.02 AM

> 
> Node B:
>   T3: UPDATE t SET value = 2 WHERE id = 1;ts=10.02 AM

T3 should be at ts=10.01 AM

> 
> Node C:
>   apply T1, T2, T3
> 
> After applying T2, the apply worker on Node C will check the latest wal flush
> location on Node B. Till that time, the T3 should have finished, so the xmin 
> will
> be advanced only after applying the WALs that is later than T3. So, the dead
> tuple will not be removed before applying the T3, which means the
> update_delete can be detected.
> 
> Your feedback on this idea would be greatly appreciated.
> 

Best Regards,
Hou zj 



RE: Conflict detection for update_deleted in logical replication

2024-09-19 Thread Zhijie Hou (Fujitsu)


> -Original Message-
> From: Masahiko Sawada 
> Sent: Friday, September 20, 2024 2:49 AM
> To: Amit Kapila 
> Cc: shveta malik ; Hou, Zhijie/侯 志杰
> ; pgsql-hackers 
> Subject: Re: Conflict detection for update_deleted in logical replication
> 
> On Tue, Sep 17, 2024 at 9:29 PM Amit Kapila 
> wrote:
> >
> > On Tue, Sep 17, 2024 at 11:24 PM Masahiko Sawada
>  wrote:
> > >
> > > On Mon, Sep 16, 2024 at 11:53 PM Amit Kapila
>  wrote:
> > > >
> > > > On Tue, Sep 17, 2024 at 6:08 AM Masahiko Sawada
>  wrote:
> > > >
> > > > I haven't thought about the implementation details yet but I think
> > > > during pruning (for example in heap_prune_satisfies_vacuum()),
> > > > apart from checking if the tuple satisfies
> > > > HeapTupleSatisfiesVacuumHorizon(), we should also check if the
> > > > tuple's committs is greater than configured vacuum_committs_age
> > > > (for the
> > > > table) to decide whether tuple can be removed.
> > >
> > > Sounds very costly. I think we need to do performance tests. Even if
> > > the vacuum gets slower only on the particular table having the
> > > vacuum_committs_age setting, it would affect overall autovacuum
> > > performance. Also, it would affect HOT pruning performance.
> > >
> >
> > Agreed that we should do some performance testing and additionally
> > think of any better way to implement. I think the cost won't be much
> > if the tuples to be removed are from a single transaction because the
> > required commit_ts information would be cached but when the tuples are
> > from different transactions, we could see a noticeable impact. We need
> > to test to say anything concrete on this.
> 
> Agreed.
> 
> >
> > > >
> > > > > > but IIUC this value doesn’t need to be significant; it can be
> > > > > > limited to just a few minutes. The one which is sufficient to
> > > > > > handle replication delays caused by network lag or other
> > > > > > factors, assuming clock skew has already been addressed.
> > > > >
> > > > > I think that in a non-bidirectional case the value could need to
> > > > > be a large number. Is that right?
> > > > >
> > > >
> > > > As per my understanding, even for non-bidirectional cases, the
> > > > value should be small. For example, in the case, pointed out by
> > > > Shveta [1], where the updates from 2 nodes are received by a third
> > > > node, this setting is expected to be small. This setting primarily
> > > > deals with concurrent transactions on multiple nodes, so it should
> > > > be small but I could be missing something.
> > > >
> > >
> > > I might be missing something but the scenario I was thinking of is
> > > something below.
> > >
> > > Suppose that we setup uni-directional logical replication between
> > > Node A and Node B (e.g., Node A -> Node B) and both nodes have the
> > > same row with key = 1:
> > >
> > > Node A:
> > > T1: UPDATE t SET val = 2 WHERE key = 1; (10:00 AM)
> > >   -> This change is applied on Node B at 10:01 AM.
> > >
> > > Node B:
> > > T2: DELETE FROM t WHERE key = 1; (05:00 AM)
> > >
> > > If a vacuum runs on Node B at 06:00 AM, the change of T1 coming from
> > > Node A would raise an "update_missing" conflict. On the other hand,
> > > if a vacuum runs on Node B at 11:00 AM, the change would raise an
> > > "update_deleted" conflict. It looks whether we detect an
> > > "update_deleted" or an "updated_missing" depends on the timing of
> > > vacuum, and to avoid such a situation, we would need to set
> > > vacuum_committs_age to more than 5 hours.
> > >
> >
> > Yeah, in this case, it would detect a different conflict (if we don't
> > set vacuum_committs_age to greater than 5 hours) but as per my
> > understanding, the primary purpose of conflict detection and
> > resolution is to avoid data inconsistency in a bi-directional setup.
> > Assume, in the above case it is a bi-directional setup, then we want
> > to have the same data in both nodes. Now, if there are other cases
> > like the one you mentioned that require to detect the conflict
> > reliably than I agree this value could be large and probably not the
> > best way to achieve it. I think we can mention in the docs that the
> > primary purpose of this is to achieve data consistency among
> > bi-directional kind of setups.
> >
> > Having said that even in the above case, the result should be the same
> > whether the vacuum has removed the row or not. Say, if the vacuum has
> > not yet removed the row (due to vacuum_committs_age or otherwise) then
> > also because the incoming update has a later timestamp, we will
> > convert the update to insert as per last_update_wins resolution
> > method, so the conflict will be considered as update_missing. And,
> > say, the vacuum has removed the row and the conflict detected is
> > update_missing, then also we will convert the update to insert. In
> > short, if UPDATE has lower commit-ts, DELETE should win and if UPDATE
> > has higher commit-ts, UPDATE should win.
> >
> > So, we can expect data consistency

RE: Conflict detection for update_deleted in logical replication

2024-09-10 Thread Zhijie Hou (Fujitsu)
On Wednesday, September 11, 2024 1:03 PM shveta malik  
wrote:
> 
> On Wed, Sep 11, 2024 at 10:15 AM Zhijie Hou (Fujitsu)
>  wrote:
> >
> > On Wednesday, September 11, 2024 12:18 PM shveta malik
>  wrote:
> > >
> > > ~~
> > >
> > > Another query is about 3 node setup. I couldn't figure out what
> > > would be feedback_slots setting when it is not bidirectional, as in
> > > consider the case where there are three nodes A,B,C. Node C is
> > > subscribing to both Node A and Node B. Node A and Node B are the
> > > ones doing concurrent "update" and "delete" which will both be
> > > replicated to Node C. In this case what will be the feedback_slots
> > > setting on Node C? We don't have any slots here which will be
> > > replicating changes from Node C to Node A and Node C to Node B. This
> > > is given in [3] in your first email ([1])
> >
> > Thanks for pointing this, the link was a bit misleading. I think the
> > solution proposed in this thread is only used to allow detecting
> > update_deleted reliably in a bidirectional cluster.  For non-
> > bidirectional cases, it would be more tricky to predict the timing till when
> should we retain the dead tuples.
> >
> 
> So in brief, this solution is only for bidrectional setup? For 
> non-bidirectional,
> feedback_slots is non-configurable and thus irrelevant.

Right.

> 
> Irrespective of above, if user ends up setting feedback_slot to some random 
> but
> existing slot which is not at all consuming changes, then it may so happen 
> that
> the node will never send feedback msg to another node resulting in
> accumulation of dead tuples on another node. Is that a possibility?

Yes, It's possible. I think this is a common situation for this kind of user
specified options. Like the user DML will be blocked, if any inactive standby
names are added synchronous_standby_names.

Best Regards,
Hou zj




RE: Conflict detection for update_deleted in logical replication

2024-09-10 Thread Zhijie Hou (Fujitsu)
On Wednesday, September 11, 2024 12:18 PM shveta malik  
wrote:
> 
> On Tue, Sep 10, 2024 at 4:30 PM Zhijie Hou (Fujitsu) 
> wrote:
> >
> > On Tuesday, September 10, 2024 5:56 PM shveta malik
>  wrote:
> > >
> > > Thanks for the example. Can you please review below and let me know
> > > if my understanding is correct.
> > >
> > > 1)
> > > In a bidirectional replication setup, the user has to create slots
> > > in a way that NodeA's sub's slot is Node B's feedback_slot and Node
> > > B's sub's slot is Node A's feedback slot. And then only this feature will
> work well, is it correct to say?
> >
> > Yes, your understanding is correct.
> >
> > >
> > > 2)
> > > Now coming back to multiple feedback_slots in a subscription, is the
> > > below
> > > correct:
> > >
> > > Say Node A has publications and subscriptions as follow:
> > > --
> > > A_pub1
> > >
> > > A_sub1 (subscribing to B_pub1 with the default slot_name of A_sub1)
> > > A_sub2 (subscribing to B_pub2 with the default slot_name of A_sub2)
> > > A_sub3 (subscribing to B_pub3 with the default slot_name of A_sub3)
> > >
> > >
> > > Say Node B has publications and subscriptions as follow:
> > > --
> > > B_sub1 (subscribing to A_pub1 with the default slot_name of B_sub1)
> > >
> > > B_pub1
> > > B_pub2
> > > B_pub3
> > >
> > > Then what will be the feedback_slot configuration for all
> > > subscriptions of A and B? Is below correct:
> > > --
> > > A_sub1, A_sub2, A_sub3: feedback_slots=B_sub1
> > > B_sub1: feedback_slots=A_sub1,A_sub2, A_sub3
> >
> > Right. The above configurations are correct.
> 
> Okay. It seems difficult to understand configuration from user's perspective.

Right. I think we could give an example in the document to make it clear.

> 
> > >
> > > 3)
> > > If the above is true, then do we have a way to make sure that the
> > > user  has given this configuration exactly the above way? If users
> > > end up giving feedback_slots as some random slot  (say A_slot4 or
> > > incomplete list), do we validate that? (I have not looked at code
> > > yet, just trying to understand design first).
> >
> > The patch doesn't validate if the feedback slots belong to the correct
> > subscriptions on remote server. It only validates if the slot is an
> > existing, valid, logical slot. I think there are few challenges to validate 
> > it
> further.
> > E.g. We need a way to identify the which server the slot is
> > replicating changes to, which could be tricky as the slot currently
> > doesn't have any info to identify the remote server. Besides, the slot
> > could be inactive temporarily due to some subscriber side error, in
> > which case we cannot verify the subscription that used it.
> 
> Okay, I understand the challenges here.
> 
> > >
> > > 4)
> > > Now coming to this:
> > >
> > > > The apply worker will get the oldest confirmed flush LSN among the
> > > > specified slots and send the LSN as a feedback message to the
> > > > walsender.
> > >
> > >  There will be one apply worker on B which will be due to B_sub1, so
> > > will it check confirmed_lsn of all slots A_sub1,A_sub2, A_sub3?
> > > Won't it be sufficient to check confimed_lsn of say slot A_sub1
> > > alone which has subscribed to table 't' on which delete has been
> > > performed? Rest of the  lots (A_sub2, A_sub3) might have subscribed to
> different tables?
> >
> > I think it's theoretically correct to only check the A_sub1. We could
> > document that user can do this by identifying the tables that each
> > subscription replicates, but it may not be user friendly.
> >
> 
> Sorry, I fail to understand how user can identify the tables and give
> feedback_slots accordingly? I thought feedback_slots is a one time
> configuration when replication is setup (or say setup changes in future); it 
> can
> not keep on changing with each query. Or am I missing something?

I meant that user have all the publication information(including the tables
added in a publication) that the subscription subscribes to, and could also
have the slot_name, so I think it's possible to identify the tables that each
subscription includes and add the feedback_slots correspondingly before
starting the replication. It

RE: Conflict detection for update_deleted in logical replication

2024-09-10 Thread Zhijie Hou (Fujitsu)
On Tuesday, September 10, 2024 7:25 PM Amit Kapila  
wrote:
> 
> On Thu, Sep 5, 2024 at 5:07 PM Zhijie Hou (Fujitsu) 
> wrote:
> >
> > ---
> > ISSUES and SOLUTION
> > ---
> >
> > To detect update_deleted conflicts, we need to search for dead tuples
> > in the table. However, dead tuples can be removed by VACUUM at any
> > time. Therefore, to ensure consistent and accurate conflict detection,
> > tuples deleted by other origins must not be removed by VACUUM before
> > the conflict detection process. If the tuples are removed prematurely,
> > it might lead to incorrect conflict identification and resolution, causing 
> > data
> divergence between nodes.
> >
> > Here is an example of how VACUUM could affect conflict detection and
> > how to prevent this issue. Assume we have a bidirectional cluster with
> > two nodes, A and B.
> >
> > Node A:
> >   T1: INSERT INTO t (id, value) VALUES (1,1);
> >   T2: DELETE FROM t WHERE id = 1;
> >
> > Node B:
> >   T3: UPDATE t SET value = 2 WHERE id = 1;
> >
> > To retain the deleted tuples, the initial idea was that once
> > transaction T2 had been applied to both nodes, there was no longer a
> > need to preserve the dead tuple on Node A. However, a scenario arises
> > where transactions T3 and T2 occur concurrently, with T3 committing
> > slightly earlier than T2. In this case, if Node B applies T2 and Node
> > A removes the dead tuple (1,1) via VACUUM, and then Node A applies T3
> > after the VACUUM operation, it can only result in an update_missing
> > conflict. Given that the default resolution for update_missing
> > conflicts is apply_or_skip (e.g. convert update to insert if possible
> > and apply the insert), Node A will eventually hold a row (1,2) while Node B
> becomes empty, causing data inconsistency.
> >
> > Therefore, the strategy needs to be expanded as follows: Node A cannot
> > remove the dead tuple until:
> > (a) The DELETE operation is replayed on all remote nodes, *AND*
> > (b) The transactions on logical standbys occurring before the replay
> > of Node A's DELETE are replayed on Node A as well.
> >
> > ---
> > THE DESIGN
> > ---
> >
> > To achieve the above, we plan to allow the logical walsender to
> > maintain and advance the slot.xmin to protect the data in the user
> > table and introduce a new logical standby feedback message. This
> > message reports the WAL position that has been replayed on the logical
> > standby *AND* the changes occurring on the logical standby before the
> > WAL position are also replayed to the walsender's node (where the
> > walsender is running). After receiving the new feedback message, the
> > walsender will advance the slot.xmin based on the flush info, similar
> > to the advancement of catalog_xmin. Currently, the effective_xmin/xmin
> > of logical slot are unused during logical replication, so I think it's safe 
> > and
> won't cause side-effect to reuse the xmin for this feature.
> >
> > We have introduced a new subscription option
> > (feedback_slots='slot1,...'), where these slots will be used to check
> > condition (b): the transactions on logical standbys occurring before
> > the replay of Node A's DELETE are replayed on Node A as well.
> > Therefore, on Node B, users should specify the slots corresponding to
> > Node A in this option. The apply worker will get the oldest confirmed
> > flush LSN among the specified slots and send the LSN as a feedback
> message to the walsender. -- I also thought of making it an automaic way, e.g.
> > let apply worker select the slots that acquired by the walsenders
> > which connect to the same remote server(e.g. if apply worker's
> > connection info or some other flags is same as the walsender's
> > connection info). But it seems tricky because if some slots are
> > inactive which means the walsenders are not there, the apply worker
> > could not find the correct slots to check unless we save the host along with
> the slot's persistence data.
> >
> > The new feedback message is sent only if feedback_slots is not NULL.
> >
> 
> Don't you need to deal with versioning stuff for sending this new message? I
> mean what if the receiver side of this message is old and doesn't support this
> new message.

Yes, I think we can avoid sending the new message if the remote server version
doesn't support handling this message (e.g. server_version < 18). Will address
this in next version.

> 
> One minor comment on 0003
> ===

RE: Conflict detection for update_deleted in logical replication

2024-09-10 Thread Zhijie Hou (Fujitsu)
On Tuesday, September 10, 2024 5:56 PM shveta malik  
wrote:
> 
> On Tue, Sep 10, 2024 at 1:40 PM Zhijie Hou (Fujitsu) 
> wrote:
> >
> > On Tuesday, September 10, 2024 2:45 PM shveta malik
>  wrote:
> > >
> > > Thank You Hou-San for explaining the design. But to make it easier
> > > to understand, would you be able to explain the sequence/timeline of
> > > the
> > > *new* actions performed by the walsender and the apply processes for
> > > the given example along with new feedback_slot config needed
> > >
> > > Node A: (Procs: walsenderA, applyA)
> > >   T1: INSERT INTO t (id, value) VALUES (1,1);  ts=10.00 AM
> > >   T2: DELETE FROM t WHERE id = 1;   ts=10.02 AM
> > >
> > > Node B: (Procs: walsenderB, applyB)
> > >   T3: UPDATE t SET value = 2 WHERE id = 1; ts=10.01 AM
> >
> > Thanks for reviewing! Let me elaborate further on the example:
> >
> > On node A, feedback_slots should include the logical slot that used to
> > replicate changes from Node A to Node B. On node B, feedback_slots
> > should include the logical slot that replicate changes from Node B to Node 
> > A.
> >
> > Assume the slot.xmin on Node A has been initialized to a valid
> > number(740) before the following flow:
> >
> > Node A executed T1  
> > - 10.00 AM
> > T1 replicated and applied on Node B 
> > - 10.0001 AM
> > Node B executed T3  
> > - 10.01 AM
> > Node A executed T2 (741)
> > - 10.02 AM
> > T2 replicated and applied on Node B (delete_missing)
> > - 10.03 AM
> 
> Not related to this feature, but do you mean delete_origin_differ here?

Oh sorry, It's a miss. I meant delete_origin_differ.

> 
> > T3 replicated and applied on Node A (new action, detect
> update_deleted) - 10.04 AM
> >
> > (new action) Apply worker on Node B has confirmed that T2 has been
> > applied locally and the transactions before T2 (e.g., T3) has been
> > replicated and applied to Node A (e.g. feedback_slot.confirmed_flush_lsn
> >= lsn of the local
> > replayed T2), thus send the new feedback message to Node A.
> - 10.05 AM
> >
> > (new action) Walsender on Node A received the message and would
> > advance the slot.xmin.- 10.06 AM
> >
> > Then, after the slot.xmin is advanced to a number greater than 741,
> > the VACUUM would be able to remove the dead tuple on Node A.
> >
> 
> Thanks for the example. Can you please review below and let me know if my
> understanding is correct.
> 
> 1)
> In a bidirectional replication setup, the user has to create slots in a way 
> that
> NodeA's sub's slot is Node B's feedback_slot and Node B's sub's slot is Node
> A's feedback slot. And then only this feature will work well, is it correct 
> to say?

Yes, your understanding is correct.

> 
> 2)
> Now coming back to multiple feedback_slots in a subscription, is the below
> correct:
> 
> Say Node A has publications and subscriptions as follow:
> --
> A_pub1
> 
> A_sub1 (subscribing to B_pub1 with the default slot_name of A_sub1)
> A_sub2 (subscribing to B_pub2 with the default slot_name of A_sub2)
> A_sub3 (subscribing to B_pub3 with the default slot_name of A_sub3)
> 
> 
> Say Node B has publications and subscriptions as follow:
> --
> B_sub1 (subscribing to A_pub1 with the default slot_name of B_sub1)
> 
> B_pub1
> B_pub2
> B_pub3
> 
> Then what will be the feedback_slot configuration for all subscriptions of A 
> and
> B? Is below correct:
> --
> A_sub1, A_sub2, A_sub3: feedback_slots=B_sub1
> B_sub1: feedback_slots=A_sub1,A_sub2, A_sub3

Right. The above configurations are correct.

> 
> 3)
> If the above is true, then do we have a way to make sure that the user  has
> given this configuration exactly the above way? If users end up giving
> feedback_slots as some random slot  (say A_slot4 or incomplete list), do we
> validate that? (I have not looked at code yet, just trying to understand 
> design
> first).

The patch doesn't validate if the feedback slots belong to the correct
subscriptions on remote server. It only validates if the slot is an existing,
valid, logical slot. I think there are few challenges to validate it further.
E.g. We need a way to identify the which 

RE: Conflict detection for update_deleted in logical replication

2024-09-10 Thread Zhijie Hou (Fujitsu)
On Tuesday, September 10, 2024 2:45 PM shveta malik  
wrote:
> > ---
> > THE DESIGN
> > ---
> >
> > To achieve the above, we plan to allow the logical walsender to
> > maintain and advance the slot.xmin to protect the data in the user
> > table and introduce a new logical standby feedback message. This
> > message reports the WAL position that has been replayed on the logical
> > standby *AND* the changes occurring on the logical standby before the
> > WAL position are also replayed to the walsender's node (where the
> > walsender is running). After receiving the new feedback message, the
> > walsender will advance the slot.xmin based on the flush info, similar
> > to the advancement of catalog_xmin. Currently, the effective_xmin/xmin
> > of logical slot are unused during logical replication, so I think it's safe 
> > and
> won't cause side-effect to reuse the xmin for this feature.
> >
> > We have introduced a new subscription option
> > (feedback_slots='slot1,...'), where these slots will be used to check
> > condition (b): the transactions on logical standbys occurring before
> > the replay of Node A's DELETE are replayed on Node A as well.
> > Therefore, on Node B, users should specify the slots corresponding to
> > Node A in this option. The apply worker will get the oldest confirmed
> > flush LSN among the specified slots and send the LSN as a feedback
> message to the walsender. -- I also thought of making it an automaic way, e.g.
> > let apply worker select the slots that acquired by the walsenders
> > which connect to the same remote server(e.g. if apply worker's
> > connection info or some other flags is same as the walsender's
> > connection info). But it seems tricky because if some slots are
> > inactive which means the walsenders are not there, the apply worker
> > could not find the correct slots to check unless we save the host along with
> the slot's persistence data.
> >
> > The new feedback message is sent only if feedback_slots is not NULL.
> > If the slots in feedback_slots are removed, a final message containing
> > InvalidXLogRecPtr will be sent to inform the walsender to forget about
> > the slot.xmin.
> >
> > To detect update_deleted conflicts during update operations, if the
> > target row cannot be found, we perform an additional scan of the table using
> snapshotAny.
> > This scan aims to locate the most recently deleted row that matches
> > the old column values from the remote update operation and has not yet
> > been removed by VACUUM. If any such tuples are found, we report the
> > update_deleted conflict along with the origin and transaction information
> that deleted the tuple.
> >
> > Please refer to the attached POC patch set which implements above
> > design. The patch set is split into some parts to make it easier for the 
> > initial
> review.
> > Please note that each patch is interdependent and cannot work
> independently.
> >
> > Thanks a lot to Kuroda-San and Amit for the off-list discussion.
> >
> > Suggestions and comments are highly appreciated !
> >
> 
> Thank You Hou-San for explaining the design. But to make it easier to
> understand, would you be able to explain the sequence/timeline of the
> *new* actions performed by the walsender and the apply processes for the
> given example along with new feedback_slot config needed
> 
> Node A: (Procs: walsenderA, applyA)
>   T1: INSERT INTO t (id, value) VALUES (1,1);  ts=10.00 AM
>   T2: DELETE FROM t WHERE id = 1;   ts=10.02 AM
> 
> Node B: (Procs: walsenderB, applyB)
>   T3: UPDATE t SET value = 2 WHERE id = 1; ts=10.01 AM

Thanks for reviewing! Let me elaborate further on the example:

On node A, feedback_slots should include the logical slot that used to 
replicate changes
from Node A to Node B. On node B, feedback_slots should include the logical
slot that replicate changes from Node B to Node A.

Assume the slot.xmin on Node A has been initialized to a valid number(740) 
before the
following flow:

Node A executed T1  
- 10.00 AM
T1 replicated and applied on Node B 
- 10.0001 AM
Node B executed T3  
- 10.01 AM
Node A executed T2 (741)
- 10.02 AM
T2 replicated and applied on Node B (delete_missing)
- 10.03 AM
T3 replicated and applied on Node A (new action, detect update_deleted) 
- 10.04 AM

(new action) Apply worker on Node B has confirmed that T2 has been applied
locally and the transactions before T2 (e.g., T3) has been replicated and
applied to Node A (e.g. feedback_slot.confirmed_flush_lsn >= lsn of the local
replayed T2), thus send the new feedback message to Node A. 
- 10.05 AM  
  

RE: long-standing data loss bug in initial sync of logical replication

2024-09-09 Thread Zhijie Hou (Fujitsu)
On Friday, August 9, 2024 7:21 PM Shlok Kyal  wrote:

Hi,

> 
> In the v7 patch, I am looping through the reorder buffer of the current 
> committed
> transaction and storing all invalidation messages in a list. Then I am
> distributing those invalidations.
> But I found that for a transaction we already store all the invalidation 
> messages
> (see [1]). So we don't need to loop through the reorder buffer and store the
> invalidations.
> 
> I have modified the patch accordingly and attached the same.

I have tested this patch across various scenarios and did not find issues.

I confirmed that changes are correctly replicated after adding the table or
schema to the publication, and changes will not be replicated after removing
the table or schema from the publication. This behavior is consistent in both
streaming and non-streaming modes. Additionally, I verified that invalidations
occurring within subtransactions are appropriately distributed.

Please refer to the attached ISOLATION tests which tested the above cases.
This also inspires me if it would be cheaper to write an ISOLATION test for this
bug instead of building a real pub/sub cluster. But I am not against the current
tests in the V8 patch as that can check the replicated data in a visible way.

Best Regards,
Hou zj
From 8f4e36c5fc65d4a88058467a73cbe423a5f0e91e Mon Sep 17 00:00:00 2001
From: Hou Zhijie 
Date: Mon, 9 Sep 2024 19:56:18 +0800
Subject: [PATCH] test invalidation distribution

---
 contrib/test_decoding/Makefile|   2 +-
 .../expected/invalidation_distrubution.out| 173 ++
 .../specs/invalidation_distrubution.spec  |  56 ++
 3 files changed, 230 insertions(+), 1 deletion(-)
 create mode 100644 contrib/test_decoding/expected/invalidation_distrubution.out
 create mode 100644 contrib/test_decoding/specs/invalidation_distrubution.spec

diff --git a/contrib/test_decoding/Makefile b/contrib/test_decoding/Makefile
index a4ba1a509a..eef7077067 100644
--- a/contrib/test_decoding/Makefile
+++ b/contrib/test_decoding/Makefile
@@ -9,7 +9,7 @@ REGRESS = ddl xact rewrite toast permissions decoding_in_xact \
 ISOLATION = mxact delayed_startup ondisk_startup concurrent_ddl_dml \
oldest_xmin snapshot_transfer subxact_without_top concurrent_stream \
twophase_snapshot slot_creation_error catalog_change_snapshot \
-   skip_snapshot_restore
+   skip_snapshot_restore invalidation_distrubution
 
 REGRESS_OPTS = --temp-config $(top_srcdir)/contrib/test_decoding/logical.conf
 ISOLATION_OPTS = --temp-config $(top_srcdir)/contrib/test_decoding/logical.conf
diff --git a/contrib/test_decoding/expected/invalidation_distrubution.out 
b/contrib/test_decoding/expected/invalidation_distrubution.out
new file mode 100644
index 00..cdc871d31d
--- /dev/null
+++ b/contrib/test_decoding/expected/invalidation_distrubution.out
@@ -0,0 +1,173 @@
+Parsed test spec with 2 sessions
+
+starting permutation: s1_init s2_create_pub s1_insert_tbl1 s1_begin 
s1_insert_tbl1 s2_alter_pub_add_tbl s1_commit s1_insert_tbl1 
s2_get_binary_changes s2_drop_pub
+step s1_init: SELECT 'init' FROM 
pg_create_logical_replication_slot('isolation_slot', 'pgoutput');
+?column?
+
+init
+(1 row)
+
+step s2_create_pub: CREATE PUBLICATION pub;
+step s1_insert_tbl1: INSERT INTO tbl1 (val1, val2) VALUES (1, 1);
+step s1_begin: BEGIN;
+step s1_insert_tbl1: INSERT INTO tbl1 (val1, val2) VALUES (1, 1);
+step s2_alter_pub_add_tbl: ALTER PUBLICATION pub ADD TABLE tbl1;
+step s1_commit: COMMIT;
+step s1_insert_tbl1: INSERT INTO tbl1 (val1, val2) VALUES (1, 1);
+step s2_get_binary_changes: SELECT count(data) FROM 
pg_logical_slot_get_binary_changes('isolation_slot', NULL, NULL, 
'proto_version', '4', 'publication_names', 'pub') WHERE get_byte(data, 0) = 73;
+count
+-
+1
+(1 row)
+
+step s2_drop_pub: DROP PUBLICATION pub;
+?column?
+
+stop
+(1 row)
+
+
+starting permutation: s1_init s2_create_pub s1_insert_tbl1 s1_begin 
s1_insert_tbl1 s2_alter_pub_add_schema s1_commit s1_insert_tbl1 
s2_get_binary_changes s2_drop_pub
+step s1_init: SELECT 'init' FROM 
pg_create_logical_replication_slot('isolation_slot', 'pgoutput');
+?column?
+
+init
+(1 row)
+
+step s2_create_pub: CREATE PUBLICATION pub;
+step s1_insert_tbl1: INSERT INTO tbl1 (val1, val2) VALUES (1, 1);
+step s1_begin: BEGIN;
+step s1_insert_tbl1: INSERT INTO tbl1 (val1, val2) VALUES (1, 1);
+step s2_alter_pub_add_schema: ALTER PUBLICATION pub ADD TABLES IN SCHEMA 
public;
+step s1_commit: COMMIT;
+step s1_insert_tbl1: INSERT INTO tbl1 (val1, val2) VALUES (1, 1);
+step s2_get_binary_changes: SELECT count(data) FROM 
pg_logical_slot_get_binary_changes('isolation_slot', NULL, NULL, 
'proto_version', '4', 'publication_names', 'pub') WHERE get_byte(data, 0) = 73;
+count
+-
+1
+(1 row)
+
+step s2_drop_pub: DROP PUBLICATION pub;
+?column?
+
+stop
+(1 row)
+
+
+starting permutation: s1_init s2_create_pub s2_alter_pub_add_tbl s1_begin 
s

RE: Collect statistics about conflicts in logical replication

2024-09-03 Thread Zhijie Hou (Fujitsu)
On Tuesday, September 3, 2024 7:23 PM Zhijie Hou (Fujitsu) 
 wrote:
> 
> On Tuesday, September 3, 2024 7:12 PM Amit Kapila
>  wrote:
> >
> > On Fri, Aug 30, 2024 at 12:15 PM Zhijie Hou (Fujitsu)
> > 
> > wrote:
> > >
> > > Here is V5 patch which addressed above and Shveta's[1] comments.
> > >
> >
> > Testing the stats for all types of conflicts is not required for this
> > patch especially because they increase the timings by 3-4s. We can add
> > tests for one or two types of conflicts.
> >
> > *
> > (see
> > + * PgStat_StatSubEntry::conflict_count and
> > + PgStat_StatSubEntry::conflict_count)
> >
> > There is a typo in the above comment.
> 
> Thanks for the comments. I have addressed the comments and adjusted the
> tests.
> In the V6 patch, Only insert_exists and delete_missing are tested.
> 
> I confirmed that it only increased the testing time by 1 second on my machine.

Sorry, I sent the wrong patch in last email, please refer to the correct patch 
here.

Best Regards,
Hou zj


v6_2-0001-Collect-statistics-about-conflicts-in-logical-rep.patch
Description:  v6_2-0001-Collect-statistics-about-conflicts-in-logical-rep.patch


RE: Collect statistics about conflicts in logical replication

2024-09-03 Thread Zhijie Hou (Fujitsu)
On Tuesday, September 3, 2024 7:12 PM Amit Kapila  
wrote:
> 
> On Fri, Aug 30, 2024 at 12:15 PM Zhijie Hou (Fujitsu) 
> wrote:
> >
> > Here is V5 patch which addressed above and Shveta's[1] comments.
> >
> 
> Testing the stats for all types of conflicts is not required for this patch
> especially because they increase the timings by 3-4s. We can add tests for one
> or two types of conflicts.
> 
> *
> (see
> + * PgStat_StatSubEntry::conflict_count and
> + PgStat_StatSubEntry::conflict_count)
> 
> There is a typo in the above comment.

Thanks for the comments. I have addressed the comments and adjusted the tests.
In the V6 patch, Only insert_exists and delete_missing are tested.

I confirmed that it only increased the testing time by 1 second on my machine.

Best Regards,
Hou zj


v6-0001-Collect-statistics-about-conflicts-in-logical-rep.patch
Description:  v6-0001-Collect-statistics-about-conflicts-in-logical-rep.patch


RE: Collect statistics about conflicts in logical replication

2024-08-29 Thread Zhijie Hou (Fujitsu)
On Friday, August 30, 2024 2:24 PM shveta malik  wrote:
> 
> On Fri, Aug 30, 2024 at 10:53 AM Peter Smith 
> wrote:
> >
> > Hi Hou-San. Here are my review comments for v4-0001.

Thanks Shveta and Peter for giving comments !

> >
> > ==
> >
> > 1. Add links in the docs
> >
> > IMO it would be good for all these confl_* descriptions (in
> > doc/src/sgml/monitoring.sgml) to include links back to where each of
> > those conflict types was defined [1].
> >
> > Indeed, when links are included to the original conflict type
> > information then I think you should remove mentioning
> > "track_commit_timestamp":
> > +   counted only when the
> > +linkend="guc-track-commit-timestamp">track_commit_timesta
> mp
> > +   option is enabled on the subscriber.
> >
> > It should be obvious that you cannot count a conflict if the conflict
> > does not happen, but I don't think we should scatter/duplicate those
> > rules in different places saying when certain conflicts can/can't
> > happen -- we should just link everywhere back to the original
> > description for those rules.
> 
> +1, will make the doc better.

Changed. To add link to each conflict type, I added " 
> > ~~~
> >
> > 2. Arrange all the counts into an intuitive/natural order
> >
> > There is an intuitive/natural ordering for these counts. For example,
> > the 'confl_*' count fields are in the order insert -> update ->
> > delete, which LGTM.
> >
> > Meanwhile, the 'apply_error_count' and the 'sync_error_count' are not
> > in a good order.
> >
> > IMO it makes more sense if everything is ordered as:
> > 'sync_error_count', then 'apply_error_count', then all the 'confl_*'
> > counts.
> >
> > This comment applies to lots of places, e.g.:
> > - docs (doc/src/sgml/monitoring.sgml)
> > - function pg_stat_get_subscription_stats (pg_proc.dat)
> > - view pg_stat_subscription_stats
> > (src/backend/catalog/system_views.sql)
> > - TAP test SELECTs (test/subscription/t/026_stats.pl)
> >
> > As all those places are already impacted by this patch, I think it
> > would be good if (in passing) we (if possible) also swapped the
> > sync/apply counts so everything  is ordered intuitively top-to-bottom
> > or left-to-right.
> 
> Not sure about this though. It does not seem to belong to the current patch.

I also don't think we should handle that in this patch.

Here is V5 patch which addressed above and Shveta's[1] comments.

[1] 
https://www.postgresql.org/message-id/CAJpy0uAZpzustNOMBhxBctHHWbBA%3DsnTAYsLpoWZg%2BcqegmD-A%40mail.gmail.com

Best Regards,
Hou zj


v5-0001-Collect-statistics-about-conflicts-in-logical-rep.patch
Description:  v5-0001-Collect-statistics-about-conflicts-in-logical-rep.patch


RE: Collect statistics about conflicts in logical replication

2024-08-28 Thread Zhijie Hou (Fujitsu)
On Thursday, August 29, 2024 11:18 AM shveta malik  
wrote:
> 
> On Thu, Aug 29, 2024 at 4:59 AM Peter Smith 
> wrote:
> >
> > On Wed, Aug 28, 2024 at 9:19 PM Amit Kapila 
> wrote:
> > >
> > > On Wed, Aug 28, 2024 at 11:43 AM shveta malik
>  wrote:
> > > >
> > > > Thanks for the patch. Just thinking out loud, since we have names
> > > > like 'apply_error_count', 'sync_error_count' which tells that they
> > > > are actually error-count, will it be better to have something
> > > > similar in conflict-count cases, like
> > > > 'insert_exists_conflict_count', 'delete_missing_conflict_count' and so
> on. Thoughts?
> > > >
> > >
> > > It would be better to have conflict in the names but OTOH it will
> > > make the names a bit longer. The other alternatives could be (a)
> > > insert_exists_confl_count, etc. (b) confl_insert_exists_count, etc.
> > > (c) confl_insert_exists, etc. These are based on the column names in
> > > the existing view pg_stat_database_conflicts [1]. The (c) looks
> > > better than other options but it will make the conflict-related
> > > columns different from error-related columns.
> > >
> > > Yet another option is to have a different view like
> > > pg_stat_subscription_conflicts but that sounds like going too far.
> 
> Yes, I think we are good with pg_stat_subscription_stats for the time being.
> 
> > >
> > > [1] -
> > >
> https://www.postgresql.org/docs/devel/monitoring-stats.html#MONITORI
> > > NG-PG-STAT-DATABASE-CONFLICTS-VIEW
> >
> > Option (c) looked good to me.
> 
> +1 for option c. it should be okay to not have '_count' in the name.

Agreed. Here is new version patch which change the names as suggested. I also
rebased the patch based on another renaming commit 640178c9.

Best Regards,
Hou zj


v4-0001-Collect-statistics-about-conflicts-in-logical-rep.patch
Description:  v4-0001-Collect-statistics-about-conflicts-in-logical-rep.patch


RE: Collect statistics about conflicts in logical replication

2024-08-28 Thread Zhijie Hou (Fujitsu)
On Thursday, August 29, 2024 8:31 AM Peter Smith  wrote:

Hi,

> I tried an experiment where I deliberately violated a primary key during 
> initial
> table synchronization.
> 
> For example:
...
> test_sub=# 2024-08-29 09:53:57.245 AEST [4345] LOG:  logical replication
> apply worker for subscription "sub1" has started
> 2024-08-29 09:53:57.258 AEST [4347] LOG:  logical replication table
> synchronization worker for subscription "sub1", table "t1" has started
> 2024-08-29 09:53:57.311 AEST [4347] ERROR:  duplicate key value violates
> unique constraint "t1_pkey"
> 2024-08-29 09:53:57.311 AEST [4347] DETAIL:  Key (a)=(1) already exists.
> 2024-08-29 09:53:57.311 AEST [4347] CONTEXT:  COPY t1, line 1
> ~~~
> 
> Notice how after a while there were multiple (15) 'sync_error_count' recorded.
> 
> According to the docs: 'insert_exists' happens when "Inserting a row that
> violates a NOT DEFERRABLE unique constraint.".  So why are there not the
> same number of 'insert_exists_count' recorded in pg_stat_subscription_stats?

Because this error was caused by COPY instead of an INSERT (e.g., CONTEXT:  COPY
t1, line 1), so this is as expected. The doc of conflict counts(
insert_exists_count) has already mentioned that it counts the conflict only 
*during the
application of changes* which is clear to me that it doesn't count the ones in
initial table synchronization. See the existing apply_error_count where we also
has similar wording(e.g. "an error occurred while applying changes").

Best Regards,
Hou zj


RE: Conflict detection and logging in logical replication

2024-08-27 Thread Zhijie Hou (Fujitsu)
On Wednesday, August 28, 2024 12:11 PM  Zhijie Hou (Fujitsu) 
 wrote:
> > > > Peter Smith mentioned to me off-list that the names of conflict
> > > > types 'update_differ' and 'delete_differ' are not intuitive as
> > > > compared to all other conflict types like insert_exists,
> > > > update_missing, etc. The other alternative that comes to mind for
> > > > those conflicts is to name them as
> > 'update_origin_differ'/''delete_origin_differ'.
> > > >
> > >
> > > For things to "differ" there must be more than one them. The plural
> > > of origin is origins.
> > >
> > > e.g. 'update_origins_differ'/''delete_origins_differ'.
> > >
> > > OTOH, you could say "differs" instead of differ:
> > >
> > > e.g. 'update_origin_differs'/''delete_origin_differs'.
> > >
> >
> > +1 on 'update_origin_differs' instead of 'update_origins_differ' as
> > the former is somewhat similar to other conflict names 'insert_exists'
> > and 'update_exists'.
> 
> Since we reached a consensus on this, I am attaching a small patch to rename
> as suggested.

Sorry, I attached the wrong patch. Here is correct one.

Best Regards,
Hou zj


0001-Rename-the-conflict-types-for-origin-differ-cases.patch
Description:  0001-Rename-the-conflict-types-for-origin-differ-cases.patch


RE: Conflict detection and logging in logical replication

2024-08-27 Thread Zhijie Hou (Fujitsu)
On Wednesday, August 28, 2024 11:30 AM shveta malik  
wrote:
> 
> On Tue, Aug 27, 2024 at 4:37 AM Peter Smith 
> wrote:
> >
> > On Mon, Aug 26, 2024 at 7:52 PM Amit Kapila 
> wrote:
> > >
> > > On Thu, Aug 22, 2024 at 2:21 PM Amit Kapila 
> wrote:
> > > >
> > > > On Thu, Aug 22, 2024 at 1:33 PM Peter Smith
>  wrote:
> > > > >
> > > > > Do you think the documentation for the 'column_value' parameter
> > > > > of the conflict logging should say that the displayed value
> > > > > might be truncated?
> > > > >
> > > >
> > > > I updated the patch to mention this and pushed it.
> > > >
> > >
> > > Peter Smith mentioned to me off-list that the names of conflict
> > > types 'update_differ' and 'delete_differ' are not intuitive as
> > > compared to all other conflict types like insert_exists,
> > > update_missing, etc. The other alternative that comes to mind for
> > > those conflicts is to name them as
> 'update_origin_differ'/''delete_origin_differ'.
> > >
> >
> > For things to "differ" there must be more than one them. The plural of
> > origin is origins.
> >
> > e.g. 'update_origins_differ'/''delete_origins_differ'.
> >
> > OTOH, you could say "differs" instead of differ:
> >
> > e.g. 'update_origin_differs'/''delete_origin_differs'.
> >
> 
> +1 on 'update_origin_differs' instead of 'update_origins_differ' as
> the former is somewhat similar to other conflict names 'insert_exists'
> and 'update_exists'.

Since we reached a consensus on this, I am attaching a small patch
to rename as suggested.

Best Regards,
Hou zj


0001-Rename-the-conflict-types-for-origin-differ-cases.patch
Description:  0001-Rename-the-conflict-types-for-origin-differ-cases.patch


RE: Collect statistics about conflicts in logical replication

2024-08-27 Thread Zhijie Hou (Fujitsu)
On Tuesday, August 27, 2024 10:59 AM Peter Smith  wrote:
> 
> ~~~
> 
> 3.
> +# Enable track_commit_timestamp to detect origin-differ conflicts in
> +logical # replication. Reduce wal_retrieve_retry_interval to 1ms to
> +accelerate the # restart of the logical replication worker after 
> encountering a
> conflict.
> +$node_subscriber->append_conf(
> + 'postgresql.conf', q{
> +track_commit_timestamp = on
> +wal_retrieve_retry_interval = 1ms
> +});
> 
> Later, after CDR resolvers are implemented, it might be good to revisit these
> conflict test cases and re-write them to use some conflict resolvers like 
> 'skip'.
> Then the subscriber won't give ERRORs and restart apply workers all the time
> behind the scenes, so you won't need the above configuration for accelerating
> the worker restarts. In other words, running these tests might be more 
> efficient
> if you can avoid restarting workers all the time.
> 
> I suggest putting an XXX comment here as a reminder that these tests should
> be revisited to make use of conflict resolvers in the future.

I think it would be too early to mention the resolution implementation detail
in the comments considering that the resolution is still not RFC. Also, I think
reducing wal_retrieve_retry_interval is a reasonable way to speed up the test
in this case because the test is not letting the worker to restart all the 
time, the
error causes the restart will be resolved immediately after the stats check. 
So, I
think adding XXX is not very appropriate.

Other comments look good to me.
I slightly adjusted few words and merged the changes. Thanks !

Here is V3 patch.

Best Regards,
Hou zj

 


v3-0001-Collect-statistics-about-conflicts-in-logical-rep.patch
Description:  v3-0001-Collect-statistics-about-conflicts-in-logical-rep.patch


RE: Doc: fix the note related to the GUC "synchronized_standby_slots"

2024-08-26 Thread Zhijie Hou (Fujitsu)
On Monday, August 26, 2024 5:37 PM Amit Kapila  wrote:
> 
> On Mon, Aug 26, 2024 at 1:30 PM  wrote:
> >
> > When I read the following documentation related to the
> "synchronized_standby_slots", I misunderstood that data loss would not occur
> in the case of synchronous physical replication. However, this is incorrect 
> (see
> reproduce.txt).
> >
> > > Note that in the case of asynchronous replication, there remains a risk of
> data loss for transactions committed on the former primary server but have yet
> to be replicated to the new primary server.
> > https://www.postgresql.org/docs/17/logical-replication-failover.html
> >
> > Am I missing something?
> >
> 
> It seems part of the paragraph: "Note that in the case of asynchronous
> replication, there remains a risk of data loss for transactions committed on 
> the
> former primary server but have yet to be replicated to the new primary 
> server." is
> a bit confusing. Will it make things clear to me if we remove that part?

I think the intention is to address a complaint[1] that the date inserted on
primary after the primary disconnects with the standby is still lost after
failover. But after rethinking, maybe it's doesn't directly belong to the topic 
in
the logical failover section because it's a general fact for async replication.
If we think it matters, maybe we can remove this part and slightly modify
another part:

   parameter ensures a seamless transition of those subscriptions after the
   standby is promoted. They can continue subscribing to publications on the
-   new primary server without losing data.
+   new primary server without losing that has already been replicated and
+flushed on the standby server.

[1] 
https://www.postgresql.org/message-id/ZfRe2%2BOxMS0kvNvx%40ip-10-97-1-34.eu-west-3.compute.internal

Best Regards,
Hou zj


RE: Conflict detection and logging in logical replication

2024-08-26 Thread Zhijie Hou (Fujitsu)
On Monday, August 26, 2024 6:36 PM shveta malik  wrote:
> 
> On Mon, Aug 26, 2024 at 3:22 PM Amit Kapila 
> wrote:
> >
> > On Thu, Aug 22, 2024 at 2:21 PM Amit Kapila 
> wrote:
> > >
> > > On Thu, Aug 22, 2024 at 1:33 PM Peter Smith 
> wrote:
> > > >
> > > > Do you think the documentation for the 'column_value' parameter of
> > > > the conflict logging should say that the displayed value might be
> > > > truncated?
> > > >
> > >
> > > I updated the patch to mention this and pushed it.
> > >
> >
> > Peter Smith mentioned to me off-list that the names of conflict types
> > 'update_differ' and 'delete_differ' are not intuitive as compared to
> > all other conflict types like insert_exists, update_missing, etc. The
> > other alternative that comes to mind for those conflicts is to name
> > them as 'update_origin_differ'/''delete_origin_differ'.
> 
> +1 on  'update_origin_differ'/''delete_origin_differ'. Gives more clarity.

+1

> 
> > The description in docs for 'update_differ' is as follows: Updating a
> > row that was previously modified by another origin. Note that this
> > conflict can only be detected when track_commit_timestamp is enabled
> > on the subscriber. Currently, the update is always applied regardless
> > of the origin of the local row.
> >
> > Does anyone else have any thoughts on the naming of these conflicts?

Best Regards,
Hou zj


RE: Collect statistics about conflicts in logical replication

2024-08-26 Thread Zhijie Hou (Fujitsu)
On Monday, August 26, 2024 3:30 PM Peter Smith  wrote:
> 
> ==
> src/include/replication/conflict.h
> 
> nit - defined 'NUM_CONFLICT_TYPES' inside the enum (I think this way is
> often used in other PG source enums)

I think we have recently tended to avoid doing that, as it has been commented
that this style is somewhat deceptive and can cause confusion. See a previous
similar comment[1]. The current style follows the other existing examples like:

#define IOOBJECT_NUM_TYPES (IOOBJECT_TEMP_RELATION + 1)
#define IOCONTEXT_NUM_TYPES (IOCONTEXT_VACUUM + 1)
#define IOOP_NUM_TYPES (IOOP_WRITEBACK + 1)
#define BACKEND_NUM_TYPES (B_LOGGER + 1)
...


> ==
> src/test/subscription/t/026_stats.pl
> 
> 1.
> + # Delete data from the test table on the publisher. This delete
> + operation # should be skipped on the subscriber since the table is already
> empty.
> + $node_publisher->safe_psql($db, qq(DELETE FROM $table_name;));
> +
> + # Wait for the subscriber to report tuple missing conflict.
> + $node_subscriber->poll_query_until(
> + $db,
> + qq[
> + SELECT update_missing_count > 0 AND delete_missing_count > 0 FROM
> + pg_stat_subscription_stats WHERE subname = '$sub_name'
> + ])
> +   or die
> +   qq(Timed out while waiting for tuple missing conflict for
> subscription '$sub_name');
> 
> Can you write a comment to explain why the replicated DELETE is
> expected to increment both the 'update_missing_count' and the
> 'delete_missing_count'?

I think the comments several lines above the wait explained the reason[2]. I
slightly modified the comments to make it clear.

Other changes look good to me and have been merged, thanks!

Here is the V2 patch.

[1] 
https://www.postgresql.org/message-id/202201130922.izanq4hkkqnx%40alvherre.pgsql

[2]
..
# Truncate test table to ensure the upcoming update operation is skipped
# and the test can continue.
$node_subscriber->safe_psql($db, qq(TRUNCATE $table_name));

Best Regards,
Hou zj 


v2-0001-Collect-statistics-about-conflicts-in-logical-rep.patch
Description:  v2-0001-Collect-statistics-about-conflicts-in-logical-rep.patch


Collect statistics about conflicts in logical replication

2024-08-22 Thread Zhijie Hou (Fujitsu)
Hi hackers,
Cc people involved in the related work.

In the original conflict resolution thread[1], we have decided to split the
conflict resolution work into multiple patches to facilitate incremental
progress towards supporting conflict resolution in logical replication, and one
of the work is statistics collection for the conflicts.

Following the discussions in the conflict resolution thread, the collection of
logical replication conflicts is important independently, which can help user
understand conflict stats (e.g., conflict rates) and potentially identify
portions of the application and other parts of the system to optimize. So, I am
starting a new thread for this feature.

The idea is to add columns(insert_exists_count, update_differ_count,
update_exists_count, update_missing_count, delete_differ_count,
delete_missing_count) in view pg_stat_subscription_stats to shows information
about the conflict which occur during the application of logical replication
changes. The conflict types originate from the committed work which is to
report additional information for each conflict in logical replication.

The patch for this feature is attached.

Suggestions and comments are highly appreciated.

[1] 
https://www.postgresql.org/message-id/CAA4eK1LgPyzPr_Vrvvr4syrde4hyT%3DQQnGjdRUNP-tz3eYa%3DGQ%40mail.gmail.com

Best Regards,
Hou Zhijie



v1-0001-Collect-statistics-about-conflicts-in-logical-rep.patch
Description:  v1-0001-Collect-statistics-about-conflicts-in-logical-rep.patch


RE: Conflict detection and logging in logical replication

2024-08-22 Thread Zhijie Hou (Fujitsu)
On Thursday, August 22, 2024 11:25 AM shveta malik  
wrote:
> 
> On Wed, Aug 21, 2024 at 3:04 PM Zhijie Hou (Fujitsu)
>  wrote:
> >
> >
> > Attach the V20 patch set which addressed above, Shveta[1][2] and
> > Kuroda-san's[3] comments.
> >
> 
> Thank You for the patch. Few comments:

Thanks for the patches. Here is V21 patch which addressed
Peter's and your comments.

Best Regards,
Hou zj


v21-0001-Doc-explain-the-log-format-of-logical-replicati.patch
Description:  v21-0001-Doc-explain-the-log-format-of-logical-replicati.patch


RE: Conflict detection and logging in logical replication

2024-08-21 Thread Zhijie Hou (Fujitsu)
On Wednesday, August 21, 2024 2:45 PM Peter Smith  wrote:
> 
> Here are some review comments for the v19-0001 docs patch.
> 
> The content seemed reasonable, but IMO it should be presented quite
> differently.
> 
> 
> 
> 1. Use sub-sections
> 
> I expect this logical replication "Conflicts" section is going to evolve into
> something much bigger. Surely, it's not going to be one humongous page of
> details, so it will be a section with lots of subsections like all the other 
> in
> Chapter 29.
> 
> IMO, you should be writing the docs in that kind of structure from the
> beginning.
> 
> For example, I'm thinking something like below (this is just an example - 
> surely
> lots more subsections will be needed for this topic):
> 
> 29.6 Conflicts
> 29.6.1. Conflict types
> 29.6.2. Logging format
> 29.6.3. Examples
> 
> Specifically, this v19-0001 patch information should be put into a subsection
> like the 29.6.2 shown above.

I think that's a good idea. But I preferred to do that in a separate
patch(maybe a third patch after the first and second are RFC), because AFAICS
we would need to adjust some existing docs which falls outside the scope of
the current patch.

> 
> ~~~
> 
> 2. Markup
> 
> +
> +LOG:  conflict detected on relation "schemaname.tablename":
> conflict=conflict_type
> +DETAIL:  detailed explaination.
> +Key (column_name, ...)=(column_name, ...);
> existing local tuple (column_name, ...)=(column_name, ...);
> remote tuple (column_name, ...)=(column_name, ...);
> replica identity (column_name, ...)=(column_name, ...).
> +
> 
> IMO this should be using markup more like the SQL syntax references.
> - e.g. I suggest  instead of 
> - e.g. I suggest all the substitution parameters (e.g. detailed explanation,
> conflict_type, column_name, ...) in the log should use  class="parameter"> and use those markups again later in these docs instead
> of 

Agreed. I have changed to use  and . But for static
words like "Key" or "replica identity" it doesn't look appropriate to use
, so I kept using  for them.

> nit - The amount of scrolling needed makes this LOG format too hard to see.
> Try to wrap it better so it can fit without being so wide.

I thought about this, but wrapping the sentence would cause the words
to be displayed in different lines after compiling. I think that's inconsistent
with the real log which display the tuples in one line.

Other comments not mentioned above look good to me.

Attach the V20 patch set which addressed above, Shveta[1][2] and Kuroda-san's[3]
comments.

[1] 
https://www.postgresql.org/message-id/CAJpy0uDUNigg86KRnk4A0KjwY8-pPaXzZ2eCjnb1ydCH48VzJQ%40mail.gmail.com
[2] 
https://www.postgresql.org/message-id/CAJpy0uARh2RRDBF6mJ7d807DsNXuCNQmEXZUn__fw4KZv8qEMg%40mail.gmail.com
[3] 
https://www.postgresql.org/message-id/TYAPR01MB5692C4EDD8B86760496A993AF58E2%40TYAPR01MB5692.jpnprd01.prod.outlook.com

Best Regards,
Hou zj



v20-0002-Collect-statistics-about-conflicts-in-logical-re.patch
Description:  v20-0002-Collect-statistics-about-conflicts-in-logical-re.patch


v20-0001-Doc-explain-the-log-format-of-logical-replicatio.patch
Description:  v20-0001-Doc-explain-the-log-format-of-logical-replicatio.patch


RE: Conflict detection and logging in logical replication

2024-08-21 Thread Zhijie Hou (Fujitsu)
On Wednesday, August 21, 2024 1:31 PM Kuroda, Hayato/黒田 隼人 
 wrote:
> 
> Dear Hou,
> 
> Thanks for updating the patch! I think the patch is mostly good.
> Here are minor comments.

Thanks for the comments !

> 
> 02.
> ```
> + 
> +  The key section in the second sentence of the
> ...
> ```
> 
> I preferred that section name is quoted.

I thought about this. But I feel the 'key' here is not a real string, so I 
chose not to
add quote for it.

> 
> 0002:
> 
> 03.
> ```
> -#include "replication/logicalrelation.h"
> ```
> 
> Just to confirm - this removal is not related with the feature but just the
> improvement, right?

The logicalrelation.h becomes unnecessary after adding worker_intenral.h, so I
think it's this patch's job to remove this.

Best Regards,
Hou zj



RE: Conflict detection and logging in logical replication

2024-08-21 Thread Zhijie Hou (Fujitsu)
On Wednesday, August 21, 2024 11:40 AM shveta malik  
wrote:
> 
> On Tue, Aug 20, 2024 at 4:45 PM Zhijie Hou (Fujitsu) 
> wrote:> 

Thanks for the comments!

> 4)
> Shall we give an example LOG message in the end?

I feel the current insert_exists log in conflict section seems
sufficient as an example to show the real conflict log.

Other comments look good, and I will address.

Best Regards,
Hou zj


RE: Conflict detection and logging in logical replication

2024-08-20 Thread Zhijie Hou (Fujitsu)
On Wednesday, August 21, 2024 9:33 AM Jonathan S. Katz  
wrote:
> On 8/6/24 4:15 AM, Zhijie Hou (Fujitsu) wrote:
> 
> > Thanks for the idea! I thought about few styles based on the suggested
> > format, what do you think about the following ?
> 
> Thanks for proposing formats. Before commenting on the specifics, I do want to
> ensure that we're thinking about the following for the log formats:
> 
> 1. For the PostgreSQL logs, we'll want to ensure we do it in a way that's as
> convenient as possible for people to parse the context from scripts.

Yeah. And I personally think the current log format is OK for parsing purposes.

> 
> 2. Semi-related, I still think the simplest way to surface this info to a 
> user is
> through a "pg_stat_..." view or similar catalog mechanism (I'm less 
> opinionated
> on the how outside of we should make it available via SQL).

We have a patch(v19-0002) in this thread to collect conflict stats and display
them in the view, and the patch is under review.

Storing it into a catalog needs more analysis as we may need to add addition
logic to clean up old conflict data in that catalog table. I think we can
consider it as a future improvement.

> 
> 3. We should ensure we're able to convey to the user these details about the
> conflict:
> 
> * What time it occurred on the local server (which we'd have in the logs)
> * What kind of conflict it is
> * What table the conflict occurred on
> * What action caused the conflict
> * How the conflict was resolved (ability to include source/origin info)

I think all above are already covered in the current conflict log. Except that
we have not support resolving the conflict, so we don't log the resolution.

> 
> 
> I think outputting the remote/local tuple value may be a parameter we need to
> think about (with the desired outcome of trying to avoid another parameter). I
> have a concern about unintentionally leaking data (and I understand that
> someone with access to the logs does have a broad ability to view data); I'm
> less concerned about the size of the logs, as conflicts in a well-designed
> system should be rare (though a conflict storm could fill up the logs, likely 
> there
> are other issues to content with at that point).

We could use an option to control, but the tuple value is already output in some
existing cases (e.g. partition check, table constraints check, view with check
constraints, unique violation), and it would test the current user's
privileges to decide whether to output the tuple or not. So, I think it's OK
to display the tuple for conflicts.

Best Regards,
Hou zj


RE: Conflict detection and logging in logical replication

2024-08-20 Thread Zhijie Hou (Fujitsu)
On Tuesday, August 20, 2024 12:37 PM Amit Kapila  
wrote:
> 
> On Mon, Aug 19, 2024 at 4:16 PM Amit Kapila 
> Pushed.

Thanks for pushing.

Here are the remaining patches.

0001 adds additional doc to explain the log format.
0002 collects statistics about conflicts in logical replication.

Best Regards,
Hou zj


v19-0002-Collect-statistics-about-conflicts-in-logical-re.patch
Description:  v19-0002-Collect-statistics-about-conflicts-in-logical-re.patch


v19-0001-Doc-explain-the-log-format-of-logical-replicatio.patch
Description:  v19-0001-Doc-explain-the-log-format-of-logical-replicatio.patch


RE: Conflict detection and logging in logical replication

2024-08-19 Thread Zhijie Hou (Fujitsu)
On Monday, August 19, 2024 2:40 PM Amit Kapila  wrote:
> 
> On Mon, Aug 19, 2024 at 11:54 AM shveta malik 
> wrote:
> >
> > On Mon, Aug 19, 2024 at 11:37 AM Amit Kapila 
> wrote:
> > >
> > > On Mon, Aug 19, 2024 at 9:08 AM shveta malik 
> wrote:
> > > >
> > > > On Sun, Aug 18, 2024 at 2:27 PM Zhijie Hou (Fujitsu)
> > > >  wrote:
> > > > >
> > > > > Attach the V16 patch which addressed the comments we agreed on.
> > > > > I will add a doc patch to explain the log format after the 0001 is 
> > > > > RFC.
> > > > >
> > > >
> > > > Thank You for addressing comments. Please see this scenario:
> > > >
> > > > create table tab1(pk int primary key, val1 int unique, val2 int);
> > > >
> > > > pub: insert into tab1 values(1,1,1);
> > > > sub: insert into tab1 values(2,2,3);
> > > > pub: update tab1 set val1=2 where pk=1;
> > > >
> > > > Wrong 'replica identity' column logged? shouldn't it be pk?
> > > >
> > > > ERROR:  conflict detected on relation "public.tab1":
> > > > conflict=update_exists
> > > > DETAIL:  Key already exists in unique index "tab1_val1_key",
> > > > modified locally in transaction 801 at 2024-08-19 08:50:47.974815+05:30.
> > > > Key (val1)=(2); existing local tuple (2, 2, 3); remote tuple (1,
> > > > 2, 1); replica identity (val1)=(1).
> > > >
> > >
> > > The docs say that by default replica identity is primary_key [1]
> > > (see REPLICA IDENTITY),
> >
> > yes, I agree. But here the importance of dumping it was to know the
> > value of RI as well which is being used as an identification of row
> > being updated rather than row being conflicted. Value is logged
> > correctly.
> >
> 
> Agreed, sorry, I misunderstood the problem reported. I thought the suggestion
> was to use 'primary key' instead of 'replica identity' but you pointed out 
> that the
> column used in 'replica identity' was wrong.
> We should fix this one.

Thanks for reporting the bug. I have fixed it and ran pgindent in V17 patch.
I also adjusted few comments and fixed a typo.

Best Regards,
Hou zj


v17-0001-Detect-and-log-conflicts-in-logical-replication.patch
Description:  v17-0001-Detect-and-log-conflicts-in-logical-replication.patch


RE: Conflict detection and logging in logical replication

2024-08-18 Thread Zhijie Hou (Fujitsu)
On Friday, August 16, 2024 7:47 PM Michail Nikolaev 
  wrote:
> > I think you might misunderstand the behavior of CheckAndReportConflict(),
> > even if it found a conflict, it still inserts the tuple into the index which
> > means the change is anyway applied.
> 
> > In the above conditions where a concurrent tuple insertion is removed or
> > rolled back before CheckAndReportConflict, the tuple inserted by apply will
> > remain. There is no need to report anything in such cases as apply was
> > successful.
> 
> Yes, thank you for explanation, I was thinking UNIQUE_CHECK_PARTIAL works
> differently.
> 
> But now I think DirtySnapshot-related bug is a blocker for this feature then,
> I'll reply into original after rechecking it.

Based on your response in the original thread[1], where you confirmed that the
dirty snapshot bug does not impact the detection of insert_exists conflicts, I
assume we are in agreement that this bug is not a blocker for the detection
feature. If you think otherwise, please feel free to let me know.

[1] 
https://www.postgresql.org/message-id/CANtu0oh69b%2BVCiASX86dF_eY%3D9%3DA2RmMQ_%2B0%2BuxZ_Zir%2BoNhhw%40mail.gmail.com

Best Regards,
Hou zj


RE: Conflict detection and logging in logical replication

2024-08-18 Thread Zhijie Hou (Fujitsu)
On Friday, August 16, 2024 5:25 PM shveta malik  wrote:
> 
> On Fri, Aug 16, 2024 at 12:19 PM Amit Kapila 
> wrote:
> >
> > On Fri, Aug 16, 2024 at 11:48 AM shveta malik 
> wrote:
> > >
> > > On Fri, Aug 16, 2024 at 10:46 AM shveta malik 
> wrote:
> > > >
> > > > 3)
> > > > For update_exists(), we dump:
> > > > Key (a, b)=(2, 1)
> > > >
> > > > For delete_missing, update_missing, update_differ, we dump:
> > > > Replica identity (a, b)=(2, 1).
> > > >
> > > > For update_exists as well, shouldn't we dump 'Replica identity'?
> > > > Only for insert case, it should be referred as 'Key'.
> > > >
> > >
> > > On rethinking, is it because for update_exists case 'Key' dumped is
> > > not the one used to search the row to be updated? Instead it is the
> > > one used to search the conflicting row. Unlike update_differ, the
> > > row to be updated and the row currently conflicting will be
> > > different for update_exists case. I earlier thought that 'KEY' and
> > > 'Existing local tuple' dumped always belong to the row currently
> > > being updated/deleted/inserted. But for 'update_eixsts', that is not
> > > the case. We are dumping 'Existing local tuple' and 'Key' for the
> > > row which is conflicting and not the one being updated.  Example:
> > >
> > > ERROR:  conflict detected on relation "public.tab_1":
> > > conflict=update_exists Key (a, b)=(2, 1); existing local tuple (2, 1); 
> > > remote
> tuple (2, 1).
> > >
> > > Operations performed were:
> > > Pub: insert into tab values (1,1);
> > > Sub: insert into tab values (2,1);
> > > Pub: update tab set a=2 where a=1;
> > >
> > > Here Key and local tuple are both 2,1 instead of 1,1. While replica
> > > identity value (used to search original row) will be 1,1 only.
> > >
> > > It may be slightly confusing or say tricky to understand when
> > > compared to other conflicts' LOGs. But not sure what better we can do
> here.
> > >
> >
> > The update_exists behaves more like insert_exists as we detect that
> > only while inserting into index. It is also not clear to me if we can
> > do better than to clarify this in docs.
> >
> 
> Instead of 'existing local tuple', will it be slightly better to have 
> 'conflicting local
> tuple'?

I am slightly not sure about adding one more variety to describe the "existing
local tuple". I think we’d better use a consistent word. But if others feel 
otherwise,
I can change it in next version.

> 
> Few trivial comments:
> 
> 1)
> errdetail_apply_conflict() header says:
> 
>  * 2. Display of conflicting key, existing local tuple, remote new tuple, and
>  *replica identity columns,  if any.
> 
> We may mention that existing *conflicting* local tuple.

Like above, I think that would duplicate the "existing local tuple" word.

> 
> Looking at build_tuple_value_details(), the cases where we display 'KEY 'and
> the ones where we display 'replica identity' are mutually exclusives (we have
> ASSERTs like that).  Shall we add this info in
> header that either Key or   'replica identity' is displayed. Or if we
> don't want to make it mutually exclusive then update_exists is one such casw
> where we can have both Key and 'Replica Identity cols'.

I think it’s fine to display replica identity for update_exists, so added.

> 
> 
> 2)
> BuildIndexValueDescription() header comment says:
> 
>  * This is currently used
>  * for building unique-constraint, exclusion-constraint and logical 
> replication
>  * tuple missing conflict error messages
> 
> Is it being used only for 'tuple missing conflict' flow? I thought, it will 
> be hit for
> other flows as well.

Removed the "tuple missing".

Attach the V16 patch which addressed the comments we agreed on.
I will add a doc patch to explain the log format after the 0001 is RFC.


Best Regards,
Hou zj



v16-0001-Detect-and-log-conflicts-in-logical-replication.patch
Description:  v16-0001-Detect-and-log-conflicts-in-logical-replication.patch


RE: Conflict detection and logging in logical replication

2024-08-18 Thread Zhijie Hou (Fujitsu)
On Friday, August 16, 2024 2:49 PM Amit Kapila  wrote:
> 
> 
> > 
> >
> > One more comment:
> >
> > 5)
> > For insert/update_exists, the sequence is:
> > Key .. ; existing local tuple .. ; remote tuple ...
> >
> > For rest of the conflicts, sequence is:
> >  Existing local tuple .. ; remote tuple .. ; replica identity ..
> >
> > Is it intentional? Shall the 'Key' or 'Replica Identity' be the first
> > one to come in all conflicts?
> >
> 
> This is worth considering but Replica Identity signifies the old tuple values,
> that is why it is probably kept at the end. 

Right. I personally think the current position is ok.

Best Regards,
Hou zj 




RE: Conflict detection and logging in logical replication

2024-08-18 Thread Zhijie Hou (Fujitsu)
On Friday, August 16, 2024 2:31 PM Amit Kapila  wrote:
> 
> On Fri, Aug 16, 2024 at 10:46 AM shveta malik 
> wrote:
> >
> > On Thu, Aug 15, 2024 at 12:47 PM Zhijie Hou (Fujitsu)
> >  wrote:
> > >
> > > Thanks. I have checked and merged the changes. Here is the V15 patch
> > > which addressed above comments.
> >
> > Thanks for the patch. Please find few comments and queries:
> >
> > 1)
> > For various conflicts , we have these in Logs:
> > Replica identity (val1)=(30).(for RI on 1 column)
> > Replica identity (pk, val1)=(200, 20).  (for RI on  2 columns)
> > Replica identity (40, 40, 11).(for RI full)
> >
> > Shall we have have column list in last case as well, or can simply
> > have *full* keyword i.e. Replica identity full (40, 40, 11)
> >
> 
> I would prefer 'full' instead of the entire column list as the complete 
> column list
> could be long and may not much sense.

+1 and will change in V16 patch.

Best Regards,
Hou zj


RE: [BUG?] check_exclusion_or_unique_constraint false negative

2024-08-15 Thread Zhijie Hou (Fujitsu)
On Monday, August 12, 2024 7:11 PM Michail Nikolaev 
  wrote:
> > In my test, if the tuple is updated and new tuple is in the same page,
> > heapam_index_fetch_tuple should find the new tuple using HOT chain. So, 
> > it's a
> > bit unclear to me how the updated tuple is missing. Maybe I missed some 
> > other
> > conditions for this issue.
> 
> Yeah, I think the pgbench-based reproducer may also cause page splits in 
> btree.
> But we may add an index to the table to disable HOT.
> 
> I have attached a reproducer for this case using a spec and injection points.
> 
> I hope it helps, check the attached files.

Thanks a lot for the steps!

I successfully reproduced the issue you mentioned in the context of logical
replication[1]. As you said, it could increase the possibility of tuple missing
when applying updates or deletes in the logical apply worker. I think this is a
long-standing issue and I will investigate the fix you proposed.

In addition, I think the bug is not a blocker for the conflict detection
feature. As the feature simply reports the current behavior of the logical
apply worker (either unique violation or tuple missing) without introducing any
new functionality. Furthermore, I think that the new ExecCheckIndexConstraints
call after ExecInsertIndexTuples() is not affected by the dirty snapshot bug.
This is because a tuple has already been inserted into the btree before the
dirty snapshot scan, which means that a concurrent non-HOT update would not be
possible (it would be blocked after finding the just inserted tuple and wait
for the apply worker to commit the current transaction).

It would be good if others could also share their opinion on this.


[1] The steps to reproduce the tuple missing in logical replication.

1. setup pub/sub env, and publish a table with 1 row.

pub:
CREATE TABLE t(a int primary key, b int);
INSERT INTO t VALUES(1,1);
CREATE PUBLICATION pub FOR TABLE t;

sub:
CREATE TABLE t (a int primary key, b int check (b < 5));
CREATE INDEX t_b_idx ON t(b);
CREATE SUBSCRIPTION sub CONNECTION 'dbname=postgres port=$port_publisher' 
PUBLICATION pub;

2. Execute an UPDATE(UPDATE t set b = b + 1) on the publisher and use gdb to
stop the apply worker at the point after index_getnext_tid() and before
index_fetch_heap().

3. execute a concurrent update(UPDATE t set b = b + 100) on the subscriber to
update a non-key column value and commit the update.

4. release the apply worker and it would report the update_missing conflict.

Best Regards,
Hou zj


RE: Conflict detection and logging in logical replication

2024-08-15 Thread Zhijie Hou (Fujitsu)
On Wednesday, August 14, 2024 10:15 PM Michail Nikolaev 
  wrote:
> > This is as expected, and we have documented this in the code comments. We 
> > don't
> > need to report a conflict if the conflicting tuple has been removed or 
> > updated
> > due to concurrent transaction. The same is true if the transaction that
> > inserted the conflicting tuple is rolled back before 
> > CheckAndReportConflict().
> > We don't consider such cases as a conflict.
> 
> That seems a little bit strange to me.
> 
> From the perspective of a user, I expect that if a change from publisher is 
> not
> applied - I need to know about it from the logs. 

I think this is exactly the current behavior in the patch. In the race
condition we discussed, the insert will be applied if the conflicting tuple is
removed concurrently before CheckAndReportConflict().

> But in that case, I will not see any information about conflict in the logs
> in SOME cases. But in OTHER cases I will see it. However, in both cases the
> change from publisher was not applied. And these cases are just random and
> depend on the timing of race conditions. It is not something I am expecting
> from the database.

I think you might misunderstand the behavior of CheckAndReportConflict(), even
if it found a conflict, it still inserts the tuple into the index which means
the change is anyway applied.

Best Regards,
Hou zj


RE: Conflict detection and logging in logical replication

2024-08-15 Thread Zhijie Hou (Fujitsu)
On Wednesday, August 14, 2024 7:02 PM Amit Kapila  
wrote:
> 
> On Wed, Aug 14, 2024 at 8:05 AM Zhijie Hou (Fujitsu)
>  wrote:
> >
> > Here is the V14 patch.
> >
> 
> Review comments:
> 1.
> ReportApplyConflict()
> {
> ...
> + ereport(elevel,
> + errcode(ERRCODE_INTEGRITY_CONSTRAINT_VIOLATION),
> + errmsg("conflict detected on relation \"%s.%s\": conflict=%s",
> +get_namespace_name(RelationGetNamespace(localrel)),
> ...
> 
> Is it a good idea to use ERRCODE_INTEGRITY_CONSTRAINT_VIOLATION for
> all conflicts? I think it is okay to use for insert_exists and update_exists. 
> The
> other error codes to consider for conflicts other than insert_exists and
> update_exists are ERRCODE_T_R_SERIALIZATION_FAILURE,
> ERRCODE_CARDINALITY_VIOLATION, ERRCODE_NO_DATA,
> ERRCODE_NO_DATA_FOUND,
> ERRCODE_TRIGGERED_DATA_CHANGE_VIOLATION,
> ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE.
> 
> BTW, even for insert/update_exists, won't it better to use
> ERRCODE_UNIQUE_VIOLATION ?

Agreed. I changed the patch to use ERRCODE_UNIQUE_VIOLATION for
Insert,update_exists, and ERRCODE_T_R_SERIALIZATION_FAILURE for
other conflicts.

> 
> Apart from the above, the attached contains some cosmetic changes.

Thanks. I have checked and merged the changes. Here is the V15 patch
which addressed above comments.

Best Regards,
Hou zj



v15-0001-Detect-and-log-conflicts-in-logical-replication.patch
Description:  v15-0001-Detect-and-log-conflicts-in-logical-replication.patch


RE: Conflict detection and logging in logical replication

2024-08-13 Thread Zhijie Hou (Fujitsu)
On Tuesday, August 13, 2024 7:04 PM Amit Kapila  wrote:
> 
> On Tue, Aug 13, 2024 at 10:09 AM Zhijie Hou (Fujitsu)
>  wrote:
> >
> > Here is V13 patch set which addressed above comments.
> >
> 
> 1.
> +ReportApplyConflict(int elevel, ConflictType type, EState *estate,
> +ResultRelInfo *relinfo,
> 
> The change looks better but it would still be better to keep elevel and type 
> after
> relinfo. The same applies to other places as well.

Changed.

> 
> 2.
> + * The caller should ensure that the index with the OID 'indexoid' is locked.
> + *
> + * Refer to errdetail_apply_conflict for the content that will be
> +included in
> + * the DETAIL line.
> + */
> +void
> +ReportApplyConflict(int elevel, ConflictType type, EState *estate,
> 
> Is it possible to add an assert to ensure that the index is locked by the 
> caller?

Added.

> 
> 3.
> +static char *
> +build_tuple_value_details(EState *estate, ResultRelInfo *relinfo,
> +   TupleTableSlot *searchslot,
> +   TupleTableSlot *localslot,
> +   TupleTableSlot *remoteslot,
> +   Oid indexoid)
> {
> ...
> ...
> + /*
> + * If 'searchslot' is NULL and 'indexoid' is valid, it indicates that
> + we
> + * are reporting the unique constraint violation conflict, in which
> + case
> + * the conflicting key values will be reported.
> + */
> + if (OidIsValid(indexoid) && !searchslot) {
> ...
> ...
> }
> 
> This indirect way of inferencing constraint violation looks fragile.
> The caller should pass the required information explicitly and then you can
> have the required assertions here.
> 
> Apart from the above, I have made quite a few changes in the code comments
> and LOG messages in the attached.

Thanks. I have addressed above comments and merged the changes.

Here is the V14 patch.

Best Regards,
Hou zj


v14-0001-Detect-and-log-conflicts-in-logical-replication.patch
Description:  v14-0001-Detect-and-log-conflicts-in-logical-replication.patch


RE: Conflict detection and logging in logical replication

2024-08-13 Thread Zhijie Hou (Fujitsu)
On Tuesday, August 13, 2024 7:33 PM Michail Nikolaev 
  wrote:

> I think this is an independent issue which can be discussed separately in the
> original thread[1], and I have replied to that thread.

>Thanks! But it seems like this part is still relevant to the current thread:

> > It also seems possible that a conflict could be resolved by a concurrent 
> > update
> > before the call to CheckAndReportConflict, which means there's no guarantee
> > that the conflict will be reported correctly. Should we be concerned about
> > this?

This is as expected, and we have documented this in the code comments. We don't
need to report a conflict if the conflicting tuple has been removed or updated
due to concurrent transaction. The same is true if the transaction that
inserted the conflicting tuple is rolled back before CheckAndReportConflict().
We don't consider such cases as a conflict.

Best Regards,
Hou zj


RE: Conflict detection and logging in logical replication

2024-08-12 Thread Zhijie Hou (Fujitsu)
On Monday, August 12, 2024 7:41 PM Amit Kapila  wrote:
> 
> On Fri, Aug 9, 2024 at 12:29 PM Zhijie Hou (Fujitsu)
>  wrote:
> >
> > Here is the V12 patch that improved the log format as discussed.
> >
> 
> Review comments:

Thanks for the comments.

> ===
> 1. The patch doesn't display the remote tuple for delete_differ case.
> However, it shows the remote tuple correctly for update_differ. Is
> there a reason for the same? See below messages:
> 
> update_differ:
> --
> LOG:  conflict detected on relation "public.t1": conflict=update_differ
> DETAIL:  Updating the row containing (c1)=(1) that was modified
> locally in transaction 806 at 2024-08-12 11:48:14.970002+05:30.
> Existing local tuple (1, 3, arun  ); remote tuple (1, 3,
> ajay  ).
> ...
> 
> delete_differ
> --
> LOG:  conflict detected on relation "public.t1": conflict=delete_differ
> DETAIL:  Deleting the row containing (c1)=(1) that was modified by
> locally in transaction 809 at 2024-08-12 14:15:41.966467+05:30.
> Existing local tuple (1, 3, arun  ).
> 
> Note this happens when the publisher table has a REPLICA IDENTITY FULL
> and the subscriber table has primary_key. It would be better to keep
> the messages consistent. One possibility is that we remove
> key/old_tuple from the first line of the DETAIL message and display it
> in the second line as Existing local tuple ; remote tuple
> <..>; key <...>

Agreed. I thought that in delete_differ/missing cases, the remote tuple is 
covered
In the key values in the first sentence. To be consistent, I have moved the 
column-values
from the first sentence to the second sentence including the insert_exists 
conflict.

The new format looks like:

LOG: xxx
DETAIL: Key %s; existing local tuple %s; remote new tuple %s; replica identity 
%s

The Key will include the conflicting key for xxx_exists conflicts. And the 
replica identity part
will include the replica identity keys or the full tuple value in replica 
identity FULL case.

> 
> 2. Similar to above, the remote tuple is not displayed in
> delete_missing but displayed in updated_missing type of conflict. If
> we follow the style mentioned in the previous point then the DETAIL
> message: "DETAIL:  Did not find the row containing (c1)=(1) to be
> updated." can also be changed to: "DETAIL:  Could not find the row to
> be updated." followed by other detail.

Same as above.


> 3. The detail of insert_exists is confusing.
> 
> ERROR:  conflict detected on relation "public.t1": conflict=insert_exists
> DETAIL:  Key (c1)=(1) already exists in unique index "t1_pkey", which
> was modified locally in transaction 802 at 2024-08-12
> 11:11:31.252148+05:30.
> 
> It sounds like the key value "(c1)=(1)" in the index is modified. How
> about changing slightly as: "Key (c1)=(1) already exists in unique
> index "t1_pkey", modified locally in transaction 802 at 2024-08-12
> 11:11:31.252148+05:30."? Feel free to propose if anything better comes
> to your mind.

The suggested message looks good to me.

> 
> 4.
> if (localorigin == InvalidRepOriginId)
> + appendStringInfo(&err_detail, _("Deleting the row containing %s that
> was modified by locally in transaction %u at %s."),
> + val_desc, localxmin, timestamptz_to_str(localts));
> 
> Typo in the above message. /modified by locally/modified locally

Fixed.

> 
> 5.
> @@ -2661,6 +2662,29 @@ apply_handle_update_internal(ApplyExecutionData
> *edata,
> {
> ...
> found = FindReplTupleInLocalRel(edata, localrel,
> &relmapentry->remoterel,
> localindexoid,
> remoteslot, &localslot);
> ...
> ...
> +
> + ReportApplyConflict(LOG, CT_UPDATE_DIFFER, relinfo,
> + GetRelationIdentityOrPK(localrel),
> 
> To find the tuple, we may have used an index other than Replica
> Identity or PK (see IsIndexUsableForReplicaIdentityFull), but while
> reporting conflict we don't consider such an index. I think the reason
> is that such an index scan wouldn't have resulted in a unique tuple
> and that is why we always compare the complete tuple in such cases. Is
> that the reason? Can we write a comment to make it clear?

Added comments atop of ReportApplyConflict for the 'indexoid' parameter.

> 
> 6.
> void ReportApplyConflict(int elevel, ConflictType type,
> + ResultRelInfo *relinfo, Oid indexoid,
> + TransactionId localxmin,
> + RepOriginId localorigin,
> + TimestampTz localts,
> + TupleTableSlot *searchslot,
> + TupleTableSlot *localslot,
> + TupleTableSlot *remoteslot,
> + EState *estate);
> 
> The prototype looks odd with pointers and no

RE: Conflict detection and logging in logical replication

2024-08-11 Thread Zhijie Hou (Fujitsu)
On Friday, August 9, 2024 7:45 PM Michail Nikolaev  
 wrote:
> There are some comments on this patch related to issue [0]. In short: any
> DirtySnapshot index scan may fail to find an existing tuple in the case of a
> concurrent update.
> 
> - FindConflictTuple may return false negative result in the case of 
> concurrent update because > ExecCheckIndexConstraints uses SnapshotDirty.
> - As a result, CheckAndReportConflict may fail to report the conflict.
> - In apply_handle_update_internal we may get an CT_UPDATE_MISSING instead of 
> CT_UPDATE_DIFFER
> - In apply_handle_update_internal we may get an CT_DELETE_MISSING instead of 
> CT_DELETE_DIFFER
> - In apply_handle_tuple_routing we may get an CT_UPDATE_MISSING instead of 
> CT_UPDATE_DIFFER
> 
> If you're interested, I could create a test to reproduce the issue within the
> context of logical replication. Issue [0] itself includes a test case to
> replicate the problem.
> 
> It also seems possible that a conflict could be resolved by a concurrent 
> update
> before the call to CheckAndReportConflict, which means there's no guarantee
> that the conflict will be reported correctly. Should we be concerned about
> this?

Thanks for reporting.

I think this is an independent issue which can be discussed separately in the
original thread[1], and I have replied to that thread.

Best Regards,
Hou zj


RE: [BUG?] check_exclusion_or_unique_constraint false negative

2024-08-11 Thread Zhijie Hou (Fujitsu)
Hi,

Thanks for reporting the issue !

I tried to reproduce this in logical replication but failed. If possible,
could you please share some steps to reproduce it in logicalrep context ?

In my test, if the tuple is updated and new tuple is in the same page,
heapam_index_fetch_tuple should find the new tuple using HOT chain. So, it's a
bit unclear to me how the updated tuple is missing. Maybe I missed some other
conditions for this issue.

It would be better if we can reproduce this by adding some breakpoints using
gdb, which may help us to write a tap test using injection point to reproduce
this reliably. I see the tap test you shared used pgbench to reproduce this,
it works, but It would be great if we can analyze the issue more deeply by
debugging the code.

And I have few questions related the steps you shared:

> * Session 1 reads a B-tree page using SnapshotDirty and copies item X to the 
> buffer.
> * Session 2 updates item X, inserting a new TID Y into the same page.
> * Session 2 commits its transaction.
> * Session 1 starts to fetch from the heap and tries to fetch X, but it was
>   already deleted by session 2. So, it goes to the B-tree for the next TID.
> * The B-tree goes to the next page, skipping Y.
> * Therefore, the search finds nothing, but tuple Y is still alive.

I am wondering at which point should the update happen ? should it happen after
calling index_getnext_tid and before index_fetch_heap ? It would be great if
you could give more details in above steps. Thanks !

Best Regards,
Hou zj


RE: Conflict detection and logging in logical replication

2024-08-09 Thread Zhijie Hou (Fujitsu)
On Wednesday, August 7, 2024 1:24 PM Amit Kapila  
wrote:
> 
> On Tue, Aug 6, 2024 at 1:45 PM Zhijie Hou (Fujitsu)
> > Thanks for the idea! I thought about few styles based on the suggested
> format,
> > what do you think about the following ?
> >
> > ---
> > Version 1
> > ---
> > LOG: CONFLICT: insert_exists; DESCRIPTION: remote INSERT violates
> unique constraint "uniqueindex" on relation "public.test".
> > DETAIL: Existing local tuple (a, b, c) = (2, 3, 4)
> xid=123,origin="pub",timestamp=xxx; remote tuple (a, b, c) = (2, 4, 5).
> >
> 
> Won't this case be ERROR? If so, the error message format like the
> above appears odd to me because in some cases, the user may want to
> add some filter based on the error message though that is not ideal.
> Also, the primary error message starts with a small case letter and
> should be short.
> 
> > LOG: CONFLICT: update_differ; DESCRIPTION: updating a row with key (a,
> b) = (2, 4) on relation "public.test" was modified by a different source.
> > DETAIL: Existing local tuple (a, b, c) = (2, 3, 4)
> xid=123,origin="pub",timestamp=xxx; remote tuple (a, b, c) = (2, 4, 5).
> >
> > LOG: CONFLICT: update_missing; DESCRIPTION: did not find the row with
> key (a, b) = (2, 4) on "public.test" to update.
> > DETAIL: remote tuple (a, b, c) = (2, 4, 5).
> >
> > ---
> > Version 2
> > It moves most the details to the DETAIL line compared to version 1.
> > ---
> > LOG: CONFLICT: insert_exists on relation "public.test".
> > DETAIL: Key (a)=(1) already exists in unique index "uniqueindex", which
> was modified by origin "pub" in transaction 123 at 2024xxx;
> > Existing local tuple (a, b, c) = (1, 3, 4), remote tuple 
> > (a, b, c)
> = (1, 4, 5).
> >
> > LOG: CONFLICT: update_differ on relation "public.test".
> > DETAIL: Updating a row with key (a, b) = (2, 4) that was modified by a
> different origin "pub" in transaction 123 at 2024xxx;
> > Existing local tuple (a, b, c) = (2, 3, 4); remote tuple 
> > (a, b, c)
> = (2, 4, 5).
> >
> > LOG: CONFLICT: update_missing on relation "public.test".
> > DETAIL: Did not find the row with key (a, b) = (2, 4) to update;
> > Remote tuple (a, b, c) = (2, 4, 5).
> >
> 
> I think we can combine sentences with full stop.
> 
> ...
> > ---
> > Version 3
> > It is similar to the style in the current patch, I only added the key value 
> > for
> > differ and missing conflicts without outputting the complete
> > remote/local tuple value.
> > ---
> > LOG: conflict insert_exists detected on relation "public.test".
> > DETAIL: Key (a)=(1) already exists in unique index "uniqueindex", which
> was modified by origin "pub" in transaction 123 at 2024xxx.
> >
> 
> For ERROR messages this appears suitable.
> 
> Considering all the above points, I propose yet another version:
> 
> LOG: conflict detected for relation "public.test": conflict=insert_exists
> DETAIL: Key (a)=(1) already exists in unique index "uniqueindex",
> which was modified by the origin "pub" in transaction 123 at 2024xxx.
> Existing local tuple (a, b, c) = (1, 3, 4), remote tuple (a, b, c) =
> (1, 4, 5).
> 
> LOG: conflict detected for relation "public.test": conflict=update_differ
> DETAIL: Updating a row with key (a, b) = (2, 4) that was modified by a
> different origin "pub" in transaction 123 at 2024xxx. Existing local
> tuple (a, b, c) = (2, 3, 4); remote tuple (a, b, c) = (2, 4, 5).
> 
> LOG:  conflict detected for relation "public.test": conflict=update_missing
> DETAIL: Could not find the row with key (a, b) = (2, 4) to update.
> Remote tuple (a, b, c) = (2, 4, 5).

Here is the V12 patch that improved the log format as discussed. I also fixed a
bug in previous version where it reported the wrong column value in the DETAIL
message.

In the latest patch, the DETAIL line comprises two parts: 1. Explanation of the
conflict type, including the tuple value used to search the existing local
tuple provided for update or deletion, or the tuple value causing the unique
constraint violation. 2. Display of the complete existing local tuple and the
remote tuple, if any.

I also addressed Shveta's comments and tried to merge Kuroda-san's changes[2] to
the new codes.

And the 0002(new sub option) patch is removed as discussed. The 0003(stats
collection) patch is also removed temporarily, we can bring it back After
finishing the 0001 work.

[1] 
https://www.postgresql.org/message-id/CAJpy0uAjJci%2BOtm4ANU0__-2qqhH2cALp8hQw5pBjNZyREF7rg%40mail.gmail.com
[2] 
https://www.postgresql.org/message-id/TYAPR01MB5692224DB472AA3FA58E1D1AF5B82%40TYAPR01MB5692.jpnprd01.prod.outlook.com

Best Regards,
Hou zj


v12-0001-Detect-and-log-conflicts-in-logical-replication.patch
Description:  v12-0001-Detect-and-log-conflicts-in-logical-replication.patch


RE: [bug fix] prepared transaction might be lost when max_prepared_transactions is zero on the subscriber

2024-08-08 Thread Zhijie Hou (Fujitsu)
On Thursday, August 8, 2024 6:01 PM shveta malik  wrote:
> 
> On Thu, Aug 8, 2024 at 12:03 PM Amit Kapila 
> wrote:
> >
> > On Thu, Aug 8, 2024 at 10:37 AM Hayato Kuroda (Fujitsu)
> >  wrote:
> > >
> > ...
> > >
> > > An easiest fix is to reset session replication origin before calling
> > > the RecordTransactionAbort(). I think this can happen when 1)
> > > LogicalRepApplyLoop() raises an ERROR or 2) apply worker exits.
> Attached patch can fix the issue on HEAD.
> > >
> >
> > Few comments:
> > =
> > *
> > @@ -4409,6 +4409,14 @@ start_apply(XLogRecPtr origin_startpos)
> >   }
> >   PG_CATCH();
> >   {
> > + /*
> > + * Reset the origin data to prevent the advancement of origin
> > + progress
> > + * if the transaction failed to apply.
> > + */
> > + replorigin_session_origin = InvalidRepOriginId;
> > + replorigin_session_origin_lsn = InvalidXLogRecPtr;
> > + replorigin_session_origin_timestamp = 0;
> >
> > Can't we call replorigin_reset() instead here?
> >
> > *
> > + /*
> > + * Register a callback to reset the origin state before aborting the
> > + * transaction in ShutdownPostgres(). This is to prevent the
> > + advancement
> > + * of origin progress if the transaction failed to apply.
> > + */
> > + before_shmem_exit(replorigin_reset, (Datum) 0);
> >
> > I think we need this despite resetting the origin-related variables in
> > PG_CATCH block to handle FATAL error cases, right? If so, can we use
> > PG_ENSURE_ERROR_CLEANUP() instead of PG_CATCH()?
> 
> +1
> 
> Basic tests work fine on this patch. Just thinking out loud,
> SetupApplyOrSyncWorker() is called for table-sync worker as well and IIUC
> tablesync worker does not deal with 2PC txns. So do we even need to register
> replorigin_reset() for tablesync worker as well? If we may hit such an issue 
> in
> general, then perhaps we need it in table-sync worker  otherwise not.  It
> needs some investigation. Thoughts?

I think this is a general issue that can occur not only due to 2PC. IIUC, this
problem should arise if any ERROR arises after setting the
replorigin_session_origin_lsn but before the CommitTransactionCommand is
completed. If so, I think we should register it for tablesync worker as well.

Best Regards,
Hou zj


RE: Conflict detection and logging in logical replication

2024-08-07 Thread Zhijie Hou (Fujitsu)
On Wednesday, August 7, 2024 3:00 PM Kuroda, Hayato/黒田 隼人 
 wrote:
> 
> While playing with the 0003 patch (the patch may not be ready), I found that
> when the insert_exists event occurred, both apply_error_count and
> insert_exists_count was counted.

Thanks for testing. 0003 is a separate feature which we might review
after the 0001 is in a good shape or committed.

> 
> ```
> -- insert a tuple on the subscriber
> subscriber =# INSERT INTO tab VALUES (1);
> 
> -- insert the same tuple on the publisher, which causes insert_exists conflict
> publisher =# INSERT INTO tab VALUES (1);
> 
> -- after some time...
> subscriber =# SELECT * FROM pg_stat_subscription_stats; -[ RECORD
> 1 ]+--
> subid| 16389
> subname  | sub
> apply_error_count| 16
> sync_error_count | 0
> insert_exists_count  | 16
> update_differ_count  | 0
> update_exists_count  | 0
> update_missing_count | 0
> delete_differ_count  | 0
> delete_missing_count | 0
> stats_reset  |
> ```
> 
> Not tested, but I think this could also happen for the update_exists_count 
> case,
> or sync_error_count may be counted when the tablesync worker detects the
> conflict.
> 
> IIUC, the reason is that pgstat_report_subscription_error() is called in the
> PG_CATCH part in start_apply() even after ReportApplyConflict(ERROR) is
> called.
> 
> What do you think of the current behavior? I wouldn't say I like that the same
> phenomenon is counted as several events. E.g., in the case of vacuum, the
> entry seemed to be separated based on the process by backends or
> autovacuum.

I think this is as expected. When the insert conflicts, it will report an ERROR
so both the conflict count and error out are incremented which looks reasonable
to me. The default behavior for each conflict could be different and is
documented, I think It's clear that insert_exists will cause an ERROR while
delete_missing or .. will not.

In addition, we might support a resolution called "error" which is to report an
ERROR When facing the specified conflict, it would be a bit confusing to me if
the apply_error_count Is not incremented on the specified conflict, when I set
resolution to ERROR.

> I feel the spec is unfamiliar in that only insert_exists and update_exists are
> counted duplicated with the apply_error_count.
> 
> An easy fix is to introduce a global variable which is turned on when the 
> conflict
> is found.

I am not sure about the benefit of changing the current behavior in the patch.
And it will change the existing behavior, because before the conflict detection
patch, the apply_error_count is incremented on each unique key violation, while
after the detection patch, it stops incrementing the apply_error and only
conflict_count is incremented.

Best Regards,
Hou zj


RE: Conflict detection and logging in logical replication

2024-08-06 Thread Zhijie Hou (Fujitsu)
On Monday, August 5, 2024 6:52 PM Amit Kapila  wrote:
> 
> On Sun, Aug 4, 2024 at 1:22 PM Zhijie Hou (Fujitsu) 
> wrote:
> >
> > On Friday, August 2, 2024 7:03 PM Amit Kapila 
> wrote:
> > >
> >
> > Here is the V11 patch set which addressed above and Kuroda-san[1]
> comments.
> >
> 
> A few design-level points:
> 
> *
> @@ -525,10 +602,33 @@ ExecSimpleRelationInsert(ResultRelInfo
> *resultRelInfo,
>   /* OK, store the tuple and create index entries for it */
>   simple_table_tuple_insert(resultRelInfo->ri_RelationDesc, slot);
> 
> + conflictindexes = resultRelInfo->ri_onConflictArbiterIndexes;
> +
>   if (resultRelInfo->ri_NumIndices > 0)
>   recheckIndexes = ExecInsertIndexTuples(resultRelInfo,
> -slot, estate, false, false,
> -NULL, NIL, false);
> +slot, estate, false,
> +conflictindexes ? true : false,
> +&conflict,
> +conflictindexes, false);
> +
> + /*
> + * Checks the conflict indexes to fetch the conflicting local tuple
> + * and reports the conflict. We perform this check here, instead of
> + * performing an additional index scan before the actual insertion and
> + * reporting the conflict if any conflicting tuples are found. This is
> + * to avoid the overhead of executing the extra scan for each INSERT
> + * operation, even when no conflict arises, which could introduce
> + * significant overhead to replication, particularly in cases where
> + * conflicts are rare.
> + *
> + * XXX OTOH, this could lead to clean-up effort for dead tuples added
> + * in heap and index in case of conflicts. But as conflicts shouldn't
> + * be a frequent thing so we preferred to save the performance overhead
> + * of extra scan before each insertion.
> + */
> + if (conflict)
> + CheckAndReportConflict(resultRelInfo, estate, CT_INSERT_EXISTS,
> +recheckIndexes, slot);
> 
> I was thinking about this case where we have some pros and cons of doing
> additional scans only after we found the conflict. I was wondering how we will
> handle the resolution strategy for this when we have to remote_apply the tuple
> for insert_exists/update_exists cases.
> We would have already inserted the remote tuple in the heap and index before
> we found the conflict which means we have to roll back that change and then
> start a forest transaction to perform remote_apply which probably has to
> update the existing tuple. We may have to perform something like speculative
> insertion and then abort it. That doesn't sound cheap either. Do you have any
> better ideas?

Since most of the codes of conflict detection can be reused in the later
resolution patch. I am thinking we can go for re-scan after insertion approach
for detection patch. Then in resolution patch we can probably have a check in
the patch that if the resolver is remote_apply/last_update_win we detect
conflict before, otherwise detect it after. This way we can save an
subscription option in the detection patch because we are not introducing 
overhead
for the detection. And we can also save some overhead in the resolution patch
if there is no need to do a prior check. There could be a few duplicate codes
in resolution patch as have codes for both prior check and after check, but it
seems acceptable.


> 
> *
> -ERROR:  duplicate key value violates unique constraint "test_pkey"
> -DETAIL:  Key (c)=(1) already exists.
> +ERROR:  conflict insert_exists detected on relation "public.test"
> +DETAIL:  Key (c)=(1) already exists in unique index "t_pkey", which
> was modified locally in transaction 740 at 2024-06-26 10:47:04.727375+08.
> 
> I think the format to display conflicts is not very clear. The conflict 
> should be
> apparent just by seeing the LOG/ERROR message. I am thinking of something
> like below:
> 
> LOG: CONFLICT: ;
> DESCRIPTION: If any .. ; RESOLUTION: (This one can be added later)
> DEATAIL: remote_tuple (tuple values); local_tuple (tuple values);
> 
> With the above, one can easily identify the conflict's reason and action 
> taken by
> apply worker.

Thanks for the idea! I thought about few styles based on the suggested format,
what do you think about the following ?

---
Version 1
---
LOG: CONFLICT: insert_exists; DESCRIPTION: remote INSERT violates unique 
constraint "uniqueindex" on relation "public.test".
DETAIL: Existing local tuple (a, b, c) = (2, 3, 4) 
xid=123,origin="pub",timestamp=xxx; remote tuple (a, b, c) = (2, 4, 5).

LOG: CONFLICT: update_differ; DESCRIPTION: updating a row with key (a, b) = (2, 
4) on relation "public.test" was modified by a different source.
DETAIL: Existing local tuple (a, b, c) = (2, 3, 4) 
xid=123,origin="pub",timestamp=xxx; remote tu

RE: Conflict detection and logging in logical replication

2024-08-04 Thread Zhijie Hou (Fujitsu)
On Friday, July 26, 2024 2:26 PM Amit Kapila  wrote:
> 
> On Fri, Jul 26, 2024 at 9:39 AM shveta malik  wrote:
> >
> > On Thu, Jul 11, 2024 at 7:47 AM Zhijie Hou (Fujitsu)
> >  wrote:
> > >
> > > On Wednesday, July 10, 2024 5:39 PM shveta malik
>  wrote:
> > > >
> >
> > > > 2)
> > > > Another case which might confuse user:
> > > >
> > > > CREATE TABLE t1 (pk integer primary key, val1 integer, val2
> > > > integer);
> > > >
> > > > On PUB: insert into t1 values(1,10,10); insert into t1
> > > > values(2,20,20);
> > > >
> > > > On SUB: update t1 set pk=3 where pk=2;
> > > >
> > > > Data on PUB: {1,10,10}, {2,20,20}
> > > > Data on SUB: {1,10,10}, {3,20,20}
> > > >
> > > > Now on PUB: update t1 set val1=200 where val1=20;
> > > >
> > > > On Sub, I get this:
> > > > 2024-07-10 14:44:00.160 IST [648287] LOG:  conflict update_missing
> > > > detected on relation "public.t1"
> > > > 2024-07-10 14:44:00.160 IST [648287] DETAIL:  Did not find the row
> > > > to be updated.
> > > > 2024-07-10 14:44:00.160 IST [648287] CONTEXT:  processing remote
> > > > data for replication origin "pg_16389" during message type
> > > > "UPDATE" for replication target relation "public.t1" in
> > > > transaction 760, finished at 0/156D658
> > > >
> > > > To user, it could be quite confusing, as val1=20 exists on sub but
> > > > still he gets update_missing conflict and the 'DETAIL' is not
> > > > sufficient to give the clarity. I think on HEAD as well (have not
> > > > tested), we will get same behavior i.e. update will be ignored as
> > > > we make search based on RI (pk in this case). So we are not
> > > > worsening the situation, but now since we are detecting conflict, is it
> possible to give better details in 'DETAIL' section indicating what is 
> actually
> missing?
> > >
> > > I think It's doable to report the row value that cannot be found in
> > > the local relation, but the concern is the potential risk of
> > > exposing some sensitive data in the log. This may be OK, as we are
> > > already reporting the key value for constraints violation, so if
> > > others also agree, we can add the row value in the DETAIL as well.
> >
> > This is still awaiting some feedback. I feel it will be good to add
> > some pk value at-least in DETAIL section, like we add for other
> > conflict types.
> >
> 
> I agree that displaying pk where applicable should be okay as we display it at
> other places but the same won't be possible when we do sequence scan to
> fetch the required tuple. So, the message will be different in that case, 
> right?

After some research, I think we can report the key values in DETAIL if the
apply worker uses any unique indexes to find the tuple to update/delete.
Otherwise, we can try to output all column values in DETAIL if the current user
of apply worker has SELECT access to these columns.

This is consistent with what we do when reporting table constraint violation
(e.g. when violating a check constraint, it could output all the column value
if the current has access to all the column):

- First, use super user to create a table.
CREATE TABLE t1 (c1 int, c2 int, c3 int check (c3 < 5));

- 1) using super user to insert a row that violates the constraint. We should
see all the column value.

INSERT INTO t1(c3) VALUES (6);
ERROR:  new row for relation "t1" violates check constraint 
"t1_c3_check"
DETAIL:  Failing row contains (null, null, 6).

- 2) use a user without access to all the columns. We can only see the inserted 
column and 
CREATE USER regress_priv_user2;
GRANT INSERT (c1, c2, c3) ON t1 TO regress_priv_user2;

SET SESSION AUTHORIZATION regress_priv_user2;
INSERT INTO t1 (c3) VALUES (6);

ERROR:  new row for relation "t1" violates check constraint 
"t1_c3_check"
DETAIL:  Failing row contains (c3) = (6).

To achieve this, I think we can expose the ExecBuildSlotValueDescription
function and use it in conflict reporting. What do you think ?

Best Regards,
Hou zj


RE: Conflict detection and logging in logical replication

2024-07-31 Thread Zhijie Hou (Fujitsu)
On Wednesday, July 31, 2024 1:36 PM shveta malik  wrote:
> 
> On Wed, Jul 31, 2024 at 7:40 AM Zhijie Hou (Fujitsu) 
> 
> wrote:
> >
> > >
> > > 2)
> > > apply_handle_delete_internal()
> > >
> > > --Do we need to check "(!edata->mtstate || 
> > > edata->mtstate->operation != CMD_UPDATE)" in the else part as 
> > > well? Can there be a scenario where during update flow, it is 
> > > trying to delete from a partition and comes here, but till then 
> > > that row is deleted already and we end up raising 'delete_missing' 
> > > additionally instead of 'update_missing'
> > > alone?
> >
> > I think this shouldn't happen because the row to be deleted should 
> > have been locked before entering the apply_handle_delete_internal().
> > Actually, calling
> > apply_handle_delete_internal() for cross-partition update is a big 
> > buggy because the row to be deleted has already been found in 
> > apply_handle_tuple_routing(), so we could have avoid scanning the 
> > tuple again. I have posted another patch to fix this issue in thread[1].
> 
> Thanks for the details.
> 
> >
> > Here is the V8 patch set. It includes the following changes:
> >
> 
> Thanks for the patch. I verified that all the bugs reported so far are 
> addressed.
> Few trivial comments:

Thanks for the comments!

> 
> 1)
> 029_on_error.pl:
> --I did not understand the intent of this change. The existing insert 
> would also have resulted in conflict (insert_exists) and we would have 
> identified and skipped that. Why change to UPDATE?
> 
>  $node_publisher->safe_psql(
>   'postgres',
>   qq[
>  BEGIN;
> -INSERT INTO tbl VALUES (1, NULL);
> +UPDATE tbl SET i = 2;
>  PREPARE TRANSACTION 'gtx';
>  COMMIT PREPARED 'gtx';
>  ]);
> 

The intention of this change is to cover the code path of update_exists.
The original test only tested the code of insert_exists.

> 
> 2)
> logical-replication.sgml
> --In doc, shall we have 'delete_differ' first and then 
> 'delete_missing', similar to what we have for update (first 
> 'update_differ' and then 'update_missing')
> 
> 3)
> logical-replication.sgml: "For instance, the origin in the above log 
> indicates that the existing row was modified by a local change."
> 
> --This clarification about origin was required when we had 'origin 0'
> in 'DETAILS'. Now we have "locally":
> "Key (c)=(1) already exists in unique index "t_pkey", which was 
> modified locally in transaction 740".
> 
> And thus shall we rephrase the concerned line ?

Fixed in the V9 patch set.

Best Regards,
Hou zj


RE: Remove duplicate table scan in logical apply worker and code refactoring

2024-07-31 Thread Zhijie Hou (Fujitsu)
On Wednesday, July 31, 2024 5:07 PM Kuroda, Hayato/黒田 隼人 
 wrote:
> 
> Dear Hou,
> 
> > When reviewing the code in logical/worker.c, I noticed that when
> > applying a cross-partition update action, it scans the old partition twice.
> > I am attaching the patch 0001 to remove this duplicate table scan.
> 
> Just to clarify, you meant that FindReplTupleInLocalRel() are called in
> apply_handle_tuple_routing() and
> apply_handle_tuple_routing()->apply_handle_delete_internal(),
> which requires the index or sequential scan, right? LGTM.

Thanks for reviewing the patch, and your understanding is correct.

Here is the updated patch 0001. I removed the comments as suggested by Amit.

Since 0002 patch is only refactoring the code and I need some time to review
the comments for it, I will hold it until the 0001 is committed.

Best Regards,
Hou zj


v2-0001-avoid-duplicate-table-scan-for-cross-partition-up.patch
Description:  v2-0001-avoid-duplicate-table-scan-for-cross-partition-up.patch


RE: Conflict detection and logging in logical replication

2024-07-30 Thread Zhijie Hou (Fujitsu)
> On Monday, July 29, 2024 6:59 PM Dilip Kumar 
> wrote:
> >
> > On Mon, Jul 29, 2024 at 11:44 AM Zhijie Hou (Fujitsu)
> >  wrote:
> > >
> >
> > I was going through v7-0001, and I have some initial comments.
> 
> Thanks for the comments !
> 
> >
> > @@ -536,11 +542,9 @@ ExecCheckIndexConstraints(ResultRelInfo
> > *resultRelInfo, TupleTableSlot *slot,
> >   ExprContext *econtext;
> >   Datum values[INDEX_MAX_KEYS];
> >   bool isnull[INDEX_MAX_KEYS];
> > - ItemPointerData invalidItemPtr;
> >   bool checkedIndex = false;
> >
> >   ItemPointerSetInvalid(conflictTid);
> > - ItemPointerSetInvalid(&invalidItemPtr);
> >
> >   /*
> >   * Get information from the result relation info structure.
> > @@ -629,7 +633,7 @@ ExecCheckIndexConstraints(ResultRelInfo
> > *resultRelInfo, TupleTableSlot *slot,
> >
> >   satisfiesConstraint =
> >   check_exclusion_or_unique_constraint(heapRelation, indexRelation,
> > - indexInfo, &invalidItemPtr,
> > + indexInfo, &slot->tts_tid,
> >   values, isnull, estate, false,
> >   CEOUC_WAIT, true,
> >   conflictTid);
> >
> > What is the purpose of this change?  I mean
> > 'check_exclusion_or_unique_constraint' says that 'tupleid'
> > should be invalidItemPtr if the tuple is not yet inserted and
> > ExecCheckIndexConstraints is called by ExecInsert before inserting the
> tuple.
> > So what is this change?
> 
> Because the function ExecCheckIndexConstraints() is now invoked after
> inserting a tuple (in the patch). So, we need to ignore the newly inserted 
> tuple
> when checking conflict in check_exclusion_or_unique_constraint().
> 
> > Would this change ExecInsert's behavior as well?
> 
> Thanks for pointing it out, I will check and reply.

After checking, I think it may affect ExecInsert's behavior if the slot passed
to ExecCheckIndexConstraints() comes from other tables (e.g. when executing
INSERT INTO SELECT FROM othertbl), because the slot->tts_tid points to a valid
position from another table in this case, which can cause the
check_exclusion_or_unique_constraint to skip a tuple unexpectedly).

I thought about two ideas to fix this: One is to reset the slot->tts_tid before
calling ExecCheckIndexConstraints() in ExecInsert(), but I feel a bit
uncomfortable to this since it is touching existing logic. So, another idea is 
to
just add a new parameter 'tupletid' in ExecCheckIndexConstraints(), then pass
tupletid=InvalidOffsetNumber in when invoke the function in ExecInsert() and
pass a valid tupletid in the new code paths in the patch.  The new
'tupletid' will be passed to check_exclusion_or_unique_constraint to
skip the target tuple. I feel the second one maybe better.

What do you think ?

Best Regards,
Hou zj



RE: Conflict detection and logging in logical replication

2024-07-29 Thread Zhijie Hou (Fujitsu)
On Monday, July 29, 2024 6:59 PM Dilip Kumar  wrote:
> 
> On Mon, Jul 29, 2024 at 11:44 AM Zhijie Hou (Fujitsu)
>  wrote:
> >
> 
> I was going through v7-0001, and I have some initial comments.

Thanks for the comments !

> 
> @@ -536,11 +542,9 @@ ExecCheckIndexConstraints(ResultRelInfo
> *resultRelInfo, TupleTableSlot *slot,
>   ExprContext *econtext;
>   Datum values[INDEX_MAX_KEYS];
>   bool isnull[INDEX_MAX_KEYS];
> - ItemPointerData invalidItemPtr;
>   bool checkedIndex = false;
> 
>   ItemPointerSetInvalid(conflictTid);
> - ItemPointerSetInvalid(&invalidItemPtr);
> 
>   /*
>   * Get information from the result relation info structure.
> @@ -629,7 +633,7 @@ ExecCheckIndexConstraints(ResultRelInfo
> *resultRelInfo, TupleTableSlot *slot,
> 
>   satisfiesConstraint =
>   check_exclusion_or_unique_constraint(heapRelation, indexRelation,
> - indexInfo, &invalidItemPtr,
> + indexInfo, &slot->tts_tid,
>   values, isnull, estate, false,
>   CEOUC_WAIT, true,
>   conflictTid);
> 
> What is the purpose of this change?  I mean
> 'check_exclusion_or_unique_constraint' says that 'tupleid'
> should be invalidItemPtr if the tuple is not yet inserted and
> ExecCheckIndexConstraints is called by ExecInsert before inserting the tuple.
> So what is this change?

Because the function ExecCheckIndexConstraints() is now invoked after inserting
a tuple (in the patch). So, we need to ignore the newly inserted tuple when
checking conflict in check_exclusion_or_unique_constraint().

> Would this change ExecInsert's behavior as well?

Thanks for pointing it out, I will check and reply.

> 
> 
> 
> 
> +ReCheckConflictIndexes(ResultRelInfo *resultRelInfo, EState *estate,
> +ConflictType type, List *recheckIndexes,
> +TupleTableSlot *slot)
> +{
> + /* Re-check all the unique indexes for potential conflicts */
> +foreach_oid(uniqueidx, resultRelInfo->ri_onConflictArbiterIndexes)
> + {
> + TupleTableSlot *conflictslot;
> +
> + if (list_member_oid(recheckIndexes, uniqueidx) &&
> + FindConflictTuple(resultRelInfo, estate, uniqueidx, slot,
> + &conflictslot)) { RepOriginId origin; TimestampTz committs;
> + TransactionId xmin;
> +
> + GetTupleCommitTs(conflictslot, &xmin, &origin, &committs);
> +ReportApplyConflict(ERROR, type, resultRelInfo->ri_RelationDesc,
> +uniqueidx,  xmin, origin, committs, conflictslot);  }  } }
>  and
> 
> + conflictindexes = resultRelInfo->ri_onConflictArbiterIndexes;
> +
>   if (resultRelInfo->ri_NumIndices > 0)
>   recheckIndexes = ExecInsertIndexTuples(resultRelInfo,
> -slot, estate, false, false,
> -NULL, NIL, false);
> +slot, estate, false,
> +conflictindexes ? true : false,
> +&conflict,
> +conflictindexes, false);
> +
> + /*
> + * Rechecks the conflict indexes to fetch the conflicting local tuple
> + * and reports the conflict. We perform this check here, instead of
> + * perform an additional index scan before the actual insertion and
> + * reporting the conflict if any conflicting tuples are found. This is
> + * to avoid the overhead of executing the extra scan for each INSERT
> + * operation, even when no conflict arises, which could introduce
> + * significant overhead to replication, particularly in cases where
> + * conflicts are rare.
> + */
> + if (conflict)
> + ReCheckConflictIndexes(resultRelInfo, estate, CT_INSERT_EXISTS,
> +recheckIndexes, slot);
> 
> 
>  This logic is confusing, first, you are calling
> ExecInsertIndexTuples() with no duplicate error for the indexes
> present in 'ri_onConflictArbiterIndexes' which means
>  the indexes returned by the function must be a subset of
> 'ri_onConflictArbiterIndexes' and later in ReCheckConflictIndexes()
> you are again processing all the
>  indexes of 'ri_onConflictArbiterIndexes' and checking if any of these
> is a subset of the indexes that is returned by
> ExecInsertIndexTuples().

I think that's not always true. The indexes returned by the function *may not*
be a subset of 'ri_onConflictArbiterIndexes'. Based on the comments atop of the
ExecInsertIndexTuples, it returns a list of index OIDs for any unique or
exclusion constraints that are deferred, and in addition to that, it will
include the indexes in 'arbiterIndexes' if noDupErr == true.

> 
>  Why are we doing that, I think we can directly use the
> 'recheckIndexes' which is returned by ExecInsertIndexTuples(), and
> those indexes are guaranteed to be a subset of
>  ri_onConflictArbiterIndexes. No?

Based on above, we need to filter the deferred indexes or exclusion constraints
in the 'ri_onConflictArbiterIndexes'.

Best Regards,
Hou zj



Remove duplicate table scan in logical apply worker and code refactoring

2024-07-25 Thread Zhijie Hou (Fujitsu)
Hi,

When reviewing the code in logical/worker.c, I noticed that when applying a
cross-partition update action, it scans the old partition twice.
I am attaching the patch 0001 to remove this duplicate table scan.

The test shows that it brings noticeable improvement:

Steps
-
Pub:
create table tab (a int not null, b int);
alter table tab replica identity full;
insert into tab select 1,generate_series(1, 100, 1);

Sub:
create table tab (a int not null, b int) partition by range (b);
create table tab_1 partition of tab for values from (minvalue) to (500);
create table tab_2 partition of tab for values from (500) to (maxvalue);
alter table tab replica identity full;


Test query:
update tab set b = 600 where b > 00; -- UPDATE 100

Results (The time spent by apply worker to apply the all the UPDATEs):
Before  14s
After   7s
-

Apart from above, I found there are quite a few duplicate codes related to 
partition
handling(e.g. apply_handle_tuple_routing), so I tried to extract some
common logic to simplify the codes. Please see 0002 for this refactoring.

Best Regards,
Hou Zhijie



v1-0002-refactor-the-partition-related-logic-in-worker.c.patch
Description:  v1-0002-refactor-the-partition-related-logic-in-worker.c.patch


perftest.conf
Description: perftest.conf


v1-0001-avoid-duplicate-table-scan-for-cross-partition-up.patch
Description:  v1-0001-avoid-duplicate-table-scan-for-cross-partition-up.patch


RE: Slow catchup of 2PC (twophase) transactions on replica in LR

2024-07-17 Thread Zhijie Hou (Fujitsu)
On Thursday, July 18, 2024 10:11 AM Kuroda, Hayato/黒田 隼人 
 wrote:
> 
> Dear Peter,
> 
> Thanks for giving comments! PSA new version.

I did a few more tests and analysis and didn't find issues. Just share the
cases I tested:

1. After manually rolling back xacts for two_pc and switch two_pc option from
   true to false, does the prepared transaction again get replicated again when
   COMMIT PREPARED happens.

It work as expected in this case. E.g. the transaction will be sent to
subscriber after disabling two_pc.

And I think there wouldn't be race conditions between the ALTER command
and apply worker because user needs to disable the subscription(the apply
worker will stop) before altering the two_phase the option.
 
And the WALs for the prepared transaction is retained until the COMMIT
PREPARED, because we don't advance the slot's restart_lsn over the ongoing
transactions(e.g. the prepared transaction in this case):
 
SnapBuildProcessRunningXacts
...
txn = ReorderBufferGetOldestTXN(builder->reorder);
...
/*
 * oldest ongoing txn might have started when we didn't yet serialize
 * anything because we hadn't reached a consistent state yet.
 */
if (txn != NULL && txn->restart_decoding_lsn != InvalidXLogRecPtr)
LogicalIncreaseRestartDecodingForSlot(lsn, txn->restart_decoding_lsn);

So, the data of the prepared transaction is safe.

2. Test when prepare is already processed but we alter the option false to
   true.

This case works as expected as well e.g. the whole transaction will be sent to 
the
subscriber on COMMIT PREPARE using two_pc flow:

"begin prepare" -> "txn data" -> "prepare" -> "commit prepare"

Due to the same reason in case 1, there is no concurrency issue and the
data of the transaction will be retained until COMMIT PREPARED.

Best Regards,
Hou zj





RE: Slow catchup of 2PC (twophase) transactions on replica in LR

2024-07-16 Thread Zhijie Hou (Fujitsu)
On Tuesday, July 16, 2024 1:17 PM Kuroda, Hayato/黒田 隼人 
 wrote
> 
> Dear Amit, Hou,
> 
> Thanks for giving comments! PSA new versions.
> What's new:
> 
> 0001: included Hou's patch [1] not to overwrite slot options.
>   Some other comments were also addressed.

Thanks for the patch!

One more issue I found is that:

+IsTwoPhaseTransactionGidForSubid(Oid subid, char *gid)
+{
+   int ret;
+   Oid subid_written;
+   TransactionId xid;
+
+   ret = sscanf(gid, "pg_gid_%u_%u", &subid_written, &xid);
+
+   return (ret == 2 && subid == subid_written);

I think it's not correct to use sscanf here, because it will return the same 
value
even if the gid is "pg_gid_123_123_123_123..." which isn't a
gid created by the apply worker. I think we should use TwoPhaseTransactionGid
to build the gid string and compare it with each existing gid(strcmp).

Best Regards,
Hou zj






RE: Slow catchup of 2PC (twophase) transactions on replica in LR

2024-07-13 Thread Zhijie Hou (Fujitsu)
On Tuesday, July 9, 2024 8:53 PM Hayato Kuroda (Fujitsu) 
 wrote:
> 
> > 0001 - Codes for SUBOPT_TWOPHASE_COMMIT are moved per requirement
> [1].
> >Also, checks for failover and two_phase are unified into one 
> > function.
> > 0002 - updated accordingly. An argument for the check function is added.
> > 0003 - this contains documentation changes required in [2].
> 
> Previous patch set could not be accepted due to the initialization miss.
> PSA new version.

Thanks for the patches ! I initially reviewed the 0001 and found that
the implementation of ALTER_REPLICATION_SLOT has a issue, e.g.
it doesn't handle the case when there is only one specified option
in the replication command:

ALTER_REPLICATION_SLOT slot (two_phase)

In this case, it always overwrites the un-specified option(failover) to false 
even
when the failover was set to true. I tried to make a small fix which is on
top of 0001 (please see the attachment).

I also added the doc of the new two_phase option of the replication command
and a missing period of errhint in the topup patch.

Best Regards,
Hou zj
From 3bbaaba53a0cb3db43cc893acbd3ffbedd61bff1 Mon Sep 17 00:00:00 2001
From: Hou Zhijie 
Date: Sat, 13 Jul 2024 18:31:28 +0800
Subject: [PATCH] fix alter replication slot

---
 doc/src/sgml/protocol.sgml  | 16 ++
 src/backend/commands/subscriptioncmds.c |  2 +-
 src/backend/replication/slot.c  | 16 --
 src/backend/replication/walsender.c | 29 +++--
 src/include/replication/slot.h  |  4 ++--
 5 files changed, 46 insertions(+), 21 deletions(-)

diff --git a/doc/src/sgml/protocol.sgml b/doc/src/sgml/protocol.sgml
index 1b27d0a547..3ac4a4be28 100644
--- a/doc/src/sgml/protocol.sgml
+++ b/doc/src/sgml/protocol.sgml
@@ -2206,6 +2206,22 @@ psql "dbname=postgres replication=database" -c 
"IDENTIFY_SYSTEM;"

   
 
+  
+   
+TWO_PHASE [ boolean ]
+
+ 
+  If true, this logical replication slot supports decoding of two-phase
+  commit. With this option, commands related to two-phase commit such 
as
+  PREPARE TRANSACTION, COMMIT 
PREPARED
+  and ROLLBACK PREPARED are decoded and transmitted.
+  The transaction will be decoded and transmitted at
+  PREPARE TRANSACTION time.
+ 
+
+   
+  
+
  
 
 
diff --git a/src/backend/commands/subscriptioncmds.c 
b/src/backend/commands/subscriptioncmds.c
index 7604e228e8..c48b6d0549 100644
--- a/src/backend/commands/subscriptioncmds.c
+++ b/src/backend/commands/subscriptioncmds.c
@@ -1308,7 +1308,7 @@ AlterSubscription(ParseState *pstate, 
AlterSubscriptionStmt *stmt,
ereport(ERROR,

(errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE),
 errmsg("cannot 
disable two_phase when uncommitted prepared transactions present"),
-
errhint("Resolve these transactions and try again")));
+
errhint("Resolve these transactions and try again.")));
 
/* Change system catalog acoordingly */

values[Anum_pg_subscription_subtwophasestate - 1] =
diff --git a/src/backend/replication/slot.c b/src/backend/replication/slot.c
index 2ad6dca993..2f167a2adc 100644
--- a/src/backend/replication/slot.c
+++ b/src/backend/replication/slot.c
@@ -804,11 +804,12 @@ ReplicationSlotDrop(const char *name, bool nowait)
  * Change the definition of the slot identified by the specified name.
  */
 void
-ReplicationSlotAlter(const char *name, bool failover, bool two_phase)
+ReplicationSlotAlter(const char *name, bool *failover, bool *two_phase)
 {
boolupdate_slot = false;
 
Assert(MyReplicationSlot == NULL);
+   Assert(failover || two_phase);
 
ReplicationSlotAcquire(name, false);
 
@@ -834,7 +835,7 @@ ReplicationSlotAlter(const char *name, bool failover, bool 
two_phase)
 * Do not allow users to enable failover on the standby as we 
do not
 * support sync to the cascading standby.
 */
-   if (failover)
+   if (failover && *failover)
ereport(ERROR,
errcode(ERRCODE_FEATURE_NOT_SUPPORTED),
errmsg("cannot enable failover for a 
replication slot"
@@ -845,24 +846,25 @@ ReplicationSlotAlter(const char *name, bool failover, 
bool two_phase)
 * Do not allow users to enable failover for temporary slots as we do 
not
 * support syncing temporary slots to the standby.
 */
-   if (failover && MyReplicationSlot->data.persistency == RS_TEMPORARY)
+   i

RE: speed up a logical replica setup

2024-07-11 Thread Zhijie Hou (Fujitsu)
On Thursday, July 11, 2024 6:21 PM Amit Kapila  wrote:
> 
> On Tue, Jul 9, 2024 at 4:30 PM Alexander Lakhin 
> wrote:
> >
> > Please look at another failure of the test [1]:
> > [13:28:05.647](2.460s) not ok 26 - failover slot is synced
> > [13:28:05.648](0.001s) #   Failed test 'failover slot is synced'
> > #   at
> /home/bf/bf-build/skink-master/HEAD/pgsql/src/bin/pg_basebackup/t/04
> 0_pg_createsubscriber.pl line 307.
> > [13:28:05.648](0.000s) #  got: ''
> > # expected: 'failover_slot'
> >
> > with 040_pg_createsubscriber_node_s.log containing:
> > 2024-07-08 13:28:05.369 UTC [3985464][client backend][0/2:0] LOG:
> > statement: SELECT pg_sync_replication_slots()
> > 2024-07-08 13:28:05.557 UTC [3985464][client backend][0/2:0] LOG:
> > could not sync slot "failover_slot" as remote slot precedes local slot
> > 2024-07-08 13:28:05.557 UTC [3985464][client backend][0/2:0] DETAIL:
> > Remote slot has LSN 0/30047B8 and catalog xmin 743, but local slot has LSN
> 0/30047B8 and catalog xmin 744.
> >
> > I could not reproduce it locally, but I've discovered that that
> > subtest somehow depends on pg_createsubscriber executed for the
> > 'primary contains unmet conditions on node P' check. For example with
> > this test modification:
> > @@ -249,7 +249,7 @@ command_fails(
> >   $node_p->connstr($db1), '--socket-directory',
> >   $node_s->host, '--subscriber-port',
> >   $node_s->port, '--database',
> > -$db1, '--database',
> > +'XXX', '--database',
> >   $db2
> >   ],
> >   'primary contains unmet conditions on node P');
> >
> > I see the same failure:
> > 2024-07-09 10:19:43.284 UTC [938890] 040_pg_createsubscriber.pl LOG:
> > statement: SELECT pg_sync_replication_slots()
> > 2024-07-09 10:19:43.292 UTC [938890] 040_pg_createsubscriber.pl LOG:
> > could not sync slot "failover_slot" as remote slot precedes local slot
> > 2024-07-09 10:19:43.292 UTC [938890] 040_pg_createsubscriber.pl
> > DETAIL:  Remote slot has LSN 0/3004780 and catalog xmin 743, but local
> slot has LSN 0/3004780 and catalog xmin 744.
> >
> > Thus maybe even a normal pg_createsubscriber run can affect the
> > primary server (it's catalog xmin) differently?
> >
> 
> Yes, pg_createsubscriber can affect the primary server's catalog xmin because
> it starts the standby server that can send HSFeedback (See
> XLogWalRcvSendHSFeedback()), which can advance the physical slot's xmin
> corresponding the following Insert in the test:
> 
> # Insert another row on node P and wait node S to catch up
> $node_p->safe_psql($db1, "INSERT INTO tbl1 VALUES('second row')");
> $node_p->wait_for_replay_catchup($node_s);
> 
> In the success case, pg_createsubscriber is able to send HSFeedback and in
> the failure case, it won't. We can see the following logs in
> 040_pg_createsubscriber_node_p.log:
> 
> 2024-07-08 13:28:00.872 UTC [3982331][walsender][:0] FATAL:  the database
> system is starting up
> 2024-07-08 13:28:00.875 UTC [3982328][startup][:0] LOG:  database system
> was shut down at 2024-07-08 13:28:00 UTC
> 2024-07-08 13:28:01.105 UTC [3981996][postmaster][:0] LOG:  database
> system is ready to accept connections
> 
> This shows that when the test  'primary contains unmet conditions on node P'
> starts the standby server the corresponding primary node was not ready
> because we just restarted node_p before that test and didn't ensure that the
> node_p is up and ready to accept connections before starting the
> pg_createsubscriber test.
> 
> Even in the successful cases where the standby is able to connect to primary
> for test 'primary contains unmet conditions on node P', there is no guarantee
> that xmin of the physical slot will be updated at least, we don't have 
> anything in
> the test to ensure the same.
> 
> Now as before creating logical replication, we didn't ensure that the physical
> slot's xmin has been caught up to the latest value, the test can lead to 
> failure
> like: "Remote slot has LSN 0/3004780 and catalog xmin 743, but local slot has
> LSN 0/3004780 and catalog xmin 744".
> 
> The xmin on standby could have been advanced due to the following Insert:
> # Insert another row on node P and wait node S to catch up
> $node_p->safe_psql($db1, "INSERT INTO tbl1 VALUES('second row')");
> $node_p->wait_for_replay_catchup($node_s);
> 
> We don't wait for the xmin to catch up corresponding to this insert and I 
> don't
> know if there is a way to do that. So, we should move this Insert to after 
> the call
> to pg_sync_replication_slots(). It won't impact the general test of
> pg_createsubscriber.

The analysis and suggestion look reasonable to me.
Here is a small patch which does the same.


> Thanks to Hou-San for helping me in the analysis of this BF failure.

Best Regards,
Hou zj


0001-fix-unstable-test-in-040_pg_createsubscriber.patch
Description: 0001-fix-unstable-test-in-040_pg_createsubscriber.patch


RE: Conflict detection and logging in logical replication

2024-07-10 Thread Zhijie Hou (Fujitsu)
On Wednesday, July 10, 2024 5:39 PM shveta malik  wrote:
> 
> On Wed, Jul 3, 2024 at 8:31 AM Zhijie Hou (Fujitsu) 
> wrote:
> >
> > On Wednesday, June 26, 2024 10:58 AM Zhijie Hou (Fujitsu)
>  wrote:
> > >
> >
> > Hi,
> >
> > As suggested by Sawada-san in another thread[1].
> >
> > I am attaching the V4 patch set which tracks the delete_differ
> > conflict in logical replication.
> >
> > delete_differ means that the replicated DELETE is deleting a row that
> > was modified by a different origin.
> >
> 
> Thanks for the patch. I am still in process of review but please find few
> comments:

Thanks for the comments!

> 1) When I try to *insert* primary/unique key on pub, which already exists on
> sub, conflict gets detected. But when I try to *update* primary/unique key to 
> a
> value on pub which already exists on sub, conflict is not detected. I get the
> error:
> 
> 2024-07-10 14:21:09.976 IST [647678] ERROR:  duplicate key value violates
> unique constraint "t1_pkey"
> 2024-07-10 14:21:09.976 IST [647678] DETAIL:  Key (pk)=(4) already exists.

Yes, I think the detection of this conflict is not added with the
intention to control the size of the patch in the first version.

> 
> This is because such conflict detection needs detection of constraint 
> violation
> using the *new value* rather than *existing* value during UPDATE. INSERT
> conflict detection takes care of this case i.e. the columns of incoming row 
> are
> considered as new values and it tries to see if all unique indexes are okay to
> digest such new values (all incoming columns) but update's logic is different.
> It searches based on oldTuple *only* and thus above detection is missing.

I think the logic is the same if we want to detect the unique violation
for UDPATE, we need to check if the new value of the UPDATE violates any
unique constraints same as the detection of insert_exists (e.g. check
the conflict around ExecInsertIndexTuples())

> 
> Shall we support such detection? If not, is it worth docuementing?

I am personally OK to support this detection. And
I think it's already documented that we only detect unique violation for
insert which mean update conflict is not detected.

> 2)
> Another case which might confuse user:
> 
> CREATE TABLE t1 (pk integer primary key, val1 integer, val2 integer);
> 
> On PUB: insert into t1 values(1,10,10); insert into t1 values(2,20,20);
> 
> On SUB: update t1 set pk=3 where pk=2;
> 
> Data on PUB: {1,10,10}, {2,20,20}
> Data on SUB: {1,10,10}, {3,20,20}
> 
> Now on PUB: update t1 set val1=200 where val1=20;
> 
> On Sub, I get this:
> 2024-07-10 14:44:00.160 IST [648287] LOG:  conflict update_missing detected
> on relation "public.t1"
> 2024-07-10 14:44:00.160 IST [648287] DETAIL:  Did not find the row to be
> updated.
> 2024-07-10 14:44:00.160 IST [648287] CONTEXT:  processing remote data for
> replication origin "pg_16389" during message type "UPDATE" for replication
> target relation "public.t1" in transaction 760, finished at 0/156D658
> 
> To user, it could be quite confusing, as val1=20 exists on sub but still he 
> gets
> update_missing conflict and the 'DETAIL' is not sufficient to give the 
> clarity. I
> think on HEAD as well (have not tested), we will get same behavior i.e. update
> will be ignored as we make search based on RI (pk in this case). So we are not
> worsening the situation, but now since we are detecting conflict, is it 
> possible
> to give better details in 'DETAIL' section indicating what is actually 
> missing?

I think It's doable to report the row value that cannot be found in the local
relation, but the concern is the potential risk of exposing some
sensitive data in the log. This may be OK, as we are already reporting the
key value for constraints violation, so if others also agree, we can add
the row value in the DETAIL as well.

Best Regards,
Hou zj



RE: Conflict Detection and Resolution

2024-07-08 Thread Zhijie Hou (Fujitsu)
On Monday, July 8, 2024 12:32 PM Zhijie Hou (Fujitsu)  
wrote:
> 
> I researched about how to detect the resolve update_deleted and thought
> about one idea: which is to maintain the xmin in logical slot to preserve
> the dead row and support latest_timestamp_xmin resolution for
> update_deleted to maintain data consistency.
> 
> Here are details of the xmin idea and resolution of update_deleted:
> 
> 1. how to preserve the dead row so that we can detect update_delete
>conflict correctly. (In the following explanation, let's assume there is a
>a multimeter setup with node A, B).
> 
> To preserve the dead row on node A, I think we could maintain the "xmin"
> in the logical replication slot on Node A to prevent the VACCUM from
> removing the dead row in user table. The walsender that acquires the slot
> is responsible to advance the xmin. (Node that I am trying to explore
> xmin idea as it could be more efficient than using commit_timestamp, and the
> logic could be simpler as we are already maintaining catalog_xmin in
> logical slot and xmin in physical slot)
> 
> - Strategy for advancing xmin:
> 
> The xmin can be advanced if a) a transaction (xid:1000) has been flushed
> to the remote node (Node B in this case). *AND* b) On Node B, the local
> transactions that happened before applying the remote
> transaction(xid:1000) were also sent and flushed to the Node A.
> 
> - The implementation:
> 
> condition a) can be achieved with existing codes, the walsender can
> advance the xmin similar to the catalog_xmin.
> 
> For condition b), we can add a subscription option (say 'feedback_slot').
> The feedback_slot indicates the replication slot that will send changes to
> the origin (On Node B, the slot should be subBA). The apply worker will
> check the status(confirmed flush lsn) of the 'feedback slot' and send
> feedback to the walsender about the WAL position that has been sent and
> flushed via the feedback_slot.

The above are some initial thoughts of how to preserve the dead row for
update_deleted conflict detection.

After thinking more, I have identified a few additional cases that I
missed to analyze regarding the design. One aspect that needs more
thoughts is the possibility of multiple slots on each node. In this
scenario, the 'feedback_slot' subscription option would need to be
structured as a list. However, requiring users to specify all the slots
may not be user-friendly. I will explore if this process can be
automated.

In addition, I will think more about the potential impact of re-using the
existing 'xmin' of the slot which may affect existing logic that relies on
'xmin'.

I will analyze more and reply about these points.

Best Regards,
Hou zj


RE: Conflict Detection and Resolution

2024-07-07 Thread Zhijie Hou (Fujitsu)
Hi,

I researched about how to detect the resolve update_deleted and thought
about one idea: which is to maintain the xmin in logical slot to preserve
the dead row and support latest_timestamp_xmin resolution for
update_deleted to maintain data consistency.

Here are details of the xmin idea and resolution of update_deleted:

1. how to preserve the dead row so that we can detect update_delete
   conflict correctly. (In the following explanation, let's assume there is a
   a multimeter setup with node A, B).

To preserve the dead row on node A, I think we could maintain the "xmin"
in the logical replication slot on Node A to prevent the VACCUM from
removing the dead row in user table. The walsender that acquires the slot
is responsible to advance the xmin. (Node that I am trying to explore
xmin idea as it could be more efficient than using commit_timestamp, and the
logic could be simpler as we are already maintaining catalog_xmin in
logical slot and xmin in physical slot)

- Strategy for advancing xmin:

The xmin can be advanced if a) a transaction (xid:1000) has been flushed
to the remote node (Node B in this case). *AND* b) On Node B, the local
transactions that happened before applying the remote
transaction(xid:1000) were also sent and flushed to the Node A.

- The implementation:

condition a) can be achieved with existing codes, the walsender can
advance the xmin similar to the catalog_xmin.

For condition b), we can add a subscription option (say 'feedback_slot').
The feedback_slot indicates the replication slot that will send changes to
the origin (On Node B, the slot should be subBA). The apply worker will
check the status(confirmed flush lsn) of the 'feedback slot' and send
feedback to the walsender about the WAL position that has been sent and
flushed via the feedback_slot.

For example, on Node B, we specify the replication slot (subBA) that is
sending changes to Node A. The apply worker on Node B will send
feedback(WAL position that has been sent to the Node A) to Node A
regularly. Then the Node A can use the position to advance the xmin.
(Similar to the hot_standby_feedback).

2. The resolution for update_delete

The current design doesn't support 'last_timestamp_win'. But this could be
a problem if update_deleted is detected due to some very old dead row.
Assume the update has the latest timestamp, and if we skip the update due
to these very old dead rows, the data would be inconsistent because the
latest update data is missing.

The ideal resolution should compare the timestamp of the UPDATE and the
timestamp of the transaction that produced these dead rows. If the UPDATE
is newer, the convert the UDPATE to INSERT, otherwise, skip the UPDATE.

Best Regards,
Hou zj


RE: New standby_slot_names GUC in PG 17

2024-07-01 Thread Zhijie Hou (Fujitsu)
On Monday, July 1, 2024 6:45 PM Amit Kapila  wrote:
> 
> On Thu, Jun 27, 2024 at 7:14 AM Masahiko Sawada
>  wrote:
> >
> > On Wed, Jun 26, 2024 at 6:15 PM Zhijie Hou (Fujitsu)
> >  wrote:
> >
> > Thank you for updating the patch. The v2 patch looks good to me.
> >
> 
> Pushed.

Thanks! I am attaching another patch to modify the release note as discussed.

Best Regards,
Hou zj


0001-add-recent-renaming-commit-to-the-release-note.patch
Description: 0001-add-recent-renaming-commit-to-the-release-note.patch


RE: New standby_slot_names GUC in PG 17

2024-06-26 Thread Zhijie Hou (Fujitsu)
On Wednesday, June 26, 2024 12:49 PM Bertrand Drouvot 
 wrote:
> 
> Hi,
> 
> On Wed, Jun 26, 2024 at 04:17:45AM +0000, Zhijie Hou (Fujitsu) wrote:
> > On Wednesday, June 26, 2024 9:40 AM Masahiko Sawada
>  wrote:
> > >
> > > On Tue, Jun 25, 2024 at 5:32 PM Amit Kapila
> > > 
> > > wrote:
> > > >
> > > > I feel synchronized better indicates the purpose because we ensure
> > > > such slots are synchronized before we process changes for logical
> > > > failover slots. We already have a 'failover' option for logical
> > > > slots which could make things confusing if we add 'failover' where
> > > > physical slots need to be specified.
> > >
> > > Agreed. So +1 for synchronized_stnadby_slots.
> >
> > +1.
> >
> > Since there is a consensus on this name, I am attaching the patch to
> > rename the GUC to synchronized_stnadby_slots. I have confirmed that
> > the regression tests and pgindent passed for the patch.
> A few comments:

Thanks for the comments!

> 1 
> 
> In the commit message:
> 
> "
> The standby_slot_names GUC is intended to allow specification of physical
> standby slots that must be synchronized before they are visible to
> subscribers
> "
> 
> Not sure that wording is correct, if we feel the need to explain the GUC, 
> maybe
> repeat some wording from bf279ddd1c?

I intentionally copied some words from release note of this GUC which was
also part of the content in the initial email of this thread. I think it
would be easy to understand than the original commit msg. But others may
have different opinion, so I would leave the decision to the committer. (I 
adjusted
a bit the word in this version).

> 
> 2 
> 
> Should we rename StandbySlotNamesConfigData too?
> 
> 3 
> 
> Should we rename SlotExistsInStandbySlotNames too?
> 
> 4 
> 
> Should we rename validate_standby_slots() too?
> 

Renamed these to the names suggested by Amit.

Attach the v2 patch set which addressed above and removed
the changes in release-17.sgml according to the comment from Amit.

Best Regards,
Hou zj


v2-0001-Rename-standby_slot_names-to-synchronized_standby.patch
Description:  v2-0001-Rename-standby_slot_names-to-synchronized_standby.patch


RE: New standby_slot_names GUC in PG 17

2024-06-25 Thread Zhijie Hou (Fujitsu)
On Wednesday, June 26, 2024 9:40 AM Masahiko Sawada  
wrote:
> 
> On Tue, Jun 25, 2024 at 5:32 PM Amit Kapila 
> wrote:
> >
> > On Tue, Jun 25, 2024 at 12:30 PM Masahiko Sawada
>  wrote:
> > >
> > > On Tue, Jun 25, 2024 at 1:54 PM Amit Kapila 
> wrote:
> > > >
> > >
> > > > So, my
> > > > preference is in order as follows: synchronized_standby_slots,
> > > > wait_for_standby_slots, logical_replication_wait_slots,
> > > > logical_replication_synchronous_slots, and
> > > > logical_replication_synchronous_standby_slots.
> > >
> > > I also prefer synchronized_standby_slots.
> > >
> > > From a different angle just for discussion, is it worth considering
> > > the term 'failover' since the purpose of this feature is to ensure a
> > > standby to be ready for failover in terms of logical replication?
> > > For example, failover_standby_slot_names?
> > >
> >
> > I feel synchronized better indicates the purpose because we ensure
> > such slots are synchronized before we process changes for logical
> > failover slots. We already have a 'failover' option for logical slots
> > which could make things confusing if we add 'failover' where physical
> > slots need to be specified.
> 
> Agreed. So +1 for synchronized_stnadby_slots.

+1.

Since there is a consensus on this name, I am attaching the patch to rename
the GUC to synchronized_stnadby_slots. I have confirmed that the regression
tests and pgindent passed for the patch.

Best Regards,
Hou zj

Best Regards,
Hou zj


0001-Rename-standby_slot_names-to-synchronized_standby_sl.patch
Description:  0001-Rename-standby_slot_names-to-synchronized_standby_sl.patch


RE: New standby_slot_names GUC in PG 17

2024-06-24 Thread Zhijie Hou (Fujitsu)
On Saturday, June 22, 2024 5:47 PM Amit Kapila  wrote:
> 
> On Sat, Jun 22, 2024 at 1:49 AM Nathan Bossart
>  wrote:
> >
> > On Fri, Jun 21, 2024 at 03:50:00PM -0400, Tom Lane wrote:
> > > Allow specification of physical standbys that must be
> > > synchronized before they are visible to subscribers (Hou Zhijie,
> > > Shveta Malik)
> > >
> > > it seems like the name ought to have some connection to
> > > synchronization.  Perhaps something like "synchronized_standby_slots"?
> >
> > IMHO that might be a bit too close to synchronous_standby_names.
> >
> 
> Right, but better than the current one. The other possibility could be
> wait_for_standby_slots.

I agree the current name seems too generic and the suggested ' 
synchronized_standby_slots '
is better than the current one.

Some other ideas could be:

synchronize_slots_on_standbys: it indicates that the standbys that enabled
slot sync should be listed in this GUC.

logical_replication_wait_slots: it means the logical replication(logical
Walsender process) will wait for these slots to advance the confirm flush
lsn before proceeding.

Best Regards,
Hou zj


RE: Conflict Detection and Resolution

2024-06-17 Thread Zhijie Hou (Fujitsu)
On Thursday, June 13, 2024 2:11 PM Masahiko Sawada  
wrote:

Hi,

> On Wed, Jun 5, 2024 at 3:32 PM Zhijie Hou (Fujitsu) 
> wrote:
> >
> > This time at PGconf.dev[1], we had some discussions regarding this
> > project. The proposed approach is to split the work into two main
> > components. The first part focuses on conflict detection, which aims
> > to identify and report conflicts in logical replication. This feature
> > will enable users to monitor the unexpected conflicts that may occur.
> > The second part involves the actual conflict resolution. Here, we will
> > provide built-in resolutions for each conflict and allow user to
> > choose which resolution will be used for which conflict(as described
> > in the initial email of this thread).
> 
> I agree with this direction that we focus on conflict detection (and
> logging) first and then develop conflict resolution on top of that.

Thanks for your reply !

> 
> >
> > Of course, we are open to alternative ideas and suggestions, and the
> > strategy above can be changed based on ongoing discussions and
> > feedback received.
> >
> > Here is the patch of the first part work, which adds a new parameter
> > detect_conflict for CREATE and ALTER subscription commands. This new
> > parameter will decide if subscription will go for conflict detection.
> > By default, conflict detection will be off for a subscription.
> >
> > When conflict detection is enabled, additional logging is triggered in
> > the following conflict scenarios:
> >
> > * updating a row that was previously modified by another origin.
> > * The tuple to be updated is not found.
> > * The tuple to be deleted is not found.
> >
> > While there exist other conflict types in logical replication, such as
> > an incoming insert conflicting with an existing row due to a primary
> > key or unique index, these cases already result in constraint violation 
> > errors.
> 
> What does detect_conflict being true actually mean to users? I understand that
> detect_conflict being true could introduce some overhead to detect conflicts.
> But in terms of conflict detection, even if detect_confict is false, we detect
> some conflicts such as concurrent inserts with the same key. Once we
> introduce the complete conflict detection feature, I'm not sure there is a 
> case
> where a user wants to detect only some particular types of conflict.
> 
> > Therefore, additional conflict detection for these cases is currently
> > omitted to minimize potential overhead. However, the pre-detection for
> > conflict in these error cases is still essential to support automatic
> > conflict resolution in the future.
> 
> I feel that we should log all types of conflict in an uniform way. For 
> example,
> with detect_conflict being true, the update_differ conflict is reported as
> "conflict %s detected on relation "%s"", whereas concurrent inserts with the
> same key is reported as "duplicate key value violates unique constraint "%s"",
> which could confuse users.

Do you mean it's ok to add a pre-check before applying the INSERT, which will
verify if the remote tuple violates any unique constraints, and if it violates
then we log a conflict message ? I thought about this but was slightly
worried about the extra cost it would bring. OTOH, if we think it's acceptable,
we could do that since the cost is there only when detect_conflict is enabled.

I also thought of logging such a conflict message in pg_catch(), but I think we
lack some necessary info(relation, index name, column name) at the catch block.

Best Regards,
Hou zj





RE: Conflict Detection and Resolution

2024-06-14 Thread Zhijie Hou (Fujitsu)
On Thursday, June 13, 2024 8:46 PM Peter Eisentraut  
wrote:
> 
> On 23.05.24 08:36, shveta malik wrote:
> > Conflict Resolution
> > 
> > a) latest_timestamp_wins:The change with later commit timestamp
> wins.
> > b) earliest_timestamp_wins:   The change with earlier commit timestamp
> wins.
> > c) apply:   Always apply the remote change.
> > d) skip:Remote change is skipped.
> > e) error:   Error out on conflict. Replication is stopped, manual
> > action is needed.
> 
> You might be aware of pglogical, which has similar conflict resolution modes,
> but they appear to be spelled a bit different.  It might be worth reviewing 
> this,
> so that we don't unnecessarily introduce differences.

Right. Some of the proposed resolution names are different from pglogical's
while the functionalities are the same. The following is the comparison with
pglogical:

 latest_timestamp_wins(proposal) - last_update_wins(pglogical)
 earliest_timestamp_wins(proposal) - first_update_wins(pglogical)
 apply(proposal)   - apply_remote(pglogical)
 skip(proposal)- keep_local(pglogical)

I personally think the pglogical's names read more naturally. But others may
have different opinions on this.

> 
> https://github.com/2ndquadrant/pglogical?tab=readme-ov-file#conflicts
> 
> There might also be other inspiration to be found related to this in pglogical
> documentation or code.

Another difference is that we allow users to specify different resolutions for
different conflicts, while pglogical allows specifying one resolution for all 
conflict.
I think the proposed approach offers more flexibility to users, which seems more
favorable to me.

Best Regards,
Hou zj


RE: Synchronizing slots from primary to standby

2024-06-06 Thread Zhijie Hou (Fujitsu)
On Thursday, June 6, 2024 12:21 PM Peter Smith 
> 
> Hi, here are some review comments for the docs patch v5-0001.

Thanks for the comments! Here is the V6 patch that addressed the these.

Best Regards,
Hou zj


v6-0001-Document-the-steps-to-check-if-the-standby-is-rea.patch
Description:  v6-0001-Document-the-steps-to-check-if-the-standby-is-rea.patch


RE: Synchronizing slots from primary to standby

2024-06-05 Thread Zhijie Hou (Fujitsu)
On Wednesday, June 5, 2024 2:32 PM Peter Smith  wrote:
 
> Hi. Here are some minor review comments for the docs patch v4-0001.

Thanks for the comments!

> The SGML file wrapping can be fixed to fill up to 80 cols for some of the
> paragraphs.

Unlike comments in C code, I think we don't force the 80 cols limit in doc
file unless it's too long to read. I checked the doc once and think it's
OK.

Here is the V5 patch which addressed Peter's comments and Amit's comments[1].

[1] 
https://www.postgresql.org/message-id/CAA4eK1%2Bq1MYGgF3-LZCj6Xd0idujnjbTsfk-RqU%2BC51wYGaD5g%40mail.gmail.com

Best Regards,
Hou zj


v5-0001-Document-the-steps-to-check-if-the-standby-is-rea.patch
Description:  v5-0001-Document-the-steps-to-check-if-the-standby-is-rea.patch


RE: Conflict Detection and Resolution

2024-06-04 Thread Zhijie Hou (Fujitsu)
Hi,

This time at PGconf.dev[1], we had some discussions regarding this
project. The proposed approach is to split the work into two main
components. The first part focuses on conflict detection, which aims to
identify and report conflicts in logical replication. This feature will
enable users to monitor the unexpected conflicts that may occur. The
second part involves the actual conflict resolution. Here, we will provide
built-in resolutions for each conflict and allow user to choose which
resolution will be used for which conflict(as described in the initial
email of this thread).
 
Of course, we are open to alternative ideas and suggestions, and the
strategy above can be changed based on ongoing discussions and feedback
received.
 
Here is the patch of the first part work, which adds a new parameter
detect_conflict for CREATE and ALTER subscription commands. This new
parameter will decide if subscription will go for conflict detection. By
default, conflict detection will be off for a subscription.
 
When conflict detection is enabled, additional logging is triggered in the
following conflict scenarios:
 
* updating a row that was previously modified by another origin.
* The tuple to be updated is not found.
* The tuple to be deleted is not found.
 
While there exist other conflict types in logical replication, such as an
incoming insert conflicting with an existing row due to a primary key or
unique index, these cases already result in constraint violation errors.
Therefore, additional conflict detection for these cases is currently
omitted to minimize potential overhead. However, the pre-detection for
conflict in these error cases is still essential to support automatic
conflict resolution in the future.

[1] https://2024.pgconf.dev/

Best Regards,
Hou zj


v1-0001-Detect-update-and-delete-conflict-in-logical-repl.patch
Description:  v1-0001-Detect-update-and-delete-conflict-in-logical-repl.patch


RE: Synchronizing slots from primary to standby

2024-06-04 Thread Zhijie Hou (Fujitsu)
On Thursday, May 23, 2024 1:34 PM Peter Smith  wrote:

Thanks for the comments. I addressed most of the comments except the
following one which I am not sure:

> 5b.
> Patch says "on the subscriber node", but isn't that the simplest case?
> e.g. maybe there are multiple nodes having subscriptions for these
> publications. Maybe the sentence needs to account for case of subscribers on
> >1 nodes.

I think it's not necessary mention the multiple nodes case, as in that case, 
user can just
perform the same steps on each node that have failover subscription.

> Is there no way to discover this information by querying the publisher?

I am not aware of the way for user to get the necessary info such as 
replication origin
progress on the publisher, because such information is only available on 
subscriber.

Attach the V4 doc patch which addressed Peter and Bertrand's comments.

Best Regards,
Hou zj


v4-0001-Document-the-steps-to-check-if-the-standby-is-rea.patch
Description:  v4-0001-Document-the-steps-to-check-if-the-standby-is-rea.patch


RE: Synchronizing slots from primary to standby

2024-06-04 Thread Zhijie Hou (Fujitsu)
On Wednesday, May 8, 2024 5:21 PM Bertrand Drouvot 
 wrote:
> A few comments:
Thanks for the comments!

> 2 ===
> 
> +test_sub=# SELECT
> +   array_agg(slotname) AS slots
> +   FROM
> +   ((
> +   SELECT r.srsubid AS subid, CONCAT('pg_', srsubid, '_sync_',
> srrelid, '_', ctl.system_identifier) AS slotname
> +   FROM pg_control_system() ctl, pg_subscription_rel r,
> pg_subscription s
> +   WHERE r.srsubstate = 'f' AND s.oid = r.srsubid AND
> s.subfailover
> +   ) UNION (
> 
> I guess this format comes from ReplicationSlotNameForTablesync(). What
> about creating a SQL callable function on top of it and make use of it in the
> query above? (that would ensure to keep the doc up to date even if the format
> changes in ReplicationSlotNameForTablesync()).

We could add a new function as suggested but I think it's not the right
time(beta1) to add this function because new function will bring
catversion bump which I think may not be worth at this stage. I think we can
consider this after releasing and maybe gather more use cases for the new
function you suggested.

> 
> 3 ===
> 
> +test_sub=# SELECT
> +   MAX(remote_lsn) AS remote_lsn_on_subscriber
> +   FROM
> +   ((
> +   SELECT (CASE WHEN r.srsubstate = 'f' THEN
> pg_replication_origin_progress(CONCAT('pg_', r.srsubid, '_', r.srrelid), 
> false)
> +   WHEN r.srsubstate IN ('s', 'r') THEN r.srsublsn
> END) AS remote_lsn
> +   FROM pg_subscription_rel r, pg_subscription s
> +   WHERE r.srsubstate IN ('f', 's', 'r') AND s.oid = r.srsubid 
> AND
> s.subfailover
> +   ) UNION (
> +   SELECT pg_replication_origin_progress(CONCAT('pg_',
> s.oid), false) AS remote_lsn
> +   FROM pg_subscription s
> +   WHERE s.subfailover
> +   ));
> 
> What about adding a join to pg_replication_origin to get rid of the 
> "hardcoded"
> format "CONCAT('pg_', r.srsubid, '_', r.srrelid)" and "CONCAT('pg_', s.oid)"?

I tried a bit, but it doesn't seem feasible to get the relationship between
subscription and origin by querying pg_subscription and
pg_replication_origin.

Best Regards,
Hou zj


RE: Synchronizing slots from primary to standby

2024-04-29 Thread Zhijie Hou (Fujitsu)
On Monday, April 29, 2024 5:11 PM shveta malik  wrote:
> 
> On Mon, Apr 29, 2024 at 11:38 AM shveta malik 
> wrote:
> >
> > On Mon, Apr 29, 2024 at 10:57 AM Zhijie Hou (Fujitsu)
> >  wrote:
> > >
> > > On Friday, March 15, 2024 10:45 PM Bertrand Drouvot
>  wrote:
> > > >
> > > > Hi,
> > > >
> > > > On Thu, Mar 14, 2024 at 02:22:44AM +, Zhijie Hou (Fujitsu) wrote:
> > > > > Hi,
> > > > >
> > > > > Since the standby_slot_names patch has been committed, I am
> > > > > attaching the last doc patch for review.
> > > > >
> > > >
> > > > Thanks!
> > > >
> > > > 1 ===
> > > >
> > > > +   continue subscribing to publications now on the new primary
> > > > + server
> > > > without
> > > > +   any data loss.
> > > >
> > > > I think "without any data loss" should be re-worded in this
> > > > context. Data loss in the sense "data committed on the primary and
> > > > not visible on the subscriber in case of failover" can still occurs (in 
> > > > case
> synchronous replication is not used).
> > > >
> > > > 2 ===
> > > >
> > > > +   If the result (failover_ready) of both above 
> > > > steps is
> > > > +   true, existing subscriptions will be able to continue without data
> loss.
> > > > +  
> > > >
> > > > I don't think that's true if synchronous replication is not used.
> > > > Say,
> > > >
> > > > - synchronous replication is not used
> > > > - primary is not able to reach the standby anymore and
> > > > standby_slot_names is set
> > > > - new data is inserted into the primary
> > > > - then not replicated to subscriber (due to standby_slot_names)
> > > >
> > > > Then I think the both above steps will return true but data would
> > > > be lost in case of failover.
> > >
> > > Thanks for the comments, attach the new version patch which reworded
> > > the above places.
> >
> > Thanks for the patch.
> >
> > Few comments:
> >
> > 1)  Tested the steps, one of the queries still refers to
> > 'conflict_reason'. I think it should refer 'conflicting'.

Thanks for catching this. Fixed.

> >
> > 2) Will it be good to mention that in case of planned promotion, it is
> > recommended to run  pg_sync_replication_slots() as last sync attempt
> > before we run failvoer-ready validation steps? This can be mentioned
> > in high-availaibility.sgml of current patch
> 
> I recall now that with the latest fix, we cannot run
> pg_sync_replication_slots() unless we disable the slot-sync worker.
> Considering that, I think it will be too many steps just to run the SQL 
> function at
> the end without much value added. Thus we can skip this point, we can rely on
> slot sync worker completely.

Agreed. I didn't change this.

Here is the V3 doc patch.

Best Regards,
Hou zj


v3-0001-Document-the-steps-to-check-if-the-standby-is-rea.patch
Description:  v3-0001-Document-the-steps-to-check-if-the-standby-is-rea.patch


RE: Synchronizing slots from primary to standby

2024-04-28 Thread Zhijie Hou (Fujitsu)
On Friday, March 15, 2024 10:45 PM Bertrand Drouvot 
 wrote:
> 
> Hi,
> 
> On Thu, Mar 14, 2024 at 02:22:44AM +0000, Zhijie Hou (Fujitsu) wrote:
> > Hi,
> >
> > Since the standby_slot_names patch has been committed, I am attaching
> > the last doc patch for review.
> >
> 
> Thanks!
> 
> 1 ===
> 
> +   continue subscribing to publications now on the new primary server
> without
> +   any data loss.
> 
> I think "without any data loss" should be re-worded in this context. Data 
> loss in
> the sense "data committed on the primary and not visible on the subscriber in
> case of failover" can still occurs (in case synchronous replication is not 
> used).
> 
> 2 ===
> 
> +   If the result (failover_ready) of both above steps is
> +   true, existing subscriptions will be able to continue without data loss.
> +  
> 
> I don't think that's true if synchronous replication is not used. Say,
> 
> - synchronous replication is not used
> - primary is not able to reach the standby anymore and standby_slot_names is
> set
> - new data is inserted into the primary
> - then not replicated to subscriber (due to standby_slot_names)
> 
> Then I think the both above steps will return true but data would be lost in 
> case
> of failover.

Thanks for the comments, attach the new version patch which reworded the
above places.

Best Regards,
Hou zj


v2-0001-Document-the-steps-to-check-if-the-standby-is-rea.patch
Description:  v2-0001-Document-the-steps-to-check-if-the-standby-is-rea.patch


RE: Improving the latch handling between logical replication launcher and worker processes.

2024-04-26 Thread Zhijie Hou (Fujitsu)
On Thursday, April 25, 2024 4:59 PM vignesh C  wrote:
> 
> Hi,
> 
> Currently the launcher's latch is used for the following: a) worker process 
> attach
> b) worker process exit and c) subscription creation.
> Since this same latch is used for multiple cases, the launcher process is not 
> able
> to handle concurrent scenarios like: a) Launcher started a new apply worker 
> and
> waiting for apply worker to attach and b) create subscription sub2 sending
> launcher wake up signal. In this scenario, both of them will set latch of the
> launcher process, the launcher process is not able to identify that both
> operations have occurred 1) worker is attached 2) subscription is created and
> apply worker should be started. As a result the apply worker does not get
> started for the new subscription created immediately and gets started after 
> the
> timeout of 180 seconds.
> 
> I have started a new thread for this based on suggestions at [1].
> 
> a) Introduce a new latch to handle worker attach and exit.

I found the startup process also uses two latches(see recoveryWakeupLatch) for
different purposes, so maybe this is OK. But note that both logical launcher
and apply worker will call logicalrep_worker_launch(), if we only add one new
latch, it means both workers will wait on the same new Latch, and the launcher
may consume the SetLatch that should have been consumed by the apply
worker(although it's rare).

> b) Add a new GUC launcher_retry_time which gives more flexibility to users as
> suggested by Amit at [1]. Before 5a3a953, the wal_retrieve_retry_interval 
> plays
> a similar role as the suggested new GUC launcher_retry_time, e.g. even if a
> worker is launched, the launcher only wait wal_retrieve_retry_interval time
> before next round.

IIUC, the issue does not happen frequently, and may not be noticeable where
apply workers wouldn't be restarted often. So, I am slightly not sure if it's
worth adding a new GUC.

> c) Don't reset the latch at worker attach and allow launcher main to identify 
> and
> handle it. For this there is a patch v6-0002 available at [2].

This seems simple. I found we are doing something similar in
RegisterSyncRequest() and WalSummarizerMain().

Best Regards,
Hou zj


RE: Race condition in FetchTableStates() breaks synchronization of subscription tables

2024-04-24 Thread Zhijie Hou (Fujitsu)
On Wednesday, April 24, 2024 6:29 PM vignesh C  wrote:
> 
> On Wed, 24 Apr 2024 at 11:59, Amit Kapila  wrote:
> >
> > On Wed, Mar 13, 2024 at 9:19 AM vignesh C  wrote:
> > >
> > > On Tue, 12 Mar 2024 at 09:34, Ajin Cherian  wrote:
> > > >
> > > >
> > > >
> > > > On Tue, Mar 12, 2024 at 2:59 PM vignesh C 
> wrote:
> > > >>
> > > >>
> > > >> Thanks, I have created the following Commitfest entry for this:
> > > >> https://commitfest.postgresql.org/47/4816/
> > > >>
> > > >> Regards,
> > > >> Vignesh
> > > >
> > > >
> > > > Thanks for the patch, I have verified that the fix works well by 
> > > > following
> the steps mentioned to reproduce the problem.
> > > > Reviewing the patch, it seems good and is well documented. Just one
> minor comment I had was probably to change the name of the variable
> table_states_valid to table_states_validity. The current name made sense when
> it was a bool, but now that it is a tri-state enum, it doesn't fit well.
> > >
> > > Thanks for reviewing the patch, the attached v6 patch has the
> > > changes for the same.
> > >
> >
> > v6_0001* looks good to me. This should be backpatched unless you or
> > others think otherwise.
> 
> This patch needs to be committed in master,PG16 and PG15.
> This is not required from PG14 onwards because they don't have
> HasSubscriptionRelations call before updating table_states_valid:
> /*
>  * Does the subscription have tables?
>  *
>  * If there were not-READY relations found then we know it does. But
>  * if table_states_not_ready was empty we still need to check again to
>  * see if there are 0 tables.
>  */
> has_subrels = (table_states_not_ready != NIL) ||
>   HasSubscriptionRelations(MySubscription->oid);
> 
> So the invalidation function will not be called here.
> 
> Whereas for PG15 and newer versions this is applicable:
> HasSubscriptionRelations calls table_open function which will get the
> invalidate callback like in:
> HasSubscriptionRelations -> table_open -> relation_open -> LockRelationOid
> -> AcceptInvalidationMessages -> ReceiveSharedInvalidMessages ->
> LocalExecuteInvalidationMessage -> CallSyscacheCallbacks ->
> invalidate_syncing_table_states
> 
> The attached patch
> v7-0001-Table-sync-missed-due-to-race-condition-in-subscr.patch
> applies for master and PG16 branch,
> v7-0001-Table-sync-missed-due-to-race-condition-in-subscr_PG15.patch
> applies for PG15 branch.

Thanks, I have verified that the patches can be applied cleanly and fix the
issue on each branch. The regression test can also pass after applying the patch
on my machine.

Best Regards,
Hou zj


RE: promotion related handling in pg_sync_replication_slots()

2024-04-21 Thread Zhijie Hou (Fujitsu)
On Friday, April 19, 2024 4:22 PM shveta malik  wrote:
> On Fri, Apr 19, 2024 at 11:37 AM shveta malik  wrote:
> >
> > On Fri, Apr 19, 2024 at 10:53 AM Bertrand Drouvot
> >  wrote:
> > >
> > > Hi,
> > >
> > > On Thu, Apr 18, 2024 at 05:36:05PM +0530, shveta malik wrote:
> > > > Please find v8 attached. Changes are:
> > >
> > > Thanks!
> > >
> > > A few comments:
> >
> > Thanks for reviewing.
> >
> > > 1 ===
> > >
> > > @@ -1440,7 +1461,7 @@ ReplSlotSyncWorkerMain(char *startup_data,
> size_t startup_data_len)
> > >  * slotsync_worker_onexit() but that will need the connection to 
> > > be
> made
> > >  * global and we want to avoid introducing global for this 
> > > purpose.
> > >  */
> > > -   before_shmem_exit(slotsync_failure_callback,
> PointerGetDatum(wrconn));
> > > +   before_shmem_exit(slotsync_worker_disconnect,
> > > + PointerGetDatum(wrconn));
> > >
> > > The comment above this change still states "Register the failure
> > > callback once we have the connection", I think it has to be reworded
> > > a bit now that v8 is making use of slotsync_worker_disconnect().
> > >
> > > 2 ===
> > >
> > > +* Register slotsync_worker_onexit() before we register
> > > +* ReplicationSlotShmemExit() in BaseInit(), to ensure that during
> exit of
> > > +* slot sync worker, ReplicationSlotShmemExit() is called first,
> followed
> > > +* by slotsync_worker_onexit(). Startup process during
> > > + promotion waits for
> > >
> > > Worth to mention in shmem_exit() (where it "while
> (--before_shmem_exit_index >= 0)"
> > > or before the shmem_exit() definition) that ReplSlotSyncWorkerMain()
> > > relies on this LIFO behavior? (not sure if there is other "strong"
> > > LIFO requirement in other part of the code).
> >
> > I see other modules as well relying on LIFO behavior.
> > Please see applyparallelworker.c where
> > 'before_shmem_exit(pa_shutdown)' is needed to be done after
> > 'before_shmem_exit(logicalrep_worker_onexit)' (commit id 3d144c6).
> > Also in postinit.c, I see such comments atop
> > 'before_shmem_exit(ShutdownPostgres, 0)'.
> > I feel we can skip adding this specific comment about
> > ReplSlotSyncWorkerMain() in ipc.c, as none of the other modules has
> > also not added any. I will address the rest of your comments in the
> > next version.
> >
> > > 3 ===
> > >
> > > +* Startup process during promotion waits for slot sync to finish
> and it
> > > +* does that by checking the 'syncing' flag.
> > >
> > > worth to mention ShutDownSlotSync()?
> > >
> > > 4 ===
> > >
> > > I did a few tests manually (launching ShutDownSlotSync() through gdb
> > > / with and without sync worker and with / without
> > > pg_sync_replication_slots() running
> > > concurrently) and it looks like it works as designed.
> >
> > Thanks for testing it.
> >
> > > Having said that, the logic that is in place to take care of the
> > > corner cases described up-thread seems reasonable to me.
> 
> Please find v9 with the above comments addressed.

Thanks, the patch looks good to me. I also tested a few concurrent
promotion/function execution cases and didn't find issues.

Best Regards,
Hou zj


RE: Disallow changing slot's failover option in transaction block

2024-04-21 Thread Zhijie Hou (Fujitsu)
On Friday, April 19, 2024 10:54 AM Kuroda, Hayato/黒田 隼人 
 wrote:
> In your patch, the pg_dump.c was updated. IIUC the main reason was that
> pg_restore executes some queries as single transactions so that ALTER
> SUBSCRIPTION cannot be done, right?

Yes, and please see below for other reasons.

> Also, failover was synchronized only when we were in the upgrade mode, but
> your patch seems to remove the condition. Can you clarify the reason?

We used ALTER SUBSCRIPTION in upgrade mode because it was not allowed to use
connect=false and failover=true together when CREATE SUBSCRIPTION. But since we
don't have this restriction anymore(we don't alter slot when creating sub
anymore), we can directly specify failover in CREATE SUBSCRIPTION and do that
in non-upgrade mode as well.

Attach the V3 patch which also addressed Shveta[1] and Bertrand[2]'s comments.

[1] 
https://www.postgresql.org/message-id/CAJpy0uD3YOeDg-tTCUi3EZ8vznRDfDqO_k6LepJpXUV1Z_%3DgkA%40mail.gmail.com
[2] 
https://www.postgresql.org/message-id/ZiIxuaiINsuaWuDK%40ip-10-97-1-34.eu-west-3.compute.internal

Best Regards,
Hou zj


v3-0001-Fix-the-handling-of-failover-option-in-subscripti.patch
Description:  v3-0001-Fix-the-handling-of-failover-option-in-subscripti.patch


RE: Disallow changing slot's failover option in transaction block

2024-04-18 Thread Zhijie Hou (Fujitsu)
On Thursday, April 18, 2024 1:52 PM Amit Kapila  wrote:
> 
> On Tue, Apr 16, 2024 at 5:06 PM shveta malik 
> wrote:
> >
> > On Tue, Apr 16, 2024 at 1:45 PM Hayato Kuroda (Fujitsu)
> >  wrote:
> > >
> > > Dear Hou,
> > >
> > > > Kuroda-San reported an issue off-list that:
> > > >
> > > > If user execute ALTER SUBSCRIPTION SET (failover) command inside a
> > > > txn block and rollback, only the subscription option change can be
> > > > rolled back, while the replication slot's failover change is preserved.
> > > >
> > > > This is because ALTER SUBSCRIPTION SET (failover) command
> > > > internally executes the replication command ALTER_REPLICATION_SLOT
> > > > to change the replication slot's failover property, but this
> > > > replication command execution cannot be rollback.
> > > >
> > > > To fix it, I think we can prevent user from executing ALTER
> > > > SUBSCRIPTION set
> > > > (failover) inside a txn block, which is also consistent to the
> > > > ALTER SUBSCRIPTION REFRESH/DROP SUBSCRIPTION command. Attach
> a
> > > > small patch to address this.
> > >
> > > Thanks for posting the patch, the fix is same as my expectation.
> > > Also, should we add the restriction to the doc? I feel [1] can be updated.
> >
> > +1.
> >
> > Similar to ALTER SUB, CREATE SUB also needs to be fixed. This is
> > because we call alter_replication_slot in CREATE SUB as well, for the
> > case when slot_name is provided and create_slot=false. But the tricky
> > part is we call alter_replication_slot() when creating subscription
> > for both failover=true and false. That means if we want to fix it on
> > the similar line of ALTER SUB, we have to disallow user from executing
> > the CREATE SUBSCRIPTION (slot_name = xx) in a txn block, which seems
> > to break some existing use cases. (previously, user can execute such a
> > command inside a txn block).
> >
> > So, we need to think if there are better ways to fix it.  After
> > discussion with Hou-San offlist, here are some ideas:
> >
> > 1. do not alter replication slot's failover option when CREATE
> > SUBSCRIPTION   WITH failover=false. This means we alter replication
> > slot only when failover is set to true. And thus we can disallow
> > CREATE SUB WITH (slot_name =xx, failover=true, create_slot=false)
> > inside a txn block.
> >
> > This option allows user to run CREATE-SUB(create_slot=false) with
> > failover=false in txn block like earlier. But on the downside, it
> > makes the behavior inconsistent for otherwise simpler option like
> > failover,  i.e. with failover=true, CREATE SUB is not allowed in txn
> > block while with failover=false, it is allowed. It makes it slightly
> > complex to be understood by user.
> >
> > 2. let's not disallow CREATE SUB in txn block as earlier, just don't
> > perform internal alter-failover during CREATE SUB for existing
> > slots(create_slot=false, slot_name=xx)  i.e. when create_slot is
> > false, we will ignore failover parameter of CREATE SUB and it is
> > user's responsibility to set it appropriately using ALTER SUB cmd. For
> > create_slot=true, the behaviour of CREATE-SUB remains same as earlier.
> >
> > This option does not add new restriction for CREATE SUB wrt txn block.
> > In context of failover with create_slot=false, we already have a
> > similar restriction (documented one) for ALTER SUB, i.e. with 'ALTER
> > SUBSCRIPTION SET(slot_name = new)', user needs to alter the new slot's
> > failover by himself. CREAT SUB can also be documented in similar way.
> > This seems simpler to be understood considering existing ALTER SUB's
> > behavior as well. Plus, this will make CREATE-SUB code slightly
> > simpler and thus easily manageable.
> >
> 
> +1 for option 2 as it sounds logical to me and consistent with ALTER
> SUBSCRIPTION.

+1.

Here is V2 patch which includes the changes for CREATE SUBSCRIPTION as
suggested. Since we don't connect pub to alter slot when (create_slot=false)
anymore, the restriction that disallows failover=true when connect=false is
also removed.

Best Regards,
Hou zj


v2-0001-Fix-the-handling-of-failover-option-in-subscripti.patch
Description:  v2-0001-Fix-the-handling-of-failover-option-in-subscripti.patch


RE: promotion related handling in pg_sync_replication_slots()

2024-04-16 Thread Zhijie Hou (Fujitsu)
On Tuesday, April 16, 2024 2:52 PM Bertrand Drouvot 
 wrote:


Hi,

> On Tue, Apr 16, 2024 at 10:00:04AM +0530, shveta malik wrote:
> > Please find v5 addressing above comments.
> 
> Thanks!
> 
> @@ -1634,9 +1677,14 @@ SyncReplicationSlots(WalReceiverConn *wrconn)  {
> PG_ENSURE_ERROR_CLEANUP(slotsync_failure_callback,
> PointerGetDatum(wrconn));
> {
> +   check_flags_and_set_sync_info(InvalidPid);
> +
> 
> Given the fact that if the sync worker is running it won't be possible to 
> trigger a
> manual sync with pg_sync_replication_slots(), what about also checking the
> "sync_replication_slots" value at the start of SyncReplicationSlots() and 
> emmit
> an error if sync_replication_slots is set to on? (The message could explicitly
> states that it's not possible to use the function if sync_replication_slots 
> is set to
> on).

I personally feel adding the additional check for sync_replication_slots may
not improve the situation here. Because the GUC sync_replication_slots can
change at any point, the GUC could be false when performing this addition check
and is set to true immediately after the check, so It could not simplify the 
logic
anyway.

Best Regards,
Hou zj




RE: Race condition in FetchTableStates() breaks synchronization of subscription tables

2024-04-15 Thread Zhijie Hou (Fujitsu)
On Wednesday, March 13, 2024 11:49 AM vignesh C  wrote:
> 
> On Tue, 12 Mar 2024 at 09:34, Ajin Cherian  wrote:
> >
> >
> >
> > On Tue, Mar 12, 2024 at 2:59 PM vignesh C  wrote:
> >>
> >>
> >> Thanks, I have created the following Commitfest entry for this:
> >> https://commitfest.postgresql.org/47/4816/
> >>
> >> Regards,
> >> Vignesh
> >
> >
> > Thanks for the patch, I have verified that the fix works well by following 
> > the
> steps mentioned to reproduce the problem.
> > Reviewing the patch, it seems good and is well documented. Just one minor
> comment I had was probably to change the name of the variable
> table_states_valid to table_states_validity. The current name made sense when
> it was a bool, but now that it is a tri-state enum, it doesn't fit well.
> 
> Thanks for reviewing the patch, the attached v6 patch has the changes for the
> same.

FYI, I noticed that there is one open item on
https://wiki.postgresql.org/wiki/PostgreSQL_17_Open_Items which is related to
the fix in this thread.

--
Intermittent failures in 040_standby_failover_slots_sync test
Possible solution in this thread: Race condition in FetchTableStates
--

AFAICS, the bug discussed here is not a new issue on
PG17, so I am thinking to move the item to the "Older bugs affecting stable
branches" section if no objections.

Best Regards,
Hou zj



Disallow changing slot's failover option in transaction block

2024-04-15 Thread Zhijie Hou (Fujitsu)
Hi,

Kuroda-San reported an issue off-list that:

If user execute ALTER SUBSCRIPTION SET (failover) command inside a txn block
and rollback, only the subscription option change can be rolled back, while the
replication slot's failover change is preserved.

This is because ALTER SUBSCRIPTION SET (failover) command internally executes
the replication command ALTER_REPLICATION_SLOT to change the replication slot's
failover property, but this replication command execution cannot be
rollback.

To fix it, I think we can prevent user from executing ALTER SUBSCRIPTION set
(failover) inside a txn block, which is also consistent to the ALTER
SUBSCRIPTION REFRESH/DROP SUBSCRIPTION command. Attach a small
patch to address this.

Best Regards,
Hou Zhijie



v1-0001-Disallow-alter-subscription-s-failover-option-ins.patch
Description:  v1-0001-Disallow-alter-subscription-s-failover-option-ins.patch


RE: promotion related handling in pg_sync_replication_slots()

2024-04-15 Thread Zhijie Hou (Fujitsu)
On Monday, April 15, 2024 6:09 PM shveta malik  wrote:
> 
> Please find v4 addressing the above comments.

Thanks for the patch.

Here are few comments:

1.

+   ereport(ERROR,
+   errmsg("promotion in progress, can not 
synchronize replication slots"));
+   }

I think an errcode is needed.

The style of the error message seems a bit unnatural to me. I suggest:
"cannot synchronize replication slots when standby promotion is ongoing"


2.

+   if (worker_pid != InvalidPid)
+   Assert(SlotSyncCtx->pid == InvalidPid);

We could merge the checks into one Assert().
Assert(SlotSyncCtx->pid == InvalidPid || worker_pid == InvalidPid);


3.

-   pqsignal(SIGINT, SignalHandlerForShutdownRequest);

I realized that we should register this before setting SlotSyncCtx->pid,
otherwise if the standby is promoted after setting pid and before registering
signal handle function, the slotsync worker could miss to handle SIGINT sent by
startup process(ShutDownSlotSync). This is an existing issue for slotsync
worker, but maybe we could fix it together with the patch.

Best Regards,
Hou zj


RE: Synchronizing slots from primary to standby

2024-04-12 Thread Zhijie Hou (Fujitsu)
On Friday, April 12, 2024 11:31 AM Amit Kapila  wrote:
> 
> On Thu, Apr 11, 2024 at 5:04 PM Zhijie Hou (Fujitsu) 
> wrote:
> >
> > On Thursday, April 11, 2024 12:11 PM Amit Kapila 
> wrote:
> >
> > >
> > > 2.
> > > - if (remote_slot->restart_lsn < slot->data.restart_lsn)
> > > + if (remote_slot->confirmed_lsn < slot->data.confirmed_flush)
> > >   elog(ERROR,
> > >   "cannot synchronize local slot \"%s\" LSN(%X/%X)"
> > >
> > > Can we be more specific in this message? How about splitting it into
> > > error_message as "cannot synchronize local slot \"%s\"" and then
> > > errdetail as "Local slot's start streaming location LSN(%X/%X) is
> > > ahead of remote slot's LSN(%X/%X)"?
> >
> > Your version looks better. Since the above two messages all have
> > errdetail, I used the style of ereport(ERROR, errmsg_internal(),
> > errdetail_internal()... in the patch which is equal to the elog(ERROR but 
> > has an
> additional detail message.
> >
> 
> makes sense.
> 
> > Here is V5 patch set.
> >
> 
> I think we should move the check to not advance slot when one of
> remote_slot's restart_lsn or catalog_xmin is lesser than the local slot inside
> update_local_synced_slot() as we want to prevent updating slot in those cases
> even during slot synchronization.

Agreed. Here is the V6 patch which addressed this. I have merged the
two patches into one.

Best Regards,
Hou zj


v6-0001-Fix-the-handling-of-LSN-and-xmin-in-slot-sync.patch
Description:  v6-0001-Fix-the-handling-of-LSN-and-xmin-in-slot-sync.patch


RE: Synchronizing slots from primary to standby

2024-04-11 Thread Zhijie Hou (Fujitsu)
On Thursday, April 11, 2024 12:11 PM Amit Kapila  
wrote:
> 
> On Wed, Apr 10, 2024 at 5:28 PM Zhijie Hou (Fujitsu) 
> wrote:
> >
> > On Thursday, April 4, 2024 5:37 PM Amit Kapila 
> wrote:
> > >
> > > BTW, while thinking on this one, I
> > > noticed that in the function LogicalConfirmReceivedLocation(), we
> > > first update the disk copy, see comment [1] and then in-memory
> > > whereas the same is not true in
> > > update_local_synced_slot() for the case when snapshot exists. Now,
> > > do we have the same risk here in case of standby? Because I think we
> > > will use these xmins while sending the feedback message (in
> XLogWalRcvSendHSFeedback()).
> > >
> > > * We have to write the changed xmin to disk *before* we change
> > > * the in-memory value, otherwise after a crash we wouldn't know
> > > * that some catalog tuples might have been removed already.
> >
> > Yes, I think we have the risk on the standby, I can reproduce the case
> > that if the server crashes after updating the in-memory value and
> > before saving them to disk, the synced slot could be invalidated after
> > restarting from crash, because the necessary rows have been removed on
> > the primary. The steps can be found in [1].
> >
> > I think we'd better fix the order in update_local_synced_slot() as
> > well. I tried to make the fix in 0002, 0001 is Shveta's patch to fix
> > another issue in this thread. Since they are touching the same function, so
> attach them together for review.
> >
> 
> Few comments:
> ===
> 1.
> +
> + /* Sanity check */
> + if (slot->data.confirmed_flush != remote_slot->confirmed_lsn)
> + ereport(LOG, errmsg("synchronized confirmed_flush for slot \"%s\"
> + differs from
> remote slot",
> +remote_slot->name),
> 
> Is there a reason to use elevel as LOG instead of ERROR? I think it should be
> elog(ERROR, .. as this is an unexpected case.

Agreed.

> 
> 2.
> - if (remote_slot->restart_lsn < slot->data.restart_lsn)
> + if (remote_slot->confirmed_lsn < slot->data.confirmed_flush)
>   elog(ERROR,
>   "cannot synchronize local slot \"%s\" LSN(%X/%X)"
> 
> Can we be more specific in this message? How about splitting it into
> error_message as "cannot synchronize local slot \"%s\"" and then errdetail as
> "Local slot's start streaming location LSN(%X/%X) is ahead of remote slot's
> LSN(%X/%X)"?

Your version looks better. Since the above two messages all have errdetail, I
used the style of ereport(ERROR, errmsg_internal(), errdetail_internal()... in
the patch which is equal to the elog(ERROR but has an additional detail message.

Here is V5 patch set.

Best Regards,
Hou zj


v5-0002-write-the-changed-xmin-to-disk-before-computing-g.patch
Description:  v5-0002-write-the-changed-xmin-to-disk-before-computing-g.patch


v5-0001-Correct-sanity-check-to-compare-confirmed_lsn.patch
Description:  v5-0001-Correct-sanity-check-to-compare-confirmed_lsn.patch


RE: Synchronizing slots from primary to standby

2024-04-10 Thread Zhijie Hou (Fujitsu)

On Thursday, April 4, 2024 5:37 PM Amit Kapila  wrote:
> 
> BTW, while thinking on this one, I
> noticed that in the function LogicalConfirmReceivedLocation(), we first update
> the disk copy, see comment [1] and then in-memory whereas the same is not
> true in
> update_local_synced_slot() for the case when snapshot exists. Now, do we have
> the same risk here in case of standby? Because I think we will use these xmins
> while sending the feedback message (in XLogWalRcvSendHSFeedback()).
>
> * We have to write the changed xmin to disk *before* we change
> * the in-memory value, otherwise after a crash we wouldn't know
> * that some catalog tuples might have been removed already.

Yes, I think we have the risk on the standby, I can reproduce the case that if
the server crashes after updating the in-memory value and before saving them to
disk, the synced slot could be invalidated after restarting from crash, because
the necessary rows have been removed on the primary. The steps can be found in
[1].

I think we'd better fix the order in update_local_synced_slot() as well. I
tried to make the fix in 0002, 0001 is Shveta's patch to fix another issue in 
this thread. Since
they are touching the same function, so attach them together for review.

[1]
-- Primary:
SELECT 'init' FROM pg_create_logical_replication_slot('logicalslot', 
'test_decoding', false, false, true);

-- Standby:
SELECT 'init' FROM pg_create_logical_replication_slot('standbylogicalslot', 
'test_decoding', false, false, false);
SELECT pg_sync_replication_slots();

-- Primary:
CREATE TABLE test (a int);
INSERT INTO test VALUES(1);
DROP TABLE test;

SELECT txid_current();
SELECT txid_current();
SELECT txid_current();
SELECT pg_log_standby_snapshot();

SELECT pg_replication_slot_advance('logicalslot', pg_current_wal_lsn());

-- Standby:
- wait for standby to replay all the changes on the primary.

- this is to serialize snapshots.
SELECT pg_replication_slot_advance('standbylogicalslot', 
pg_last_wal_replay_lsn());

- Use gdb to stop at the place after calling ReplicationSlotsComputexx()
  functions and before calling ReplicationSlotSave().
SELECT pg_sync_replication_slots();

-- Primary:

- First, wait for the primary slot(the physical slot)'s catalog xmin to be
  updated to the same as the failover slot.

VACUUM FULL;

- Wait for VACUMM FULL to be replayed on standby.

-- Standby:

- For the process which is blocked by gdb, let the process crash (elog(PANIC,
  ...)).

After restarting the standby from crash, we can see the synced slot is 
invalidated.

LOG:  invalidating obsolete replication slot "logicalslot"
DETAIL:  The slot conflicted with xid horizon 741.
CONTEXT:  WAL redo at 0/3059B90 for Heap2/PRUNE_ON_ACCESS: 
snapshotConflictHorizon: 741, isCatalogRel: T, nplans: 0, nredirected: 0, 
ndead: 7, nunused: 0, dead: [22, 23, 24, 25, 26, 27, 28]; blkref #0: rel 
1663/5/1249, blk 16


Best Regards,
Hou zj


v4-0002-write-the-changed-xmin-to-disk-before-computing-g.patch
Description:  v4-0002-write-the-changed-xmin-to-disk-before-computing-g.patch


v4-0001-Correct-sanity-check-to-compare-confirmed_lsn.patch
Description:  v4-0001-Correct-sanity-check-to-compare-confirmed_lsn.patch


RE: Synchronizing slots from primary to standby

2024-04-09 Thread Zhijie Hou (Fujitsu)
On Thursday, April 4, 2024 4:25 PM Masahiko Sawada  
wrote:

Hi,

> On Wed, Apr 3, 2024 at 7:06 PM Amit Kapila 
> wrote:
> >
> > On Wed, Apr 3, 2024 at 11:13 AM Amit Kapila 
> wrote:
> > >
> > > On Wed, Apr 3, 2024 at 9:36 AM Bharath Rupireddy
> > >  wrote:
> > >
> > > > I quickly looked at v8, and have a nit, rest all looks good.
> > > >
> > > > +if (DecodingContextReady(ctx) &&
> found_consistent_snapshot)
> > > > +*found_consistent_snapshot = true;
> > > >
> > > > Can the found_consistent_snapshot be checked first to help avoid
> > > > the function call DecodingContextReady() for
> > > > pg_replication_slot_advance callers?
> > > >
> > >
> > > Okay, changed. Additionally, I have updated the comments and commit
> > > message. I'll push this patch after some more testing.
> > >
> >
> > Pushed!
> 
> While testing this change, I realized that it could happen that the server 
> logs
> are flooded with the following logical decoding logs that are written every 
> 200
> ms:

Thanks for reporting!

> 
> 2024-04-04 16:15:19.270 JST [3838739] LOG:  starting logical decoding for slot
> "test_sub"
> 2024-04-04 16:15:19.270 JST [3838739] DETAIL:  Streaming transactions
> committing after 0/50006F48, reading WAL from 0/50006F10.
> 2024-04-04 16:15:19.270 JST [3838739] LOG:  logical decoding found
> consistent point at 0/50006F10
> 2024-04-04 16:15:19.270 JST [3838739] DETAIL:  There are no running
> transactions.
> 2024-04-04 16:15:19.477 JST [3838739] LOG:  starting logical decoding for slot
> "test_sub"
> 2024-04-04 16:15:19.477 JST [3838739] DETAIL:  Streaming transactions
> committing after 0/50006F48, reading WAL from 0/50006F10.
> 2024-04-04 16:15:19.477 JST [3838739] LOG:  logical decoding found
> consistent point at 0/50006F10
> 2024-04-04 16:15:19.477 JST [3838739] DETAIL:  There are no running
> transactions.
> 
> For example, I could reproduce it with the following steps:
> 
> 1. create the primary and start.
> 2. run "pgbench -i -s 100" on the primary.
> 3. run pg_basebackup to create the standby.
> 4. configure slotsync setup on the standby and start.
> 5. create a publication for all tables on the primary.
> 6. create the subscriber and start.
> 7. run "pgbench -i -Idtpf" on the subscriber.
> 8. create a subscription on the subscriber (initial data copy will start).
> 
> The logical decoding logs were written every 200 ms during the initial data
> synchronization.
> 
> Looking at the new changes for update_local_synced_slot():
...
> We call LogicalSlotAdvanceAndCheckSnapState() if one of confirmed_lsn,
> restart_lsn, and catalog_xmin is different between the remote slot and the 
> local
> slot. In my test case, during the initial sync performing, only catalog_xmin 
> was
> different and there was no serialized snapshot at restart_lsn, and the 
> slotsync
> worker called LogicalSlotAdvanceAndCheckSnapState(). However no slot
> properties were changed even after the function and it set slot_updated = 
> true.
> So it starts the next slot synchronization after 200ms.

I was trying to reproduce this and check why the catalog_xmin is different
among synced slot and remote slot, but I was not able to reproduce the case
where there are lots of logical decoding logs. The script I used is attached.

Would it be possible for you to share the script you used to reproduce this
issue? Alternatively, could you please share the log files from both the
primary and standby servers after reproducing the problem (it would be greatly
helpful if you could set the log level to DEBUG2).

Best Regards,
Hou zj


test.sh
Description: test.sh


RE: Synchronizing slots from primary to standby

2024-04-08 Thread Zhijie Hou (Fujitsu)
On Monday, April 8, 2024 6:32 PM Amit Kapila  wrote:
> 
> On Mon, Apr 8, 2024 at 12:19 PM Zhijie Hou (Fujitsu) 
> wrote:
> >
> > On Saturday, April 6, 2024 12:43 PM Amit Kapila 
> wrote:
> > > On Fri, Apr 5, 2024 at 8:05 PM Bertrand Drouvot
> > >  wrote:
> > >
> > > Yeah, that could be the first step. We can probably add an injection
> > > point to control the bgwrite behavior and then add tests involving
> > > walsender performing the decoding. But I think it is important to
> > > have sufficient tests in this area as I see they are quite helpful in 
> > > uncovering
> the issues.
> >
> > Here is the patch to drop the subscription in the beginning so that
> > the restart_lsn of the lsub1_slot won't be advanced due to concurrent
> > xl_running_xacts from bgwriter. The subscription will be re-created
> > after all the slots are sync-ready. I think maybe we can use this to
> > stabilize the test as a first step and then think about how to make
> > use of injection point to add more tests if it's worth it.
> >
> 
> Pushed.

Thanks for pushing.

I checked the BF status, and noticed one BF failure, which I think is related to
a miss in the test code.
https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=adder&dt=2024-04-08%2012%3A04%3A27

From the following log, I can see the sync failed because the standby is
lagging behind of the failover slot.

-
# No postmaster PID for node "cascading_standby"
error running SQL: 'psql::1: ERROR:  skipping slot synchronization as 
the received slot sync LSN 0/4000148 for slot "snap_test_slot" is ahead of the 
standby position 0/4000114'
while running 'psql -XAtq -d port=50074 host=/tmp/t4HQFlrDmI dbname='postgres' 
-f - -v ON_ERROR_STOP=1' with sql 'SELECT pg_sync_replication_slots();' at 
/home/bf/bf-build/adder/HEAD/pgsql/src/test/perl/PostgreSQL/Test/Cluster.pm 
line 2042.
# Postmaster PID for node "publisher" is 3715298
-

I think it's because we missed to call wait_for_replay_catchup before syncing
slots.

-
$primary->safe_psql('postgres',
"SELECT pg_create_logical_replication_slot('snap_test_slot', 
'test_decoding', false, false, true);"
);
# ? missed to wait here
$standby1->safe_psql('postgres', "SELECT pg_sync_replication_slots();");
-

While testing, I noticed another place where we were calling
wait_for_replay_catchup before doing pg_replication_slot_advance, which also has
a small possibility to cause the failover slot to be ahead of the standby if
some logs are written in between these two steps. So, I adjusted them together.

Here is a small patch to improve the test.

Best Regards,
Hou zj


0001-Ensure-the-standby-is-not-lagging-behind-the-failove.patch
Description:  0001-Ensure-the-standby-is-not-lagging-behind-the-failove.patch


RE: Synchronizing slots from primary to standby

2024-04-07 Thread Zhijie Hou (Fujitsu)
On Saturday, April 6, 2024 12:43 PM Amit Kapila  wrote:
> On Fri, Apr 5, 2024 at 8:05 PM Bertrand Drouvot
>  wrote:
> >
> > On Fri, Apr 05, 2024 at 06:23:10PM +0530, Amit Kapila wrote:
> > > On Fri, Apr 5, 2024 at 5:17 PM Amit Kapila 
> wrote:
> > > Thinking more on this, it doesn't seem related to
> > > c9920a9068eac2e6c8fb34988d18c0b42b9bf811 as that commit doesn't
> > > change any locking or something like that which impacts write positions.
> >
> > Agree.
> >
> > > I think what has happened here is that running_xact record written
> > > by the background writer [1] is not written to the kernel or disk
> > > (see LogStandbySnapshot()), before pg_current_wal_lsn() checks the
> > > current_lsn to be compared with replayed LSN.
> >
> > Agree, I think it's not visible through pg_current_wal_lsn() yet.
> >
> > Also I think that the DEBUG message in LogCurrentRunningXacts()
> >
> > "
> > elog(DEBUG2,
> >  "snapshot of %d+%d running transaction ids (lsn %X/%X oldest
> xid %u latest complete %u next xid %u)",
> >  CurrRunningXacts->xcnt, CurrRunningXacts->subxcnt,
> >  LSN_FORMAT_ARGS(recptr),
> >  CurrRunningXacts->oldestRunningXid,
> >  CurrRunningXacts->latestCompletedXid,
> >  CurrRunningXacts->nextXid); "
> >
> > should be located after the XLogSetAsyncXactLSN() call. Indeed, the
> > new LSN is visible after the spinlock (XLogCtl->info_lck) in
> > XLogSetAsyncXactLSN() is released,
> >
> 
> I think the new LSN can be visible only when the corresponding WAL is written
> by XLogWrite(). I don't know what in XLogSetAsyncXactLSN() can make it
> visible. In your experiment below, isn't it possible that in the meantime WAL
> writer has written that WAL due to which you are seeing an updated location?
> 
> >see:
> >
> > \watch on Session 1 provides:
> >
> >  pg_current_wal_lsn
> > 
> >  0/87D110
> > (1 row)
> >
> > Until:
> >
> > Breakpoint 2, XLogSetAsyncXactLSN (asyncXactLSN=8900936) at
> xlog.c:2579
> > 2579XLogRecPtr  WriteRqstPtr = asyncXactLSN;
> > (gdb) n
> > 2581boolwakeup = false;
> > (gdb)
> > 2584SpinLockAcquire(&XLogCtl->info_lck);
> > (gdb)
> > 2585RefreshXLogWriteResult(LogwrtResult);
> > (gdb)
> > 2586sleeping = XLogCtl->WalWriterSleeping;
> > (gdb)
> > 2587prevAsyncXactLSN = XLogCtl->asyncXactLSN;
> > (gdb)
> > 2588if (XLogCtl->asyncXactLSN < asyncXactLSN)
> > (gdb)
> > 2589XLogCtl->asyncXactLSN = asyncXactLSN;
> > (gdb)
> > 2590SpinLockRelease(&XLogCtl->info_lck);
> > (gdb) p p/x (uint32) XLogCtl->asyncXactLSN
> > $1 = 0x87d148
> >
> > Then session 1 provides:
> >
> >  pg_current_wal_lsn
> > 
> >  0/87D148
> > (1 row)
> >
> > So, when we see in the log:
> >
> > 2024-04-05 04:37:05.074 UTC [3854278][background writer][:0] DEBUG:
> > snapshot of 0+0 running transaction ids (lsn 0/398 oldest xid 740
> > latest complete 739 next xid 740)
> > 2024-04-05 04:37:05.197 UTC [3866475][client backend][2/4:0] LOG:
> statement: SELECT '0/360' <= replay_lsn AND state = 'streaming'
> >
> > It's indeed possible that the new LSN was not visible yet (spinlock
> > not released?) before the query began (because we can not rely on the
> > time the DEBUG message has been emitted).
> >
> > > Note that the reason why
> > > walsender has picked the running_xact written by background writer
> > > is because it has checked after pg_current_wal_lsn() query, see LOGs [2].
> > > I think we can probably try to reproduce manually via debugger.
> > >
> > > If this theory is correct
> >
> > It think it is.
> >
> > > then I think we will need to use injection points to control the
> > > behavior of bgwriter or use the slots created via SQL API for
> > > syncing in tests.
> > >
> > > Thoughts?
> >
> > I think that maybe as a first step we should move the "elog(DEBUG2,"
> > message as proposed above to help debugging (that could help to confirm
> the above theory).
> >
> 
> I think I am missing how exactly moving DEBUG2 can confirm the above theory.
> 
> > If the theory is proven then I'm not sure we need the extra complexity
> > of injection point here, maybe just relying on the slots created via
> > SQL API could be enough.
> >
> 
> Yeah, that could be the first step. We can probably add an injection point to
> control the bgwrite behavior and then add tests involving walsender
> performing the decoding. But I think it is important to have sufficient tests 
> in
> this area as I see they are quite helpful in uncovering the issues.

Here is the patch to drop the subscription in the beginning so that the
restart_lsn of the lsub1_slot won't be advanced due to concurrent
xl_running_xacts from bgwriter. The subscription will be re-created after all
the slots are sync-ready. I think maybe we can use this to stabilize the test
as a first step and then think about how to make use of 

RE: Synchronizing slots from primary to standby

2024-04-02 Thread Zhijie Hou (Fujitsu)
On Tuesday, April 2, 2024 8:49 PM Bharath Rupireddy 
 wrote:
> 
> On Tue, Apr 2, 2024 at 2:11 PM Zhijie Hou (Fujitsu)
>  wrote:
> >
> > CFbot[1] complained about one query result's order in the tap-test, so I am
> > attaching a V7 patch set which fixed this. There are no changes in 0001.
> >
> > [1] https://cirrus-ci.com/task/6375962162495488
> 
> Thanks. Here are some comments:

Thanks for the comments.

> 
> 1. Can we just remove pg_logical_replication_slot_advance and use
> LogicalSlotAdvanceAndCheckSnapState instead? If worried about the
> function naming, LogicalSlotAdvanceAndCheckSnapState can be renamed to
> pg_logical_replication_slot_advance?
> 
> + * Advance our logical replication slot forward. See
> + * LogicalSlotAdvanceAndCheckSnapState for details.
>   */
>  static XLogRecPtr
>  pg_logical_replication_slot_advance(XLogRecPtr moveto)
>  {

It was commented[1] that it's not appropriate for the
pg_logical_replication_slot_advance to have an out parameter
'ready_for_decoding' which looks bit awkward as the functionality doesn't match
the name, and is also not consistent with the style of
pg_physical_replication_slot_advance(). So, we added a new function.

> 
> 2.
> +if (!ready_for_decoding)
> +{
> +elog(DEBUG1, "could not find consistent point for synced
> slot; restart_lsn = %X/%X",
> + LSN_FORMAT_ARGS(slot->data.restart_lsn));
> 
> Can we specify the slot name in the message?

Added.

> 
> 3. Also, can the "could not find consistent point for synced slot;
> restart_lsn = %X/%X" be emitted at LOG level just like other messages
> in update_and_persist_local_synced_slot. Although, I see "XXX should
> this be changed to elog(DEBUG1) perhaps?", these messages need to be
> at LOG level as they help debug issues if at all they are hit.

Changed to LOG and reworded the message.

> 
> 4. How about using found_consistent_snapshot instead of
> ready_for_decoding? A general understanding is that the synced slots
> are not allowed for decoding (although with this fix, we do that for
> internal purposes), ready_for_decoding looks a bit misleading.

Agreed and renamed.

> 
> 5. As far as the test case for this issue is concerned, I'm fine with
> adding one using an INJECTION point because we seem to be having no
> consistent way to control postgres writing current snapshot to WAL.

Since me and my colleagues can reproduce the issue consistently after applying
0002 and it could be rare for concurrent xl_running_xacts to happen, we 
discussed[2] to
consider adding the INJECTION point after pushing the main fix.

> 
> 6. A nit: can we use "fast_forward mode" instead of "fast-forward
> mode" just to be consistent?
> + * logical changes unless we are in fast-forward mode where no changes
> are
> 
> 7.
> +/*
> + * We need to access the system tables during decoding to build the
> + * logical changes unless we are in fast-forward mode where no changes
> are
> + * generated.
> + */
> +if (slot->data.database != MyDatabaseId && !fast_forward)
> 
> May I know if we need this change for this fix?

The slotsync worker needs to advance the slots from different databases in
fast_forward. So, we need to skip this check in fast_forward mode. The analysis 
can
be found in [3].

Attach the V8 patch which addressed above comments.


[1] 
https://www.postgresql.org/message-id/CAA4eK1%2BwkaRi2BrLLC_0gKbHN68Awc9dRp811G3An6A6fuqdOg%40mail.gmail.com
[2] 
https://www.postgresql.org/message-id/ZgvI9iAUWCZ17z5V%40ip-10-97-1-34.eu-west-3.compute.internal
[3] 
https://www.postgresql.org/message-id/CAJpy0uCQ2PDCAqcnbdOz6q_ZqmBfMyBpVqKDqL_XZBP%3DeK-1yw%40mail.gmail.com

Best Regards,
Hou zj


v8-0002-test-the-data-loss-case.patch
Description: v8-0002-test-the-data-loss-case.patch


v8-0001-advance-the-restart_lsn-of-synced-slots-using-log.patch
Description:  v8-0001-advance-the-restart_lsn-of-synced-slots-using-log.patch


RE: Synchronizing slots from primary to standby

2024-04-02 Thread Zhijie Hou (Fujitsu)
On Tuesday, April 2, 2024 3:21 PM Zhijie Hou (Fujitsu)  
wrote:
> On Tuesday, April 2, 2024 8:35 AM Zhijie Hou (Fujitsu)
>  wrote:
> >
> > On Monday, April 1, 2024 7:30 PM Amit Kapila 
> > wrote:
> > >
> > > On Mon, Apr 1, 2024 at 6:26 AM Zhijie Hou (Fujitsu)
> > > 
> > > wrote:
> > > >
> > > > On Friday, March 29, 2024 2:50 PM Amit Kapila
> > > > 
> > > wrote:
> > > > >
> > > >
> > > > >
> > > > >
> > > > > 2.
> > > > > +extern XLogRecPtr
> > > > > +pg_logical_replication_slot_advance(XLogRecPtr
> > > moveto,
> > > > > +   bool *found_consistent_point);
> > > > > +
> > > > >
> > > > > This API looks a bit awkward as the functionality doesn't match
> > > > > the name. How about having a function with name
> > > > > LogicalSlotAdvanceAndCheckReadynessForDecoding(moveto,
> > > > > ready_for_decoding) with the same functionality as your patch
> > > > > has for
> > > > > pg_logical_replication_slot_advance() and then invoke it both
> > > > > from pg_logical_replication_slot_advance and slotsync.c. The
> > > > > function name is too big, we can think of a shorter name. Any ideas?
> > > >
> > > > How about LogicalSlotAdvanceAndCheckDecodingState() Or just
> > > > LogicalSlotAdvanceAndCheckDecoding()?
> > > >
> > >
> > > It is about snapbuild state, so how about naming the function as
> > > LogicalSlotAdvanceAndCheckSnapState()?
> >
> > It looks better to me, so changed.
> >
> > >
> > > I have made quite a few cosmetic changes in comments and code. See
> > > attached. This is atop your latest patch. Can you please review and
> > > include these changes in the next version?
> >
> > Thanks, I have reviewed and merged them.
> > Attach the V5 patch set which addressed above comments and ran pgindent.
> 
> I added one test in 040_standby_failover_slots_sync.pl in 0002 patch, which 
> can
> reproduce the data loss issue consistently on my machine. It may not
> reproduce in some rare cases if concurrent xl_running_xacts are written by
> bgwriter, but I think it's still valuable if it can verify the fix in most 
> cases. The test
> will fail if directly applied on HEAD, and will pass after applying atop of 
> 0001.

CFbot[1] complained about one query result's order in the tap-test, so I am
attaching a V7 patch set which fixed this. There are no changes in 0001.

[1] https://cirrus-ci.com/task/6375962162495488

Best Regards,
Hou zj



v7-0002-test-the-data-loss-case.patch
Description: v7-0002-test-the-data-loss-case.patch


v7-0001-advance-the-restart_lsn-of-synced-slots-using-log.patch
Description:  v7-0001-advance-the-restart_lsn-of-synced-slots-using-log.patch


RE: Synchronizing slots from primary to standby

2024-04-02 Thread Zhijie Hou (Fujitsu)
On Tuesday, April 2, 2024 8:35 AM Zhijie Hou (Fujitsu)  
wrote:
> 
> On Monday, April 1, 2024 7:30 PM Amit Kapila 
> wrote:
> >
> > On Mon, Apr 1, 2024 at 6:26 AM Zhijie Hou (Fujitsu)
> > 
> > wrote:
> > >
> > > On Friday, March 29, 2024 2:50 PM Amit Kapila
> > > 
> > wrote:
> > > >
> > >
> > > >
> > > >
> > > > 2.
> > > > +extern XLogRecPtr pg_logical_replication_slot_advance(XLogRecPtr
> > moveto,
> > > > +   bool *found_consistent_point);
> > > > +
> > > >
> > > > This API looks a bit awkward as the functionality doesn't match
> > > > the name. How about having a function with name
> > > > LogicalSlotAdvanceAndCheckReadynessForDecoding(moveto,
> > > > ready_for_decoding) with the same functionality as your patch has
> > > > for
> > > > pg_logical_replication_slot_advance() and then invoke it both from
> > > > pg_logical_replication_slot_advance and slotsync.c. The function
> > > > name is too big, we can think of a shorter name. Any ideas?
> > >
> > > How about LogicalSlotAdvanceAndCheckDecodingState() Or just
> > > LogicalSlotAdvanceAndCheckDecoding()?
> > >
> >
> > It is about snapbuild state, so how about naming the function as
> > LogicalSlotAdvanceAndCheckSnapState()?
> 
> It looks better to me, so changed.
> 
> >
> > I have made quite a few cosmetic changes in comments and code. See
> > attached. This is atop your latest patch. Can you please review and
> > include these changes in the next version?
> 
> Thanks, I have reviewed and merged them.
> Attach the V5 patch set which addressed above comments and ran pgindent.

I added one test in 040_standby_failover_slots_sync.pl in 0002 patch, which can
reproduce the data loss issue consistently on my machine. It may not reproduce
in some rare cases if concurrent xl_running_xacts are written by bgwriter, but
I think it's still valuable if it can verify the fix in most cases. The test
will fail if directly applied on HEAD, and will pass after applying atop of
0001.

Best Regards,
Hou zj


v6-0001-advance-the-restart_lsn-of-synced-slots-using-log.patch
Description:  v6-0001-advance-the-restart_lsn-of-synced-slots-using-log.patch


v6-0002-test-the-data-loss-case.patch
Description: v6-0002-test-the-data-loss-case.patch


RE: Synchronizing slots from primary to standby

2024-04-01 Thread Zhijie Hou (Fujitsu)
On Tuesday, April 2, 2024 8:43 AM Bharath Rupireddy 
 wrote:
> 
> On Mon, Apr 1, 2024 at 11:36 AM Zhijie Hou (Fujitsu) 
> wrote:
> >
> > Attach the V4 patch which includes the optimization to skip the
> > decoding if the snapshot at the syncing restart_lsn is already
> > serialized. It can avoid most of the duplicate decoding in my test, and I am
> doing some more tests locally.
> 
> Thanks for the patch. I'm thinking if we can reduce the amount of work that we
> do for synced slots in each sync worker cycle. With that context in mind, why 
> do
> we need to create decoding context every time?
> Can't we create it once, store it in an in-memory structure and use it for 
> each
> sync worker cycle? Is there any problem with it? What do you think?

Thanks for the idea. I think the cost of decoding context seems to be
relatively minor when compared to the IO cost. After generating the profiles
for the tests shared by Nisha[1], it appears that the StartupDecodingContext is
not a issue. While the suggested refactoring is an option, I think
we can consider this as a future improvement and addressing it only if we
encounter scenarios where the creation of decoding context becomes a
bottleneck.

[1] 
https://www.postgresql.org/message-id/CALj2ACUeij5tFzJ1-cuoUh%2Bmhj33v%2BYgqD_gHYUpRdXSCSBbhw%40mail.gmail.com

Best Regards,
Hou zj
<>


RE: Synchronizing slots from primary to standby

2024-04-01 Thread Zhijie Hou (Fujitsu)
On Monday, April 1, 2024 7:30 PM Amit Kapila  wrote:
> 
> On Mon, Apr 1, 2024 at 6:26 AM Zhijie Hou (Fujitsu) 
> wrote:
> >
> > On Friday, March 29, 2024 2:50 PM Amit Kapila 
> wrote:
> > >
> >
> > >
> > >
> > > 2.
> > > +extern XLogRecPtr pg_logical_replication_slot_advance(XLogRecPtr
> moveto,
> > > +   bool *found_consistent_point);
> > > +
> > >
> > > This API looks a bit awkward as the functionality doesn't match the
> > > name. How about having a function with name
> > > LogicalSlotAdvanceAndCheckReadynessForDecoding(moveto,
> > > ready_for_decoding) with the same functionality as your patch has
> > > for
> > > pg_logical_replication_slot_advance() and then invoke it both from
> > > pg_logical_replication_slot_advance and slotsync.c. The function
> > > name is too big, we can think of a shorter name. Any ideas?
> >
> > How about LogicalSlotAdvanceAndCheckDecodingState() Or just
> > LogicalSlotAdvanceAndCheckDecoding()?
> >
> 
> It is about snapbuild state, so how about naming the function as
> LogicalSlotAdvanceAndCheckSnapState()?

It looks better to me, so changed.

> 
> I have made quite a few cosmetic changes in comments and code. See
> attached. This is atop your latest patch. Can you please review and include
> these changes in the next version?

Thanks, I have reviewed and merged them.
Attach the V5 patch set which addressed above comments and ran pgindent.

I will think and test the improvement suggested by Bertrand[1] and reply after 
that.

[1] 
https://www.postgresql.org/message-id/Zgp8n9QD5nYSESnM%40ip-10-97-1-34.eu-west-3.compute.internal

Best Regards,
Hou zj


v5-0001-advance-the-restart_lsn-of-synced-slots-using-log.patch
Description:  v5-0001-advance-the-restart_lsn-of-synced-slots-using-log.patch


RE: Synchronizing slots from primary to standby

2024-03-31 Thread Zhijie Hou (Fujitsu)
On Monday, April 1, 2024 8:56 AM Zhijie Hou (Fujitsu)  
wrote:
> 
> On Friday, March 29, 2024 2:50 PM Amit Kapila 
> wrote:
> >
> > On Fri, Mar 29, 2024 at 6:36 AM Zhijie Hou (Fujitsu)
> > 
> > wrote:
> > >
> > >
> > > Attach a new version patch which fixed an un-initialized variable
> > > issue and added some comments.
> > >
> >
> > The other approach to fix this issue could be that the slotsync worker
> > get the serialized snapshot using pg_read_binary_file() corresponding
> > to restart_lsn and writes those at standby. But there are cases when
> > we won't have such a file like (a) when we initially create the slot
> > and reach the consistent_point, or (b) also by the time the slotsync
> > worker starts to read the remote snapshot file, the snapshot file
> > could have been removed by the checkpointer on the primary (if the
> > restart_lsn of the remote has been advanced in this window). So, in
> > such cases, we anyway need to advance the slot. I think these could be
> optimizations that we could do in the future.
> >
> > Few comments:
> 
> Thanks for the comments.
> 
> > =
> > 1.
> > - if (slot->data.database != MyDatabaseId)
> > + if (slot->data.database != MyDatabaseId && !fast_forward)
> >   ereport(ERROR,
> >   (errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE),
> >   errmsg("replication slot \"%s\" was not created in this database",
> > @@ -526,7
> > +527,7 @@ CreateDecodingContext(XLogRecPtr start_lsn,
> >   * Do not allow consumption of a "synchronized" slot until the standby
> >   * gets promoted.
> >   */
> > - if (RecoveryInProgress() && slot->data.synced)
> > + if (RecoveryInProgress() && slot->data.synced &&
> > + !IsSyncingReplicationSlots())
> >
> >
> > Add comments at both of the above places.
> 
> Added.
> 
> >
> >
> > 2.
> > +extern XLogRecPtr pg_logical_replication_slot_advance(XLogRecPtr
> moveto,
> > +   bool *found_consistent_point);
> > +
> >
> > This API looks a bit awkward as the functionality doesn't match the
> > name. How about having a function with name
> > LogicalSlotAdvanceAndCheckReadynessForDecoding(moveto,
> > ready_for_decoding) with the same functionality as your patch has for
> > pg_logical_replication_slot_advance() and then invoke it both from
> > pg_logical_replication_slot_advance and slotsync.c. The function name
> > is too big, we can think of a shorter name. Any ideas?
> 
> How about LogicalSlotAdvanceAndCheckDecodingState() Or just
> LogicalSlotAdvanceAndCheckDecoding()? (I used the suggested
> LogicalSlotAdvanceAndCheckReadynessForDecoding in this version, It can be
> renamed in next version if we agree).
> 
> Attach the V3 patch which addressed above comments and Kuroda-san's
> comments[1]. I also adjusted the tap-test to only check the 
> confirmed_flush_lsn
> after syncing, as the restart_lsn could be different from the remote one due 
> to
> the new slot_advance() call. I am also testing some optimization idea locally 
> and
> will share if ready.

Attach the V4 patch which includes the optimization to skip the decoding if
the snapshot at the syncing restart_lsn is already serialized. It can avoid most
of the duplicate decoding in my test, and I am doing some more tests locally.

Best Regards,
Hou zj


v4-0001-advance-the-restart_lsn-of-synced-slots-using-log.patch
Description:  v4-0001-advance-the-restart_lsn-of-synced-slots-using-log.patch


RE: Synchronizing slots from primary to standby

2024-03-31 Thread Zhijie Hou (Fujitsu)
On Friday, March 29, 2024 2:50 PM Amit Kapila  wrote:
> 
> On Fri, Mar 29, 2024 at 6:36 AM Zhijie Hou (Fujitsu) 
> wrote:
> >
> >
> > Attach a new version patch which fixed an un-initialized variable
> > issue and added some comments.
> >
> 
> The other approach to fix this issue could be that the slotsync worker get the
> serialized snapshot using pg_read_binary_file() corresponding to restart_lsn
> and writes those at standby. But there are cases when we won't have such a 
> file
> like (a) when we initially create the slot and reach the consistent_point, or 
> (b)
> also by the time the slotsync worker starts to read the remote snapshot file, 
> the
> snapshot file could have been removed by the checkpointer on the primary (if
> the restart_lsn of the remote has been advanced in this window). So, in such
> cases, we anyway need to advance the slot. I think these could be 
> optimizations
> that we could do in the future.
> 
> Few comments:

Thanks for the comments.

> =
> 1.
> - if (slot->data.database != MyDatabaseId)
> + if (slot->data.database != MyDatabaseId && !fast_forward)
>   ereport(ERROR,
>   (errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE),
>   errmsg("replication slot \"%s\" was not created in this database", @@ -526,7
> +527,7 @@ CreateDecodingContext(XLogRecPtr start_lsn,
>   * Do not allow consumption of a "synchronized" slot until the standby
>   * gets promoted.
>   */
> - if (RecoveryInProgress() && slot->data.synced)
> + if (RecoveryInProgress() && slot->data.synced &&
> + !IsSyncingReplicationSlots())
> 
> 
> Add comments at both of the above places.

Added.

> 
> 
> 2.
> +extern XLogRecPtr pg_logical_replication_slot_advance(XLogRecPtr moveto,
> +   bool *found_consistent_point);
> +
> 
> This API looks a bit awkward as the functionality doesn't match the name. How
> about having a function with name
> LogicalSlotAdvanceAndCheckReadynessForDecoding(moveto,
> ready_for_decoding) with the same functionality as your patch has for
> pg_logical_replication_slot_advance() and then invoke it both from
> pg_logical_replication_slot_advance and slotsync.c. The function name is too
> big, we can think of a shorter name. Any ideas?

How about LogicalSlotAdvanceAndCheckDecodingState() Or just
LogicalSlotAdvanceAndCheckDecoding()? (I used the suggested
LogicalSlotAdvanceAndCheckReadynessForDecoding in this version, It can be 
renamed in
next version if we agree).

Attach the V3 patch which addressed above comments and Kuroda-san's
comments[1]. I also adjusted the tap-test to only check the confirmed_flush_lsn
after syncing, as the restart_lsn could be different from the remote one due to
the new slot_advance() call. I am also testing some optimization idea locally
and will share if ready.

[1] 
https://www.postgresql.org/message-id/TYCPR01MB1207757BB2A32B6815CE1CCE7F53A2%40TYCPR01MB12077.jpnprd01.prod.outlook.com

Best Regards,
Hou zj


v3-0001-advance-the-restart_lsn-of-synced-slots-using-log.patch
Description:  v3-0001-advance-the-restart_lsn-of-synced-slots-using-log.patch


RE: Synchronizing slots from primary to standby

2024-03-29 Thread Zhijie Hou (Fujitsu)
On Friday, March 29, 2024 2:48 PM Bertrand Drouvot 
 wrote:
> 
> Hi,
> 
> On Fri, Mar 29, 2024 at 01:06:15AM +0000, Zhijie Hou (Fujitsu) wrote:
> > Attach a new version patch which fixed an un-initialized variable
> > issue and added some comments. Also, temporarily enable DEBUG2 for the
> > 040 tap-test so that we can analyze the possible CFbot failures easily.
> >
> 
> Thanks!
> 
> +   if (remote_slot->confirmed_lsn != slot->data.confirmed_flush)
> +   {
> +   /*
> +* By advancing the restart_lsn, confirmed_lsn, and xmin using
> +* fast-forward logical decoding, we ensure that the required
> snapshots
> +* are saved to disk. This enables logical decoding to quickly
> reach a
> +* consistent point at the restart_lsn, eliminating the risk 
> of
> missing
> +* data during snapshot creation.
> +*/
> +
> pg_logical_replication_slot_advance(remote_slot->confirmed_lsn,
> +
> found_consistent_point);
> +   ReplicationSlotsComputeRequiredLSN();
> +   updated_lsn = true;
> +   }
> 
> Instead of using pg_logical_replication_slot_advance() for each synced slot 
> and
> during sync cycles what about?:
> 
> - keep sync slot synchronization as it is currently (not using
> pg_logical_replication_slot_advance())
> - create "an hidden" logical slot if sync slot feature is on
> - at the time of promotion use pg_logical_replication_slot_advance() on this
> hidden slot only to advance to the max lsn of the synced slots
> 
> I'm not sure that would be enough, just asking your thoughts on this (benefits
> would be to avoid calling pg_logical_replication_slot_advance() on each sync
> slots and during the sync cycles).

Thanks for the idea !

I considered about this. I think advancing the "hidden" slot on promotion may 
be a
bit late, because if we cannot reach the consistent point after advancing the
"hidden" slot, then it means we may need to remove all the synced slots as we
are not sure if they are usable(will not loss data) after promotion. And it may
confuse user a bit as they have seen these slots to be sync-ready.

The current approach is to mark such un-consistent slot as temp and persist
them once it reaches consistent point, so that user can ensure the slot can be
used after promotion once persisted.


Another optimization idea is to check the snapshot file existence before 
calling the
slot_advance(). If the file already exists, we skip the decoding and directly
update the restart_lsn. This way, we could also avoid some duplicate decoding
work.

Best Regards,
Hou zj




  1   2   3   >