Re: Clock-skew management in logical replication
On Mon, Sep 23, 2024 at 4:00 PM Nisha Moond wrote: > > On Fri, Sep 20, 2024 at 7:51 PM Tom Lane wrote: > > > > Nisha Moond writes: > > > While considering the implementation of timestamp-based conflict > > > resolution (last_update_wins) in logical replication (see [1]), there > > > was a feedback at [2] and the discussion on whether or not to manage > > > clock-skew at database level. > > > > FWIW, I cannot see why we would do anything beyond suggesting that > > people run NTP. That's standard anyway on the vast majority of > > machines these days. Why would we add complexity that we have > > to maintain (and document) in order to cater to somebody not doing > > that? > > > > regards, tom lane > > Thank you for your response. > > I agree with suggesting users to run NTP and we can recommend it in > the docs rather than introducing additional complexities. > > In my research on setting up NTP servers on Linux, I found that > Chrony[1] is a lightweight and efficient solution for time > synchronization across nodes. Another reliable option is the classic > NTP daemon (ntpd)[2], which is also easy to configure and maintain. > Both Chrony and ntpd can be used to configure a local machine as an > NTP server for localized time synchronization, or as clients syncing > from public NTP servers such as 'ntp.ubuntu.com' (default ntp server > pool for Ubuntu systems) or 'time.google.com'(Google Public NTP). > For example, on Ubuntu, Chrony is straightforward to install and > configure[3]. Comprehensive NTP(ntpd) configuration guides are > available for various Linux distributions, such as Ubuntu[4] and > RedHat-Linux[5]. > > Further, I’m exploring options for implementing NTP on Windows systems. > Windows platforms provide built-in time synchronization services. As a client, they allow users to sync system time using internet or public NTP servers. This can be easily configured by selecting a public NTP server directly in the Date and Time settings. More details can be found at [1]. Additionally, Windows servers can be configured as NTP servers for localized time synchronization within a network, allowing other nodes to sync with them. Further instructions on configuring an NTP server on Windows can be found at [2]. [1] https://learn.microsoft.com/en-us/windows-server/networking/windows-time-service/how-the-windows-time-service-works [2] https://learn.microsoft.com/en-us/troubleshoot/windows-server/active-directory/configure-authoritative-time-server Thanks, Nisha
Re: Clock-skew management in logical replication
On Fri, Sep 20, 2024 at 7:51 PM Tom Lane wrote: > > Nisha Moond writes: > > While considering the implementation of timestamp-based conflict > > resolution (last_update_wins) in logical replication (see [1]), there > > was a feedback at [2] and the discussion on whether or not to manage > > clock-skew at database level. > > FWIW, I cannot see why we would do anything beyond suggesting that > people run NTP. That's standard anyway on the vast majority of > machines these days. Why would we add complexity that we have > to maintain (and document) in order to cater to somebody not doing > that? > > regards, tom lane Thank you for your response. I agree with suggesting users to run NTP and we can recommend it in the docs rather than introducing additional complexities. In my research on setting up NTP servers on Linux, I found that Chrony[1] is a lightweight and efficient solution for time synchronization across nodes. Another reliable option is the classic NTP daemon (ntpd)[2], which is also easy to configure and maintain. Both Chrony and ntpd can be used to configure a local machine as an NTP server for localized time synchronization, or as clients syncing from public NTP servers such as 'ntp.ubuntu.com' (default ntp server pool for Ubuntu systems) or 'time.google.com'(Google Public NTP). For example, on Ubuntu, Chrony is straightforward to install and configure[3]. Comprehensive NTP(ntpd) configuration guides are available for various Linux distributions, such as Ubuntu[4] and RedHat-Linux[5]. Further, I’m exploring options for implementing NTP on Windows systems. [1] https://chrony-project.org/index.html [2] https://www.ntp.org/documentation/4.2.8-series/ [3] https://documentation.ubuntu.com/server/how-to/networking/serve-ntp-with-chrony/ [4] https://askubuntu.com/questions/14558/how-do-i-setup-a-local-ntp-server [5] https://docs.redhat.com/en/documentation/red_hat_enterprise_linux/7/html/system_administrators_guide/ch-configuring_ntp_using_ntpd#s1-Understanding_the_ntpd_Configuration_File Thanks, Nisha
Clock-skew management in logical replication
Hello Hackers, (CC people involved in the earlier discussion) While considering the implementation of timestamp-based conflict resolution (last_update_wins) in logical replication (see [1]), there was a feedback at [2] and the discussion on whether or not to manage clock-skew at database level. We tried to research the history of clock-skew related discussions in Postgres itself and summarized that at [3]. We also analyzed how other databases deal with it. Based on our research, the other classic RDBMS like Oracle and IBM, using similar timestamp-based resolution methods, do not address clock-skew at the database level. Instead, they recommend using external time synchronization solutions, such as NTP. - Oracle while handling conflicts[2] assumes clocks are synchronized and relies on external tools like NTP for time synchronization between nodes[4]. - IBM Informix, similarly, recommends using their network commands to ensure clock synchronization across nodes[5]. Other postgres dependent databases like EDB-BDR and YugabyteDB provide GUC parameters to manage clock-skew within the database: - EDB-BDR allows configuration of parameters like bdr.maximum_clock_skew and bdr.maximum_clock_skew_action to define acceptable skew and actions when it exceeds[6]. - YugabyteDB offers a GUC max_clock_skew_usec setting, which causes the node to crash if the clock-skew exceeds the specified value[7]. There are, of course, other approaches to managing clock-skew used by distributed systems, such as NTP daemons, centralized logical clocks, atomic clocks (as in Google Spanner), and time sync services like AWS[4]. Implementing any of these time-sync services for CDR seems quite a bit of deviation and a big project in itself, which we are not sure is really needed. At best, for users' aid, we should provide some GUCs based implementation to handle clock-skew in logical replication. The idea is that users should be able to handle clock-skew outside of the database. But in worst case scenarios, users can rely on these GUCs. We have attempted to implement a patch which manages clock-skew in logical replication. It works based on these new GUCs: (see [10] for detailed discussion) - max_logical_rep_clock_skew: Defines the tolerable limit for clock-skew. - max_logical_rep_clock_skew_action: Configures the action when clock-skew exceeds the limit. - max_logical_rep_clock_skew_wait: Limits the maximum wait time if the action is configured as "wait." The proposed idea is implemented in attached patch v1. Thank you Shveta for implementing it. Thanks Kuroda-san for assisting in the research. Thoughts? Looking forward to hearing others' opinions! [1]: https://www.postgresql.org/message-id/CAJpy0uD0-DpYVMtsxK5R%3DzszXauZBayQMAYET9sWr_w0CNWXxQ%40mail.gmail.com [2]: https://www.postgresql.org/message-id/CAFiTN-uTycjZWdp1kEpN9w7b7SQpoGL5zyg_qZzjpY_vr2%2BKsg%40mail.gmail.com [3]: https://www.postgresql.org/message-id/CAA4eK1Jn4r-y%2BbkW%3DJaKCbxEz%3DjawzQAS1Z4wAd8jT%2B1B0RL2w%40mail.gmail.com [4]: https://www.oracle.com/cn/a/tech/docs/technical-resources/wp-oracle-goldengate-activeactive-final2-1.pdf [5]: https://docs.oracle.com/en/operating-systems/oracle-linux/8/network/network-ConfiguringNetworkTime.html [6]: https://www.ibm.com/docs/en/informix-servers/14.10?topic=environment-time-synchronization [7]: https://www.enterprisedb.com/docs/pgd/latest/reference/pgd-settings/#bdrmaximum_clock_skew [8]: https://support.yugabyte.com/hc/en-us/articles/4403707404173-Too-big-clock-skew-leading-to-error-messages-or-tserver-crashes [9]: https://aws.amazon.com/about-aws/whats-new/2023/11/amazon-time-sync-service-microsecond-accurate-time/ [10]: https://www.postgresql.org/message-id/CAJpy0uDCW%2BvrBoUZWrBWPjsM%3D9wwpwbpZuZa8Raj3VqeVYs3PQ%40mail.gmail.com -- Thanks, Nisha v1-0001-Implements-Clock-skew-management-between-nodes.patch Description: Binary data
Re: Conflict Detection and Resolution
On Fri, Sep 20, 2024 at 8:40 AM Nisha Moond wrote: > > On Wed, Sep 18, 2024 at 10:46 AM vignesh C wrote: > > > > On Thu, 12 Sept 2024 at 14:03, Ajin Cherian wrote: > > > > > > On Tue, Sep 3, 2024 at 7:42 PM vignesh C wrote: > > > > > > On Fri, 30 Aug 2024 at 11:01, Nisha Moond > > > wrote: > > > > > > > > Here is the v11 patch-set. Changes are: > > > > > > 1) This command crashes: > > > ALTER SUBSCRIPTION name RESET CONFLICT RESOLVER FOR NULL; > > > #0 __strcmp_avx2 () at ../sysdeps/x86_64/multiarch/strcmp-avx2.S:116 > > > #1 0x55c67270600a in ResetConflictResolver (subid=16404, > > > conflict_type=0x0) at conflict.c:744 > > > #2 0x55c67247e0c3 in AlterSubscription (pstate=0x55c6748ff9d0, > > > stmt=0x55c67497dfe0, isTopLevel=true) at subscriptioncmds.c:1664 > > > > > > + | ALTER SUBSCRIPTION name RESET CONFLICT > > > RESOLVER FOR conflict_type > > > + { > > > + AlterSubscriptionStmt *n = > > > + > > > makeNode(AlterSubscriptionStmt); > > > + > > > + n->kind = > > > ALTER_SUBSCRIPTION_RESET_CONFLICT_RESOLVER; > > > + n->subname = $3; > > > + n->conflict_type = $8; > > > + $$ = (Node *) n; > > > + } > > > + ; > > > +conflict_type: > > > + Sconst > > > { $$ = $1; } > > > + | NULL_P > > > { $$ = NULL; } > > > ; > > > > > > May be conflict_type should be changed to: > > > +conflict_type: > > > + Sconst > > > { $$ = $1; } > > > ; > > > > > > > > > Fixed. > > > > > > > Few comments: > > 1) This should be in (fout->remoteVersion >= 18) check to support > > dumping backward compatible server objects, else dump with older > > version will fail: > > + /* Populate conflict type fields using the new query */ > > + confQuery = createPQExpBuffer(); > > + appendPQExpBuffer(confQuery, > > + "SELECT confrtype, > > confrres FROM pg_catalog.pg_subscription_conflict " > > + "WHERE confsubid = > > %u;", subinfo[i].dobj.catId.oid); > > + confRes = ExecuteSqlQuery(fout, confQuery->data, > > PGRES_TUPLES_OK); > > + > > + ntuples = PQntuples(confRes); > > + for (j = 0; j < ntuples; j++) > > > > 2) Can we check and throw an error before the warning is logged in > > this case as it seems strange to throw a warning first and then an > > error for the same track_commit_timestamp configuration: > > postgres=# create subscription sub1 connection ... publication pub1 > > conflict resolver (insert_exists = 'last_update_wins'); > > WARNING: conflict detection and resolution could be incomplete due to > > disabled track_commit_timestamp > > DETAIL: Conflicts update_origin_differs and delete_origin_differs > > cannot be detected, and the origin and commit timestamp for the local > > row will not be logged. > > ERROR: resolver last_update_wins requires "track_commit_timestamp" to > > be enabled > > HINT: Make sure the configuration parameter "track_commit_timestamp" is > > set. > > > > Thanks for the review. > Here is the v14 patch-set fixing review comments in [1] and [2]. Just noticed that there are failures for the new 034_conflict_resolver.pl test on CFbot. From the initial review it seems to be a test issue and not a bug. We will fix these along with the next version of patch-sets. Thanks, Nisha
Re: Conflict Detection and Resolution
On Fri, Sep 20, 2024 at 8:40 AM Nisha Moond wrote: > > On Wed, Sep 18, 2024 at 10:46 AM vignesh C wrote: > > > > On Thu, 12 Sept 2024 at 14:03, Ajin Cherian wrote: > > > > > > On Tue, Sep 3, 2024 at 7:42 PM vignesh C wrote: > > > > > > On Fri, 30 Aug 2024 at 11:01, Nisha Moond > > > wrote: > > > > > > > > Here is the v11 patch-set. Changes are: > > > > > > 1) This command crashes: > > > ALTER SUBSCRIPTION name RESET CONFLICT RESOLVER FOR NULL; > > > #0 __strcmp_avx2 () at ../sysdeps/x86_64/multiarch/strcmp-avx2.S:116 > > > #1 0x55c67270600a in ResetConflictResolver (subid=16404, > > > conflict_type=0x0) at conflict.c:744 > > > #2 0x55c67247e0c3 in AlterSubscription (pstate=0x55c6748ff9d0, > > > stmt=0x55c67497dfe0, isTopLevel=true) at subscriptioncmds.c:1664 > > > > > > + | ALTER SUBSCRIPTION name RESET CONFLICT > > > RESOLVER FOR conflict_type > > > + { > > > + AlterSubscriptionStmt *n = > > > + > > > makeNode(AlterSubscriptionStmt); > > > + > > > + n->kind = > > > ALTER_SUBSCRIPTION_RESET_CONFLICT_RESOLVER; > > > + n->subname = $3; > > > + n->conflict_type = $8; > > > + $$ = (Node *) n; > > > + } > > > + ; > > > +conflict_type: > > > + Sconst > > > { $$ = $1; } > > > + | NULL_P > > > { $$ = NULL; } > > > ; > > > > > > May be conflict_type should be changed to: > > > +conflict_type: > > > + Sconst > > > { $$ = $1; } > > > ; > > > > > > > > > Fixed. > > > > > > > Few comments: > > 1) This should be in (fout->remoteVersion >= 18) check to support > > dumping backward compatible server objects, else dump with older > > version will fail: > > + /* Populate conflict type fields using the new query */ > > + confQuery = createPQExpBuffer(); > > + appendPQExpBuffer(confQuery, > > + "SELECT confrtype, > > confrres FROM pg_catalog.pg_subscription_conflict " > > + "WHERE confsubid = > > %u;", subinfo[i].dobj.catId.oid); > > + confRes = ExecuteSqlQuery(fout, confQuery->data, > > PGRES_TUPLES_OK); > > + > > + ntuples = PQntuples(confRes); > > + for (j = 0; j < ntuples; j++) > > > > 2) Can we check and throw an error before the warning is logged in > > this case as it seems strange to throw a warning first and then an > > error for the same track_commit_timestamp configuration: > > postgres=# create subscription sub1 connection ... publication pub1 > > conflict resolver (insert_exists = 'last_update_wins'); > > WARNING: conflict detection and resolution could be incomplete due to > > disabled track_commit_timestamp > > DETAIL: Conflicts update_origin_differs and delete_origin_differs > > cannot be detected, and the origin and commit timestamp for the local > > row will not be logged. > > ERROR: resolver last_update_wins requires "track_commit_timestamp" to > > be enabled > > HINT: Make sure the configuration parameter "track_commit_timestamp" is > > set. > > > > Thanks for the review. > Here is the v14 patch-set fixing review comments in [1] and [2]. Clarification: The fixes for mentioned comments from Vignesh - [1] & [2] are fixed in patch-001. Thank you Ajin for providing the changes. > New in patches: > 1) Added partition table tests in 034_conflict_resolver.pl in 002 and > 003 patches. > 2) 003 has a bug fix for update_exists conflict resolution on > partitioned tables. > > [1]: > https://www.postgresql.org/message-id/CALDaNm3es1JqU8Qcv5Yw%3D7Ts2dOvaV8a_boxPSdofB%2BDTx1oFg%40mail.gmail.com > [2]: > https://www.postgresql.org/message-id/CALDaNm18HuAcNsEC47J6qLRC7rMD2Q9_wT_hFtcc4UWqsfkgjA%40mail.gmail.com > > Thanks, > Nisha
Re: Conflict Detection and Resolution
On Fri, Sep 6, 2024 at 2:05 PM Ajin Cherian wrote: > > > > On Thu, Aug 29, 2024 at 2:50 PM shveta malik wrote: >> >> On Wed, Aug 28, 2024 at 4:07 PM shveta malik wrote: >> > >> > > On Wed, Aug 28, 2024 at 10:30 AM Ajin Cherian wrote: >> > > > >> > >> > The review is WIP. Please find a few comments on patch001. >> > >> >> More comments on ptach001 in continuation of previous comments: >> > > Thank you for your feedback, Shveta. I've addressed both sets of comments you > provided. Thanks for the patches. I tested the v12-0001 patch, and here are my comments: 1) An unexpected error occurs when attempting to alter the resolver for multiple conflict_type(s) in ALTER SUB...CONFLICT RESOLVER command. See below examples : postgres=# alter subscription sub2 CONFLICT RESOLVER (update_exists=keep_local, delete_missing=error, update_origin_differs=error); ERROR: unrecognized node type: 1633972341 postgres=# alter subscription sub2 CONFLICT RESOLVER ( update_origin_differs=error, update_exists=error); ERROR: unrecognized node type: 1633972341 postgres=# alter subscription sub2 CONFLICT RESOLVER ( delete_origin_differs=error, delete_missing=error); ERROR: unrecognized node type: 1701602660 postgres=# alter subscription sub2 CONFLICT RESOLVER (update_exists=keep_local, delete_missing=error); ALTER SUBSCRIPTION -- It appears that the error occurs only when at least two conflict types belong to the same category, either UPDATE or DELETE. 2) Given the above issue, it would be beneficial to add a test in subscription.sql to cover cases where all valid conflict types are set with appropriate resolvers in both the ALTER and CREATE commands. Thanks, Nisha
Re: Commit Timestamp and LSN Inversion issue
On Wed, Sep 4, 2024 at 12:23 PM shveta malik wrote: > > Hello hackers, > (Cc people involved in the earlier discussion) > > I would like to discuss the $Subject. > > While discussing Logical Replication's Conflict Detection and > Resolution (CDR) design in [1] , it came to our notice that the > commit LSN and timestamp may not correlate perfectly i.e. commits may > happen with LSN1 < LSN2 but with Ts1 > Ts2. This issue may arise > because, during the commit process, the timestamp (xactStopTimestamp) > is captured slightly earlier than when space is reserved in the WAL. > > ~~ > > Reproducibility of conflict-resolution problem due to the timestamp inversion > > It was suggested that timestamp inversion *may* impact the time-based > resolutions such as last_update_wins (targeted to be implemented in > [1]) as we may end up making wrong decisions if timestamps and LSNs > are not correctly ordered. And thus we tried some tests but failed to > find any practical scenario where it could be a problem. > > Basically, the proposed conflict resolution is a row-level resolution, > and to cause the row value to be inconsistent, we need to modify the > same row in concurrent transactions and commit the changes > concurrently. But this doesn't seem possible because concurrent > updates on the same row are disallowed (e.g., the later update will be > blocked due to the row lock). See [2] for the details. > > We tried to give some thoughts on multi table cases as well e.g., > update table A with foreign key and update the table B that table A > refers to. But update on table A will block the update on table B as > well, so we could not reproduce data-divergence due to the > LSN/timestamp mismatch issue there. > > ~~ > > Idea proposed to fix the timestamp inversion issue > > There was a suggestion in [3] to acquire the timestamp while reserving > the space (because that happens in LSN order). The clock would need to > be monotonic (easy enough with CLOCK_MONOTONIC), but also cheap. The > main problem why it's being done outside the critical section, because > gettimeofday() may be quite expensive. There's a concept of hybrid > clock, combining "time" and logical counter, which might be useful > independently of CDR. > > On further analyzing this idea, we found that CLOCK_MONOTONIC can be > accepted only by clock_gettime() which has more precision than > gettimeofday() and thus is equally or more expensive theoretically (we > plan to test it and post the results). It does not look like a good > idea to call any of these when holding spinlock to reserve the wal > position. As for the suggested solution "hybrid clock", it might not > help here because the logical counter is only used to order the > transactions with the same timestamp. The problem here is how to get > the timestamp along with wal position > reservation(ReserveXLogInsertLocation). > Here are the tests done to compare clock_gettime() and gettimeofday() performance. Machine details : Intel(R) Xeon(R) CPU E7-4890 v2 @ 2.80GHz CPU(s): 120; 800GB RAM Three functions were tested across three different call volumes (1 million, 100 million, and 1 billion): 1) clock_gettime() with CLOCK_REALTIME 2) clock_gettime() with CLOCK_MONOTONIC 3) gettimeofday() --> clock_gettime() with CLOCK_MONOTONIC sometimes shows slightly better performance, but not consistently. The difference in time taken by all three functions is minimal, with averages varying by no more than ~2.5%. Overall, the performance between CLOCK_MONOTONIC and gettimeofday() is essentially the same. Below are the test results - (each test was run twice for consistency) 1) For 1 million calls: 1a) clock_gettime() with CLOCK_REALTIME: - Run 1: 0.01770 seconds, Run 2: 0.01772 seconds, Average: 0.01771 seconds. 1b) clock_gettime() with CLOCK_MONOTONIC: - Run 1: 0.01753 seconds, Run 2: 0.01748 seconds, Average: 0.01750 seconds. 1c) gettimeofday(): - Run 1: 0.01742 seconds, Run 2: 0.01777 seconds, Average: 0.01760 seconds. 2) For 100 million calls: 2a) clock_gettime() with CLOCK_REALTIME: - Run 1: 1.76649 seconds, Run 2: 1.76602 seconds, Average: 1.76625 seconds. 2b) clock_gettime() with CLOCK_MONOTONIC: - Run 1: 1.72768 seconds, Run 2: 1.72988 seconds, Average: 1.72878 seconds. 2c) gettimeofday(): - Run 1: 1.72436 seconds, Run 2: 1.72174 seconds, Average: 1.72305 seconds. 3) For 1 billion calls: 3a) clock_gettime() with CLOCK_REALTIME: - Run 1: 17.63859 seconds, Run 2: 17.65529 seconds, Average: 17.64694 seconds. 3b) clock_gettime() with CLOCK_MONOTONIC: - Run 1: 17.15109 seconds, Run 2: 17.27406 seconds, Average: 17.21257 seconds. 3c) gettimeofday(): - Run 1: 17.21368 seconds, Run 2: 17.22983 seconds, Average: 17.22175 seconds. Attached the scripts used for tests. -- Thanks, Nisha <>
Re: Conflict Detection and Resolution
On Thu, Aug 29, 2024 at 4:43 PM Amit Kapila wrote: > > On Fri, Aug 23, 2024 at 10:39 AM shveta malik wrote: > > > > Please find issues which need some thoughts and approval for > > time-based resolution and clock-skew. > > > > 1) > > Time based conflict resolution and two phase transactions: > > > > Time based conflict resolution (last_update_wins) is the one > > resolution which will not result in data-divergence considering > > clock-skew is taken care of. But when it comes to two-phase > > transactions, it might not be the case. For two-phase transaction, we > > do not have commit timestamp when the changes are being applied. Thus > > for time-based comparison, initially it was decided to user prepare > > timestamp but it may result in data-divergence. Please see the > > example at [1]. > > > > Example at [1] is a tricky situation, and thus in the initial draft, > > we decided to restrict usage of 2pc and CDR together. The plan is: > > > > a) During Create subscription, if the user has given last_update_wins > > resolver for any conflict_type and 'two_phase' is also enabled, we > > ERROR out. > > b) During Alter subscription, if the user tries to update resolver to > > 'last_update_wins' but 'two_phase' is enabled, we error out. > > > > Another solution could be to save both prepare_ts and commit_ts. And > > when any txn comes for conflict resolution, we first check if > > prepare_ts is available, use that else use commit_ts. Availability of > > prepare_ts would indicate it was a prepared txn and thus even if it is > > committed, we should use prepare_ts for comparison for consistency. > > This will have some overhead of storing prepare_ts along with > > commit_ts. But if the number of prepared txns are reasonably small, > > this overhead should be less. > > > > Yet another idea is that if the conflict is detected and the > resolution strategy is last_update_wins then from that point we start > writing all the changes to the file similar to what we do for > streaming mode and only once commit_prepared arrives, we will read and > apply changes. That will solve this problem. > > > We currently plan to go with restricting 2pc and last_update_wins > > together, unless others have different opinions. > > > > Sounds reasonable but we should add comments on the possible solution > like the one I have mentioned so that we can extend it afterwards. > Done, v12-004 patch has the comments for the possible solution. > > ~~ > > > > 2) > > parallel apply worker and conflict-resolution: > > As discussed in [2] (see last paragraph in [2]), for streaming of > > in-progress transactions by parallel worker, we do not have > > commit-timestamp with each change and thus it makes sense to disable > > parallel apply worker with CDR. The plan is to not start parallel > > apply worker if 'last_update_wins' is configured for any > > conflict_type. > > > > The other idea is that we can let the changes written to file if any > conflict is detected and then at commit time let the remaining changes > be applied by apply worker. This can introduce some complexity, so > similar to two_pc we can extend this functionality later. > v12-004 patch has the comments to extend it later. > > ~~ > > > > 3) > > parallel apply worker and clock skew management: > > Regarding clock-skew management as discussed in [3], we will wait for > > the local clock to come within tolerable range during 'begin' rather > > than before 'commit'. And this wait needs commit-timestamp in the > > beginning, thus we plan to restrict starting pa-worker even when > > clock-skew related GUCs are configured. > > > > Earlier we had restricted both 2pc and parallel worker worker start > > when detect_conflict was enabled, but now since detect_conflict > > parameter is removed, we will change the implementation to restrict > > all 3 above cases when last_update_wins is configured. When the > > changes are done, we will post the patch. > > > > At this stage, we are not sure how we want to deal with clock skew. > There is an argument that clock-skew should be handled outside the > database, so we can probably have the clock-skew-related stuff in a > separate patch. > Separated the clock-skew related code in v12-005 patch. -- Thanks, Nisha
Re: Conflict Detection and Resolution
On Mon, Aug 26, 2024 at 2:23 PM Amit Kapila wrote: > > On Thu, Aug 22, 2024 at 3:45 PM shveta malik wrote: > > > > On Wed, Aug 21, 2024 at 4:08 PM Nisha Moond > > wrote: > > > > > > The patches have been rebased on the latest pgHead following the merge > > > of the conflict detection patch [1]. > > > > Thanks for working on patches. > > > > Summarizing the issues which need some suggestions/thoughts. > > > > 1) > > For subscription based resolvers, currently the syntax implemented is: > > > > 1a) > > CREATE SUBSCRIPTION > > CONNECTION PUBLICATION > > CONFLICT RESOLVER > > (conflict_type1 = resolver1, conflict_type2 = resolver2, > > conflict_type3 = resolver3,...); > > > > 1b) > > ALTER SUBSCRIPTION CONFLICT RESOLVER > > (conflict_type1 = resolver1, conflict_type2 = resolver2, > > conflict_type3 = resolver3,...); > > > > Earlier the syntax suggested in [1] was: > > CREATE SUBSCRIPTION CONNECTION PUBLICATION > > CONFLICT RESOLVER 'conflict_resolver1' FOR 'conflict_type1', > > CONFLICT RESOLVER 'conflict_resolver2' FOR 'conflict_type2'; > > > > I think the currently implemented syntax is good as it has less > > repetition, unless others think otherwise. > > > > ~~ > > > > 2) > > For subscription based resolvers, do we need a RESET command to reset > > resolvers to default? Any one of below or both? > > > > 2a) reset all at once: > > ALTER SUBSCRIPTION RESET CONFLICT RESOLVERS > > > > 2b) reset one at a time: > > ALTER SUBSCRIPTION RESET CONFLICT RESOLVER for 'conflict_type'; > > > > The issue I see here is, to implement 1a and 1b, we have introduced > > the 'RESOLVER' keyword. If we want to implement 2a, we will have to > > introduce the 'RESOLVERS' keyword as well. But we can come up with > > some alternative syntax if we plan to implement these. Thoughts? > > > > It makes sense to have a RESET on the lines of (a) and (b). At this > stage, we should do minimal in extending the syntax. How about RESET > CONFLICT RESOLVER ALL for (a)? > Done, v11 implements the suggested RESET command. > > ~~ > > -- Thanks, Nisha
Re: Conflict Detection and Resolution
On Fri, Aug 23, 2024 at 10:39 AM shveta malik wrote: > > On Thu, Aug 22, 2024 at 3:44 PM shveta malik wrote: > > > > > > For clock-skew and timestamp based resolution, if needed, I will post > > another email for the design items where suggestions are needed. > > > > Please find issues which need some thoughts and approval for > time-based resolution and clock-skew. > > 1) > Time based conflict resolution and two phase transactions: > > Time based conflict resolution (last_update_wins) is the one > resolution which will not result in data-divergence considering > clock-skew is taken care of. But when it comes to two-phase > transactions, it might not be the case. For two-phase transaction, we > do not have commit timestamp when the changes are being applied. Thus > for time-based comparison, initially it was decided to user prepare > timestamp but it may result in data-divergence. Please see the > example at [1]. > > Example at [1] is a tricky situation, and thus in the initial draft, > we decided to restrict usage of 2pc and CDR together. The plan is: > > a) During Create subscription, if the user has given last_update_wins > resolver for any conflict_type and 'two_phase' is also enabled, we > ERROR out. > b) During Alter subscription, if the user tries to update resolver to > 'last_update_wins' but 'two_phase' is enabled, we error out. > > Another solution could be to save both prepare_ts and commit_ts. And > when any txn comes for conflict resolution, we first check if > prepare_ts is available, use that else use commit_ts. Availability of > prepare_ts would indicate it was a prepared txn and thus even if it is > committed, we should use prepare_ts for comparison for consistency. > This will have some overhead of storing prepare_ts along with > commit_ts. But if the number of prepared txns are reasonably small, > this overhead should be less. > > We currently plan to go with restricting 2pc and last_update_wins > together, unless others have different opinions. > Done. v11-004 implements the idea of restricting 2pc and last_update_wins together. > ~~ > > 2) > parallel apply worker and conflict-resolution: > As discussed in [2] (see last paragraph in [2]), for streaming of > in-progress transactions by parallel worker, we do not have > commit-timestamp with each change and thus it makes sense to disable > parallel apply worker with CDR. The plan is to not start parallel > apply worker if 'last_update_wins' is configured for any > conflict_type. > Done. > ~~ > > 3) > parallel apply worker and clock skew management: > Regarding clock-skew management as discussed in [3], we will wait for > the local clock to come within tolerable range during 'begin' rather > than before 'commit'. And this wait needs commit-timestamp in the > beginning, thus we plan to restrict starting pa-worker even when > clock-skew related GUCs are configured. > Done. v11 implements it. > Earlier we had restricted both 2pc and parallel worker worker start > when detect_conflict was enabled, but now since detect_conflict > parameter is removed, we will change the implementation to restrict > all 3 above cases when last_update_wins is configured. When the > changes are done, we will post the patch. > > ~~ > -- Thanks, Nisha
Re: Conflict Detection and Resolution
On Fri, Aug 23, 2024 at 10:39 AM shveta malik wrote: > > On Thu, Aug 22, 2024 at 3:44 PM shveta malik wrote: > > > > > > For clock-skew and timestamp based resolution, if needed, I will post > > another email for the design items where suggestions are needed. > > > > Please find issues which need some thoughts and approval for > time-based resolution and clock-skew. > > 1) > Time based conflict resolution and two phase transactions: > > Time based conflict resolution (last_update_wins) is the one > resolution which will not result in data-divergence considering > clock-skew is taken care of. But when it comes to two-phase > transactions, it might not be the case. For two-phase transaction, we > do not have commit timestamp when the changes are being applied. Thus > for time-based comparison, initially it was decided to user prepare > timestamp but it may result in data-divergence. Please see the > example at [1]. > > Example at [1] is a tricky situation, and thus in the initial draft, > we decided to restrict usage of 2pc and CDR together. The plan is: > > a) During Create subscription, if the user has given last_update_wins > resolver for any conflict_type and 'two_phase' is also enabled, we > ERROR out. > b) During Alter subscription, if the user tries to update resolver to > 'last_update_wins' but 'two_phase' is enabled, we error out. > > Another solution could be to save both prepare_ts and commit_ts. And > when any txn comes for conflict resolution, we first check if > prepare_ts is available, use that else use commit_ts. Availability of > prepare_ts would indicate it was a prepared txn and thus even if it is > committed, we should use prepare_ts for comparison for consistency. > This will have some overhead of storing prepare_ts along with > commit_ts. But if the number of prepared txns are reasonably small, > this overhead should be less. > > We currently plan to go with restricting 2pc and last_update_wins > together, unless others have different opinions. > > ~~ > > 2) > parallel apply worker and conflict-resolution: > As discussed in [2] (see last paragraph in [2]), for streaming of > in-progress transactions by parallel worker, we do not have > commit-timestamp with each change and thus it makes sense to disable > parallel apply worker with CDR. The plan is to not start parallel > apply worker if 'last_update_wins' is configured for any > conflict_type. > > ~~ > > 3) > parallel apply worker and clock skew management: > Regarding clock-skew management as discussed in [3], we will wait for > the local clock to come within tolerable range during 'begin' rather > than before 'commit'. And this wait needs commit-timestamp in the > beginning, thus we plan to restrict starting pa-worker even when > clock-skew related GUCs are configured. > > Earlier we had restricted both 2pc and parallel worker worker start > when detect_conflict was enabled, but now since detect_conflict > parameter is removed, we will change the implementation to restrict > all 3 above cases when last_update_wins is configured. When the > changes are done, we will post the patch. > > ~~ > > 4) > > Earlier when 'detect_conflict' was enabled, we were giving WARNING if > 'track_commit_timestamp' was not enabled. This was during CREATE and > ALTER subscription. Now with this parameter removed, this WARNING has > also been removed. But I think we need to bring back this WARNING. > Currently default resolvers set may work without > 'track_commit_timestamp' but when user gives CONFLICT RESOLVER in > create-sub or alter-sub explicitly making them configured to > non-default values (or say any values, does not matter if few are > defaults), we may still emit this warning to alert user: > > 2024-07-26 09:14:03.152 IST [195415] WARNING: conflict detection > could be incomplete due to disabled track_commit_timestamp > 2024-07-26 09:14:03.152 IST [195415] DETAIL: Conflicts update_differ > and delete_differ cannot be detected, and the origin and commit > timestamp for the local row will not be logged. > > Thoughts? > > If we emit this WARNING during each resolution, then it may flood our > log files, thus it seems better to emit it during create or alter > subscription instead of during resolution. > > Done. v10 has implemented the suggested warning when a user gives CONFLICT RESOLVER in create-sub or alter-sub explicitly. Thanks, Nisha
Re: Conflict Detection and Resolution
On Mon, Aug 26, 2024 at 2:23 PM Amit Kapila wrote: > > On Thu, Aug 22, 2024 at 3:45 PM shveta malik wrote: > > > > On Wed, Aug 21, 2024 at 4:08 PM Nisha Moond > > wrote: > > > > > > The patches have been rebased on the latest pgHead following the merge > > > of the conflict detection patch [1]. > > > > Thanks for working on patches. > > > > Summarizing the issues which need some suggestions/thoughts. > > > > 1) > > For subscription based resolvers, currently the syntax implemented is: > > > > 1a) > > CREATE SUBSCRIPTION > > CONNECTION PUBLICATION > > CONFLICT RESOLVER > > (conflict_type1 = resolver1, conflict_type2 = resolver2, > > conflict_type3 = resolver3,...); > > > > 1b) > > ALTER SUBSCRIPTION CONFLICT RESOLVER > > (conflict_type1 = resolver1, conflict_type2 = resolver2, > > conflict_type3 = resolver3,...); > > > > Earlier the syntax suggested in [1] was: > > CREATE SUBSCRIPTION CONNECTION PUBLICATION > > CONFLICT RESOLVER 'conflict_resolver1' FOR 'conflict_type1', > > CONFLICT RESOLVER 'conflict_resolver2' FOR 'conflict_type2'; > > > > I think the currently implemented syntax is good as it has less > > repetition, unless others think otherwise. > > > > ~~ > > > > 2) > > For subscription based resolvers, do we need a RESET command to reset > > resolvers to default? Any one of below or both? > > > > 2a) reset all at once: > > ALTER SUBSCRIPTION RESET CONFLICT RESOLVERS > > > > 2b) reset one at a time: > > ALTER SUBSCRIPTION RESET CONFLICT RESOLVER for 'conflict_type'; > > > > The issue I see here is, to implement 1a and 1b, we have introduced > > the 'RESOLVER' keyword. If we want to implement 2a, we will have to > > introduce the 'RESOLVERS' keyword as well. But we can come up with > > some alternative syntax if we plan to implement these. Thoughts? > > > > It makes sense to have a RESET on the lines of (a) and (b). At this > stage, we should do minimal in extending the syntax. How about RESET > CONFLICT RESOLVER ALL for (a)? > > > ~~ > > > > 3) Regarding update_exists: > > > > 3a) > > Currently update_exists resolver patch is kept separate. The reason > > being, it performs resolution which will need deletion of multiple > > rows. It will be good to discuss if we want to target this in the > > first draft. Please see the example: > > > > create table tab (a int primary key, b int unique, c int unique); > > > > Pub: insert into tab values (1,1,1); > > Sub: > > insert into tab values (2,20,30); > > insert into tab values (3,40,50); > > insert into tab values (4,60,70); > > > > Pub: update tab set a=2,b=40,c=70 where a=1; > > > > The above 'update' on pub will result in 'update_exists' on sub and if > > resolution is in favour of 'apply', then it will conflict with all the > > three local rows of subscriber due to unique constraint present on all > > three columns. Thus in order to resolve the conflict, it will have to > > delete these 3 rows on sub: > > > > 2,20,30 > > 3,40,50 > > 4,60,70 > > and then update 1,1,1 to 2,40,70. > > > > Just need opinion on if we shall target this in the initial draft. > > > > This case looks a bit complicated. It seems there is no other > alternative than to delete the multiple rows. It is better to create a > separate top-up patch for this and we can discuss in detail about this > once the basic patch is in better shape. v9 onwards the patch-0003 is a separate top-up patch implementing update_exists. > > 3b) > > If we plan to implement this, we need to work on optimal design where > > we can find all the conflicting rows at once and delete those. > > Currently the implementation has been done using recursion i.e. find > > one conflicting row, then delete it and then next and so on i.e. we > > call apply_handle_update_internal() recursively. On initial code > > review, I feel it is doable to scan all indexes at once and get > > conflicting-tuple-ids in one go and get rid of recursion. It can be > > attempted once we decide on 3a. > > > > I suggest following the simplest strategy (even if that means calling > the update function recursively) by adding comments on the optimal > strategy. We can optimize it later as well. > > > ~~ > > > > 4) > > Now for insert_exists and update_exists, we are doing a pre-
Re: Conflict Detection and Resolution
On Mon, Aug 26, 2024 at 9:05 AM shveta malik wrote: > > On Wed, Aug 21, 2024 at 4:08 PM Nisha Moond wrote: > > > > The patches have been rebased on the latest pgHead following the merge > > of the conflict detection patch [1]. The detect_conflict option has > > been removed, and conflict detection is now enabled by default. This > > change required the following updates in resolver patches: > > patch-0001: > > - Removed dependency on the detect_conflict option. Now, default > > conflict resolvers are set on CREATE SUBSCRIPTION if no values are > > provided. > > - To keep the behavior unchanged, the default resolvers are now set as - > > insert_exists = error > > update_exists = error > > update_differ = apply_remote > > update_missing = skip > > delete_missing = skip > > delete_differ = apply_remote > > - Added documentation for conflict resolvers. > > > > patch-0002: > > - Removed dependency on the detect_conflict option. > > - Updated test cases in 034_conflict_resolver.pl to reflect new > > default resolvers and the removal of the detect_conflict option. > > > > patch-0003: > > - Implemented resolver for the update_exists conflict type. Supported > > resolvers are: apply_remote, keep_local, error. > > > > Thanks Nisha for the patches, I was running some tests on > update_exists and found this case wherein it misses to LOG one > conflict out of 3. > > create table tab (a int primary key, b int unique, c int unique); > Pub: insert into tab values (1,1,1); > > Sub: > insert into tab values (2,20,30); > insert into tab values (3,40,50); > insert into tab values (4,60,70); > > Pub: update tab set a=2,b=40,c=70 where a=1; > > Here it logs update_exists conflict and the resolution for Key > (b)=(40) and Key (c)=(70) but misses to LOG first one which is with > Key (a)=(2). > Fixed. Thanks, Nisha
Re: Conflict Detection and Resolution
On Thu, Aug 22, 2024 at 3:45 PM shveta malik wrote: > > On Wed, Aug 21, 2024 at 4:08 PM Nisha Moond wrote: > > > > The patches have been rebased on the latest pgHead following the merge > > of the conflict detection patch [1]. > > Thanks for working on patches. > > Summarizing the issues which need some suggestions/thoughts. > > 1) > For subscription based resolvers, currently the syntax implemented is: > > 1a) > CREATE SUBSCRIPTION > CONNECTION PUBLICATION > CONFLICT RESOLVER > (conflict_type1 = resolver1, conflict_type2 = resolver2, > conflict_type3 = resolver3,...); > > 1b) > ALTER SUBSCRIPTION CONFLICT RESOLVER > (conflict_type1 = resolver1, conflict_type2 = resolver2, > conflict_type3 = resolver3,...); > > Earlier the syntax suggested in [1] was: > CREATE SUBSCRIPTION CONNECTION PUBLICATION > CONFLICT RESOLVER 'conflict_resolver1' FOR 'conflict_type1', > CONFLICT RESOLVER 'conflict_resolver2' FOR 'conflict_type2'; > > I think the currently implemented syntax is good as it has less > repetition, unless others think otherwise. > > ~~ > > 2) > For subscription based resolvers, do we need a RESET command to reset > resolvers to default? Any one of below or both? > > 2a) reset all at once: > ALTER SUBSCRIPTION RESET CONFLICT RESOLVERS > > 2b) reset one at a time: > ALTER SUBSCRIPTION RESET CONFLICT RESOLVER for 'conflict_type'; > > The issue I see here is, to implement 1a and 1b, we have introduced > the 'RESOLVER' keyword. If we want to implement 2a, we will have to > introduce the 'RESOLVERS' keyword as well. But we can come up with > some alternative syntax if we plan to implement these. Thoughts? > > ~~ > > 3) Regarding update_exists: > > 3a) > Currently update_exists resolver patch is kept separate. The reason > being, it performs resolution which will need deletion of multiple > rows. It will be good to discuss if we want to target this in the > first draft. Please see the example: > > create table tab (a int primary key, b int unique, c int unique); > > Pub: insert into tab values (1,1,1); > Sub: > insert into tab values (2,20,30); > insert into tab values (3,40,50); > insert into tab values (4,60,70); > > Pub: update tab set a=2,b=40,c=70 where a=1; > > The above 'update' on pub will result in 'update_exists' on sub and if > resolution is in favour of 'apply', then it will conflict with all the > three local rows of subscriber due to unique constraint present on all > three columns. Thus in order to resolve the conflict, it will have to > delete these 3 rows on sub: > > 2,20,30 > 3,40,50 > 4,60,70 > and then update 1,1,1 to 2,40,70. > > Just need opinion on if we shall target this in the initial draft. > > 3b) > If we plan to implement this, we need to work on optimal design where > we can find all the conflicting rows at once and delete those. > Currently the implementation has been done using recursion i.e. find > one conflicting row, then delete it and then next and so on i.e. we > call apply_handle_update_internal() recursively. On initial code > review, I feel it is doable to scan all indexes at once and get > conflicting-tuple-ids in one go and get rid of recursion. It can be > attempted once we decide on 3a. > > ~~ > > 4) > Now for insert_exists and update_exists, we are doing a pre-scan of > all unique indexes to find conflict. Also there is post-scan to figure > out if the conflicting row is inserted meanwhile. This needs to be > reviewed for optimization. We need to avoid pre-scan wherever > possible. I think the only case for which it can be avoided is > 'ERROR'. For the cases where resolver is in favor of remote-apply, we > need to check conflict beforehand to avoid rollback of already > inserted data. And for the case where resolver is in favor of skipping > the change, then too we should know beforehand about the conflict to > avoid heap-insertion and rollback. Thoughts? > +1 to the idea of optimization, but it seems that when the resolver is set to ERROR, skipping the pre-scan only optimizes the case where no conflict exists. If a conflict is found, the apply-worker will error out during the pre-scan, and no post-scan occurs, so there's no opportunity for optimization. However, if no conflict is present, we currently do both pre-scan and post-scan. Skipping the pre-scan in this scenario could be a worthwhile optimization, even if it only benefits the no-conflict case. -- Thanks, Nisha
Re: Conflict detection and logging in logical replication
On Mon, Aug 5, 2024 at 10:05 AM shveta malik wrote: > > On Mon, Aug 5, 2024 at 9:19 AM Amit Kapila wrote: > > > > On Fri, Aug 2, 2024 at 6:28 PM Nisha Moond wrote: > > > > > > Performance tests done on the v8-0001 and v8-0002 patches, available at > > > [1]. > > > > > > > Thanks for doing the detailed tests for this patch. > > > > > The purpose of the performance tests is to measure the impact on > > > logical replication with track_commit_timestamp enabled, as this > > > involves fetching the commit_ts data to determine > > > delete_differ/update_differ conflicts. > > > > > > Fortunately, we did not see any noticeable overhead from the new > > > commit_ts fetch and comparison logic. The only notable impact is > > > potential overhead from logging conflicts if they occur frequently. > > > Therefore, enabling conflict detection by default seems feasible, and > > > introducing a new detect_conflict option may not be necessary. > > > > > ... > > > > > > Test 1: create conflicts on Sub using pgbench. > > > > > > Setup: > > > - Both publisher and subscriber have pgbench tables created as- > > > pgbench -p $node1_port postgres -qis 1 > > > - At Sub, a subscription created for all the changes from Pub node. > > > > > > Test Run: > > > - To test, ran pgbench for 15 minutes on both nodes simultaneously, > > > which led to concurrent updates and update_differ conflicts on the > > > Subscriber node. > > > Command used to run pgbench on both nodes- > > > ./pgbench postgres -p 8833 -c 10 -j 3 -T 300 -P 20 > > > > > > Results: > > > For each case, note the “tps” and total time taken by the apply-worker > > > on Sub to apply the changes coming from Pub. > > > > > > Case1: track_commit_timestamp = off, detect_conflict = off > > > Pub-tps = 9139.556405 > > > Sub-tps = 8456.787967 > > > Time of replicating all the changes: 19min 28s > > > Case 2 : track_commit_timestamp = on, detect_conflict = on > > > Pub-tps = 8833.016548 > > > Sub-tps = 8389.763739 > > > Time of replicating all the changes: 20min 20s > > > > > > > Why is there a noticeable tps (~3%) reduction in publisher TPS? Is it > > the impact of track_commit_timestamp = on or something else? When both the publisher and subscriber nodes are on the same machine, we observe a decrease in the publisher's TPS in case when 'track_commit_timestamp' is ON for the subscriber. Testing on pgHead (without the patch) also showed a similar reduction in the publisher's TPS. Test Setup: The test was conducted with the same setup as Test-1. Results: Case 1: pgHead - 'track_commit_timestamp' = OFF - Pub TPS: 9306.25 - Sub TPS: 8848.91 Case 2: pgHead - 'track_commit_timestamp' = ON - Pub TPS: 8915.75 - Sub TPS: 8667.12 On pgHead too, there was a ~400tps reduction in the publisher when 'track_commit_timestamp' was enabled on the subscriber. Additionally, code profiling of the walsender on the publisher showed that the overhead in Case-2 was mainly in the DecodeCommit() call stack, causing slower write operations, especially in logicalrep_write_update() and OutputPluginWrite(). case1 : 'track_commit_timestamp' = OFF --11.57%--xact_decode | | DecodeCommit | | ReorderBufferCommit ... | | --6.10%--pgoutput_change | || | ||--3.09%--logicalrep_write_update | | | | |--2.01%--OutputPluginWrite | ||--1.97%--WalSndWriteData case2: 'track_commit_timestamp' = ON |--53.19%--xact_decode | | DecodeCommit | | ReorderBufferCommit ... | | --30.25%--pgoutput_change | | | | | |--15.23%--logicalrep_write_update | | | | |--9.82%--OutputPluginWrite | | |--9.57%--WalSndWriteData -- In Case 2, the subscriber's process of writing timestamp data for millions of rows appears to have impacted all write operations on the machine. To confirm the profiling results, we conducted the same test with the publisher and subscriber on separate machines. Results: Case 1: 'track_commit_timestamp' = OFF - Run 1: Pub TPS: 2144.10, Sub TPS: 2216.02 - Run 2: Pub TPS: 2159.41, Sub TPS: 2233.82 Case 2: 'track_commit_timestamp' = ON - Run 1: Pub TPS: 2174.39, Sub TPS: 2226.89 - Run 2: Pub TPS: 2148.92, Sub TPS: 2224.80 Note: The machines used in this test were not as powerful as the one used in the earlier tests, resulting in lower overall TPS (~2k vs. ~8-9k). However, the results sh
Re: Conflict detection and logging in logical replication
Performance tests done on the v8-0001 and v8-0002 patches, available at [1]. The purpose of the performance tests is to measure the impact on logical replication with track_commit_timestamp enabled, as this involves fetching the commit_ts data to determine delete_differ/update_differ conflicts. Fortunately, we did not see any noticeable overhead from the new commit_ts fetch and comparison logic. The only notable impact is potential overhead from logging conflicts if they occur frequently. Therefore, enabling conflict detection by default seems feasible, and introducing a new detect_conflict option may not be necessary. Please refer to the following for detailed test results: For all the tests, created two server nodes, one publisher and one as subscriber. Both the nodes are configured with below settings - wal_level = logical shared_buffers = 40GB max_worker_processes = 32 max_parallel_maintenance_workers = 24 max_parallel_workers = 32 synchronous_commit = off checkpoint_timeout = 1d max_wal_size = 24GB min_wal_size = 15GB autovacuum = off Test 1: create conflicts on Sub using pgbench. Setup: - Both publisher and subscriber have pgbench tables created as- pgbench -p $node1_port postgres -qis 1 - At Sub, a subscription created for all the changes from Pub node. Test Run: - To test, ran pgbench for 15 minutes on both nodes simultaneously, which led to concurrent updates and update_differ conflicts on the Subscriber node. Command used to run pgbench on both nodes- ./pgbench postgres -p 8833 -c 10 -j 3 -T 300 -P 20 Results: For each case, note the “tps” and total time taken by the apply-worker on Sub to apply the changes coming from Pub. Case1: track_commit_timestamp = off, detect_conflict = off Pub-tps = 9139.556405 Sub-tps = 8456.787967 Time of replicating all the changes: 19min 28s Case 2 : track_commit_timestamp = on, detect_conflict = on Pub-tps = 8833.016548 Sub-tps = 8389.763739 Time of replicating all the changes: 20min 20s Case3: track_commit_timestamp = on, detect_conflict = off Pub-tps = 8886.101726 Sub-tps = 8374.508017 Time of replicating all the changes: 19min 35s Case 4: track_commit_timestamp = off, detect_conflict = on Pub-tps = 8981.924596 Sub-tps = 8411.120808 Time of replicating all the changes: 19min 27s **The difference of TPS between each case is small. While I can see a slight increase of the replication time (about 5%), when enabling both track_commit_timestamp and detect_conflict. Test2: create conflict using a manual script - To measure the precise time taken by the apply-worker in all cases, create a test with a table having 10 million rows. - To record the total time taken by the apply-worker, dump the current time in the logfile for apply_handle_begin() and apply_handle_commit(). Setup: Pub : has a table ‘perf’ with 10 million rows. Sub : has the same table ‘perf’ with its own 10 million rows (inserted by 1000 different transactions). This table is subscribed for all changes from Pub. Test Run: At Pub: run UPDATE on the table ‘perf’ to update all its rows in a single transaction. (this will lead to update_differ conflict for all rows on Sub when enabled). At Sub: record the time(from log file) taken by the apply-worker to apply all updates coming from Pub. Results: Below table shows the total time taken by the apply-worker (apply_handle_commit Time - apply_handle_begin Time ). (Two test runs for each of the four cases) Case1: track_commit_timestamp = off, detect_conflict = off Run1 - 2min 42sec 579ms Run2 - 2min 41sec 75ms Case 2 : track_commit_timestamp = on, detect_conflict = on Run1 - 6min 11sec 602ms Run2 - 6min 25sec 179ms Case3: track_commit_timestamp = on, detect_conflict = off Run1 - 2min 34sec 223ms Run2 - 2min 33sec 482ms Case 4: track_commit_timestamp = off, detect_conflict = on Run1 - 2min 35sec 276ms Run2 - 2min 38sec 745ms ** In the case-2 when both track_commit_timestamp and detect_conflict are enabled, the time taken by the apply-worker is ~140% higher. Test3: Case when no conflict is detected. To measure the time taken by the apply-worker when there is no conflict detected. This test is to confirm if the time overhead in Test1-Case2 is due to the new function GetTupleCommitTs() which fetches the origin and timestamp information for each row in the table before applying the update. Setup: - The Publisher and Subscriber both have an empty table to start with. - At Sub, the table is subscribed for all changes from Pub. - At Pub: Insert 10 million rows and the same will be replicated to the Sub table as well. Test Run: At Pub: run an UPDATE on the table to update all rows in a single transaction. (This will NOT hit the update_differ
Re: Conflict detection and logging in logical replication
On Thu, Jul 25, 2024 at 12:04 PM Zhijie Hou (Fujitsu) wrote: > Here is the V6 patch set which addressed Shveta and Nisha's comments > in [1][2][3][4]. Thanks for the patch. I tested the v6-0001 patch with partition table scenarios. Please review the following scenario where Pub updates a tuple, causing it to move from one partition to another on Sub. Setup: Pub: create table tab (a int not null, b int not null); alter table tab add constraint tab_pk primary key (a,b); Sub: create table tab (a int not null, b int not null) partition by range (b); alter table tab add constraint tab_pk primary key (a,b); create table tab_1 partition of tab FOR values from (MINVALUE) TO (100); create table tab_2 partition of tab FOR values from (101) TO (MAXVALUE); Test: Pub: insert into tab values (1,1); Sub: update tab set a=1 where a=1; > just to make it Sub's origin Sub: insert into tab values (1,101); Pub: update b=101 where b=1; --> Both 'update_differ' and 'insert_exists' are detected. For non-partitioned tables, a similar update results in 'update_differ' and 'update_exists' conflicts. After detecting 'update_differ', the apply worker proceeds to apply the remote update and if a tuple with the updated key already exists, it raises 'update_exists'. This same behavior is expected for partitioned tables too. Thanks, Nisha
Re: Conflict detection and logging in logical replication
On Thu, Jul 18, 2024 at 7:52 AM Zhijie Hou (Fujitsu) wrote: > > Attach the V5 patch set which changed the following: > Tested v5-0001 patch, and it fails to detect the update_exists conflict for a setup where Pub has a non-partitioned table and Sub has the same table partitioned. Below is a testcase showcasing the issue: Setup: Pub: create table tab (a int not null, b int not null); alter table tab add constraint tab_pk primary key (a,b); Sub: create table tab (a int not null, b int not null) partition by range (b); alter table tab add constraint tab_pk primary key (a,b); CREATE TABLE tab_1 PARTITION OF tab FOR VALUES FROM (MINVALUE) TO (100); CREATE TABLE tab_2 PARTITION OF tab FOR VALUES FROM (101) TO (MAXVALUE); Test: Pub: insert into tab values (1,1); Sub: insert into tab values (2,1); Pub: update tab set a=2 where a=1;--> After this update on Pub, 'update_exists' is expected on Sub, but it fails to detect the conflict and logs the key violation error - ERROR: duplicate key value violates unique constraint "tab_1_pkey" DETAIL: Key (a, b)=(2, 1) already exists. Thanks, Nisha
Re: Improve the connection failure error messages
On Tue, Jul 9, 2024 at 1:00 AM Tom Lane wrote: > > Nisha Moond writes: > > Attached v5 patch with the translator comments as suggested. > > I looked at this, and I agree with the goal, but I find just about all > of the translator comments unnecessary. The ones that are useful are > useful only because the message is violating one of our message style > guidelines [1]: > > When citing the name of an object, state what kind of object it is. > > Rationale: Otherwise no one will know what “foo.bar.baz” refers to. > > So, for example, where you have > > + > +/* > + * translator: first %s is the subscription name, second %s is > the > + * error > + */ > + errmsg("subscription \"%s\" could not connect to the > publisher: %s", stmt->subname, err))); > > I don't find that that translator comment is adding anything. > But there are a couple of places like > > +/* > + * translator: first %s is the slotsync worker name, second %s is the > + * error > + */ > +errmsg("\"%s\" could not connect to the primary server: %s", > app_name.data, err)); > > I think that the right cure for the ambiguity here is not to add a > translator comment, but to label the name properly, perhaps like > > errmsg("synchronization worker \"%s\" could not connect to > the primary server: %s", >app_name.data, err)); > > > regards, tom lane > > [1] > https://www.postgresql.org/docs/devel/error-style-guide.html#ERROR-STYLE-GUIDE-OBJECT-TYPE Thank you for the review. Attached the patch v6 with suggested improvements. - Removed unnecessary translator comments. - Added appropriate identifier names where missing. -- Thanks, Nisha v6-0001-Improve-the-connection-failure-error-messages.patch Description: Binary data
Re: Conflict Detection and Resolution
On Mon, Jul 1, 2024 at 1:17 PM Ajin Cherian wrote: > > > > On Thu, Jun 27, 2024 at 1:14 PM Nisha Moond wrote: >> >> Please find the attached 'patch0003', which implements conflict >> resolutions according to the global resolver settings. >> >> Summary of Conflict Resolutions Implemented in 'patch0003': >> >> INSERT Conflicts: >> >> 1) Conflict Type: 'insert_exists' >> >> Supported Resolutions: >> a) 'remote_apply': Convert the INSERT to an UPDATE and apply. >> b) 'keep_local': Ignore the incoming (conflicting) INSERT and retain >> the local tuple. >> c) 'error': The apply worker will error out and restart. >> > > Hi Nisha, > > While testing the patch, when conflict resolution is configured and > insert_exists is set to "remote_apply", I see this warning in the logs due to > a resource not being closed: > > 2024-07-01 02:52:59.427 EDT [20304] LOG: conflict insert_exists detected on > relation "public.test1" > 2024-07-01 02:52:59.427 EDT [20304] DETAIL: Key already exists. Applying > resolution method "remote_apply" > 2024-07-01 02:52:59.427 EDT [20304] CONTEXT: processing remote data for > replication origin "pg_16417" during message type "INSERT" for replication > target relation "public.test1" in transaction 763, finished at 0/15E7F68 > 2024-07-01 02:52:59.427 EDT [20304] WARNING: resource was not closed: [138] > (rel=base/5/16413, blockNum=0, flags=0x9380, refcount=1 1) > 2024-07-01 02:52:59.427 EDT [20304] CONTEXT: processing remote data for > replication origin "pg_16417" during message type "COMMIT" in transaction > 763, finished at 0/15E7F68 > 2024-07-01 02:52:59.427 EDT [20304] WARNING: resource was not closed: > TupleDesc 0x7f8c0439e448 (16402,-1) > 2024-07-01 02:52:59.427 EDT [20304] CONTEXT: processing remote data for > replication origin "pg_16417" during message type "COMMIT" in transaction > 763, finished at 0/15E7F68 > Thank you Ajin for reporting the issue, This is now fixed with the v4-0003 patch. -- Thanks, Nisha
Re: Conflict detection and logging in logical replication
On Mon, Jun 24, 2024 at 7:39 AM Zhijie Hou (Fujitsu) wrote: > > When testing the patch, I noticed a bug that when reporting the conflict > after calling ExecInsertIndexTuples(), we might find the tuple that we > just inserted and report it.(we should only report conflict if there are > other conflict tuples which are not inserted by us) Here is a new patch > which fixed this and fixed a compile warning reported by CFbot. > Thank you for the patch! A review comment: The patch does not detect 'update_differ' conflicts when the Publisher has a non-partitioned table and the Subscriber has a partitioned version. Here’s a simple failing test case: Pub: create table tab (a int primary key, b int not null, c varchar(5)); Sub: create table tab (a int not null, b int not null, c varchar(5)) partition by range (b); alter table tab add constraint tab_pk primary key (a, b); create table tab_1 partition of tab for values from (minvalue) to (100); create table tab_2 partition of tab for values from (100) to (maxvalue); With the above setup, in case the Subscriber table has a tuple with its own origin, the incoming remote update from the Publisher fails to detect the 'update_differ' conflict. -- Thanks, Nisha
Re: Conflict Detection and Resolution
On Wed, Jun 5, 2024 at 7:29 PM Dilip Kumar wrote: > > On Tue, Jun 4, 2024 at 9:37 AM Amit Kapila wrote: > > > > Can you share the use case of "earliest_timestamp_wins" resolution > > method? It seems after the initial update on the local node, it will > > never allow remote update to succeed which sounds a bit odd. Jan has > > shared this and similar concerns about this resolution method, so I > > have added him to the email as well. > > > I can not think of a use case exactly in this context but it's very > common to have such a use case while designing a distributed > application with multiple clients. For example, when we are doing git > push concurrently from multiple clients it is expected that the > earliest commit wins. > Here are more use cases of the "earliest_timestamp_wins" resolution method: 1) Applications where the record of first occurrence of an event is important. For example, sensor based applications like earthquake detection systems, capturing the first seismic wave's time is crucial. 2) Scheduling systems, like appointment booking, prioritize the earliest request when handling concurrent ones. 3) In contexts where maintaining chronological order is important - a) Social media platforms display comments ensuring that the earliest ones are visible first. b) Finance transaction processing systems rely on timestamps to prioritize the processing of transactions, ensuring that the earliest transaction is handled first -- Thanks, Nisha
Re: Improve the connection failure error messages
On Fri, Apr 26, 2024 at 1:10 PM Daniel Gustafsson wrote: > > > On 22 Mar 2024, at 11:42, Nisha Moond wrote: > > > Here is the v4 patch with changes required in slotfuncs.c and slotsync.c > > files. > > - errmsg("could not connect to the primary server: %s", err)); > + errmsg("\"%s\" could not connect to the primary server: %s", > app_name.data, err)); > > Messages like this should perhaps have translator comments to indicate what > the > leading "%s" will contain? Attached v5 patch with the translator comments as suggested. -- Thanks, Nisha v5-0001-Improve-the-connection-failure-error-messages.patch Description: Binary data
Re: Conflict Detection and Resolution
On Mon, May 27, 2024 at 11:19 AM shveta malik wrote: > > On Sat, May 25, 2024 at 2:39 AM Tomas Vondra > wrote: > > > > On 5/23/24 08:36, shveta malik wrote: > > > Hello hackers, > > > > > > Please find the proposal for Conflict Detection and Resolution (CDR) > > > for Logical replication. > > > > > below details.> > > > > > > Introduction > > > > > > In case the node is subscribed to multiple providers, or when local > > > writes happen on a subscriber, conflicts can arise for the incoming > > > changes. CDR is the mechanism to automatically detect and resolve > > > these conflicts depending on the application and configurations. > > > CDR is not applicable for the initial table sync. If locally, there > > > exists conflicting data on the table, the table sync worker will fail. > > > Please find the details on CDR in apply worker for INSERT, UPDATE and > > > DELETE operations: > > > > > > > Which architecture are you aiming for? Here you talk about multiple > > providers, but the wiki page mentions active-active. I'm not sure how > > much this matters, but it might. > > Currently, we are working for multi providers case but ideally it > should work for active-active also. During further discussion and > implementation phase, if we find that, there are cases which will not > work in straight-forward way for active-active, then our primary focus > will remain to first implement it for multiple providers architecture. > > > > > Also, what kind of consistency you expect from this? Because none of > > these simple conflict resolution methods can give you the regular > > consistency models we're used to, AFAICS. > > Can you please explain a little bit more on this. > > > > > > INSERT > > > > > > To resolve INSERT conflict on subscriber, it is important to find out > > > the conflicting row (if any) before we attempt an insertion. The > > > indexes or search preference for the same will be: > > > First check for replica identity (RI) index. > > > - if not found, check for the primary key (PK) index. > > > - if not found, then check for unique indexes (individual ones or > > > added by unique constraints) > > > - if unique index also not found, skip CDR > > > > > > Note: if no RI index, PK, or unique index is found but > > > REPLICA_IDENTITY_FULL is defined, CDR will still be skipped. > > > The reason being that even though a row can be identified with > > > REPLICAT_IDENTITY_FULL, such tables are allowed to have duplicate > > > rows. Hence, we should not go for conflict detection in such a case. > > > > > > > It's not clear to me why would REPLICA_IDENTITY_FULL mean the table is > > allowed to have duplicate values? It just means the upstream is sending > > the whole original row, there can still be a PK/UNIQUE index on both the > > publisher and subscriber. > > Yes, right. Sorry for confusion. I meant the same i.e. in absence of > 'RI index, PK, or unique index', tables can have duplicates. So even > in presence of Replica-identity (FULL in this case) but in absence of > unique/primary index, CDR will be skipped for INSERT. > > > > > > In case of replica identity ‘nothing’ and in absence of any suitable > > > index (as defined above), CDR will be skipped for INSERT. > > > > > > Conflict Type: > > > > > > insert_exists: A conflict is detected when the table has the same > > > value for a key column as the new value in the incoming row. > > > > > > Conflict Resolution > > > > > > a) latest_timestamp_wins:The change with later commit timestamp wins. > > > b) earliest_timestamp_wins: The change with earlier commit timestamp > > > wins. > > > c) apply: Always apply the remote change. > > > d) skip:Remote change is skipped. > > > e) error: Error out on conflict. Replication is stopped, manual > > > action is needed. > > > > > > > Why not to have some support for user-defined conflict resolution > > methods, allowing to do more complex stuff (e.g. merging the rows in > > some way, perhaps even with datatype-specific behavior)? > > Initially, for the sake of simplicity, we are targeting to support > built-in resolvers. But we have a plan to work on user-defined > resolvers as well. We shall propose that separately. > > > > > > The change will be converted to 'UPDATE' and applied if the decision > > > is in favor of applying remote change. > > > > > > It is important to have commit timestamp info available on subscriber > > > when latest_timestamp_wins or earliest_timestamp_wins method is chosen > > > as resolution method. Thus ‘track_commit_timestamp’ must be enabled > > > on subscriber, in absence of which, configuring the said > > > timestamp-based resolution methods will result in error. > > > > > > Note: If the user has chosen the latest or earliest_timestamp_wins, > > > and the remote and local timestamps are the same, then it will go by > > > system identifier. The change with a higher system identifier will > >
Re: Synchronizing slots from primary to standby
Did performance test on optimization patch (v2-0001-optimize-the-slot-advancement.patch). Please find the results: Setup: - One primary node with 100 failover-enabled logical slots - 20 DBs, each having 5 failover-enabled logical replication slots - One physical standby node with 'sync_replication_slots' as off but other parameters required by slot-sync as enabled. Node Configurations: please see config.txt Test Plan: 1) Create 20 Databases on Primary node, each with 5 failover slots using "pg_create_logical_replication_slot()". Overall 100 failover slots. 2) Use pg_sync_replication_slot() to sync them to the standby. Note the execution time of sync and lsns values. 3) On Primary node, run pgbench for 15 mins on postgres db 4) Advance lsns of all the 100 slots on primary using pg_replication_slot_advance(). 5) Use pg_sync_replication_slot() to sync slots to the standby. Note the execution time of sync and lsns values. Executed the above test plan for three cases and did time elapsed comparison for the pg_replication_slot_advance()- (1) HEAD Time taken by pg_sync_replication_slot() on Standby node - a) The initial sync (step 2) = 140.208 ms b) Sync after pgbench run on primary (step 5) = 66.994 ms (2) HEAD + v3-0001-advance-the-restart_lsn-of-synced-slots-using-log.patch a) The initial sync (step 2) = 163.885 ms b) Sync after pgbench run on primary (step 5) = 837901.290 ms (13:57.901) >> With v3 patch, the pg_sync_replication_slot() takes a significant amount of time to sync the slots. (3) HEAD + v3-0001-advance-the-restart_lsn-of-synced-slots-using-log.patch + v2-0001-optimize-the-slot-advancement.patch a) The initial sync (step 2) = 165.554 ms b) Sync after pgbench run on primary (step 5) = 7991.718 ms (00:07.992) >> With the optimization patch, the time taken by pg_sync_replication_slot() is reduced significantly to ~7 seconds. We did the same test with a single DB too by creating all 100 failover slots in postgres DB and the results were almost similar. Attached the scripts used for the test - "v3_perf_test_scripts.tar.gz" include files - setup_multidb.sh : setup primary and standby nodes createdb20.sql : create 20 DBs createslot20.sql : create total 100 logical slots, 5 on each DB run_sync.sql : call pg_replication_slot_advance() with timing advance20.sql : advance lsn of all slots on Primary node to current lsn advance20_perdb.sql : use on HEAD to advance lsn on Primary node get_synced_data.sql : get details of the config.txt : configuration used for nodes v3_perf_test_scripts.tar.gz Description: GNU Zip compressed data
Re: Improve the connection failure error messages
On Wed, Mar 13, 2024 at 11:16 AM Peter Smith wrote: > > FYI -- some more code has been pushed since this patch was last > updated. AFAICT perhaps you'll want to update this patch again for the > following new connection messages on HEAD: > > - slotfuncs.c [1] > - slotsync.c [2] > > -- > [1] > https://github.com/postgres/postgres/blob/0b84f5c419a300dc1b1a70cf63b9907208e52643/src/backend/replication/slotfuncs.c#L989 > [2] > https://github.com/postgres/postgres/blob/0b84f5c419a300dc1b1a70cf63b9907208e52643/src/backend/replication/logical/slotsync.c#L1258 > Thanks for the update. Here is the v4 patch with changes required in slotfuncs.c and slotsync.c files. -- Thanks, Nisha v4-0001-Improve-the-connection-failure-error-messages.patch Description: Binary data
Re: Synchronizing slots from primary to standby
I did performance tests for the v99 patch w.r.t. wait time analysis. As this patch is introducing a wait for standby before sending changes to a subscriber, at the primary node, logged time at the start and end of the XLogSendLogical() call (which eventually calls WalSndWaitForWal()) and calculated total time taken by this function during the load run. For load, ran pgbench for 15 minutes: Creating tables: pgbench -p 5833 postgres -qis 2 Running benchmark: pgbench postgres -p 5833 -c 10 -j 3 -T 900 -P 20 Machine details: 11th Gen Intel(R) Core(TM) i9-11950H @ 2.60GHz 32GB RAM OS - Windows 10 Enterprise Test setup: Primary node --> -> One physical standby node -> One subscriber node having only one subscription with failover=true -- the slot-sync relevant parameters are set to default (OFF) for all the tests i.e. hot_standby_feedback = off sync_replication_slots = false -- addition configuration on each instance is: shared_buffers = 6GB max_worker_processes = 32 max_parallel_maintenance_workers = 24 max_parallel_workers = 32 synchronous_commit = off checkpoint_timeout = 1d max_wal_size = 24GB min_wal_size = 15GB autovacuum = off To review the wait time impact with and without patch, compared three cases (did two runs for each case)- (1) HEAD code: time taken in run 1 = 103.935631 seconds time taken in run 2 = 104.832186 seconds (2) HEAD code + v99 patch ('standby_slot_names' is not set): time taken in run 1 = 104.076343 seconds time taken in run 2 = 103.116226 seconds (3) HEAD code + v99 patch + a valid 'standby_slot_names' is set: time taken in run 1 = 103.871012 seconds time taken in run 2 = 103.793524 seconds The time consumption of XLogSendLogical() is almost same in all the cases and no performance degradation is observed. -- Thanks, Nisha
Re: Synchronizing slots from primary to standby
We conducted stress testing for the patch with a setup of one primary node with 100 tables and five subscribers, each having 20 subscriptions. Then created three physical standbys syncing the logical replication slots from the primary node. All 100 slots were successfully synced on all three standbys. We then ran the load and monitored LSN convergence using the prescribed SQL checks. Once the standbys were failover-ready, we were able to successfully promote one of the standbys and all the subscribers seamlessly migrated to the new primary node. We replicated the tests with 200 tables, creating 200 logical replication slots. With the increased load, all the tests were completed successfully. Minor errors (not due to patch) observed during tests - 1) When the load was run, on subscribers, the logical replication apply workers started failing due to timeout. This is not related to the patch as it happened due to the small "wal_receiver_timeout" setting w.r.t. the load. To confirm, we ran the same load without the patch too, and the same failure happened. 2) There was a buffer overflow exception on the primary node with the '200 replication slots' case. It was not related to the patch as it was due to short memory configuration. All the tests were done on Windows as well as Linux environments. Thank you Ajin for the stress test and analysis on Linux.
Re: Improve the connection failure error messages
> AFAIK some recent commits patches (e,g [1] for the "slot sync" > development) have created some more cases of "could not connect..." > messages. So, you might need to enhance your patch to deal with any > new ones in the latest HEAD. > > == > [1] > https://github.com/postgres/postgres/commit/776621a5e4796fa214b6b29a7ca134f6c138572a > > Thank you for the update. The v3 patch has the changes needed as per the latest HEAD. -- Thanks, Nisha v3-0001-Improve-the-connection-failure-error-messages.patch Description: Binary data
Re: Improve the connection failure error messages
On Fri, Jan 12, 2024 at 7:06 PM Aleksander Alekseev wrote: > > Hi, > > Thanks for the patch. > > > Due to this behavior, it is not possible to add a test to show the > > error message as it is done for CREATE SUBSCRIPTION. > > Let me know if you think there is another way to add this test. > > I believe it can be done with TAP tests, see for instance: > > contrib/auto_explain/t/001_auto_explain.pl > > However I wouldn't insist on including the test in scope of this > particular patch. Such a test doesn't currently exist, it can be added > as a separate patch, and whether this is actually a useful test (all > the tests consume time after all...) is somewhat debatable. Personally > I agree that it would be nice to have though. Thank you for providing this information. Yes, I can write a TAP test to check the log for the same error message. I'll attempt it and perform a time analysis. I'm unsure where to appropriately add this test. Any suggestions? Following your suggestion, I won't include the test in the scope of this patch. Instead, I'll start a new thread once I have sufficient information required. -- Thanks, Nisha
Re: Improve the connection failure error messages
> > ~~ > > BTW, while experimenting with the bad connection ALTER I also tried > setting 'disable_on_error' like below: > > ALTER SUBSCRIPTION sub4 SET (disable_on_error); > ALTER SUBSCRIPTION sub4 CONNECTION 'port = -1'; > > ...but here the subscription did not become DISABLED as I expected it > would do on the next connection error iteration. It remains enabled > and just continues to loop relaunch/ERROR indefinitely same as before. > > That looks like it may be a bug. Thoughts? > Ideally, if the already running apply worker in "LogicalRepApplyLoop()" has any exception/error it will be handled and the subscription will be disabled if 'disable_on_error' is set - start_apply(XLogRecPtr origin_startpos) { PG_TRY(); { LogicalRepApplyLoop(origin_startpos); } PG_CATCH(); { if (MySubscription->disableonerr) DisableSubscriptionAndExit(); ... What is happening in this case is that the control reaches the function - run_apply_worker() -> start_apply() -> LogicalRepApplyLoop -> maybe_reread_subscription() ... /* * Exit if any parameter that affects the remote connection was changed. * The launcher will start a new worker but note that the parallel apply * worker won't restart if the streaming option's value is changed from * 'parallel' to any other value or the server decides not to stream the * in-progress transaction. */ if (strcmp(newsub->conninfo, MySubscription->conninfo) != 0 || ... and it sees a change in the parameter and calls apply_worker_exit(). This will exit the current process, without throwing an exception to the caller and the postmaster will try to restart the apply worker. The new apply worker, before reaching the start_apply() [where we handle exception], will hit the code to establish the connection to the publisher - ApplyWorkerMain() -> run_apply_worker() - ... LogRepWorkerWalRcvConn = walrcv_connect(MySubscription->conninfo, true /* replication */ , true, must_use_password, MySubscription->name, &err); if (LogRepWorkerWalRcvConn == NULL) ereport(ERROR, (errcode(ERRCODE_CONNECTION_FAILURE), errmsg("could not connect to the publisher: %s", err))); ... and due to the bad connection string in the subscription, it will error out. [28680] ERROR: could not connect to the publisher: invalid port number: "-1" [3196] LOG: background worker "logical replication apply worker" (PID 28680) exited with exit code 1 Now, the postmaster keeps trying to restart the apply worker and it will keep failing until the connection string is corrected or the subscription is disabled manually. I think this is a bug that needs to be handled in run_apply_worker() when disable_on_error is set. IMO, this bug-fix discussion deserves a separate thread. Thoughts?
Re: Synchronizing slots from primary to standby
A review on v62-006: failover-ready validation steps doc - + Next, check that the logical replication slots identified above exist on + the standby server. This step can be skipped if + standby_slot_names has been correctly configured. + +test_standby=# SELECT bool_and(synced AND NOT temporary AND conflict_reason IS NULL) AS failover_ready + FROM pg_replication_slots + WHERE slot_name in ('sub'); + failover_ready + + t +(1 row) This query does not ensure that all the logical replication slots exist on standby. Due to the 'IN ('slots')' check, it will return 'true' even if only one or a few slots exist.
Re: Improve the connection failure error messages
Thanks for reviewing, please find my response inline. On Wed, Jan 17, 2024 at 4:56 AM Peter Smith wrote: > > On Sat, Jan 13, 2024 at 12:36 AM Aleksander Alekseev > wrote: > > > > Hi, > > > > Thanks for the patch. > > > > > Due to this behavior, it is not possible to add a test to show the > > > error message as it is done for CREATE SUBSCRIPTION. > > > Let me know if you think there is another way to add this test. > > > > I believe it can be done with TAP tests, see for instance: > > > > contrib/auto_explain/t/001_auto_explain.pl > > > > However I wouldn't insist on including the test in scope of this > > particular patch. Such a test doesn't currently exist, it can be added > > as a separate patch, and whether this is actually a useful test (all > > the tests consume time after all...) is somewhat debatable. Personally > > I agree that it would be nice to have though. > > > > This being said... > > > > > The ALTER SUBSCRIPTION command does not error out on the user > > > interface if updated with a bad connection string and the connection > > > failure error can only be seen in the respective log file. > > > > I wonder if we should fix this. Again, not necessarily in scope of > > this work, but if this is not a difficult task, again, it would be > > nice to have. > > > > Other than that v2 looks good. > > > > OK. I see now that any ALTER of the subscription's connection, even > to some value that fails, will restart a new worker (like ALTER of any > other subscription parameters). For a bad connection, it will continue > to relaunch-worker/ERROR over and over. > > test_sub=# \r2024-01-17 09:34:28.665 AEDT [11274] LOG: logical > replication apply worker for subscription "sub4" has started > 2024-01-17 09:34:28.666 AEDT [11274] ERROR: could not connect to the > publisher: invalid port number: "-1" > 2024-01-17 09:34:28.667 AEDT [928] LOG: background worker "logical > replication apply worker" (PID 11274) exited with exit code 1 > dRs su2024-01-17 09:34:33.669 AEDT [11391] LOG: logical replication > apply worker for subscription "sub4" has started > 2024-01-17 09:34:33.669 AEDT [11391] ERROR: could not connect to the > publisher: invalid port number: "-1" > 2024-01-17 09:34:33.670 AEDT [928] LOG: background worker "logical > replication apply worker" (PID 11391) exited with exit code 1 > b4 > ... > > I don't really have any opinion if that behaviour is good or bad, but > anyway, it is deliberate, and IMO it is outside the scope of your > patch, so v2 patch LGTM. Upon code review, the ALTER SUBSCRIPTION updates the connection string after checking for parse and a few obvious errors and does not attempt to establish a connection. It is the apply worker running for the respective subscription that will try to connect and fail in case of a bad connection string. To me, it seems an intentional design behavior and I agree that deciding to change or maintain this behavior is out of this patch's scope. -- Thanks, Nisha
Re: Improve the connection failure error messages
Thanks for the review. Attached v2 patch with suggested changes. Please find my response inline. On Fri, Jan 12, 2024 at 8:20 AM Peter Smith wrote: > > Thanks for the patch! Here are a couple of review comments for it. > > == > src/backend/commands/subscriptioncmds.c > > 1. > @@ -742,7 +742,7 @@ CreateSubscription(ParseState *pstate, > CreateSubscriptionStmt *stmt, > if (!wrconn) > ereport(ERROR, > (errcode(ERRCODE_CONNECTION_FAILURE), > - errmsg("could not connect to the publisher: %s", err))); > + errmsg("\"%s\" could not connect to the publisher: %s", stmt->subname, > err))); > > In practice, these commands give errors like: > > test_sub=# create subscription sub1 connection 'dbname=bogus' publication > pub1; > ERROR: could not connect to the publisher: connection to server on > socket "/tmp/.s.PGSQL.5432" failed: FATAL: database "bogus" does not > exist > > and logs like: > > 2024-01-12 12:45:05.177 AEDT [13108] ERROR: could not connect to the > publisher: connection to server on socket "/tmp/.s.PGSQL.5432" failed: > FATAL: database "bogus" does not exist > 2024-01-12 12:45:05.177 AEDT [13108] STATEMENT: create subscription > sub1 connection 'dbname=bogus' publication pub1; > > Since the subscription name is already obvious from the statement that > caused the error I'm not sure it benefits much to add this to the > error message (but maybe it is useful if the error message was somehow > read in isolation from the statement). > > Anyway, I felt at least it should include the word "subscription" for > better consistency with the other messages in this patch: > > SUGGESTION > subscription \"%s\" could not connect to the publisher: %s Done. > == > > 2. > + appname = cluster_name[0] ? cluster_name : "walreceiver"; > + > /* Establish the connection to the primary for XLOG streaming */ > - wrconn = walrcv_connect(conninfo, false, false, > - cluster_name[0] ? cluster_name : "walreceiver", > - &err); > + wrconn = walrcv_connect(conninfo, false, false, appname, &err); > if (!wrconn) > ereport(ERROR, > (errcode(ERRCODE_CONNECTION_FAILURE), > - errmsg("could not connect to the primary server: %s", err))); > + errmsg("%s could not connect to the primary server: %s", appname, err))); > > I think your new %s should be quoted according to the guidelines at [1]. Done. > == > src/test/regress/expected/subscription.out > > 3. > Apparently, there is no existing regression test case for the ALTER > "could not connect" message because if there was, it would have > failed. Maybe a test should be added? > The ALTER SUBSCRIPTION command does not error out on the user interface if updated with a bad connection string and the connection failure error can only be seen in the respective log file. Due to this behavior, it is not possible to add a test to show the error message as it is done for CREATE SUBSCRIPTION. Let me know if you think there is another way to add this test. -- Thanks, Nisha v2-0001-Improve-the-connection-failure-error-messages.patch Description: Binary data
Improve the connection failure error messages
Hi Hackers, Various sections of the code utilize the walrcv_connect() function, employed by various processes such as walreceiver, logical replication apply worker, etc., to establish connections with other hosts. Presently, in case of connection failures, the error message lacks information about the specific process attempting to connect and encountering the failure. The provided patch enhances error messages for such connection failures, offering more details on the processes that failed to establish a connection. -- Thanks, Nisha v1-0001-Improve-the-connection-failure-error-messages.patch Description: Binary data
Re: Intermittent failure with t/003_logical_slots.pl test on windows
Thanks for working on it. I tested the patch on my system and it resolved the issue with commands running -V (version check). As you mentioned, I am also still seeing intermittent errors even with the patch as below - in 'pg_upgrade/002_pg_upgrade' - # Running: pg_upgrade --no-sync -d D:\Project\pg2\postgres\build/testrun/pg_upgrade/002_pg_upgrade\data/t_002_pg_upgrade_old_node_data/pgdata -D D:\Project\pg2\postgres\build/testrun/pg_upgrade/002_pg_upgrade\data/t_002_pg_upgrade_new_node_data/pgdata -b D:/Project/pg2/postgres/build/tmp_install/Project/pg2/postgresql/bin -B D:/Project/pg2/postgres/build/tmp_install/Project/pg2/postgresql/bin -s 127.0.0.1 -p 56095 -P 56096 --copy --check Performing Consistency Checks - Checking cluster versions ok The source cluster lacks cluster state information: Failure, exiting [12:37:38.666](3.317s) not ok 12 - run of pg_upgrade --check for new instance [12:37:38.666](0.000s) # Failed test 'run of pg_upgrade --check for new instance' # at D:/Project/pg2/postgres/src/bin/pg_upgrade/t/002_pg_upgrade.pl line 375. and in 'pg_upgrade/003_logical_slots' - [12:35:33.773](0.001s) not ok 6 - run of pg_upgrade of old cluster with slots having unconsumed WAL records stdout /(?^:Your installation contains logical replication slots that can't be upgraded.)/ [12:35:33.773](0.000s) # Failed test 'run of pg_upgrade of old cluster with slots having unconsumed WAL records stdout /(?^:Your installation contains logical replication slots that can't be upgraded.)/' # at D:/Project/pg2/postgres/src/bin/pg_upgrade/t/003_logical_slots.pl line 102. [12:35:33.773](0.000s) # 'Performing Consistency Checks # - # Checking cluster versions ok # # The target cluster lacks cluster state information: # Failure, exiting It seems 'Performing Consistency Checks' fail due to a lack of some information and possible that it can also be fixed on the same lines. -- Thanks, Nisha
Re: Synchronizing slots from primary to standby
Review for v47 patch - (1) When we try to create a subscription on standby using a synced slot that is in 'r' sync_state, the subscription will be created at the subscriber, and on standby, two actions will take place - (i) As copy_data is true by default, it will switch the failover state of the synced-slot to 'false'. (ii) As we don't allow to use synced-slots, it will start giving the expected error in the log file - ERROR: cannot use replication slot "logical_slot" for logical decoding DETAIL: This slot is being synced from the primary server. HINT: Specify another replication slot. The first one seems an issue, it toggles the failover to false and then it remains false after that. I think it should be fixed. (2) With the patch, the 'CREATE SUBSCRIPTION' command with a 'slot_name' of an 'active' logical slot fails and errors out - ERROR: could not alter replication slot "logical_slot" on publisher: ERROR: replication slot "logical_slot1" is active for PID Without the patch, the create subscription with an 'active' slot_name succeeds and the log file shows the error "could not start WAL streaming: ERROR: replication slot "logical_slot" is active for PID ". Given that the specified active slot_name has failover set to false and the create subscription command also specifies failover=false, the expected behavior of the "with-patch" case is anticipated to be the same as that of the "without-patch" scenario.
Re: Synchronizing slots from primary to standby
A review on v45 patch: If one creates a logical slot with failover=true as - select pg_create_logical_replication_slot('logical_slot','pgoutput', false, true, true); Then, uses the existing logical slot while creating a subscription - postgres=# create subscription sub4 connection 'dbname=postgres host=localhost port=5433' publication pub1t4 WITH (slot_name=logical_slot, create_slot=false, failover=true); NOTICE: changed the failover state of replication slot "logical_slot" on publisher to false CREATE SUBSCRIPTION Despite configuring logical_slot's failover to true and specifying failover=true during subscription creation, the NOTICE indicates a change in the failover state to 'false', without providing any explanation for this transition. It can be confusing for users, so IMO, the notice should include the reason for switching failover to 'false' or should give a hint to use either refresh=false or copy_data=false to enable failover=true for the slot as we do in other similar 'alter subscription...' scenarios. -- Thanks & Regards, Nisha
Re: Synchronizing slots from primary to standby
On Fri, Dec 1, 2023 at 5:40 PM Nisha Moond wrote: > > Review for v41 patch. > > 1. > == > src/backend/utils/misc/postgresql.conf.sample > > +#enable_syncslot = on # enables slot synchronization on the physical > standby from the primary > > enable_syncslot is disabled by default, so, it should be 'off' here. > > ~~~ > 2. > IIUC, the slotsyncworker's connection to the primary is to execute a > query. Its aim is not walsender type connection, but at primary when > queried, the 'backend_type' is set to 'walsender'. > Snippet from primary db- > > datname | usename | application_name | wait_event_type | backend_type > -+-+--+-+-- > postgres | replication | slotsyncworker | Client | walsender > > Is it okay? > > ~~~ > 3. > As per current logic, If there are slots on primary with disabled > subscriptions, then, when standby is created it replicates these slots > but can't make them sync-ready until any activity happens on the > slots. > So, such slots stay in 'i' sync-state and get dropped when failover > happens. Now, if the subscriber tries to enable their existing > subscription after failover, it gives an error that the slot does not > exist. Is this behavior expected? If yes, then is it worth documenting about disabled subscription slots not being synced? -- Thanks, Nisha
Re: Synchronizing slots from primary to standby
Review for v41 patch. 1. == src/backend/utils/misc/postgresql.conf.sample +#enable_syncslot = on # enables slot synchronization on the physical standby from the primary enable_syncslot is disabled by default, so, it should be 'off' here. ~~~ 2. IIUC, the slotsyncworker's connection to the primary is to execute a query. Its aim is not walsender type connection, but at primary when queried, the 'backend_type' is set to 'walsender'. Snippet from primary db- datname | usename | application_name | wait_event_type | backend_type -+-+--+-+-- postgres | replication | slotsyncworker | Client | walsender Is it okay? ~~~ 3. As per current logic, If there are slots on primary with disabled subscriptions, then, when standby is created it replicates these slots but can't make them sync-ready until any activity happens on the slots. So, such slots stay in 'i' sync-state and get dropped when failover happens. Now, if the subscriber tries to enable their existing subscription after failover, it gives an error that the slot does not exist. ~~~ 4. primary_slot_name GUC value test: When standby is started with a non-existing primary_slot_name, the wal-receiver gives an error but the slot-sync worker does not raise any error/warning. It is no-op though as it has a check 'if (XLogRecPtrIsInvalid(WalRcv->latestWalEnd)) do nothing'. Is this okay or shall the slot-sync worker too raise an error and exit? In another case, when standby is started with valid primary_slot_name, but it is changed to some invalid value in runtime, then walreceiver starts giving error but the slot-sync worker keeps on running. In this case, unlike the previous case, it even did not go to no-op mode (as it sees valid WalRcv->latestWalEnd from the earlier run) and keep pinging primary repeatedly for slots. Shall here it should error out or at least be no-op until we give a valid primary_slot_name? -- Thanks, Nisha
Re: Intermittent failure with t/003_logical_slots.pl test on windows
On Fri, Nov 3, 2023 at 5:02 PM Nisha Moond wrote: > > On Thu, Nov 2, 2023 at 11:52 AM Kyotaro Horiguchi > wrote: > > > > At Tue, 31 Oct 2023 18:11:48 +0530, vignesh C wrote in > > > Few others are also facing this problem with similar code like in: > > > https://stackoverflow.com/questions/15882799/fgets-returning-error-for-file-returned-by-popen > > > > I'm inclined to believe that the pipe won't enter the EOF state until > > the target command terminates (then the top-level cmd.exe). The target > > command likely terminated prematurely due to an error before priting > > any output. > > > > If we append "2>&1" to the command line, we can capture the error > > message through the returned pipe if any. Such error messages will > > cause the subsequent code to fail with an error such as "unexpected > > string: 'the output'". I'm not sure, but if this is permissive, the > > returned error messages could potentially provide insight into the > > underlying issue, paving the way for a potential solution. > > > > Appending '2>&1 test: > The command still results in NULL and ends up failing as no data is > returned. Which means even no error message is returned. The error log > with appended '2>$1' is - > > no data was returned by command > ""D:/Project/pg1/postgres/tmp_install/bin/pg_resetwal" -V 2>&1" > > check for "D:/Project/pg1/postgres/tmp_install/bin/pg_resetwal" > failed: cannot execute > Failure, exiting > > Further observations: > 1. To make sure the forked process completes before fgets(), I tested > with Sleep(100) before fgets() call. > ... > ... > if ((pgver = popen(cmd, "r")) == NULL) > { > perror("popen failure"); > return NULL; > } > > errno = 0; > Sleep(100); > if (fgets(line, maxsize, pgver) == NULL) > { > if (feof(pgver)) > fprintf(stderr, "no data was returned by command \"%s\"\n", cmd); > ... > ... > > This also doesn't resolve the issue, the error is still seen intermittently. > > 2. Even though fgets() fails, the output is still getting captured in > 'line' string. > Tested with printing the 'line' in case of failure: > ... > ... > if ((pgver = popen(cmd, "r")) == NULL) > { > perror("popen failure"); > return NULL; > } > > errno = 0; > if (fgets(line, maxsize, pgver) == NULL) > { > if (line) > fprintf(stderr, "cmd output - %s\n", line); > > if (feof(pgver)) > fprintf(stderr, "no data was returned by command \"%s\"\n", cmd); > … > … > And the log looks like - > cmd output - postgres (PostgreSQL) 17devel > no data was returned by command > ""D:/Project/pg1/postgres/tmp_install/bin/pg_controldata" -V" > > check for "D:/Project/pg1/postgres/tmp_install/bin/pg_controldata" > failed: cannot execute > Failure, exiting > > Attached test result log for the same - "regress_log_003_logical_slots". The same failure is observed with test 't\002_pg_upgrade.pl' too (intermittently). So, it is not limited to "t/003_logical_slots.pl" test alone. It is more closely associated with the pg_upgrade command run. -- Thanks, Nisha Moond
Re: Intermittent failure with t/003_logical_slots.pl test on windows
On Thu, Nov 2, 2023 at 11:52 AM Kyotaro Horiguchi wrote: > > At Tue, 31 Oct 2023 18:11:48 +0530, vignesh C wrote in > > Few others are also facing this problem with similar code like in: > > https://stackoverflow.com/questions/15882799/fgets-returning-error-for-file-returned-by-popen > > I'm inclined to believe that the pipe won't enter the EOF state until > the target command terminates (then the top-level cmd.exe). The target > command likely terminated prematurely due to an error before priting > any output. > > If we append "2>&1" to the command line, we can capture the error > message through the returned pipe if any. Such error messages will > cause the subsequent code to fail with an error such as "unexpected > string: 'the output'". I'm not sure, but if this is permissive, the > returned error messages could potentially provide insight into the > underlying issue, paving the way for a potential solution. > Appending '2>&1 test: The command still results in NULL and ends up failing as no data is returned. Which means even no error message is returned. The error log with appended '2>$1' is - no data was returned by command ""D:/Project/pg1/postgres/tmp_install/bin/pg_resetwal" -V 2>&1" check for "D:/Project/pg1/postgres/tmp_install/bin/pg_resetwal" failed: cannot execute Failure, exiting Further observations: 1. To make sure the forked process completes before fgets(), I tested with Sleep(100) before fgets() call. ... ... if ((pgver = popen(cmd, "r")) == NULL) { perror("popen failure"); return NULL; } errno = 0; Sleep(100); if (fgets(line, maxsize, pgver) == NULL) { if (feof(pgver)) fprintf(stderr, "no data was returned by command \"%s\"\n", cmd); ... ... This also doesn't resolve the issue, the error is still seen intermittently. 2. Even though fgets() fails, the output is still getting captured in 'line' string. Tested with printing the 'line' in case of failure: ... ... if ((pgver = popen(cmd, "r")) == NULL) { perror("popen failure"); return NULL; } errno = 0; if (fgets(line, maxsize, pgver) == NULL) { if (line) fprintf(stderr, "cmd output - %s\n", line); if (feof(pgver)) fprintf(stderr, "no data was returned by command \"%s\"\n", cmd); … … And the log looks like - cmd output - postgres (PostgreSQL) 17devel no data was returned by command ""D:/Project/pg1/postgres/tmp_install/bin/pg_controldata" -V" check for "D:/Project/pg1/postgres/tmp_install/bin/pg_controldata" failed: cannot execute Failure, exiting Attached test result log for the same - "regress_log_003_logical_slots". Thanks, Nisha Moond regress_log_003_logical_slots Description: Binary data
Intermittent failure with t/003_logical_slots.pl test on windows
Hi hackers, There is a failure with 't/003_logical_slots.pl' test during the upgrade. The failure is intermittent and observed in the Windows environment. Details- Test - pg_upgrade/t/003_logical_slots.pl Result - t/003_logical_slots.pl .. 5/? # Failed test 'run of pg_upgrade of old cluster' # at t/003_logical_slots.pl line 165. t/003_logical_slots.pl .. 10/? # Failed test 'check the slot exists on new cluster' # at t/003_logical_slots.pl line 171. # got: '' # expected: 'regress_sub|t' # Tests were run but no plan was declared and done_testing() was not seen. # Looks like your test exited with 25 just after 11. t/003_logical_slots.pl .. Dubious, test returned 25 (wstat 6400, 0x1900) Failed 2/11 subtests Test Summary Report --- t/003_logical_slots.pl (Wstat: 6400 (exited 25) Tests: 11 Failed: 2) Failed tests: 10-11 Non-zero exit status: 25 Parse errors: No plan found in TAP output Files=1, Tests=11, 32 wallclock secs ( 0.03 usr + 0.01 sys = 0.05 CPU) Result: FAIL log attached - 'regress_log_003_logical_slots'. The failure cause is - no data was returned by command ""D:/Project/pg1/postgres/tmp_install/bin/pg_resetwal" -V" check for "D:/Project/pg1/postgres/tmp_install/bin/pg_resetwal" failed: cannot execute Failure, exiting [16:24:21.144](6.275s) not ok 10 - run of pg_upgrade of old cluster If the same command is run manually, it succeeds - >"D:/Project/pg1/postgres/tmp_install/bin/pg_resetwal" -V pg_resetwal (PostgreSQL) 17devel The same test failure (intermittent) is also seen with different commands like pg_ctl and pg_dump as failure cause while retrieving version - Ex - no data was returned by command ""D:/Project/pg1/postgres/tmp_install/bin/pg_dump" -V" check for "D:/Project/pg1/postgres/tmp_install/bin/pg_dump" failed: cannot execute Failure, exiting [16:08:50.444](7.434s) not ok 10 - run of pg_upgrade of old cluster Has anyone come across this issue? I am not sure what is the issue here. Any thoughts? Thanks, Nisha Moond regress_log_003_logical_slots Description: Binary data