Re: Conflict detection and logging in logical replication

Nisha Moond Fri, 02 Aug 2024 05:58:41 -0700

Performance tests done on the v8-0001 and v8-0002 patches, available at [1].


The purpose of the performance tests is to measure the impact on
logical replication with track_commit_timestamp enabled, as this
involves fetching the commit_ts data to determine
delete_differ/update_differ conflicts.

Fortunately, we did not see any noticeable overhead from the new
commit_ts fetch and comparison logic. The only notable impact is
potential overhead from logging conflicts if they occur frequently.
Therefore, enabling conflict detection by default seems feasible, and
introducing a new detect_conflict option may not be necessary.

Please refer to the following for detailed test results:

For all the tests, created two server nodes, one publisher and one as
subscriber. Both the nodes are configured with below settings -
  wal_level = logical
  shared_buffers = 40GB
  max_worker_processes = 32
  max_parallel_maintenance_workers = 24
  max_parallel_workers = 32
  synchronous_commit = off
  checkpoint_timeout = 1d
  max_wal_size = 24GB
  min_wal_size = 15GB
  autovacuum = off
~~~~

Test 1: create conflicts on Sub using pgbench.
----------------------------------------------------------------
Setup:
 - Both publisher and subscriber have pgbench tables created as-
      pgbench -p $node1_port postgres -qis 1
 - At Sub, a subscription created for all the changes from Pub node.

Test Run:
 - To test, ran pgbench for 15 minutes on both nodes simultaneously,
which led to concurrent updates and update_differ conflicts on the
Subscriber node.
 Command used to run pgbench on both nodes-
        ./pgbench postgres -p 8833 -c 10 -j 3 -T 300 -P 20

Results:
For each case, note the “tps” and total time taken by the apply-worker
on Sub to apply the changes coming from Pub.

Case1: track_commit_timestamp = off, detect_conflict = off
    Pub-tps = 9139.556405
    Sub-tps = 8456.787967
    Time of replicating all the changes: 19min 28s
Case 2 : track_commit_timestamp = on, detect_conflict = on
    Pub-tps = 8833.016548
    Sub-tps = 8389.763739
    Time of replicating all the changes: 20min 20s
Case3: track_commit_timestamp = on, detect_conflict = off
    Pub-tps = 8886.101726
    Sub-tps = 8374.508017
    Time of replicating all the changes: 19min 35s
Case 4: track_commit_timestamp = off, detect_conflict = on
    Pub-tps = 8981.924596
    Sub-tps = 8411.120808
    Time of replicating all the changes: 19min 27s

**The difference of TPS between each case is small. While I can see a
slight increase of the replication time (about 5%), when enabling both
track_commit_timestamp and detect_conflict.

Test2: create conflict using a manual script
----------------------------------------------------------------
 - To measure the precise time taken by the apply-worker in all cases,
create a test with a table having 10 million rows.
 - To record the total time taken by the apply-worker, dump the
current time in the logfile for apply_handle_begin() and
apply_handle_commit().

Setup:
Pub : has a table ‘perf’ with 10 million rows.
Sub : has the same table ‘perf’ with its own 10 million rows (inserted
by 1000 different transactions). This table is subscribed for all
changes from Pub.

Test Run:
At Pub: run UPDATE on the table ‘perf’ to update all its rows in a
single transaction. (this will lead to update_differ conflict for all
rows on Sub when enabled).
At Sub: record the time(from log file) taken by the apply-worker to
apply all updates coming from Pub.

Results:
Below table shows the total time taken by the apply-worker
(apply_handle_commit Time - apply_handle_begin Time ).
(Two test runs for each of the four cases)

Case1: track_commit_timestamp = off, detect_conflict = off
    Run1 - 2min 42sec 579ms
    Run2 - 2min 41sec 75ms
Case 2 : track_commit_timestamp = on, detect_conflict = on
    Run1 - 6min 11sec 602ms
    Run2 - 6min 25sec 179ms
Case3: track_commit_timestamp = on, detect_conflict = off
    Run1 - 2min 34sec 223ms
    Run2 - 2min 33sec 482ms
Case 4: track_commit_timestamp = off, detect_conflict = on
    Run1 - 2min 35sec 276ms
    Run2 - 2min 38sec 745ms

** In the case-2 when both track_commit_timestamp and detect_conflict
are enabled, the time taken by the apply-worker is ~140% higher.

Test3: Case when no conflict is detected.
----------------------------------------------------------------
To measure the time taken by the apply-worker when there is no
conflict detected. This test is to confirm if the time overhead in
Test1-Case2 is due to the new function GetTupleCommitTs() which
fetches the origin and timestamp information for each row in the table
before applying the update.

Setup:
 - The Publisher and Subscriber both have an empty table to start with.
 - At Sub, the table is subscribed for all changes from Pub.
 - At Pub: Insert 10 million rows and the same will be replicated to
the Sub table as well.

Test Run:
At Pub: run an UPDATE on the table to update all rows in a single
transaction. (This will NOT hit the update_differ on Sub because now
all the tuples have the Pub’s origin).

Results:
Case1: track_commit_timestamp = off, detect_conflict = off
    Run1 - 2min 39sec 261ms
    Run2 - 2min 30sec 95ms
Case 2 : track_commit_timestamp = on, detect_conflict = on
    Run1 - 2min 38sec 985ms
    Run2 - 2min 46sec 624ms
Case3: track_commit_timestamp = on, detect_conflict = off
    Run1 - 2min 59sec 887ms
    Run2 - 2min 34sec 336ms
Case 4: track_commit_timestamp = off, detect_conflict = on
    Run1 - 2min 33sec 477min
    Run2 - 2min 37sec 677ms

Test Summary -
-- The duration for case-2 was reduced to 2-3 minutes, matching the
times of the other cases.
-- The test revealed that the overhead in case-2 was not due to
commit_ts fetching (GetTupleCommitTs).
-- The additional action in case-2 was the error logging of all 10
million update_differ conflicts.
-- To confirm that the additional time was due to logging, I conducted
another test. I removed the "ReportApplyConflict()" call for
update_differ in the code and re-ran test1-case2 (which initially took
~6 minutes). Without conflict logging, the duration was reduced to
"2min 56sec 758 ms".

Test4 - Code Profiling
----------------------------------------------------------------
To narrow down the cause of the time overhead in Test2-case2, did code
profiling patches. Used same setup and test script as Test2.
The overhead in (track_committs=on and detect_conflict=on) case is not
introduced by the commit timestamp fetching(e.g. GetTupleCommitTs).
The main overhead comes from the log reporting which happens when
applying each change:

|--16.57%--ReportApplyConflict
    |--13.17%--errfinish
         --11.53%--EmitErrorReport
             --11.41%--send_message_to_server_log ...
            ...
...
|--0.74%--GetTupleCommitTs"

Thank you Hou-San for helping in Test1 and conducting Test4.

[1] 
https://www.postgresql.org/message-id/OS0PR01MB57162919F1D6C55D82D4D89D94B12%40OS0PR01MB5716.jpnprd01.prod.outlook.com

Thanks,

Nisha

Re: Conflict detection and logging in logical replication

Reply via email to