On 2017-05-05 02:00, Andres Freund wrote:

Could you have a look?

Running tests with these three patches:

0001-WIP-Fix-off-by-one-around-GetLastImportantRecPtr.patch+
0002-WIP-Possibly-more-robust-snapbuild-approach.patch     +
fix-statistics-reporting-in-logical-replication-work.patch
    (on top of 44c528810)

I test by 15-minute pgbench runs while there is a logical replication connection. Primary and replica are on the same machine.

I have seen errors on 3 different machines (where error means: at least 1 of the 4 pgbench tables is not md5-equal). It seems better, faster machines yield less errors.

Normally I see in pg_stat_replication (on master) one process in state 'streaming'.

pid | wal | replay_loc | diff | state | app | sync_state 16495 | 11/EDBC0000 | 11/EA3FEEE8 | 58462488 | streaming | derail2 | async

Often there are another two processes in pg_stat_replication that remain in state 'startup'.

In the failing sessions the 'streaming'-state process is missing; in failing sessions there are only the two processes that are and remain in 'startup'.

FWIW, below  the output of a succesful and a failed run:


------------------ successful run:
creating tables...
1590400 of 2500000 tuples (63%) done (elapsed 5.34 s, remaining 3.05 s)
2500000 of 2500000 tuples (100%) done (elapsed 9.63 s, remaining 0.00 s)
vacuum...
set primary keys...
done.
create publication pub1 for all tables;
create subscription sub1 connection 'port=6972 application_name=derail2' publication pub1 with (disabled);
alter subscription sub1 enable;
-- pgbench -c 90 -j 8 -T 900 -P 180 -n   --  scale 25
progress: 180.0 s, 82.5 tps, lat 1086.845 ms stddev 3211.785
progress: 360.0 s, 25.4 tps, lat 3469.040 ms stddev 6297.440
progress: 540.0 s, 28.9 tps, lat 3131.438 ms stddev 4288.130
progress: 720.0 s, 27.5 tps, lat 3285.024 ms stddev 4113.841
progress: 900.0 s, 47.2 tps, lat 1896.698 ms stddev 2182.695
transaction type: <builtin: TPC-B (sort of)>
scaling factor: 25
query mode: simple
number of clients: 90
number of threads: 8
duration: 900 s
number of transactions actually processed: 38175
latency average = 2128.606 ms
latency stddev = 3948.634 ms
tps = 42.151205 (including connections establishing)
tps = 42.151589 (excluding connections establishing)
-- waiting 0s... (always)
port | pg_stat_replication | pid | wal | replay_loc | diff | ?column? | state | app | sync_state 6972 | pg_stat_replication | 2545 | 18/432B2180 | 18/432B2180 | 0 | t | streaming | derail2 | async

2017.05.08 23:19:22
-- getting md5 (cb)
6972 a,b,t,h: 2500000 25 250 38175 b2ba48b53 b3788a837 d1afac950 d4abcc72e master 6973 a,b,t,h: 2500000 25 250 38175 b2ba48b53 b3788a837 d1afac950 d4abcc72e replica ok bee2312c7
2017.05.08 23:20:48

port | pg_stat_replication | pid | wal | replay_loc | diff | ?column? | state | app | sync_state 6972 | pg_stat_replication | 2545 | 18/4AEEC8C0 | 18/453FBD20 | 95357856 | f | streaming | derail2 | async
------------------------------------


------------------ failure:
creating tables...
1777100 of 2500000 tuples (71%) done (elapsed 5.06 s, remaining 2.06 s)
2500000 of 2500000 tuples (100%) done (elapsed 7.41 s, remaining 0.00 s)
vacuum...
set primary keys...
done.
create publication pub1 for all tables;
create subscription sub1 connection 'port=6972 application_name=derail2' publication pub1 with (disabled);
alter subscription sub1 enable;
port | pg_stat_replication | pid | wal | replay_loc | diff | ?column? | state | app | sync_state 6972 | pg_stat_replication | 11945 | 18/5E2913D0 | | | | catchup | derail2 | async

-- pgbench -c 90 -j 8 -T 900 -P 180 -n   --  scale 25
progress: 180.0 s, 78.4 tps, lat 1138.348 ms stddev 2884.815
progress: 360.0 s, 69.2 tps, lat 1309.716 ms stddev 2594.231
progress: 540.0 s, 59.0 tps, lat 1519.146 ms stddev 2033.400
progress: 720.0 s, 62.9 tps, lat 1421.854 ms stddev 1775.066
progress: 900.0 s, 57.0 tps, lat 1575.693 ms stddev 1681.800
transaction type: <builtin: TPC-B (sort of)>
scaling factor: 25
query mode: simple
number of clients: 90
number of threads: 8
duration: 900 s
number of transactions actually processed: 58846
latency average = 1378.259 ms
latency stddev = 2304.159 ms
tps = 65.224168 (including connections establishing)
tps = 65.224788 (excluding connections establishing)
-- waiting 0s... (always)
port | pg_stat_replication | pid | wal | replay_loc | diff | ?column? | state | app | sync_state 6972 | pg_stat_replication | 11948 | 18/7469A038 | | | | startup | derail2 | async 6972 | pg_stat_replication | 12372 | 18/7469A038 | | | | startup | derail2 | async

------------------------------------

During my tests, I keep an eye on pg_stat_replication (refreshing every 2s), and when I see those two 'startup'-lines in pg_stat_replication without any 'streaming'-line I know the test is going to fail. I believe this has been true for all failure cases that I've seen (except the much more rare stuck-DROP-SUBSCRIPTION which is mentioned in another thread).

Sorry, I have not been able to get any thing more clear or definitive...


thanks,


Erik Rijkers










--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Reply via email to