On 2017-03-09 11:06, Erik Rijkers wrote:

I use three different machines (2 desktop, 1 server) to test logical
replication, and all three have now at least once failed to correctly
synchronise a pgbench session (amidst many succesful runs, of course)



(At the moment using tese patches for tests:)

0001-Reserve-global-xmin-for-create-slot-snasphot-export.patch +
0002-Don-t-use-on-disk-snapshots-for-snapshot-export-in-l.patch+
0003-Prevent-snapshot-builder-xmin-from-going-backwards.patch  +
0004-Fix-xl_running_xacts-usage-in-snapshot-builder.patch      +
0005-Skip-unnecessary-snapshot-builds.patch                    +


The failed tests that I kept seeing (see the pgbench-over-logical-replication tests upthread) were never really 'solved'.


But I have now finally figured out what caused these unexpected failed tests: it was wal_sender_timeout or rather, its default of 60 s.

This caused 'terminating walsender process due to replication timeout' on the primary (not strictly an error), and the concomittant ERROR on the replica: 'could not receive data from WAL stream: server closed the connection unexpectedly'.

here is a typical example (primary/replica logs time-intertwined, with 'primary'):

[...]
2017-03-24 16:21:38.129 CET [15002] primary LOG: using stale statistics instead of current ones because stats collector is not responding 2017-03-24 16:21:42.690 CET [27515] primary LOG: using stale statistics instead of current ones because stats collector is not responding 2017-03-24 16:21:42.965 CET [14999] replica LOG: using stale statistics instead of current ones because stats collector is not responding 2017-03-24 16:21:49.816 CET [14930] primary LOG: terminating walsender process due to 2017-03-24 16:21:49.817 CET [14926] replica ERROR: could not receive data from WAL stream: server closed the connection unexpectedly 2017-03-24 16:21:49.824 CET [27502] replica LOG: worker process: logical replication worker for subscription 24864 (PID 14926) exited with exit code 1 2017-03-24 16:21:49.824 CET [27521] replica LOG: starting logical replication worker for subscription "sub1" 2017-03-24 16:21:49.828 CET [15008] replica LOG: logical replication apply for subscription sub1 started 2017-03-24 16:21:49.832 CET [15009] primary LOG: received replication command: IDENTIFY_SYSTEM 2017-03-24 16:21:49.832 CET [15009] primary LOG: received replication command: START_REPLICATION SLOT "sub1" LOGICAL 3/FC976440 (proto_version '1', publication_names '"pub1"') 2017-03-24 16:21:49.833 CET [15009] primary DETAIL: streaming transactions committing after 3/FC889810, reading WAL from 3/FC820FC0 2017-03-24 16:21:49.833 CET [15009] primary LOG: starting logical decoding for slot "sub1" 2017-03-24 16:21:50.471 CET [15009] primary DETAIL: Logical decoding will begin using saved snapshot. 2017-03-24 16:21:50.471 CET [15009] primary LOG: logical decoding found consistent point at 3/FC820FC0 2017-03-24 16:21:51.169 CET [15008] replica DETAIL: Key (hid)=(9014) already exists. 2017-03-24 16:21:51.169 CET [15008] replica ERROR: duplicate key value violates unique constraint "pgbench_history_pkey" 2017-03-24 16:21:51.170 CET [27502] replica LOG: worker process: logical replication worker for subscription 24864 (PID 15008) exited with exit code 1 2017-03-24 16:21:51.170 CET [27521] replica LOG: starting logical replication worker for subscription "sub1"
[...]

My primary and replica were always on a single machine (making it more likely that that timeout is reached?). In my testing it seems that reaching the timeout on the primary (and 'closing the connection unexpectedly' on the replica) does not necessarily break the logical replication. But almost all log-rep failures that I have seen were started by this sequence of events.

After setting wal_sender_timeout to 3 minutes there were no more failed tests.

Perhaps it warrants setting wal_sender_timeout a bit higher than the current default of 60 seconds? After all I also saw the 'replication timeout' / 'closed the connection' couple rather often during not-failing tests. (These also disappeared, almost completely, with a higher setting of wal_sender_timeout)

In any case it would be good to mention the setting (and its potentially deteriorating effect) somehere nearer the logical replication treatment.

( I read about wal_sender_timeout and keepalive ping, perhaps there's (still) something amiss there? Just a guess, I don't know )

As I said, I saw no more failures with the higher 3 minute setting, with one exception: the one test that straddled the DST change (saterday 24 march 02:00 h). I am happy to discount that one failure but strictly speaking I suppose it should be able to take DST into its stride.


Thanks,

Erik Rijkers











--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Reply via email to