Re: [HACKERS] Logical replication existing data copy

Erik Rijkers Wed, 29 Mar 2017 01:15:07 -0700

On 2017-03-09 11:06, Erik Rijkers wrote:


I use three different machines (2 desktop, 1 server) to test logical
replication, and all three have now at least once failed to correctly
synchronise a pgbench session (amidst many succesful runs, of course)



(At the moment using tese patches for tests:)

0001-Reserve-global-xmin-for-create-slot-snasphot-export.patch +
0002-Don-t-use-on-disk-snapshots-for-snapshot-export-in-l.patch+
0003-Prevent-snapshot-builder-xmin-from-going-backwards.patch  +
0004-Fix-xl_running_xacts-usage-in-snapshot-builder.patch      +
0005-Skip-unnecessary-snapshot-builds.patch                    +

The failed tests that I kept seeing (see thepgbench-over-logical-replication tests upthread) were never really'solved'.

But I have now finally figured out what caused these unexpected failedtests: it was wal_sender_timeout or rather, its default of 60 s.

This caused 'terminating walsender process due to replication timeout'on the primary (not strictly an error), and the concomittant ERROR onthe replica: 'could not receive data from WAL stream: server closed theconnection unexpectedly'.

here is a typical example (primary/replica logs time-intertwined, with'primary'):


[...]

2017-03-24 16:21:38.129 CET [15002] primary LOG: using stalestatistics instead of current ones because stats collector is notresponding2017-03-24 16:21:42.690 CET [27515] primary LOG: using stalestatistics instead of current ones because stats collector is notresponding2017-03-24 16:21:42.965 CET [14999] replica LOG: using stalestatistics instead of current ones because stats collector is notresponding2017-03-24 16:21:49.816 CET [14930] primary LOG: terminatingwalsender process due to2017-03-24 16:21:49.817 CET [14926] replica ERROR: could notreceive data from WAL stream: server closed the connection unexpectedly2017-03-24 16:21:49.824 CET [27502] replica LOG: worker process:logical replication worker for subscription 24864 (PID 14926) exitedwith exit code 12017-03-24 16:21:49.824 CET [27521] replica LOG: starting logicalreplication worker for subscription "sub1"2017-03-24 16:21:49.828 CET [15008] replica LOG: logicalreplication apply for subscription sub1 started2017-03-24 16:21:49.832 CET [15009] primary LOG: receivedreplication command: IDENTIFY_SYSTEM2017-03-24 16:21:49.832 CET [15009] primary LOG: receivedreplication command: START_REPLICATION SLOT "sub1" LOGICAL 3/FC976440(proto_version '1', publication_names '"pub1"')2017-03-24 16:21:49.833 CET [15009] primary DETAIL: streamingtransactions committing after 3/FC889810, reading WAL from 3/FC820FC02017-03-24 16:21:49.833 CET [15009] primary LOG: starting logicaldecoding for slot "sub1"2017-03-24 16:21:50.471 CET [15009] primary DETAIL: Logicaldecoding will begin using saved snapshot.2017-03-24 16:21:50.471 CET [15009] primary LOG: logical decodingfound consistent point at 3/FC820FC02017-03-24 16:21:51.169 CET [15008] replica DETAIL: Key(hid)=(9014) already exists.2017-03-24 16:21:51.169 CET [15008] replica ERROR: duplicate keyvalue violates unique constraint "pgbench_history_pkey"2017-03-24 16:21:51.170 CET [27502] replica LOG: worker process:logical replication worker for subscription 24864 (PID 15008) exitedwith exit code 12017-03-24 16:21:51.170 CET [27521] replica LOG: starting logicalreplication worker for subscription "sub1"

[...]

My primary and replica were always on a single machine (making it morelikely that that timeout is reached?). In my testing it seems thatreaching the timeout on the primary (and 'closing the connectionunexpectedly' on the replica) does not necessarily break the logicalreplication. But almost all log-rep failures that I have seen werestarted by this sequence of events.

After setting wal_sender_timeout to 3 minutes there were no morefailed tests.

Perhaps it warrants setting wal_sender_timeout a bit higher than thecurrent default of 60 seconds? After all I also saw the 'replicationtimeout' / 'closed the connection' couple rather often duringnot-failing tests. (These also disappeared, almost completely, with ahigher setting of wal_sender_timeout)

In any case it would be good to mention the setting (and its potentiallydeteriorating effect) somehere nearer the logical replication treatment.

( I read about wal_sender_timeout and keepalive ping, perhaps there's(still) something amiss there? Just a guess, I don't know )

As I said, I saw no more failures with the higher 3 minute setting, withone exception: the one test that straddled the DST change (saterday 24march 02:00 h). I am happy to discount that one failure but strictlyspeaking I suppose it should be able to take DST into its stride.



Thanks,

Erik Rijkers











--
Sent via pgsql-hackers mailing list ([email protected])
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Logical replication existing data copy

Reply via email to