Re: Errors with physical replication

Kyotaro HORIGUCHI Tue, 22 May 2018 18:36:09 -0700

Hello.

At Mon, 21 May 2018 05:18:57 -0700 (MST), greigwise <[email protected]> 
wrote in <[email protected]>
> Hello.  
> 
> We are on Postgresql version 9.6.6.  We have 2 EC2 instances in different
> Amazon regions and we are doing physical replication via VPN.  It all seems
> to work just fine most of the time.   I'm noticing in the logs that we have
> recurring erros (maybe 10 or 12 times per day) that look like this:


<following is digested>

> 2018-05-17 06:36:14 UTC  5af0599f.210d  LOG:  invalid resource manager ID 49
> 2018-05-17 06:36:14 UTC  5afd22de.7ac4  LOG:  started streaming WAL from
> 2018-05-17 07:20:17 UTC  5afd22de.7ac4  FATAL:  could not receive data from
> WAL stream: server closed the connection unexpectedly

> Or some that also look like this:
> 
> 2018-05-17 07:20:17 UTC  5af0599f.210d  LOG:  record with incorrect prev-link
> 2018-05-17 07:20:18 UTC  5afd2d31.1889  LOG:  started streaming WAL from
> 2018-05-17 08:03:28 UTC  5afd2d31.1889  FATAL:  could not receive data from
> WAL stream: server closed the connection unexpectedly

> And some like this:
> 
> 2018-05-17 23:00:13 UTC  5afd63ec.26fc  LOG:  invalid magic number 0000 in
> log segment 00000001000003850000003C, offset 10436608
> 2018-05-17 23:00:14 UTC  5afe097d.49aa  LOG:  started streaming WAL from
> primary at 385/3C000000 on timeline 1

You recplication connection seems quite unstable and disconnected
frequently. After disconnection, you will see several kinds of "I
find a broken record in my WAL file" and they are cues for
standby to switch to streaming. Itself is a normal operation as
PostgreSQL with one known exception.

> Then, like maybe once every couple months or so, we have a crash with logs
> looking like this:


> 2018-05-17 08:03:28 UTC hireology 5af47b75.2670 hireology WARNING: 
> terminating connection because of crash of another server process

I think the lines follow an error message like "FATAL: invalid
memory alloc request size 3075129344". This is a kind of "broken
record" but it is known to lead standby to crash. It is
disucussed here.

> [bug fix] Cascaded standby cannot start after a clean shutdown

https://www.postgresql.org/message-id/flat/0A3221C70F24FB45833433255569204D1F8B57AD@G01JPEXMBYT05#0A3221C70F24FB45833433255569204D1F8B57AD@G01JPEXMBYT05

> When this last error occurs, the recovery is to go on the replica and remove
> all the WAL logs from the pg_xlog director and then restart Postgresql. 
> Everything seems to recover and come up fine.  I've done some tests
> comparing counts between the replica and the primary and everything seems
> synced just fine from all I can tell.  

It is right recovery steps, as far as looking the attached log
messages.

> So, a couple of questions.  1) Should I be worried that my replica is
> corrupt in some way or given that everything *seems* ok, is it reasonable to
> believe that things are working correctly in spite of these errors being
> reported.  2)  Is there something I should configure differently to avoid
> some of these errors?

It doesn't seem worth warrying from the viewpoint of data
integrity, but if walsender/walreceiver timeouts fire too
frequently, you might need to increase them for increased
stability.

> Thanks in advance for any help.
> 
> Greig Wise

regards.

-- 
Kyotaro Horiguchi
NTT Open Source Software Center

Re: Errors with physical replication

Reply via email to