Re: [HACKERS] Funny WAL corruption issue

Vladimir Borodin Thu, 10 Aug 2017 05:20:56 -0700

Hi, Chris.

> 10 авг. 2017 г., в 15:09, Chris Travers <[email protected]> написал(а):
> 
> Hi;
> 
> I ran into a funny situation today regarding PostgreSQL replication and wal 
> corruption and wanted to go over what I think happened and what I wonder 
> about as a possible solution.
> 
> Basic information is custom-build PostgreSQL 9.6.3 on Gentoo, on a ~5TB 
> database with variable load.  Master database has two slaves and generates 
> 10-20MB of WAL traffic a second.  The data_checksum option is off.
> 
> 
> The problem occurred when I attempted to restart the service on the slave 
> using pg_ctl (I believe the service had been started with sys V init 
> scripts).  On trying to restart, it gave me a nice "Invalid memory allocation 
> request" error and promptly stopped.
> 
> The main logs showed a lot of messages like before the restart:
> 2017-08-02 11:47:33 UTC LOG:  PID 19033 in cancel request did not match any 
> process
> 2017-08-02 11:47:33 UTC LOG:  PID 19032 in cancel request did not match any 
> process
> 2017-08-02 11:47:33 UTC LOG:  PID 19024 in cancel request did not match any 
> process
> 2017-08-02 11:47:33 UTC LOG:  PID 19034 in cancel request did not match any 
> process
> 
> On restart, the following was logged to stderr:
> LOG:  entering standby mode
> LOG:  redo starts at 1E39C/8B77B458
> LOG:  consistent recovery state reached at 1E39C/E1117FF8
> FATAL:  invalid memory alloc request size 3456458752
> LOG:  startup process (PID 18167) exited with exit code 1
> LOG:  terminating any other active server processes
> LOG:  database system is shut down
> 
> After some troubleshooting I found that the wal segment had become corrupt, I 
> copied the correct one from the master and everything came up to present.
> 
> So It seems like somewhere something crashed big time on the back-end and 
> when we tried to restart, the wal ended in an invalid way.


We have reported the same thing [1] nearly a year ago. Could you please check 
with pg_xlogdump that both WALs (normal from master and corrupted) are exactly 
the same until some certain LSN?

[1] 
https://www.postgresql.org/message-id/20160614103415.5796.6885%40wrigleys.postgresql.org

> 
> I am wondering what can be done to prevent these sorts of things from 
> happening in the future if, for example, a replica dies in the middle of a 
> wal fsync. 
> -- 
> Best Wishes,
> Chris Travers
> 
> Efficito:  Hosted Accounting and ERP.  Robust and Flexible.  No vendor 
> lock-in.
> http://www.efficito.com/learn_more <http://www.efficito.com/learn_more>

--
May the force be with you…
https://simply.name

Re: [HACKERS] Funny WAL corruption issue

Reply via email to