Sorry, meant to reply all. On Thu, Aug 10, 2017 at 2:19 PM, Vladimir Borodin <r...@simply.name> wrote:
> Hi, Chris. > > 10 авг. 2017 г., в 15:09, Chris Travers <chris.trav...@gmail.com> > написал(а): > > Hi; > > I ran into a funny situation today regarding PostgreSQL replication and > wal corruption and wanted to go over what I think happened and what I > wonder about as a possible solution. > > Basic information is custom-build PostgreSQL 9.6.3 on Gentoo, on a ~5TB > database with variable load. Master database has two slaves and generates > 10-20MB of WAL traffic a second. The data_checksum option is off. > > > The problem occurred when I attempted to restart the service on the slave > using pg_ctl (I believe the service had been started with sys V init > scripts). On trying to restart, it gave me a nice "Invalid memory > allocation request" error and promptly stopped. > > The main logs showed a lot of messages like before the restart: > 2017-08-02 11:47:33 UTC LOG: PID 19033 in cancel request did not match > any process > 2017-08-02 11:47:33 UTC LOG: PID 19032 in cancel request did not match > any process > 2017-08-02 11:47:33 UTC LOG: PID 19024 in cancel request did not match > any process > 2017-08-02 11:47:33 UTC LOG: PID 19034 in cancel request did not match > any process > > On restart, the following was logged to stderr: > LOG: entering standby mode > LOG: redo starts at 1E39C/8B77B458 > LOG: consistent recovery state reached at 1E39C/E1117FF8 > FATAL: invalid memory alloc request size 3456458752 > LOG: startup process (PID 18167) exited with exit code 1 > LOG: terminating any other active server processes > > LOG: database system is shut down > > After some troubleshooting I found that the wal segment had become > corrupt, I copied the correct one from the master and everything came up to > present. > > So It seems like somewhere something crashed big time on the back-end and > when we tried to restart, the wal ended in an invalid way. > > > We have reported the same thing [1] nearly a year ago. Could you please > check with pg_xlogdump that both WALs (normal from master and corrupted) > are exactly the same until some certain LSN? > > [1] https://www.postgresql.org/message-id/20160614103415. > 5796.6885%40wrigleys.postgresql.org > Yes. Exactly the same output until a certain point where pg_xlogdump dies with an error. That is the same LSN where PostgreSQL died with an error on restart. > > > I am wondering what can be done to prevent these sorts of things from > happening in the future if, for example, a replica dies in the middle of a > wal fsync. > -- > Best Wishes, > Chris Travers > > Efficito: Hosted Accounting and ERP. Robust and Flexible. No vendor > lock-in. > http://www.efficito.com/learn_more > > > > -- > May the force be with you… > https://simply.name > > -- Best Wishes, Chris Travers Efficito: Hosted Accounting and ERP. Robust and Flexible. No vendor lock-in. http://www.efficito.com/learn_more