Re: [GENERAL] Standby crash

2016-08-12 Thread Adrian Klaver

On 08/11/2016 11:28 PM, James Sewell wrote:

Hello,

We recently experienced a critical failure when failing to a DR environment.

This is in the following environment:


What is the version of Postgres used?

What is the OS and version used?



  * 3 x PostgreSQL machines in Prod in a sync replication cluster


Is there a connection between the above and the below?

Or can you provide a schematic representation of both?



  * 3 x PostgreSQL machines in DR, with a single machine async and the
other two cascading from the first machine.



The key seems to be this:
"The postmaster has commanded this server process to roll back the 
current transaction and exit, because another server process exited 
abnormally and possibly corrupted shared memory."


So is that appearing on all the machines in the DR setup?

Do you have other Postgres instances on those machines?

What do the OS logs show at the same time?




There was network failure which isolated Production from everything
else, Production has no errors during this time (and has now come back OK).

DR did not tolerate the break, the following appeared in the logs and
none of them can start postgres. There were no queries coming into DR at
the time of the break.

Please note that the "Host Key verification failed" messages are due to
the scp command not functioning. This means restore_command is not
working to restore from the XLOG archive, but should not effect anything
else.

2016-08-12 02:52:55 GMT [23205]: [9077-1] user=,db=,client=
 (0:0)LOG:  restartpoint starting: time
2016-08-12 02:57:25 GMT [23205]: [9078-1] user=,db=,client=
 (0:0)LOG:  restartpoint complete: wrote 13361 buffers (0.6%); 0
transaction log file(s) added, 0 removed, 3 recycled; write=269.972 s,
sync=0.031 s, total=270.008 s; sync files=41, longest=0.004 s,
average=0.000 s; distance=37747 kB, estimate=48968 kB
2016-08-12 02:57:25 GMT [23205]: [9079-1] user=,db=,client=
 (0:0)LOG:  recovery restart point at 3/7C7B0228
2016-08-12 02:57:25 GMT [23205]: [9080-1] user=,db=,client=
 (0:0)DETAIL:  last completed transaction was at log time 2016-08-12
02:57:24.033588+00
2016-08-12 02:57:55 GMT [23205]: [9081-1] user=,db=,client=
 (0:0)LOG:  restartpoint starting: time
2016-08-12 03:02:25 GMT [23205]: [9082-1] user=,db=,client=
 (0:0)LOG:  restartpoint complete: wrote 7196 buffers (0.3%); 0
transaction log file(s) added, 0 removed, 2 recycled; write=269.988 s,
sync=0.014 s, total=270.007 s; sync files=42, longest=0.002 s,
average=0.000 s; distance=79888 kB, estimate=79888 kB
2016-08-12 03:02:25 GMT [23205]: [9083-1] user=,db=,client=
 (0:0)LOG:  recovery restart point at 3/815B42C8
2016-08-12 03:02:25 GMT [23205]: [9084-1] user=,db=,client=
 (0:0)DETAIL:  last completed transaction was at log time 2016-08-12
03:02:15.07639+00
2016-08-12 03:02:55 GMT [23205]: [9085-1] user=,db=,client=
 (0:0)LOG:  restartpoint starting: time
2016-08-12 03:04:00 GMT [22350]: [2-1] user=,db=,client=
 (0:XX000)FATAL:  terminating walreceiver due to timeout
Host key verification failed.^M
Host key verification failed.^M
2016-08-12 03:04:00 GMT [23188]: [9-1] user=,db=,client=
 (0:XX000)FATAL:  invalid memory alloc request size 3445219328
2016-08-12 03:04:01 GMT [23182]: [5-1] user=,db=,client=  (0:0)LOG:
 startup process (PID 23188) exited with exit code 1
2016-08-12 03:04:01 GMT [23182]: [6-1] user=,db=,client=  (0:0)LOG:
 terminating any other active server processes
2016-08-12 03:04:01 GMT [22285]: [1-1]
user=replication,db=[unknown],client=10.8.0.14 10.8.0.14(55826)
(0:57P02)WARNING:  terminating connection because of crash of another
server process
2016-08-12 03:04:01 GMT [22285]: [2-1]
user=replication,db=[unknown],client=10.8.0.14 10.8.0.14(55826)
(0:57P02)DETAIL:  The postmaster has commanded this server process to
roll back the current transaction and exit, because another server
process exited abnormally and possibly corrupted shared memory.
2016-08-12 03:04:01 GMT [22285]: [3-1]
user=replication,db=[unknown],client=10.8.0.14 10.8.0.14(55826)
(0:57P02)HINT:  In a moment you should be able to reconnect to the
database and repeat your command.
2016-08-12 03:04:01 GMT [22286]: [1-1]
user=replication,db=[unknown],client=10.8.0.32 10.8.0.32(56442)
(0:57P02)WARNING:  terminating connection because of crash of another
server process
2016-08-12 03:04:01 GMT [22286]: [2-1]
user=replication,db=[unknown],client=10.8.0.32 10.8.0.32(56442)
(0:57P02)DETAIL:  The postmaster has commanded this server process to
roll back the current transaction and exit, because another server
process exited abnormally and possibly corrupted shared memory.
2016-08-12 03:04:01 GMT [22286]: [3-1]
user=replication,db=[unknown],client=10.8.0.32 10.8.0.32(56442)
(0:57P02)HINT:  In a moment you should be able to reconnect to the
database and repeat your command.
2016-08-12 03:04:01 GMT [23182]: [7-1] user=,db=,client=  (0:0)LOG:
 archiver process (PID 23207) exited with exit code 1

[GENERAL] Standby crash

2016-08-12 Thread James Sewell
Hello,

We recently experienced a critical failure when failing to a DR environment.

This is in the following environment:


   - 3 x PostgreSQL machines in Prod in a sync replication cluster
   - 3 x PostgreSQL machines in DR, with a single machine async and the
   other two cascading from the first machine.

There was network failure which isolated Production from everything else,
Production has no errors during this time (and has now come back OK).

DR did not tolerate the break, the following appeared in the logs and none
of them can start postgres. There were no queries coming into DR at the
time of the break.

Please note that the "Host Key verification failed" messages are due to the
scp command not functioning. This means restore_command is not working to
restore from the XLOG archive, but should not effect anything else.

2016-08-12 02:52:55 GMT [23205]: [9077-1] user=,db=,client=  (0:0)LOG:
 restartpoint starting: time
2016-08-12 02:57:25 GMT [23205]: [9078-1] user=,db=,client=  (0:0)LOG:
 restartpoint complete: wrote 13361 buffers (0.6%); 0 transaction log
file(s) added, 0 removed, 3 recycled; write=269.972 s, sync=0.031 s,
total=270.008 s; sync files=41, longest=0.004 s, average=0.000 s;
distance=37747 kB, estimate=48968 kB
2016-08-12 02:57:25 GMT [23205]: [9079-1] user=,db=,client=  (0:0)LOG:
 recovery restart point at 3/7C7B0228
2016-08-12 02:57:25 GMT [23205]: [9080-1] user=,db=,client=
 (0:0)DETAIL:  last completed transaction was at log time 2016-08-12
02:57:24.033588+00
2016-08-12 02:57:55 GMT [23205]: [9081-1] user=,db=,client=  (0:0)LOG:
 restartpoint starting: time
2016-08-12 03:02:25 GMT [23205]: [9082-1] user=,db=,client=  (0:0)LOG:
 restartpoint complete: wrote 7196 buffers (0.3%); 0 transaction log
file(s) added, 0 removed, 2 recycled; write=269.988 s, sync=0.014 s,
total=270.007 s; sync files=42, longest=0.002 s, average=0.000 s;
distance=79888 kB, estimate=79888 kB
2016-08-12 03:02:25 GMT [23205]: [9083-1] user=,db=,client=  (0:0)LOG:
 recovery restart point at 3/815B42C8
2016-08-12 03:02:25 GMT [23205]: [9084-1] user=,db=,client=
 (0:0)DETAIL:  last completed transaction was at log time 2016-08-12
03:02:15.07639+00
2016-08-12 03:02:55 GMT [23205]: [9085-1] user=,db=,client=  (0:0)LOG:
 restartpoint starting: time
2016-08-12 03:04:00 GMT [22350]: [2-1] user=,db=,client=  (0:XX000)FATAL:
 terminating walreceiver due to timeout
Host key verification failed.^M
Host key verification failed.^M
2016-08-12 03:04:00 GMT [23188]: [9-1] user=,db=,client=  (0:XX000)FATAL:
 invalid memory alloc request size 3445219328
2016-08-12 03:04:01 GMT [23182]: [5-1] user=,db=,client=  (0:0)LOG:
 startup process (PID 23188) exited with exit code 1
2016-08-12 03:04:01 GMT [23182]: [6-1] user=,db=,client=  (0:0)LOG:
 terminating any other active server processes
2016-08-12 03:04:01 GMT [22285]: [1-1]
user=replication,db=[unknown],client=10.8.0.14
10.8.0.14(55826) (0:57P02)WARNING:  terminating connection because of crash
of another server process
2016-08-12 03:04:01 GMT [22285]: [2-1]
user=replication,db=[unknown],client=10.8.0.14
10.8.0.14(55826) (0:57P02)DETAIL:  The postmaster has commanded this server
process to roll back the current transaction and exit, because another
server process exited abnormally and possibly corrupted shared memory.
2016-08-12 03:04:01 GMT [22285]: [3-1]
user=replication,db=[unknown],client=10.8.0.14
10.8.0.14(55826) (0:57P02)HINT:  In a moment you should be able to
reconnect to the database and repeat your command.
2016-08-12 03:04:01 GMT [22286]: [1-1]
user=replication,db=[unknown],client=10.8.0.32
10.8.0.32(56442) (0:57P02)WARNING:  terminating connection because of crash
of another server process
2016-08-12 03:04:01 GMT [22286]: [2-1]
user=replication,db=[unknown],client=10.8.0.32
10.8.0.32(56442) (0:57P02)DETAIL:  The postmaster has commanded this server
process to roll back the current transaction and exit, because another
server process exited abnormally and possibly corrupted shared memory.
2016-08-12 03:04:01 GMT [22286]: [3-1]
user=replication,db=[unknown],client=10.8.0.32
10.8.0.32(56442) (0:57P02)HINT:  In a moment you should be able to
reconnect to the database and repeat your command.
2016-08-12 03:04:01 GMT [23182]: [7-1] user=,db=,client=  (0:0)LOG:
 archiver process (PID 23207) exited with exit code 1
2016-08-12 04:43:51 GMT [23614]: [1-1] user=,db=,client=  (0:0)LOG:
 database system was interrupted while in recovery at log time 2016-08-12
02:53:19 GMT
2016-08-12 04:43:51 GMT [23614]: [2-1] user=,db=,client=  (0:0)HINT:
 If this has occurred more than once some data might be corrupted and you
might need to choose an earlier recovery target.
2016-08-12 04:43:51 GMT [23615]: [1-1] user=postgres,db=postgres,client=[local]
[local] (0:57P03)FATAL:  the database system is starting up
Host key verification failed.^M
2016-08-12 04:43:51 GMT [23614]: [3-1] user=,db=,client=  (0:0)LOG:
 entering standby