During internals tests, it is observed that checkpointer is getting crashed on slave with below log on slave in windows:
LOG: checkpointer process (PID 4040) was terminated by exception 0xC0000005 HINT: See C include file "ntstatus.h" for a description of the hexadecimal value. LOG: terminating any other active server processes I debugged and found that it is happening when checkpointer tries to update shared memory config and below is the call stack. > postgres.exe!LWLockAcquireCommon(LWLock * l=0x0000000000000000, LWLockMode mode=LW_EXCLUSIVE, unsigned __int64 * valptr=0x0000000000000020, unsigned __int64 val=18446744073709551615) Line 579 + 0x14 bytes C postgres.exe!LWLockAcquireWithVar(LWLock * l=0x0000000000000000, unsigned __int64 * valptr=0x0000000000000020, unsigned __int64 val=18446744073709551615) Line 510 C postgres.exe!WALInsertLockAcquireExclusive() Line 1627 C postgres.exe!UpdateFullPageWrites() Line 9037 C postgres.exe!UpdateSharedMemoryConfig() Line 1364 C postgres.exe!CheckpointerMain() Line 359 C postgres.exe!AuxiliaryProcessMain(int argc=2, char * * argv=0x00000000007d2180) Line 427 C postgres.exe!SubPostmasterMain(int argc=4, char * * argv=0x00000000007d2170) Line 4635 C postgres.exe!main(int argc=4, char * * argv=0x00000000007d2170) Line 207 C Basically, here the issue is that during startup when checkpointer tries to acquire WAL Insertion Locks to update the value of fullPageWrites, it crashes because the same is still not initialized. It will be initialized in InitXLOGAccess() which will get called via RecoveryInProgress() in case recovery is in progress before doing actual checkpoint. However we are trying to access it before that which leads to crash. I think the reason why it occurs only on windows is that on linux fork will ensure that WAL Insertion Locks get initialized with same values as postmaster. To fix this issue, we need to ensure that WAL Insertion Locks should get initialized before we use them, so one of the ways is to call InitXLOGAccess() before calling CheckPointerMain() as I have done in attached patch, other could be to call RecoveryInProgess() much earlier in path than now. Steps to reproduce the issue ------------------------------------------- On Master a. Change below parameters in postgresql.conf wal_level = archive archive_mode = on archive_command = 'copy "%p" "c:\\Users\\PostgreSQL\9.4\\bin\\archive\\%f"' archive_timeout = 10 b. Change pg_hba.conf to accept connections from slave c. Start Server d. Connect to server and start online backup psql.exe -p 5432 -c "select pg_start_backup('label-1')"; postgres e. Create the slave directory by copying everything from master f. remove postmaster.pid from slave directoy g. change port on slave g. create recovery.conf with below parameters on slave: standby_mode=on restore_command = 'copy "c:\\Users\\PostgreSQL\9.4\\bin\\archive\\%f" "%p"' h. Stop online backup on master psql.exe -p 5432 -c "select pg_stop_backup('1')"; postgres i. Start the slave and you can observe below logs: LOG: checkpointer process (PID 4040) was terminated by exception 0xC0000005 HINT: See C include file "ntstatus.h" for a description of the hexadecimal value. Comments? With Regards, Amit Kapila. EnterpriseDB: http://www.enterprisedb.com
fix_checkpointer_crash_on_slave_v1.patch
Description: Binary data
-- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers