[HACKERS] Checkpointer crashes on slave in 9.4 on windows

Amit Kapila Mon, 21 Jul 2014 01:17:12 -0700

During internals tests, it is observed that checkpointer
is getting crashed on slave with below log on slave in
windows:


LOG:  checkpointer process (PID 4040) was terminated by exception 0xC0000005
HINT:  See C include file "ntstatus.h" for a description of the hexadecimal
value.
LOG:  terminating any other active server processes

I debugged and found that it is happening when checkpointer
tries to update shared memory config and below is the
call stack.

> postgres.exe!LWLockAcquireCommon(LWLock * l=0x0000000000000000,
LWLockMode mode=LW_EXCLUSIVE, unsigned __int64 * valptr=0x0000000000000020,
unsigned __int64 val=18446744073709551615)  Line 579 + 0x14 bytes C
  postgres.exe!LWLockAcquireWithVar(LWLock * l=0x0000000000000000, unsigned
__int64 * valptr=0x0000000000000020, unsigned __int64
val=18446744073709551615)  Line 510 C
  postgres.exe!WALInsertLockAcquireExclusive()  Line 1627 C
  postgres.exe!UpdateFullPageWrites()  Line 9037 C
  postgres.exe!UpdateSharedMemoryConfig()  Line 1364 C
  postgres.exe!CheckpointerMain()  Line 359 C
  postgres.exe!AuxiliaryProcessMain(int argc=2, char * *
argv=0x00000000007d2180)  Line 427 C
  postgres.exe!SubPostmasterMain(int argc=4, char * *
argv=0x00000000007d2170)  Line 4635 C
  postgres.exe!main(int argc=4, char * * argv=0x00000000007d2170)  Line 207
C

Basically, here the issue is that during startup when
checkpointer tries to acquire WAL Insertion Locks to
update the value of fullPageWrites, it crashes because
the same is still not initialized. It will be initialized in
InitXLOGAccess() which will get called via RecoveryInProgress()
in case recovery is in progress before doing actual checkpoint.
However we are trying to access it before that which leads to
crash.

I think the reason why it occurs only on windows is that
on linux fork will ensure that WAL Insertion Locks get
initialized with same values as postmaster.

To fix this issue, we need to ensure that WAL Insertion
Locks should get initialized before we use them, so one of
the ways is to call InitXLOGAccess() before calling
CheckPointerMain() as I have done in attached patch, other
could be to call RecoveryInProgess() much earlier in path
than now.

Steps to reproduce the issue
-------------------------------------------
On Master
a. Change below parameters in postgresql.conf
    wal_level = archive
    archive_mode = on
    archive_command = 'copy "%p"
"c:\\Users\\PostgreSQL\9.4\\bin\\archive\\%f"'
    archive_timeout = 10
b. Change pg_hba.conf to accept connections from slave
c. Start Server
d. Connect to server and start online backup
    psql.exe -p 5432 -c "select pg_start_backup('label-1')"; postgres
e. Create the slave directory by copying everything from master
f.  remove postmaster.pid from slave directoy
g. change port on slave
g. create recovery.conf with below parameters on slave:
    standby_mode=on
    restore_command = 'copy
 "c:\\Users\\PostgreSQL\9.4\\bin\\archive\\%f" "%p"'
h. Stop online backup on master
    psql.exe -p 5432 -c "select pg_stop_backup('1')"; postgres
i.  Start the slave and you can observe below logs:
LOG:  checkpointer process (PID 4040) was terminated by exception 0xC0000005
HINT:  See C include file "ntstatus.h" for a description of the hexadecimal
value.

Comments?


With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com

fix_checkpointer_crash_on_slave_v1.patch
Description: Binary data

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

[HACKERS] Checkpointer crashes on slave in 9.4 on windows

Reply via email to