I discovered while poking at an LDAP problem that our TAP tests are
100% reproducibly capable of ignoring server crashes and reporting
success anyway.  This problem occurs if the postmaster doesn't get
the child failure report until after shutdown has been initiated,
in which case you find something like this in the postmaster log:

2022-05-04 12:01:33.946 EDT [57945] [unknown] LOG:  connection received: 
host=[local]
2022-05-04 12:01:33.995 EDT [57945] [unknown] LOG:  connection authenticated: 
identity="uid=test1,dc=example,dc=net" method=ldap 
(/Users/tgl/pgsql/src/test/ldap/tmp_check/t_001_auth_node_data/pgdata/pg_hba.conf:1)
2022-05-04 12:01:33.995 EDT [57945] [unknown] LOG:  connection authorized: 
user=test1 database=postgres application_name=001_auth.pl
2022-05-04 12:01:33.998 EDT [57945] 001_auth.pl LOG:  statement: SELECT 
$$connected with user=test1$$
2022-05-04 12:01:34.003 EDT [57937] LOG:  received fast shutdown request
2022-05-04 12:01:34.003 EDT [57937] LOG:  aborting any active transactions
2022-05-04 12:01:34.003 EDT [57937] LOG:  background worker "logical 
replication launcher" (PID 57943) exited with exit code 1
2022-05-04 12:01:35.750 EDT [57937] LOG:  server process (PID 57945) was 
terminated by signal 11: Segmentation fault: 11
2022-05-04 12:01:35.751 EDT [57937] LOG:  terminating any other active server 
processes
2022-05-04 12:01:35.751 EDT [57937] LOG:  abnormal database system shutdown
2022-05-04 12:01:35.751 EDT [57937] LOG:  database system is shut down

Our TAP scripts don't notice the "abnormal database system shutdown",
so it looks like things have passed.  It's even worse when a script
demands an immediate shutdown, because then the postmaster won't wait
around for the child status.

If you have core dumps enabled, that adds some time before the child
exit status is delivered, making this scenario extremely probable.
I'm finding that src/test/ldap reports success 100% reproducibly
after doing "ulimit -c unlimited", even though four backend core dumps
are produced during the run.

I think that (a) at least by default, node->stop() ought to check for
normal shutdown status, and (b) immediate shutdown should only be used
in cases where we've already decided that the test failed.

                        regards, tom lane


Reply via email to