This same crash occured on a 2nd server.
Also qemu/KVM, but this time on a 2ndary ZFS tablespaces which (fails to)
include the missing relfilenode.
Linux database7 3.10.0-957.10.1.el7.x86_64 #1 SMP Mon Mar 18 15:06:45 UTC 2019
x86_64 x86_64 x86_64 GNU/Linux
This is postgresql12-12.1-1PGDG.rhel7.x86_64 (same as first crash), running
since:
|30515 Tue Nov 19 10:04:33 2019 S ? 00:09:54
/usr/pgsql-12/bin/postmaster -D /var/lib/pgsql/12/data/
Before that, this server ran v12.0 since Oct 30 (without crashing).
In this case, the pg_dump --snap finished and released its snapshot at 21:50,
and there were no ALTERed tables. I see a temp file written since the previous
checkpoint, but not by a parallel worker, as in the previous server's crash.
The crash happened while reindexing, though. The "DROP INDEX CONCURRENTLY" is
from pg_repack -i, and completed successfully, but is followed immediately by
the abort log. The folllowing "REINDEX toast..." failed. In this case, I
*guess* that the missing filenode is due to a dropped index (570627937 or
otherwise). I don't see any other CLUSTER, VACUUM FULL, DROP, TRUNCATE or
ALTER within that checkpoint interval (note, we have 1 minute checkpoints).
Note, I double checked on the first server which crashed, it definitely wasn't
running pg_repack or the reindex script, since I removed pg_repack12 from our
servers until 12.1 was installed, to avoid the "concurrently" progress
reporting crash fixed at 1cd5bc3c. So I think ALTER TABLE TYPE and REINDEX can
both trigger this crash, at least on v12.1.
Note I actually have *full logs*, which I've now saved. But here's an excerpt:
postgres=# SELECT log_time, message FROM ckpt_crash WHERE log_time BETWEEN
'2019-11-26 23:40:20' AND '2019-11-26 23:48:58' AND user_name IS NULL ORDER BY
1;
2019-11-26 23:40:20.139-05 | checkpoint starting: time
2019-11-26 23:40:50.069-05 | checkpoint complete: wrote 11093 buffers (5.6%);
0 WAL file(s) added, 0 removed, 12 recycled; write=29.885 s, sync=0.008 s,
total=29.930 s; sync files=71, longest=0.001 s, average=0.000 s; distance
=193388 kB, estimate=550813 kB
2019-11-26 23:41:16.234-05 | automatic analyze of table
"postgres.public.postgres_log_2019_11_26_2300" system usage: CPU: user: 3.00 s,
system: 0.19 s, elapsed: 10.92 s
2019-11-26 23:41:20.101-05 | checkpoint starting: time
2019-11-26 23:41:50.009-05 | could not fsync file
"pg_tblspc/16401/PG_12_201909212/16460/973123799.10": No such file or directory
2019-11-26 23:42:04.397-05 | checkpointer process (PID 30560) was terminated
by signal 6: Aborted
2019-11-26 23:42:04.397-05 | terminating any other active server processes
2019-11-26 23:42:04.397-05 | terminating connection because of crash of
another server process
2019-11-26 23:42:04.42-05 | terminating connection because of crash of
another server process
2019-11-26 23:42:04.493-05 | all server processes terminated; reinitializing
2019-11-26 23:42:05.651-05 | database system was interrupted; last known up at
2019-11-27 00:40:50 -04
2019-11-26 23:47:30.404-05 | database system was not properly shut down;
automatic recovery in progress
2019-11-26 23:47:30.435-05 | redo starts at 3450/1B202938
2019-11-26 23:47:54.501-05 | redo done at 3450/205CE960
2019-11-26 23:47:54.501-05 | invalid record length at 3450/205CEA18: wanted
24, got 0
2019-11-26 23:47:54.567-05 | checkpoint starting: end-of-recovery immediate
2019-11-26 23:47:57.365-05 | checkpoint complete: wrote 3287 buffers (1.7%); 0
WAL file(s) added, 0 removed, 5 recycled; write=2.606 s, sync=0.183 s,
total=2.798 s; sync files=145, longest=0.150 s, average=0.001 s; distance=85
808 kB, estimate=85808 kB
2019-11-26 23:47:57.769-05 | database system is ready to accept connections
2019-11-26 23:48:57.774-05 | checkpoint starting: time
< 2019-11-27 00:42:04.342 -04 postgres >LOG: duration: 13.028 ms statement:
DROP INDEX CONCURRENTLY "child"."index_570627937"
< 2019-11-27 00:42:04.397 -04 >LOG: checkpointer process (PID 30560) was
terminated by signal 6: Aborted
< 2019-11-27 00:42:04.397 -04 >LOG: terminating any other active server
processes
< 2019-11-27 00:42:04.397 -04 >WARNING: terminating connection because of
crash of another server process
< 2019-11-27 00:42:04.397 -04 >DETAIL: The postmaster has commanded this
server process to roll back the current transaction and exit, because another
server process exited abnormally and possibly corrupted shared memory.
< 2019-11-27 00:42:04.397 -04 >HINT: In a moment you should be able to
reconnect to the database and repeat your command.
...
< 2019-11-27 00:42:04.421 -04 postgres >STATEMENT: begin; LOCK TABLE
child.ericsson_sgsn_ss7_remote_sp_201911 IN SHARE MODE;REINDEX INDEX
pg_toast.pg_toast_570627929_index;commit
Here's all the nondefault settings which seem plausibly relevant or interesting:
autovacuum_analyze_scale_factor | 0.005 |
| configuration file
autovacuum_analyze_threshold | 2 |
| configuration file
checkpoint_timeout | 60 | s
| configuration file
max_files_per_process | 1000 |
| configuration file
max_stack_depth | 2048 | kB
| environment variable
max_wal_size | 4096 | MB
| configuration file
min_wal_size | 4096 | MB
| configuration file
shared_buffers | 196608 | 8kB
| configuration file
shared_preload_libraries | pg_stat_statements |
| configuration file
wal_buffers | 2048 | 8kB
| override
wal_compression | on |
| configuration file
wal_segment_size | 16777216 | B
| override
(gdb) bt
#0 0x00007f07c0070207 in raise () from /lib64/libc.so.6
#1 0x00007f07c00718f8 in abort () from /lib64/libc.so.6
#2 0x000000000087752a in errfinish (dummy=<optimized out>) at elog.c:552
#3 0x000000000075c8ec in ProcessSyncRequests () at sync.c:398
#4 0x0000000000734dd9 in CheckPointBuffers (flags=flags@entry=256) at
bufmgr.c:2588
#5 0x00000000005095e1 in CheckPointGuts (checkPointRedo=57518713529016,
flags=flags@entry=256) at xlog.c:9006
#6 0x000000000050ff86 in CreateCheckPoint (flags=flags@entry=256) at
xlog.c:8795
#7 0x00000000006e4092 in CheckpointerMain () at checkpointer.c:481
#8 0x000000000051fcd5 in AuxiliaryProcessMain (argc=argc@entry=2,
argv=argv@entry=0x7ffd82122400) at bootstrap.c:461
#9 0x00000000006ee680 in StartChildProcess (type=CheckpointerProcess) at
postmaster.c:5392
#10 0x00000000006ef9ca in reaper (postgres_signal_arg=<optimized out>) at
postmaster.c:2973
#11 <signal handler called>
#12 0x00007f07c012ef53 in __select_nocancel () from /lib64/libc.so.6
#13 0x00000000004833d4 in ServerLoop () at postmaster.c:1668
#14 0x00000000006f106f in PostmasterMain (argc=argc@entry=3,
argv=argv@entry=0x19d4280) at postmaster.c:1377
#15 0x0000000000484cd3 in main (argc=3, argv=0x19d4280) at main.c:228
#3 0x000000000075c8ec in ProcessSyncRequests () at sync.c:398
path = "pg_tblspc/16401/PG_12_201909212/16460/973123799.10", '\000'
<repeats 14 times>, ...
failures = 1
sync_in_progress = true
hstat = {hashp = 0x19fd2f0, curBucket = 1443, curEntry = 0x0}
entry = 0x1a61260
absorb_counter = <optimized out>
processed = 23
sync_start = {tv_sec = 21582125, tv_nsec = 303557162}
sync_end = {tv_sec = 21582125, tv_nsec = 303536006}
sync_diff = <optimized out>
elapsed = <optimized out>
longest = 1674
total_elapsed = 7074
__func__ = "ProcessSyncRequests"