Hi there,I've been doing a lot of filesystem testing / benchmarking recently, and today I've ran into a really strange issue with ext4 on two SSD devices in a RAID-0 configuration (Linux software raid).
I'm currently trying rerunning the test to see if it's reproducible, but maybe someone has an idea of what might be the problem.
The issue demonstrates like this: FATAL: "base/12140" is not a valid data directory DETAIL: File "base/12140/PG_VERSION" does not contain valid data. HINT: You might need to initdb.The paths are obviously nonsense. But it gets funnier - the database continues to run seemingly just fine (doing checkpoints, serving queries, ...), until this happens:
ERROR: index "pg_type_oid_index" contains unexpected zero page at block 3 at character 61 This happens after the benchmarking script runs vacuumdb: vacuumdb: query failed: ERROR: index "pg_type_oid_index" contains unexpected zero page at block 3 LINE 1: ...LECT datname FROM pg_database WHERE datallowconn ORDER BY 1Attached is a PostgreSQL log for the whole benchmark run, log tracking the benchmark script (useful for mapping the pg.log to steps of the benchmark), and also log with mdadm info.
, which initializes a new cluster and then does this: 1) run on small dataset (scale=10) - pgbench init - vacuumdb - warmup - pgbench runs for various client runs (with explicit checkpoints) 2) run on large data set (scale=1100) - ... same as for (1) 3) run on medium data set (scale=140) - ... same as for (1) (The data set sizes are for a machine with 8GB of RAM.)Anyway, the (1) completes without any errors, then while doing warmup for (2) the "not a valid data directory" errors start to pop up, and finally when (3) attempts to do the vacuumdb, it fails because of the zero page in pg_type_oid_index.
All this happens on an ext4 filesystem, created on a sw raid0 manager by kernel 4.0.4. The filesystem is created like this:
mdadm --create /dev/md0 --level=0 --raid-devices=2 /dev/sda1 /dev/sdb1 mkfs.ext4 -E stride=128,stripe-width=256 /dev/md0 and mounted like this /dev/md0 on /mnt/data type ext4 (rw,noatime,nobarrier,discard)Neither the array nor the filesystem are corrupted in any way, and there's no sign of kernel issues in any of the logs (/var/log/messages or dmesg, for example).
Also, I've done a number of tests with ext4 with exactly the same mount options, but placed directly on a single device (thus not going through the sw raid layers), and none of those had this issue.
So it seems to me that either the sw raid somehow breaks the guarantees we expect from ext4, or something like that. Another possibility is that using two devices introduces some sort of race condition somewhere in the stack. Or maybe it's not safe to use nobarrier in this case, I don't know.
Now, I don't really think people should use software raid in cases when data durability matters, but I'm not sure that's where the problem is.
I've found two threads that might be somewhat related:1) http://www.postgresql.org/message-id/201002200230.16951.and...@anarazel.de
- Same error message, but I don't see any conclusion except for "cannot happen" from Greg. 2) http://www.postgresql.org/message-id/48331f9f.9030...@demabg.com - Essentially talks about failed RAID5 array, but that does not seem to be the case here (no RAID failures here). BTW this was done on PostgreSQL 9.4.x. regards -- Tomas Vondra http://www.2ndQuadrant.com PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
pg.log.gz
Description: application/gzip
bench pgdata # mdadm --detail --scan ARRAY /dev/md0 metadata=1.2 name=bench:0 UUID=8d407fb2:6d54e2cb:d5bf4532:7f8e3913 bench pgdata # mdadm --examine /dev/sda1 /dev/sda1: Magic : a92b4efc Version : 1.2 Feature Map : 0x0 Array UUID : 8d407fb2:6d54e2cb:d5bf4532:7f8e3913 Name : bench:0 (local to host bench) Creation Time : Sat Oct 17 16:18:50 2015 Raid Level : raid0 Raid Devices : 2 Avail Dev Size : 195238448 (93.10 GiB 99.96 GB) Data Offset : 131072 sectors Super Offset : 8 sectors Unused Space : before=130984 sectors, after=0 sectors State : clean Device UUID : d5253a3d:ba09fb80:30bc07a6:dd88ced6 Update Time : Sat Oct 17 16:18:50 2015 Bad Block Log : 512 entries available at offset 72 sectors Checksum : c3a9174c - correct Events : 0 Chunk Size : 512K Device Role : Active device 0 Array State : AA ('A' == active, '.' == missing, 'R' == replacing) bench pgdata # mdadm --examine /dev/sdb1 /dev/sdb1: Magic : a92b4efc Version : 1.2 Feature Map : 0x0 Array UUID : 8d407fb2:6d54e2cb:d5bf4532:7f8e3913 Name : bench:0 (local to host bench) Creation Time : Sat Oct 17 16:18:50 2015 Raid Level : raid0 Raid Devices : 2 Avail Dev Size : 195238448 (93.10 GiB 99.96 GB) Data Offset : 131072 sectors Super Offset : 8 sectors Unused Space : before=130984 sectors, after=0 sectors State : clean Device UUID : e5093e0d:88998e11:9bffea40:c05e005c Update Time : Sat Oct 17 16:18:50 2015 Bad Block Log : 512 entries available at offset 72 sectors Checksum : 4455a478 - correct Events : 0 Chunk Size : 512K Device Role : Active device 1 Array State : AA ('A' == active, '.' == missing, 'R' == replacing)
output directory: results-20151017-1622 clients: 1 2 4 8 16 scales: 10 1100 140 repeats: 3 test runtime (seconds): 1800 test runs: 90 estimated total runtime (seconds): 162000 estimated total runtime (hours): 45.0 2015-10-17 16:22:03.301367 : fetching pg_settings 2015-10-17 16:22:03.312466 : running r/o for 'small' scale 2015-10-17 16:22:03.312598 : running pgbench init for scale 10 2015-10-17 16:22:03.637502 : running vacuumdb 2015-10-17 16:22:06.990408 : running pgbench warmup with 16 clients 2015-10-17 16:52:07.024324 : running benchmark round 1 (dir=ro/small/1) 2015-10-17 16:52:07.024375 : running pgbench clients=1 threads=1 duration=1800 readonly=True 2015-10-17 17:22:08.052644 : running pgbench clients=2 threads=2 duration=1800 readonly=True 2015-10-17 17:52:08.209035 : running pgbench clients=4 threads=4 duration=1800 readonly=True 2015-10-17 18:22:08.399261 : running pgbench clients=8 threads=8 duration=1800 readonly=True 2015-10-17 18:52:08.652762 : running pgbench clients=16 threads=16 duration=1800 readonly=True 2015-10-17 19:22:09.046676 : running benchmark round 2 (dir=ro/small/2) 2015-10-17 19:22:09.046701 : running pgbench clients=1 threads=1 duration=1800 readonly=True 2015-10-17 19:52:09.184768 : running pgbench clients=2 threads=2 duration=1800 readonly=True 2015-10-17 20:22:09.341345 : running pgbench clients=4 threads=4 duration=1800 readonly=True 2015-10-17 20:52:09.531366 : running pgbench clients=8 threads=8 duration=1800 readonly=True 2015-10-17 21:22:09.786587 : running pgbench clients=16 threads=16 duration=1800 readonly=True 2015-10-17 21:52:10.180005 : running benchmark round 3 (dir=ro/small/3) 2015-10-17 21:52:10.180030 : running pgbench clients=1 threads=1 duration=1800 readonly=True 2015-10-17 22:22:10.318249 : running pgbench clients=2 threads=2 duration=1800 readonly=True 2015-10-17 22:52:10.474378 : running pgbench clients=4 threads=4 duration=1800 readonly=True 2015-10-17 23:22:10.664607 : running pgbench clients=8 threads=8 duration=1800 readonly=True 2015-10-17 23:52:10.921235 : running pgbench clients=16 threads=16 duration=1800 readonly=True 2015-10-18 00:22:11.317608 : running r/o for 'large' scale 2015-10-18 00:22:11.317686 : running pgbench init for scale 1100 2015-10-18 00:22:11.662897 : running vacuumdb 2015-10-18 00:26:16.244434 : running pgbench warmup with 16 clients 2015-10-18 00:56:16.301757 : running benchmark round 1 (dir=ro/large/1) 2015-10-18 00:56:16.301816 : running pgbench clients=1 threads=1 duration=1800 readonly=True 2015-10-18 01:26:18.413628 : running pgbench clients=2 threads=2 duration=1800 readonly=True 2015-10-18 01:56:18.740683 : running pgbench clients=4 threads=4 duration=1800 readonly=True 2015-10-18 02:26:19.015732 : running pgbench clients=8 threads=8 duration=1800 readonly=True 2015-10-18 02:56:19.378200 : running pgbench clients=16 threads=16 duration=1800 readonly=True 2015-10-18 03:26:19.932416 : running benchmark round 2 (dir=ro/large/2) 2015-10-18 03:26:19.932451 : running pgbench clients=1 threads=1 duration=1800 readonly=True 2015-10-18 03:56:20.149420 : running pgbench clients=2 threads=2 duration=1800 readonly=True 2015-10-18 04:26:20.377590 : running pgbench clients=4 threads=4 duration=1800 readonly=True 2015-10-18 04:56:20.645956 : running pgbench clients=8 threads=8 duration=1800 readonly=True 2015-10-18 05:26:21.012901 : running pgbench clients=16 threads=16 duration=1800 readonly=True 2015-10-18 05:56:21.570770 : running benchmark round 3 (dir=ro/large/3) 2015-10-18 05:56:21.570805 : running pgbench clients=1 threads=1 duration=1800 readonly=True 2015-10-18 06:26:21.784902 : running pgbench clients=2 threads=2 duration=1800 readonly=True 2015-10-18 06:56:22.011767 : running pgbench clients=4 threads=4 duration=1800 readonly=True 2015-10-18 07:26:22.281407 : running pgbench clients=8 threads=8 duration=1800 readonly=True 2015-10-18 07:56:22.650882 : running pgbench clients=16 threads=16 duration=1800 readonly=True 2015-10-18 08:26:23.204575 : running r/o for 'medium' scale 2015-10-18 08:26:23.204903 : running pgbench init for scale 140 2015-10-18 08:26:23.972575 : running vacuumdb 2015-10-18 08:26:23.977793 : ERROR: command failed: ['vacuumdb', '-a', '-f', '-z'] 2015-10-18 08:26:23.977821 : ERROR: exception 'command failed with exit code 1'
-- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers