[HACKERS] fs issues on software raid0 (PG_VERSION does not contain valid data)

Tomas Vondra Sun, 18 Oct 2015 11:27:03 -0700

Hi there,

I've been doing a lot of filesystem testing / benchmarking recently, and today I've ran into a really strange issue with ext4 on two SSD devices in a RAID-0 configuration (Linux software raid).

I'm currently trying rerunning the test to see if it's reproducible, but maybe someone has an idea of what might be the problem.


The issue demonstrates like this:

  FATAL:  "base/12140" is not a valid data directory
  DETAIL:  File "base/12140/PG_VERSION" does not contain valid data.
  HINT:  You might need to initdb.

The paths are obviously nonsense. But it gets funnier - the database continues to run seemingly just fine (doing checkpoints, serving queries, ...), until this happens:


  ERROR:  index "pg_type_oid_index" contains unexpected zero page
          at block 3 at character 61

This happens after the  benchmarking script runs vacuumdb:

  vacuumdb: query failed: ERROR:  index "pg_type_oid_index" contains
             unexpected zero page at block 3
  LINE 1: ...LECT datname FROM pg_database WHERE datallowconn ORDER BY 1

Attached is a PostgreSQL log for the whole benchmark run, log tracking the benchmark script (useful for mapping the pg.log to steps of the benchmark), and also log with mdadm info.


, which initializes a new cluster and then does this:

  1) run on small dataset (scale=10)
     - pgbench init
     - vacuumdb
     - warmup
     - pgbench runs for various client runs (with explicit checkpoints)

  2) run on large data set (scale=1100)
     - ... same as for (1)

  3) run on medium data set (scale=140)
     - ... same as for (1)

(The data set sizes are for a machine with 8GB of RAM.)

Anyway, the (1) completes without any errors, then while doing warmup for (2) the "not a valid data directory" errors start to pop up, and finally when (3) attempts to do the vacuumdb, it fails because of the zero page in pg_type_oid_index.

All this happens on an ext4 filesystem, created on a sw raid0 manager by kernel 4.0.4. The filesystem is created like this:


  mdadm --create /dev/md0 --level=0 --raid-devices=2 /dev/sda1 /dev/sdb1
  mkfs.ext4 -E stride=128,stripe-width=256 /dev/md0

and mounted like this

  /dev/md0 on /mnt/data type ext4 (rw,noatime,nobarrier,discard)

Neither the array nor the filesystem are corrupted in any way, and there's no sign of kernel issues in any of the logs (/var/log/messages or dmesg, for example).

Also, I've done a number of tests with ext4 with exactly the same mount options, but placed directly on a single device (thus not going through the sw raid layers), and none of those had this issue.

So it seems to me that either the sw raid somehow breaks the guarantees we expect from ext4, or something like that. Another possibility is that using two devices introduces some sort of race condition somewhere in the stack. Or maybe it's not safe to use nobarrier in this case, I don't know.

Now, I don't really think people should use software raid in cases when data durability matters, but I'm not sure that's where the problem is.


I've found two threads that might be somewhat related:

1) http://www.postgresql.org/message-id/201002200230.16951.and...@anarazel.de


  - Same error message, but I don't see any conclusion except for
   "cannot happen" from Greg.

2) http://www.postgresql.org/message-id/48331f9f.9030...@demabg.com

  - Essentially talks about failed RAID5 array, but that does not seem
    to be the case here (no RAID failures here).


BTW this was done on PostgreSQL 9.4.x.


regards

--
Tomas Vondra                  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

pg.log.gz
Description: application/gzip

bench pgdata # mdadm --detail --scan
ARRAY /dev/md0 metadata=1.2 name=bench:0 
UUID=8d407fb2:6d54e2cb:d5bf4532:7f8e3913
bench pgdata # mdadm --examine /dev/sda1
/dev/sda1:
          Magic : a92b4efc
        Version : 1.2
    Feature Map : 0x0
     Array UUID : 8d407fb2:6d54e2cb:d5bf4532:7f8e3913
           Name : bench:0  (local to host bench)
  Creation Time : Sat Oct 17 16:18:50 2015
     Raid Level : raid0
   Raid Devices : 2

 Avail Dev Size : 195238448 (93.10 GiB 99.96 GB)
    Data Offset : 131072 sectors
   Super Offset : 8 sectors
   Unused Space : before=130984 sectors, after=0 sectors
          State : clean
    Device UUID : d5253a3d:ba09fb80:30bc07a6:dd88ced6

    Update Time : Sat Oct 17 16:18:50 2015
  Bad Block Log : 512 entries available at offset 72 sectors
       Checksum : c3a9174c - correct
         Events : 0

     Chunk Size : 512K

   Device Role : Active device 0
   Array State : AA ('A' == active, '.' == missing, 'R' == replacing)
bench pgdata # mdadm --examine /dev/sdb1
/dev/sdb1:
          Magic : a92b4efc
        Version : 1.2
    Feature Map : 0x0
     Array UUID : 8d407fb2:6d54e2cb:d5bf4532:7f8e3913
           Name : bench:0  (local to host bench)
  Creation Time : Sat Oct 17 16:18:50 2015
     Raid Level : raid0
   Raid Devices : 2

 Avail Dev Size : 195238448 (93.10 GiB 99.96 GB)
    Data Offset : 131072 sectors
   Super Offset : 8 sectors
   Unused Space : before=130984 sectors, after=0 sectors
          State : clean
    Device UUID : e5093e0d:88998e11:9bffea40:c05e005c

    Update Time : Sat Oct 17 16:18:50 2015
  Bad Block Log : 512 entries available at offset 72 sectors
       Checksum : 4455a478 - correct
         Events : 0

     Chunk Size : 512K

   Device Role : Active device 1
   Array State : AA ('A' == active, '.' == missing, 'R' == replacing)

output directory: results-20151017-1622
clients: 1 2 4 8 16
scales: 10 1100 140
repeats: 3
test runtime (seconds): 1800
test runs: 90
estimated total runtime (seconds): 162000
estimated total runtime (hours): 45.0
2015-10-17 16:22:03.301367 : fetching pg_settings
2015-10-17 16:22:03.312466 : running r/o for 'small' scale
2015-10-17 16:22:03.312598 : running pgbench init for scale 10
2015-10-17 16:22:03.637502 : running vacuumdb
2015-10-17 16:22:06.990408 : running pgbench warmup with 16 clients
2015-10-17 16:52:07.024324 : running benchmark round 1 (dir=ro/small/1)
2015-10-17 16:52:07.024375 : running pgbench clients=1 threads=1 duration=1800 
readonly=True
2015-10-17 17:22:08.052644 : running pgbench clients=2 threads=2 duration=1800 
readonly=True
2015-10-17 17:52:08.209035 : running pgbench clients=4 threads=4 duration=1800 
readonly=True
2015-10-17 18:22:08.399261 : running pgbench clients=8 threads=8 duration=1800 
readonly=True
2015-10-17 18:52:08.652762 : running pgbench clients=16 threads=16 
duration=1800 readonly=True
2015-10-17 19:22:09.046676 : running benchmark round 2 (dir=ro/small/2)
2015-10-17 19:22:09.046701 : running pgbench clients=1 threads=1 duration=1800 
readonly=True
2015-10-17 19:52:09.184768 : running pgbench clients=2 threads=2 duration=1800 
readonly=True
2015-10-17 20:22:09.341345 : running pgbench clients=4 threads=4 duration=1800 
readonly=True
2015-10-17 20:52:09.531366 : running pgbench clients=8 threads=8 duration=1800 
readonly=True
2015-10-17 21:22:09.786587 : running pgbench clients=16 threads=16 
duration=1800 readonly=True
2015-10-17 21:52:10.180005 : running benchmark round 3 (dir=ro/small/3)
2015-10-17 21:52:10.180030 : running pgbench clients=1 threads=1 duration=1800 
readonly=True
2015-10-17 22:22:10.318249 : running pgbench clients=2 threads=2 duration=1800 
readonly=True
2015-10-17 22:52:10.474378 : running pgbench clients=4 threads=4 duration=1800 
readonly=True
2015-10-17 23:22:10.664607 : running pgbench clients=8 threads=8 duration=1800 
readonly=True
2015-10-17 23:52:10.921235 : running pgbench clients=16 threads=16 
duration=1800 readonly=True
2015-10-18 00:22:11.317608 : running r/o for 'large' scale
2015-10-18 00:22:11.317686 : running pgbench init for scale 1100
2015-10-18 00:22:11.662897 : running vacuumdb
2015-10-18 00:26:16.244434 : running pgbench warmup with 16 clients
2015-10-18 00:56:16.301757 : running benchmark round 1 (dir=ro/large/1)
2015-10-18 00:56:16.301816 : running pgbench clients=1 threads=1 duration=1800 
readonly=True
2015-10-18 01:26:18.413628 : running pgbench clients=2 threads=2 duration=1800 
readonly=True
2015-10-18 01:56:18.740683 : running pgbench clients=4 threads=4 duration=1800 
readonly=True
2015-10-18 02:26:19.015732 : running pgbench clients=8 threads=8 duration=1800 
readonly=True
2015-10-18 02:56:19.378200 : running pgbench clients=16 threads=16 
duration=1800 readonly=True
2015-10-18 03:26:19.932416 : running benchmark round 2 (dir=ro/large/2)
2015-10-18 03:26:19.932451 : running pgbench clients=1 threads=1 duration=1800 
readonly=True
2015-10-18 03:56:20.149420 : running pgbench clients=2 threads=2 duration=1800 
readonly=True
2015-10-18 04:26:20.377590 : running pgbench clients=4 threads=4 duration=1800 
readonly=True
2015-10-18 04:56:20.645956 : running pgbench clients=8 threads=8 duration=1800 
readonly=True
2015-10-18 05:26:21.012901 : running pgbench clients=16 threads=16 
duration=1800 readonly=True
2015-10-18 05:56:21.570770 : running benchmark round 3 (dir=ro/large/3)
2015-10-18 05:56:21.570805 : running pgbench clients=1 threads=1 duration=1800 
readonly=True
2015-10-18 06:26:21.784902 : running pgbench clients=2 threads=2 duration=1800 
readonly=True
2015-10-18 06:56:22.011767 : running pgbench clients=4 threads=4 duration=1800 
readonly=True
2015-10-18 07:26:22.281407 : running pgbench clients=8 threads=8 duration=1800 
readonly=True
2015-10-18 07:56:22.650882 : running pgbench clients=16 threads=16 
duration=1800 readonly=True
2015-10-18 08:26:23.204575 : running r/o for 'medium' scale
2015-10-18 08:26:23.204903 : running pgbench init for scale 140
2015-10-18 08:26:23.972575 : running vacuumdb
2015-10-18 08:26:23.977793 : ERROR: command failed: ['vacuumdb', '-a', '-f', 
'-z']
2015-10-18 08:26:23.977821 : ERROR: exception 'command failed with exit code 1'

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

[HACKERS] fs issues on software raid0 (PG_VERSION does not contain valid data)

Reply via email to