Re: [zfs-discuss] ZFS Dedup and bad checksums

Jim Klimov Mon, 23 Jan 2012 19:53:25 -0800

2012-01-23 18:25, Jim Klimov wrote:

4) I did not get to check whether "dedup=verify" triggers a
checksum mismatch alarm if the preexisting on-disk data
does not in fact match the checksum.


All checksum mismatches are handled the same way.

> I have yet to test (to be certain) whether writing over a
> block which is invalid on-disk and marked as deduped, with
> dedup=verify, would increase the CKSUM counter.

I checked (oi_148a LiveUSB), by writing a correct block
(128KB) instead of the corrupted one into the file, and:
* "dedup=on" neither fixed the on-disk file, nor logged
  an error, and subsequent reads produced IO errors
  (and increased the counter). Probably just the DDT
  counter was increased during the write (that's the
  "works as designed" part);
* "dedup=verify" *doesn't* log a checksum error if it
  finds a block whose assumed checksum matches the newly
  written block, but contents differ from the new block
  during dedup-verification and in fact these contents
  do not match the checksum either (at least, not the
  one in block pointer). Reading the block produced no
  errors;
* what's worse, reenabling "dedup=on" and writing the
  same again block crashes (reboots) the system instantly.
  Possibly, because now there are two DDT entries pointing
  to same checksum in different blocks, and no verification
  was explicitly requested?

Reenactment of the test (as a hopefully reproducible)
test case constitutes the remainder of the post and
thus it is going to be lengthy... Analyze that! ;)

I think such alarm should exist and to as much as a scrub,
read or other means of error detection and recovery would.


Statement/desire still stands.

Checksum mismatches are logged,


no they are not (in this case)

>> what was your root cause?

Probably same as before - some sort of existing on-disk data
corruption which overwrote some sectors and raidz2 failed to
reconstruct the stripe. I seem to have had about a dozen of
such files. Fixed some by rsync with different dedup settings,
before going into it all deeper. I am not sure if any of them
had overlapping DVAs (those which remain corrupted now - don't),
but many addresses lie in very roughly similar address ranges
(within several GBs or so).


As written above, at least for one case it was probably
a random write by a disk over existing sectors, invalidating
the block.

Still, according to "Works as designed" above, logging the
mismatch so far has no effect on not-using the old DDT entry
pointing to corrupt data.

Just in case, logged as https://www.illumos.org/issues/2015


REENACTMENT OF THE TEST CASE

Beside illustrating my error for those who decide to take on
the bug, I hope this post would also help others in their data
recovery attempts, zfs research, etc.

If my methodology is faulty, I hope someone points that out ;)


1) Uh, I have unrecoverable errors!

The computer was freshly rebooted, pool imported (with rollback),
no newly known CKSUM errors (but we have the nvlist of existing
mismatching files):

# zpool status -vx
  pool: pool
 state: ONLINE
status: One or more devices has experienced an error resulting in data
        corruption.  Applications may be affected.
action: Restore the file in question if possible.  Otherwise restore the
        entire pool from backup.
   see: http://www.sun.com/msg/ZFS-8000-8A

scan: scrub repaired 244K in 138h57m with 31 errors on Sat Jan 1401:50:16 2012

config:

        NAME        STATE     READ WRITE CKSUM
        pool        ONLINE       0     0     0
          raidz2-0  ONLINE       0     0     0
            c6t0d0  ONLINE       0     0     0
            c6t1d0  ONLINE       0     0     0
            c6t2d0  ONLINE       0     0     0
            c6t3d0  ONLINE       0     0     0
            c6t4d0  ONLINE       0     0     0
            c6t5d0  ONLINE       0     0     0

errors: Permanent errors have been detected in the following files:

        <metadata>:<0x0>
...
        pool/mymedia:/incoming/DSCF1234.AVI

NOTE: I don't yet have full detail of <metadata>:<0x0> error,
asked about it numerously on the list.


2) Mine some information about the file and error location

* mount the dataset
  # zfs mount pool/mymedia
* find the inode number
  # ls -i /pool/mymedia/incoming/DSCF1234.AVI
  6313 /pool/mymedia/incoming/DSCF1234.AVI
* dump ZDB info
  # zdb -dddddd pool/mymedia 6313 > /tmp/zdb.txt
* find the bad block offset
  # dd if=/pool/mymedia/incoming/DSCF1234.AVI of=/dev/null \
    bs=512 conv=noerror,notrunc
  dd: reading `/pool/mymedia/incoming/DSCF1234.AVI': I/O error
  58880+0 records in
  58880+0 records out
  30146560 bytes (30 MB) copied, 676.772 s, 44.5 kB/s
    (error repeated 256 times)
  239145+1 records in
  239145+1 records out
  122442738 bytes (122 MB) copied, 2136.19 s, 57.3 kB/s

  So the error is at offset 58800*512 bytes = 0x1CC0000
  And its size is 512b*256 = 128KB



3) Review the /tmp/zdb.txt information

We need the L0 entry for the erroneous block and its
parent L1 entry:

    Object  lvl   iblk   dblk  dsize  lsize   %full  type

6313 3 16K 128K 117M 117M 100.00 ZFS plain file(K=inherit) (Z=inherit)

                                        168   bonus  System attributes
        dnode flags: USED_BYTES USERUSED_ACCOUNTED
        dnode maxblkid: 935
...
Indirect blocks:

0 L2 0:ac25a001000:3000 0:927ba8e7000:3000 4000L/400PF=936 B=325191092/325191092

0 L1 0:ac23ecbf000:6000 0:927b38c1000:6000 4000L/1800PF=128 B=325191082/325191082

...

1000000 L1 0:ac243ea6000:6000 0:927b474c000:6000 4000L/1800PF=128 B=325191084/325191084

1cc0000 L0 0:aca016fa000:30000 20000L/20000P F=1B=325191083/325191083

...

For readers not in the know, ZFS indirect blocks at level L0
address the "userdata" blocks. Indirect blocks are stored in
sets of up to 128 entries; so if more blocks are needed to
store the file, there appears a tree of indirect blocks, up
to 6 levels deep or so. That's a lot of possibly addressable
userdata blocks (2^64 they say, IIRC).

In my case, the file is comprised of 936 "userdata" blocks
(numbered from 0 to 935 = maxblkid) and requires an L2 tree
to store it. The file is conveniently made of 128KB blocks
sized 0x20000 bytes (logical and physical userdata, no
compression here), taking up 0x30000 bytes for allocation
of the data on-disk (including raidz2 redundancy on 6 disks).
Each L1 metadata block covers 0x1000000 bytes of this file,
so there are 8 L1 blocks and one L2 block to address them.

Our L0 block starting at 0x1CC0000 is number 0x66 (102) in
its parent L1 block starting at 0x1000000:
  (0x1CC0000-0x1000000)/0x20000 = 0x66 = 102
Its L0 metadata starts at the byte offset 0x3300 in the L1
metadata block, since all blkptr_t entries in Ln are 0x80
(128) bytes long:
  (0x66*0x80) = 0x3300 = 13056 bytes

* Extract the raw L1 information - for checksum information,
compression and dedup flags, etc. Metadata is compressed
(as seen via 4000L/1800P in zdb.txt), so we decompress it
on-the-fly:
  # zdb -R pool 0:ac243ea6000:1800:dr > /tmp/l1
  Found vdev type: raidz

* Confirm that we can't read the block directly:
  # dd if=/pool/mymedia/incoming/DSCF1234.AVI of=/tmp/blk-bad.bin \
    bs=131072 count=1 skip=230
  dd: reading `/pool/mymedia/incoming/DSCF1234.AVI': I/O error
  0+0 records in
  0+0 records out
  0 bytes (0 B) copied, 0.0995645 s, 0.0 kB/s

* Dump the (corrupt) raw block data with ZDB, uncompressed:
  # zdb -R pool 0:aca016fa000:20000:r > /tmp/blk-bad.bin
  Found vdev type: raidz


* Use the "dd" command similar to one above to extract the
  same block offset*size from the known-good copy of the
  file, saved as /tmp/blk-orig.bin.

4) Inspect the L1 entry

NOTES: See definition of the blkptr_t structure here:
http://src.illumos.org/source/xref/illumos-gate/usr/src/uts/common/fs/zfs/sys/spa.h#blkptr
  and some more details here:
http://hub.opensolaris.org/bin/download/Community+Group+zfs/docs/ondiskformat0822.pdf

* Dump the blkptr_t entry 102 (offset 0x3300) for inspection.

# od -Ax -tx1 /tmp/l1 | egrep '^0033' | head -8
003300 80 01 00 00 00 00 00 00 d0 b7 00 65 05 00 00 00
003310 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
003330 ff 00 ff 00 02 08 13 c0 00 00 00 00 00 00 00 00
003340 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
003350 ab 05 62 13 00 00 00 00 01 00 00 00 00 00 00 00
003360 14 ad 26 24 f2 89 8e 28 f1 ae 5d 99 2d 3b 3c 58
003370 c1 91 74 21 09 28 48 5b 55 6f b4 8f 7e ed 3a 52

Cutting to the chase, I'm interested in these bytes now:
0x3334   02  compression = off
0x3335   08  checksum algo = sha256
0x3336   13  data type = 19 = file data (DMU_OT_PLAIN_FILE_CONTENTS)
0x3337   c0  11000000 binary; bits: endianness=1, dedup=1

Bytes 0x3360-0x337f contain the checksum value in an
inversed form of 4 64-bit words. Read right-to-left
in four 8-byte groups (3367-3360, 336f-3368, etc.)

Most of the other information about the userblock was
presented by ZDB in its report about the L0 entry
(converted DVA, sizes, TXGs, etc.).

Checksum value in corect order of bytes:
28 8e 89 f2 24 26 ad 14
58 3c 3b 2d 99 5d ae f1
5b 48 28 09 21 74 91 c1
52 3a ed 7e 8f b4 6f 55

5) Test the checksums

  # openssl dgst -sha256 < /tmp/blk-orig.bin
  288e89f22426ad14583c3b2d995daef15b482809217491c1523aed7e8fb46f55

  # openssl dgst -sha256 < /tmp/blk-bad.bin
  ab34939a6342e3f353927a395a04aa37e50f1e74e7a9b993e731c790ba554a59

As we see, the checksums differ, and the "orig" one matches
the checksum value in the L1 block above. So that copy is
good indeed.

6) Compare the good and bad blocks

* Convert the files into readable forms
  # od -Ax -tx1 /tmp/blk-bad.bin >  /tmp/blk-bad.txt
  # od -Ax -tx1 /tmp/blk-orig.bin > /tmp/blk-orig.txt

* Compare the files
# diff -bu /tmp/blk-*txt | less
--- /tmp/blk-bad.txt    2012-01-23 22:20:56.669957059 +0000
+++ /tmp/blk-orig.txt   2012-01-23 22:20:42.886726737 +0000
@@ -6142,135 +6142,262 @@
 017fd0 bc ad fb 94 39 7d 9f e6 8b e8 ab 92 5a e5 70 b2
 017fe0 6e ed 47 e7 eb 9d e4 ed 9a 8c dd 13 66 79 aa 9e
 017ff0 8a cb d5 b4 59 f5 33 cc a8 b3 0f 52 86 64 e9 9e
-018000 00 02 00 69 f8 82 00 00 00 a8 02 00 8a 10 08 7e
-018010 10 08 a5 10 08 2a e4 10 08 d2 10 08 9f 08 08 05
-018020 00 54 1a f9 0c 38 32 10 08 29 10 08 8f 15 10 08
(...)
-0187e0 00 5b 54 00 83 00 00 00 00 00 00 00 00 00 00 00
-0187f0 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
-*
+018000 d4 c3 94 06 b3 9a dc 89 26 24 c9 36 15 91 c2 73
+018010 1f 48 06 04 b7 57 17 d6 57 00 d7 b3 03 40 2b c1
(...)
+018fe0 32 b3 50 8c e8 c6 77 69 4e 54 7d 55 9e b7 d5 ce
+018ff0 a7 23 21 41 0c ac ea 11 82 e0 86 da 6a 62 ef a6
 019000 28 1c 74 02 66 d4 7a 21 0b 86 97 08 e0 13 bd 2a
(...)

So here we had one 4kb streak of unexpected data.
No particular (ir)regularities noticed in this case,
just one block of "white noise" replaced by another.


============================================================

So, by this time we have:
* Known-good and proven-bad copies of the block;
* Known offset and size of this block in the file;
* ZDB info on layout of the file;
* Corrupt file in a deduped dataset;
* Unanswered question: how do dedup and errors interact?

Time to squash some bugs!


1) What type of dedup was on the dataset?

  # zfs get dedup pool/mymedia
  NAME              PROPERTY  VALUE          SOURCE
  pool/mymedia      dedup     sha256,verify  local

2) How many CKSUM errors do we have now?
  # zpool status -v pool | grep ONLINE
        pool        ONLINE       0     0     8
          raidz2-0  ONLINE       0     0     8
            c6t0d0  ONLINE       0     0     0
            c6t1d0  ONLINE       0     0     0
            c6t2d0  ONLINE       0     0     0
            c6t3d0  ONLINE       0     0     0
            c6t4d0  ONLINE       0     0     0
            c6t5d0  ONLINE       0     0     0

  Ok, by the time I got here, the pool logged hitting CKSUM
errors 8 times; but none definitely on any certain disk.

3) Let's try "fixing" the file with "dedup=on":

* Set dedup mode on dataset:
  # zfs set dedup=on pool/mymedia

* Overwrite the byterange in the file:
  # dd if=/tmp/blk-orig.bin of=/pool/mymedia/incoming/DSCF1234.AVI \
    bs=131072 count=1 seek=230 conv=notrunc
  1+0 records in
  1+0 records out
  131072 bytes (131 kB) copied, 0.000837707 s, 156 MB/s
  # sleep 10; sync

* Did the known CKSUM error count increase? No, it did not:
  # zpool status -v pool | grep ONLINE | grep -v c6
        pool        ONLINE       0     0     8
          raidz2-0  ONLINE       0     0     8

* Try reading the file:
  # md5sum /pool/mymedia/incoming/DSCF1234.AVI
  md5sum: /pool/mymedia/incoming/DSCF1234.AVI: I/O error

* Did the known CKSUM error count increase? Yes, it did now:
  # zpool status -v pool | grep ONLINE | grep -v c6
        pool        ONLINE       0     0     10
          raidz2-0  ONLINE       0     0     10

So, in fact, this did not fix the file.

Did anything change with the blockpointer tree?

* dump ZDB info:
  # zdb -dddddd pool/mymedia 6313 > /tmp/zdb-2.txt
* compare the two info files:
  # diff /tmp/zdb*.txt

< 0 L2 0:6a01ccb6000:3000 0:2902c091000:3000 4000L/400PF=688 B=325617005/325617005> 0 L2 0:ac25a001000:3000 0:927ba8e7000:3000 4000L/400PF=936 B=325191092/325191092

< 1000000 L1 0:6a01ccb0000:6000 0:2902c05b000:60004000L/1800P F=128 B=325617005/325617005> 1000000 L1 0:ac243ea6000:6000 0:927b474c000:60004000L/1800P F=128 B=325191084/325191084

< 1cc0000 L0 0:aca016fa000:30000 20000L/20000P F=1B=325617005/325191083> 1cc0000 L0 0:aca016fa000:30000 20000L/20000P F=1B=325191083/325191083


So, the on-disk L0-block DVA address did not change, only a
TXG entry - and accordingly the indirect blockpointer tree.
Since DVA remains unchanged, and ZFS's COW does not let it
overwrite existing data, this means that the block on disk
remains corrupted. As we've seen with md5sum above.


4) Now let's retry with "dedup=verify":

* Set dedup mode on dataset:
  # zfs set dedup=sha256,verify pool/mymedia

* Overwrite the byterange in the file:
  # dd if=/tmp/blk-orig.bin of=/pool/mymedia/incoming/DSCF1234.AVI \
    bs=131072 count=1 seek=230 conv=notrunc
  1+0 records in
  1+0 records out
  131072 bytes (131 kB) copied, 0.00109255 s, 120 MB/s
  # sleep 10; sync

* Did the known CKSUM error count increase? No, it did not:
  # zpool status -v pool | grep ONLINE | grep -v c6
        pool        ONLINE       0     0     10
          raidz2-0  ONLINE       0     0     10

* Try reading the file:
  # md5sum /pool/mymedia/incoming/DSCF1234.AVI
  731247e7d71951eb622b979bc18f3caa  /pool/mymedia/incoming/DSCF1234.AVI

* Did the known CKSUM error count increase? No, it still did not:
  # zpool status -v pool | grep ONLINE | grep -v c6
        pool        ONLINE       0     0     10
          raidz2-0  ONLINE       0     0     10

So, in fact, this did fix the file.

We can even attest to that with ZDB: the tree of indirect blocks
has changed (levels L2, L1 and L0 of our branch, but not L1 of
another branch):

Indirect blocks:

0 L2 0:6a27641d000:3000 0:2992ca90000:3000 4000L/400PF=936 B=325619762/3256197620 L1 0:ac23ecbf000:6000 0:927b38c1000:6000 4000L/1800PF=128 B=325191082/3251910821000000 L1 0:6a19c140000:6000 0:299204f9000:6000 4000L/1800PF=128 B=325619762/3256197621cc0000 L0 0:6a7df3b2000:30000 20000L/20000P F=1B=325619762/325619762


But the CKSUM errors were not logged as such during dedup-verify -
even though the block read from disk mismatched the checksum in
blockpointer (and presumably in DDT).



5) Now for the interesting part: retry with "dedup=on"

* Set dedup mode on dataset:
  # zfs set dedup=on pool/mymedia

* Overwrite the byterange in the file:
  # dd if=/tmp/blk-orig.bin of=/pool/mymedia/incoming/DSCF1234.AVI \
    bs=131072 count=1 seek=230 conv=notrunc
  1+0 records in
  1+0 records out
  131072 bytes (131 kB) copied, 0.000873926 s, 150 MB/s

The command prompt never returned, host rebooted "instantly".
According to another terminal logging "iostat -xnz 1", the new
writes never made it to disk.

After reboot the pool imported cleanly, file's md5sum matches.
Fix from dedup=verify remains in place...

HTH and thanks,
//Jim Klimov
_______________________________________________
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] ZFS Dedup and bad checksums

Reply via email to