[zfs-discuss] Single-disk rpool with inconsistent checksums, import fails

2011-11-08 Thread Jim Klimov

Hello all,

I have an oi_148a PC with a single root disk, and since
recently it fails to boot - hangs after the copyright
message whenever I use any of my GRUB menu options.

Booting with an oi_148a LiveUSB I had around since
installation, I ran some zdb traversals over the rpool
and zpool import attempts. The imports fail by running
the kernel out of RAM (as recently discussed in the
list with Paul Kraus's problems).

However, in my current case, the rpool has just 11.2Gb
allocated with 8.7Gb available. So almost all of it
could fit in the 8Gb RAM of this computer (no more can
be placed into the motherboard). And I don't believe
there is so much metadata as to exhaust the RAM during
an import attempt.

I have also tried rollback imports with -F, but they
have also failed so far.

I am not ready to copypaste the zdb/zpool outputs here
(I have to get text files off that box), but in short:

1) zdb -bsvL -e rpool-GUID showed that there are some
problems:
* deferred free block count is not zero, although small
  (144 blocks amounting to 1.4Mbytes), and it remained at
  this value over several import attempts.
  I have removed a swap volume some time before the failure,
  so this might be its leftovers.
* It had also output this line:
block traversal size 11986202624 != alloc 11986203136 (unreachable 512)
  I believe this refers to the allocated data size in bytes,
  and that one sector (512b) is deemed unreachable. Is that
  so fatal?

2) zdb -bsvc -e rpool-GUID showed that there are some
consistency problems. Namely, five blocks had mismatching
checksums. They were named plain file blocks with no
further details (like what files they might be parts of).
But I hope that this means no metadata was hurt so far.

3) I've tried importing the pool in several ways (including
normal and rollback mounts, readonly and -n), but so far
all attempts led to to the computer hanging within a minute
(vmstat 1 shows that free RAM plummets towards the zero
mark).

I've tried preparing the system tunables as well:

:; echo aok/W 1 | mdb -kw
:; echo zfs_recover/W 1 | mdb -kw

and sometimes adding:
:; echo zfs_vdev_max_pending/W0t5 | mdb -kw
:; echo zfs_resilver_delay/W0t0 | mdb -kw
:; echo zfs_resilver_min_time_ms/W0t2 | mdb -kw
:; echo zfs_txg_synctime/W0t1 | mdb -kw


In this case I am not very hesitant to recreate the rpool
and reinstall the OS - it was mostly needed to server the
separate data pool. However this option is not always an
acceptable one, so I wonder if anything can be done to
repair an inconsistent non-redundant pool - at least to
make it importable again in order to evacuate some of the
settings and tunings that I've made over time.

//Jim

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Single-disk rpool with inconsistent checksums, import fails

2011-11-08 Thread Jim Klimov

2011-11-08 22:30, Jim Klimov wrote:

Hello all,

I have an oi_148a PC with a single root disk, and since
recently it fails to boot - hangs after the copyright
message whenever I use any of my GRUB menu options.


Thanks to my wife's sister, who is my hands and eyes near
the problematic PC, here's some ZDB output from this rpool:

# zpool import
  pool: rpool
id: 17995958177810353692
 state: ONLINE
status: The pool was last accessed by another system.
action: The pool can be imported using its name or numeric identifier and
the '-f' flag.
   see: http://www.sun.com/msg/ZFS-8000-EY
config:

rpool   ONLINE
  c4t1d0s0  ONLINE


So here it is - a single-device rpool.
There are some on-disk errors, so some of zdb walks fail:


root@openindiana:~# time zdb -bb -e 17995958177810353692

Traversing all blocks to verify nothing leaked ...
Assertion failed: ss-ss_start = start (0x79e22600 = 0x79e1dc00), file 
../../../uts/common/fs/zfs/space_map.c, line 173

Abort (core dumped)

real0m12.184s
user0m0.367s
sys 0m0.474s

root@openindiana:~# time zdb -bsvc -e 17995958177810353692

Traversing all blocks to verify checksums and verify nothing leaked ...
Assertion failed: ss-ss_start = start (0x79e22600 = 0x79e1dc00), file 
../../../uts/common/fs/zfs/space_map.c, line 173

Abort (core dumped)

real0m12.019s
user0m0.360s
sys 0m0.458s



However -bsvL and -bsvcL (with checksum-checks) do finish,
results of the former test (more complete) are listed below:



root@openindiana:~# time zdb -bsvcL -e 17995958177810353692

Traversing all blocks to verify checksums ...

zdb_blkptr_cb: Got error 50 reading 182, 19177, 0, 1 
DVA[0]=0:a8c8e600:2 [L0 ZFS plain file] fletcher4 uncompressed LE 
contiguous unique single size=2L/2P birth=82L/82P fill=1 
cksum=3401f5fe522b:109ee10ba48ed38c:e7f49c220f7b8bc:ff405ef051b91e65 -- 
skipping
zdb_blkptr_cb: Got error 50 reading 182, 19202, 0, 1 
DVA[0]=0:a9030a00:2 [L0 ZFS plain file] fletcher4 uncompressed LE 
contiguous unique single size=2L/2P birth=82L/82P fill=1 
cksum=11c4c738b0ba:7bb81bce3313913:8f85a7abf1b9e34:58e8746d63119393 -- 
skipping
zdb_blkptr_cb: Got error 50 reading 182, 24924, 0, 0 
DVA[0]=0:b1aaec00:14a00 [L0 ZFS plain file] fletcher4 uncompressed LE 
contiguous unique single size=14a00L/14a00P birth=85L/85P fill=1 
cksum=270679cd905d:6119a969a134566:6f0f7da64c4d2d90:3ab86aa985abef02 -- 
skipping
zdb_blkptr_cb: Got error 50 reading 182, 24944, 0, 0 
DVA[0]=0:b1cdf000:10800 [L0 ZFS plain file] fletcher4 uncompressed LE 
contiguous unique single size=10800L/10800P birth=85L/85P fill=1 
cksum=1ebb4d1ae9f5:3cf5f42afa9a332:757613fc2d2de7b3:5f197017333a4f89 -- 
skipping


zdb_blkptr_cb: Got error 50 reading 493, 947, 0, 165 
DVA[0]=0:b3efc200:2 [L0 ZFS plain file] fletcher4 uncompressed LE 
contiguous unique single size=2L/2P birth=26691L/26691P fill=1 
cksum=2cdc2ae22d10:b33d31bcbc0d8da:f1571c9975e151b0:a037073594569635 -- 
skipping


Error counts:

errno  count
   50  5
block traversal size 11986202624 != alloc 11986203136 (unreachable 512)

bp count:  405927
bp logical:15030449664  avg:  37027
bp physical:   12995855872  avg:  32015 compression:   1.16
bp allocated:  13172434944  avg:  32450 compression:   1.14
bp deduped:1186232320ref1:  12767   deduplication:   1.09
SPA allocated: 11986203136 used: 56.17%

Blocks  LSIZE   PSIZE   ASIZE avgcomp   %Total  Type
 -  -   -   -   -   --  unallocated
 232K  4K   12.0K   6.00K8.00 0.00  object directory
 3  1.50K   1.50K   4.50K   1.50K1.00 0.00  object array
 116K   1.50K   4.50K   4.50K   10.67 0.00  packed nvlist
 -  -   -   -   -   --  packed nvlist size
   197  24.2M   1.87M   5.61M   29.2K   12.92 0.04  bpobj
 -  -   -   -   -   --  bpobj header
 -  -   -   -   -   --  SPA space map 
header

 1.27K  6.79M   3.25M9.8M   7.70K2.09 0.08  SPA space map
 8   144K144K144K   18.0K1.00 0.00  ZIL intent log
 26.6K   426M   91.1M182M   6.86K4.67 1.45  DMU dnode
75   150K   39.0K   80.0K   1.07K3.85 0.00  DMU objset
 -  -   -   -   -   --  DSL directory
23  12.0K   11.5K   34.5K   1.50K1.04 0.00  DSL directory 
child map
21  11.5K   10.5K   31.5K   1.50K1.10 0.00  DSL dataset 
snap map

49   707K   79.5K239K   4.87K8.89 0.00  DSL props
 -  -   -   -   -   --  DSL dataset
 -  -   -   -   -   --  ZFS znode
 -  -   -   -   -   --  ZFS V0 ACL
  321K  12.0G   10.5G   10.5G   33.4K1.1485.46  ZFS plain file
 26.8K  41.5M   19.1M   38.2M   1.42K