I was in the middle of replacing the drives of my NAS one-by-one (I wished to move to bigger and faster storage at the end), so I used one more SATA drive + SATA cable than usual. Unfortunately, the extra cable turned out to be faulty and it looks like it caused some heavy damage to the file system.
There was no "devive replace" running at the moment or the disaster. The first round already got finished hours ago and I planned to start the next one before going to sleep. So, it was a full RAID-5 setup in normal state. But one of the active, mounted devices was the first replacment HDD and it was hanging on the spare SATA cable. I tried to save some file to my mounted samba share and I realized the file system because read-only. I rebooted the machine and saw that my /data can't be mounted. According to SmartmonTools, one of the drives was suffering from SATA communication errors. I tried some tirivial recovery methods and I tried to search the mailing list archives but I didn't really find a solution. I wonder if somebody can help with this. Should I run "btrfs rescue chunk-recover /dev/sda"? Here are some raw details: # uname -a Linux F17a_NAS 4.2.3-gentoo #2 SMP Sun Oct 18 17:56:45 CEST 2015 x86_64 AMD E-350 Processor AuthenticAMD GNU/Linux # btrfs --version btrfs-progs v4.2.2 # btrfs check /dev/sda checksum verify failed on 21102592 found 295F0086 wanted 00000000 checksum verify failed on 21102592 found 295F0086 wanted 00000000 checksum verify failed on 21102592 found 99D0FC26 wanted B08FFCA0 checksum verify failed on 21102592 found 99D0FC26 wanted B08FFCA0 bytenr mismatch, want=21102592, have=65536 Couldn't read chunk root Couldn't open file system # mount /dev/sda /data -o ro,recovery mount: wrong fs type, bad option, bad superblock on /dev/sda, ... # cat /proc/kmsg <6>[ 1902.033164] BTRFS info (device sdb): enabling auto recovery <6>[ 1902.033184] BTRFS info (device sdb): disk space caching is enabled <6>[ 1902.033191] BTRFS: has skinny extents <3>[ 1902.034931] BTRFS (device sdb): bad tree block start 0 21102592 <3>[ 1902.051259] BTRFS (device sdb): parent transid verify failed on 21147648 wanted 101748 found 101124 <3>[ 1902.111807] BTRFS (device sdb): parent transid verify failed on 44613632 wanted 101770 found 101233 <3>[ 1902.126529] BTRFS (device sdb): parent transid verify failed on 40595456 wanted 101767 found 101232 <6>[ 1902.164667] BTRFS: bdev /dev/sda errs: wr 858, rd 8057, flush 280, corrupt 0, gen 0 <3>[ 1902.165929] BTRFS (device sdb): parent transid verify failed on 44617728 wanted 101770 found 101233 <3>[ 1902.166975] BTRFS (device sdb): parent transid verify failed on 44621824 wanted 101770 found 101233 <3>[ 1902.271296] BTRFS (device sdb): parent transid verify failed on 38621184 wanted 101765 found 101223 <3>[ 1902.380526] BTRFS (device sdb): parent transid verify failed on 38719488 wanted 101765 found 101223 <3>[ 1902.381510] BTRFS (device sdb): parent transid verify failed on 38719488 wanted 101765 found 101223 <3>[ 1902.381549] BTRFS: Failed to read block groups: -5 <3>[ 1902.394835] BTRFS: open_ctree failed <6>[ 1911.202254] BTRFS info (device sdb): enabling auto recovery <6>[ 1911.202270] BTRFS info (device sdb): disk space caching is enabled <6>[ 1911.202275] BTRFS: has skinny extents <3>[ 1911.203611] BTRFS (device sdb): bad tree block start 0 21102592 <3>[ 1911.204803] BTRFS (device sdb): parent transid verify failed on 21147648 wanted 101748 found 101124 <3>[ 1911.246384] BTRFS (device sdb): parent transid verify failed on 44613632 wanted 101770 found 101233 <3>[ 1911.248729] BTRFS (device sdb): parent transid verify failed on 40595456 wanted 101767 found 101232 <6>[ 1911.251658] BTRFS: bdev /dev/sda errs: wr 858, rd 8057, flush 280, corrupt 0, gen 0 <3>[ 1911.252485] BTRFS (device sdb): parent transid verify failed on 44617728 wanted 101770 found 101233 <3>[ 1911.253542] BTRFS (device sdb): parent transid verify failed on 44621824 wanted 101770 found 101233 <3>[ 1911.278414] BTRFS (device sdb): parent transid verify failed on 38621184 wanted 101765 found 101223 <3>[ 1911.283950] BTRFS (device sdb): parent transid verify failed on 38719488 wanted 101765 found 101223 <3>[ 1911.284835] BTRFS (device sdb): parent transid verify failed on 38719488 wanted 101765 found 101223 <3>[ 1911.284873] BTRFS: Failed to read block groups: -5 <3>[ 1911.298783] BTRFS: open_ctree failed # btrfs-show-super /dev/sda superblock: bytenr=65536, device=/dev/sda --------------------------------------------------------- csum 0xe8789014 [match] bytenr 65536 flags 0x1 ( WRITTEN ) magic _BHRfS_M [match] fsid 2bba7cff-b4bf-4554-bee4-66f69c761ec4 label generation 101480 root 37892096 sys_array_size 258 chunk_root_generation 101124 root_level 2 chunk_root 21147648 chunk_root_level 1 log_root 0 log_root_transid 0 log_root_level 0 total_bytes 6001196802048 bytes_used 3593129504768 sectorsize 4096 nodesize 4096 leafsize 4096 stripesize 4096 root_dir 6 num_devices 3 compat_flags 0x0 compat_ro_flags 0x0 incompat_flags 0x381 ( MIXED_BACKREF | RAID56 | SKINNY_METADATA | NO_HOLES ) csum_type 0 csum_size 4 cache_generation 101480 uuid_tree_generation 101480 dev_item.uuid 330c9c98-4140-497a-814f-ac76a5b07172 dev_item.fsid 2bba7cff-b4bf-4554-bee4-66f69c761ec4 [match] dev_item.type 0 dev_item.total_bytes 2000398934016 dev_item.bytes_used 1809263362048 dev_item.io_align 4096 dev_item.io_width 4096 dev_item.sector_size 4096 dev_item.devid 2 dev_item.dev_group 0 dev_item.seek_speed 0 dev_item.bandwidth 0 dev_item.generation 0 # btrfs-show-super /dev/sdb superblock: bytenr=65536, device=/dev/sdb --------------------------------------------------------- csum 0x177aae67 [match] bytenr 65536 flags 0x1 ( WRITTEN ) magic _BHRfS_M [match] fsid 2bba7cff-b4bf-4554-bee4-66f69c761ec4 label generation 101770 root 44650496 sys_array_size 258 chunk_root_generation 101748 root_level 2 chunk_root 21102592 chunk_root_level 1 log_root 0 log_root_transid 0 log_root_level 0 total_bytes 6001196802048 bytes_used 3533993762816 sectorsize 4096 nodesize 4096 leafsize 4096 stripesize 4096 root_dir 6 num_devices 3 compat_flags 0x0 compat_ro_flags 0x0 incompat_flags 0x381 ( MIXED_BACKREF | RAID56 | SKINNY_METADATA | NO_HOLES ) csum_type 0 csum_size 4 cache_generation 101770 uuid_tree_generation 101770 dev_item.uuid f14b343e-b701-47f2-a652-e52a47be42b2 dev_item.fsid 2bba7cff-b4bf-4554-bee4-66f69c761ec4 [match] dev_item.type 0 dev_item.total_bytes 2000398934016 dev_item.bytes_used 1815705812992 dev_item.io_align 4096 dev_item.io_width 4096 dev_item.sector_size 4096 dev_item.devid 3 dev_item.dev_group 0 dev_item.seek_speed 0 dev_item.bandwidth 0 dev_item.generation 0 # btrfs-show-super /dev/sdc superblock: bytenr=65536, device=/dev/sdc --------------------------------------------------------- csum 0xa06026f3 [match] bytenr 65536 flags 0x1 ( WRITTEN ) magic _BHRfS_M [match] fsid 2bba7cff-b4bf-4554-bee4-66f69c761ec4 label generation 101770 root 44650496 sys_array_size 258 chunk_root_generation 101748 root_level 2 chunk_root 21102592 chunk_root_level 1 log_root 0 log_root_transid 0 log_root_level 0 total_bytes 6001196802048 bytes_used 3533993762816 sectorsize 4096 nodesize 4096 leafsize 4096 stripesize 4096 root_dir 6 num_devices 3 compat_flags 0x0 compat_ro_flags 0x0 incompat_flags 0x381 ( MIXED_BACKREF | RAID56 | SKINNY_METADATA | NO_HOLES ) csum_type 0 csum_size 4 cache_generation 101770 uuid_tree_generation 101770 dev_item.uuid 4dadced6-392f-4d57-920c-ee8fbebbd608 dev_item.fsid 2bba7cff-b4bf-4554-bee4-66f69c761ec4 [match] dev_item.type 0 dev_item.total_bytes 2000398934016 dev_item.bytes_used 1815726784512 dev_item.io_align 4096 dev_item.io_width 4096 dev_item.sector_size 4096 dev_item.devid 1 dev_item.dev_group 0 dev_item.seek_speed 0 dev_item.bandwidth 0 dev_item.generation 0 # smartctl -a /dev/sda smartctl 6.4 2015-06-04 r4109 [x86_64-linux-4.2.3-gentoo] (local build) ... ID# ATTRIBUTE_NAME FLAG VALUE WORST THRESH TYPE UPDATED WHEN_FAILED RAW_VALUE 199 UDMA_CRC_Error_Count 0x003e 200 200 000 Old_age Always - 16 This was a new drive and this counter didn't move before I thouched the cables again in order to prepare for the next "device replace" round. I checked the SMART data several times before, during and after the first round of "device replace" to make sure the new drive didn't came as faulty from the factory/reseller... I sure these two (unmountable filesystem and this SATA cable error counter) are directly related. I threw away these SATA cables because another one of this "batch" (a four-pack I picked up somewhere, sometime...) proved to be faulty as well (although that one didn't cause any practical harm, other than making a Windows PC hanging and the CRC error counter of the SSD rising). I am not really happy that Btrfs in RAID5 mode wasn't a little more fault tolerant towards "disk" faults. Although it might still be saved, right? Right? :) Thank you for your answers in advance! -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html